Architecture Document 23: Anticipatory Transformer Architecture
**Status**: Research Proposal (Revised) **Created**: 2026-01-04 **Revised**: 2026-01-04 (Incorporated engineering feedback) **Dependencies**: DELL Theory (19), Graph Kernel (15), Computational Choreography (01), TrajectoryOS (02)
Full Public Reader
Architecture Document 23: Anticipatory Transformer Architecture
Status: Research Proposal (Revised)
Created: 2026-01-04
Revised: 2026-01-04 (Incorporated engineering feedback)
Dependencies: DELL Theory (19), Graph Kernel (15), Computational Choreography (01), TrajectoryOS (02)
---
I. Core Thesis
Current transformer architectures operate on prediction: given context, predict next token. This proposal introduces an anticipatory transformer that operates on commitment detection: given motion through semantic space, detect when futures become constrained enough to warrant action.
Key Insight: Just as Comp-Core's motion intelligence detects when a gesture is irreversible (not when it completes), an anticipatory transformer should detect when semantic trajectories become committed, enabling earlier, more efficient generation.
---
II. Philosophical Alignment with Comp-Core
II.1 Anticipation Over Prediction
Traditional Transformers:
P(token_t+1 | context_0:t) → argmax(probability distribution)Anticipatory Transformer:
Commitment(trajectory_0:t) × Uncertainty(futures_t) → action_threshold
- High commitment + Low uncertainty = Generate immediately
- Low commitment + High uncertainty = Buffer and observe
- High commitment + High uncertainty = Surprising (adapt fast)Implementation Consequence: Model outputs not just probabilities but commitment scores and uncertainty estimates that drive generation policy.
---
II.1.1 Operationalizing Commitment (Ground Truth)
Problem: "Commitment" is conceptually clear but needs a trainable target to avoid becoming a random-number generator correlated with logit sharpness.
Solution: Three workable approaches, ordered by alignment with philosophy:
A. Counterfactual Stability (Canonical - Best Aligned)
Definition: Commitment at token t is how invariant the best continuation is under small admissible perturbations of context.
Mechanism:
1. Sample several admissible context slices from kernel (small variations in slice selection)
2. Run model on each variant
3. Measure continuation agreement (e.g., edit distance, token overlap)
4. High agreement = high commitment
Training Target:
def compute_commitment_target(model, anchor, kernel, k_samples=5):
"""
Self-supervised commitment signal via counterfactual stability.
"""
# Sample k different admissible slices
slices = [
kernel.slice(anchor, policy=perturb_policy(base_policy, epsilon))
for _ in range(k_samples)
]
# Generate continuations for each
continuations = [
model.generate(slice, max_tokens=10)
for slice in slices
]
# Measure agreement (normalized edit distance)
agreements = []
for i in range(len(continuations)):
for j in range(i+1, len(continuations)):
agreements.append(
1.0 - edit_distance(continuations[i], continuations[j]) / max_len
)
# High agreement = high commitment
commitment = np.mean(agreements)
return commitmentAdvantages:
- Directly measures "irreversibility" (futures constrained)
- Self-supervised (no manual labels needed)
- Philosophically aligned with anticipation
Cost: Requires multiple forward passes during training (5x slower)
---
B. Edit Distance to Future (Cheap Supervision)
Definition: Commitment is high when next k tokens are robust under sampling temperature variations.
Mechanism:
def commitment_via_robustness(model, context, k=5):
"""
Commitment = robustness of next k tokens across temperatures.
"""
temps = [0.5, 0.7, 1.0, 1.2, 1.5]
samples = [
model.generate(context, max_tokens=k, temperature=t)
for t in temps
]
# Low variance = high commitment
variance = np.mean([
edit_distance(samples[i], samples[j])
for i in range(len(samples))
for j in range(i+1, len(samples))
])
commitment = 1.0 / (1.0 + variance)
return commitmentAdvantages:
- Cheap (single forward pass, multiple decode heads)
- Easy to implement
Disadvantages:
- Slightly circular (model defines own confidence)
- Less philosophically pure
---
C. Task-Conditioned Commitment (Production Usefulness)
Definition: Commitment varies by domain-specific structural signals.
For Code:
# Commitment high when next structural token is unambiguous
commitment = P(next_token in {'{', '}', 'def', 'class', 'return'})For Dialogue:
# Commitment high when sentence intention class is stable
commitment = max(P(intention_class | context))Advantages:
- Practical, measurable
- Enables domain-specific tuning
Disadvantages:
- Requires task-specific engineering
- Less universal
---
Recommendation: Use Counterfactual Stability (A) as canonical for research, with Robustness (B) as a cheaper proxy during development.
---
II.2 Dual-Timescale Processing (DELL-Inspired)
Humans process language at multiple timescales:
- Fast: Word-level phonology, syntax, local coherence (~100-200ms)
- Slow: Sentence/paragraph-level semantics, narrative arc, style (~500ms-2s)
Architectural Innovation: Parallel processing pathways with different temporal resolutions.
┌─────────────────────────────────────────────────┐
│ Input Embedding │
└─────────────────┬───────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Fast Pathway │ │ Slow Pathway │
│ │ │ │
│ • Local attn │ │ • Slice attn │
│ • 8-12 layers │ │ • 4-6 layers │
│ • τ_fast = 4 │ │ • τ_slow = 64 │
│ • High freq │ │ • Low freq │
│ • Syntax/fluency│ │ • Semantics/plan │
└────────┬────────┘ └────────┬─────────┘
│ │
└────────┬──────────┘
▼
┌─────────────────┐
│ Gating Network │
│ (Coordinator) │
└────────┬────────┘
▼
┌─────────────────┐
│ Output + Commit │
│ + Uncertainty │
└─────────────────┘Key Parameters:
- τ_fast: Fast pathway time constant (tokens)
- τ_slow: Slow pathway time constant (tokens)
- Coordinator: Learns HOW to blend based on context
---
II.2.1 Specialization via Orthogonality (Not Naive Divergence)
Problem: Naive `L_div = -mse(h_fast, h_slow)` can cause training instability (norm explosion, subspace rotation).
Solution: Constrained specialization via orthogonality + information bottleneck.
Training Objective:
L_total = L_recon + λ_smooth * L_smooth + λ_ortho * L_ortho + λ_info * L_info
L_recon = -log P(y | context) # Standard LM loss
# Temporal smoothness (within pathways)
L_smooth_fast = ||h^F_{t+1} - h^F_t||²
L_smooth_slow = ||h^S_{t+τ} - h^S_t||² # Slower timescale
L_smooth = L_smooth_fast + L_smooth_slow
# Orthogonality penalty (force decorrelation, not just distance)
cov = (h^F - mean(h^F))^T @ (h^S - mean(h^S))
L_ortho = ||cov||²_F # Frobenius norm of cross-covariance
# Information bottleneck on slow path (force low-freq compression)
# Slow path should only carry semantics, not high-freq details
L_info = -mutual_information(h^S, low_freq_target)
# Commitment coherence (penalize oscillation)
L_commit = ||commit_{t+1} - commit_t||²
# Uncertainty calibration
empirical_correct = (argmax(logits) == targets).float()
L_cal = ||1.0 - uncert - empirical_correct||²Hyperparameters (initial proposal):
λ_smooth = 0.01 # Smoothness regularization
λ_ortho = 0.1 # Orthogonality (decorrelation)
λ_info = 0.05 # Information bottleneck
λ_commit = 0.1 # Commitment smoothness
λ_cal = 0.2 # Uncertainty calibrationStop-Gradient Trick (prevent chasing):
# Occasionally stop gradient from one path to prevent coupling
if step % stop_grad_freq == 0:
h_fast = h_fast.detach() # Slow can't pull fastExpected Behavior:
- Fast pathway: High-frequency, syntax, local coherence
- Slow pathway: Low-frequency, semantics, global plan
- No norm explosion, controlled specialization
---
II.3 Trajectory-Aware Positional Encoding
Traditional positional encoding: `PE(pos) = sin/cos functions of position`
Anticipatory Encoding: 5D trajectory coordinates
struct TrajectoryPosition {
temporal: f64, // Sequential position (like traditional PE)
semantic: f64, // Distance to current semantic anchor
depth: f64, // Nesting level (quotes, parentheticals, recursion)
homogeneity: f64, // Similarity to local context (regime stability)
salience: f64, // Dynamic importance (attention-weighted)
}---
II.3.1 Trajectory Attention (Additive Bias, Not Multiplicative)
Problem: Multiplying attention scores by `exp(-α * ring_dist)` fights softmax and causes numerical instability.
Solution: Add bias to scores before softmax (standard practice, FlashAttention-compatible).
Attention Computation:
def trajectory_attention(Q, K, V, traj_coords, bias_net):
"""
Trajectory-aware attention with ADDITIVE bias.
Compatible with FlashAttention and numerically stable.
"""
# Traditional dot-product attention
scores = (Q @ K.T) / sqrt(d_k) # [batch, n_heads, seq_len, seq_len]
# Compute trajectory bias (learned per head)
bias = compute_trajectory_bias(
traj_coords,
bias_net # Learned bias function
) # [batch, n_heads, seq_len, seq_len]
# ADD bias (not multiply)
scores = scores + bias
# Apply softmax
weights = softmax(scores)
return weights @ V
def compute_trajectory_bias(traj_coords, bias_net):
"""
Compute additive bias from trajectory distance.
Each attention head learns its own bias function.
"""
# Compute pairwise ring distances
ring_dist = pairwise_ring_distance(traj_coords)
# Per-head learned bias (could be MLP or simple linear)
# shape: [n_heads, 1] - per-head scale parameter
bias = bias_net(ring_dist) # Learned function of distance
return biasRing Distance (with per-dimension normalization):
def pairwise_ring_distance(coords):
"""
5D ring distance with learned dimension weights.
Prevents any single dimension from dominating.
"""
# Normalize each dimension independently
temporal_norm = coords.temporal / temporal_scale
semantic_norm = coords.semantic / semantic_scale # From embeddings
depth_norm = coords.depth / max_depth
homogeneity_norm = coords.homogeneity # Already [0,1]
salience_norm = coords.salience # Already [0,1]
# Compute pairwise distances
d_ring = sqrt(
w_t * (temporal_norm[i] - temporal_norm[j])² +
w_s * (semantic_norm[i] - semantic_norm[j])² +
w_d * (depth_norm[i] - depth_norm[j])² +
w_h * (homogeneity_norm[i] - homogeneity_norm[j])² +
w_sal * (salience_norm[i] - salience_norm[j])²
)
# Weights w_* are LEARNED per attention head
return d_ringKey Improvements:
- ✅ Additive bias (numerically stable, standard practice)
- ✅ Per-dimension normalization (prevents dominance)
- ✅ Learned bias function (not hand-tuned α)
- ✅ FlashAttention-compatible
---
II.4 Regime-Based Context Processing
Analogous to motion regimes (preparation, travel, accent, rebound), semantic processing has regimes:
Semantic Regimes:
1. Exploration - High uncertainty, many plausible futures
2. Consolidation - Narrowing options, committing to direction
3. Synthesis - Low uncertainty, executing committed plan
4. Transition - Regime shift detected, recalibrating---
II.4.1 Differentiable Regime Detection (Simplex Over Regimes)
Problem: Rule-based `detect_regime()` is not differentiable and uses brittle thresholds.
Solution: Model outputs a probability distribution over regimes (simplex), then uses it to mix attention patterns.
Regime Head:
class RegimeDetector(nn.Module):
def __init__(self, d_model, n_regimes=4):
self.regime_head = nn.Linear(d_model, n_regimes)
def forward(self, h_fast, h_slow, commit, uncert):
"""
Output differentiable regime distribution.
"""
# Combine fast, slow, and commitment signals
combined = torch.cat([
h_fast,
h_slow,
commit.unsqueeze(-1),
uncert.unsqueeze(-1)
], dim=-1)
# Project to regime logits
regime_logits = self.regime_head(combined)
# Softmax to get regime probabilities (simplex)
regime_probs = softmax(regime_logits) # [batch, seq_len, 4]
return regime_probsRegime-Conditional Attention:
def regime_conditional_attention(Q, K, V, regime_probs):
"""
Mix attention patterns based on regime probabilities.
Each regime has its own attention bias.
"""
# Precompute attention for each regime
attn_exploration = attention(Q, K, V, bias=exploration_bias)
attn_consolidation = attention(Q, K, V, bias=consolidation_bias)
attn_synthesis = attention(Q, K, V, bias=synthesis_bias)
attn_transition = attention(Q, K, V, bias=transition_bias)
# Mix according to regime probabilities
output = (
regime_probs[:, :, 0].unsqueeze(-1) * attn_exploration +
regime_probs[:, :, 1].unsqueeze(-1) * attn_consolidation +
regime_probs[:, :, 2].unsqueeze(-1) * attn_synthesis +
regime_probs[:, :, 3].unsqueeze(-1) * attn_transition
)
return outputTraining:
# Regime-specific loss weighting
regime_weights = {
'exploration': 0.3, # Less important (often tangential)
'consolidation': 0.6, # Medium importance
'synthesis': 1.0, # Most important (insights)
'transition': 0.5 # Medium (recalibration)
}
# Weight reconstruction loss by regime
regime_weight = torch.sum(
regime_probs * torch.tensor([0.3, 0.6, 1.0, 0.5]),
dim=-1
)
L_recon_weighted = regime_weight * L_reconInference (optional hard regime for logging):
# For interpretability, can hard-argmax regime
regime_idx = torch.argmax(regime_probs, dim=-1)
regime_name = ['exploration', 'consolidation', 'synthesis', 'transition'][regime_idx]Key Improvements:
- ✅ Differentiable (can backprop through regime selection)
- ✅ No brittle thresholds
- ✅ Regime-specific attention patterns
- ✅ Still interpretable via argmax at inference
---
II.5 Kernel Slice-Based Context Selection
Problem with Current Transformers: Context window is fixed or uses heuristics (e.g., "keep last N tokens").
Anticipatory Solution: Graph kernel-inspired priority-queue expansion with explicit budget, operating on SliceExport primitives (not raw turns).
---
II.5.1 Context Selection via Kernel Slices
Key Shift: The unit of context is not a "turn" but a kernel slice with provenance.
def select_context(
anchor_query: Embedding,
kernel: GraphKernel,
budget: int, # Token budget
policy: SlicePolicy
) -> AdmissibleSliceBundle:
"""
Deterministic context selection via priority-queue expansion.
Operates on SliceExport primitives from graph kernel.
Same input → identical output across runs.
"""
# Initialize priority queue
pq = PriorityQueue()
# Request initial slice from kernel (always includes most recent context)
initial_slice = kernel.slice(
anchor=get_most_recent_turn(),
policy=policy,
budget=budget // 2 # Reserve half budget for expansion
)
# Verify admissibility
assert initial_slice.verify_admissibility(kernel.hmac_secret)
pq.push(initial_slice, priority=float('inf'))
total_tokens = initial_slice.num_tokens()
selected_slices = [initial_slice]
while not pq.empty() and total_tokens < budget:
current_slice = pq.pop()
# Request neighbor slices from kernel
neighbors = kernel.get_neighbor_slices(
current_slice,
policy=policy
)
for neighbor in neighbors:
# Verify each neighbor is kernel-issued
if not neighbor.verify_admissibility(kernel.hmac_secret):
continue # Skip non-admissible slices
if total_tokens + neighbor.num_tokens() > budget:
break # Hard budget constraint
priority = compute_priority(neighbor, anchor_query, policy)
pq.push(neighbor, priority)
selected_slices.append(neighbor)
total_tokens += neighbor.num_tokens()
# Deterministic ordering (stable sort by timestamp)
selected_slices.sort(key=lambda s: s.anchor_turn_id)
return AdmissibleSliceBundle(
slices=selected_slices,
fingerprint=compute_bundle_hash(selected_slices),
budget_used=total_tokens
)Priority Computation (from Graph Kernel):
def compute_priority(
slice: SliceExport,
query: Embedding,
policy: SlicePolicy
) -> float:
"""
Compute priority for a kernel slice.
Uses phase weights calibrated to information density.
"""
# Extract dominant phase from slice turns
phase_counts = Counter(turn.phase for turn in slice.turns)
dominant_phase = phase_counts.most_common(1)[0][0]
priority = (
slice.salience * # Base salience
PHASE_WEIGHTS[dominant_phase] * # Information density
exp(-policy.distance_decay * # Semantic distance
semantic_distance(slice.embedding, query)) *
recency_boost(slice.anchor_turn().timestamp) # Recent = higher priority
)
return priorityPhase Weights (calibrated to information density):
PHASE_WEIGHTS = {
'synthesis': 1.0, # Rare, breakthrough insights (0.1% of data, high signal)
'planning': 0.9, # Strategic thinking
'consolidation': 0.6, # Summarization
'debugging': 0.5, # Often repetitive
'exploration': 0.3 # Often tangential
}Guarantees:
- ✅ Deterministic: Same query → same slices
- ✅ Budget-constrained: Never exceeds token limit
- ✅ Provenance-tracked: Every slice has admissibility token
- ✅ Verifiable: Can audit slice selection decisions
- ✅ Kernel-first: Model consumes slices, not raw turns
---
II.6 Commitment-Driven Generation Policy with Safety Mechanisms
Traditional Greedy Decoding:
while not done:
[sensitive field redacted]))
output.append(token)Anticipatory Decoding (with deadlock prevention):
def anticipatory_decode(
model,
context,
threshold_commit=0.8,
max_buffer=5, # NEW: Max buffer horizon
max_wait_steps=3 # NEW: Max steps without committing
):
"""
Commitment-gated generation with safety mechanisms.
Prevents deadlock in high-uncertainty regimes.
"""
committed_output = [] # Final, committed tokens
draft_buffer = [] # Provisional tokens (may be revised)
wait_steps = 0
while not done:
# Model outputs: logits, commitment, uncertainty
logits, commit, uncert, regime_probs = model(context)
# Multi-signal convergence check
prob = max(softmax(logits))
signals = {
'probability': prob > threshold_commit,
'commitment': commit > threshold_commit,
'certainty': uncert < (1.0 - threshold_commit),
'regime': regime_probs[2] > 0.5 # Synthesis regime
}
# Only commit if >= 3 signals agree
if sum(signals.values()) >= 3:
# High confidence - commit immediately
[sensitive field redacted])
committed_output.append(token)
# Flush draft buffer
if draft_buffer:
committed_output.extend(draft_buffer)
draft_buffer.clear()
wait_steps = 0 # Reset wait counter
if [sensitive field redacted]
break
context = append_and_trim(context, token)
elif signals['commitment'] > 0.5:
# Moderate confidence - buffer for revision
[sensitive field redacted])
draft_buffer.append(token)
# SAFETY: Max buffer horizon
if len(draft_buffer) >= max_buffer:
# Commit buffer as "provisional"
committed_output.extend(draft_buffer)
draft_buffer.clear()
wait_steps = 0
context = append_and_trim(context, token)
else:
# Low confidence - wait/observe
wait_steps += 1
# SAFETY: Max wait steps (prevent infinite waiting)
if wait_steps >= max_wait_steps:
# Forced commit with warning flag
[sensitive field redacted])
committed_output.append(token)
committed_output.append(PROVISIONAL_MARKER)
wait_steps = 0
context = append_and_trim(context, token)
else:
# Expand context (request more slices)
context = expand_context(context, kernel, policy)
return committed_output, draft_bufferTwo-Channel Output (for UIs):
class GenerationOutput:
committed: List[Token] # Final, high-confidence tokens
draft: List[Token] # Provisional tokens (may change)
commitment_scores: List[float] # Per-token commitment
def render_ui(self):
"""
In a UI, show committed in black, draft in gray.
User sees model thinking in real-time.
"""
for token in self.committed:
print(token, color='black', style='bold')
for token in self.draft:
print(token, color='gray', style='italic')Key Improvements:
- ✅ No deadlock: max_buffer and max_wait_steps prevent infinite waiting
- ✅ Provisional commitment: Low-confidence tokens marked, not silently emitted
- ✅ Two-channel output: Committed vs draft (UI gold)
- ✅ Graceful degradation: Forced commit with warning when needed
---
III. Architectural Specification
III.1 Model Components
AnticipatorTransformer:
├─ Input Embedding Layer
├─ Trajectory Encoder (5D coordinates)
├─ Fast Pathway (8-12 layers, local token attention, τ=4)
├─ Slow Pathway (4-6 layers, slice cross-attention, τ=64)
├─ Gating Network (learns blending policy)
├─ Regime Detector (differentiable simplex over regimes)
├─ Output Heads:
│ ├─ Token Logits (standard LM head)
│ ├─ Commitment Score (scalar, 0-1, counterfactual stability)
│ ├─ Uncertainty Estimate (scalar, 0-1, calibrated)
│ └─ Regime Probabilities (simplex over 4 regimes)
└─ Context Selector (kernel slice priority-queue)---
III.2 Layer Architecture
Fast Layer (Token Self-Attention):
class FastLayer(nn.Module):
def __init__(self, d_model, n_heads, window_size=128):
self.local_attn = LocalAttention(
d_model, n_heads, window_size
)
self.ffn = FeedForward(d_model, expansion=4)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
# Per-head trajectory bias network
self.traj_bias_net = nn.ModuleList([
nn.Linear(5, 1) # 5D trajectory → scalar bias per head
for _ in range(n_heads)
])
def forward(self, x, traj_coords):
# Local trajectory-aware attention
attn_out = self.local_attn(
x, x, x,
traj_coords=traj_coords,
bias_net=self.traj_bias_net
)
x = self.norm1(x + attn_out)
# Feed-forward
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return xSlow Layer (Slice Cross-Attention):
class SlowLayer(nn.Module):
def __init__(self, d_model, n_heads, max_slices=32):
# Cross-attention: attend to slice embeddings
self.slice_cross_attn = CrossAttention(
d_model, n_heads
)
# Materialization gate (expand slice → tokens only when needed)
self.materialize_gate = nn.Linear(d_model, 1)
self.ffn = FeedForward(d_model, expansion=4)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.max_slices = max_slices
def forward(self, x_tokens, slice_embeddings):
"""
x_tokens: [batch, seq_len, d_model] - token sequence
slice_embeddings: [batch, num_slices, d_model] - compressed slices
"""
# Cross-attention to slices (virtual context window)
attn_out = self.slice_cross_attn(
query=x_tokens,
key=slice_embeddings,
value=slice_embeddings
)
x = self.norm1(x_tokens + attn_out)
# Feed-forward
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return xGating Network (Coordinator):
class GatingNetwork(nn.Module):
def __init__(self, d_model):
self.project = nn.Linear(d_model * 2 + 2, d_model)
self.gate = nn.Linear(d_model, 1)
def forward(self, h_fast, h_slow, commit, uncert):
# Concatenate fast, slow, and anticipation scalars
combined = torch.cat([
h_fast,
h_slow,
commit.unsqueeze(-1),
uncert.unsqueeze(-1)
], dim=-1)
# Learn gating weight
α = torch.sigmoid(self.gate(self.project(combined)))
# Blend outputs
h = α * h_fast + (1 - α) * h_slow
return h, α---
III.3 Training Procedure
Multi-Objective Loss:
def compute_loss(
model_out,
targets,
h_fast,
h_slow,
commit,
uncert,
regime_probs
):
# 1. Reconstruction loss (standard LM)
L_recon = cross_entropy(model_out, targets)
# 2. Temporal smoothness (within pathways)
L_smooth_fast = mse(h_fast[1:], h_fast[:-1])
L_smooth_slow = mse(h_slow[::stride], h_slow[:-stride])
L_smooth = L_smooth_fast + L_smooth_slow
# 3. Orthogonality penalty (force decorrelation)
h_fast_centered = h_fast - h_fast.mean(dim=0)
h_slow_centered = h_slow - h_slow.mean(dim=0)
cov_matrix = h_fast_centered.T @ h_slow_centered
L_ortho = torch.norm(cov_matrix, p='fro') # Frobenius norm
# 4. Commitment coherence (penalize oscillation)
L_commit = mse(commit[1:], commit[:-1])
# 5. Uncertainty calibration (match empirical correctness)
empirical_correct = (argmax(model_out) == targets).float()
L_calibration = mse(1.0 - uncert, empirical_correct)
# 6. Regime-weighted reconstruction
regime_weight = torch.sum(
regime_probs * torch.tensor([0.3, 0.6, 1.0, 0.5]),
dim=-1
)
L_recon_weighted = (regime_weight * L_recon).mean()
# Weighted combination
L_total = (
L_recon_weighted +
λ_smooth * L_smooth +
λ_ortho * L_ortho +
λ_commit * L_commit +
λ_cal * L_calibration
)
return L_totalHyperparameters (initial proposal):
λ_smooth = 0.01 # Smoothness regularization
λ_ortho = 0.1 # Orthogonality (decorrelation)
λ_commit = 0.1 # Commitment smoothness
λ_cal = 0.2 # Uncertainty calibration---
III.4 Inference Procedure
def generate(
model,
prompt,
kernel,
policy,
max_tokens=512,
commit_threshold=0.8,
uncert_threshold=0.3
):
# Select initial context slices from kernel
slice_bundle = select_context(
anchor_query=encode(prompt),
kernel=kernel,
budget=model.slice_budget,
policy=policy
)
# Verify all slices are kernel-issued
for slice in slice_bundle.slices:
assert slice.verify_admissibility(kernel.hmac_secret)
# Compress slices to embeddings for slow pathway
slice_embeddings = [
compress_slice(slice) for slice in slice_bundle.slices
]
committed_output = []
draft_buffer = []
wait_steps = 0
for step in range(max_tokens):
# Forward pass
logits, h_fast, h_slow, commit, uncert, regime_probs = model(
tokens=committed_output + draft_buffer,
slice_embeddings=slice_embeddings
)
# Multi-signal convergence check
prob = max(softmax(logits[-1]))
signals = {
'probability': prob > commit_threshold,
'commitment': commit > commit_threshold,
'certainty': uncert < uncert_threshold,
'regime': regime_probs[-1, 2] > 0.5 # Synthesis
}
# Commitment-gated generation (with safety)
if sum(signals.values()) >= 3:
[sensitive field redacted])
committed_output.append(token)
if draft_buffer:
committed_output.extend(draft_buffer)
draft_buffer.clear()
wait_steps = 0
if [sensitive field redacted]
break
elif signals['commitment'] > 0.5:
[sensitive field redacted])
draft_buffer.append(token)
# Safety: max buffer
if len(draft_buffer) >= 5:
committed_output.extend(draft_buffer)
draft_buffer.clear()
else:
wait_steps += 1
# Safety: max wait steps
if wait_steps >= 3:
[sensitive field redacted])
committed_output.append(token)
wait_steps = 0
return committed_output, draft_buffer---
IV. Theoretical Properties & Rigorous Evaluation
IV.1 Latency Reduction via Anticipation
Hypothesis: By detecting commitment early, model generates with fewer input tokens for same quality.
Locked Evaluation Protocol (non-gameable):
Setup:
1. Fix target quality metric (e.g., pass@k for code, BLEU for translation)
2. Fix quality threshold Q_target (e.g., pass@k >= 0.85)
3. Use identical retrieval budgets and memory sources across baselines
Measurement:
def measure_latency_gain(model, baseline, dataset, Q_target):
"""
Hold quality constant, measure input context length required.
"""
results = []
for sample in dataset:
# Baseline: Binary search for minimum context achieving Q_target
baseline_ctx_len = binary_search_min_context(
model=baseline,
sample=sample,
quality_fn=compute_quality,
target=Q_target
)
# Anticipatory: Same procedure
anticipatory_ctx_len = binary_search_min_context(
model=model,
sample=sample,
quality_fn=compute_quality,
target=Q_target
)
gain = (baseline_ctx_len - anticipatory_ctx_len) / baseline_ctx_len
results.append(gain)
return np.mean(results), np.std(results)Attribution-Based Quality (for "contributing tokens"):
def compute_contributing_tokens(model, context, output):
"""
Ablate each context slice and measure loss delta.
Contribution = measurable degradation when removed.
"""
baseline_loss = model.loss(context, output)
contributions = []
for i, slice in enumerate(context.slices):
# Remove slice i
ablated_context = context.without_slice(i)
ablated_loss = model.loss(ablated_context, output)
# Contribution = loss increase when removed
contribution = ablated_loss - baseline_loss
contributions.append(contribution)
# "Contributing" = top-k by contribution score
contributing_slices = sorted(
enumerate(contributions),
key=lambda x: x[1],
reverse=True
)[:k]
return contributing_slicesExpected Improvement: 15-30
---
IV.2 Context Window Efficiency via Slice Selection
Hypothesis: Kernel slice priority-queue outperforms fixed-window or recency-based selection.
Measurement:
Relevance Score = sum(contribution_i for i in selected_slices) / total_tokens
Compare:
- Fixed window (last N tokens)
- Recency-only (last N turns, no priority)
- Priority-queue kernel slices (proposed)Locked Protocol:
1. Same task dataset across all methods
2. Same token budget B
3. Measure attribution-based relevance (ablation method above)
4. Report mean relevance ± std across dataset
Expected Improvement: 20-40
---
IV.3 Specialization via Orthogonality Training
Hypothesis: Orthogonality penalty forces fast/slow pathways to specialize without mode collapse.
Measurement:
def measure_specialization(h_fast, h_slow):
"""
Specialization Index = cross-covariance Frobenius norm.
Low = specialized, High = mode collapse.
"""
h_fast_centered = h_fast - h_fast.mean(dim=0)
h_slow_centered = h_slow - h_slow.mean(dim=0)
cov_matrix = h_fast_centered.T @ h_slow_centered
specialization_index = torch.norm(cov_matrix, p='fro')
return specialization_index.item()
def measure_frequency_separation(h_fast, h_slow):
"""
Fast should have high-frequency content, slow should have low-frequency.
Measure via FFT power spectrum.
"""
fft_fast = torch.fft.fft(h_fast, dim=0)
fft_slow = torch.fft.fft(h_slow, dim=0)
# Measure power in high-frequency bands
high_freq_power_fast = torch.sum(torch.abs(fft_fast[cutoff:]))
high_freq_power_slow = torch.sum(torch.abs(fft_slow[cutoff:]))
# Fast should have MORE high-freq power than slow
freq_separation = high_freq_power_fast / high_freq_power_slow
return freq_separation.item()Expected Behavior:
- Without orthogonality: specialization_index → high (mode collapse)
- With orthogonality: specialization_index → low (decorrelated)
- Fast pathway: High freq_separation (syntax, local)
- Slow pathway: Low freq_separation (semantics, global)
---
V. Kernel Slice Interface Specification
This section defines how the transformer consumes kernel slices and maintains provenance.
V.1 SliceExport Structure (From Graph Kernel)
/// Exported slice from Graph Kernel with full provenance.
pub struct SliceExport {
/// Anchor turn this slice was built around
pub anchor_turn_id: TurnId,
/// Turns in the slice, sorted by TurnId
pub turns: Vec<TurnSnapshot>,
/// Edges between turns in the slice
pub edges: Vec<Edge>,
/// Policy identifier (e.g., "slice_policy_v1")
pub policy_id: String,
/// Hash of policy parameters (deterministic)
pub policy_params_hash: String,
/// Schema version
pub schema_version: String,
/// Unique fingerprint of this slice (selection identity)
pub slice_id: SliceFingerprint,
/// Graph state at slicing time (content immutability proof)
pub graph_snapshot_hash: GraphSnapshotHash,
/// Unforgeable admissibility claim from Graph Kernel (HMAC-SHA256)
pub admissibility_token: AdmissibilityToken,
}Key Fields:
- `slice_id`: Content-derived fingerprint (deterministic replay)
- `graph_snapshot_hash`: Detects content drift (immutability proof)
- `admissibility_token`: HMAC-SHA256 signed by kernel (unforgeable)
Verification:
impl SliceExport {
/// Verify this slice was issued by the kernel.
pub fn verify_admissibility(&self, hmac_secret: &[u8]) -> bool {
self.admissibility_token.verify_hmac(
hmac_secret,
&self.slice_id,
&self.anchor_turn_id,
&self.policy_id,
&self.policy_params_hash,
&self.graph_snapshot_hash,
&self.schema_version,
)
}
/// Check if a turn is admissible in this slice.
pub fn is_turn_admissible(&self, turn_id: &TurnId) -> bool {
self.turns.binary_search_by_key(turn_id, |t| t.id).is_ok()
}
}---
V.2 Transformer Consumption Interface
class SliceBundle:
"""
Bundle of admissible slices for transformer consumption.
"""
def __init__(self, slices: List[SliceExport], kernel_secret: bytes):
self.slices = slices
self.kernel_secret = kernel_secret
# Verify all slices are kernel-issued
for slice in slices:
assert slice.verify_admissibility(kernel_secret), \
f"Slice {slice.slice_id} failed admissibility check"
# Compute bundle fingerprint (for audit trail)
self.bundle_fingerprint = self._compute_fingerprint()
def _compute_fingerprint(self) -> str:
"""
Content-derived hash of entire bundle.
Enables deterministic replay of context selection.
"""
slice_ids = sorted([s.slice_id for s in self.slices])
return canonical_hash(slice_ids)
def compress_to_embeddings(self, encoder: SliceEncoder) -> torch.Tensor:
"""
Compress slices to fixed-size embeddings for slow pathway.
Each slice → single embedding vector.
Slow pathway attends to these compressed representations.
"""
embeddings = []
for slice in self.slices:
# Encode slice content
slice_embedding = encoder.encode(
turns=slice.turns,
edges=slice.edges,
metadata={
'phase': slice.dominant_phase(),
'salience': slice.average_salience(),
'timestamp': slice.anchor_turn().timestamp
}
)
embeddings.append(slice_embedding)
return torch.stack(embeddings) # [num_slices, d_model]
def materialize_slice(self, slice_idx: int, tokenizer) -> List[Token]:
"""
Materialize a slice to full token sequence (optional, expensive).
Only called when materialization gate activates.
"""
slice = self.slices[slice_idx]
# Concatenate turn contents
full_text = "\n".join([turn.content for turn in slice.turns])
tokens = tokenizer.encode(full_text)
return tokens
def get_provenance_trail(self) -> Dict:
"""
Export full provenance for auditing.
Enables tracing model decisions back to kernel slices.
"""
return {
'bundle_fingerprint': self.bundle_fingerprint,
'slices': [
{
'slice_id': s.slice_id,
'anchor_turn': s.anchor_turn_id,
'policy_id': s.policy_id,
'graph_snapshot': s.graph_snapshot_hash,
'admissibility_token': s.admissibility_token,
'num_turns': len(s.turns),
'num_tokens': s.num_tokens()
}
for s in self.slices
]
}---
V.3 Slice Cross-Attention Mechanism
Architecture:
Token Sequence (small window) ─┐
├─→ Token Self-Attention (Fast)
│
Kernel Slices (compressed) ────┼─→ Slice Cross-Attention (Slow)
│
└─→ Gating Network → OutputImplementation:
class SliceAwareTransformer(nn.Module):
def __init__(self, d_model, n_heads, n_fast_layers, n_slow_layers):
# Fast pathway: token self-attention
self.fast_layers = nn.ModuleList([
FastLayer(d_model, n_heads, window_size=128)
for _ in range(n_fast_layers)
])
# Slow pathway: slice cross-attention
self.slow_layers = nn.ModuleList([
SlowLayer(d_model, n_heads, max_slices=32)
for _ in range(n_slow_layers)
])
# Slice encoder (compresses SliceExport → embedding)
self.slice_encoder = SliceEncoder(d_model)
# Coordinator
self.gating = GatingNetwork(d_model)
def forward(self, token_ids, slice_bundle: SliceBundle):
# Encode tokens
x_tokens = self.embed_tokens(token_ids)
# Compress slices to embeddings
slice_embeddings = slice_bundle.compress_to_embeddings(
self.slice_encoder
)
# Fast pathway: process tokens
h_fast = x_tokens
for layer in self.fast_layers:
h_fast = layer(h_fast, traj_coords=None)
# Slow pathway: attend to slices
h_slow = x_tokens # Same initial state
for layer in self.slow_layers:
h_slow = layer(h_slow, slice_embeddings)
# Coordinate outputs
h_combined, alpha = self.gating(h_fast, h_slow, commit, uncert)
return h_combined---
V.4 Provenance Tracking in Model Outputs
class ModelOutput:
"""
Model output with full provenance trail.
"""
tokens: List[Token]
commitment_scores: List[float]
uncertainty_scores: List[float]
regime_probs: torch.Tensor
# Provenance fields
slice_bundle_fingerprint: str
slice_ids: List[str]
slice_admissibility_tokens: List[str]
def export_audit_log(self) -> Dict:
"""
Export audit log for debugging/analysis.
Enables tracing each token back to source slices.
"""
return {
'output_tokens': self.tokens,
'commitment_per_token': self.commitment_scores,
'uncertainty_per_token': self.uncertainty_scores,
'regimes_per_token': self.regime_probs.tolist(),
'provenance': {
'bundle_fingerprint': self.slice_bundle_fingerprint,
'source_slices': [
{
'slice_id': sid,
'admissibility_token': token,
'verified': True # All tokens verified at input
}
for sid, token in zip(
self.slice_ids,
self.slice_admissibility_tokens
)
]
}
}---
VI. Implementation Roadmap
Phase 0: Validation (3-4 weeks)
Objective: Prove core concepts on small scale before full implementation.
- [ ] Implement dual-pathway mini-transformer (2M params)
- [ ] Validate orthogonality loss prevents mode collapse (measure specialization index)
- [ ] Measure fast/slow frequency separation on small dataset
- [ ] Prototype trajectory encoding with additive bias attention
- [ ] Benchmark slice priority-queue vs fixed-window (attribution-based relevance)
- [ ] Test commitment targets (counterfactual stability vs robustness)
Deliverable: Technical report with ablation studies proving each innovation.
---
Phase 1: Core Architecture (6-8 weeks)
Objective: Build full-scale model with all components.
- [ ] Implement trajectory encoder (5D coordinate system with normalization)
- [ ] Build fast pathway (8 layers, local attention, τ=4, additive bias)
- [ ] Build slow pathway (6 layers, slice cross-attention, τ=64)
- [ ] Implement gating network (coordinator)
- [ ] Add differentiable regime detector (simplex over regimes)
- [ ] Integrate kernel slice interface (SliceExport consumption)
- [ ] Implement slice encoder and compression
- [ ] Add provenance tracking to all outputs
Deliverable: Trainable model with ~350M parameters + slice interface.
---
Phase 2: Training & Evaluation (8-12 weeks)
Objective: Train to convergence and benchmark rigorously.
- [ ] Train on diverse corpus (code, prose, dialogue, technical)
- [ ] Implement multi-objective loss (orthogonality, not naive divergence)
- [ ] Hyperparameter tuning (λ_ortho, λ_smooth, λ_commit, λ_cal)
- [ ] Evaluate on standard benchmarks (perplexity, pass@k, BLEU)
- [ ] Measure latency gain (locked protocol, attribution-based)
- [ ] Measure context efficiency (slice relevance vs baselines)
- [ ] Analyze fast/slow specialization (FFT, cross-covariance)
- [ ] Validate commitment calibration (counterfactual stability)
Deliverable: Trained model + comprehensive evaluation report with non-gameable metrics.
---
Phase 3: Scaling & Optimization (4-6 weeks)
Objective: Scale to production size and optimize inference.
- [ ] Scale to 1B+ parameters
- [ ] Implement efficient attention kernels (FlashAttention-compatible additive bias)
- [ ] Optimize slice compression and materialization gate
- [ ] Add KV-cache compatibility for fast inference
- [ ] Benchmark throughput and memory usage
- [ ] Implement streaming generation with commitment gating (safety mechanisms)
- [ ] Production slice interface integration with Graph Kernel service
Deliverable: Production-ready model with deployment guide + kernel integration.
---
VII. Expected Innovations
VII.1 Research Contributions
1. Dual-Timescale Language Models: First architecture with explicit fast/slow pathways trained with orthogonality penalty (not naive divergence)
2. Trajectory-Aware Attention: Novel positional encoding based on 5D semantic trajectory space with additive bias (numerically stable)
3. Commitment-Driven Generation: Generation policy based on counterfactual stability and multi-signal convergence, with deadlock prevention
4. Kernel Slice Context Selection: Deterministic, policy-driven context windows operating on provenance-tracked SliceExport primitives
5. Regime-Based Processing: Differentiable simplex over regimes (exploration/consolidation/synthesis) guiding attention patterns
6. Slice Cross-Attention: Virtual context window via compressed kernel slices, enabling bounded-compute long-context processing
---
VII.2 Practical Benefits
For Long-Context Tasks:
- More efficient use of context window (slice priority-queue)
- Better handling of long-range dependencies (trajectory-aware attention)
- Reduced latency for equivalent quality (anticipatory generation)
- Provenance-tracked context (audit trail to source slices)
For Code Generation:
- Fast pathway handles syntax, slow pathway handles semantics
- Commitment detection enables early generation of boilerplate
- Regime detection identifies exploratory vs synthesis phases
- Slice attention captures relevant code patterns from memory
For Dialogue:
- Fast pathway maintains conversational flow
- Slow pathway tracks narrative arc and user intent
- Uncertainty estimation enables asking clarifying questions
- Two-channel output (committed vs draft) for real-time UIs
For Reasoning:
- Slow pathway performs deliberate reasoning over compressed memory
- Fast pathway maintains coherence during thinking
- Buffered generation enables self-revision
- Provenance trail for debugging reasoning chains
---
VIII. Philosophical Alignment Summary
This architecture embodies Comp-Core's core principles:
| Principle | Manifestation in Architecture |
|---|---|
| Anticipation Over Prediction | Commitment detection (counterfactual stability), multi-signal convergence, early generation with safety mechanisms |
| Motion as Semantic Object | Continuous processing, differentiable regime simplex, trajectory encoding |
| Dual-Timescale Processing | Fast/slow pathways, orthogonality penalty, coordinator network |
| Trajectory-Aware Memory | 5D positional encoding, additive bias attention, I-RCP-inspired ring distance |
| Determinism | Kernel slice priority-queue, content-derived fingerprints, admissibility tokens |
| Asymmetric Reversibility | Easy to expand slices (weaken), hard to commit generation (strengthen) |
| Policy-Driven Expansion | Explicit budget constraints, phase-weighted priorities, auditable slice selection |
| Provenance Tracking | SliceExport with HMAC tokens, bundle fingerprints, audit logs |
---
IX. Open Questions & Research Directions
IX.1 Theoretical Questions
1. Optimal Orthogonality Strength: What value of λ_ortho maximizes specialization without hurting reconstruction?
2. Timescale Ratios: What τ_fast / τ_slow ratio is optimal for different task domains?
3. Trajectory Dimensionality: Is 5D sufficient, or should we add dimensions (emotional valence, formality, discourse structure)?
4. Commitment Threshold Adaptation: Should thresholds be learned per-task or globally fixed?
5. Slice Compression: What is the optimal slice embedding dimension vs information loss tradeoff?
---
IX.2 Engineering Challenges
1. Attention Efficiency: How to implement trajectory-aware additive bias without quadratic complexity blowup?
2. Training Stability: Does orthogonality penalty require curriculum learning or warmup schedules?
3. Slice Materialization: When should the gate trigger full slice expansion vs compressed representation?
4. Streaming Generation: How to support streaming output with buffered commitment gating in production UIs?
5. Kernel Integration: What latency is acceptable for slice requests to Graph Kernel service?
---
IX.3 Evaluation Methodology
1. Anticipation Metrics: How to measure "latency reduction via anticipation" on diverse tasks beyond code?
2. Specialization Metrics: What frequency bands constitute "high-freq" vs "low-freq" for language?
3. Trajectory Quality: How to validate that trajectory encoding captures meaningful semantic structure?
4. Commitment Calibration: How to ensure counterfactual stability targets correlate with human judgments?
5. Slice Relevance: Can attribution-based relevance be gamed by adversarial slice selection?
---
X. Conclusion
The Anticipatory Transformer is not just an incremental improvement to existing architectures—it's a paradigm shift from prediction to anticipation, from fixed context to kernel slice selection, from single-timescale to dual-equilibrium processing.
By aligning with Comp-Core's philosophical foundations (DELL theory, computational choreography, motion intelligence, graph kernel provenance), this architecture has the potential to achieve:
- **15-30
- **20-40
- Better long-range coherence via trajectory-aware additive bias attention
- Interpretable behavior via differentiable regime detection and commitment signals
- Full provenance tracking via SliceExport admissibility tokens
Key Engineering Corrections:
- ✅ Commitment operationalized (counterfactual stability, not vibes)
- ✅ Deadlock prevention (max buffer, max wait steps, provisional commit)
- ✅ Orthogonality penalty (not naive divergence, numerically stable)
- ✅ Additive bias attention (FlashAttention-compatible)
- ✅ Differentiable regimes (simplex, not thresholds)
- ✅ Kernel slice interface (SliceExport, provenance-tracked)
- ✅ Rigorous evaluation (attribution-based, non-gameable)
- ✅ Slice cross-attention (virtual context window, bounded compute)
The path forward is clear:
1. Validate core concepts on small scale (Phase 0)
2. Build and train full architecture with slice interface (Phases 1-2)
3. Scale and optimize for production (Phase 3)
Next Step: Review this revised proposal, validate engineering corrections, and initiate Phase 0 validation experiments.
---
References:
- [DELL Theory (19)]([home]/Desktop/Comp-Core/Docs/architecture/19-DELL_THEORY.md)
- [Graph Kernel (15)]([home]/Desktop/Comp-Core/Docs/architecture/15-GRAPH_KERNEL.md)
- [Computational Choreography (01)]([home]/Desktop/Comp-Core/Docs/architecture/01-COMPUTATIONAL_CHOREOGRAPHY.md)
- [TrajectoryOS (02)]([home]/Desktop/Comp-Core/Docs/architecture/02-TRAJECTORY_OS.md)
- [Anticipation Kernel]([home]/Desktop/Comp-Core/core/cc-anticipation/docs/PROJECT_CHARTER.md)
- [Graph Kernel Design]([home]/Desktop/Comp-Core/core/cc-graph-kernel/docs/DESIGN.md)
- [Graph Kernel Slice Types]([home]/Desktop/Comp-Core/core/cc-graph-kernel/src/types/slice.rs)
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/docs/architecture/23-ANTICIPATORY_TRANSFORMER.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture