Back to corpus
working paperpreprint structure candidatescore 100

The Anticipatory Transformer: Geometry-Steered Attention for Trajectory-Aware Reasoning

Standard transformers attend based on learned position encodings (sinusoidal, RoPE, ALiBi) that encode *where* tokens are in a sequence but not *what the sequence is doing* as a geometric process. I introduce the Anticipatory Transformer, a modified transformer architecture where seven geometric scalars derived from Anticipation Geometry (commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, stability) steer the multi-head attention mechanism via additive bias. The trajectory bias

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

Standard transformers attend based on learned position encodings (sinusoidal, RoPE, ALiBi) that encode *where* tokens are in a sequence but not *what the sequence is doing* as a geometric process. I introduce the Anticipatory Transformer, a modified transformer architecture where seven geometric scalars derived from Anticipation Geometry (commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, stability) steer the multi-head attention mechanism via additive bias. The trajectory bias is computed by a learned network that maps the seven scalars at each position to per-head, position-dependent attention biases, enabling different heads to specialize to different geometric dimensions of the reasoning trajectory. I also introduce the CommitmentGate, a threshold-based mechanism that determines *when* to emit tokens: when the model's predicted commitment is below a learned threshold, it buffers hidden states and defers emission, enabling variable-rate generation that mirrors the deliberative pauses of human reasoning. The architecture further incorporates a dual-pathway design: a fast pathway with local windowed attention (128-token window, updated every token) for high-frequency pattern capture, and a slow pathway with global attention (full context) for long-range dependency modeling. In smoke tests on a 678,206-parameter model trained for 50 steps on synthetic data, the commitment gate achieves +0.93 correlation with the commitment scalar, attention heads specialize to 3 out of 4 unique dominant scalars, scalar prediction MSE drops from 0.15 to 0.07, and the orthogonality penalty converges to 0.005. I present this as a complete, implemented architecture with preliminary validation, not as a benchmark-breaking result. I argue that the trajectory-bias mechanism is suited for three application domains where standard position encodings are insufficient: agent reasoning over multi-step plans, multi-hop knowledge graph traversal, and real-time motion-to-audio synthesis.

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.