Grand Diomande Research · Full HTML Reader

AGP Paper Framing V1

**AGP: Anticipatory Geometry Partitioning for Semantically Routed Distributed Transformer Inference on Heterogeneous Apple Silicon**

Agents That Account for Themselves proposal experiment writeup candidate score 18 .md

Full Public Reader

AGP Paper Framing V1

Working Paper Title

AGP: Anticipatory Geometry Partitioning for Semantically Routed Distributed Transformer Inference on Heterogeneous Apple Silicon

Alternate Short Title

AGP: A Hierarchical Transformer Architecture for Routed Hidden-State Transfer

Detailed Abstract

We introduce Anticipatory Geometry Partitioning (AGP), a transformer-centered systems architecture that treats intermediate hidden states not as disposable internals of a monolithic forward pass, but as operational interfaces for conditional computation, semantic control, and cross-device continuation. In standard decoder-only transformer inference, every token traverses essentially the same depth, on the same device, under the same compute budget, regardless of whether the model already possesses a sufficient internal estimate of the answer. AGP challenges that assumption. It augments a base language model with a learned control stack that predicts, from intermediate representations, whether the current state should be accepted locally, resumed from a later boundary, revived locally, or escalated to a deeper corrective path. In our current implementation, the base model is a Thunder-trained `Gemma 4 E2B` backbone in `MLX`; the control stack consists of route, vitality, and earliest-layer heads; and the transfer stack consists of a learned same-host latent adapter that reconstructs final hidden-state targets from selected source-layer states.

The central claim of AGP is that transformer hidden states can become typed scheduling objects. Rather than spending full-depth computation uniformly, the system learns a small policy over latent sufficiency. This policy is informed by an anticipation-inspired control geometry: the model does not merely ask “what token comes next,” but “what kind of state is this, how alive is it, how far is it from semantic sufficiency, and what is the cheapest safe continuation path?” In the full architecture, these judgments are made by a hierarchy of lightweight heads that estimate route action, vitality, and acceptable boundary. The route layer determines whether computation should terminate locally, continue locally from a late layer, or escalate to a corrective path. The vitality layer estimates whether the latent state is healthy, weak, or in need of revival. The boundary layer estimates where a continuation boundary should be drawn. Together, these components convert a decoder transformer into a runtime that allocates compute according to the geometry of the current hidden state rather than the static index of the final layer.

AGP is designed explicitly for heterogeneous Apple hardware. The dense model and trainable adapters live in `MLX`, taking advantage of unified-memory GPU execution on Apple silicon. Two hosts, `Mac4` and `Mac5`, are connected through `Thunderbolt 5`, which supplies the transport envelope for future cross-host continuation. In the full design, the `Apple Neural Engine` is not treated as a replacement for the GPU, but as a low-power reflex substrate for cheap, high-frequency modules such as routing heads, vitality heads, semantic heads, and shallow projection or packetization layers. The GPU remains the primary substrate for dense transformer execution, corrective continuation, and training updates. The CPU manages orchestration, packet control, and host coordination. This yields a hardware hierarchy rather than a flat accelerator story: `ANE` for reflexive control, `GPU` for deep reasoning and adaptation, `CPU` for scheduling, and `Thunderbolt` for latent-state mobility.

The current AGP stack is staged. First, a personalized backbone is trained with `LoRA` on a domain-specific conversation corpus so that the model’s hidden states lie on the user’s actual task manifold rather than a generic public distribution. Second, route and vitality supervision are derived from traced hidden-state artifacts and teacher comparisons, producing a calibrated controller over `accept_local`, `continue_local`, `revive_local`, and `escalate`. Third, a boundary head estimates the earliest acceptable layer. Fourth, a transfer adapter is trained to reconstruct final hidden states from selected source-layer states. In current results, late-boundary transfer from layer `30` is already strong, while early-boundary transfer from layer `26` remains weak. This asymmetry is not a failure of the concept; it is a useful structural discovery. It implies that not all latent boundaries are equally operational, and that the system must distinguish between mature resumable states and premature states that should not yet be transferred or exited from.

This leads to the most important architectural implication of AGP: the goal is not simply to distribute a large model across two machines, nor merely to quantize or accelerate a conventional transformer. The goal is to turn the transformer into a hierarchical runtime in which hidden states determine how much computation is still necessary. In a normal architecture, performance gains come primarily from larger models, better kernels, or lower precision. In AGP, performance gains can also come from better compute allocation. Easy cases should terminate locally. Late-boundary cases should resume cheaply. Hard cases should escalate. Only a subset of cases should invoke deeper or remote correction. If this works, the result is not just faster inference. It is a system that extracts more useful behavior per parameter, per watt, and per unit of latency by replacing uniform full-depth execution with learned, state-aware routing.

The full-fledged architecture extends beyond the current local runtime. A semantic projection layer, informed by a canonical semantic kernel, can project dense hidden states into sparse, typed semantic bundles that make route decisions more interpretable and eventually more robust. A transport layer can compress and serialize latent packets for cross-host continuation, ideally using a structured bottleneck or quantized packetization strategy rather than shipping raw activations. A corrective remote path can turn `Mac5` into the deeper reasoning or verifier host, while `Mac4` remains the reflex and draft host. In this full form, AGP becomes a semantically routed distributed transformer architecture in which the unit of scheduling is no longer the token alone, but the typed hidden state associated with that token.

If successful, AGP implies a different scaling story from ordinary transformer deployment. The normal path says: larger model, more memory, more FLOPs. AGP says: better latent interfaces, better routing, better hardware specialization, and better continuation policies can increase effective intelligence without requiring every token to pay for the deepest path. The win is therefore not only larger effective model capacity, but better parameter efficiency, better energy efficiency, and a more faithful alignment between semantic difficulty and compute expenditure. In that sense, AGP is not a replacement for transformers. It is a proposal for how transformers become schedulable systems.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/research/agp-paper-title-abstract-v1.md

Detected Structure

Method · Evaluation · Architecture