AGP / MLX / ANE Research Track
Current Apple-local language model execution treats the model as a monolith. A single host loads the full graph, every token pays for roughly the same depth, hidden states are transient internals, and hardware engines are mostly passive containers. That leaves three opportunities underexploited on the current `Mac4 + Mac5` setup:
Full Public Reader
AGP / MLX / ANE Research Track
Date: `2026-04-16`
Problem Statement
Current Apple-local language model execution treats the model as a monolith. A single host loads the full graph, every token pays for roughly the same depth, hidden states are transient internals, and hardware engines are mostly passive containers. That leaves three opportunities underexploited on the current `Mac4 + Mac5` setup:
1. intermediate hidden states may already contain enough information for early acceptance, correction, or transfer
2. different Apple engines are better suited to different classes of work
3. Thunderbolt 5 makes inter-host state transfer practical if the transferred object is compact and semantically meaningful
The research track therefore asks whether a transformer can be wrapped in a learned partitioning architecture where hidden states become routable artifacts rather than opaque byproducts.
Core Goal
The goal is to prove that a `transformer-based model running natively in MLX` can be extended into a `hierarchical Apple-local compute fabric` where:
- shallow inference, routing, and semantic inspection happen cheaply and frequently
- deeper correction and continuation happen conditionally
- hidden states are compressed, typed, and transferred across hosts only when justified
- ANE, GPU, CPU, and Thunderbolt each own the work they are structurally best at
This is not primarily a "bigger model on two Macs" project. It is a `semantic sufficiency and conditional compute` project.
Research Hypotheses
H1. Representational Sufficiency
For a meaningful subset of prompts and tokens, an intermediate hidden state contains enough information to support:
- local early acceptance
- resumed continuation on another host
- corrective verification without replaying the entire stack
H2. Trajectory-Aware Routing
The anticipation geometry packet is useful as a routing state. Conditioning depth selection and transfer decisions on trajectory signals should outperform fixed layer splits and naive entropy-only heuristics.
H3. Semantic Projection Utility
A sparse semantic projection layer aligned to kernel primitives, invariants, and concept bundles improves transfer acceptance, dead-state detection, and corrective routing.
H4. Heterogeneous Apple Execution
Projection-heavy shallow modules, semantic heads, and routing heads can be placed on ANE-adjacent execution paths, while MLX on GPU remains the main dense transformer runtime. This should improve energy efficiency and average latency versus GPU-only execution.
H5. Quantized State Mobility
TurboQuant-style rotation plus low-bit transport can compress hidden-state packets enough that two-host partitioning becomes worthwhile over Thunderbolt 5.
What We Are Trying To Prove
The first proof target is that hidden states can be treated as `operational interfaces`. The second is that `conditional compute beats uniform full-depth compute` on the real local workload. The third is that `Apple engine specialization` can reduce average energy and median latency. The fourth is that `semantic structure can become part of the scheduler`, not just a post hoc analysis layer.
Expected Advantages If Successful
If the architecture works, the expected gains are:
- better quality per unit compute
- lower median latency on easy or familiar prompts
- lower energy per accepted token
- higher effective context through inter-host block/state movement
- improved parameter efficiency, where smaller models become more useful because the system adds routing, correction, and semantic structure
- larger effective model capacity when two hosts cooperate on hard cases
Cold-start load speed is not the main target. The deeper target is better compute spending.
Baselines To Beat
The project only counts as successful if it beats at least one of these on real workloads:
1. `Single-host MLX baseline`
Same base model, same quantization, one Mac only.
2. `Naive two-host fixed split`
Static layer boundary, no trajectory routing, no semantic layer, no learned transfer adapter.
3. `Uniform full-depth inference`
No early exit, no correction routing, no typed transfer.
4. `Retrieval-only augmentation baseline`
Same model plus retrieval, but without AGP routing or semantic projection.
Canonical Initial Model Choices
Initial model family:
- `Gemma 4 E2B` for phase-one feasibility
- `Gemma 4 E4B` for phase-two scale-up if the first architecture proves out
Reason:
- small enough to iterate quickly on two 16 GB Macs
- official support exists across Google release materials and MLX ecosystem paths
- large enough to expose meaningful intermediate-state behavior without turning every experiment into a memory problem
System Architecture
Base Model Class
The base model remains a decoder transformer.
Architectural Modifications
The research architecture adds four learned interfaces around the transformer:
1. `Trajectory State Head`
Predicts anticipation geometry scalars from intermediate hidden states.
2. `Depth Routing Residual Module`
Replaces fixed residual accumulation at selected blocks with learned depth-aware routing over preceding layer states.
3. `Semantic Projection Head`
Projects hidden states into sparse or typed kernel-aligned primitives, invariants, and concept bundles.
4. `Transfer Adapter`
Compresses hidden-state and optional KV summaries for cross-host continuation.
Runtime Topology
`Mac4` owns the reflex path:
- tokenization
- embeddings
- shallow layers
- route estimation
- semantic projection
- local draft
`Mac5` owns the corrective path:
- resumed continuation from transfer packet
- deeper correction
- verifier decoding
- semantic consistency checks
Engine Ownership
`MLX` is the primary dense model runtime.
`GPU` owns the main transformer path and adapter training.
`ANE` is used through exported or sidecar modules for:
- routing heads
- semantic heads
- projection-heavy shallow modules
- frozen-forward or low-branch front-end paths when viable
`CPU` owns orchestration, logging, packet assembly, and transport control.
`Thunderbolt 5` is the inter-host transport fabric.
AGP-PTP: Partition Transfer Protocol
Purpose
Move semantically useful state, not raw monolithic model internals, between Apple hosts.
Packet Sections
1. `Identity`
model fingerprint, quantization, tokenizer fingerprint, session id, token span id, source host, target host, layer boundary, checksum
2. `Trajectory State`
commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, stability, route confidence
3. `Vitality State`
hidden norm, entropy surrogate, sparsity surrogate, semantic confidence, early-exit confidence, dead-state flags
4. `Semantic State`
sparse primitive activations, invariant activations, nearest concept bundles, semantic mismatch risk
5. `Payload State`
compressed residual packet, optional KV summary packet, compression metadata, decoder reconstruction hint
Transport Requirements
- binary framing, not JSON for the payload path
- replay-safe ids
- packet checksums
- exact model/version compatibility checks
- optional local loopback mode for same-host debugging
Work Packages
WP0. Feasibility + Instrumentation
Goal:
Establish the real local execution baseline and collect hidden-state evidence before any distributed claims.
Tasks:
1. verify `Gemma 4 E2B` local inference path in MLX on a single host
2. add hidden-state capture at selected layers
3. record per-layer sufficiency metrics on real prompt sets
4. define baseline latency, tokens/sec, memory footprint, and energy measurement path
5. identify candidate split layers for later transfer experiments
Deliverable:
`single-host baseline report + candidate split-layer map`
WP1. Trajectory Routing Head
Goal:
Attach anticipation geometry to the model as a first-class route signal.
Tasks:
1. wire a trajectory state head to selected hidden layers
2. compute supervision from local conversation/tool trajectories
3. train and calibrate route confidence
4. compare trajectory-aware routing against entropy-only routing
Deliverable:
`route-head evaluation report`
WP2. Depth Routing Residual Module
Goal:
Test whether learned depth selection improves local quality or calibration before distribution.
Tasks:
1. implement MLX-native depth routing residual blocks
2. insert at a small number of candidate blocks
3. train with next-token + calibration losses
4. compare to plain residual path and naive depth mixing
Deliverable:
`residual-routing ablation`
WP3. Semantic Projection Layer
Goal:
Project hidden states into a sparse or typed semantic layer aligned to the semantic kernel.
Tasks:
1. define initial primitive/invariant label schema
2. train sparse projection head or SAE-style bottleneck
3. evaluate semantic stability across neighboring prompts
4. test whether semantic confidence helps dead-state detection
Deliverable:
`semantic-latent evaluation report`
WP4. Transfer Adapter + AGP-PTP
Goal:
Make hidden-state transfer real and measurable.
Tasks:
1. implement compressed residual packet encoder/decoder
2. define binary AGP-PTP framing
3. test same-host encode/decode fidelity first
4. test cross-host packet transfer over Thunderbolt 5
5. compare resumed continuation against local no-transfer continuation
Deliverable:
`transfer fidelity and transport report`
WP5. Two-Host Corrective Execution
Goal:
Turn the packet path into actual conditional distributed inference.
Tasks:
1. Mac4 local draft path
2. Mac5 corrective continuation path
3. acceptance/rejection logic
4. transfer-trigger calibration
5. quality and latency comparison against all baselines
Deliverable:
`two-host distributed inference benchmark`
WP6. ANE Sidecar Path
Goal:
Place the correct modules on ANE-adjacent execution surfaces.
Tasks:
1. choose exportable shallow modules
2. implement Core ML sidecar path for public-runtime experiments
3. benchmark ANE vs GPU for those modules
4. optionally validate private-runtime ANE path separately from the mainline build
Deliverable:
`ANE sidecar benchmark and decision memo`
Immediate First Package
The first executable package is `WP0`.
WP0 success criteria:
- a reproducible single-host Gemma 4 E2B MLX baseline exists
- layer captures are saved for real prompt sets
- candidate split layers are identified from evidence, not intuition
- local metrics exist for:
- median latency
- p95 latency
- tokens/sec
- peak memory
- per-layer sufficiency proxy
WP0 is intentionally boring. If this part is weak, everything downstream becomes storytelling.
Training Decision
The canonical training plan now lives in `agp-mlx-ane-training-spec.md`.
The key decision is that AGP is trained as an `adapter-first curriculum`, not as an end-to-end new model. The order is:
1. `Gemma 4 E2B` domain SFT adapter on the real local corpus
2. trajectory and vitality heads on top of the tuned backbone
3. depth-routing residual modules with teacher distillation
4. semantic projection layer aligned to kernel primitives and invariants
5. transfer adapter with same-host continuation fidelity before any real cross-host claim
6. short joint alignment finetune across the trained heads and adapters
This keeps the architecture measurable at every stage and avoids training a scheduler on the wrong hidden-state manifold.
Benchmark Matrix
The benchmark suite for the whole track should include:
- `quality`
- task-completion quality on personal prompts
- coding-task correctness
- memory-grounded response quality
- semantic consistency
- `systems`
- median latency
- p95 latency
- tokens/sec
- peak memory
- host-to-host bytes per escalated token
- escalation rate
- correction acceptance rate
- `efficiency`
- energy per token
- energy per accepted token
- average active depth
- average local-only completion rate
- `representation`
- hidden-state transfer fidelity
- semantic projection stability
- dead-state detection accuracy
- route calibration
Decision Rules
If `single-host MLX` beats the distributed path on both quality and efficiency, stop and simplify.
If `naive fixed split` matches the learned split, remove complexity and re-evaluate.
If the semantic projection does not improve routing, keep it as analysis only and do not let it bloat the critical path.
If ANE export paths do not beat GPU for their assigned modules, keep ANE as a future optimization rather than forcing it into the mainline build.
Current Start State
Known assets already available:
- anticipation geometry research and transformer bias work
- ANE spike result in the N'Ko track
- TurboQuant sidecar evaluation on real embeddings
- semantic-kernel / semantic-layer theory
- Thunder-Train / Thunderbolt 5 context for Mac4 + Mac5
Known missing pieces:
- a canonical AGP codebase
- MLX-native prototype with route heads
- transfer protocol implementation
- cross-host benchmark harness
- semantic projection training loop tied to Gemma 4 hidden states
Immediate Next Actions
1. create the WP0 experiment folder and baseline harness
2. validate Gemma 4 E2B local MLX path
3. instrument hidden-state capture at selected layers
4. assemble the first real prompt pack from your own data
5. produce the first baseline report before any distributed execution claims
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/docs/research/agp-mlx-ane-research-track.md
Detected Structure
Method · Evaluation · Architecture