Grand Diomande Research · Full HTML Reader

AGP / MLX / ANE Research Track

Current Apple-local language model execution treats the model as a monolith. A single host loads the full graph, every token pays for roughly the same depth, hidden states are transient internals, and hardware engines are mostly passive containers. That leaves three opportunities underexploited on the current `Mac4 + Mac5` setup:

Language as Infrastructure research note experiment writeup candidate score 26 .md

Full Public Reader

AGP / MLX / ANE Research Track

Date: `2026-04-16`

Problem Statement

Current Apple-local language model execution treats the model as a monolith. A single host loads the full graph, every token pays for roughly the same depth, hidden states are transient internals, and hardware engines are mostly passive containers. That leaves three opportunities underexploited on the current `Mac4 + Mac5` setup:

1. intermediate hidden states may already contain enough information for early acceptance, correction, or transfer
2. different Apple engines are better suited to different classes of work
3. Thunderbolt 5 makes inter-host state transfer practical if the transferred object is compact and semantically meaningful

The research track therefore asks whether a transformer can be wrapped in a learned partitioning architecture where hidden states become routable artifacts rather than opaque byproducts.

Core Goal

The goal is to prove that a `transformer-based model running natively in MLX` can be extended into a `hierarchical Apple-local compute fabric` where:

  • shallow inference, routing, and semantic inspection happen cheaply and frequently
  • deeper correction and continuation happen conditionally
  • hidden states are compressed, typed, and transferred across hosts only when justified
  • ANE, GPU, CPU, and Thunderbolt each own the work they are structurally best at

This is not primarily a "bigger model on two Macs" project. It is a `semantic sufficiency and conditional compute` project.

Research Hypotheses

H1. Representational Sufficiency

For a meaningful subset of prompts and tokens, an intermediate hidden state contains enough information to support:

  • local early acceptance
  • resumed continuation on another host
  • corrective verification without replaying the entire stack

H2. Trajectory-Aware Routing

The anticipation geometry packet is useful as a routing state. Conditioning depth selection and transfer decisions on trajectory signals should outperform fixed layer splits and naive entropy-only heuristics.

H3. Semantic Projection Utility

A sparse semantic projection layer aligned to kernel primitives, invariants, and concept bundles improves transfer acceptance, dead-state detection, and corrective routing.

H4. Heterogeneous Apple Execution

Projection-heavy shallow modules, semantic heads, and routing heads can be placed on ANE-adjacent execution paths, while MLX on GPU remains the main dense transformer runtime. This should improve energy efficiency and average latency versus GPU-only execution.

H5. Quantized State Mobility

TurboQuant-style rotation plus low-bit transport can compress hidden-state packets enough that two-host partitioning becomes worthwhile over Thunderbolt 5.

What We Are Trying To Prove

The first proof target is that hidden states can be treated as `operational interfaces`. The second is that `conditional compute beats uniform full-depth compute` on the real local workload. The third is that `Apple engine specialization` can reduce average energy and median latency. The fourth is that `semantic structure can become part of the scheduler`, not just a post hoc analysis layer.

Expected Advantages If Successful

If the architecture works, the expected gains are:

  • better quality per unit compute
  • lower median latency on easy or familiar prompts
  • lower energy per accepted token
  • higher effective context through inter-host block/state movement
  • improved parameter efficiency, where smaller models become more useful because the system adds routing, correction, and semantic structure
  • larger effective model capacity when two hosts cooperate on hard cases

Cold-start load speed is not the main target. The deeper target is better compute spending.

Baselines To Beat

The project only counts as successful if it beats at least one of these on real workloads:

1. `Single-host MLX baseline`
Same base model, same quantization, one Mac only.

2. `Naive two-host fixed split`
Static layer boundary, no trajectory routing, no semantic layer, no learned transfer adapter.

3. `Uniform full-depth inference`
No early exit, no correction routing, no typed transfer.

4. `Retrieval-only augmentation baseline`
Same model plus retrieval, but without AGP routing or semantic projection.

Canonical Initial Model Choices

Initial model family:

  • `Gemma 4 E2B` for phase-one feasibility
  • `Gemma 4 E4B` for phase-two scale-up if the first architecture proves out

Reason:

  • small enough to iterate quickly on two 16 GB Macs
  • official support exists across Google release materials and MLX ecosystem paths
  • large enough to expose meaningful intermediate-state behavior without turning every experiment into a memory problem

System Architecture

Base Model Class

The base model remains a decoder transformer.

Architectural Modifications

The research architecture adds four learned interfaces around the transformer:

1. `Trajectory State Head`
Predicts anticipation geometry scalars from intermediate hidden states.

2. `Depth Routing Residual Module`
Replaces fixed residual accumulation at selected blocks with learned depth-aware routing over preceding layer states.

3. `Semantic Projection Head`
Projects hidden states into sparse or typed kernel-aligned primitives, invariants, and concept bundles.

4. `Transfer Adapter`
Compresses hidden-state and optional KV summaries for cross-host continuation.

Runtime Topology

`Mac4` owns the reflex path:

  • tokenization
  • embeddings
  • shallow layers
  • route estimation
  • semantic projection
  • local draft

`Mac5` owns the corrective path:

  • resumed continuation from transfer packet
  • deeper correction
  • verifier decoding
  • semantic consistency checks

Engine Ownership

`MLX` is the primary dense model runtime.

`GPU` owns the main transformer path and adapter training.

`ANE` is used through exported or sidecar modules for:

  • routing heads
  • semantic heads
  • projection-heavy shallow modules
  • frozen-forward or low-branch front-end paths when viable

`CPU` owns orchestration, logging, packet assembly, and transport control.

`Thunderbolt 5` is the inter-host transport fabric.

AGP-PTP: Partition Transfer Protocol

Purpose

Move semantically useful state, not raw monolithic model internals, between Apple hosts.

Packet Sections

1. `Identity`
model fingerprint, quantization, tokenizer fingerprint, session id, token span id, source host, target host, layer boundary, checksum

2. `Trajectory State`
commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, stability, route confidence

3. `Vitality State`
hidden norm, entropy surrogate, sparsity surrogate, semantic confidence, early-exit confidence, dead-state flags

4. `Semantic State`
sparse primitive activations, invariant activations, nearest concept bundles, semantic mismatch risk

5. `Payload State`
compressed residual packet, optional KV summary packet, compression metadata, decoder reconstruction hint

Transport Requirements

  • binary framing, not JSON for the payload path
  • replay-safe ids
  • packet checksums
  • exact model/version compatibility checks
  • optional local loopback mode for same-host debugging

Work Packages

WP0. Feasibility + Instrumentation

Goal:
Establish the real local execution baseline and collect hidden-state evidence before any distributed claims.

Tasks:

1. verify `Gemma 4 E2B` local inference path in MLX on a single host
2. add hidden-state capture at selected layers
3. record per-layer sufficiency metrics on real prompt sets
4. define baseline latency, tokens/sec, memory footprint, and energy measurement path
5. identify candidate split layers for later transfer experiments

Deliverable:
`single-host baseline report + candidate split-layer map`

WP1. Trajectory Routing Head

Goal:
Attach anticipation geometry to the model as a first-class route signal.

Tasks:

1. wire a trajectory state head to selected hidden layers
2. compute supervision from local conversation/tool trajectories
3. train and calibrate route confidence
4. compare trajectory-aware routing against entropy-only routing

Deliverable:
`route-head evaluation report`

WP2. Depth Routing Residual Module

Goal:
Test whether learned depth selection improves local quality or calibration before distribution.

Tasks:

1. implement MLX-native depth routing residual blocks
2. insert at a small number of candidate blocks
3. train with next-token + calibration losses
4. compare to plain residual path and naive depth mixing

Deliverable:
`residual-routing ablation`

WP3. Semantic Projection Layer

Goal:
Project hidden states into a sparse or typed semantic layer aligned to the semantic kernel.

Tasks:

1. define initial primitive/invariant label schema
2. train sparse projection head or SAE-style bottleneck
3. evaluate semantic stability across neighboring prompts
4. test whether semantic confidence helps dead-state detection

Deliverable:
`semantic-latent evaluation report`

WP4. Transfer Adapter + AGP-PTP

Goal:
Make hidden-state transfer real and measurable.

Tasks:

1. implement compressed residual packet encoder/decoder
2. define binary AGP-PTP framing
3. test same-host encode/decode fidelity first
4. test cross-host packet transfer over Thunderbolt 5
5. compare resumed continuation against local no-transfer continuation

Deliverable:
`transfer fidelity and transport report`

WP5. Two-Host Corrective Execution

Goal:
Turn the packet path into actual conditional distributed inference.

Tasks:

1. Mac4 local draft path
2. Mac5 corrective continuation path
3. acceptance/rejection logic
4. transfer-trigger calibration
5. quality and latency comparison against all baselines

Deliverable:
`two-host distributed inference benchmark`

WP6. ANE Sidecar Path

Goal:
Place the correct modules on ANE-adjacent execution surfaces.

Tasks:

1. choose exportable shallow modules
2. implement Core ML sidecar path for public-runtime experiments
3. benchmark ANE vs GPU for those modules
4. optionally validate private-runtime ANE path separately from the mainline build

Deliverable:
`ANE sidecar benchmark and decision memo`

Immediate First Package

The first executable package is `WP0`.

WP0 success criteria:

  • a reproducible single-host Gemma 4 E2B MLX baseline exists
  • layer captures are saved for real prompt sets
  • candidate split layers are identified from evidence, not intuition
  • local metrics exist for:
  • median latency
  • p95 latency
  • tokens/sec
  • peak memory
  • per-layer sufficiency proxy

WP0 is intentionally boring. If this part is weak, everything downstream becomes storytelling.

Training Decision

The canonical training plan now lives in `agp-mlx-ane-training-spec.md`.

The key decision is that AGP is trained as an `adapter-first curriculum`, not as an end-to-end new model. The order is:

1. `Gemma 4 E2B` domain SFT adapter on the real local corpus
2. trajectory and vitality heads on top of the tuned backbone
3. depth-routing residual modules with teacher distillation
4. semantic projection layer aligned to kernel primitives and invariants
5. transfer adapter with same-host continuation fidelity before any real cross-host claim
6. short joint alignment finetune across the trained heads and adapters

This keeps the architecture measurable at every stage and avoids training a scheduler on the wrong hidden-state manifold.

Benchmark Matrix

The benchmark suite for the whole track should include:

  • `quality`
  • task-completion quality on personal prompts
  • coding-task correctness
  • memory-grounded response quality
  • semantic consistency
  • `systems`
  • median latency
  • p95 latency
  • tokens/sec
  • peak memory
  • host-to-host bytes per escalated token
  • escalation rate
  • correction acceptance rate
  • `efficiency`
  • energy per token
  • energy per accepted token
  • average active depth
  • average local-only completion rate
  • `representation`
  • hidden-state transfer fidelity
  • semantic projection stability
  • dead-state detection accuracy
  • route calibration

Decision Rules

If `single-host MLX` beats the distributed path on both quality and efficiency, stop and simplify.

If `naive fixed split` matches the learned split, remove complexity and re-evaluate.

If the semantic projection does not improve routing, keep it as analysis only and do not let it bloat the critical path.

If ANE export paths do not beat GPU for their assigned modules, keep ANE as a future optimization rather than forcing it into the mainline build.

Current Start State

Known assets already available:

  • anticipation geometry research and transformer bias work
  • ANE spike result in the N'Ko track
  • TurboQuant sidecar evaluation on real embeddings
  • semantic-kernel / semantic-layer theory
  • Thunder-Train / Thunderbolt 5 context for Mac4 + Mac5

Known missing pieces:

  • a canonical AGP codebase
  • MLX-native prototype with route heads
  • transfer protocol implementation
  • cross-host benchmark harness
  • semantic projection training loop tied to Gemma 4 hidden states

Immediate Next Actions

1. create the WP0 experiment folder and baseline harness
2. validate Gemma 4 E2B local MLX path
3. instrument hidden-state capture at selected layers
4. assemble the first real prompt pack from your own data
5. produce the first baseline report before any distributed execution claims

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/research/agp-mlx-ane-research-track.md

Detected Structure

Method · Evaluation · Architecture