Grand Diomande Research · Full HTML Reader

AGP / MLX / ANE Training Spec

The first trainable version of AGP is not a new foundation model. It is a `Gemma 4 E2B` decoder transformer wrapped in a small set of new trainable interfaces. The objective is to teach the system four things at once: how to speak in your distribution, how to estimate whether an intermediate state is alive enough to trust, how to project that state into a typed semantic layer, and how to compress or route that state without paying for full-depth inference every time. The right starting point is therefore not full-m

Language as Infrastructure proposal experiment writeup candidate score 32 .md

Full Public Reader

AGP / MLX / ANE Training Spec

Date: `2026-04-16`

Training Objective

The architecture we are training has five trainable regions. The first region is the ordinary language-model adaptation layer, which is a LoRA or DoRA-style adapter on selected Gemma blocks so the base model moves toward your conversational, coding, and memory-grounded distribution. The second region is the `trajectory and vitality head`, which reads selected hidden layers and predicts the routing state of the current reasoning process. The third region is the `semantic projection head`, which turns dense hidden states into sparse activations over kernel-aligned primitives, invariants, and bundle neighborhoods. The fourth region is the `depth-routing residual module`, which learns whether earlier hidden states are sufficient and how much prior depth should still matter. The fifth region is the `transfer adapter`, which encodes a cross-host packet and then reconstructs a continuation-ready latent on the receiving side.

What We Train First

The first thing we train is not the routing logic. It is not the transfer adapter either. We first train a `domain adapter` so the base Gemma hidden states become useful on your own distribution. If we skip that and train routing on generic hidden states, we end up learning a scheduler over the wrong representation manifold. So the actual order is strict. First produce a strong local LoRA-tuned Gemma 4 E2B on your data. Then use that tuned model as the teacher and backbone for every later AGP head.

This matters because the architecture is trying to decide when a hidden state is sufficient. Sufficiency depends on the actual target workload. A hidden state that is sufficient for generic internet chat may not be sufficient for your prompt style, your code tasks, your memory-heavy requests, or your semantic-layer work. The hidden-state geometry has to be measured and trained on the real distribution.

Model Choice

The canonical base for training is `Gemma 4 E2B`. The E2B model is small enough to iterate on two 16 GB Apple machines and large enough to expose meaningful intermediate-state structure. The next scale-up model is `Gemma 4 E4B`, but that is phase two only. We should not jump to E4B before the curriculum and auxiliary heads are working on E2B.

For training, the base transformer stays mostly frozen. We train LoRA or DoRA adapters on a restricted set of blocks, and we train the AGP heads and modules natively in MLX. The ANE is not part of the first training loop. The ANE belongs to later inference deployment for shallow exported heads. Early training should remain `GPU-first MLX` because that is the stable path and because the AGP modules are still changing rapidly.

Data Lanes

The training corpus should be split into four lanes rather than treated as one giant bag. The first lane is `domain SFT`, built from the richest high-value conversational and coding sessions. The second lane is `routing supervision`, built from hidden-state traces and oracle labels derived from the tuned model itself. The third lane is `semantic supervision`, built from prompts and responses that can be weakly labeled by the semantic-kernel or primitive-bundle pipeline. The fourth lane is `transfer supervision`, built from captured intermediate hidden states at candidate split layers and their corresponding full-depth continuations.

The source priority should remain the one already established during the audit. Raw Claude session transcripts are the best source for real sequential behavior. `verbose-all.jsonl` is the best metadata and ranking layer. `prompts-all.jsonl` and Codex prompt history are support surfaces. Supabase memory is the longest-horizon archive and becomes especially important later for memory-grounded and temporal continuity training.

The SFT lane should contain only high-signal prompt-response pairs. Tool-only orchestration wrappers, task notifications, mirror noise, and heartbeat chatter should be filtered aggressively. The routing lane should preserve more process detail because it needs tool actions, edits, confidence patterns, and multi-turn structure. The semantic lane should over-sample prompts that naturally express invariants, transitions, return/dwell/stability type concepts, and cross-script or conceptual decomposition work. The transfer lane should come from instrumented forward passes, not static JSONL alone.

Label Construction

The route labels should come from an oracle built during WP0 and the first tuned-model passes. For each prompt or token span, run the full model and record selected hidden layers. At each candidate layer, decode provisional logits through a lightweight translator or direct probe and compare them to the final logits. The earliest layer that satisfies a quality threshold becomes the `earliest acceptable boundary`. That label then supervises the router. If no early layer passes the threshold, the target is `deepen` or `escalate`.

Vitality labels should be built from a mix of heuristics and specific dead-state signals. Hidden-state norm, entropy proxies, agreement gap with final logits, semantic confidence, and sparse activation concentration all matter. For the N'Ko and semantic-heavy tasks, the dead-state lessons from the brain-scanner work should be used explicitly. We want the router to learn the difference between a state that is merely unfinished and a state that is structurally unfit for transfer.

Semantic labels should come from the kernel and its neighboring theory stack. The primitive layer gives multi-label targets over operators such as stabilize, transition, dwell, echo, return, and shift. The invariant layer gives a second, somewhat slower semantic target. Concept bundles can be used as neighborhood supervision rather than as a single hard class. In practice, the semantic head should predict primitives and invariants directly, with bundle identity used as a contrastive or retrieval-style target on top.

Transfer labels are generated automatically. At each candidate split layer, capture the hidden state and the final continuation outcome. The encoder-decoder transfer pair is then trained to reconstruct a latent from which the downstream continuation matches the no-transfer baseline closely enough. The target is not merely hidden-state reconstruction. The target is continuation fidelity.

Training Curriculum

Stage 0 — Baseline Instrumentation

Before training anything, finish WP0 and lock the baseline. That means a single-host Gemma 4 E2B MLX run, hidden-state capture at a small set of layers, prompt-pack execution on real local tasks, and a candidate split-layer map. This stage produces the empirical boundaries that later stages need.

Stage 1 — Domain Adapter SFT

Train a plain language adapter on Gemma 4 E2B first. This is a standard supervised finetune using LoRA or DoRA on selected transformer blocks. The objective is next-token cross-entropy on your filtered conversational, coding, planning, and memory-grounded pairs. The base model remains frozen except for the adapter layers.

This stage should not try to teach routing, semantics, or transfer directly. Its job is to bend the hidden-state manifold toward your workload so later AGP supervision has something meaningful to work with. The output of this stage is the first `AGP backbone`, which is still just a tuned Gemma with no scheduler yet.

Stage 2 — Trajectory And Vitality Heads

Once the domain adapter exists, freeze the base backbone and train trajectory and vitality heads on top of selected hidden layers. These heads read hidden states and predict route state, dead-state risk, early-exit confidence, and escalate-versus-stay-local decisions. The supervision comes from the oracle labels built from full-depth teacher runs.

The route head should initially solve a small decision problem. The first version only needs to predict four states: `accept_local`, `continue_local`, `revive_local`, and `escalate`. If that works, we can later make the action space finer. The vitality head should predict dead-versus-alive risk and confidence calibration. The goal is not only correct routing. The goal is reliable refusal to route when the hidden state is not trustworthy.

Stage 3 — Depth Routing Residual Module

After the route head is calibrated, train the depth-routing residual module. This module is the first actual architectural intervention inside the transformer path. It decides how much useful signal exists in earlier hidden states and whether a shallower mixture can approach the later result.

This stage is trained with teacher distillation. The reference is the tuned full-depth backbone from stage one. The routed path is penalized when its decoded distribution diverges too far from the full-depth output. This module should be inserted in only a small number of candidate blocks at first. The experiment is not about maximal architectural change. It is about testing whether a limited learned depth interface actually improves early sufficiency.

Stage 4 — Semantic Projection Layer

With a stable tuned backbone and route head, train the semantic projection layer. This can be implemented either as a sparse bottleneck directly over hidden states or as a two-part head where a dense projector is followed by sparse or thresholded primitive activations. The key is that the head predicts typed kernel-aligned structure, not generic opaque features.

The losses here are multi-label semantic prediction, stability across semantically adjacent prompts, and sparsity regularization. If a sparse autoencoder path is added later, its discovered features should be aligned back to the kernel rather than treated as a separate ontology. The semantic head is successful if it becomes useful for routing and dead-state detection. If it is only interpretable but not operationally helpful, it should remain off the critical path.

Stage 5 — Transfer Adapter

Now train the cross-host transfer adapter. The encoder reads an intermediate hidden state plus route and vitality context. The decoder reconstructs a continuation-ready state that the deeper path can resume from. Training begins same-host only. The receiving side should still be local so that transport noise is not yet in the loop. Once the encode-decode path is stable, the same packet format is exercised across the two Macs.

The transfer objective is not raw tensor reconstruction for its own sake. It is agreement with the no-transfer continuation after resume. Hidden-state cosine and MSE are useful stabilizers, but the real target is downstream logit fidelity and continuation quality.

Stage 6 — Joint Limited Finetune

Only after the separate modules exist do we run a short joint finetune. This is not full end-to-end base-model training. It is a small composite optimization over the already trained adapters and heads. The purpose is to align the router, the semantic layer, and the transfer path to each other so they stop fighting. This stage should be short, carefully regularized, and easy to roll back from.

Loss Design

The total loss should be composite and curriculum-gated rather than active all at once from day one.

The base SFT loss is ordinary next-token cross-entropy. This dominates stage one.

The route loss is a small classification loss over route actions. It should be weighted by confidence so ambiguous teacher labels do not destabilize the head.

The vitality loss is a binary or small multi-class calibration loss over alive, weak, dead, and revive-needed regimes.

The depth-routing loss is primarily a KL divergence between routed-path logits and full-depth teacher logits, optionally with an intermediate hidden-state similarity term.

The semantic loss is multi-label binary cross-entropy over primitives and invariants, with optional contrastive loss over concept-neighborhood similarity and an L1-style sparsity penalty on the latent activations.

The transfer loss is a weighted combination of hidden-state cosine similarity, hidden-state MSE, and final-logit KL after resumed continuation. If transfer is intended to reduce packet size aggressively, a bitrate or compression penalty should be added later.

The composite objective should therefore look like:

`L = L_sft + λ_route L_route + λ_vitality L_vitality + λ_depth L_depth + λ_sem L_sem + λ_transfer L_transfer`

The important part is not the exact coefficients yet. The important part is that stages activate these terms gradually instead of all at once.

MLX Implementation Strategy

The training stack should be built natively in MLX with the existing `mlx_lm` tuner as the base. `mlx_lm 0.31.2` already exposes LoRA/DoRA utilities and a trainer path. That gives us a stable starting point for the base domain adapter and potentially for parts of the later curriculum.

The first implementation should be split into two code paths. One path is the standard adapter path, which uses `mlx_lm` LoRA training for stage one. The second path is the AGP path, which wraps the tuned base model in custom MLX modules for trajectory heads, semantic heads, and transfer adapters. The same trainer interface can still be used, but with a custom loss function and controlled hidden-state capture.

The repo should eventually gain a dedicated training package, ideally something like:

`experiments/agp_mlx/data/` for dataset builders and label constructors
`experiments/agp_mlx/model/` for the Gemma wrapper and AGP heads
`experiments/agp_mlx/train/` for stage-specific training entrypoints
`experiments/agp_mlx/eval/` for route, semantic, and transfer evaluation

The point is to keep AGP-specific training code separate from the generic WP0 measurement harness.

Hardware Plan

Training should not begin as a distributed two-host gradient system. That is the wrong complexity level for the first pass. Instead, use the two Macs asymmetrically.

One Mac should be the active trainer. It runs the MLX training loop for the current stage. The second Mac should be the teacher and evaluation sidecar. It can generate oracle labels, run hidden-state trace jobs, or score checkpoints without interrupting the active trainer. This gives us real parallelism without prematurely inventing distributed MLX training.

The ANE stays out of the early training loop. Later, once the route head and semantic head stabilize, those compact heads can be exported or rewritten for a Core ML sidecar path so the inference architecture can actually test ANE placement. But training them initially on the GPU in MLX is the sane path.

The updated execution decision is slightly narrower than the original caution, not a contradiction of it. We still do not start with distributed training until a single-host baseline and a clean dataset path exist. That condition is now satisfied. So stage one can pivot into a dual-host `Thunder-Train` backbone run across `Mac4 + Mac5`, while preserving the single-host MLX run as the baseline and fallback control. In other words, stage one is now `distributed backbone training`, but only for the domain adapter itself. The later AGP heads still follow the staged curriculum.

Benchmarks For Training Success

Stage one succeeds when the tuned Gemma clearly outperforms the untuned baseline on your own prompt pack and yields more coherent hidden-state traces on the target distribution.

Stage two succeeds when the route and vitality heads beat naive entropy-only routing on held-out prompts.

Stage three succeeds when the depth-routing path can approximate the full-depth logits closely enough that early sufficiency becomes measurable instead of speculative.

Stage four succeeds when the semantic layer is both stable and useful. Interpretability alone is not enough. It must improve either route quality, dead-state detection, or transfer acceptance.

Stage five succeeds when same-host transfer and resumed continuation stay close to no-transfer continuation at the candidate split layers. Cross-host tests should only begin after this is true locally.

The full curriculum succeeds when the combined system improves at least one of the actual architecture targets: quality per compute, median latency on easy prompts, energy per accepted token, or parameter efficiency on your workload.

The First Three Concrete Tasks

The next three build actions should be locked now.

First, create the `domain SFT` dataset from the audited local sources and write the filtering rules before training anything. This is the most important data decision in the whole track.

Second, build the first `Gemma 4 E2B MLX LoRA` run and treat that checkpoint as the backbone for every later AGP module.

Third, add hidden-state capture and oracle-label generation on top of that tuned checkpoint so the route and vitality heads have real supervision instead of hand-wavy heuristics.

That is the real starting point for training this architecture. Everything else comes after that.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/research/agp-mlx-ane-training-spec.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture