AGP / MLX / ANE Training Spec

Full HTML reader

Read the full artifact

Extracted abstract or opening context

The first trainable version of AGP is not a new foundation model. It is a `Gemma 4 E2B` decoder transformer wrapped in a small set of new trainable interfaces. The objective is to teach the system four things at once: how to speak in your distribution, how to estimate whether an intermediate state is alive enough to trust, how to project that state into a typed semantic layer, and how to compress or route that state without paying for full-depth inference every time. The right starting point is therefore not full-model training. The right starting point is `adapter-first curriculum training` on top of a mostly frozen base model in `MLX`. The architecture we are training has five trainable regions. The first region is the ordinary language-model adaptation layer, which is a LoRA or DoRA-style adapter on selected Gemma blocks so the base model moves toward your conversational, coding, and memory-grounded distribution. The second region is the `trajectory and vitality head`, which reads selected hidden layers and predicts the routing state of the current reasoning process. The third region is the `semantic projection head`, which turns dense hidden states into sparse activations over kernel-aligned primitives, invariants, and bundle neighborhoods. The fourth region is the `depth-routing residual module`, which learns whether earlier hidden states are sufficient and how much prior depth should still matter. The fifth region is the `transfer adapter`, which encodes a cross-host packet and then reconstructs a continuation-ready latent on the receiving side. The first thing we train is not the routing logic. It is not the transfer adapter either. We first train a `domain adapter` so the base Gemma hidden states become useful on your own distribution. If we skip that and train routing on generic hidden states, we end up learning a scheduler over the wrong representation manifold. So the actual order is strict. First produce a strong local LoRA-tuned Gemma 4 E2B on your data. Then use that tuned model as the teacher and backbone for every later AGP head. This matters because the architecture is trying to decide when a hidden state is sufficient. Sufficiency depends on the actual target workload. A hidden state that is sufficient for generic internet chat may not be sufficient for your prompt style, your code tasks, your memory-heavy requests, or your semantic-layer work. The hidden-state geometry has to be measured and trained on the real distribution. The canonical base for training is `Gemma 4 E2B`. The E2B model is small enough to iterate on two 16 GB Apple machines and large enough to expose meaningful intermediate-state structure. The next scale-up model is `Gemma 4 E4B`, but that is phase two only. We should not jump to E4B before the curriculum and

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.