AGP Thunder-Train Stage 1 Plan
This is the stage-1 backbone plan for running AGP domain adaptation across `Mac4 + Mac5` over `Thunderbolt 5` using the existing `thunder-train` stack. The purpose of this stage is not to train the full AGP architecture end to end. The purpose is to make both Macs compute immediately on the first useful backbone problem: `Gemma 4 E2B` domain adaptation on the AGP high-signal corpus.
Full Public Reader
AGP Thunder-Train Stage 1 Plan
Date: `2026-04-16`
This is the stage-1 backbone plan for running AGP domain adaptation across `Mac4 + Mac5` over `Thunderbolt 5` using the existing `thunder-train` stack. The purpose of this stage is not to train the full AGP architecture end to end. The purpose is to make both Macs compute immediately on the first useful backbone problem: `Gemma 4 E2B` domain adaptation on the AGP high-signal corpus.
The core decision is now explicit. `Mac4 + Mac5 should both compute.` The prior single-host MLX path remains important, but only as the baseline and fallback control. The primary execution path for stage 1 is dual-host Thunder-Train over the `10.0.5.x` Thunderbolt link.
There are four hard constraints that shape this stage.
First, the real Thunder entrypoint is `launch.sh`, not `distributed_launch.sh`. The repo marks `distributed_launch.sh` as legacy and the current launcher path is `launch.sh` + `mlx_launch.py`.
Second, Thunder-Train expects ChatML-style `{"messages": [...]}` records. The current AGP MLX lane already proved out on plain `text` exports for `mlx_lm lora`, so the AGP data must be re-exported into Thunder format instead of pretending the same files can be reused unchanged.
Third, the remote runtime assumption is currently broken. The Thunder launcher expects a consistent Python interpreter path on both Macs with `mlx` and `mlx_lm` installed. At status check time, neither host satisfied that assumption cleanly. Until that parity is fixed, no distributed launch should be called “live.”
Fourth, this stage is still only the backbone stage. Route heads, vitality heads, semantic projection, transfer adapters, and ANE placement remain later curriculum stages. Stage 1 only needs to produce a strong dual-host domain adapter checkpoint.
The machine roles are simple. `Mac1` remains the orchestrator, dataset builder, and verification surface. `Mac4` and `Mac5` form the dual compute fabric. `Mac4` should be treated as rank `0` and `Mac5` as rank `1`, matching the existing Thunder topology and hostfile. Both machines should hold the same stage-1 dataset and launchable runtime. The training strategy for the first pass should remain `data` parallel, not tensor parallel, because the problem is throughput and adapter training stability, not model fit failure.
The execution order should be strict.
First, export the AGP `domain_sft_v1_high_signal` corpus into Thunder `messages` JSONL format with train, valid, and test splits. The first export should preserve provenance fields such as `record_id`, `weight`, and `source`, even if Thunder ignores them, so the data remains auditable.
Second, establish Python parity on both remote hosts. The launcher must resolve to a Python environment that imports `mlx` and `mlx_lm` on both machines. Until this is true, any launch attempt is noise.
Third, sync the AGP Thunder dataset to both Macs in a consistent path, preferably under `Desktop/Comp-Core/experiments/agp_mlx/train/datasets/domain_sft_v1_high_signal_thunder/` or a mirrored path inside `[home-path]`.
Fourth, run a tiny smoke job over the actual Thunder launcher with `Gemma 4 E2B`, low iteration count, and a narrow adapter path. The purpose of the smoke is not model quality. The purpose is to verify the distributed group initializes, both ranks load data, and a checkpoint lands without the launcher or remote Python path collapsing.
Fifth, only after the smoke succeeds should the first real stage-1 backbone run begin. That run should use the high-signal corpus, conservative learning rate, and explicit logging on both ranks.
Success for this stage means five things. Both Macs join the distributed group. Both Macs process real AGP data. A checkpoint lands without launcher drift. Validation runs complete. The resulting checkpoint can be compared against the single-host MLX backbone control.
Failure at this stage is also useful, but it must be categorized correctly. If the launcher fails before rank initialization, the problem is runtime parity. If the data loader fails, the problem is export format or path consistency. If loss diverges immediately, the problem is learning rate or quantized LoRA stability. If the run works but underperforms the single-host control badly, the problem is the Thunder path itself rather than the AGP data.
The immediate blockers at the time of writing are known and finite. The dataset still needs Thunder-format export. The remote Python launcher path is not yet parity-clean on either host. No real dual-host AGP smoke run has been completed yet. Those are setup blockers, not research ambiguity.
The immediate next command path, once runtime parity is fixed, should be a smoke run from `Mac1` through the Thunder launcher with `Mac4 + Mac5` on the same AGP dataset. Once that succeeds, stage 1 is officially dual-host.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/docs/research/agp-thunder-train-stage1-plan.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture