Codex Handoff: Partial-Real Thunder Train Lane
This handoff is for one bounded job only: get the first completed partial-real local Thunder training window to finish cleanly and verify that it writes a real checkpoint. Do not expand scope beyond that.
Full Public Reader
Codex Handoff: Partial-Real Thunder Train Lane
Date: 2026-04-25
This handoff is for one bounded job only: get the first completed partial-real local Thunder training window to finish cleanly and verify that it writes a real checkpoint. Do not expand scope beyond that.
The system context matters because this lane sits inside a larger architecture. The canonical acoustic science lane is running separately on Vast and is responsible for the paper-grade CER claims. That lane is the same-snapshot Paper 4 matrix on the A100 instance and it should be treated as read-only from this handoff. The local Thunder lane is different. It is the AGP and Gemma correction-adapter lane. Gemma is not the authority model. Gemma is the bounded proposal model that only acts after the AGP partition router says a chunk deserves additional compute. Rust admissibility is still the final authority for whether a proposed correction is allowed.
The current goal is therefore narrow and practical. We already proved the mechanics on a synthetic correction set. We already proved that the launchers, chunk runner, and local distributed setup can work. The next real milestone is to prove that a first bounded training window can run on a partial-real correction corpus built from actual A100 replay outputs. This is the first time the local Gemma adapter lane will be training on real replay-derived correction rows rather than the small synthetic mechanics set.
Scope Boundary
You own only the partial-real Thunder lane.
You do own:
- the local and remote Thunder dataset path for the partial-real corpus
- the Gemma cache warm state on `mac4` and `mac5`
- the partial-real Thunder launcher and chunk runner
- the dedicated adapter output path for the partial-real run
- checkpoint verification and artifact mirroring back to local
You do not own:
- the Vast closeout watcher
- the A100 instance lifecycle
- the paper text or manuscript framing
- the MAOE replay scripts unless needed only to read field names or confirm input format
- routing logic redesign
- changing the correction policy
If you discover a blocker outside your scope, record it precisely and stop there. Do not silently absorb adjacent systems into this task.
Current Architecture State
The partial-real correction corpus already exists. It was assembled from the four completed same-snapshot A100 runs that were already replayed locally. Those runs produced real bridge rows, and those bridge rows were converted into Gemma correction SFT data. That partial-real corpus has already been staged into Thunder Train and mirrored to the two compute hosts.
The local Thunder launch layer has already been patched so that this corpus is a named profile rather than a one-off path hack. The only reason the first bounded run did not proceed is that `mac4` had only a partial Hugging Face cache for `mlx-community/gemma-4-e2b-4bit`. `mac5` already had the full cache. The launch therefore stalled waiting on a cold model download on `mac4`. The correct fix was started but not yet verified through to the first checkpoint boundary.
Files And Paths That Matter
Dataset root
The partial-real SFT dataset should exist locally at:
`[home-path]`
This directory should contain at least:
- `train.jsonl`
- `valid.jsonl`
- `test.jsonl`
- `manifest.json`
The same path is expected on both compute hosts:
- `mac4:[home-path]`
- `mac5:[home-path]`
Launcher and runner
Relevant scripts:
- `Desktop/Comp-Core/experiments/agp_mlx/train/launch_gemma4_e2b_nko_correction_thunder.sh`
- `Desktop/Comp-Core/experiments/agp_mlx/train/run_gemma4_e2b_nko_correction_chunks.sh`
- `Desktop/Comp-Core/experiments/agp_mlx/train/sync_gemma4_e2b_cache_to_thunder.sh`
- `Desktop/Comp-Core/experiments/agp_mlx/train/README.md`
Named profiles already exist:
- `partial_real_stage1_stable`
- `partial_real_stage1`
Expected output path
The dedicated partial-real stable run should write here:
`[home-path]`
The specific checkpoint file that matters is:
`[home-path]`
Current known-success command
The exact bounded run to relaunch is:
./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10Interpret that as: launch the `partial_real_stage1_stable` profile, advance to absolute step 10 in a 10-step window, and verify a real checkpoint lands.
What Is Already Done
These items are not the job anymore. Do not redo them unless you verify they are missing.
The partial-real corpus has already been promoted into Thunder Train.
The launchers have already been patched to expose the partial-real profile by name.
The synthetic mechanics lane has already proven the local chunked resume logic on the smaller correction set.
The current failure mode is not “training code broken.” The known failure mode is “one Thunder node is cold on the Gemma cache.”
Your Actual Task
You need to finish four concrete things:
First, verify the partial-real dataset is actually present and nontrivial on all three machines: local, `mac4`, and `mac5`.
Second, verify that the full `mlx-community/gemma-4-e2b-4bit` Hugging Face cache is present on `mac4` and `mac5`, and compare it against the known-good local cache. Do not trust directory existence alone. Confirm that `mac4` is no longer stuck with the partial approximately 334 megabyte cache snapshot that previously caused the cold-download stall. The local complete cache is around 3.4 gigabytes, and the remote should look materially similar in weight and file count.
Third, once cache completeness is proven, relaunch the bounded partial-real Thunder window using the exact command already selected. The target is not a long run. The target is the first clean completed stable window on the real corpus.
Fourth, verify that `checkpoint.json` reaches step 10 and that the run artifact is mirrored back locally so the state is not stranded on a single compute host.
Recommended Verification Sequence
Use this order so you do not conflate dataset, cache, and training failures.
Step 1: Dataset verification
Verify locally:
ls -lh [home-path]
wc -l [home-path]
wc -l [home-path]
wc -l [home-path]Verify on `mac4` and `mac5` with the same path. You are not trying to remeasure the corpus scientifically here. You are proving that both distributed ranks can see the same real files.
Step 2: Gemma cache verification
Find the local cache root actually used by Thunder for `mlx-community/gemma-4-e2b-4bit`. Then compare local against `mac4` and `mac5`.
At minimum record:
- the cache directory path
- total size
- whether the expected large weight files are present
- whether `mac4` still looks partial compared to local
If `sync_gemma4_e2b_cache_to_thunder.sh` exists exactly for this fix, use that rather than inventing a new sync pattern. If it fails, collect the exact failure and only then fall back to manual copy.
Step 3: Relaunch bounded run
Run from the intended training directory and do not alter the profile:
./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10You are not tuning the profile in this handoff. You are only proving that the known selected stable profile can reach its first real checkpoint on the partial-real corpus once model cache warmup is no longer the bottleneck.
Step 4: Checkpoint verification
Confirm:
cat [home-path]You want the step field to show `10`.
Also verify the run directory contains the expected adapter and metadata files, not just an empty shell.
Step 5: Mirror back
Make sure the final adapter state is accessible locally, not only on the primary writer host.
You do not need to evaluate the model in this handoff unless the run completed instantly and there is zero ambiguity about success. The checkpoint boundary is the success gate for this task.
Success Definition
This task is successful only if all of the following are true:
The partial-real dataset is verified on local, `mac4`, and `mac5`.
The Gemma cache on `mac4` is confirmed complete enough that the run no longer blocks on model download.
The bounded command `./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10` completes.
The file `[home-path]` reaches step `10`.
The resulting run artifact is available locally after the run, not stranded on one remote host.
Failure Triage
Do not improvise profile changes until the failure mode is concrete.
If the run stalls before training logs begin, the likely remaining problem is still cache or model materialization. Prove that first.
If rank 1 cannot see the dataset path, the mirror is incomplete. Fix dataset sync, not training.
If the run starts training but never writes a checkpoint, determine whether it is a runtime crash, a distributed teardown issue, or a path/write issue. Do not call it success just because logs appeared.
If checkpoint step 10 lands and only the post-save distributed teardown is noisy, treat that as secondary unless it prevents artifact availability.
If the run fails after cache warmup and before step 10 for a real training/runtime reason, stop after recording the exact failure mode. Do not retune the profile in this handoff unless the failure is trivial and obviously environmental.
Explicit Stop Conditions
Stop once one of these is true:
You have a verified step-10 checkpoint for the partial-real stable run and the adapter is mirrored back locally.
Or:
You have a concrete, reproducible failure mode after cache completeness is verified and the bounded run still cannot reach step 10.
Anything beyond that is a different task.
Deliverable Format Back To Main Lane
Report back with exactly this information:
The dataset verification result on local, `mac4`, and `mac5`.
The Gemma cache verification result on local, `mac4`, and `mac5`, including whether `mac4` was partial or complete.
The exact command used to relaunch the bounded partial-real run.
Whether `checkpoint.json` reached step `10`.
The absolute path to the final run directory.
Any remaining runtime limit or failure mode that still blocks the next window.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/docs/handoffs/codex-partial-real-thunder-handoff-2026-04-25.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture