Grand Diomande Research · Full HTML Reader

Codex Handoff: Partial-Real Thunder Train Lane

This handoff is for one bounded job only: get the first completed partial-real local Thunder training window to finish cleanly and verify that it writes a real checkpoint. Do not expand scope beyond that.

Language as Infrastructure technical note experiment writeup candidate score 32 .md

Full Public Reader

Codex Handoff: Partial-Real Thunder Train Lane

Date: 2026-04-25

This handoff is for one bounded job only: get the first completed partial-real local Thunder training window to finish cleanly and verify that it writes a real checkpoint. Do not expand scope beyond that.

The system context matters because this lane sits inside a larger architecture. The canonical acoustic science lane is running separately on Vast and is responsible for the paper-grade CER claims. That lane is the same-snapshot Paper 4 matrix on the A100 instance and it should be treated as read-only from this handoff. The local Thunder lane is different. It is the AGP and Gemma correction-adapter lane. Gemma is not the authority model. Gemma is the bounded proposal model that only acts after the AGP partition router says a chunk deserves additional compute. Rust admissibility is still the final authority for whether a proposed correction is allowed.

The current goal is therefore narrow and practical. We already proved the mechanics on a synthetic correction set. We already proved that the launchers, chunk runner, and local distributed setup can work. The next real milestone is to prove that a first bounded training window can run on a partial-real correction corpus built from actual A100 replay outputs. This is the first time the local Gemma adapter lane will be training on real replay-derived correction rows rather than the small synthetic mechanics set.

Scope Boundary

You own only the partial-real Thunder lane.

You do own:

  • the local and remote Thunder dataset path for the partial-real corpus
  • the Gemma cache warm state on `mac4` and `mac5`
  • the partial-real Thunder launcher and chunk runner
  • the dedicated adapter output path for the partial-real run
  • checkpoint verification and artifact mirroring back to local

You do not own:

  • the Vast closeout watcher
  • the A100 instance lifecycle
  • the paper text or manuscript framing
  • the MAOE replay scripts unless needed only to read field names or confirm input format
  • routing logic redesign
  • changing the correction policy

If you discover a blocker outside your scope, record it precisely and stop there. Do not silently absorb adjacent systems into this task.

Current Architecture State

The partial-real correction corpus already exists. It was assembled from the four completed same-snapshot A100 runs that were already replayed locally. Those runs produced real bridge rows, and those bridge rows were converted into Gemma correction SFT data. That partial-real corpus has already been staged into Thunder Train and mirrored to the two compute hosts.

The local Thunder launch layer has already been patched so that this corpus is a named profile rather than a one-off path hack. The only reason the first bounded run did not proceed is that `mac4` had only a partial Hugging Face cache for `mlx-community/gemma-4-e2b-4bit`. `mac5` already had the full cache. The launch therefore stalled waiting on a cold model download on `mac4`. The correct fix was started but not yet verified through to the first checkpoint boundary.

Files And Paths That Matter

Dataset root

The partial-real SFT dataset should exist locally at:

`[home-path]`

This directory should contain at least:

  • `train.jsonl`
  • `valid.jsonl`
  • `test.jsonl`
  • `manifest.json`

The same path is expected on both compute hosts:

  • `mac4:[home-path]`
  • `mac5:[home-path]`

Launcher and runner

Relevant scripts:

  • `Desktop/Comp-Core/experiments/agp_mlx/train/launch_gemma4_e2b_nko_correction_thunder.sh`
  • `Desktop/Comp-Core/experiments/agp_mlx/train/run_gemma4_e2b_nko_correction_chunks.sh`
  • `Desktop/Comp-Core/experiments/agp_mlx/train/sync_gemma4_e2b_cache_to_thunder.sh`
  • `Desktop/Comp-Core/experiments/agp_mlx/train/README.md`

Named profiles already exist:

  • `partial_real_stage1_stable`
  • `partial_real_stage1`

Expected output path

The dedicated partial-real stable run should write here:

`[home-path]`

The specific checkpoint file that matters is:

`[home-path]`

Current known-success command

The exact bounded run to relaunch is:

bash
./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10

Interpret that as: launch the `partial_real_stage1_stable` profile, advance to absolute step 10 in a 10-step window, and verify a real checkpoint lands.

What Is Already Done

These items are not the job anymore. Do not redo them unless you verify they are missing.

The partial-real corpus has already been promoted into Thunder Train.

The launchers have already been patched to expose the partial-real profile by name.

The synthetic mechanics lane has already proven the local chunked resume logic on the smaller correction set.

The current failure mode is not “training code broken.” The known failure mode is “one Thunder node is cold on the Gemma cache.”

Your Actual Task

You need to finish four concrete things:

First, verify the partial-real dataset is actually present and nontrivial on all three machines: local, `mac4`, and `mac5`.

Second, verify that the full `mlx-community/gemma-4-e2b-4bit` Hugging Face cache is present on `mac4` and `mac5`, and compare it against the known-good local cache. Do not trust directory existence alone. Confirm that `mac4` is no longer stuck with the partial approximately 334 megabyte cache snapshot that previously caused the cold-download stall. The local complete cache is around 3.4 gigabytes, and the remote should look materially similar in weight and file count.

Third, once cache completeness is proven, relaunch the bounded partial-real Thunder window using the exact command already selected. The target is not a long run. The target is the first clean completed stable window on the real corpus.

Fourth, verify that `checkpoint.json` reaches step 10 and that the run artifact is mirrored back locally so the state is not stranded on a single compute host.

Recommended Verification Sequence

Use this order so you do not conflate dataset, cache, and training failures.

Step 1: Dataset verification

Verify locally:

bash
ls -lh [home-path]
wc -l [home-path]
wc -l [home-path]
wc -l [home-path]

Verify on `mac4` and `mac5` with the same path. You are not trying to remeasure the corpus scientifically here. You are proving that both distributed ranks can see the same real files.

Step 2: Gemma cache verification

Find the local cache root actually used by Thunder for `mlx-community/gemma-4-e2b-4bit`. Then compare local against `mac4` and `mac5`.

At minimum record:

  • the cache directory path
  • total size
  • whether the expected large weight files are present
  • whether `mac4` still looks partial compared to local

If `sync_gemma4_e2b_cache_to_thunder.sh` exists exactly for this fix, use that rather than inventing a new sync pattern. If it fails, collect the exact failure and only then fall back to manual copy.

Step 3: Relaunch bounded run

Run from the intended training directory and do not alter the profile:

bash
./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10

You are not tuning the profile in this handoff. You are only proving that the known selected stable profile can reach its first real checkpoint on the partial-real corpus once model cache warmup is no longer the bottleneck.

Step 4: Checkpoint verification

Confirm:

bash
cat [home-path]

You want the step field to show `10`.

Also verify the run directory contains the expected adapter and metadata files, not just an empty shell.

Step 5: Mirror back

Make sure the final adapter state is accessible locally, not only on the primary writer host.

You do not need to evaluate the model in this handoff unless the run completed instantly and there is zero ambiguity about success. The checkpoint boundary is the success gate for this task.

Success Definition

This task is successful only if all of the following are true:

The partial-real dataset is verified on local, `mac4`, and `mac5`.

The Gemma cache on `mac4` is confirmed complete enough that the run no longer blocks on model download.

The bounded command `./run_gemma4_e2b_nko_correction_chunks.sh partial_real_stage1_stable 10 10` completes.

The file `[home-path]` reaches step `10`.

The resulting run artifact is available locally after the run, not stranded on one remote host.

Failure Triage

Do not improvise profile changes until the failure mode is concrete.

If the run stalls before training logs begin, the likely remaining problem is still cache or model materialization. Prove that first.

If rank 1 cannot see the dataset path, the mirror is incomplete. Fix dataset sync, not training.

If the run starts training but never writes a checkpoint, determine whether it is a runtime crash, a distributed teardown issue, or a path/write issue. Do not call it success just because logs appeared.

If checkpoint step 10 lands and only the post-save distributed teardown is noisy, treat that as secondary unless it prevents artifact availability.

If the run fails after cache warmup and before step 10 for a real training/runtime reason, stop after recording the exact failure mode. Do not retune the profile in this handoff unless the failure is trivial and obviously environmental.

Explicit Stop Conditions

Stop once one of these is true:

You have a verified step-10 checkpoint for the partial-real stable run and the adapter is mirrored back locally.

Or:

You have a concrete, reproducible failure mode after cache completeness is verified and the bounded run still cannot reach step 10.

Anything beyond that is a different task.

Deliverable Format Back To Main Lane

Report back with exactly this information:

The dataset verification result on local, `mac4`, and `mac5`.

The Gemma cache verification result on local, `mac4`, and `mac5`, including whether `mac4` was partial or complete.

The exact command used to relaunch the bounded partial-real run.

Whether `checkpoint.json` reached step `10`.

The absolute path to the final run directory.

Any remaining runtime limit or failure mode that still blocks the next window.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/handoffs/codex-partial-real-thunder-handoff-2026-04-25.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture