Codex Handoff: N'Ko ASR Vast.ai Pipeline Monitor
**Created:** 2026-04-04 14:50 UTC **Author:** Claude (Mac1 bottom-right pane) **For:** Codex agent taking over monitoring duties
Full Public Reader
Codex Handoff: N'Ko ASR Vast.ai Pipeline Monitor
Created: 2026-04-04 14:50 UTC
Author: Claude (Mac1 bottom-right pane)
For: Codex agent taking over monitoring duties
---
What Is Running Right Now
Two CTC inference jobs are transcribing 32,826 Djoko soap opera audio segments on a Vast.ai RTX 4090 GPU. A watchdog monitors both jobs and will auto-launch a 3-step chain when they finish.
Instance: 34097422 (label: `djoko-batch-v3`)
Machine: RTX 4090, 16 vCPUs, 127GB RAM, 100GB disk (53GB used)
Cost: $0.3337/hr
SSH: `[ssh command redacted]`
### Progress at handoff time
- N'Ko: 16,128 / 32,826 (49.1
- Latin: 16,129 / 32,826 (49.1
- Rate: ~1.4 segments/second per job
- ETA: ~3.2 hours from handoff (~6pm local)
---
How to Check Status
Quick one-liner
[ssh command redacted]Full status
[ssh command redacted]
echo "=== PROGRESS ==="
wc -l /workspace/nko_transcriptions.jsonl /workspace/latin_transcriptions.jsonl
echo "=== GPU ==="
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
echo "=== PROCESSES ==="
ps aux | grep vastai_transcribe | grep -v grep | awk '{print $2, $NF}'
echo "=== WATCHDOG (last 5) ==="
tail -5 /workspace/watchdog.log
echo "=== CHAIN STATUS ==="
ls /workspace/.nko_done /workspace/.latin_done /workspace/.chain_done 2>/dev/null || echo "not done yet"
cat /workspace/chain_results.json 2>/dev/null || echo "no chain results yet"
EOFAttach to live tmux sessions
# Watch N'Ko inference live
[ssh command redacted]
# Watch Latin inference live
[ssh command redacted]
# Watch watchdog
[ssh command redacted]
# Detach from tmux: Ctrl+B then D---
Architecture of What's Running
3 tmux sessions on the Vast.ai instance:
| Session | Process | Purpose |
|---|---|---|
| `nko_infer` | `vastai_transcribe.py --script nko --resume` | N'Ko CTC inference on 32,826 Djoko WAV segments |
| `latin_infer` | `vastai_transcribe.py --script latin --resume` | Latin CTC inference on same 32,826 segments |
| `watchdog` | `watchdog.sh` | Checks every 60s. Restarts dead jobs. Launches chain when both done. |
The auto-chain (fires when both hit 32,826):
The watchdog detects both jobs complete and launches `chain.sh` in a new tmux session `chain`. The chain runs 3 steps sequentially:
1. Consensus pipeline (`consensus_pipeline.py`)
- Merges `nko_transcriptions.jsonl` + `latin_transcriptions.jsonl`
- Scores each segment by CTC confidence + cross-script agreement
- Outputs: `/workspace/consensus_pairs.jsonl`
- Runtime: ~5-10 minutes (CPU)
2. Experiment G: TTT N'Ko (`ttt_eval.py --script nko`)
- Test-Time Training: loads N'Ko checkpoint, processes Djoko speakers sequentially
- Updates model weights after each utterance to measure per-speaker adaptation
- Uses `/workspace/djoko_speakers.json` for speaker clustering
- Outputs: `/workspace/exp_g_results_nko.json`
- Runtime: ~30-60 minutes (GPU)
3. Experiment G: TTT Latin (`ttt_eval.py --script latin`)
- Same as above but with Latin checkpoint
- Outputs: `/workspace/exp_g_results_latin.json`
- Runtime: ~30-60 minutes (GPU)
4. Summary written to `/workspace/chain_results.json`, marker `/workspace/.chain_done` created.
Total expected runtime from now: ~5-6 hours (inference + chain)
---
File Map on Instance
### Checkpoints (DO NOT DELETE)
| File | Size | What |
|------|------|------|
| `/workspace/nko_traj_best.pt` | 185MB | N'Ko trajectory CTC model (27.50
| `/workspace/latin_best.pt` | 179MB | Latin baseline CTC model (31.43
### Scripts
| File | Purpose |
|------|---------|
| `/workspace/vastai_transcribe.py` | Whisper encoder + CTC head inference. Supports `--script nko/latin`, `--resume`, `--batch-size` |
| `/workspace/consensus_pipeline.py` | Merges N'Ko + Latin transcriptions into consensus pairs |
| `/workspace/ttt_eval.py` | Experiment G: Test-Time Training evaluation |
| `/workspace/watchdog.sh` | Process monitor, auto-restarter, chain launcher |
| `/workspace/chain.sh` | Post-inference pipeline (consensus + TTT x2 + summary) |
### Data
| File/Dir | What |
|----------|------|
| `/workspace/djoko_audio/segments/` | 32,826 WAV files across 1,124 episode subdirectories |
| `/workspace/djoko_speakers.json` | Speaker diarization: speaker IDs mapped to segment paths |
### Outputs (grow during run)
| File | What | Target size |
|------|------|-------------|
| `/workspace/nko_transcriptions.jsonl` | N'Ko CTC transcriptions | 32,826 lines |
| `/workspace/latin_transcriptions.jsonl` | Latin CTC transcriptions | 32,826 lines |
| `/workspace/consensus_pairs.jsonl` | Consensus-filtered training pairs | Generated after inference |
| `/workspace/exp_g_results_nko.json` | TTT N'Ko results | Generated after consensus |
| `/workspace/exp_g_results_latin.json` | TTT Latin results | Generated after consensus |
| `/workspace/chain_results.json` | Summary of all chain outputs | Generated last |
### Logs
| File | What |
|------|------|
| `/workspace/nko_infer.log` | N'Ko inference stdout |
| `/workspace/latin_infer.log` | Latin inference stdout |
| `/workspace/watchdog.log` | Watchdog activity log |
| `/workspace/chain.log` | Chain pipeline stdout |
### Completion markers
| File | Meaning |
|------|---------|
| `/workspace/.nko_done` | N'Ko inference finished (not used in v2 watchdog, but may be created) |
| `/workspace/.latin_done` | Latin inference finished |
| `/workspace/.chain_done` | All chain steps complete |
---
What Can Go Wrong and How to Fix It
### 1. A job dies and watchdog doesn't restart it
Symptom: `wc -l` on output file stops growing, `pgrep -f vastai_transcribe` returns nothing
Fix:
[ssh command redacted]
# Restart N'Ko
tmux kill-session -t nko_infer 2>/dev/null
tmux new-session -d -s nko_infer
tmux send-keys -t nko_infer "python3 /workspace/vastai_transcribe.py --checkpoint /workspace/nko_traj_best.pt --input /workspace/djoko_audio/segments/ --output /workspace/nko_transcriptions.jsonl --script nko --batch-size 8 --resume 2>&1 | tee -a /workspace/nko_infer.log" Enter
# Restart Latin
tmux kill-session -t latin_infer 2>/dev/null
tmux new-session -d -s latin_infer
tmux send-keys -t latin_infer "python3 /workspace/vastai_transcribe.py --checkpoint /workspace/latin_best.pt --input /workspace/djoko_audio/segments/ --output /workspace/latin_transcriptions.jsonl --script latin --batch-size 8 --resume 2>&1 | tee -a /workspace/latin_infer.log" Enter
EOFThe `--resume` flag reads the output JSONL and skips already-processed segments. Zero rework.
### 2. Watchdog itself dies
Symptom: `tmux ls` doesn't show `watchdog`
Fix:
[ssh command redacted]
tmux new-session -d -s watchdog
tmux send-keys -t watchdog "/workspace/watchdog.sh" Enter
EOF### 3. GPU out of memory (OOM)
Symptom: Process crashes with CUDA OOM in logs
Fix: Reduce batch size. Kill both, restart with `--batch-size 4`:
# Kill both
[ssh command redacted]
# Then restart with lower batch size (replace 8 with 4 in the commands above)### 4. Instance gets preempted/stopped by Vast.ai
Symptom: SSH connection refused
Fix from Mac1:
vastai start instance 34097422
# Wait 1-2 min, then SSH in and check tmux sessions
# If tmux sessions are gone, restart everything manually (see fix #1 + #2)### 5. Chain step fails (consensus or TTT)
Symptom: `/workspace/.chain_done` never appears, check `/workspace/chain.log`
Fix: Read chain.log for the error, fix, then re-run:
[ssh command redacted]chain.sh is idempotent (doesn't delete previous outputs).
### 6. Disk fills up
Symptom: Write errors in logs
Check: `ssh ... 'df -h / | tail -1'`
Current: 53GB used / 100GB total. Each output JSONL is ~2-3MB. Should be fine.
---
When Everything Is Done
How to know it's done
[ssh command redacted]If this returns JSON with segment counts and TTT CER values, everything completed.
Download results to Mac1
mkdir -p Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/nko_transcriptions.jsonl \
Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/latin_transcriptions.jsonl \
Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/consensus_pairs.jsonl \
Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/exp_g_results_nko.json \
Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/exp_g_results_latin.json \
Desktop/nko-brain-scanner/results/vastai_djoko_full/
scp -i [home-path] -P 17422 [email]:/workspace/chain_results.json \
Desktop/nko-brain-scanner/results/vastai_djoko_full/Stop the instance (SAVE MONEY)
vastai stop instance 34097422Data persists on stopped instances. Only delete if you're sure you have everything.
---
What Comes After This
Once results are downloaded, the next tasks in priority order:
1. Task #34: Retrain trajectory CTC on 297K + Djoko consensus
- Combine existing 297K training pairs + new consensus_pairs.jsonl
- Train on Vast.ai (needs 297K features from instance 33981290 or re-extract)
- Expected: lower CER with more diverse training data
2. Task #31: Gemma 4 scene analysis on Vast.ai GPU
- Visual scene context from Djoko video frames
- Can run on same instance or a new one
3. OCR pipeline (not yet started)
- synth_data_gen.py exists but 0 images generated
- train_trocr_nko.py exists, no checkpoint
- Needs Vast.ai GPU for image generation + training
4. Task #35: Paper 5 draft
- Needs Exp G results (from chain) + consensus methodology writeup
---
Key Numbers to Verify
| Metric | Expected |
|---|---|
| nko_transcriptions.jsonl lines | 32,826 |
| latin_transcriptions.jsonl lines | 32,826 |
| N'Ko non-empty rate | ~99.5 |
| Latin non-empty rate | ~100 |
| consensus_pairs.jsonl lines | Unknown (depends on agreement threshold) |
| Instance cost for full run | ~$2-3 total |
---
Vast.ai CLI Reference
# List all instances
vastai show instances
# Start/stop
vastai start instance 34097422
vastai stop instance 34097422
# Get SSH URL
vastai ssh-url 34097422
# Check balance
vastai show user---
Contact
If something breaks that isn't covered above, the full conversation context is at:
`[home]/.claude/projects/-Users-mohameddiomande/7a56c132-6c3e-4a60-bbaa-926b93a5c744.jsonl`
Key files on Mac1:
- `[home]/Desktop/nko-brain-scanner/asr/vastai_transcribe.py` (local copy of inference script)
- `[home]/Desktop/nko-brain-scanner/asr/consensus_pipeline.py` (local copy of consensus)
- `[home]/Desktop/nko-brain-scanner/experiments/B_script_advantage_ctc/vastai_results_297k/` (all 297K training results + checkpoints)
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/HANDOFF_CODEX.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture