Grand Diomande Research · Full HTML Reader

Codex Handoff: N'Ko ASR Vast.ai Pipeline Monitor

**Created:** 2026-04-04 14:50 UTC **Author:** Claude (Mac1 bottom-right pane) **For:** Codex agent taking over monitoring duties

Language as Infrastructure technical note experiment writeup candidate score 32 .md

Full Public Reader

Codex Handoff: N'Ko ASR Vast.ai Pipeline Monitor

Created: 2026-04-04 14:50 UTC
Author: Claude (Mac1 bottom-right pane)
For: Codex agent taking over monitoring duties

---

What Is Running Right Now

Two CTC inference jobs are transcribing 32,826 Djoko soap opera audio segments on a Vast.ai RTX 4090 GPU. A watchdog monitors both jobs and will auto-launch a 3-step chain when they finish.

Instance: 34097422 (label: `djoko-batch-v3`)
Machine: RTX 4090, 16 vCPUs, 127GB RAM, 100GB disk (53GB used)
Cost: $0.3337/hr
SSH: `[ssh command redacted]`

### Progress at handoff time
- N'Ko: 16,128 / 32,826 (49.1
- Latin: 16,129 / 32,826 (49.1
- Rate: ~1.4 segments/second per job
- ETA: ~3.2 hours from handoff (~6pm local)

---

How to Check Status

Quick one-liner

bash

[ssh command redacted]

Full status

bash

[ssh command redacted]
echo "=== PROGRESS ==="
wc -l /workspace/nko_transcriptions.jsonl /workspace/latin_transcriptions.jsonl
echo "=== GPU ==="
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader
echo "=== PROCESSES ==="
ps aux | grep vastai_transcribe | grep -v grep | awk '{print $2, $NF}'
echo "=== WATCHDOG (last 5) ==="
tail -5 /workspace/watchdog.log
echo "=== CHAIN STATUS ==="
ls /workspace/.nko_done /workspace/.latin_done /workspace/.chain_done 2>/dev/null || echo "not done yet"
cat /workspace/chain_results.json 2>/dev/null || echo "no chain results yet"
EOF

Attach to live tmux sessions

bash

# Watch N'Ko inference live
[ssh command redacted]

# Watch Latin inference live
[ssh command redacted]

# Watch watchdog
[ssh command redacted]

# Detach from tmux: Ctrl+B then D

---

Architecture of What's Running

3 tmux sessions on the Vast.ai instance:

Session	Process	Purpose
`nko_infer`	`vastai_transcribe.py --script nko --resume`	N'Ko CTC inference on 32,826 Djoko WAV segments
`latin_infer`	`vastai_transcribe.py --script latin --resume`	Latin CTC inference on same 32,826 segments
`watchdog`	`watchdog.sh`	Checks every 60s. Restarts dead jobs. Launches chain when both done.

The auto-chain (fires when both hit 32,826):

The watchdog detects both jobs complete and launches `chain.sh` in a new tmux session `chain`. The chain runs 3 steps sequentially:

1. Consensus pipeline (`consensus_pipeline.py`)
- Merges `nko_transcriptions.jsonl` + `latin_transcriptions.jsonl`
- Scores each segment by CTC confidence + cross-script agreement
- Outputs: `/workspace/consensus_pairs.jsonl`
- Runtime: ~5-10 minutes (CPU)

2. Experiment G: TTT N'Ko (`ttt_eval.py --script nko`)
- Test-Time Training: loads N'Ko checkpoint, processes Djoko speakers sequentially
- Updates model weights after each utterance to measure per-speaker adaptation
- Uses `/workspace/djoko_speakers.json` for speaker clustering
- Outputs: `/workspace/exp_g_results_nko.json`
- Runtime: ~30-60 minutes (GPU)

3. Experiment G: TTT Latin (`ttt_eval.py --script latin`)
- Same as above but with Latin checkpoint
- Outputs: `/workspace/exp_g_results_latin.json`
- Runtime: ~30-60 minutes (GPU)

4. Summary written to `/workspace/chain_results.json`, marker `/workspace/.chain_done` created.

Total expected runtime from now: ~5-6 hours (inference + chain)

---

File Map on Instance

### Checkpoints (DO NOT DELETE)
| File | Size | What |
|------|------|------|
| `/workspace/nko_traj_best.pt` | 185MB | N'Ko trajectory CTC model (27.50
| `/workspace/latin_best.pt` | 179MB | Latin baseline CTC model (31.43

### Scripts
| File | Purpose |
|------|---------|
| `/workspace/vastai_transcribe.py` | Whisper encoder + CTC head inference. Supports `--script nko/latin`, `--resume`, `--batch-size` |
| `/workspace/consensus_pipeline.py` | Merges N'Ko + Latin transcriptions into consensus pairs |
| `/workspace/ttt_eval.py` | Experiment G: Test-Time Training evaluation |
| `/workspace/watchdog.sh` | Process monitor, auto-restarter, chain launcher |
| `/workspace/chain.sh` | Post-inference pipeline (consensus + TTT x2 + summary) |

### Data
| File/Dir | What |
|----------|------|
| `/workspace/djoko_audio/segments/` | 32,826 WAV files across 1,124 episode subdirectories |
| `/workspace/djoko_speakers.json` | Speaker diarization: speaker IDs mapped to segment paths |

### Outputs (grow during run)
| File | What | Target size |
|------|------|-------------|
| `/workspace/nko_transcriptions.jsonl` | N'Ko CTC transcriptions | 32,826 lines |
| `/workspace/latin_transcriptions.jsonl` | Latin CTC transcriptions | 32,826 lines |
| `/workspace/consensus_pairs.jsonl` | Consensus-filtered training pairs | Generated after inference |
| `/workspace/exp_g_results_nko.json` | TTT N'Ko results | Generated after consensus |
| `/workspace/exp_g_results_latin.json` | TTT Latin results | Generated after consensus |
| `/workspace/chain_results.json` | Summary of all chain outputs | Generated last |

### Logs
| File | What |
|------|------|
| `/workspace/nko_infer.log` | N'Ko inference stdout |
| `/workspace/latin_infer.log` | Latin inference stdout |
| `/workspace/watchdog.log` | Watchdog activity log |
| `/workspace/chain.log` | Chain pipeline stdout |

### Completion markers
| File | Meaning |
|------|---------|
| `/workspace/.nko_done` | N'Ko inference finished (not used in v2 watchdog, but may be created) |
| `/workspace/.latin_done` | Latin inference finished |
| `/workspace/.chain_done` | All chain steps complete |

---

What Can Go Wrong and How to Fix It

### 1. A job dies and watchdog doesn't restart it
Symptom: `wc -l` on output file stops growing, `pgrep -f vastai_transcribe` returns nothing
Fix:

bash

[ssh command redacted]
# Restart N'Ko
tmux kill-session -t nko_infer 2>/dev/null
tmux new-session -d -s nko_infer
tmux send-keys -t nko_infer "python3 /workspace/vastai_transcribe.py --checkpoint /workspace/nko_traj_best.pt --input /workspace/djoko_audio/segments/ --output /workspace/nko_transcriptions.jsonl --script nko --batch-size 8 --resume 2>&1 | tee -a /workspace/nko_infer.log" Enter

# Restart Latin
tmux kill-session -t latin_infer 2>/dev/null
tmux new-session -d -s latin_infer
tmux send-keys -t latin_infer "python3 /workspace/vastai_transcribe.py --checkpoint /workspace/latin_best.pt --input /workspace/djoko_audio/segments/ --output /workspace/latin_transcriptions.jsonl --script latin --batch-size 8 --resume 2>&1 | tee -a /workspace/latin_infer.log" Enter
EOF

The `--resume` flag reads the output JSONL and skips already-processed segments. Zero rework.

### 2. Watchdog itself dies
Symptom: `tmux ls` doesn't show `watchdog`
Fix:

bash

[ssh command redacted]
tmux new-session -d -s watchdog
tmux send-keys -t watchdog "/workspace/watchdog.sh" Enter
EOF

### 3. GPU out of memory (OOM)
Symptom: Process crashes with CUDA OOM in logs
Fix: Reduce batch size. Kill both, restart with `--batch-size 4`:

bash

# Kill both
[ssh command redacted]
# Then restart with lower batch size (replace 8 with 4 in the commands above)

### 4. Instance gets preempted/stopped by Vast.ai
Symptom: SSH connection refused
Fix from Mac1:

bash

vastai start instance 34097422
# Wait 1-2 min, then SSH in and check tmux sessions
# If tmux sessions are gone, restart everything manually (see fix #1 + #2)

### 5. Chain step fails (consensus or TTT)
Symptom: `/workspace/.chain_done` never appears, check `/workspace/chain.log`
Fix: Read chain.log for the error, fix, then re-run:

bash

[ssh command redacted]

chain.sh is idempotent (doesn't delete previous outputs).

### 6. Disk fills up
Symptom: Write errors in logs
Check: `ssh ... 'df -h / | tail -1'`
Current: 53GB used / 100GB total. Each output JSONL is ~2-3MB. Should be fine.

---

When Everything Is Done

How to know it's done

bash

[ssh command redacted]

If this returns JSON with segment counts and TTT CER values, everything completed.

Download results to Mac1

bash

mkdir -p Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/nko_transcriptions.jsonl \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/latin_transcriptions.jsonl \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/consensus_pairs.jsonl \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/exp_g_results_nko.json \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/exp_g_results_latin.json \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

scp -i [home-path] -P 17422 [email]:/workspace/chain_results.json \
    Desktop/nko-brain-scanner/results/vastai_djoko_full/

Stop the instance (SAVE MONEY)

bash

vastai stop instance 34097422

Data persists on stopped instances. Only delete if you're sure you have everything.

---

What Comes After This

Once results are downloaded, the next tasks in priority order:

1. Task #34: Retrain trajectory CTC on 297K + Djoko consensus
- Combine existing 297K training pairs + new consensus_pairs.jsonl
- Train on Vast.ai (needs 297K features from instance 33981290 or re-extract)
- Expected: lower CER with more diverse training data

2. Task #31: Gemma 4 scene analysis on Vast.ai GPU
- Visual scene context from Djoko video frames
- Can run on same instance or a new one

3. OCR pipeline (not yet started)
- synth_data_gen.py exists but 0 images generated
- train_trocr_nko.py exists, no checkpoint
- Needs Vast.ai GPU for image generation + training

4. Task #35: Paper 5 draft
- Needs Exp G results (from chain) + consensus methodology writeup

---

Key Numbers to Verify

Metric	Expected
nko_transcriptions.jsonl lines	32,826
latin_transcriptions.jsonl lines	32,826
N'Ko non-empty rate	~99.5
Latin non-empty rate	~100
consensus_pairs.jsonl lines	Unknown (depends on agreement threshold)
Instance cost for full run	~$2-3 total

---

Vast.ai CLI Reference

bash

# List all instances
vastai show instances

# Start/stop
vastai start instance 34097422
vastai stop instance 34097422

# Get SSH URL
vastai ssh-url 34097422

# Check balance
vastai show user

---

Contact

If something breaks that isn't covered above, the full conversation context is at:
`[home]/.claude/projects/-Users-mohameddiomande/7a56c132-6c3e-4a60-bbaa-926b93a5c744.jsonl`

Key files on Mac1:
- `[home]/Desktop/nko-brain-scanner/asr/vastai_transcribe.py` (local copy of inference script)
- `[home]/Desktop/nko-brain-scanner/asr/consensus_pipeline.py` (local copy of consensus)
- `[home]/Desktop/nko-brain-scanner/experiments/B_script_advantage_ctc/vastai_results_297k/` (all 297K training results + checkpoints)

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/HANDOFF_CODEX.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture