V5 Training Session Handoff

Full HTML reader

Read the full artifact

Extracted abstract or opening context

# V5 Training Session Handoff > Updated: 2026-03-21 00:11 EDT > Purpose: Fresh session picks this up with full context, nothing lost | Instance | What | Cost | SSH | |----------|------|------|-----| | Vast.ai 33195812 | Cognitive twin SFT/DPO training | $0.97/hr | `[ssh command redacted]` | | Vast.ai 33248108 | V5 mel extraction (tmux "v5") | $0.93/hr | `[ssh command redacted]` | | Mac4 monitor | V5 watch (LaunchAgent com.nko.v5-monitor) | free | `ssh -o IdentitiesOnly=yes -i [home-path] mac4` | 1. `prepare_v5_data.py` only loads HuggingFace datasets (afvoices + bam-asr-early = 259K samples) 2. It does NOT load any of the local JSONL data files (20K+ pairs on disk) 3. It does NOT run the YouTube OCR pipeline (490+ unprocessed videos) 4. Training script had `total_mem` instead of `total_memory` (fixed 3 times, kept reverting from stale SCPs) 5. Training launched with `--skip-features` but the script requires mel spectrograms to exist 6. Context compaction dropped details between tool calls, causing repeated mistakes ### Audio-paired data (for ASR training): | Source | Samples | Quality | Location | Used in V5? | |--------|---------|---------|----------|-------------| | afvoices (HF) | 253,290 | Whisper-inferred | HuggingFace | YES | | bam-asr-early (HF) | 37,306 | Whisper-inferred | HuggingFace | YES | | Babamamadidiane OCR | 2,577 | **Ground truth** (Gemini vision) | `results/dynamic_ocr/dynamic_pairs.jsonl` | NO | | Babamamadidiane features | 941 | Whisper-inferred | `results/feature_pairs_babamamadidiane.jsonl` | NO | | Djoko pairs | 926 | Whisper-inferred | `results/quebec_djoko_pairs.jsonl` | NO | | Texas pairs | 6,633 | Whisper-inferred | `results/vastai/texas_nko_pairs.jsonl` | NO | | Texas transcriptions | 8,981 | Whisper-inferred | `results/vastai/texas_transcriptions.jsonl` | NO | | Common Voice Bambara | 500 files | Human-verified | `data/common_voice_bm/audio/` | NO | | **Total available** | **~311,000** | | | **259K used** | ### UNPROCESSED YouTube (highest value, ground-truth labels): | Channel | Total videos | Processed | Remaining | |---------|-------------|-----------|-----------| | @babamamadidiane | 532 | ~40 | **490** | | @mamadibabadiane1 | 58 | 0 | **58** | | Djoko | partial | partial | unknown |

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.