Grand Diomande Research · Full HTML Reader

V5 Training Session Handoff

| Instance | What | Cost | SSH | |----------|------|------|-----| | Vast.ai 33195812 | Cognitive twin SFT/DPO training | $0.97/hr | `[ssh command redacted]` | | Vast.ai 33248108 | V5 mel extraction (tmux "v5") | $0.93/hr | `[ssh command redacted]` | | Mac4 monitor | V5 watch (LaunchAgent com.nko.v5-monitor) | free | `ssh -o IdentitiesOnly=yes -i [home-path] mac4` |

Language as Infrastructure technical note experiment writeup candidate score 24 .md

Full Public Reader

# V5 Training Session Handoff
> Updated: 2026-03-21 00:11 EDT
> Purpose: Fresh session picks this up with full context, nothing lost

WHAT'S RUNNING RIGHT NOW

InstanceWhatCostSSH
Vast.ai 33195812Cognitive twin SFT/DPO training$0.97/hr`[ssh command redacted]`
Vast.ai 33248108V5 mel extraction (tmux "v5")$0.93/hr`[ssh command redacted]`
Mac4 monitorV5 watch (LaunchAgent com.nko.v5-monitor)free`ssh -o IdentitiesOnly=yes -i [home-path] mac4`

DO NOT STOP instance 33195812 — cognitive twin is training.

WHAT WENT WRONG WITH V5

1. `prepare_v5_data.py` only loads HuggingFace datasets (afvoices + bam-asr-early = 259K samples)
2. It does NOT load any of the local JSONL data files (20K+ pairs on disk)
3. It does NOT run the YouTube OCR pipeline (490+ unprocessed videos)
4. Training script had `total_mem` instead of `total_memory` (fixed 3 times, kept reverting from stale SCPs)
5. Training launched with `--skip-features` but the script requires mel spectrograms to exist
6. Context compaction dropped details between tool calls, causing repeated mistakes

COMPLETE DATA INVENTORY (USE ALL OF THIS)

### Audio-paired data (for ASR training):
| Source | Samples | Quality | Location | Used in V5? |
|--------|---------|---------|----------|-------------|
| afvoices (HF) | 253,290 | Whisper-inferred | HuggingFace | YES |
| bam-asr-early (HF) | 37,306 | Whisper-inferred | HuggingFace | YES |
| Babamamadidiane OCR | 2,577 | Ground truth (Gemini vision) | `results/dynamic_ocr/dynamic_pairs.jsonl` | NO |
| Babamamadidiane features | 941 | Whisper-inferred | `results/feature_pairs_babamamadidiane.jsonl` | NO |
| Djoko pairs | 926 | Whisper-inferred | `results/quebec_djoko_pairs.jsonl` | NO |
| Texas pairs | 6,633 | Whisper-inferred | `results/vastai/texas_nko_pairs.jsonl` | NO |
| Texas transcriptions | 8,981 | Whisper-inferred | `results/vastai/texas_transcriptions.jsonl` | NO |
| Common Voice Bambara | 500 files | Human-verified | `data/common_voice_bm/audio/` | NO |
| Total available | ~311,000 | | | 259K used |

### UNPROCESSED YouTube (highest value, ground-truth labels):
| Channel | Total videos | Processed | Remaining |
|---------|-------------|-----------|-----------|
| @babamamadidiane | 532 | ~40 | 490 |
| @mamadibabadiane1 | 58 | 0 | 58 |
| Djoko | partial | partial | unknown |

Estimated yield from unprocessed videos: ~34,000 ground-truth pairs (OCR from screen text aligned with audio)

### Text-only data (for NLLB translation, not ASR):
- parallel_corpus.jsonl: 460 pairs
- nicolingua: 29,513 train + 3,279 test
- bayelemabaga: 1,001 pilot SFT
- sft_v3_combined: 92,184
- NLLB training: 8,640 pairs

THE GENERATIVE FRAMES PIPELINE

This is the most valuable data pipeline and it's NOT being used.

`asr/dynamic_ocr_pipeline.py` does:
1. FFmpeg scene detection — finds every slide transition in teaching videos
2. Gemini vision OCR — reads N'Ko text from each frame (bounding boxes, sections, digital slides)
3. Audio-text alignment — extracts the audio window when that text is on screen
4. Output: `(audio_segment, nko_text)` pairs with GROUND TRUTH labels

This produces labels that come from what's literally written on screen, NOT from Whisper inference. Worth 10x more per pair than Whisper-inferred data.

Run: `python3 asr/dynamic_ocr_pipeline.py --all --resume` (needs GOOGLE_API_KEY for Gemini)
Also: `asr/gcs_ocr_pipeline.py` processes 189 videos from babamamadidiane

PARENTS' CONVERSATION

File: `[home-path] Hollywood Luxury Car Services 2.m4a` (23MB, 24.5 min)
- Copied to Vast.ai at `/workspace/parents_conversation.m4a` and `.wav`
- Transcription script ready at `/workspace/transcribe_parents.py`
- NOT YET TRANSCRIBED (GPU was busy with mel extraction)

### What Mohamed wants:
1. Transcribe the Malinke conversation with Whisper
2. Speaker diarization — detect mom vs dad speaking turns
3. Translate to English, send to Discord
4. Text-to-speech voice cloning of both parents
5. Use their voices as "voice of the architecture" for inscriptions/sigils
6. Language models communicating in Malinke using parents' voices
7. Find all mom's recordings in [home-path] folder
8. Build a voice model minimizing loss between generated and real audio

### The bigger vision:
"Eternal parents" — inscriptions on EPOCH blockchain spoken in parents' voices in Malinke. The sigils and derivation chain voiced by the people who speak the language. Not communicating with Mohamed directly but with a language model, in Malinke.

V5 ARCHITECTURE

  • LoRA on ALL 32 Whisper encoder layers (V4 used only top 8)
  • Rank 64, alpha 128, target: q_proj, v_proj, k_proj, out_proj
  • CTC head: CharASR V3 (h768, L6, nhead=12, Conv1d stride-4)
  • Dual LR: 5e-6 encoder, 1e-4 CTC head
  • Mel-based training (encoder is being adapted, can't use pre-extracted features)
  • Beam search decoder with FSM constraints (`beam_search_decoder.py`)
  • 50 epochs, batch 16, SpecAugment, mixed precision
  • W&B logging to `wandb.ai/bwb-brews-with-beats/nko-brain-scanner`

PAPERS

PaperStatusFile
1: Dead CircuitsREADY for arXiv`paper/current/paper1_dead_circuits.tex`
2: Living SpeechAwaiting V5 results`paper/current/paper2_living_speech.tex`
3: Script Invisibility3/4 models scannedFrom Experiment A
4: Script DesignScripts readyNeeds Experiment B ($10 Vast.ai)
5: Script as ThoughtPreppedNeeds cognitive twin
6: Inscribing KnowledgeD+E doneNeeds C results

Paper 1 verified: zero fabricated numbers, real brain scan data from 3 models (Qwen3-8B 2.94x, Qwen2.5-7B 3.22x, Mistral-7B 3.42x).

ACADEMIC OUTREACH

UPenn: Marlyse Baptista (Mande substrate, strongest contact). Lead with Paper 1 or Paper 3.
Miami: Ludovic Mompelat (U Miami, Creole NLP, local). Coffee meeting.
LDC: Dataset submission pathway at submissions.ldc.upenn.edu.

Prep timeline: 4-6 weeks. Papers on arXiv first, then colloquium talk.

CONTEXT MANAGEMENT (LEARNED THIS SESSION)

Problem: Parallel agent outputs fill context, compaction drops details, mistakes compound.

Solution:
1. Agent outputs → disk files, not context. Read only summaries.
2. State → this handoff document, updated periodically.
3. Heavy work → pane spawn (separate Claude sessions), not subagents.
4. Use RAG++ (`rag_search`) to recall conversation history instead of keeping it in context.
5. Codex/Gemini CLI/OpenCode for isolated tasks that write to git, not back to this context.

WHAT THE NEXT SESSION SHOULD DO

1. Read this file first
2. Check V5 mel extraction status on Vast.ai 33248108
3. Run `transcribe_parents.py` on Vast.ai after mels done
4. FIX `prepare_v5_data.py` to load ALL local JSONL files
5. Run YouTube OCR pipeline on remaining 548 videos (490 + 58)
6. Retrain V5 with ALL data (~345K+ samples)
7. Upload Paper 1 to arXiv
8. Send parents' transcription to Discord
9. Compile Paper 1 PDF (need LaTeX toolchain)

YouTube Download Blocker (2026-03-21 04:42 EDT)

Vast.ai CANNOT download from YouTube — datacenter IP is blocked even with exported cookies + deno JS solver.

Only Mac1 works (Safari cookies + residential IP + nightly yt-dlp 2026.03.17.232108).

Working command:

bash
[home-path] --cookies-from-browser safari -x --audio-format wav -o "OUTPUT.%(ext)s" "YOUTUBE_URL"

Mac1 has 159MB free disk. Must clear space before downloading 590 videos.

Steps for next session:
1. Clear Mac1 disk (rm old builds, caches, sim builds)
2. Download all 590 videos as wav audio (~50GB estimated)
3. SCP to Vast.ai
4. OR: stream from Mac1 to Vast.ai via pipe (download + SCP in one shot per video)

Alternative: use the CC-stream approach — stream video frames directly without downloading, process in real-time.

YouTube Download SOLVED (2026-03-21 11:06 EDT)

Cloud-VM works! Not Mac1-only after all.

Working setup on cloud-vm:
- deno 2.7.7 installed at `[home-path]`
- yt-dlp 2026.03.03 at `[home-path]`
- Cookies exported from Mac1 Safari at `[home-path]`
- Video ID list at `[home-path]` (590 IDs)
- 34GB free disk

Working command:

bash
export PATH=$HOME/.deno/bin:$PATH
[home-path] --remote-components ejs:github --cookies [home-path] -x --audio-format wav -o "OUTPUT.%(ext)s" "URL"

For batch download next week:

bash
mkdir -p [home-path]
while read ID; do
    [home-path] --remote-components ejs:github --cookies [home-path] \
        -x --audio-format wav --postprocessor-args "ffmpeg:-ar 16000 -ac 1" \
        -o "[home-path])s" "https://www.youtube.com/watch?v=${ID}"
done < [home-path]

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

SESSION-HANDOFF-V5.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture