Grand Diomande Research · Full HTML Reader

N'Ko Brain Scanner — Comprehensive Project Handoff

An ASR (Automatic Speech Recognition) system for N'Ko script — a phonetically transparent writing system used by ~30M Manding-language speakers in West Africa. The core research question: **does N'Ko's phonetic transparency give it a measurable architectural advantage over Latin script in ASR?**

Language as Infrastructure technical note experiment writeup candidate score 36 .md

Full Public Reader

N'Ko Brain Scanner — Comprehensive Project Handoff

What This Project Is

An ASR (Automatic Speech Recognition) system for N'Ko script — a phonetically transparent writing system used by ~30M Manding-language speakers in West Africa. The core research question: does N'Ko's phonetic transparency give it a measurable architectural advantage over Latin script in ASR?

Current verified answer: the N'Ko trajectory model is strong enough to anchor the project baseline. The fully archived reproduction on the current `290,596`-pair corpus snapshot achieves **20.57

---

Best Model

N'Ko Trajectory CTC — 20.57

PropertyValue
ArchitectureUnifiedCTCHead (46.8M params)
InputWhisper large-v3 encoder features (1280-dim)
Decoder6-layer Transformer with trajectory bias injection
Training data290,596 pairs (232,476 train / 29,060 val / 29,060 test)
Seed42 (deterministic split)
Epochs trained46 (best checkpoint at epoch 38)
Best val loss0.6359
Checkpoint`results/paper4_reproduction_35205256/best.pt`
Results JSON`results/paper4_reproduction_35205256/results.json`
Inference script`asr/transcribe_nko.py`

Artifacts are also synced to `Mac5:/Volumes/HD1/tar_297k_clean/paper4_reproduction_35205256/`.

### Trajectory Bias Mechanism
The model injects pen-stroke trajectory scalars (7-dim: velocity, curvature, acceleration, etc.) into transformer attention as position-dependent bias. This exploits N'Ko's bijective grapheme-phoneme mapping — each character encodes exactly one phoneme, and the trajectory captures how that character is physically written.

Key: trajectory bias gives N'Ko -5.25pp CER improvement but Latin only +0.24pp. The bias is a bijection amplifier — it only helps when the script's structure is phonetically transparent.

---

Verified Baseline (Current)

This is the benchmark you can cite without qualification inside the repo:

MetricValue
Test CER**20.57
Test edits / chars216,225 / 1,050,967
Corpus snapshot290,596 pairs
Split232,476 / 29,060 / 29,060
ModeN'Ko trajectory
Early stoppingEpoch 46
Best validation0.6359 at epoch 38

Historical 8-Way Comparison (Provisional)

These numbers motivated the script-dependent trajectory story, but the full local artifact bundle for all eight runs is still missing. Keep them marked as historical/provisional until the companion Latin and ablation artifacts are restored.

Historical 297K Results (internal logs)

ModeN'Ko CERLatin CERDelta (N'Ko - Latin)
Baseline32.75
Graph32.38
Trajectory**27.50
Combined30.46

Historical key findings:
1. At baseline, Latin slightly wins (+1.32pp) — Latin has fewer classes (41 vs 66)
2. Graph hurts Latin badly (+5.71pp) but helps N'Ko (-0.37pp) — graph encodes phonological structure that only helps transparent scripts
3. Trajectory is the big win: -5.25pp for N'Ko, +0.24pp for Latin — trajectory bias is script-dependent
4. Combined doesn't beat trajectory alone for N'Ko — the mechanisms compete

### Historical results JSON
- Expected historical path: `experiments/B_script_advantage_ctc/vastai_results_297k/...`
- Status on this machine: missing; not available for direct verification

### Verified reproduction logs
- `results/paper4_reproduction_35205256/train.log`
- `results/paper4_reproduction_35205256/run.log`
- `results/paper4_reproduction_35205256/split.json`
- `results/paper4_reproduction_35205256/vocab.json`

---

Experiment F: Compositional Generalization (COMPLETE)

Question: Can N'Ko decode words never seen in training?

ScriptFull test CERExp F test CERDegradation
N'Ko32.75
Latin31.43

N'Ko degrades 6x less than Latin on unseen vocabulary. Trained on SEEN-only words (213,664 samples), tested on utterances with UNSEEN words (59,648 samples).

Detailed word-level analysis:
- N'Ko SEEN-word CER: 16.09
- Latin SEEN-word CER: 15.05
- N'Ko gap is 3.65pp smaller than Latin

Files:
- `experiments/B_script_advantage_ctc/vastai_results_297k/exp_f_nko_results.json`
- `experiments/B_script_advantage_ctc/vastai_results_297k/exp_f_latin_results.json`
- `experiments/B_script_advantage_ctc/expF_results.json` (word-level breakdown)

---

Experiment H: Vocabulary Expansion (COMPLETE)

Question: Can new words be added to the graph and immediately transcribed without retraining?

Results in `experiments/B_script_advantage_ctc/expH_results.json`. Three conditions: no graph, SEEN-only graph, FULL graph (with unseen words added back, no model retraining).

---

Experiment G: TTT Speaker Adaptation (IN PROGRESS)

Question: Does the decoder improve per-speaker during inference?

Status: Script built, diarization done, awaiting execution.

  • Speaker diarization: 7 clusters from 6,625 Djoko segments (`djoko_speakers.json`)
  • TTT script: `experiments/G_living_weights/ttt_eval.py`
  • Method: Load trajectory checkpoint → process speaker's utterances sequentially → update last 2 MLP layers after each → measure CER improvement curve

---

Data Pipeline

### Training Data Sources
| Source | Samples | Type |
|--------|---------|------|
| AfVoices | ~260K | Read speech, multiple African languages |
| bam-asr-early | ~37K | Bambara read speech |
| Total | 290,596 | Combined, deduplicated in current verified snapshot |

### Djoko Soap Opera Extraction (IN PROGRESS)
- Source: YouTube channel "Koman Diabate - Film Djoko" (2,001 videos)
- Download: `asr/download_djoko.py` (PID 34858, 1,124/2,001 videos, 32,826 segments)
- Audio: 30s WAV segments at 16kHz mono
- Transcription: `asr/vastai_transcribe.py` on Vast.ai GPU
- First batch: 8,985 segments → 8.8
- Results: `djoko_transcriptions.jsonl`
- Speaker diarization: `asr/diarize_djoko.py` using MFCC + agglomerative clustering
- Results: `djoko_speakers.json` (7 speakers, 5 eligible for TTT)

### Parents' Audio (COMPLETE)
- Source: Voice memo, 24.5 min Malinke conversation
- Diarized: mom + dad separated
- N'Ko transcribed: Whisper large-v3, avg confidence 0.56
- Files: `parents-audio/segments/`

---

Current Pipeline Architecture

Djoko YouTube Video
    │
    ├── Audio → yt-dlp → 30s WAV segments
    │           → Whisper encoder → 1280-dim features
    │           → Our CTC head (nko_traj_best.pt) → N'Ko text (hypothesis A)
    │           → FarmRadio Whisper → Latin text → transliterate → N'Ko (hypothesis B)
    │
    └── Video → Gemma 4 E4B (scene analysis)
                → character identification
                → speaker attribution
                → scene type (dialogue/action/music)
                → dialogue quality score (1-5)
                │
                └── Consensus: where A ≈ B AND quality ≥ 3 → training pair

---

Model Architecture Details

UnifiedCTCHead (`experiments/B_script_advantage_ctc/train_vastai.py`)

Input: Whisper features (B, T, 1280)
  → input_proj: Linear(1280, 768) + GELU + Dropout
  → temporal_ds: Conv1d(768, 768, k=5, s=4) + GELU  (4x downsample)
  → sinusoidal positional encoding
  → [if trajectory]: AudioTrajectoryScalars → TrajectoryBiasNetwork → bias
  → 6x TrajectoryTransformerLayer (d=768, heads=12, ff=3072)
     └── each layer: self-attention with trajectory bias + FFN
     └── [if graph]: GraphCrossAttention after self-attention
  → LayerNorm → Linear(768, num_classes)
Output: CTC logits (B, T/4, num_classes)

### Key Components
- AudioTrajectoryScalars: Conv1d + projection → 7 trajectory scalars per timestep
- TrajectoryBiasNetwork: Maps 7 scalars → 12 attention head biases with distance-dependent kernel
- GraphCrossAttention: K/V from GNN path embeddings (d=256), Q from decoder hidden states, gated residual

### Vocabularies
- N'Ko: 66 classes (U+07C0-07FF range + space + blank)
- Latin: 41 classes (a-z + ɛ/ɔ/ŋ/ɲ + accented vowels + space/' /- + blank)

---

Paper Status

### Paper 4: "Phonetic Transparency as Architectural Advantage" (ACTIVE REVISION)
- File: `paper/current/paper4_script_advantage.tex`
- PDF: `Desktop/Paper4_ScriptAdvantage_297K.pdf` (12 pages)
- Audio narrative: `Desktop/Paper4_Narrative.mp3` (21.3 min, OpenAI TTS onyx voice)
- All figures: `figures/fig{1-6}_{name}.pdf` + `.png`
- Status: manuscript now needs to foreground the verified `20.57

### Paper 5: "Compositional Generalization and Speaker Adaptation" (OUTLINE)
- File: `paper/paper5_outline.md`
- Thesis: N'Ko generalizes better to unseen vocab, expands without retraining, adapts faster to speakers
- Blocked on: Exp G results

---

Figures (8 total, all generated)

FigureFileSource
Fig 1: CER comparison bar chart`figures/fig1_cer_comparison.pdf``figures/gen_fig1.py`
Fig 2: Training loss curves (REAL)`figures/fig2_loss_curves.pdf``figures/gen_fig2_real_loss.py`
Fig 3: Architecture asymmetry delta`figures/fig3_delta.pdf``figures/gen_fig3_delta.py`
Fig 4: Data scale effect (37K→297K)`figures/fig4_data_scale.pdf``figures/gen_fig4_scale.py`
Fig 5: Exp F compositional generalization`figures/fig5_expF_compositional.pdf``figures/gen_fig5_expF.py`
Fig 6: Exp H vocabulary expansion`figures/fig6_expH_vocab_expansion.pdf``figures/gen_fig6_expH.py`
(old) Fig 2: Compositional gen`figures/fig2_compositional_generalization.pdf`Previous session
(old) Fig 3: Vocab expansion`figures/fig3_vocabulary_expansion.pdf`Previous session

---

Key File Map

### Scripts
| File | Purpose |
|------|---------|
| `experiments/B_script_advantage_ctc/train_vastai.py` | PyTorch CTC trainer (4 modes × 2 scripts) |
| `asr/transcribe_nko.py` | Inference: audio → N'Ko text using trajectory model |
| `asr/vastai_transcribe.py` | Batch GPU transcription on Vast.ai |
| `asr/download_djoko.py` | YouTube audio downloader (streaming mode) |
| `asr/diarize_djoko.py` | Speaker clustering via MFCC embeddings |
| `asr/scene_analyzer.py` | Gemma 4 video frame scene analysis |
| `asr/extract_djoko_frames.py` | Video frame extraction for OCR |
| `experiments/G_living_weights/ttt_eval.py` | Test-time training per-speaker |
| `nko_pretext/renderer.py` | Synthetic N'Ko OCR image generator |

### Data
| File | Purpose |
|------|---------|
| `experiments/B_script_advantage_ctc/data/pairs.jsonl` | 37,305 training pairs (local) |
| `experiments/B_script_advantage_ctc/vastai_results_297k/` | All 297K results + logs |
| `djoko_transcriptions.jsonl` | 8,985 CTC transcriptions of Djoko segments |
| `djoko_speakers.json` | Speaker diarization (7 clusters) |
| `djoko_audio/segments/` | 32,826+ WAV segments (growing) |
| `parents-audio/` | Diarized parents' audio + transcriptions |
| `data/triples_cache.json` | Knowledge graph triples |
| `data/path_embeddings.npz` | Pre-computed GNN path embeddings |

### Checkpoints
| File | Model | CER |
|------|-------|-----|
| `results/paper4_reproduction_35205256/best.pt` | Verified N'Ko trajectory baseline | 20.57
| `vastai_results_297k/checkpoints/nko_graph_traj_best.pt` | N'Ko graph+trajectory | 30.46
| `vastai_results/nko_best.pt` | N'Ko baseline (37K) | 38.90
| `vastai_results/nko_graph_best.pt` | N'Ko graph (37K) | ~32
| `vastai_results_clean/nko_baseline_best.pt` | N'Ko baseline (297K) | 32.75

---

Infrastructure

ResourceDetails
Vast.aiRTX 4090, ~$0.30/hr. Instance 34072387 running. $15.66 credit remaining.
Mac5M4 16GB. Ollama installed. MLX available.
Mac1Build host. Djoko downloader running (PID 34858).
HuggingFaceToken available. Models: FarmRadio bambara-whisper-asr, Whisper large-v3.

---

What's Next (Priority Order)

1. #31 Run Gemma 4 E4B scene analysis on Vast.ai — filter Djoko segments by dialogue quality
2. #30 Run FarmRadio Whisper as second opinion on Vast.ai
3. #32 Build consensus labeling pipeline (CTC + Whisper + scene quality)
4. #33 Run Experiment G (TTT) on Vast.ai with diarized speakers
5. #34 Retrain CTC on expanded data (297K + Djoko consensus pairs)
6. #35 Write Paper 5 draft

## Known Issues
- CTC model has severe domain gap on Djoko audio (8.8
- Djoko has NO burned-in subtitles — OCR pathway not viable for this source
- macOS OMP double-load crashes PyTorch-based speaker embedding models (pyannote, resemblyzer) on Mac1 — workaround: MFCC-based embeddings
- Vast.ai instance SSH uses `[home-path]` key specifically

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/HANDOFF.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture