N'Ko Brain Scanner — Comprehensive Project Handoff
An ASR (Automatic Speech Recognition) system for N'Ko script — a phonetically transparent writing system used by ~30M Manding-language speakers in West Africa. The core research question: **does N'Ko's phonetic transparency give it a measurable architectural advantage over Latin script in ASR?**
Full Public Reader
N'Ko Brain Scanner — Comprehensive Project Handoff
What This Project Is
An ASR (Automatic Speech Recognition) system for N'Ko script — a phonetically transparent writing system used by ~30M Manding-language speakers in West Africa. The core research question: does N'Ko's phonetic transparency give it a measurable architectural advantage over Latin script in ASR?
Current verified answer: the N'Ko trajectory model is strong enough to anchor the project baseline. The fully archived reproduction on the current `290,596`-pair corpus snapshot achieves **20.57
---
Best Model
N'Ko Trajectory CTC — 20.57
| Property | Value |
|---|---|
| Architecture | UnifiedCTCHead (46.8M params) |
| Input | Whisper large-v3 encoder features (1280-dim) |
| Decoder | 6-layer Transformer with trajectory bias injection |
| Training data | 290,596 pairs (232,476 train / 29,060 val / 29,060 test) |
| Seed | 42 (deterministic split) |
| Epochs trained | 46 (best checkpoint at epoch 38) |
| Best val loss | 0.6359 |
| Checkpoint | `results/paper4_reproduction_35205256/best.pt` |
| Results JSON | `results/paper4_reproduction_35205256/results.json` |
| Inference script | `asr/transcribe_nko.py` |
Artifacts are also synced to `Mac5:/Volumes/HD1/tar_297k_clean/paper4_reproduction_35205256/`.
### Trajectory Bias Mechanism
The model injects pen-stroke trajectory scalars (7-dim: velocity, curvature, acceleration, etc.) into transformer attention as position-dependent bias. This exploits N'Ko's bijective grapheme-phoneme mapping — each character encodes exactly one phoneme, and the trajectory captures how that character is physically written.
Key: trajectory bias gives N'Ko -5.25pp CER improvement but Latin only +0.24pp. The bias is a bijection amplifier — it only helps when the script's structure is phonetically transparent.
---
Verified Baseline (Current)
This is the benchmark you can cite without qualification inside the repo:
| Metric | Value |
|---|---|
| Test CER | **20.57 |
| Test edits / chars | 216,225 / 1,050,967 |
| Corpus snapshot | 290,596 pairs |
| Split | 232,476 / 29,060 / 29,060 |
| Mode | N'Ko trajectory |
| Early stopping | Epoch 46 |
| Best validation | 0.6359 at epoch 38 |
Historical 8-Way Comparison (Provisional)
These numbers motivated the script-dependent trajectory story, but the full local artifact bundle for all eight runs is still missing. Keep them marked as historical/provisional until the companion Latin and ablation artifacts are restored.
Historical 297K Results (internal logs)
| Mode | N'Ko CER | Latin CER | Delta (N'Ko - Latin) |
|---|---|---|---|
| Baseline | 32.75 | ||
| Graph | 32.38 | ||
| Trajectory | **27.50 | ||
| Combined | 30.46 |
Historical key findings:
1. At baseline, Latin slightly wins (+1.32pp) — Latin has fewer classes (41 vs 66)
2. Graph hurts Latin badly (+5.71pp) but helps N'Ko (-0.37pp) — graph encodes phonological structure that only helps transparent scripts
3. Trajectory is the big win: -5.25pp for N'Ko, +0.24pp for Latin — trajectory bias is script-dependent
4. Combined doesn't beat trajectory alone for N'Ko — the mechanisms compete
### Historical results JSON
- Expected historical path: `experiments/B_script_advantage_ctc/vastai_results_297k/...`
- Status on this machine: missing; not available for direct verification
### Verified reproduction logs
- `results/paper4_reproduction_35205256/train.log`
- `results/paper4_reproduction_35205256/run.log`
- `results/paper4_reproduction_35205256/split.json`
- `results/paper4_reproduction_35205256/vocab.json`
---
Experiment F: Compositional Generalization (COMPLETE)
Question: Can N'Ko decode words never seen in training?
| Script | Full test CER | Exp F test CER | Degradation |
|---|---|---|---|
| N'Ko | 32.75 | ||
| Latin | 31.43 |
N'Ko degrades 6x less than Latin on unseen vocabulary. Trained on SEEN-only words (213,664 samples), tested on utterances with UNSEEN words (59,648 samples).
Detailed word-level analysis:
- N'Ko SEEN-word CER: 16.09
- Latin SEEN-word CER: 15.05
- N'Ko gap is 3.65pp smaller than Latin
Files:
- `experiments/B_script_advantage_ctc/vastai_results_297k/exp_f_nko_results.json`
- `experiments/B_script_advantage_ctc/vastai_results_297k/exp_f_latin_results.json`
- `experiments/B_script_advantage_ctc/expF_results.json` (word-level breakdown)
---
Experiment H: Vocabulary Expansion (COMPLETE)
Question: Can new words be added to the graph and immediately transcribed without retraining?
Results in `experiments/B_script_advantage_ctc/expH_results.json`. Three conditions: no graph, SEEN-only graph, FULL graph (with unseen words added back, no model retraining).
---
Experiment G: TTT Speaker Adaptation (IN PROGRESS)
Question: Does the decoder improve per-speaker during inference?
Status: Script built, diarization done, awaiting execution.
- Speaker diarization: 7 clusters from 6,625 Djoko segments (`djoko_speakers.json`)
- TTT script: `experiments/G_living_weights/ttt_eval.py`
- Method: Load trajectory checkpoint → process speaker's utterances sequentially → update last 2 MLP layers after each → measure CER improvement curve
---
Data Pipeline
### Training Data Sources
| Source | Samples | Type |
|--------|---------|------|
| AfVoices | ~260K | Read speech, multiple African languages |
| bam-asr-early | ~37K | Bambara read speech |
| Total | 290,596 | Combined, deduplicated in current verified snapshot |
### Djoko Soap Opera Extraction (IN PROGRESS)
- Source: YouTube channel "Koman Diabate - Film Djoko" (2,001 videos)
- Download: `asr/download_djoko.py` (PID 34858, 1,124/2,001 videos, 32,826 segments)
- Audio: 30s WAV segments at 16kHz mono
- Transcription: `asr/vastai_transcribe.py` on Vast.ai GPU
- First batch: 8,985 segments → 8.8
- Results: `djoko_transcriptions.jsonl`
- Speaker diarization: `asr/diarize_djoko.py` using MFCC + agglomerative clustering
- Results: `djoko_speakers.json` (7 speakers, 5 eligible for TTT)
### Parents' Audio (COMPLETE)
- Source: Voice memo, 24.5 min Malinke conversation
- Diarized: mom + dad separated
- N'Ko transcribed: Whisper large-v3, avg confidence 0.56
- Files: `parents-audio/segments/`
---
Current Pipeline Architecture
Djoko YouTube Video
│
├── Audio → yt-dlp → 30s WAV segments
│ → Whisper encoder → 1280-dim features
│ → Our CTC head (nko_traj_best.pt) → N'Ko text (hypothesis A)
│ → FarmRadio Whisper → Latin text → transliterate → N'Ko (hypothesis B)
│
└── Video → Gemma 4 E4B (scene analysis)
→ character identification
→ speaker attribution
→ scene type (dialogue/action/music)
→ dialogue quality score (1-5)
│
└── Consensus: where A ≈ B AND quality ≥ 3 → training pair---
Model Architecture Details
UnifiedCTCHead (`experiments/B_script_advantage_ctc/train_vastai.py`)
Input: Whisper features (B, T, 1280)
→ input_proj: Linear(1280, 768) + GELU + Dropout
→ temporal_ds: Conv1d(768, 768, k=5, s=4) + GELU (4x downsample)
→ sinusoidal positional encoding
→ [if trajectory]: AudioTrajectoryScalars → TrajectoryBiasNetwork → bias
→ 6x TrajectoryTransformerLayer (d=768, heads=12, ff=3072)
└── each layer: self-attention with trajectory bias + FFN
└── [if graph]: GraphCrossAttention after self-attention
→ LayerNorm → Linear(768, num_classes)
Output: CTC logits (B, T/4, num_classes)### Key Components
- AudioTrajectoryScalars: Conv1d + projection → 7 trajectory scalars per timestep
- TrajectoryBiasNetwork: Maps 7 scalars → 12 attention head biases with distance-dependent kernel
- GraphCrossAttention: K/V from GNN path embeddings (d=256), Q from decoder hidden states, gated residual
### Vocabularies
- N'Ko: 66 classes (U+07C0-07FF range + space + blank)
- Latin: 41 classes (a-z + ɛ/ɔ/ŋ/ɲ + accented vowels + space/' /- + blank)
---
Paper Status
### Paper 4: "Phonetic Transparency as Architectural Advantage" (ACTIVE REVISION)
- File: `paper/current/paper4_script_advantage.tex`
- PDF: `Desktop/Paper4_ScriptAdvantage_297K.pdf` (12 pages)
- Audio narrative: `Desktop/Paper4_Narrative.mp3` (21.3 min, OpenAI TTS onyx voice)
- All figures: `figures/fig{1-6}_{name}.pdf` + `.png`
- Status: manuscript now needs to foreground the verified `20.57
### Paper 5: "Compositional Generalization and Speaker Adaptation" (OUTLINE)
- File: `paper/paper5_outline.md`
- Thesis: N'Ko generalizes better to unseen vocab, expands without retraining, adapts faster to speakers
- Blocked on: Exp G results
---
Figures (8 total, all generated)
| Figure | File | Source |
|---|---|---|
| Fig 1: CER comparison bar chart | `figures/fig1_cer_comparison.pdf` | `figures/gen_fig1.py` |
| Fig 2: Training loss curves (REAL) | `figures/fig2_loss_curves.pdf` | `figures/gen_fig2_real_loss.py` |
| Fig 3: Architecture asymmetry delta | `figures/fig3_delta.pdf` | `figures/gen_fig3_delta.py` |
| Fig 4: Data scale effect (37K→297K) | `figures/fig4_data_scale.pdf` | `figures/gen_fig4_scale.py` |
| Fig 5: Exp F compositional generalization | `figures/fig5_expF_compositional.pdf` | `figures/gen_fig5_expF.py` |
| Fig 6: Exp H vocabulary expansion | `figures/fig6_expH_vocab_expansion.pdf` | `figures/gen_fig6_expH.py` |
| (old) Fig 2: Compositional gen | `figures/fig2_compositional_generalization.pdf` | Previous session |
| (old) Fig 3: Vocab expansion | `figures/fig3_vocabulary_expansion.pdf` | Previous session |
---
Key File Map
### Scripts
| File | Purpose |
|------|---------|
| `experiments/B_script_advantage_ctc/train_vastai.py` | PyTorch CTC trainer (4 modes × 2 scripts) |
| `asr/transcribe_nko.py` | Inference: audio → N'Ko text using trajectory model |
| `asr/vastai_transcribe.py` | Batch GPU transcription on Vast.ai |
| `asr/download_djoko.py` | YouTube audio downloader (streaming mode) |
| `asr/diarize_djoko.py` | Speaker clustering via MFCC embeddings |
| `asr/scene_analyzer.py` | Gemma 4 video frame scene analysis |
| `asr/extract_djoko_frames.py` | Video frame extraction for OCR |
| `experiments/G_living_weights/ttt_eval.py` | Test-time training per-speaker |
| `nko_pretext/renderer.py` | Synthetic N'Ko OCR image generator |
### Data
| File | Purpose |
|------|---------|
| `experiments/B_script_advantage_ctc/data/pairs.jsonl` | 37,305 training pairs (local) |
| `experiments/B_script_advantage_ctc/vastai_results_297k/` | All 297K results + logs |
| `djoko_transcriptions.jsonl` | 8,985 CTC transcriptions of Djoko segments |
| `djoko_speakers.json` | Speaker diarization (7 clusters) |
| `djoko_audio/segments/` | 32,826+ WAV segments (growing) |
| `parents-audio/` | Diarized parents' audio + transcriptions |
| `data/triples_cache.json` | Knowledge graph triples |
| `data/path_embeddings.npz` | Pre-computed GNN path embeddings |
### Checkpoints
| File | Model | CER |
|------|-------|-----|
| `results/paper4_reproduction_35205256/best.pt` | Verified N'Ko trajectory baseline | 20.57
| `vastai_results_297k/checkpoints/nko_graph_traj_best.pt` | N'Ko graph+trajectory | 30.46
| `vastai_results/nko_best.pt` | N'Ko baseline (37K) | 38.90
| `vastai_results/nko_graph_best.pt` | N'Ko graph (37K) | ~32
| `vastai_results_clean/nko_baseline_best.pt` | N'Ko baseline (297K) | 32.75
---
Infrastructure
| Resource | Details |
|---|---|
| Vast.ai | RTX 4090, ~$0.30/hr. Instance 34072387 running. $15.66 credit remaining. |
| Mac5 | M4 16GB. Ollama installed. MLX available. |
| Mac1 | Build host. Djoko downloader running (PID 34858). |
| HuggingFace | Token available. Models: FarmRadio bambara-whisper-asr, Whisper large-v3. |
---
What's Next (Priority Order)
1. #31 Run Gemma 4 E4B scene analysis on Vast.ai — filter Djoko segments by dialogue quality
2. #30 Run FarmRadio Whisper as second opinion on Vast.ai
3. #32 Build consensus labeling pipeline (CTC + Whisper + scene quality)
4. #33 Run Experiment G (TTT) on Vast.ai with diarized speakers
5. #34 Retrain CTC on expanded data (297K + Djoko consensus pairs)
6. #35 Write Paper 5 draft
## Known Issues
- CTC model has severe domain gap on Djoko audio (8.8
- Djoko has NO burned-in subtitles — OCR pathway not viable for this source
- macOS OMP double-load crashes PyTorch-based speaker embedding models (pyannote, resemblyzer) on Mac1 — workaround: MFCC-based embeddings
- Vast.ai instance SSH uses `[home-path]` key specifically
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/HANDOFF.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture