Dead Circuits and Living Speech: Building the First N'Ko AI Pipeline

Full HTML reader

Read the full artifact

Extracted abstract or opening context

*Two papers in preparation for ACL/EMNLP 2026. All code and models open-source.* - **What we found**: Qwen3-8B-Instruct processes N'Ko text with a 2.94x "translation tax" (L2 norm deficit) across all 36 transformer layers. Circuit duplication analysis (55 configurations, RYS methodology) finds 0/55 N'Ko-advantageous configurations. Three-zone failure analysis reveals structurally distinct collapse modes at embedding, middle, and output layers. Arabic, another RTL script, sits within 7% of English on the same metrics. The failure is entirely data-driven. - **What we built**: (1) A three-stage LoRA pipeline (CPT + SFT + BPE-aware, 185 minutes on Apple M4) that reduces the translation tax from 2.94x to 0.70x. (2) The first audio-to-N'Ko ASR system, with a four-version architecture progression from BiLSTM baseline (56% CER) to Whisper LoRA fine-tune (29.4% CER). A deterministic cross-script bridge, a 4-state FSM for phonotactic validation, and a downstream NLLB-200 translation pipeline complete the stack. - **The numbers**: Translation tax: 2.94x to 0.70x (76% reduction). ASR: V1 56% CER / 91.5% WER to V4 29.4% CER / 62.3% WER. Prediction confidence: 0.46 to 0.82 (79% improvement). Per-sample: worst case au30 drops from WER 15.8 to 1.0 (93.7% improvement). Total compute cost across all experiments: $14. N'Ko is the only writing system in history designed from the ground up with the computational properties that NLP engineers dream about when building synthetic alphabets. Solomana Kante designed it in 1949 in Kankan, Guinea, in response to a claim that African languages could not be written. The result is a right-to-left alphabetic script occupying Unicode block `U+07C0--U+07FF` (standardized 2006) with properties that no evolved script can match: - **Strict phoneme-grapheme bijection**: every phoneme in the Manding inventory maps to exactly one character. No digraphs, no silent letters, no context-dependent pronunciation rules. - **Explicit tonal diacritics**: Bambara is tonal; N'Ko marks tone with combining characters above vowels. Latin Bambara orthography does not mark tone. - **Zero spelling irregularities**: English has ~1,100 letter-to-sound rules for ~44 phonemes. N'Ko has 1-to-1 correspondence for 33 phonemes.

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.