Grand Diomande Research · Full HTML Reader

Living Speech: Building the First N'Ko ASR for 14

Every speech recognition system for Bambara outputs Latin characters. French colonial characters designed for French colonial administrators. Not for the 40 million people who actually speak and read these languages.

Language as Infrastructure research note experiment writeup candidate score 26 .md

Full Public Reader

Living Speech: Building the First N'Ko ASR for $14

Every speech recognition system for Bambara outputs Latin characters. French colonial characters designed for French colonial administrators. Not for the 40 million people who actually speak and read these languages.

I built the first one that outputs N'Ko.

This is the story of how a $14 experiment on consumer hardware produced a speech recognition system that competes with models 38 times its size. And why Solomana Kante's 1949 design decisions turned out to be a machine learning advantage nobody predicted.

---

The Problem Nobody Solved

There is no N'Ko speech corpus. Every Bambara audio dataset has Latin transcriptions. The FarmRadio datasets, the MALIBA-AI models, the Bayelemabaga corpus. All Latin output.

If you want to build a system that listens to Bambara and writes N'Ko, you first need to solve a problem that nobody has needed to solve before: convert thousands of Latin Bambara transcriptions into valid N'Ko.

That sounds simple. It isn't.

The Bridge: Undoing Colonial Orthography

Latin Bambara was designed by French colonial linguists in the 20th century. It reflects French phonological conventions, not Manding phonological reality.

The digraph "ny" represents a single sound (the palatal nasal). But is "n" followed by "y" the digraph, or two separate sounds? The decoder can't tell without phonological context.

N'Ko doesn't have this problem. Every sound gets one character. Kante designed it that way in 1949.

I built a deterministic bridge: Latin Bambara to IPA to N'Ko Unicode.

B: Latin Bambara -> IPA -> N'Ko

During development I hit six distinct bug classes. None of them were typical programming bugs. They were all places where Latin orthography conceals information that N'Ko was designed to express:

1. Greedy character matching. "ny" is one phoneme, not two. You have to consume digraphs before single characters.
2. Missing consonants. The phoneme /g/ had no IPA-to-N'Ko entry. Words with /g/ produced Latin characters embedded in N'Ko output.
3. Extended IPA symbols. FarmRadio's transcription model uses symbols for loanwords and dialectal variants not in the standard Bambara inventory.
4. Cascading gaps. Fixing bug 1 revealed that digraph outputs from stage 1 had no entries in stage 2.
5. Unicode decomposition. Pre-composed toned vowels (like "a" with grave accent) need canonical decomposition before lookup.
6. RTL rendering. N'Ko reads right-to-left. Spaces between N'Ko words need the RTL mark (U+200F) to render correctly in left-to-right environments.

Each bug is a small document of where colonial orthography erased phonemic structure. The bridge doesn't just convert scripts. It recovers information that was encoded away.

After fixing all six: 99.4

The Grammar Machine

N'Ko syllables follow a strict structure: optional consonant, required vowel, optional nasal coda. Written as (C)V(N). No consonant clusters. No exceptions.

This is a formal, closed system. There is zero reason to ask a neural network to learn it from data when you can specify it exactly. I built a 4-state finite-state machine:

Start  + consonant -> Onset
Start  + vowel     -> Nucleus
Onset  + vowel     -> Nucleus
Onset  + consonant -> REJECT (no CC clusters)
Nucleus + tone     -> Nucleus
Nucleus + nasal    -> Coda
Nucleus + consonant -> Onset (new syllable)

When the FSM hits an invalid transition, it doesn't throw out the character. It replaces it with the highest-probability valid token from the CTC decoder's output distribution at that time step. In practice, most corrections insert a vowel to resolve consonant clusters that the acoustic signal probably carried but the neural network missed.

Result: 100

This guarantee is only possible because N'Ko's rules are complete and exception-free. You cannot build an equivalent FSM for Latin Bambara because of digraph ambiguity and French loan words that violate Manding syllable structure.

This is free accuracy. A structural guarantee that is architecturally impossible for Latin-output systems.

Four Versions, $14 Total

I used frozen Whisper Large V3 as the acoustic encoder. 37 hours of Bambara speech from bam-asr-early. Latin transcriptions run through the bridge for N'Ko targets.

**V1: BiLSTM, 56

Garbage. But informative garbage.

Architecture search: 28 configurations. Before scaling up, I tested every combination of architecture family (BiLSTM, Transformer, Conformer), hidden dimension (256, 512, 768), depth (2, 4, 6 layers), and temporal downsampling (4x, 8x, 16x).

Three findings:
- Transformers beat BiLSTMs by 15-20 points at every scale. Self-attention's global context window is exactly what N'Ko syllable structure needs.
- 4x downsampling consistently beats 8x and 16x. N'Ko characters represent individual phonemes, shorter acoustic events than higher downsampling rates assume.
- Conformers underperform Transformers at 37 hours. The local convolution kernels overfit to speaker-specific patterns with this little data.

**V3: Transformer, 33

A 23-point CER improvement over V1.

**V4: Whisper LoRA, 29.4

The dramatic result is per-sample. Sample au30, V3's worst (WER 15.8, total failure), improves to WER 1.0 under V4. A 93.7

Mean prediction confidence jumps from 0.46 to 0.82. The model isn't just predicting better characters. It's predicting with nearly twice the certainty.

VersionParamsCERWERCost
V1 BiLSTM5.4M56.0
V2 Transformer22.1M45.7
V3 Transformer46.9M33.0
V4 Whisper LoRA52.8M29.4
MALIBA-AI v3~2Bn/a*45.73

Total compute: $14.

MALIBA-AI's Latin-output system uses approximately 2 billion parameters and achieves 45.73

Why N'Ko Has a Structural Advantage

I proved this formally. The proof is from the structure of CTC (Connectionist Temporal Classification), the algorithm that aligns audio frames to output characters.

CTC works by searching over all possible alignments between the audio and the target text. The more ambiguous the output vocabulary, the harder this search is.

N'Ko's output space: 36 classes. One character per phoneme, plus blank. No ambiguity.

Latin Bambara's output space: ~40 character classes, but with digraphs creating segmentation ambiguity. The CTC decoder has to learn that "n" followed by "y" might be one phoneme or two. That learning takes model capacity and training data.

Theorem (Phonetic Transparency Advantage): Given identical architecture and capacity, a CTC decoder targeting a bijective transcription function achieves CER less than or equal to a decoder targeting a many-to-many function.

The proof: N'Ko's bijective mapping eliminates insertion and deletion errors caused by digraph segmentation ambiguity. Latin's many-to-many mapping creates exactly those errors.

In practice, this means: for the same amount of training data and the same model size, N'Ko will always produce fewer character errors than Latin Bambara. The advantage is structural, not data-dependent. More data helps both systems, but N'Ko starts ahead.

Kante couldn't have known this in 1949. But when he designed a script where every phoneme gets one character, he was solving the CTC alignment problem 57 years before CTC was invented.

The Bigger Picture

This research is part of a larger arc:

Paper 1 (Dead Circuits) showed that AI models process N'Ko at 30

Paper 2 (Living Speech) is this post. The first N'Ko ASR system. $14 total. 28 architectures tested.

Paper 3 (Script Invisibility Is Structural) confirmed the deficit appears identically in three different models from two companies. It's universal.

Paper 4 (Does Script Design Matter?) formalized the phonetic transparency advantage with a mathematical proof.

Two more papers are in progress. One tests whether N'Ko gives a controlled CER advantage over Latin on identical audio (training running right now). The other asks whether a personalized AI twin behaves differently when retrained in N'Ko.

All code, data, and papers: [github.com/Diomandeee/nko-brain-scanner](https://github.com/Diomandeee/nko-brain-scanner)

What Comes Next

More data. The afvoices dataset has 612 hours of Bambara speech. We trained on 37. Scaling to 612 is the single highest-leverage move.

Tone labels. 38

The controlled experiment. Training two identical CTC decoders on the same audio, one outputting N'Ko and one outputting Latin, to get a clean CER comparison. This is running right now on two Mac Minis.

The first time a machine wrote N'Ko from speech cost $14 and ran on consumer hardware. The script Kante designed in 1949 turns out to be computationally optimal for speech recognition. The 1:1 phoneme mapping, the explicit tone marks, the exception-free syllable structure. Every design decision is a machine learning advantage.

All it needed was someone to build the on-ramp.

---

Mohamed Diomande is an independent researcher. This is the second post in a series on N'Ko and AI. The first post, "The Script Machines Can't Read," covers the brain scan findings.

All research was conducted on Apple Silicon at a total compute cost of $19 across all experiments. Code and papers are open source.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/blog/substack/02-living-speech.md

Detected Structure

Method · Evaluation · Architecture