Grand Diomande Research · Full HTML Reader

Why Word Error Rate Is the Wrong Metric for Bambara Speech Recognition

Every Bambara ASR system published today reports Word Error Rate. MALIBA-AI's bambara-asr-v3 reports 45.73% WER. Normalized, that drops to 13.23% WER. Those numbers sound meaningful. They are not.

Language as Infrastructure research note experiment writeup candidate score 18 .md

Full Public Reader

Why Word Error Rate Is the Wrong Metric for Bambara Speech Recognition

And what we should be measuring instead

---

The Problem with WER

Every Bambara ASR system published today reports Word Error Rate. MALIBA-AI's bambara-asr-v3 reports 45.73

WER counts the number of word-level insertions, deletions, and substitutions needed to transform a predicted transcript into the reference. It was designed for English, where words are separated by spaces, spelled consistently, and carry stable meaning across contexts.

Bambara written in Latin script has none of these properties.

Latin Bambara Is Not a Stable Target

Take the Bambara word for "goat." Depending on who transcribed it, you might see: ba, baa, bà, or bâ. Those are all the same word. But WER treats each spelling as a distinct token. If your model outputs "ba" and the reference says "baa", that is a substitution error. 100

It gets worse. Bambara has digraphs: "ny" for the palatal nasal, "ng" for the velar nasal. Is "nyuman" one word or two? Does "n'ka" contain an apostrophe or not? Different transcribers make different choices. WER punishes the model for disagreeing with the transcriber, not for getting the language wrong.

And then there is tone. Bambara is a tonal language. The word "ba" means mother, goat, or river depending on its tone. Latin script has no standard way to mark tone. Some transcribers use diacritics. Most do not. WER has no way to distinguish "got the word right but missed the tone" from "got a completely different word." It collapses a gradient of correctness into binary right/wrong.

When you report WER on Latin Bambara, you are measuring agreement with one transcriber's spelling conventions. You are not measuring whether the system understood the speech.

What CER on N'Ko Actually Measures

N'Ko was designed by Solomana Kante with a single constraint: one sound, one character. No exceptions. No digraphs. No ambiguous spellings. Every consonant has one symbol. Every vowel has one symbol. Tone is marked with explicit diacritics above the character.

This means Character Error Rate on N'Ko output is a direct measurement of phonemic accuracy.

If the model outputs the wrong N'Ko character, it has identified the wrong phoneme. If it outputs the right character, it has identified the right phoneme. There is no ambiguity, no spelling convention, no transcriber disagreement to confuse the signal.

A 27.50

The Numbers Tell the Story

In our controlled experiments on 290K+ Bambara speech pairs, we trained identical architectures targeting both scripts:

Script	Metric	What It Measures
N'Ko (66 classes)	27.50
Latin (~104 classes)	30.32

The Latin CER is not directly comparable to the N'Ko CER because Latin characters do not reliably correspond to phonemes. A correct Latin character might represent the wrong sound (if tone is unmarked). An incorrect Latin character might represent the right sound (if the transcription convention differs from the reference).

N'Ko CER is unambiguous. Latin CER is noisy. And Latin WER is noisier still, because it compounds character-level ambiguity with word-boundary ambiguity.

A Formal Argument

For a transcription function f mapping phonemes to characters:

If f is bijective (one-to-one and onto, like N'Ko), then CER equals the phoneme error rate exactly. Every character error is a phoneme error. Every phoneme error is a character error.

If f is many-to-many (like Latin Bambara, where digraphs map two characters to one phoneme, and tone is unmarked), then CER is a noisy proxy. Character errors and phoneme errors are correlated but not identical. And WER adds another layer of noise on top.

We prove in our paper that for identical model architectures and training data, the bijective transcription function yields CER less than or equal to the many-to-many function. This is not an empirical observation. It is a mathematical consequence of the output space structure.

What This Means for Bambara ASR

Stop reporting WER on Latin Bambara as if it means something precise. It does not. It measures transcription agreement, not speech understanding.

If you want a metric that tells you whether your ASR system actually recognizes Bambara sounds, decode into N'Ko and report CER. The script was literally designed for this. Every character is a phoneme. Every phoneme is a character. The metric and the measurement are the same thing.

This is not about cultural preference. It is about measurement validity. N'Ko CER is a better instrument for measuring Bambara ASR accuracy than Latin WER, for the same reason a scale is a better instrument for measuring weight than a ruler.

We propose N'Ko CER as the standard evaluation metric for Manding ASR. Not because N'Ko is culturally important (it is), but because it is the only output representation where character-level accuracy is phonemically interpretable.

Solomana Kante built a perfect phonemic encoding in 1949. It took us 77 years to realize that machines need it as much as people do.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/blog/posts/05-why-wer-is-wrong.md

Detected Structure

Method · Evaluation · Architecture