Grand Diomande Research · Full HTML Reader

LinkedIn Post — Week 1

In 1949, in the city of Kankan, Guinea, a self-taught linguist named Solomana Kante sat down and designed a writing system from scratch.

Language as Infrastructure research note experiment writeup candidate score 18 .md

Full Public Reader

LinkedIn Post — Week 1

---

In 1949, in the city of Kankan, Guinea, a self-taught linguist named Solomana Kante sat down and designed a writing system from scratch.

He was frustrated by a claim he'd read: that African languages were inherently unsuitable for writing. He spoke Manding, a family of languages used by over 40 million people across West Africa. Bambara in Mali, Maninka in Guinea, Dioula in Cote d'Ivoire. These languages had been written in Arabic script for centuries, and in Latin script since colonization. Neither was designed for them. Arabic doesn't capture Manding vowel distinctions. Latin doesn't encode its tonal system.

So Kante built N'Ko (ߒߞߏ), which means "I say" in all Manding languages.

Here's what makes N'Ko different from English, French, Arabic, or any other major writing system:

Every character maps to exactly one sound. Every sound maps to exactly one character. This is called bijective mapping. English doesn't have it ("ough" has six pronunciations, "knight" has six letters and two sounds). French doesn't have it. Arabic doesn't have it. Kante engineered it deliberately.

He also encoded tone directly into the script. In Manding, the word "ba" means mother, goat, or river depending on pitch. In Latin Bambara, all three look identical: "ba." In N'Ko, each one is written differently using combining diacritical marks for high, low, and mid tone. The semantic information is in the text, not left for the reader to guess from context.

27 base characters. Zero spelling exceptions. Zero silent letters. Zero irregular forms. A 64-character Unicode block (U+07C0 through U+07FF). The entire script fits in the space that English uses for uppercase letters alone.

Kante designed this for humans. But these properties have computational consequences nobody anticipated.

Last year, I came across a paper by David Noel Ng called "Reasoning Yield from Stacking." He found that duplicating certain layers inside a language model improved reasoning by 17.72

He never tested a bijective script.

My parents speak Malinke. I grew up hearing this language. So I built a brain scanner to see what happens when a model reads N'Ko.

Qwen3-8B, 36 transformer layers, 4,096 neurons per layer. 100 parallel sentences in English and N'Ko. At every layer, I extracted the hidden state and measured activation magnitude, entropy, sparsity, and circuit specialization.

The model processes N'Ko at 30

I replicated Ng's layer duplication across both scripts. For English, clear reasoning improvement bands appeared. For N'Ko, nothing. Every configuration scored at chance. The reasoning circuits exist in the architecture. They have no signal to work with.

I call this the translation tax: 2.94x. The model works nearly three times harder to process N'Ko, and still fails. Not because N'Ko is hard. Because the model has never seen it. I ran the same scan on Qwen2.5-7B (3.59x tax) and Mistral-7B (2.67x). Three model families, three companies, same structural deficit. This is how the systems are built.

Now here's where N'Ko's construction starts to matter in a way Kante couldn't have predicted.

Because of the bijective mapping, character error rate (CER) on N'Ko output means something different than CER on English or Latin. If a model outputs N'Ko and gets 70

You can't do that with Latin Bambara. The digraph "ny" is one sound written as two characters. The letter "c" changes pronunciation depending on what follows it (a French colonial artifact baked into Bambara spelling). CER on Latin output is a noisy approximation. CER on N'Ko output is a direct phonemic measurement.

Most ASR research uses Word Error Rate, which treats a one-letter typo and a completely hallucinated word as equal errors. For a tonal, agglutinative language like Bambara, WER is the wrong metric. It was designed for English. CER on a bijective script is the metric that measures what matters: did the machine hear the sounds?

I built the pipeline to test this. 290,596 Bambara speech samples. Whisper large-v3 encoder features extracted once, stored permanently. CTC decoding into both scripts simultaneously. Same audio. Same architecture. Same training data. The only difference: the character set the decoder writes in.

When I injected 7 temporal acoustic scalars (velocity, acceleration, jerk, spectral flux, zero-crossing rate, harmonic ratio, onset strength) into the decoder, N'Ko CER dropped by 5.25 percentage points. For Latin, the exact same scalars produced zero improvement. The decoder can use acoustic trajectory information when the output space has a clean 1:1 mapping to sounds. When the output space has spelling ambiguity, the information gets wasted.

This is not a theoretical argument. It's a controlled experiment on 290,596 pairs.

What I find compelling about this process is how each piece required building the previous piece. You can't measure the script advantage without CER on bijective output. You can't get bijective output without a transliteration bridge (which I built, and which uncovered 6 colonial encoding artifacts hiding in the Latin Bambara conventions). You can't run the transliteration without understanding N'Ko's phonological rules. You can't understand those rules without understanding what Kante was trying to solve in 1949.

The writing system, the measurement framework, the acoustic decoder, the transliteration bridge, the evaluation metric. Each one had to be constructed. Each one revealed properties that didn't exist until it was built. That's what this work is: constructing the pieces that let you see something that was always there in the script's design but that nobody had the tools to measure.

Under $50 in total GPU compute. Everything else on Apple Silicon.

Code: github.com/Diomandeee/nko-brain-scanner
Data: huggingface.co/datasets/Diomande/bambara-whisper-features

#NLP #ASR #MachineLearning #AfricanLanguages #SpeechRecognition #Linguistics #OpenSource

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

content-pipeline/linkedin-nko-week1.md

Detected Structure

Method · Evaluation · Architecture