Back to corpus
research noteexperiment writeup candidatescore 18

The Script That Machines Can't Read

N'Ko is an alphabetic script used by over 40 million Manding-language speakers across West Africa. It has a Unicode block (U+07C0-U+07FF), a Wikipedia with thousands of articles, and a vibrant literary tradition. But when you feed N'Ko text to state-of-the-art language models, they choke. Not subtly. Catastrophically.

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

N'Ko is an alphabetic script used by over 40 million Manding-language speakers across West Africa. It has a Unicode block (U+07C0-U+07FF), a Wikipedia with thousands of articles, and a vibrant literary tradition. But when you feed N'Ko text to state-of-the-art language models, they choke. Not subtly. Catastrophically. This is the story of how I taught a language model to read N'Ko, on a laptop, for less than $2. When I ran Qwen3-8B on N'Ko text, the model produced garbled output. Its N'Ko perplexity was 11.02 compared to 3.8 for English. That's a 2.90x "translation tax," meaning the model found N'Ko nearly three times harder to predict than English. N'Ko token accuracy sat at 23%, and about 10% of generated syllables were phonotactically invalid. The root cause: N'Ko occupies just 64 codepoints in Unicode. During pre-training, these 64 characters compete with tens of thousands of CJK, Latin, and Cyrillic tokens for model capacity. The model learns to represent N'Ko, barely, through its general multilingual abilities, but it never develops deep understanding. Before fine-tuning, I wanted to understand what was happening inside the model. I built an "activation profiler" (we call it a brain scanner) that records the hidden state norms at every transformer layer while the model processes English vs. N'Ko text.

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.