Grand Diomande Research · Full HTML Reader

When Unicode Is Not Understanding

The characters render. The cursor moves right to left. The prompt accepts the input. The model returns something confident. From the outside, the system looks as if it can read.

Language as Infrastructure research note experiment writeup candidate score 22 .md

Full Public Reader

When Unicode Is Not Understanding

Most people think the machine has crossed the line once the script appears on the
screen.

The characters render. The cursor moves right to left. The prompt accepts the
input. The model returns something confident. From the outside, the system looks
as if it can read.

That is not what reading means.

N'Ko makes the difference visible because it is not a broken script, not an
informal spelling habit, and not a niche encoding trick. It is a real writing
system designed for Manding languages, standardized in Unicode from U+07C0 through
U+07FF, used in books, education, religious texts, newspapers, keyboards, and
online writing. If the machine fails on N'Ko, the failure is not that N'Ko is
ill-formed. The failure is that the machine's world is incomplete.

That was the first experiment: look inside the model and ask whether N'Ko has a
real internal life there.

The brain scan

The method was simple in spirit. Take a large language model. Feed it matched text
in scripts the model should process. Then extract the hidden states layer by
layer, the internal vectors the model builds before it ever speaks.

The point was not to ask the model, "Do you know N'Ko?" A model can bluff. It can
route through French or English. It can produce an answer that sounds plausible
while never building a clean representation of the script itself.

So we measured the machinery.

The older blog drafts called this a brain scan. That is still the right metaphor.
A transformer has layers. Early layers build token and script representations.
Middle layers combine those representations into meaning. Later layers prepare a
prediction. If a script is really present in the model, it should leave a
signature across that stack.

For N'Ko, the signature was weak.

In one Qwen3-8B protocol, the model processed N'Ko with an average representation
tax around 2.94x. In a cross-model protocol, the tax was 3.30x for Qwen3-8B,
3.59x for Qwen2.5-7B, and 2.67x for Mistral-7B. Those are not universal constants.
They are measurements under a specific protocol. But the pattern is consistent:
N'Ko enters the model, yet the model does not give it the same internal support it
gives better-represented scripts.

![N'Ko activation deficit](../final/01-script-invisibility/figures/brain_scan_l2_comparison.png)

What the numbers mean

The first number is activation energy. If the model has useful pathways for a
script, the hidden states carry strong signal. If the script is poorly represented,
the signal is weaker. N'Ko produced lower activation energy across the stack.

The second number is entropy. Entropy tells you whether the model is concentrating
or spreading out. High entropy means the internal representation is diffuse. It is
the model shrugging with mathematics. N'Ko produced more diffuse hidden states.

The third number is sparsity. A sparse representation has many inactive
dimensions. That can be useful when a model has learned specialized circuits, but
it is dangerous at the embedding stage. For N'Ko, the early representation looked
underconnected. Too many dimensions were doing too little.

The fourth number is specialization. A model that knows a script should develop
peaked, task-relevant activation patterns. N'Ko showed weaker specialization near
the places where the model should have been preparing a confident prediction.

That combination matters. One weak metric might be noise. Four related metrics
pointing in the same direction are a diagnosis.

Arabic was the control

Right-to-left writing is the obvious suspect. N'Ko is written right to left. Maybe
the model was confused by direction.

But Arabic is also written right to left, and modern large language models handle
Arabic far better. The older experiments showed Arabic sitting close to English
on the same kinds of metrics, while N'Ko remained far behind.

That control changes the interpretation. The model is not failing because
right-to-left text is impossible for transformers. It is failing because N'Ko did
not receive enough data, tokenizer allocation, and training pressure to build
useful pathways.

The machine can render the script. It has not learned the script.

The missing on-ramp

The older blog section had a phrase worth preserving: the reasoning circuits may
already exist, but N'Ko lacks the on-ramp.

That is the difference between a weak script and a weak model. N'Ko was designed
with properties that should help machines: consistent sound-symbol mapping,
explicit tone marks, no silent letters, no digraph ambiguity inside the script
itself. The script is not the source of the confusion. The model's training
history is.

This is why prompt engineering is not enough. A better prompt can coax a surface
answer from a model. It cannot create the missing internal pathway. The fix has to
be data, tokenizer support, adaptation, and measurement.

Why this mattered for speech

Once we saw script invisibility in language models, the automatic speech
recognition, or ASR, decision became clearer.

If a general model is weak in N'Ko, then a pipeline that hears Bambara, writes
Latin, and later asks a generic language model to "restore" N'Ko is risky. It
passes through the exact layer of weakness the brain scan exposed. The model may
produce something fluent while losing the script-specific evidence that matters.

The alternative is direct script-native ASR. Let the speech model emit N'Ko
characters directly. Let the measurement happen in the script designed for the
sound system. Then, if a correction layer is used later, govern it carefully.

That is the bridge from Paper 1 to the rest of the project. Script invisibility is
not a side note. It explains why direct N'Ko output matters.

The human point

Unicode inclusion is not computational inclusion.

A script can be standardized, rendered, typed, and stored while still being
effectively absent from the systems that now mediate search, transcription,
education, translation, and everyday writing. That absence does not announce
itself. The model still responds. The interface still works. The failure is hidden
inside the layers.

N'Ko lets us name that failure.

Solomana Kante designed a writing system because someone claimed African languages
could not be written properly. Seventy-seven years later, the machines made a
quieter version of the same mistake. They accepted the characters but did not make
room for the language.

The first job of this project was to show that gap clearly. The next job was to
build through it.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/paper/blog-series/01-when-unicode-is-not-understanding.md

Detected Structure

Method · Evaluation · Figures · Architecture