Grand Diomande Research · Full HTML Reader

The Model That Listened in N'Ko

Do not make the system hear Bambara, write Latin, and then ask another model to recover N'Ko afterward. That route repeats the exact problem the brain scan found: the generic model is weak in N'Ko, so a post-hoc restoration step can sound fluent while losing the script's evidence.

Language as Infrastructure research note experiment writeup candidate score 18 .md

Full Public Reader

The Model That Listened in N'Ko

The architecture began with a refusal.

Do not make the system hear Bambara, write Latin, and then ask another model to
recover N'Ko afterward. That route repeats the exact problem the brain scan found:
the generic model is weak in N'Ko, so a post-hoc restoration step can sound fluent
while losing the script's evidence.

The cleaner route is harder but more honest:

text

audio -> acoustic features -> N'Ko decoder -> N'Ko text

That decision shaped the whole experiment.

The bridge that made training possible

There was a practical problem first. There was no large N'Ko-labeled speech corpus
waiting to be downloaded. Existing Bambara automatic speech recognition, or ASR,
data was mostly Latin. To train a direct N'Ko decoder, the project needed targets
in N'Ko.

So we built a deterministic bridge:

text

Latin Bambara -> IPA -> N'Ko Unicode

IPA means International Phonetic Alphabet. The Latin-to-IPA stage handles Bambara
digraphs and vowel forms. The IPA-to-N'Ko stage maps the sound inventory into N'Ko
characters.

This is where the script argument became concrete. The bridge did not just move
letters around. It exposed every place where Latin Bambara made the model guess.
"ny" has to be consumed as a single palatal nasal, not as "n" plus "y." "ng" has
to be treated as the velar nasal. Toned vowels need Unicode decomposition before
lookup. Extended IPA symbols from real transcriptions needed explicit handling.
Right-to-left rendering needed marks so the output would not visually scramble in
Latin-dominant environments.

These were engineering bugs, but they were also linguistic evidence. Each bug was
a place where script choice changed what the machine had to infer.

The acoustic front end

Training a full acoustic encoder from scratch would have been wasteful. Whisper
large-v3 already knows a lot about speech. It has seen huge multilingual audio
data and can turn waveform input into a sequence of acoustic feature vectors.

The project used Whisper as a frozen feature extractor. In this setup, Whisper
produces 1280-dimensional acoustic features. The N'Ko decoder learns what to do
with those features.

That separation matters. The question was not "can we retrain Whisper into
everything?" The sharper question was: if we keep a powerful acoustic front end,
can a script-native decoder learn to write N'Ko directly?

The CTC decoder

The decoder was a Transformer trained with CTC, connectionist temporal
classification. CTC is built for exactly this kind of problem. Speech is a long
sequence of frames. Text is a shorter sequence of characters. We usually know the
transcript, but not the exact frame where each character begins and ends.

CTC learns that alignment. It lets the model emit blanks, repeated characters, and
then collapse the framewise output into a final string.

For N'Ko, that is especially attractive because the output units are close to the
sound units. The decoder is not choosing between Latin spelling conventions. It is
choosing N'Ko characters.

The trajectory idea

The older blog post on trajectory bias had the right phrase: speech is movement.

A person does not produce isolated characters. The signal holds, bends, releases,
crosses boundaries, recovers, and sometimes becomes uncertain. The trajectory
module tried to give the decoder a small sense of that movement.

The seven channels were:

Channel	Plain meaning
Commitment	How locked-in the current path looks.
Uncertainty	How many plausible interpretations remain open.
Transition pressure	Whether the speech is near a boundary or change.
Recovery margin	Whether the model can recover from a local instability.
Phase stiffness	Whether the current pattern resists sudden change.
Novelty	Whether the signal looks unusual or out of distribution.
Stability	Whether the decoding path is holding together.

Those names can sound abstract, but the job is practical. If the model is in a
stable stretch, it can use broader context. If it is near a boundary, it should be
more careful. If the signal is novel, it should not rush to normalize it away.

The archived 20.57

TAR and TTT, without the fog

Two later branches need to be explained because otherwise they sound like secret
machinery.

TAR means trajectory-attention residual. It pushed trajectory information deeper
into the attention mechanism. Instead of using compact trajectory state as a
lighter conditioning signal, TAR made it a heavier residual branch.

TTT means test-time training or test-time adaptation. Instead of keeping the model
fixed during evaluation, TTT changes behavior on the fly by adapting during
inference or evaluation.

Both ideas are reasonable research branches. Neither produced the 20.57
That distinction is important because otherwise the project starts to sound like a
bundle of acronyms instead of a sequence of experiments.

The clean architecture story is:

text

Whisper large-v3 features
  -> trainable Transformer CTC decoder
  -> compact trajectory conditioning
  -> direct N'Ko characters
  -> archived 20.57% CER anchor

TAR and TTT sit beside that story. They do not replace it.

What transpired during the runs

The training story had the usual shape of real research: the first thing that
worked was not the final thing that mattered.

Earlier systems established feasibility. BiLSTM-style CTC systems showed that
direct audio-to-N'Ko output was possible, but they were weak. Transformer CTC
decoders on frozen Whisper features improved the range substantially. The older
trajectory-bias experiments showed why dynamic state could help N'Ko more than
Latin: the acoustic boundary information matched the script units better.

The final anchor came from scaling the script-native trajectory line to the
290,596-pair corpus snapshot and preserving the scorer arithmetic. The result was
20.57

Later, heavier branches and a lower learning-rate matrix produced results around
31
had evaporated. But the runs were not comparable. The learning rate changed from
0.0003 to 0.0001. The architecture branches changed. The question changed.

That is the part a good public account has to make clear: the story is not a
single magic number. It is a path through design decisions.

Why the model listened differently in N'Ko

The point of direct N'Ko output is not nostalgia. It is information geometry.

If the decoder emits Latin, it has to learn the sound and the spelling convention
at the same time. If it emits N'Ko, the label space is closer to the sound space.
Trajectory state is useful precisely because it describes movement in the acoustic
signal. N'Ko gives that movement a cleaner written target.

This is why the project keeps returning to script design. A writing system is not
just a cultural wrapper around language. It is an interface between sound and
symbol. In ASR, that interface becomes part of the model.

Kante designed N'Ko so Manding speech could be written without distortion. The
model listened better when we stopped forcing the speech through a script that was
not built for it.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/paper/blog-series/03-the-model-that-listened-in-nko.md

Detected Structure

Method · Evaluation · Architecture