Grand Diomande Research · Full HTML Reader

How Seven Numbers Changed Everything We Know About Speech Recognition

We had a working Bambara ASR system. A 46.9M-parameter Transformer CTC decoder sitting on top of frozen Whisper features. It took raw audio, ran it through Whisper's encoder to get acoustic features, then decoded those features into N'Ko characters.

Language as Infrastructure research note experiment writeup candidate score 22 .md

Full Public Reader

How Seven Numbers Changed Everything We Know About Speech Recognition

The discovery of trajectory bias and why it only works for N'Ko

---

The Baseline Was Not Enough

It worked. 33

But there was a pattern we kept noticing in the errors. The model would get the middle of words right and the beginnings wrong. It would handle sustained vowels perfectly but stumble on consonant clusters at phrase boundaries. It would decode long stretches of speech with no errors, then suddenly produce a burst of garbage at exactly the moments where a human listener would say "oh, something changed there."

The model had no sense of where it was in the utterance. It processed each frame independently, using only local attention to the surrounding frames. It had no concept of "we are at the beginning of a new phrase" or "the speaker is about to transition to a different topic." It was reading the audio like a very fast typewriter, one frame at a time, with no awareness of the larger structure.

What Trajectories Are

In our motion capture work (a separate project involving dance, music, and real-time audio generation), we had developed something called anticipation geometry. The idea: you can predict what a dancer is about to do by tracking seven scalar values derived from their movement trajectory.

Those seven values are:
1. Commitment (0-1): how locked-in the current motion path is
2. Uncertainty (0-1): how many possible next-states exist
3. Transition pressure (0-1): how close we are to a regime change
4. Rhythmic phase (0-2pi): where we are in the current periodic cycle
5. Energy (0-1): overall intensity of the signal
6. Curvature (0-1): how rapidly the trajectory is bending
7. Jerk (0-1): rate of change of curvature, the "snap" in the motion

These seven numbers compress the entire dynamic state of a moving system into a trajectory signature. High commitment + low uncertainty = the dancer is about to execute a known move. Low commitment + high transition pressure = something is about to change. High jerk + high energy = explosive movement.

The insight was: speech has trajectories too.

Applying Trajectories to Audio

We built a small network (AudioTrajectoryScalars) that takes the Whisper encoder output and extracts seven scalar values. Same seven dimensions as the motion system. The network pools across the time axis, runs through a two-layer MLP, and produces seven numbers between 0 and 1.

Then we built a TrajectoryBiasNetwork that converts those seven scalars into attention biases. The output is a per-head bias matrix that gets added to the attention scores in every transformer layer of the CTC decoder.

Concretely: if the trajectory says "transition pressure is high" (something is about to change in the speech), the attention bias shifts to favor recent frames over distant ones. If commitment is high (the speaker is in the middle of a stable phrase), the bias widens to let the model attend to the full context.

The scalars do not replace any part of the architecture. They modulate it. The same six transformer layers run the same attention computation, but the attention weights are nudged by the trajectory state. It is like giving the model a sense of proprioception: awareness of where it is and where it is going.

The Controlled Experiment

We ran four configurations on 290K+ Bambara speech pairs, each one trained twice (once for N'Ko output, once for Latin output). Same data. Same architecture. Same hyperparameters. Same random seed. The only difference: whether trajectory bias was active, and which script the output targeted.

N'Ko Results

Mode	CER	Params
Baseline (no trajectory)	32.75
Trajectory	**27.50

Trajectory bias reduced N'Ko CER by 5.25 percentage points. That is a 16

Latin Results

Mode	CER	Params
Baseline (no trajectory)	31.43
Trajectory	31.67

Trajectory bias did nothing for Latin. Actually, it made it slightly worse (+0.24pp). Same model, same data, same training. The only difference was the output script.

Without Any Trajectory Scalars

We also ran a pure baseline with no trajectory information at all. Latin collapsed to 100

N'Ko's baseline still converged fine at 32.75

Why It Only Works for N'Ko

This is the finding that matters.

Trajectory bias helps the model figure out where it is in the utterance. "We are mid-phrase, high commitment, low transition pressure." That information is useful when the model needs to decide which character to emit. But it is only useful if the character space is unambiguous.

For N'Ko, "which character to emit" is the same question as "which phoneme am I hearing." The trajectory state helps the model resolve ambiguity at acoustic boundaries, where two phonemes sound similar and context is the deciding factor.

For Latin, "which character to emit" is a different question than "which phoneme am I hearing." The model needs to decide not just what sound it heard, but how to spell it. Should the palatal nasal be "ny" or "gn" or "ñ"? Should the tone be marked with a diacritic or left unmarked? These are orthographic decisions, not acoustic ones. Trajectory bias gives the model acoustic context. Acoustic context does not help with spelling.

The trajectory scalars provide information about where in the speech stream we are. N'Ko converts that information directly into character predictions because its characters are sounds. Latin cannot use that information because its characters are spelling conventions.

The Deeper Insight

Script design is not neutral. It is an architectural choice that determines what kinds of model improvements are available.

Trajectory bias is a general technique. It works for any sequential decoding task where dynamic context matters. But it only improves ASR when the output space is phonemically transparent. For N'Ko, trajectory bias is a 5.25pp win. For Latin, it is a 0.24pp loss. Same technique, same data, opposite outcomes. The difference is the script.

This has implications beyond our system. Any attention-modulating technique, any contextual mechanism that provides temporal or dynamic information to the decoder, will be more effective when the output space directly encodes the signal being modulated. Trajectory bias helps with acoustic dynamics. N'Ko encodes acoustic content. Latin encodes spelling.

This is not about N'Ko being a "better" script in some abstract sense. It is about match. The technique matches the representation. When they align, you get 5.25pp. When they do not, you get nothing.

What Comes Next

We are currently running a full reproduction of the 27.50

The trajectory technique came from watching people dance. It works on speech because speech and movement share the same temporal geometry: commitment, uncertainty, transition, rhythm, energy. The seven numbers that predict where a dancer will move next are the same seven numbers that predict where a speaker is going.

Kante designed N'Ko to capture how Manding languages sound. We found that when you build a machine to decode those sounds, the script he designed in 1949 enables techniques that Latin cannot use.

That is not a coincidence. That is design.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/blog/posts/06-trajectory-bias.md

Detected Structure

Method · Evaluation · Figures · Architecture