Language/Sound and Tone·6 min read

Tone Is an Acoustic Problem

The standard plan restores tone from text context alone. The signal was in the audio the whole time.

In Manding, tone is not decoration. The same consonants and vowels with different pitch patterns are different words, different grammar. Any speech recognizer that drops tone has not transcribed the language; it has transcribed a lossy projection of it.

The strongest released N'Ko decoder today is toneless. It emits consonants and vowels and drops the tone diacritics. The standard plan for fixing this is a text-context language model: look at the surrounding words and infer which tone makes sense. That works exactly as well as the ambiguity allows, and in Manding the text-only ceiling is real — many sentences remain genuinely ambiguous on paper.

The reframe

Tone restoration is treated as a text problem because the text is what we have. But the pitch contour that disambiguates the word was in the audio all along. Throwing it away and then guessing is a choice.

Featural Acoustic Coding starts from a property of N'Ko that I think is underappreciated: the script natively encodes axes of the acoustic signal. Tone marks map to pitch. Length marks map to duration. Seven vowels map to spectral color. Consonant manner maps to onset shape. A designed script, designed by someone who could hear the language clearly, turns out to be a compact descriptor system for the sound itself.

That has a measurable token-cost consequence. Descriptor systems for acoustics spend words: a pitch contour becomes a phrase. N'Ko spends one glyph. When you are paying per token, in training data, in context windows, in on-device inference, a notation that carries four or five acoustic axes natively is not a curiosity. It is compression.

The honest status: the framework is built, the synthetic results show the expected gap on tonal material, and the decisive test needs tone-bearing read speech rather than casual conversation, where pitch excursions flatten. The claim I am prepared to defend now is narrower and still useful: tone should be modeled as text prior times acoustic evidence, and a system that ignores the second factor is leaving recoverable signal on the floor.