Grand Diomande Research · Full HTML Reader

Reading Tone from the Signal: Featural Acoustic Coding for Tone Resolution in N'Ko Speech Recognition

% Reading Tone from the Signal: % Featural Acoustic Coding for Tone Resolution in N'Ko Speech Recognition % Compiles with pdflatex (MacTeX). N'Ko shown via Unicode codepoints + % transliteration; IPA via tipa; architecture figure via TikZ.

Language as Infrastructure working paper preprint render candidate score 100 .tex

Full Public Reader

Abstract

Script-native automatic speech recognition for the Manding languages, written in
N'Ko (U+07C0--U+07FF), reaches a meaningful error regime, but the strongest
released decoder is toneless: it emits N'Ko consonants and vowels and drops
the tone diacritics that are lexically and grammatically contrastive in Manding.
Tone is therefore restored downstream, and the planned mechanism is a text-context
language model that infers tone from orthographic context alone. We make three
claims that reframe this problem as an acoustic one. First, we correct the
tone inventory: N'Ko has seven combining tone marks (U+07EB--U+07F1),
short and long forms of high, low, rising, and a native descending (falling) tone
at U+07EE; a widely reused syllable codebook had mislabeled the long marks
as length, which propagated into downstream tooling. Second, we measure an
empirical tone prior from a corpus of 105 transcribed N'Ko lesson frames
(12{,}541 N'Ko characters, 3{,}316 tone marks, harvested by vision-language OCR):
\textbf{65.8\
are unmarked mid, and only \textbf{0.9\
is, to first order, a three-way non-contour register classification. Third, we establish that text-only
tone resolution is weak: a context model on the same corpus reaches a
(tone-diacritic error rate) of \textbf{50.8\
roughly half of all tones, leaving large headroom for acoustic evidence that the
toneless recognizer discards. We then define Featural Acoustic Coding
(FAC), a featural treatment of the N'Ko syllable in which the tone mark is a
register-plus-contour primitive read directly from the fundamental frequency, and
we place tone restoration as a conservative, governance-gated correction in an
Anticipation Geometry Partition (AGP) layer that reuses the recognizer's
own trajectory geometry. We give the architecture, the reproducibility structure
(the core claims run standalone, without the acoustic model), a preregistered
fusion evaluation, and an honest account of what remains, principally read-speech
ground truth for the acoustic tone-diacritic error rate. A secondary section
relates FAC to Lexical Acoustic Coding and reports that the featural pitch
advantage over a lexical carrier is real but conditional on tonal density; the
durable advantage is token efficiency, not reconstruction fidelity.

Introduction

N'Ko is a script created in 1949 by Solomana Kant\'e for the tonal Manding
language family (Bambara, Dioula, Maninka), spoken by tens of millions of people
in West Africa, and encoded in Unicode at U+07C0--U+07FF
[citation: unicode_nko]. Unlike the Latin orthography of Bambara, N'Ko was engineered
around Manding phonology with a near one-to-one phoneme-to-character mapping and
explicit, obligatory tone marking. Prior work shows that this design is not
cosmetic for machine learning: it changes what tokenizers represent, how acoustic
evidence aligns to symbols, and whether a reported error rate measures recognition
or agreement with an inherited spelling convention [citation: diomande2026nko].

That same line of work produced a script-native recognizer, an anticipatory
Transformer connectionist-temporal-classification (CTC) decoder over frozen
encoder features, with an archived checkpoint at roughly twenty percent character
error rate (CER) on a large Bambara corpus. The decoder is, however, toneless: it
emits N'Ko segmental content and omits the tone diacritics. In Manding this is a
substantive loss, because tone is contrastive at the lexical and grammatical level;
two utterances that differ only in tone differ in meaning, and a toneless
transcript is therefore underspecified. The planned remedy is contextual tone
resolution by a language model trained on tone-marked N'Ko text: a prior
over which tone is plausible given the surrounding orthography. Such a model never
consults the acoustic signal, even though tone is, physically, a property of the
fundamental frequency that is present in the audio and merely discarded by a
toneless decoder.

This paper reframes tone restoration as an acoustic problem and makes the case
quantitatively. Our contributions are as follows.

- A corrected tone inventory (\S[ref: sec:inventory]). We show, against
authoritative Unicode names, that N'Ko has seven combining tone marks
(U+07EB--U+07F1) encoding high, low, rising, and a native descending
(falling) tone in short and long forms, and we document a mislabeling in a reused
syllable codebook that treated two tone marks as length and propagated into
downstream tooling and an earlier draft of this work.

- An empirical tone prior (\S[ref: sec:prior]). From a 105-frame
vision-OCR corpus of N'Ko lessons we measure the tone-class distribution:
\textbf{65.8\
establishing that acoustic tone resolution for N'Ko is dominated by a
non-contour register decision.

- A text-only baseline (\S[ref: sec:baseline]). On the same corpus, a
context model reaches ~$=50.8\%$, the bar that an acoustic channel must
beat. Text alone resolves only about half of all tones.

- Featural Acoustic Coding and a governance-gated correction layer
(\S[ref: sec:fac], \S[ref: sec:arch]). FAC reads register and contour from the
fundamental frequency into the native tone mark; tone restoration is a
conservative correction governed by an Anticipation Geometry Partition (AGP) that
reuses the recognizer's trajectory geometry. We give the unified architecture and
its reproducibility structure (\S[ref: sec:repro]): the component claims run
standalone, with no dependence on the acoustic model.

- A preregistered fusion evaluation (\S[ref: sec:eval]) and an honest
relation to Lexical Acoustic Coding (\S[ref: sec:lac]), where we report that the
featural pitch-fidelity advantage is conditional on tonal density and that the
durable advantage of a designed script is token efficiency.

Background

Script-native N'Ko ASR..
The recognizer of [citation: diomande2026nko] projects and downsamples frozen
Whisper-scale features, then decodes with a Transformer CTC head over N'Ko
character classes, with a finite-state constraint enforcing syllable validity at
output. An anticipation module estimates a seven-dimensional trajectory state
$z_t \in [0,1]^7$ per timestep, comprising commitment, uncertainty, transition
pressure, recovery margin, phase stiffness, novelty, and stability, injected as an
additive attention-logit bias $B_{ij}(z_i, z_j)$ before emission. The central
measurement argument is a transparent-script proposition: if the normalized script
map is bijective over the phoneme inventory, character edit distance preserves
phoneme-edit structure up to normalization, so N'Ko CER is more phonemically
interpretable than Latin word error rate. Tone is the principal hedge in that
argument, because CER still depends on tone and diacritic policy. This paper
attacks that hedge.

Tone in Manding..
Manding is tonal; tone carries lexical and grammatical contrasts
[citation: vydrin2015manding]. N'Ko marks tone with obligatory combining diacritics,
which is exactly the property a tone-restoration system can exploit and which
Latin Bambara orthography, with optional and inconsistent tone marking, lacks.

Audio as text..
Lexical Acoustic Coding (LAC) transmits a sound between agents as a readable
sentence by quantizing interpretable acoustic descriptors into lexical labels
[citation: lac2026], situated against captions [citation: drossos2020clotho,
mei2022survey], neural codecs [citation: zeghidour2021soundstream, defossez2022encodec,
kumar2023dac], and continuous descriptor sets [citation: peeters2004cuidado,
zwicker1999psychoacoustics]. FAC keeps the quantize-and-serialize move but uses a
designed featural script as the carrier; we treat this relation as secondary
(\S[ref: sec:lac]) and lead instead with tone resolution.

The N'Ko Tone Inventory, Corrected

N'Ko encodes seven combining tone marks, verified against the Unicode character
database: U+07EB short high, U+07EC short low, U+07ED short
rising, U+07EE long descending, U+07EF long high, U+07F0 long
low, and U+07F1 long rising. Folding the short/long length distinction into
tone class yields four shapes, high, low, rising, and
falling, all native; an unmarked syllable is mid by default. This matters
because a widely reused N'Ko syllable codebook, which serves as the retrieval
target for the joint-embedding recognizer, enumerated only five marks and labeled
U+07EE and U+07EF as ``long'' and ``very long'' length rather than as
the descending and long-high tones they are. That mislabeling propagated into tone
parsers and into an earlier draft of this work, which incorrectly claimed that
falling tone was absent from native N'Ko and had to be added as a designed
diacritic. It is not: all four tone shapes are native. Consequently, any designed
extension (\S[ref: sec:fac-ext]) is required only for timbral descriptors, never for
pitch. We corrected the codebook labels in place while freezing the enumerated
tone set, since the recognizer's retrieval indices depend on it.

An Empirical Tone Prior

To ground the design we measured the distribution of tone classes in real N'Ko
text. We harvested 16 numbered N'Ko teaching lessons (printed, tone-marked N'Ko on
screen-shared documents), sampled frames, and transcribed the printed N'Ko with a
vision-language model under a deterministic decoding configuration; flagship
general models without that configuration either produced degenerate repetition on
N'Ko or refused, while a current reasoning model with reasoning disabled
transcribed cleanly. The resulting corpus has 105 entries spanning 20 lessons,
12{,}541 N'Ko characters and 3{,}316 tone marks (a glyph-level tone density of
26.4\
(\S[ref: sec:inventory]) over 4{,}139 syllables yields the prior in
Table~[ref: tab:prior].

Tone classShareGroup
low (l)40.1 high (h)25.7 mid / unmarked33.3 rising0.4 falling0.6

\caption{Empirical N'Ko tone prior, 4{,}139 syllables from the lesson corpus.
Tone is dominated by register; contour is rare. The labels carry OCR noise (see
\S[ref: sec:limitations]), but the register/contour split is stable across corpus
size.}

The implication is design-determining. Because tone is roughly two-thirds marked
high/low register, one-third unmarked mid, and only about one percent contour,
acoustic tone resolution for N'Ko reduces, to first order, to deciding low versus
high versus unmarked from the fundamental frequency relative to a speaker
baseline. This is precisely the kind of decision $F_0$ supports robustly, and it
argues that a tone channel should lead with register estimation while treating
contour as a high-confidence slope event.

Text-Only Tone Resolution Is Weak

We next quantify the ceiling of the text-only strategy that the recognizer
pipeline currently plans to rely on, using only the corpus and no audio. For each
entry we strip the tone marks to obtain the toneless syllable sequence a toneless
decoder would emit, treat the original marks as ground truth, and predict the tone
class of each syllable. We report the tone-diacritic error rate, the fraction of
syllables whose tone class is wrong, averaged over five lesson-disjoint splits.
Three models of increasing context are compared: a majority-class floor, a unigram
model conditioning on syllable identity with backoff to the floor, and a bigram
model additionally conditioning on the previous tone with backoff to the unigram.
Results are in Table~[ref: tab:tder].

Text-only model\textbf
majority-class floor58.7 unigram (syllable identity)51.4 bigram ( $+$ previous tone)\textbf50.8

Caption: Text-only tone resolution on the lesson corpus (mean of five lesson-disjoint splits). Even with context, text alone misses about half of all tones. This is the bar an acoustic channel must beat.

The best text-only model still misses about half of the syllable tone classes.
This is the empirical case for the acoustic channel: the toneless decoder throws
away the fundamental frequency, which is the most direct evidence for the register
decision that text struggles with, so an acoustic register estimator has
substantial headroom to reduce below 50.8\

Featural Acoustic Coding

The syllable as a featural tuple

FAC treats each N'Ko syllable as a tuple
$\sigma = (o, v, c, t, \ell, \mathbf{d})$ of onset $o$, vowel nucleus $v$, nasal
coda $c$, tone $t$, length $\ell$, and an optional stack of timbral diacritics
$\mathbf{d}$. The base tuple indexes the existing syllable codebook. Each slot is
an acoustic feature: the onset's manner of articulation is the attack-transient
character (plosive [k t p] percussive, fricative [s f] noisy,
nasal [m n N] resonant, approximant [l r w j] glide); the seven
vowels partition the formant plane from the bright high-second-formant front vowel
[i] to the dark back vowel [O], an ordered code for spectral
centroid; the tone mark encodes pitch register and contour; the length mark encodes
duration; the nasal coda encodes a resonant sustain. Crucially for this paper, the
tone slot is native and complete (\S[ref: sec:inventory]): register and contour are
written, not approximated.

Encoder, decoder, and the register-first reading

The FAC encoder maps a waveform to a string of codebook syllables by quantizing a
standard acoustic-descriptor front end into the featural slots: spectral centroid
to vowel, onset descriptors to onset manner, duration to length, sustain to coda,
and, the focus here, the fundamental frequency to the tone mark. Given the prior of
\S[ref: sec:prior], the tone encoder is implemented register-first: estimate the
event's mean $F_0$ relative to a per-speaker baseline (calibrated from the
speaker's $F_0$ distribution), quantize to low / mid / high, and apply a contour
primitive only when the within-event slope exceeds a deadband. The FAC decoder is
the generative dual: it maps each syllable back to descriptor intervals and drives
a synthesizer, with per-slot interpretable error (a tone mismatch is a pitch
error). The decoder is what makes FAC more than an analysis: it is the proof that a
featural N'Ko code can author a continuous signal, which is the template for using
N'Ko as a score for other embodied signals; we do not develop that here.

The designed extension is purely timbral

N'Ko natively carries all four tone shapes and the vocal/segmental axes, so the
only descriptors without a native symbol are higher timbral ones. We define four
combining diacritics in the script's spirit, each a small ordered set, for
harmonicity, spectral spread, roughness, and dynamics.
No pitch dimension requires a designed mark; the correction of \S[ref: sec:inventory]
removes the earlier, erroneous ``designed falling tone.''

Tone Resolution as a Governance-Gated Correction

The recognizer, the governance layer, and the tone correction compose into one
architecture unified by the trajectory geometry $z_t$, which is reused at three
levels (Figure~[ref: fig:arch]). At the decode level, $z_t$ biases attention
in the CTC head and the decoder emits toneless N'Ko. At the govern level,
the Anticipation Geometry Partition (AGP) reuses the same geometry as a correction
policy rather than a decoder, partitioning each output span into stable, boundary,
uncertain, and novelty states and gating which spans may be corrected, reviewed, or
excluded. At the correct level, tone restoration is applied conservatively
to eligible spans as a fusion of the text-context prior (\S[ref: sec:baseline]) and
the acoustic register estimate (\S[ref: sec:fac]). AGP is governance, deciding where
and whether to correct; FAC is content, supplying the tone. They are complementary:
AGP marks a span uncertain and eligible, FAC supplies the register-resolved tone.

tikzpicture[
box/.style={draw, rounded corners, align=left,
text width=0.86\columnwidth, inner sep=4.5pt, font= },
ann/.style={font=\scriptsize\itshape}, >=Stealth, node distance=8mm]
\node[box] (dec) {1. Decode --- anticipatory Transformer CTC.
frozen features $\to$ trajectory $z_t \in [0,1]^7 \to$ toneless N'Ko
($\sim$20\
\node[box, below=of dec] (gov) {2. Govern (AGP) --- same $z_t$ as a
correction policy. partition spans: stable / boundary / uncertain / novelty.};
\node[box, below=of gov] (cor) {3. Correct (conservative, AGP-gated).
tone $=$ text prior $\times$ acoustic register (FAC) $\to$ toned N'Ko.};
\draw[->] (dec) -- node[right, ann] {toneless} (gov);
\draw[->] (gov) -- node[right, ann] {eligible spans} (cor);
tikzpicture

Caption: The N'Ko architecture. The trajectory geometry is reused as a decoding bias, an AGP correction policy, and (implicitly) the eligibility signal for tone restoration. FAC's acoustic register estimate is the new evidence at the correction level.

This composition is already realized as a data pipeline. The lesson corpus of
\S[ref: sec:prior], treated as OCR-derived pairs, passes through the AGP partitioner
unchanged: of 105 rows, 99 partition as stable (train-ready), 6 as
boundary (review), and 0 as uncertain, so the corpus feeds the recognizer's
self-improving loop (recognize, label, OCR, tone-resolve, retrain) as governed
training data. We note that frame-sampled data supplies the partitioner a constant
scene duration, which inflates one of its positive signals; the discriminating
signals on this corpus are text length and structure, and the stable rate should be
read with that caveat.

Reproducibility and Dependency Structure

The claims of this paper are layered by what they require to reproduce. The
component claims, the corrected inventory, the tone prior, the text-only
baseline, and the acoustic register classifier, depend only on the released
corpus and standard numerical and audio libraries; they run with no GPU and no
access to the acoustic model, so any reader can clone and verify them. The
end-to-end claim, that acoustic evidence lowers the recognizer's toned CER,
requires one inference pass with the archived checkpoint and no retraining. Only the
deeper integration, a jointly trained acoustic tone head or a trajectory
state augmented with FAC's interpretable descriptors, requires training. We regard
this separation as a feature: the scientific core is decoupled from the heavy model
and is reproducible on a laptop.

Preregistered Fusion Evaluation

We specify the decisive evaluation rather than report it, since it requires data we
name explicitly. The metric is on held-out material, with full toned CER as
a secondary metric. The text-only bar is fixed at the 50.8\
\S[ref: sec:baseline]. The treatment is the fusion of the text prior with the
acoustic register estimate of \S[ref: sec:fac]. Hypotheses: H1, the fusion
reduces below the text-only bar; H2, the reduction is driven by
the non-contour register decision, consistent with the prior of
\S[ref: sec:prior], so an ablation removing contour primitives barely changes
; H3, the gain concentrates on AGP-uncertain spans, where the text
prior is least confident.
The one dependency that is a data dependency, not a model dependency, is
ground truth: scoring acoustic requires read speech of known tone-marked
text (forced-alignable), because lesson speech is commentary on the displayed text,
not narration of it, and therefore does not align tone-per-spoken-syllable. We
treat obtaining a small read-speech set as the gating next step.

Relation to Lexical Acoustic Coding

FAC and LAC share the move of quantizing acoustic descriptors and serializing them
as text; they differ in the carrier. A natural-language carrier encodes pitch as an
adjective and does not decompose along acoustic axes, whereas a designed featural
script writes register and contour as marks and composes onset, nucleus, coda, and
tone as independent slots. We tested the sharpest consequence, pitch-contour
fidelity, in a controlled rate-distortion study at matched code budget on synthetic
tonal contours and on real Manding speech. Two honest findings result. On synthetic
contours with large excursions, a level-only lexical summary hits an error floor
that register budget cannot break (it cannot represent within-event movement),
while a contour-carrying code reaches a $4.4\times$ lower error; but a lexical code
that also spends a word on contour ties the featural code on error, so the featural
advantage is not fidelity per se but token cost (one glyph versus two words). On
real speech the advantage shrinks with tonal density: roughly $1.2\times$ on
didactic lesson speech (within-event excursion $\approx$77 cents) and negligible on
conversational speech ($\approx$40 cents). The durable contribution of a designed
script is therefore token efficiency and native, decomposable tone, not a universal
reconstruction-fidelity win. This is why the present paper leads with tone
resolution and treats the LAC comparison as secondary.

Limitations

The tone prior and the text-only baseline are computed on a corpus whose labels are
vision-OCR output and therefore carry OCR noise; they are prototype measurements,
and the register/contour split, while stable across corpus growth, is not a
gold-standard annotation. The acoustic register classifier is validated on
controlled synthetic syllables and runs on real audio, but no acoustic is
reported because that requires the read-speech ground truth named in
\S[ref: sec:eval]. The AGP stable rate is mildly optimistic for frame-sampled data
(\S[ref: sec:arch]). The end-to-end CER effect is unmeasured pending the fusion run.
Finally, N'Ko is a living writing system for tens of millions of people; we use it
as an acoustic substrate with that in mind and frame the contribution as exploiting
its design discipline.

Conclusion

N'Ko speech recognition leaves tone on the table, and the field's planned remedy
reads tone from text. We argue, with measurements, that tone should be read from
the signal: N'Ko tone is overwhelmingly non-contour, text resolves only about half
of it, and the fundamental frequency the toneless decoder discards is exactly the
evidence the register decision needs. Featural Acoustic Coding makes the tone mark
an acoustic primitive, and an Anticipation Geometry Partition makes tone
restoration a governed, conservative correction in the same architecture that does
the recognition. The result is a concrete, mostly standalone, and reproducible
program for closing the tone gap, with one honest dependency, read-speech ground
truth, standing between it and an end-to-end number.

references

Promotion Decision

Compile/render the source, verify references and figures, then add to the curated atlas.

Source Anchor

nko-acoustic-coding/main.tex

Detected Structure

Latex · Abstract · Method · Evaluation · References · Figures · Architecture