Grand Diomande Research · Full HTML Reader

Script-Native ASR for N'Ko: Anticipatory Transformer CTC Decoding and the CER Anchor

This paper preserves the technical ASR center of the \nko{} research program: an archived script-native trajectory checkpoint reporting \anchorcer{} character error rate on a \corpusn{}-pair Bambara corpus snapshot. The model uses frozen Whisper large-v3 acoustic features, a trainable Transformer CTC decoder, and a compact trajectory state that biases attention with speech-dynamic information. The result is the strongest retained ASR artifact in the project and is the correct way to discuss the phrase ``20 CER'' pu

Language as Infrastructure working paper preprint render candidate score 100 .tex

Full Public Reader

Abstract

This paper preserves the technical ASR center of the research program: an
archived script-native trajectory checkpoint reporting character error
rate on a -pair Bambara corpus snapshot. The model uses frozen Whisper
large-v3 acoustic features, a trainable Transformer CTC decoder, and a compact
trajectory state that biases attention with speech-dynamic information. The result
is the strongest retained ASR artifact in the project and is the correct way to
discuss the phrase ``20 CER'' publicly.

The paper gives the architecture, training regime, artifact contract, and claim
boundaries.
Input audio is encoded by Whisper large-v3 [citation: radford2023robust]; 1280-dimensional
features are projected to a 768-dimensional decoder space, temporally downsampled,
and decoded by a six-layer Transformer CTC head [citation: graves2006connectionist]. A
trajectory module estimates a seven-dimensional state for each timestep:
commitment, uncertainty, transition pressure, recovery margin, phase stiffness,
novelty, and stability. This state produces an additive attention-logit bias before
CTC emission, giving the decoder an anticipatory geometry over speech dynamics.

The archived anchor was trained on paired examples split into
232,476 training rows, 29,060 validation rows, and 29,060 test rows, with learning
rate 0.0003, batch size 32, dropout 0.1, seed 42, and best validation loss
0.6358872798606507. The reported test CER is , computed as
216,225 edits over 1,050,967 reference characters. It is an archived checkpoint
result with preserved metadata.
Later low-learning-rate runs around 31\
because they used a different learning-rate regime. The conclusion is therefore
bounded: direct ASR reached a meaningful error regime under recorded
settings, and the artifact should not be silently replaced by non-comparable runs.

Introduction

The first two papers in this series establish the premise. General language models
can be weak processors of even when they accept the Unicode string, and Latin
WER is not a sufficient metric for script-native Manding speech recognition. This
paper asks the next question: what did the ASR system actually achieve, and how
should that achievement be stated without overstating it?

The answer is the anchor. The project retains an archived
trajectory CTC checkpoint trained on a -pair corpus snapshot. It reports
test CER with explicit scorer arithmetic. This is the number that can
be discussed publicly, but only with provenance: later ablations used a different
learning-rate regime and should not be merged into the same claim. The paper
therefore treats the result as an artifact-backed anchor rather than as a loose
leaderboard slogan.

The scientific contribution is not only the number. It is the architecture and the
measurement stack around the number. The model decodes directly into , not
through Latin. It uses CTC, whose labels are script-native characters rather than
words. It adds a compact trajectory state to attention, making speech dynamics
visible to the decoder. The surrounding artifacts record splits, row counts, hashes,
and scorer denominators. This is what makes the result useful for research even
when the public story is kept concise.

Research Questions and Claim Boundaries

The paper is organized around three questions: what architecture produced the
retained number, what metadata makes the number interpretable, and which later
branches should not be merged into the same claim. This keeps the paper from
turning the 20.57\

Caption: Research questions for the script-native ASR anchor.

ID	Question	Required evidence
RQ1	What architecture produced the retained 20.57 encoder, CTC decoder, trajectory-state definition, and training metadata.
RQ2	What makes the anchor inspectable?	Row counts, split sizes, scorer numerator, denominator, hashes, and artifact paths.
RQ3	Which later runs are non-comparable?	Hyperparameter table showing learning rate, architecture branch, and artifact differences.
RQ4	What should be preserved for future work?	Row exports, partition metrics, feature/pair hashes, and the exact scoring contract.

Caption: Allowed and disallowed public claims about the ASR result.

Allowed	Disallowed
The score was recorded under lr=0.0003, batch size 32, dropout 0.1, seed 42.	The later lr=0.0001 matrix directly refutes or replaces the anchor.
The result supports direct script-native ASR as a serious research path.	The result proves universal superiority over Latin under all settings.
The anchor should be reported with its scorer arithmetic and metadata.	AGP, TAR, or TTT produced the 20.57

Background

CTC and script-native labels

Connectionist Temporal Classification solves sequence alignment without frame-level labels [citation: graves2006connectionist]. For an input sequence $x_{1:T}$ and target label sequence $y_{1:U}$, CTC marginalizes over all framewise paths $\pi$ that collapse to $y$: \[ \mathcal{L}_{\mathrm{CTC}}(x,y) =-\log \sum_{\pi\in \mathcal{B}^{-1}(y)} \prod_{t=1}^{T}p(\pi_t\mid x). \] The label inventory matters. If labels are characters, the model learns a direct relation between acoustic evidence and script units. If labels are Latin words or Latin characters, the model learns a different output relation that may hide tone, digraph boundaries, or orthographic convention.

Whisper features as acoustic substrate

The project uses Whisper large-v3 as a frozen acoustic encoder [citation: radford2023robust].
This is pragmatic. Training an acoustic encoder from scratch for Bambara is
unrealistic under the available data and compute. Frozen Whisper features give the
decoder access to strong multilingual acoustic representations while keeping the
script-specific learning problem in the CTC head.

Development path

The final anchor should be understood against the development sequence. Early
BiLSTM CTC systems established feasibility but were weak. Transformer CTC decoders
on frozen Whisper features improved validation CER into the low-30s in earlier
development regimes. LoRA experiments on Whisper improved confidence and loss
behavior but were not the canonical benchmark [citation: hu2022lora]. The archived
anchor belongs to a later trajectory-conditioned CTC line trained at larger scale.

Caption: Condensed development path. Values are from different regimes and should not be plotted as one homogeneous leaderboard.

System	Role	Recorded outcome
V1 BiLSTM CTC	Feasibility: direct audio-to- output.	Approximately 56 V3 Transformer CTC	Frozen Whisper features plus six-layer Transformer decoder.	Approximately 33 V4 Whisper LoRA	Acoustic adaptation and confidence experiment.	Loss and confidence improved; not the canonical anchor.
Trajectory CTC	Script-native trajectory decoder at larger corpus scale.	archived test CER.

Architecture

Encoder and decoder

Let $a$ be an audio segment. Whisper large-v3 produces acoustic features \[ H=E_{\mathrm{Whisper}}(a)\in \mathbb{R}^{T\times 1280}. \] The decoder first projects these features: \[ U_0 = H W_p + b_p,\qquad W_p\in\mathbb{R}^{1280\times 768}. \] A temporal downsampling module reduces the effective sequence length to $T'$: \[ U = \mathrm{Downsample}(U_0)\in\mathbb{R}^{T'\times 768}. \] The downsampled sequence passes through six Transformer blocks and a CTC output projection over the normalized character vocabulary plus blank.

Trajectory state

The anticipatory component computes a compact state: \[ z_t=\sigma(g_\theta(U_{t-k:t+k}))\in[0,1]^7. \] The seven channels are defined operationally, not metaphysically:

Caption: Seven trajectory channels.

Channel	Intended meaning
Commitment	Evidence that the current local acoustic-symbol state is settled.
Uncertainty	Local ambiguity in the acoustic or decoder state.
Transition pressure	Evidence that the utterance is crossing a phoneme, syllable, or phrase boundary.
Recovery margin	Room to recover after an unstable local decision.
Phase stiffness	Resistance to abrupt changes in the local trajectory.
Novelty	Evidence of unseen word, unusual speaker behavior, or domain shift.
Stability	Persistence of a coherent local decoding path.

For attention head $m$, the trajectory state defines an additive bias: \[ \alpha_{ij}^{(m)} =\mathrm{softmax}_j\left( \frac{Q_i^{(m)}K_j^{(m)\top}}{\sqrt{d_h}} +B_{ij}^{(m)}(z_i,z_j) \right). \] The model is called anticipatory because attention is conditioned not only on content similarity but also on local speech dynamics. A frame near a boundary or uncertainty region should not be treated the same as a stable frame deep inside a settled character span.

Why trajectory belongs with

The trajectory hypothesis is that compact dynamic state is most useful when output
labels preserve acoustic structure. characters are closer to Manding
phonemic units than Latin word tokens or many Latin character sequences. If a
trajectory channel detects transition pressure, a CTC decoder can use it at
a boundary where a single script unit often corresponds to a sound unit. In a Latin
digraph regime, the same acoustic event may need multiple written characters, so
the boundary relation is less direct.

This hypothesis is not the same as claiming every trajectory variant improves every
run. Later low-learning-rate experiments suggest the heavier trajectory-attention
residual branch can underperform. The retained claim is narrower: the archived
checkpoint is associated with the simpler trajectory-conditioned CTC
path, and the historical comparisons motivated trajectory as a script-sensitive
mechanism.

Ablation logic

The architecture family has three separable mechanisms. The baseline mechanism is
direct script-native CTC: the model emits normalized characters instead of
Latin words or post-converted strings. The trajectory mechanism adds a compact state
that summarizes local speech dynamics before or during attention. The heavier TAR
mechanism, short for trajectory-attention residual, injects trajectory information
as a deeper residual branch. TTT, short for test-time training or test-time
adaptation, changes the inference procedure rather than simply changing the
decoder's forward pass.

Caption: Ablation logic for the ASR architecture family.

Mechanism	Hypothesis tested	What a negative result means
Script-native CTC	Direct labels reduce label ambiguity compared with a Latin bridge.	The model or data may still be insufficient; it does not refute metric validity.
Compact trajectory state	Local speech dynamics help boundary and uncertainty handling.	The state may be misplaced, undertrained, or unnecessary for a given regime.
TAR branch	Deeper trajectory residuals improve attention decisions.	More geometry can overconstrain or destabilize training.
TTT branch	Inference-time adaptation improves difficult rows.	Test-time updates may add variance or fail without a calibrated objective.

Training Regime and Artifact Anchor

The canonical anchor metadata is explicit.

longtable{p{0.32\linewidth}p{0.56\linewidth}}
Canonical archived ASR anchor.

Field & Value

\endfirsthead

Field & Value

\endhead
Corpus snapshot & paired examples

Train / validation / test & 232,476 / 29,060 / 29,060

Script and mode & trajectory CTC

Learning rate & 0.0003

Batch size & 32

Dropout & 0.1

Seed & 42

Best validation loss & 0.6358872798606507

Epochs trained & 47

Reported test CER &

CER arithmetic & 216,225 edits / 1,050,967 reference characters

Results SHA-256 & 252aecd6e323f7d50cefd3c1e507ddaf035d9f0ac4f78d67766c4cf6ed5d24a7

Vocabulary SHA-256 & e3ab620c9d2f971603d76f953f2be40bf9283dfd99d6428c7d51a9a73246ea67

Best checkpoint SHA-256 & ab1fe47f96c2c434d8f301ae065b3292d592b9a4f5accf1d09acc97ca2c03b59

longtable

The anchor should be reported with its arithmetic: \[ \frac{216{,}225}{1{,}050{,}967}=0.20574\ldots \approx 20.57\%. \] This protects the claim from rounding ambiguity. If a future scorer review changes the numerator or denominator, the difference can be localized to scorer behavior, normalization, split composition, model output, or reference material.

Provenance Protocol

The anchor is useful because it carries enough metadata to be inspected rather than
only repeated as a rounded percentage. The provenance protocol verifies the pair
file hash, row count, feature count, feature tensor shapes, vocabulary hash, split
sizes, and output directories. It also preserves the prediction and reference rows
needed to recompute the score.

longtable{p{0.24\linewidth}p{0.62\linewidth}}

Caption: Required provenance stages for the anchor.

Stage & Required check

\endfirsthead

Stage & Required check

\endhead
Data identity & Pair file hash, row count 290,596, feature tensor count 290,596,
and feature-shape normalization.

Split identity & Train, validation, and test row identities; expected counts
232,476, 29,060, and 29,060.

Vocabulary identity & Script-native vocabulary file and SHA-256
e3ab620c9d2f971603d76f953f2be40bf9283dfd99d6428c7d51a9a73246ea67.

Training identity & Learning rate 0.0003, batch size 32, dropout 0.1, seed 42,
patience, optimizer, maximum epochs, and stopping rule.

Runtime identity & Trainer hash, library versions, GPU type, feature cache path,
and exact launch command.

Evaluation identity & CER normalizer, prediction export, reference export,
partition metrics, edit numerator, and reference denominator.

Publication identity & Results JSON, best checkpoint, final checkpoint, logs, split,
vocabulary, and hashes copied into a durable bundle.

longtable

Artifact Contract

The main research artifact is not only a checkpoint. A checkpoint without row-level
predictions cannot defend the metric. A row export without a split file cannot
defend the data boundary. A metric without a normalizer cannot defend the scorer.
The durable anchor bundle should therefore contain the following artifacts.

\caption{Artifact contract for the 20.57\

Artifact	Purpose
`results.json`	Stores scalar metrics, hyperparameters, row counts, hashes, and artifact paths.
`test\_predictions.jsonl`	Provides one hypothesis per test row for independent rescoring.
`test\_references.jsonl`	Provides the aligned reference rows and denominator source.
`test\_metrics\_\allowbreak by\_\allowbreak partition.json`	Supports AGP or error-partition analysis without repeating inference.
`split.json`	Preserves exact row membership for train, validation, and test.
`vocab.json`	Defines output labels and lets readers verify script-native decoding.
`best.pt` and `final.pt`	Preserve the selected checkpoint and the terminal training state.
Logs and launch scripts	Record command line, hardware, dependency versions, and guardrail checks.

[figure: figures/fig1_cer_comparison.pdf]

Caption: CER comparison context from the ASR experiments. The final paper series separates archived anchor evidence from historical comparisons and non-comparable low-learning-rate ablations.

[figure: figures/fig2_loss_curves.pdf]

Caption: Loss-curve context from the ASR line of work. Loss curves are useful for training diagnosis but do not by themselves establish matched script superiority.

Historical Comparative Evidence

The project history contains an eight-way internal comparison across baseline,
graph, trajectory, and combined conditions in and Latin. These runs explain
why the trajectory hypothesis became central. They are not the canonical benchmark,
because the full local artifact bundle for all eight historical runs is not
restored.

Caption: Historical eight-way comparison. These values are hypothesis-generating context, not the primary benchmark.

Decoder condition	CER	Latin CER	--Latin delta
Baseline	32.75 Graph structure	32.38 Trajectory	27.50 Graph + trajectory	30.46

The historical pattern is scientifically useful because it suggests that dynamic
or structural mechanisms may help more when the output labels preserve
phonemic structure. But it should not be promoted as the final matched proof.
Artifact status determines claim strength.

[figure: figures/fig3_delta.pdf]

Caption: Delta figure from the ASR experiments. The public interpretation should keep historical and canonical evidence separate.

Non-Comparable Later Runs

The project also preserved a low-learning-rate matrix around 31\
are valuable engineering evidence, but they are not anchor replacements because
the learning rate differed. The main lesson is comparability: a change from
lr=0.0003 to lr=0.0001 changes the training regime, so the result should be read as
a separate condition rather than as a direct contradiction.

Operational lessons

The later execution work exposed practical issues that matter for future work:
feature hydration required substantial disk, feature tensors appeared in mixed
shapes and needed loader normalization, and guardrails had to distinguish
pre-hydration disk capacity from post-hydration free space. These are engineering
lessons about running the pipeline, not reasons to bury the archived anchor.

Caption: Low-learning-rate ablation context. These runs used lr=0.0001 and should not be compared directly against the lr=0.0003 anchor as if only architecture changed.

Run	Script	Mode	CER
baseline		baseline	31.38 TAR		trajectory-attention residual	31.69 trajectory TTT		trajectory + test-time training branch	31.12 Latin baseline	Latin	baseline	31.66 Latin trajectory	Latin	trajectory	32.81

This table is important because it prevents separate mechanisms from being merged
into one story. TAR
means trajectory-attention residual: a heavier branch that injects trajectory state
deeper into attention. TTT means test-time training or adaptation: an inference-time
update procedure. Neither is the archived 20.57\
variants did not obviously improve the low-learning-rate matrix is a useful
mechanistic warning: compact trajectory state may be helpful in the right position,
but more geometry is not automatically better.

Data Scale

Data scale matters. Earlier development used smaller corpora for architecture
search and feasibility. The anchor uses the larger -pair snapshot. The
paper should not mix results from 37-hour or 37K-row development runs with the
290K-row anchor without labeling them. The development runs answer engineering
questions; the anchor answers the retained benchmark question.

[figure: figures/fig4_data_scale.pdf]

Caption: Data-scale context from the ASR experiments. Scaling changes both model behavior and claim strength, which is why artifact provenance is central.

Publication Language

The correct public sentence is:
quote
An archived trajectory ASR checkpoint reports test CER after
training on Bambara speech pairs under recorded settings.
quote
That sentence is strong because it is specific. It names the script, architecture
family, metric, corpus scale, and artifact status. The unsafe sentence is:
quote
The later TAR, TTT, or AGP branch produced 20.57\
beats Latin.
quote
That sentence is false or at least unsupported by the retained artifact chain.

Model Card and Intended Use

The retained anchor should be accompanied by a minimal model card. The intended use
is research on script-native ASR for Manding speech under controlled
evaluation. The intended metric is normalized CER with explicit edit counts
and denominators. The system is not intended as a final legal, medical, emergency,
or fully automated subtitle system. It is also not intended to replace community
orthographic authority. The correct deployment path is benchmarked ASR followed by
row-level governance, review, and domain-specific validation.

Minimal model-card fields for the ASR anchor.

Field	Statement
Primary task	Direct acoustic-to- character transcription for Manding speech.
Primary metric	Normalized CER with prediction/reference row exports.
Known strengths	Script-native output, explicit scorer arithmetic, large retained corpus snapshot.
Known limitations	Limited out-of-domain evidence and no universal Latin comparison claim.
Unsafe uses	High-stakes transcription, unreviewed corpus expansion, or automatic normalization of uncertain text.
Required governance	AGP-style row contracts, correction admissibility, and human review for uncertain or novel rows.

Limitations

The main limitation is scope. The archived checkpoint is a retained benchmark
anchor, not a deployment guarantee. Future work should preserve the exact lr=0.0003
anchor contract with the frozen split, pair hash, vocabulary hash, feature
validation, row exports, and partition metrics.

The second limitation is matched comparison. The historical N'Ko/Latin tables are
informative, but artifact status and hyperparameter mismatch prevent them from
closing a universal superiority claim. The next fair comparison needs identical
data, split, feature cache, optimizer, learning rate, patience, seed schedule,
normalizer, scorer, and artifact export.

The third limitation is deployment. The anchor is a within-distribution ASR result.
It is not a finished conversational-broadcast model, a Djoko production model, or a
guarantee of subtitle-quality transcription. Out-of-domain use requires the AGP
governance layer described in the fourth paper.

Conclusion

The result is usable, but only if stated with discipline. It is an
archived trajectory CTC checkpoint, trained under recorded settings on a
-pair snapshot, with explicit scorer arithmetic. It is not the later TAR
branch, not the TTT branch, and not AGP.

That bounded claim is still important. It shows that direct script-native
ASR reached a meaningful error regime on a large Bambara corpus. Combined with the
metric argument from the previous paper, it gives the project a concrete scientific
anchor: Manding ASR should be evaluated in a script that preserves the linguistic
structure the system is supposed to recognize.

plainnat
references

Promotion Decision

Compile/render the source, verify references and figures, then add to the curated atlas.

Source Anchor

nko-brain-scanner/paper/final/03-script-native-asr-anchor/paper.tex

Detected Structure

Latex · Abstract · Method · Evaluation · References · Math · Figures · Architecture