Governed Self-Correction for Low-Resource N'Ko ASR: A Technical Report on the Acoustic Verifier Experiment
**Date:** 2026-06-01 **Author:** Mohamed Diomande **Status:** Component-characterized; loop not yet closed. Preservation/data-selection signal confirmed (clean preservation AUC 0.739; original 297k/ANE pilot AUC 0.923 was inflated); live acoustic correction is capped (absolute proposal plausibility AUC 0.60); proposal quality identified as the main bottleneck. **Scope:** This report documents the full experimental chain from the AGP correction benchmark through the acoustic verifier, including every measured number
Full Public Reader
Governed Self-Correction for Low-Resource N'Ko ASR: A Technical Report on the Acoustic Verifier Experiment
Date: 2026-06-01
Author: Mohamed Diomande
Status: Component-characterized; loop not yet closed. Preservation/data-selection signal confirmed (clean preservation AUC 0.739; original 297k/ANE pilot AUC 0.923 was inflated); live acoustic correction is capped (absolute proposal plausibility AUC 0.60); proposal quality identified as the main bottleneck.
Scope: This report documents the full experimental chain from the AGP correction benchmark through the acoustic verifier, including every measured number, every dead end, and the precise wiring required to reproduce it.
---
0. Executive Summary
The N'Ko speech program set out to build a self-improving correction loop: an ASR
model decodes audio to toneless N'Ko, a language model proposes corrections, a
governance gate accepts only admissible corrections, and the accepted/rejected pairs
are recycled as training data. This report covers the experimental campaign that tested
whether that loop closes.
Three findings, in order of certainty:
1. An ungoverned LLM corrector is catastrophic to a low-resource transcript. Blind
acceptance of a Gemma-3n correction model moved CER from 0.3106 to 0.4701
(+15.94pp worse) on a 500-row real-proposal benchmark. The governance gate
neutralized this to +0.14pp (a 99
2. No text-internal signal can build a correctness gate. Across trajectory scalars,
n-best consensus, and character posteriors, the area-under-curve for predicting
whether a proposed edit actually lowers CER was ~0.50 (chance). Good edits and bad
edits are statistically indistinguishable from the text side. This was proven
conclusively, not assumed.
3. The acoustic signal is useful, but not as a live correction gate. The initial
`score_delta` result looked strong (AUC 0.8243 versus 0.50 for text signals), but
the robust follow-up showed that score is partly tautological: the ASR hypothesis is
the greedy decode of the same logits, so it is acoustically hard to beat by
construction. The non-tautological live-correction signal is modest (absolute
proposal plausibility AUC 0.60). The genuine acoustic win is preservation and
data selection: ASR self-score predicted already-good rows at AUC 0.923 on the
original 297k/ANE pilot, then revalidated at a usable clean-anchor AUC 0.739.
The clean-output proposal hit rate on the old substrate was still only **1.9
after harness de-confounding and now needs a clean-anchor proposal regeneration.
Net status: The loop is now fully characterized component-by-component. Decode works,
live acoustic correction is capped, preservation/data-selection survives clean
revalidation (AUC 0.739), and the corrector still needs a clean-anchor proposal
regeneration before its true hit rate is known. Closing the loop requires a stronger
proposer plus using acoustic self-score as a preservation/data-selection filter for the
training flywheel rather than as a live corrector.
---
1. Background and Motivation
1.1 The data desert
N'Ko (Unicode U+07C0–U+07FF) is a phonemic, bijective script engineered by Solomana
Kanté in 1949 for the Manding languages (~40M speakers). For machine learning it has a
rare property: each grapheme maps to exactly one phoneme, so character edit distance is
a phonemically interpretable error metric (unlike Latin WER). But there is almost no
digitized N'Ko speech-text data. The realistic public pool is roughly: ~500h of
@babamamadidiane YouTube lessons, the RobotsMali/afvoices corpus, Bayelemabaga
(46,976 Bambara-French pairs), bam-asr-early (~37h). There is no large toned-text
corpus and no large N'Ko speech corpus.
Consequence: you cannot pretrain an N'Ko language model on a giant corpus, because
the corpus does not exist. The system is forced to manufacture its own training data.
This is the central architectural driver: the loop is not a convenience, it is the only
path to scale for a language with no corpus.
1.2 The governed self-correction loop
audio --> [DECODE] --> toneless N'Ko hypothesis
|
[CORRECT] LLM proposes a refined N'Ko string
|
[GOVERN] gate accepts only admissible corrections
|
accepted/rejected pairs --> [RECYCLE] --> SFT --> better corrector
^ |
+----------------------------------------------+Every stage is governed by the same intent: never let a fluent-but-wrong correction
overwrite acoustic evidence. The governance layer (AGP — Anticipation Geometry
Partition) exists precisely because a language-model corrector can improve surface
fluency while destroying speech evidence.
---
2. The AGP Correction Benchmark
2.1 Corpus and prior state
The benchmark uses the `nko_tar_290596` snapshot: 290,596 pairs (232,476 train /
29,060 val / 29,060 test) decoded by the trajectory-CTC ASR at ~31
`safe_lr1e4` run, not the archived 20.57
hypothesis, the reference, and seven trajectory scalars (confidence, uncertainty,
transition_pressure, recovery_margin, novelty, stability, text_quality).
A prior 29,060-row run existed but was an oracle-proposal test: it fed the
reference as the proposal to measure the gate's upper bound. The oracle ceiling
computes to CER → 0.0000 (proposal == reference), confirming it was never a
real-correction benchmark. Handed the correct answer, the gate accepted only 12
(3,302 / 27,762), with 89
correct fixes that exceeded its 2-character edit budget.
2.2 The real-proposal benchmark
A Gemma-3n-E2B-4bit model with a LoRA correction adapter ("thunder" adapter) generated
real proposals for a 500-row pilot. Results:
| Scenario | CER | Δ vs ASR |
|---|---|---|
| ASR baseline | 0.3106 | — |
| Blind accept (no gate) | 0.4701 | +15.94pp |
| AGP-gated (Python admissibility) | 0.3120 | +0.14pp |
Direct scoring: 21 better / 169 same / 310 worse. The corrector paraphrases instead
of minimally correcting — it regenerates the line and drifts. Blindly accepting its
output adds ~16 points of CER. The gate neutralizes 99
accepted-worse (the famed "zero accepted-worse" of the oracle run was partly an
artifact: oracle proposals are never harmful by construction).
2.2b Reconciliation: the documented "AGP improved CER" result
There is prior documentation (`paper4_same_snapshot_batch_replay_v2/batch_summary.md`)
showing AGP lowering CER across all four arms:
| Arm | CER before → after | Δ | accepted-worse |
|---|---|---|---|
| latin_baseline | 0.3162 → 0.3115 | −0.47pp | 0 |
| nko_baseline | 0.3124 → 0.3077 | −0.47pp | 0 |
| nko_tar | 0.3156 → 0.3111 | −0.46pp | 0 |
| latin_trajectory | 0.3270 → 0.3225 | −0.45pp | 0 |
This is real and on disk. But it is the `oracle_guardrail` run: every accepted
proposal had `proposed_text == reference_text` (verified: 242/242 accepted rows in a
2,000-row sample had proposal == reference). It feeds the correct answer as the
proposal to measure the gate's upper bound. So it proves the gate can improve CER
when handed good proposals — not that an end-to-end corrector works. It is the same
result class as §2.3's oracle column: it validates the governance upper bound, not the
end-to-end loop. The later acoustic-gate follow-up validates preservation/data-selection
more strongly than live correction. The −0.46pp (oracle) and the +0.14pp / +0.50pp
(real) numbers are all true and answer different questions — do not conflate "the gate
can improve CER given correct proposals" with "the system improves CER."
2.3 The budget two-regime proof
Sweeping the edit-budget cap on the same 500 rows, with real vs oracle proposals:
| Edit cap | Real proposals | Oracle proposals |
|---|---|---|
| 2 (current) | +0.14pp | −0.61pp (acc_worse 0) |
| 8 | +2.36pp | −6.01pp (acc_worse 0) |
| 999 (unbounded) | +15.50pp | −26.12pp (acc_worse 0) |
Interpretation: with trustworthy proposals, raising the budget unlocks up to
−26pp CER with zero accepted-worse. With the real corrector, the same lever makes it
worse. Same gate, same budget, opposite outcome — the bottleneck is provably proposal
quality, not the gate or the budget. The −26pp of headroom is real and reachable, but
only by a proposer that makes correct edits.
2.4 The minimal-edit retrain (loop-v1)
A new LoRA adapter was trained on 26,187 SFT examples built from the bridge rows, where
the target teaches minimal edits (change only wrong glyphs, else repeat the ASR). The
SFT set was built at cap=8 (27
89
iters, batch 1, max-seq 320, peak mem 4.3GB, loss 2.43 → 0.46. (A first attempt OOM'd
at iter 120 with batch 4; fixed with batch 1 + seq cap.)
Result (loop-v1 did NOT close):
| Old adapter | Minimal-edit adapter | |
|---|---|---|
| Gated CER delta | +0.14pp | +0.50pp (worse) |
| Blind delta | +15.94pp | +11.63pp |
| Refused (prop=ASR) | ~0 | 204/500 |
| Accepted-worse (gated) | 18 | 69 |
| Direct better/same/worse | 21/169/310 | 14/225/261 |
What improved: the model learned to refuse (204/500 left noisy rows alone) and blind
harm dropped. What failed: it made edits small enough to slip the 2-char gate but
still wrong. In the `uncertain` partition, 59 of 78 accepted edits were worse.
Training for minimal-edit SIZE without minimal-edit CORRECTNESS made mistakes small
enough to defeat a size-only gate.
2.5 The correctness-gate dead end
The natural fix — a smarter gate using available signals — was tested and proven
impossible. AUC for predicting whether a committed edit is actually better:
- Trajectory confidence: 0.369 (below 0.5 — higher confidence weakly predicts worse)
- Small-edit preference: 0.449
- Length-ratio ~1: 0.397
- N-best consensus: unavailable (every row has exactly 1 hypothesis; 0/29,060 have >1)
- Char posteriors: null
Conclusion: the existing text-only artifacts contain no signal capable of building a
correctness gate. The missing signal had to come from outside the text — from the
acoustics.
---
3. The Acoustic Verifier Experiment
3.1 Hypothesis
If the proposed N'Ko string matches the audio better than the ASR hypothesis does, it
is more likely correct. Operationalize "matches the audio" as the CTC score of the
candidate against the ASR model's own per-frame logits: lower CTC loss = the audio
aligns better to that target. Define `score_delta = asr_score − prop_score` (>0 means
the audio prefers the proposal). Test whether `score_delta` predicts real CER
improvement.
3.2 Data
The 290,596 snapshot is text-only (audio_id and audio_path are null for every row),
so the existing benchmark cannot be acoustically re-scored. However, 1,381 `bam_train`
utterances have pre-extracted Whisper-large-v3 encoder features on HD1
(`/Volumes/HD1/Mac4-Offload/Desktop/ane-training/features/bam_train_*.pt`, shape
[375,1280] float16) with gold N'Ko in `pairs.jsonl`. This is a self-contained,
no-download eval set. (The af_human_corrected majority needs a ~1.3GB HF pull and was
out of scope.)
3.3 The matched-triple wiring problem (and its resolution)
This was the hardest part of the experiment and the most important to document. The
checkpoint, the model class, and the vocabulary must all come from the same training
run. Three wrong combinations were tried before the correct one:
| Attempt | Checkpoint | Model class | Vocab | Sanity CER |
|---|---|---|---|---|
| 1 | nko_traj_best.pt (66-class) | TrajectoryTransformerCTCHead | fixed U+07C0.. blank=65 | 3.25 (random) |
| 2 | nko_traj_best.pt (66-class) | transcribe_nko UnifiedCTCHead | fixed, blank=0 | 0.53 (wrong vocab) |
| 3 (correct) | nko_traj_297k/nko_traj/best.pt (61-class) | TrajectoryTransformerCTCHead | sibling vocab.json, data-derived, blank=60 | 0.22 (correct) |
Key facts that resolve the wiring:
- The training script (`docs/handoffs/train_vastai_tar_ttt_anchor_audit_20260502.py`)
uses `get_vocab_from_data()` — the vocab is data-derived (60 chars: mixed
N'Ko + Latin + punctuation in frequency order, NOT the U+07C0–07FF codepoint block),
saved to `vocab.json` as `{"chars":[...], "num_classes":N}` with blank = `len(chars)`.
- The matched checkpoint `nko_traj_297k/nko_traj/best.pt` has output dim 61, uses the
`self_attn` architecture (→ `asr/trajectory_asr.py::TrajectoryTransformerCTCHead`),
and its sibling `vocab.json` is the correct map. The earlier `nko_traj_best.pt`
(66-class) is a different run whose vocab is not on disk.
- 12 missing / 8 unexpected keys remain (`scalar_computer` vs `trajectory_bias_net`
naming drift), so the trajectory bias is not fully loaded — but the base CTC path
loads cleanly (0.22 CER proves it), and the trajectory bias is not load-bearing for
the acoustic gate (we score via CTC logits, not the z_t bias).
The sanity gate caught all three errors. A pipeline that produces CER 3.25 or 0.53
is mis-wired; only the 0.22 result (with one utterance decoding at exactly 0.000)
indicates correct wiring. This is why the plan put a mandatory sanity gate before any
trust.
3.4 The CTC scoring primitive
The one piece not already in the repo is a CTC scoring function (CTC loss of a fixed
target against fixed logits), ~15 lines:
def score_ctc(log_probs, target, char_vocab, blank_idx):
idxs = [char_vocab[c] for c in target if c in char_vocab] # drop OOV (gold has IPA leak)
if not idxs or len(idxs) > log_probs.shape[0]:
return float("inf")
loss = F.ctc_loss(log_probs.unsqueeze(1),
torch.tensor(idxs).unsqueeze(0),
torch.tensor([log_probs.shape[0]]), torch.tensor([len(idxs)]),
blank=blank_idx, zero_infinity=True)
return loss.item() / len(idxs) # per-char nats, length-comparable3.5 Procedure
1. Decode all 1,381 eval utterances: feature [375,1280] → logits [T',61] → greedy
decode (asr_hyp) + `score_ctc(asr_hyp)`. Mean CER = 0.2168 over 1,380 (1 corrupt
feature skipped). Sanity gate passed.
2. Propose corrections for all 1,380 via the minimal-edit adapter on mac5
(`adapter_minedit_v1`, Gemma-3n + LoRA, max-tokens 48). 1,380 proposals.
3. Re-score each proposal against the same audio logits; compute `score_delta`,
`asr_cer`, `prop_cer`, and the label `improved = prop_cer < asr_cer`. Restrict to
committed edits (proposal ≠ ASR): 991 rows.
4. Evaluate: AUC of `score_delta` for predicting `improved`; end-to-end gated CER
at margins 0, 0.001, 0.005, 0.01.
3.6 Results
991 committed edits | actually-better: 5 (0.5%)
ACOUSTIC score_delta AUC (predict prop_cer < asr_cer) = 0.8243
(text-side signals gave ~0.50)
margin accepted acc_worse gated_CER vs_asr
0.000 9 3 0.3238 +0.01pp
0.005 7 3 0.3238 +0.01pp
0.010 6 3 0.3238 +0.01pp
ASR baseline CER on these 991 rows = 0.32373.7 Initial interpretation, later corrected by §8b
Both numbers are true and they measure different things:
- AUC 0.82 (score_delta ranking): this initially looked like a strong acoustic
verifier signal versus 0.50 for all text signals. The robust follow-up in §8b corrects
that interpretation: because the ASR hypothesis is the greedy decode of these same
logits, `score_delta` is partly a proxy for how far the proposal moved away from the
ASR, not an independent measurement that the audio prefers the proposal.
- +0.01pp (operating-point power): the end-to-end gate barely moves CER, because
only 5 of 991 proposals are improvements. At the accept threshold, the gate accepts
9, of which 1 improved and 3 were worse — net negative at the operating point. The
five improvements rank at `score_delta` positions [3, 62, 156, 268, 387]: only one is
near the top, so the 0.82 AUC comes mostly from correctly demoting bad edits, not
from cleanly surfacing good ones.
These diverge because the base rate is 0.5
acoustically privileged by construction. You cannot gate your way to a CER win when
almost all raw material is wrong and the live acoustic comparison almost never accepts
non-ASR text.
---
4. Status of the Loop, Component by Component
| Stage | Validated? | Evidence | Bottleneck? |
|---|---|---|---|
| Decode (trajectory-CTC ASR) | Yes | 0.22 CER on local eval; 0.275 full-test (archived); 0.2057 anchor | No |
| Govern (live acoustic verifier) | No | score_delta AUC 0.82 was inflated; absolute proposal plausibility AUC 0.60; accepts almost nothing | Yes — not viable as live gate |
| Preserve (acoustic self-score) | Yes | AUC 0.923 on original contaminated 297k/ANE pilot; AUC 0.739 on clean anchor revalidation | No |
| Govern (size-only gate) | Partial | Neutralizes +15.94pp → +0.14pp; but leaks small wrong edits | No |
| Correct (LLM proposer) | No | 1.9 | |
| Recycle (SFT flywheel) | Untested | infrastructure exists (build_gemma_nko_correction_sft.py) | Pending proposer |
The architecture is sound; the empirical weak link is isolated to one component with
quantified evidence.
---
5. Honest Limitations
1. Eval set is Bambara `bam_train`, 1,380 utts, not the af_human_corrected
"human-corrected" subset used in the AGP text benchmarks; not apples-to-apples with
the 500-row text-gate result. A positive result here justifies the HF download for an
at-scale, matched run.
2. Base CER is ~31
At lower base CER, corrections may become more learnable; the raw 0.5
cleaned 1.9
3. The trajectory bias is not fully loaded in the verifier checkpoint (12 missing
keys). This does not affect the acoustic gate (CTC-logit-based) but means the loaded
model is the toneless base CTC, not the full trajectory-biased variant.
4. The 0.82 AUC is not an independent live-verifier result. It is a score_delta
ranking metric inflated by comparing proposals against the ASR's own greedy decode.
It should never be reported as "the acoustic gate improves CER" — it does not.
5. OOV handling:** gold N'Ko contains ~11
outside the model's vocabulary; these are dropped in scoring, which slightly favors
in-vocab targets. A cleaned gold set would tighten the numbers.
---
6. What Closing the Loop Now Requires
The bottleneck is proposal quality, not governance. Three paths, ranked:
1. Acoustic self-score as preservation/data-selector (strongest, honest, reframes the
value). Do not use the verifier for live correction. Use ASR self-score to preserve
already-good transcripts (clean AUC 0.739; original 297k/ANE pilot AUC 0.923) and rank self-labeled `(audio → N'Ko)` pairs for
retraining. This is the self-improving flywheel: the acoustic score becomes the
quality filter that lets the system manufacture clean training data from its own
output — directly answering the data desert. The +0.01pp live-correction result is
irrelevant to this use; subset quality/yield is the metric that matters.
2. A fundamentally stronger proposer. More/better SFT data, contrastive training, or
a larger correction model, so the hit rate rises from 1.9
conservative gate yields a net win. The budget two-regime proof shows up to −26pp is
available once proposals are trustworthy.
3. Lower the base ASR error first, but only after reconstructing provenance. The
archived 20.57
anchor class loads cleanly but emits blanks on `bam_train_*` pilot features. Recreate
the anchor's April eval feature/pair path first, then rerun preservation and harvest
yield against the genuinely lower-CER base.
---
7. Reproduction
All paths absolute; HD1 must be mounted.
cd Desktop/nko-brain-scanner/experiments/acoustic_gate
# Step 1: build the eval set (1,381 local bam_train rows)
python3 build_eval_set.py
# Step 2: decode + acoustic-score (mandatory sanity gate: mean CER must be ~0.22)
KMP_DUPLICATE_LIB_OK=TRUE python3 decode_and_score.py # -> decoded.jsonl
# Step 3: generate proposals on mac5 (minimal-edit adapter)
# build proposer_input.jsonl from decoded.jsonl, rsync to mac5,
# run agp_text_proposal.py with --adapter-path [home-path]
# then scp proposals_acoustic_eval.jsonl back here.
# Step 4: the verdict (AUC + gated CER)
KMP_DUPLICATE_LIB_OK=TRUE python3 evaluate_gate.py # -> gate_eval.jsonlMatched triple (critical):
- Checkpoint: `/Volumes/HD1/tar_297k_clean/checkpoints/nko_traj_297k/nko_traj/best.pt`
- Model class: `asr/trajectory_asr.py::TrajectoryTransformerCTCHead(num_chars=60, use_trajectory=True)`
- Vocab: sibling `nko_traj_297k/nko_traj/vocab.json` (60 chars, blank=60, data-derived order)
- Features: `/Volumes/HD1/Mac4-Offload/Desktop/ane-training/features/bam_train_*.pt` ([375,1280] f16)
Environment note: set `KMP_DUPLICATE_LIB_OK=TRUE` (OpenMP double-init on this
machine); model runs on MPS (Apple Silicon) or CPU; no mesh required.
---
8. Artifacts
| File | Contents |
|---|---|
| `build_eval_set.py` | joins local bam_train features → eval_set.jsonl (1,381 rows) |
| `decode_and_score.py` | matched-triple loader + CTC `score_ctc` + decode → decoded.jsonl |
| `evaluate_gate.py` | acoustic re-score + AUC + gated-CER margin sweep → gate_eval.jsonl |
| `eval_set.jsonl` | 1,381 {feat_id, feature_path, reference_text} |
| `decoded.jsonl` | 1,380 {asr_hyp, asr_cer (0.22 mean), asr_score} |
| `proposals_acoustic_eval.jsonl` | 1,380 minimal-edit proposals (from mac5) |
| `gate_eval.jsonl` | 991 {score_delta, asr_cer, prop_cer, improved} — the verdict data |
---
8b. Loop-v2 Refinement: Harness Fix, Preservation Lane, and a Correction to the AUC Claim
After the initial verdict, the result was interrogated and found confounded. Three
refinements were applied on the existing artifacts (no regeneration — proposals carry
`raw_response`):
### Harness fix (real, measured)
The mac5 proposer leaked bracket-wrapping (`[answer]`, 71
("Return the candidate.", 28
re-extractor (`reextract.py`: unwrap brackets, keep only U+07C0–07FF spans, collapse
repeats, fall back to ASR when no N'Ko remains) reduced non-N'Ko characters from 7.0
to ~0
correctly fall back to ASR instead of counting as garbage edits.
### Preservation lane (original 297k/ANE pilot; real but later magnitude-corrected)
The acoustic self-score `score_ctc(asr_hyp, audio)` predicts whether the ASR was already
good (CER<0.30) at AUC 0.923 (good rows median 0.0021 vs bad 0.0176). A reference-free
preservation rule (`preserve if asr_score < 0.00843`, tuned on a held-out split) preserves
rows that are **94
lane as the first behaving lane and directly addresses the 77
### Correction to the AUC 0.82 claim (important)
The robust 4-condition eval (`robust_eval.py`) exposed a structural flaw: the acoustic
gate as a "proposal beats ASR" rule (`accept iff prop_score < asr_score`) accepts
almost nothing (9/991; mean score_delta = −0.108). Root cause: the ASR hypothesis is
the greedy decode of the very logits we score against, so it has a near-optimal CTC score
by construction — a proposal essentially cannot beat the audio score of the ASR. The
reported AUC 0.82 was therefore partly tautological: `score_delta ≈ −prop_score ≈ how
much the proposal differs from the ASR`, which correlates with "big risky edit," not with
genuine independent acoustic preference. The non-tautological signal — the absolute
acoustic plausibility of the proposal (low `prop_score`) — separates good from bad edits
at only AUC 0.60. So the true acoustic-verifier signal is modest, not strong; the
0.82 was inflated by the construction.
### What survives this correction
1. The harness fix is real (0.5
2. The preservation lane is real and reference-free (AUC 0.923 here on the original
297k/ANE pilot; clean-anchor magnitude later corrected to AUC 0.739 in §8d.2).
3. The proposer, now de-confounded, is still only 1.9
"Gemma is weak at correction" conclusion is now earned, not assumed.
4. The acoustic gate as a live correction rule does not work (ASR is unbeatable by
construction); its viable use is as a data-selection filter (absolute plausibility
0.60 + preservation may suffice to pick keep-worthy self-labeled pairs — a lower bar
than live correction).
8c. The Data-Flywheel Filter: Verifier as Reference-Free Data Selector
The strongest surviving use of the acoustic verifier is not live correction but
data selection: rank the ASR's own outputs by acoustic self-score and keep the most
confident as clean self-labeled training pairs — exactly the bootstrapping a
corpus-less language needs. Tested on the 1,380 decoded utterances (sort by `asr_score`
ascending, take top-K):
| Keep top | Subset CER | vs full (0.217) |
|---|---|---|
| 10 | ||
| 25 | ||
| 50 |
The filter is a real quality ranker. But the high-precision harvest is small (proper
precision-at-threshold sweep, 90
CER<0.15, 133 (10
model quality:** a 31
training data at 90
harvest cleaner-than-the-model data. Scaling it requires a better base ASR (the 20.57
anchor would raise yield), more raw audio to filter (the af_human 25k pool), or both.
Critically, both deployable wins — preservation and data filtering — ride the same
reference-free acoustic self-score, so they are one mechanism with two uses. §8d.2
revalidates the magnitude on the clean anchor substrate.
8d. Anchor Re-Evaluation Attempt: Corrected Blocker
The obvious next question is whether the same preservation/data-selection conclusions
improve when the base ASR is the archived 20.57
pilot model. A quick checkpoint swap was invalid: loading the anchor into
`asr/trajectory_asr.py::TrajectoryTransformerCTCHead` silently dropped trajectory-module
keys and produced near-total blank output.
That initial diagnosis was still incomplete. The anchor checkpoint itself confirms the
exact trained configuration: no TTT and no depth gate. Its only trajectory pieces are
`scalar_computer.net.{0,2}` and `trajectory_bias_net.net.{0,2}`. Loading the anchor with
its training-era model class from
`docs/handoffs/train_vastai_tar_ttt_anchor_audit_20260502.py`
(`UnifiedCTCHead(num_classes=66, use_trajectory=True, use_tar=False, use_ttt=False)`)
loads cleanly: 0 missing keys, 0 unexpected keys. Yet on the acoustic-gate pilot
rows (`bam_train_*.pt` from `/Volumes/HD1/Mac4-Offload/Desktop/ane-training/features`)
it still emits near-total blank output: the 20-row matched-class probe had mean CER
1.0000, and row 0's argmax was blank on all 94 frames. So the current blocker is
not merely model class mismatch. The sharper status is:
- the anchor checkpoint and its own class are loadable;
- the anchor does not decode this 1,380-row acoustic-gate feature set;
- the pilot features come from `Mac4-Offload/Desktop/ane-training/features/bam_train_*`,
while the anchor's recorded 20.57
pipeline with its own randomized 232,476/29,060/29,060 split;
- the likely mismatch is feature corpus / preprocessing / split provenance, not just
state-dict key naming;
- the 20.57
not yet a drop-in replacement for the acoustic-gate pilot without reconstructing the
exact anchor inference dataset/path from the April run.
8d.1 Resolved: provenance recovered, anchor validated, but the CER head-to-head is confounded
The provenance gap is now closed. The anchor trained on the HF dataset
`Diomande/bambara-whisper-features`, whose features are full-30s Whisper-large-v3 encoder
output (1, 1500, 1280) — versus the ane-training pilot features (375, 1280). That
frame-structure mismatch is exactly why the cleanly-loaded anchor saw silence. Both
datasets key off `RobotsMali/bam-asr-early` as `bam_train_{idx:06d}`, so the same 1,381
utts exist in the HF set with identical ids. `fetch_anchor_feats.py` stream-extracts them
from `features_complete_shard_021.tar` (early-stop, ~5 GB) to
`/Volumes/HD1/anchor_bam_feats`. Decoding the anchor on its native features
(`NKO_ARCH=anchor`, `eval_set_anchor.jsonl`) yields CER 0.3087 over 1,380 utts —
real, correct N'Ko, not blank. The anchor checkpoint is therefore validated
end-to-end.
The seed-42 split was reconstructed exactly (`train=232476`, `val=29060`, `test=29060`,
matching `results.json`). The 1,381 pilot utterances scatter across that split; they are
not all train leakage. On this slice, the anchor's train-portion CER is 0.3081 and its
held-out val+test portion is 0.3112, a negligible +0.0031 gap. So the anchor
generalizes on these utterances. The gap between this slice's ~0.31 CER and the anchor's
recorded 0.2057 is distributional: this is a small, short-utterance-heavy
`bam_train_000000..001380` slice, while the archived number is the 29,060-row held-out
test over the full April corpus.
A direct anchor-vs-297k CER comparison is invalid, and the apparent −19pp "win" must
not be cited. The two models target different N'Ko transliteration conventions. The
297k pilot trained on the contaminated ane pairs (its data-derived vocab is 60-class
mixed N'Ko + Latin, including raw IPA `ɔ/ɛ`); the anchor trained on the clean
re-transliterated 66-class pairs (the repro script's explicit fix: "bam-asr-early native
nko is CONTAMINATED, re-transliterate from latin, Fixed 2026-04-16"). Concretely, for
`bam_train_000000` ("jigi i bolo…") the clean target is `ߖߌߜߌ ߌ ߓߏߟߏ…` (ji = ߖߌ, correct),
while the ane target is `ߙߌߗߌ ߌ ߐߍߓߍ…` (wrong glyphs) with IPA. The 297k faithfully
reproduces the corrupt convention, so it scores 0.2168 against its own ane refs but
0.5420 against clean refs — a convention mismatch, not a quality gap
(`compare_anchor_vs_297k.py`). The defensible conclusions are: (1) the anchor decodes
correct N'Ko; (2) the 297k's low pilot CER is partly measured against corrupted
targets. Implication for this report: the mechanism behind the reference-free
acoustic self-score survives, but its magnitude must be re-measured on the clean anchor
substrate (§8d.2: preservation AUC 0.739, not 0.923). Every reference-dependent result
(proposer hit rate 1.9
targets and had to be re-run on the anchor + clean refs before being treated as final
(completed in §8d.3).
8d.2 Clean re-validation: the preservation number was inflated
Re-running the preservation lane on the anchor decode (clean HF refs as the good/bad
label, not the contaminated ane refs) lowers the AUC from 0.923 to 0.739. The
acoustic self-score still separates good rows from bad (good median self-score 0.0013 vs
bad 0.0034, ~2.6×), so the lane is a real, reference-free signal — but the 0.92 headline
was inflated by (a) contaminated good/bad labels and (b) the 297k/ANE feature regime; the
usable figure is ≈0.74. The flywheel harvest holds on clean data: 3.2
CER<0.10, 9.1
the deployable components remain the cleanest wins, but their magnitudes were overstated.
> Feature-extractor provenance (ANE). The 297k pilot's features
> (`ane-training/features`) were produced by a CoreML Whisper encoder on the Apple
> Neural Engine (`ane_ctc_train.py`: "Frozen Whisper encoder on ANE, CTC head on MLX
> GPU"), giving 375-frame features that are numerically distinct from the anchor's
> 1500-frame standard PyTorch-GPU Whisper-large-v3 features. This is the concrete
> mechanism behind the out-of-distribution all-blank output, and a third confound
> (alongside transliteration convention and split) that makes a 297k-vs-anchor CER
> ranking invalid. All decoding in this session ran on MPS (Metal GPU)/CPU, not the
> ANE; the ANE was used only upstream, for the 297k feature extraction.
8d.3 Clean-anchor min-edit regeneration: the loop still does not close
The clean proposal regeneration was run on mac5 using the staged clean-anchor input
(`proposer_input_anchor.jsonl`, 1,381 rows: anchor hypotheses + clean HF references),
`mlx-community/gemma-4-e2b-4bit`, and the min-edit adapter
`[home-path]`. The output artifact is
`proposals_anchor_clean_minedit.jsonl`, with a re-extracted clean-N'Ko view in
`proposals_anchor_clean_minedit_extracted.jsonl`.
The harness fix still matters: re-extraction changed 999/1,381 rows (72
381 refusals (28
not rescue the correction loop.
`robust_eval_anchor_clean.py` re-scored the regenerated proposals with the anchor model
and native HF features. It reports aggregate corpus CER (total edit distance / total
reference characters), so it is not directly identical to the earlier per-utterance mean
CER sanity number.
| Condition | CER | Δ vs ASR | 95 |
|---|---|---|---|
| ASR baseline | 0.3514 | — | — |
| raw + acoustic gate | 0.3518 | +0.03pp | [+0.00,+0.12]pp |
| clean + acoustic gate | 0.3514 | +0.00pp | [+0.00,+0.00]pp |
| clean + preserve + acoustic | 0.3514 | +0.00pp | [+0.00,+0.00]pp |
Proposal quality remains the bottleneck even after moving to clean references: raw
proposals contained 3 better / 391 same / 987 worse candidates; clean extracted
proposals contained 10 better / 936 same / 435 worse. The gate prevents damage by
rejecting almost everything, but it does not create a CER win. The conclusion is now
earned on the clean substrate: AGP/preservation is useful as harm prevention and data
selection; the current Gemma min-edit proposer is not strong enough for live
self-correction.
8d.4 Base-Gemma control: the adapter is not the only problem
To isolate whether the min-edit LoRA adapter was hurting proposals, the same clean-anchor
input was regenerated on mac5 with base `mlx-community/gemma-4-e2b-4bit` and no
adapter. This produced `proposals_anchor_clean_base.jsonl`, with the cleaned view
`proposals_anchor_clean_base_extracted.jsonl` and eval report
`clean_anchor_eval_base_report.json`.
The base model is less format-broken than the min-edit adapter: re-extraction changed
929/1,381 rows (67
characters fell 9.3
messier. But base Gemma still does not close the loop.
| Proposer | Proposal view | better | same | worse | accepted by clean+preserve gate | CER delta |
|---|---|---|---|---|---|---|
| min-edit adapter | raw | 3 | 391 | 987 | 1 raw worse edit | +0.03pp raw |
| min-edit adapter | cleaned | 10 | 936 | 435 | 0 | +0.00pp |
| base Gemma | raw | 8 | 717 | 656 | 1 raw worse edit | +0.00pp raw |
| base Gemma | cleaned | 19 | 306 | 1056 | 14 (5 better / 5 same / 4 worse) | +0.01pp |
Base Gemma exposes a small real correction signal (5 accepted helpful edits), but it is
too weak and too noisy: the clean gated result is 0.3516 CER vs 0.3514 baseline
(+0.01pp, CI [-0.01,+0.06]pp). A fast edit-distance diagnostic makes the failure mode
plain: for base-cleaned proposals, the median helpful improvement is 1 character,
while the median harmful damage is 10 characters (min-edit: 1 vs 9). The adapter is
not the sole bottleneck. The broader problem is that Gemma-E2B, with the current
prompting/extraction setup and no clean task-specific training, does not propose enough
acoustically valid corrections to beat a 20.57
8d.5 Clean proposer LoRA v1: training works, full-string generation fails the latency gate
A clean proposer SFT dataset was then built to attack the actual bottleneck instead of
tuning the gate. `build_clean_proposer_sft.py` uses the clean HF N'Ko corpus
(`corrected_pairs_290k.jsonl`), excludes all 1,381 acoustic-gate eval ids, learns
ASR-like confusions from `decoded_anchor_native.jsonl`, and emits three target classes:
exact-copy, bounded local repair, and noisy-preserve. The dataset contains 80,063
train / 3,461 valid / 3,466 test rows, with 36,000 exact-copy, 35,959 local
repair, and 15,031 noisy-preserve examples.
Gemma-4-E2B's MLX tokenizer has no chat template, so OpenAI-style `messages` rows fail in
`mlx_lm.lora`. `convert_clean_proposer_sft_to_text.py` converts the data to the same
fallback prompt format used by `agp_text_proposal.py` (`System: ... User: ... Assistant:
...`). A new adapter was trained locally on mac1 at
`[home]/agp_pilot/adapter_clean_anchor_v1` because mac5 became
unreachable over SSH during dataset transfer. Training completed cleanly for 600
iterations (peak memory ~5.46 GB). Validation loss improved from 2.169 at startup
to a best checkpoint of 0.246 at iter 400, then ended at 0.289 at iter 600.
Smoke generation was clean on the first three clean-anchor rows: no prompt echo, no
bracket wrapping, no Cyrillic/Russian leak. But the full 1,381-row generation using the
current full-string correction interface was stopped after 39+ minutes on mac1
with no completed output file. That is enough to fail the live-correction latency gate
for this wiring. This does not prove the trained adapter is useless, but it does prove
that the current "LLM emits the entire corrected transcript" interface is the wrong
serving shape for live ASR correction unless a hard latency/stop-token cap is added and
shown to preserve accuracy.
Practical implication: the next proposer should emit bounded edit operations (copy,
substitute character/span, insert/delete short span) or run under a strict max-token and
first-line stop regime. A full-string generative corrector is too slow and too
open-ended for the live loop, even when its training loss improves.
9. One-Paragraph Conclusion
The governed self-correction loop for N'Ko is fully characterized on the clean anchor
substrate: decoding works, the live acoustic correction gate is capped, preservation/data
selection survives as the useful reference-free signal, and both the current Gemma
min-edit adapter and base Gemma-E2B remain too weak for live self-correction. The clean
proposer LoRA v1 proved that clean copy/min-edit training is feasible, but it also exposed
a systems problem: full-string LLM correction is too slow for live use under the current
interface. The headline "breakthrough" - closing the loop with a CER win - did not happen
and should not be claimed. The clean usable preservation magnitude is AUC 0.739, not
the contaminated-substrate 0.923; the clean regenerated loop is neutral (0 accepted
clean edits, 0.3514 CER unchanged) and the base-Gemma control is still slightly harmful
(+0.01pp). The strongest path forward is therefore a bounded edit-operation
proposer trained on the clean anchor substrate, while keeping AGP/preservation as harm
prevention and data-selection infrastructure. ANE/TurboQuant serving is still valuable,
but only after the proposer interface is made fast and correction-positive.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/experiments/acoustic_gate/TECHNICAL-REPORT.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture