Back to corpus
working paperpreprint render candidatescore 100

Living Speech: Script-Native Automatic Speech Recognition for N'Ko

\documentclass[11pt]{article} \usepackage{acl} \usepackage{times} \usepackage{latexsym} \usepackage{graphicx} \usepackage{booktabs} \usepackage{amsmath} \usepackage{amssymb} \usepackage{hyperref} \usepackage{multirow} \usepackage{xcolor} \usepackage{enumitem} \usepackage{tipa}

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

\noindent\textbf{Provenance note.} This manuscript documents the development path that produced the first audio-to-N'Ko ASR pipeline: the bridge, the V1--V4 architecture progression, the FSM, and the downstream translation stack. The V1--V4 metrics reported here are historical development results from the earlier 37-hour bam-asr-early regime and small held-out evaluations. They are not the current repository benchmark. The strongest fully verified ASR checkpoint currently archived in this repository is the companion N'Ko trajectory model on the 290,596-pair corpus snapshot, which achieves 20.57\% test CER. \begin{abstract} Every published Bambara automatic speech recognition system produces Latin-script output. For the 40+ million N'Ko-literate speakers across West Africa, the entire ASR field has been writing in a foreign script. We build the first audio-to-N'Ko ASR system, converting Bambara speech directly to N'Ko script without Latin as an intermediary. Our approach exploits a structural advantage. N'Ko is a bijective phoneme-to-grapheme script: every Manding phoneme maps to exactly one Unicode character, tone is marked explicitly, and there are no spelling irregularities. This bijectivity reduces the CTC decoder's output space to 66 classes (64 N'Ko codepoints plus space and blank), compared to the effectively larger combinatorial space of Latin Bambara (26 base letters plus digraphs, tone-unmarked). We present a four-version architecture progression. \textbf{V1}: A BiLSTM CTC decoder (5.4M parameters) on frozen Whisper large-v3 features achieves 56\% CER. A \textbf{28-configuration architecture search} over BiLSTM, Transformer, and Conformer variants identifies the optimal design. \textbf{V3}: A Transformer CTC decoder (46.9M parameters) with 6 layers, 768 hidden dimension, and 4$\times$ downsampling on frozen Whisper features achieves 33\% validation CER and 70\% validation WER at epoch 200 (measured on a 10\% holdout from the bam-asr-early training split, not an independent test set). \textbf{V4}: Whisper LoRA fine-tuning (rank=32, layers 24--31, 5.9M trainable parameters) on A100 reduces validation loss from 0.884 to 0.290 over 30 epochs. Qualitatively, V4 produces coherent Bambara word sequences where V3 produces degenerate repetitions; quantitatively, V4 boosts prediction confidence from 0.46 to 0.82 (79\% improvement) and reduces WER from 70\% to 62.3\% ($-$11.0\% relative). Per-sample evaluation shows LoRA wins on 20/50 samples, base wins on 19/50, and 11/50 tie, with dramatic improvements on worst-case inputs (sample au30: WER 15.8 $\to$ 1.0). Syllable validity rate exceeds 91\% for both models, confirming that the CTC decoder has learned N'Ko phonotactic structure even when word identity is incorrect. A deterministic cross-script b

Promotion decision

What has to happen next

Compile/render the source, verify references and figures, then add to the curated atlas.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.