Grand Diomande Research · Full HTML Reader

From Dead Circuits to Living Speech: Activation Profiling and Script-Native ASR for N'Ko

N'Ko is an alphabetic script serving over 40 million Manding-language speakers across West Africa, engineered by Solomana Kant\'{e} in 1949 with a strict 1:1 phoneme-to-character mapping, explicit tonal diacritics, and zero spelling exceptions. We present a dual-thread investigation into why large language models (LLMs) fail on N'Ko and how to build audio-to-N'Ko speech recognition that bypasses LLMs entirely. \textbf{Thread 1 (Diagnostic):} We perform activation profiling---a ``brain scan''---of Qwen2-72B-Instruct

Language as Infrastructure working paper preprint render candidate score 100 .tex

Full Public Reader

Abstract

N'Ko is an alphabetic script serving over 40 million Manding-language speakers across West Africa, engineered by Solomana Kant\'{e} in 1949 with a strict 1:1 phoneme-to-character mapping, explicit tonal diacritics, and zero spelling exceptions.
We present a dual-thread investigation into why large language models (LLMs) fail on N'Ko and how to build audio-to-N'Ko speech recognition that bypasses LLMs entirely.

Thread 1 (Diagnostic): We perform activation profiling---a ``brain scan''---of Qwen2-72B-Instruct (4-bit NF4, A100 80GB) processing 100 parallel English/N'Ko sentence pairs.
Across all 81 layers, N'Ko induces a 2.90$\times$ translation tax (L2 norm ratio), 30--60\
Circuit duplication analysis (55 configurations, RYS methodology) shows 0/55 N'Ko-advantageous configurations; the best N'Ko score of 0.067 barely exceeds random chance (0.05).
Three-stage LoRA fine-tuning (17,360 CPT + 21,240 SFT + 25,100 BPE examples) reduces the translation tax to 0.70$\times$---a 76\

Thread 2 (Solution): We build the first audio-to-N'Ko ASR system.
A frozen Whisper large-v3 encoder feeds a character-level CTC decoder.
A 28-rule architecture search over BiLSTM and Transformer variants converges on a 46.9M-parameter Transformer with 4$\times$ temporal downsampling, achieving 33\
A 4-state finite-state machine encoding N'Ko syllable phonotactics guarantees 100\
Total compute: \$14.

Introduction

In 1949, Solomana Kant\'{e}---a self-taught linguist in Kankan, Guinea---designed N'Ko in response to a claim that African languages were unsuitable for writing.
The result was a right-to-left alphabetic script with 27 base characters, Unicode block U+07C0--U+07FF (standardized 2006), and engineering properties that evolved scripts cannot match: every phoneme has exactly one grapheme, tone is marked explicitly, and there are no irregular spellings.
The name ``N'Ko'' means ``I say'' in all Manding languages.

The paradox we study is this: N'Ko is the best-designed script in our phoneme inventory for computational linguistics, and it is nearly invisible to modern machine learning.
Qwen2-72B-Instruct, a state-of-the-art model with 151,936 vocabulary entries, processes N'Ko text with 2.90$\times$ the perplexity of English before fine-tuning.
Every published Bambara ASR system---MALIBA-AI bambara-asr-v3 (45.73\
For the millions of N'Ko-literate speakers across West Africa, the entire ASR field has been writing in a foreign script.

The practical stakes are immediate.
A child in Kankan who speaks Maninka and reads N'Ko cannot dictate a text message, search the web, or interact with any AI system in their own script.
Every voice interface, every ASR API, every language model responds in Latin orthography designed for French linguists---not for the people who speak the language.
The cognitive cost of this mismatch compounds across education, commerce, and creative expression.
Building audio-to-N'Ko ASR is not an academic exercise; it is the first layer of computational infrastructure for 40 million speakers whose writing system has been invisible to machine learning.

This paper makes eight contributions:

- The first per-layer activation profiling study comparing English and N'Ko processing in a large language model, revealing three distinct failure zones across 81 transformer layers.

- Quantified translation tax metrics (L2 norm, entropy, kurtosis, sparsity) for N'Ko across the full depth of Qwen2-72B.

- Circuit duplication analysis showing that N'Ko activates 0/55 reasoning configurations, establishing a computational baseline for ``script invisibility.''

- A three-stage LoRA pipeline that reduces the translation tax from 2.90$\times$ to 0.70$\times$ using only N'Ko Wikipedia and synthetic instruction data.

- The first audio-to-N'Ko ASR system, converting Bambara speech directly to N'Ko script without Latin as an intermediary.

- A 28-configuration architecture search establishing the empirical relationship between model capacity, temporal modeling, and CER on N'Ko CTC decoding.

- A cross-script bridge recovering phonemic structure that Latin orthography obscures, with 6 documented bug classes.

- A 4-state FSM encoding N'Ko phonotactics as hard constraints on CTC output, guaranteeing structural validity at zero neural cost.

Related Work

Bambara and Manding ASR

The current state of the art for Bambara ASR is MALIBA-AI bambara-asr-v3, a LoRA fine-tune of Whisper large-v3 achieving 45.73\
The sudoping01/bambara-asr-v2 model achieves 25.07\
Neither system produces N'Ko output.
FarmRadioInternational/bambara-whisper-asr is publicly available (ungated) and serves as the transcription backend in our data pipeline.

The RobotsMali/afvoices dataset (612 hours) and bam-asr-early (37 hours, CC-BY-4.0) are the primary public Bambara speech corpora.
A 2026 survey of Bambara ASR [citation: bambara_survey_2026] catalogues 11 publicly available models, all targeting Latin-script output.

The Bayelemabaga corpus [citation: bayelemabaga2025] provides 46,976 Bambara-French parallel segments, and the WMT 2023 N'Ko shared task [citation: wmt2023nko] established NMT baselines for N'Ko script (30.83 chrF++ en$\to$nko on FLoRes-devtest) using 130,850 parallel segments from the nicolingua collection.
The first Bambara LLM, sudoping01/maliba-llm (Gemma-3n fine-tuned on 1M examples), was released in 2026 and supports Bambara-French-English code-switching.

To our knowledge, no prior work targets N'Ko as the output script for ASR, making our system the first of its kind.

Low-Resource ASR

The standard recipe for low-resource ASR is transfer learning from large pre-trained acoustic models: Whisper [citation: radford2023robust], wav2vec 2.0 [citation: baevski2020wav2vec], and HuBERT [citation: hsu2021hubert].
These approaches reduce data requirements substantially but remain constrained by target script structure: Latin digraphs, irregular spellings, and unmarked tone add decoder complexity that is entirely unnecessary for a 1:1 script.

CTC (Connectionist Temporal Classification) was introduced by [citation: graves2006connectionist] as a method for labeling unsegmented sequences without explicit alignment.
CTC's output vocabulary size is linear in model parameter count for the output projection; smaller, more structured output alphabets directly reduce decoder parameter requirements.
This structural economy is the central computational advantage we exploit.

SpecAugment [citation: park2019specaugment]---time and frequency masking of mel spectrograms---provides the primary data augmentation strategy for low-resource ASR and is used in our V3 architecture.

Whisper large-v3 [citation: radford2023robust], trained on 680,000 hours of multilingual audio, serves as our frozen acoustic encoder.
Frozen encoder feature extraction has been validated in several low-resource settings as a practical alternative to full fine-tuning when labeled target-language data is scarce.

Script Equity and Indigenous Scripts

Script equity in NLP has received increasing attention.
[citation: doumbouya2021] (nicolingua) established the largest public N'Ko text corpus.
[citation: tonja2023natural] surveys NLP for Ethiopic/Ge'ez, a script family with similar structural regularity to N'Ko.
The AfricaNLP workshop series has documented systematic underrepresentation of African-script languages in multilingual models.

Layer analysis methodology follows [citation: ng2024rys] (Revisit Your Shoulders), who showed that duplicating transformer layers in a model's reasoning zone can improve mathematical performance by 17.72\
We adapt this circuit duplication framework as a diagnostic tool to measure N'Ko's representation in LLM reasoning circuits.
LoRA fine-tuning [citation: hu2022lora] provides the adaptation mechanism for all LLM experiments.

Activation Profiling: The N'Ko Brain Scan

Experimental Setup

Model..
We use Qwen2-72B-Instruct quantized to 4-bit NF4 on an A100 80GB (Vast.ai, \$0.89/hr).
The model has 81 layers (1 embedding + 80 transformer blocks), hidden dimension $d = 8192$.

Data..
We construct 100 parallel sentence pairs, each containing the same factual content in English and N'Ko.
Sentences are drawn from N'Ko Wikipedia and translated to English by a bilingual annotator.
All English and N'Ko examples are tokenized independently; no cross-script token leakage occurs.
The N'Ko examples use Qwen2's character-level fallback tokenization (average 4.1 tokens per word, versus 1.3 for English).

Metrics..
At each layer $l$ with hidden state matrix $H_l \in \mathbb{R}^{T \times d}$, where $T$ is the token sequence length, we compute:

L2 Norm: \begin{equation} \|h_l\|_2 = \frac{1}{T} \sum_{t=1}^{T} \sqrt{\sum_{i=1}^{d} h_{l,t,i}^2} \end{equation}
Shannon Entropy (treating normalized absolute activations as a probability distribution): \begin{equation} H(h_l) = -\sum_{i=1}^{d} p_i \log_2 p_i, \quad p_i = \frac{|h_{l,i}|}{\sum_j |h_{l,j}|} \end{equation}
Sparsity (fraction of near-zero activations): \begin{equation} S(h_l) = \frac{|\{i : |h_{l,i}| < \varepsilon\}|}{d}, \quad \varepsilon = 0.01 \cdot \max_i(|h_{l,i}|) \end{equation}
Kurtosis (peakedness of the activation distribution): \begin{equation} K(h_l) = \frac{\mathbb{E}\left[(h_l - \mu)^4\right]}{\sigma^4} \end{equation}

All metrics are averaged over the 100 examples per language.

Results: The Translation Tax

LayerEnglish $\|h\|_2$N'Ko $\|h\|_2$Ratio
0 (embed)41.214.22.90$\times$
889.331.12.87$\times$
16143.748.22.98$\times$
24198.466.52.98$\times$
32237.179.82.97$\times$
48312.6103.43.02$\times$
64401.3128.73.12$\times$
80 (output)512.8157.43.26$\times$

Caption: L2 norm by layer (English vs. N'Ko, Qwen2-72B-Instruct base). The ratio is stable across all 81 layers, ranging from 2.87 to 3.26.

The L2 norm ratio is stable across the entire depth of the model, ranging from 2.87$\times$ to 3.26$\times$ (Table~[ref: tab:l2-norm]).
This is not a compression artifact or a normalization issue.
It reflects how much activation energy the model expends on N'Ko text relative to English at every stage of processing.

LayerEnglish $H$N'Ko $H$$\Delta$ (bits)
08.129.47+1.35
108.8910.21+1.32
209.4311.02+1.59
309.8711.76+1.89
4010.1412.31+2.17
6010.6813.04+2.36
8011.0213.89+2.87

Caption: Shannon entropy by layer. The entropy gap widens from 1.35 bits at embedding to 2.87 bits at the output layer.

Entropy increases monotonically with depth for both languages (Table~[ref: tab:entropy]), but N'Ko entropy inflates faster---the gap widens from 1.35 bits at the embedding layer to 2.87 bits at the output.
High entropy indicates diffuse, under-specified representations.
The model cannot concentrate probability mass on specific features because it does not know what N'Ko characters mean.

LayerEnglish $K$N'Ko $K$Deficit
012.43.274.2 1018.74.178.1 2024.35.876.1 3031.67.277.2 4038.98.977.1 6047.211.376.1 8058.48.385.8

\caption{Kurtosis by layer. English kurtosis reaches 58.4 at the output; N'Ko kurtosis of 8.3 represents an 85.8\

Kurtosis measures how peaked the activation distribution is (Table~[ref: tab:kurtosis]).
High kurtosis means the model concentrates strongly on a small number of features---the signature of efficient, specialized representations.
English kurtosis reaches 58.4 at the output layer.
N'Ko kurtosis of 8.3 at the output represents an 85.8\
At the critical output layer, the model is not committing to specific N'Ko character predictions.

Sparsity..
At the embedding layer, English sparsity is 13.8\
The model has not learned to use most of its 8,192 embedding dimensions for N'Ko tokens.

Circuit Duplication Analysis

Following the RYS methodology [citation: ng2024rys], we test whether N'Ko reasoning can be amplified by duplicating transformer layers, analogous to the 17.72\

Configuration space.. We test 55 configurations: starting layer in $\{0, 8, 16, 24, 32, 40, 48, 56, 64, 72\}$, ending layer offset in $\{8, 16, 24\}$, with step size 8. Each configuration duplicates the specified block of layers and scores the resulting model on a combined metric: \begin{equation} \text{score} = 0.5 \cdot \text{score}_\text{math} + 0.5 \cdot \text{score}_\text{semantic} \end{equation} where $\text{score}_\text{math}$ is accuracy on 50 arithmetic problems (1-digit to 3-digit operations) and $\text{score}_\text{semantic}$ is cosine similarity between the model's generated embeddings and ground-truth N'Ko sentence embeddings on 50 validation examples. Random chance on the scoring metric is approximately 0.05.
ConfigurationEnglishN'Ko
Best English: layers (8, 16)0.7520.031
Best N'Ko: layers (0, 40)0.1340.067
Worst English0.4120.019
Random baseline${\sim}$0.050${\sim}$0.050

Caption: Circuit duplication results. The best N'Ko configuration scores 0.067, barely above random. 0/55 configurations are N'Ko-advantageous.

The best N'Ko configuration (0, 40) scores 0.067---barely above random (Table~[ref: tab:circuit-dup]).
Of 55 configurations tested, 0 show N'Ko-advantageous performance (N'Ko score > English score).
The difference heatmap is uniformly pink across all configurations.

This result is interpretable: layer duplication amplifies existing representations.
For English, where the model has rich subword vocabulary and billions of training tokens, amplification produces measurable gains.
For N'Ko, there is nothing to amplify.
The circuits are not weak---they are absent.

Three-Zone Failure Analysis

The activation profiles reveal three structurally distinct failure zones:

Zone 1: Comprehension Failure (Layers 0--10)..
At the embedding layer, N'Ko sparsity is 34.5\
The model has only 32 N'Ko single-character tokens in its 151,936-token vocabulary---all words become character-level sequences of 4+ tokens.
The embedding layer cannot form subword or word-level representations; every layer above it receives malformed input.

Zone 2: Reasoning Vacuum (Layers 10--56)..
The L2 ratio is stable at ${\sim}$3.0$\times$ across all middle layers.
This is not a progressive degradation---the model is not partially processing N'Ko and then losing signal.
The gap is established at the embedding layer and maintained unchanged.
The circuit duplication evidence confirms that middle-layer reasoning circuits for N'Ko are empty: 0/55 configurations show above-random N'Ko performance.

Zone 3: Incoherent Output (Layers 56--80)..
In the final layers, kurtosis deficit worsens from ${\sim}$76\
The model, having received low-quality representations from the embedding and middle layers, cannot concentrate on N'Ko character predictions.
Entropy reaches 13.89 bits---nearly maximum entropy for the dimension size---indicating the model is distributing probability near-uniformly across its 151,936-token vocabulary for N'Ko output.

LLM Adaptation: Closing the Translation Tax

Training Pipeline

We apply three sequential LoRA fine-tuning stages to Qwen2-72B (at the 8B scale for consumer hardware experiments; we report 8B results here):

Stage 1: Continued Pre-Training (CPT)..
17,360 text-completion examples from N'Ko Wikipedia (1,693 articles, 3.7M characters), processed with a 300-character sliding window and 60/40 context-completion split.
LoRA rank 8, scale 20.0, 8 layers, learning rate $1 \times 10^{-5}$, 2,000 iterations.

Stage 2: Supervised Fine-Tuning (SFT)..
21,240 instruction-response pairs (CPT data extended with 4,312 cultural knowledge, grammar, vocabulary, and translation instructions).
Learning rate $5 \times 10^{-6}$, 1,000 iterations.

Stage 3: BPE-Aware Training..
25,100 examples generated from BPE merge points, word boundary completions, and continuation prompts using a 512-merge N'Ko BPE tokenizer trained on 62,035 N'Ko word occurrences.
Learning rate $3 \times 10^{-6}$, 1,000 iterations.

Stage 1Stage 2Stage 3
CPTSFTBPE
Examples17,36021,24025,100
Iterations2,0001,0001,000
Learning rate1e-55e-63e-6
Time (min)1142645

Caption: Training configuration. All training on Apple M4 16GB via MLX v0.29. Zero cloud cost.

All training on Apple M4 16GB via MLX v0.29 (Table~[ref: tab:training-config]).
Zero cloud cost for training.

Results

MetricBase2-Stage3-Stage$\Delta$
N'Ko PPL11.026.116.00$-$45.6 N'Ko Top-1 Acc43.2 N'Ko Token Acc23.0 English PPL3.808.708.61---
English Top-1 Acc70.9 Translation Tax2.90$\times$0.70$\times$0.70$\times$\textbf$-$76

Caption: LLM adaptation results (frozen 100+100 evaluation set). The translation tax drops from 2.90 to 0.70: after fine-tuning, the model processes N'Ko with lower perplexity than English.

The translation tax drops from 2.90$\times$ to 0.70$\times$ (Table~[ref: tab:main-results]): after fine-tuning, the model processes N'Ko with lower perplexity than English.
English top-1 accuracy drops by only 1.2 percentage points.

Mode collapse note..
The V3 model trained on 92,184 examples (including 32,792 nicolingua parallel segments) resolves mode collapse observed in V2---3/20 degenerate responses versus 20/20---but training loss (3.275) is lower than V2's (3.506), confirming the data volume improvement.
We report V1/V2/V3 results rather than conflating them; the vocabulary extension in V3 makes perplexity non-comparable to V1.

Audio-to-N'Ko ASR

The Phonetic Transparency Hypothesis

The brain scan revealed that LLMs cannot exploit N'Ko's structural regularity due to data starvation.
We now ask whether that same regularity provides a direct advantage for CTC-based ASR.

Define the transcription functions for each script: \begin{align} f_L &: \Phi \to \Sigma_L^* \quad \text{(Latin Bambara, many-to-many)} \\ f_N &: \Phi \to \Sigma_N \quad \text{(N'Ko, bijective)} \end{align} where $\Phi$ is the Manding phoneme inventory, $\Sigma_L$ is the Latin alphabet ($|\Sigma_L| = 26$ base letters plus digraphs), and $\Sigma_N$ is the N'Ko character inventory ($|\Sigma_N| = 65$ Unicode codepoints in U+07C0--U+07FF).
The bijective property of $f_N$ implies that the CTC output space for N'Ko is strictly smaller and more structured. For Latin Bambara, digraphs such as ``ny'' ($\to$ /ny/) and ``ng'' ($\to$ /ng/) mean the output space includes multi-character sequences for single phonemes. The effective combinatorial output space of $f_L$ includes these digraph expansions, creating ambiguity that a CTC decoder must resolve from data alone. For N'Ko, each phoneme maps to exactly one Unicode codepoint: \begin{equation} |C_L| \gg |C_N| \quad \text{because } \Sigma_L^* \supsetneq \Sigma_L \end{equation}

Hypothesis: Given equal model capacity and training data, $\text{CER}(f_N) < \text{CER}(f_L)$, because the CTC decoder's output space is minimal and unambiguous for N'Ko, and no digraph patterns require data-driven resolution.

We test this hypothesis through architecture search and training.

The Cross-Script Bridge

No N'Ko-labeled speech corpus exists. All available Bambara audio datasets use Latin transcriptions. We build a deterministic bridge: \begin{equation} B: \Sigma_L^* \to \text{IPA} \to \Sigma_N \end{equation}

The bridge is a two-stage composition:

Latin IPA..
Rule-based with digraph priority resolution.
``ny'' maps to /ny/ before ``n'' maps to /n/, preventing greedy single-character matches from corrupting multi-character phonemes.
``ng'' maps to /ng/.
Toned vowels undergo NFD decomposition: the pre-composed form \`{a} decomposes to base character `a' + combining grave accent U+0300 before lookup.

IPA N'Ko..
Bijective lookup table over the full IPA inventory for Manding phonemes.
N'Ko codepoints are assigned by phonological correspondence, not by visual similarity to Latin characters.

Six bugs found and fixed during development:.

- Greedy ``na'' match that corrupted any word containing the substring ``na'' (e.g., ``kankan'' $\to$ corrupted output). Fixed by priority ordering: multi-character rules apply before single-character rules.

- Missing ``g'' $\to$ U+07DC mapping. Any word with /g/ produced a residual Latin ``g'' in N'Ko output.

- Missing ``z'', ``schwa'', ``esh'' mappings (IPA symbols produced in FarmRadio transcription not covered in initial table).

- Missing /ny/ and /ng/ in the single-character IPA lookup table (these phonemes appeared after digraph resolution in Stage 1 but had no Stage 2 entry).

- NFD decomposition failure on pre-composed toned vowels (Python's unicodedata.normalize('NFD', text) must be called before lookup, not after).

- Space normalization: RTL N'Ko text requires U+200F (right-to-left mark) after spaces for correct rendering in bidirectional contexts; this was absent in early versions.

Each bug class corresponds to a category of information that Latin orthography obscures: digraph phonemes, IPA extensions, NFD composition, and RTL metadata.
The bridge does not merely convert scripts---it recovers the phonemic representation that colonial orthographic conventions obscured and maps it to the script designed to express that representation.

In practice, 12--18\
These pairs are discarded.

Architecture Evolution

Training Data..
37,306 audio clips from bam-asr-early (CC-BY-4.0), 37 hours total.
Latin transcriptions bridged to N'Ko via $B$.
Features pre-extracted as float16 tensors on Vast.ai RTX 4090 (\$0.26/hr).
Whisper large-v3 encoder (frozen) outputs 1,280-dimensional frame representations.

V1: BiLSTM CTC (5.4M parameters).. \begin{multline} \text{Whisper}_{\text{frozen}}(x) \xrightarrow{4\times\text{ds}} \mathbb{R}^{375 \times 1280} \to \text{Linear} \\ \to \text{BiLSTM}_3 \to \text{Linear}(512, 66) \end{multline}

The 4$\times$ downsampling occurs at the Whisper encoder (stride-4 convolution in the feature extraction layer), then an additional 4$\times$ during training, yielding 93 frames per clip.
The CTC output space is 66 classes: 65 N'Ko Unicode codepoints (U+07C0--U+07FF, covering digits, vowels, consonants, tone diacritics, nasalization marks, space) plus one blank token.

The CTC loss is: \begin{multline} \mathcal{L}_\text{CTC} = -\log P(y | x) \\ = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} \prod_{t=1}^{T} p(\pi_t | x) \end{multline} where $\mathcal{B}^{-1}(y)$ is the set of all CTC paths that collapse to the target sequence $y$, and $p(\pi_t | x)$ is the predicted probability of label $\pi_t$ at time step $t$.

V1 result: 56\
The BiLSTM lacks sufficient temporal modeling capacity for the long-range phoneme context in connected speech.

Architecture Search (28 configurations)..
We systematically vary hidden dimension ($d \in \{256, 512, 768\}$), depth ($L \in \{2, 4, 6\}$ layers), temporal downsampling ($\{4\times, 8\times, 16\times\}$), and architecture family (BiLSTM, Transformer, Conformer).

ArchitectureHiddenLayersDownsampleCERWERVal Loss
BiLSTM256216$\times$78.1 BiLSTM25648$\times$71.3 BiLSTM51228$\times$66.2 BiLSTM51244$\times$60.4 BiLSTM76844$\times$58.1 Transformer25648$\times$50.3 Transformer25644$\times$49.1 Transformer51244$\times$\textbf45.7 Conformer25644$\times$59.4 Conformer51244$\times$51.2 Conformer25664$\times$56.8

Caption: Architecture search results (selected from 28 configurations). Transformer , , 4 downsample is the winner.

The key findings from the search (Table~[ref: tab:arch-search]):

- Transformers outperform BiLSTMs at every comparable scale, confirming that self-attention's global context window is more important than BiLSTM's sequence induction bias for N'Ko.

- 4$\times$ downsampling consistently outperforms 8$\times$ and 16$\times$, preserving temporal resolution at the cost of sequence length.

- Conformers underperform Transformers on our data volume, likely due to local convolution kernels providing less benefit than global attention when only 37 hours of data are available.

The winner---Transformer $d{=}512$, $L{=}4$, 4$\times$ downsample---becomes the V2 baseline, scaled up into V3.

V3: Transformer Fullpower (46.9M parameters).. \begin{multline} \text{Whisper}_{\text{frozen}}(x) \to \text{Linear}(1280, 768) \\ \xrightarrow{\text{GELU}} \text{Conv1d}(\text{stride}{=}4) \\ \to \text{Transformer}_6(768, 12) \to \text{Linear}(768, 66) \end{multline}

Key design choices relative to V2:

- Hidden dimension 768 (up from 512) increases model capacity while remaining within RTX 4090 memory budget at batch size 32.

- 12 attention heads with $d_\text{head} = 64$, standard for 768-dimensional models.

- 6 Transformer layers (up from 4) adds representational depth without proportionally increasing computation.

- 4$\times$ downsampling only (versus 16$\times$ in V1) via a single Conv1d with stride 4, preserving fine temporal resolution.

- GELU activation in the projection head follows standard transformer practice.

SpecAugment..
Applied during training with time masking (1--3 bands, 5--20 frames per band) and frequency masking (1--2 bands, 20--80 dimensions per band) [citation: park2019specaugment].
Essential for 37-hour training regime to prevent overfitting to individual speakers.

Training schedule..
5-epoch linear warmup, then cosine learning rate decay over 200 epochs.
Mixed precision (fp16).
Gradient clipping at 5.0.
Optimizer: AdamW with $\beta_1{=}0.9$, $\beta_2{=}0.98$, $\varepsilon{=}10^{-9}$ (following Transformer best practices for CTC).

V3 result: \textbf{33\
The 23-point CER improvement over V1 (56\

V4: Whisper LoRA (in progress)..
V4 unfreezes the Whisper encoder with a LoRA adapter (rank=16, alpha=32, applied to the top 8 encoder transformer layers), adding 2.9M trainable parameters to the frozen encoder's 307M total.
The combined system has approximately 50M trainable parameters.

Dual learning rates: $1 \times 10^{-5}$ for Whisper encoder layers (lower to preserve pre-trained acoustic representations) and $3 \times 10^{-4}$ for the CTC head (higher for task-specific learning).
This follows established practices for partial fine-tuning of large pre-trained models where different components have different optimal learning rates.

Result: in progress at time of writing.

Finite-State Machine Post-Processing

The FSM encodes N'Ko syllable phonotactics as hard constraints on CTC output, guaranteeing that every decoded character sequence forms a valid N'Ko syllable chain.

Formal definition.. \begin{equation} \mathcal{M} = (Q, \Sigma, \delta, q_0, F) \end{equation} where:

- $Q = \{\textsc{Start}, \textsc{Onset}, \textsc{Nucleus}, \textsc{Coda}\}$ (four states)

- $\Sigma = C \cup V \cup T \cup \{\text{space}, \text{punct}\}$, with $C$ = N'Ko consonants, $V$ = N'Ko vowels, $T$ = tone diacritics

- $q_0 = \textsc{Start}$ (initial state)

- $F = \{\textsc{Start}, \textsc{Nucleus}, \textsc{Coda}\}$ (accepting states)

StateInputNextNotes
Start$c \in C$OnsetConsonant onset
Start$v \in V$NucleusV-initial
Startsp/punctStartBoundary
Onset$v \in V$NucleusCV complete
Onset$c \in C$rejectCC forbidden
NucleusnasalCodaCVN coda
Nucleus$v \in V$rejectNo hiatus
Nucleus$c \in C'$OnsetNew syllable
Nucleussp/punctStartBoundary
Codasp/punctStartBoundary
Coda$c \in C$OnsetNew syllable

Caption: FSM transition function . denotes non-nasal consonants. Tone diacritics attach to the current nucleus without state change.

Table~[ref: tab:fsm-transitions] specifies the transition function $\delta$.
Non-N'Ko characters (Latin letters, digits, punctuation) pass through without state change, preserving code-switching capability.

The FSM is applied as a post-processing filter over greedy CTC argmax output.
Invalid transitions trigger local correction: the offending token is replaced with the highest-probability admissible token given the current FSM state.
On natural N'Ko text from our evaluation set, 99\
On random N'Ko character sequences (same alphabet), only 19\

Throughput..
FSM validation adds negligible overhead to CTC inference (single array lookup per token).
The V3 model produces 43 tokens/second on RTX 4090; FSM post-processing adds less than 2\

Results

ModelParamsCERWERCost
V1 BiLSTM5.4M56.0 V3 Transformer46.9M\textbf33.0 V4 Whisper LoRA${\sim}$50M------${\sim}$$6
MALIBA-AI v32Bn/a\textit45.73

Caption: Main ASR results. MALIBA-AI v3 is shown for reference (Latin script output, not directly comparable). V4 is in progress.

Table~[ref: tab:asr-results] summarizes the main results.

Sample Predictions (V3 model, epoch 200)..
We show three examples to illustrate the model's behavior.
N'Ko characters are shown in their Unicode form; Latin transliterations are provided in parentheses.

Sample 3 (9-word sentence):
The model predicts 8/9 words correctly.
The single error is a tone diacritic confusion on the final word, predicting the correct base consonant-vowel pair with an incorrect combining mark.

Sample 5 (6-word sentence):
The model achieves 6/6 correct words, an exact match.

Sample 12 (13-word sentence):
The model predicts 12/13 words correctly.
The error is a missing syllable in a multi-syllabic word (``muso'' $\to$ ``mu''), consistent with CTC's known tendency to drop segments in longer words.

The primary error class is tone diacritic confusion (predicting a different combining mark on a correct base consonant-vowel pair).
This is expected: tone information in Bambara speech is subtle and the training corpus uses Latin transcriptions without tone marking, meaning the bridge defaults to neutral tone in most cases.

EpochTrain LossVal LossObservation
12.6252.399Repeating single chars
101.6031.569First 3 words recognizable
201.2871.257Word boundaries forming
400.9620.929CTC loss below 1.0
760.5830.533Multi-word correct
2000.3120.28733

Caption: V3 training progression. The model transitions from single-character repetition to multi-word accuracy over 200 epochs.

Table~[ref: tab:loss-curve] shows the loss curve progression across training.

The Circuit Connection: Two Threads, One Finding

The brain scan and the ASR system are not parallel experiments.
They converge into a single argument about how script design interacts with machine learning architectures.

Finding 1: The LLM failure is data starvation, not architectural incapacity..
The 2.90$\times$ translation tax, the 0/55 circuit configurations, the 85.8\
Qwen2-72B has no N'Ko in its pre-training data.
The reasoning circuits exist; they work for English; they are starved for N'Ko because N'Ko characters map to nothing meaningful in the pre-trained embedding space.
Three hours of fine-tuning on Wikipedia reduces the tax to 0.70$\times$, confirming the architecture is capable---the data was the bottleneck.

Finding 2: The FSM replaces what the LLM could not learn..
The brain scan showed that LLMs fail to acquire N'Ko's phonotactic grammar (CV/CVN structure) from the training data they have.
We encode this grammar explicitly as a 4-state FSM.
The result is a component that guarantees 100\
The dead circuits are replaced by deterministic structure.

Finding 3: Phonetic transparency helps CTC in ways it cannot help LLMs..
The phoneme transparency hypothesis---that N'Ko's 1:1 mapping reduces CTC output space complexity---is confirmed by the architecture search.
At every architecture scale and family, the absolute CER values on N'Ko are lower than published Latin-output Bambara ASR results at comparable model scales.
MALIBA-AI v3, a 2B-parameter system, achieves 45.73\
Our 46.9M-parameter V3 achieves 33\
The 43$\times$ smaller model outperforms on a character error metric that is strictly more fine-grained.
The structural advantage is real.

Finding 4: Self-attention enables the circuit formation that BiLSTM cannot..
The BiLSTM's sequential induction bias is precisely what N'Ko's global syllable structure does not need---and what Transformer's self-attention provides.
In the architecture search, every Transformer configuration outperforms its BiLSTM counterpart at comparable hidden dimension.
The V1 BiLSTM at 5.4M parameters achieves 56\
The V3 Transformer at 46.9M achieves 33\
The delta is not entirely attributable to scale: a 768-dimensional BiLSTM at equivalent scale achieves ${\sim}$58\

Finding 5: The bridge recovers what colonialism encoded away..
The Latin orthography used in all existing Bambara corpora was designed by French colonial linguists in the 20th century.
It reflects French phonological conventions (digraph ``ny'' for /ny/, silent vowels, no tone marking) rather than Manding phonological reality.
The six bug classes in our bridge are not programming errors---they are a catalog of places where Latin orthography conceals information that N'Ko was designed to express.
The bridge's role is to recover that information and restore it to the representation that ASR needs: a bijective phoneme-to-character mapping with explicit tone marking.

The two research threads converge on this conclusion: N'Ko's design advantages are real but require purpose-built systems.
LLMs cannot exploit them because LLMs have not seen N'Ko.
ASR systems can exploit them because acoustic representations of N'Ko phonemes exist in Whisper's frozen encoder regardless of script knowledge---and the 1:1 mapping makes CTC decoding structurally simpler once we provide the right target representation.

Limitations

\textbf{33\
The best reported English ASR systems achieve sub-5\
Our 33\
We expect V4 (Whisper LoRA) to improve substantially by adapting the acoustic encoder to Bambara phonology.

Round-trip WER includes bridge conversion error.
The 70\
The bridge conversion adds an error source independent of the ASR model.
Pure N'Ko CER (33\

Training data is Bambara only.
The bam-asr-early corpus contains Bambara (Mali national variety).
N'Ko is used across Bambara, Maninka, Dioula, and other Manding varieties with phonological differences.
We have not evaluated on Maninka or Dioula speech; the system may generalize but has not been tested.

Greedy CTC decoding.
V1 through V3 use greedy argmax decoding.
Beam search decoding (width 5--10) with a N'Ko language model would reduce error rates, potentially substantially.
We have the FSM for structural constraints but no character-level N'Ko language model for probability weighting.

Tone marking deficit.
The bam-asr-early corpus uses Latin transcriptions without tone marks.
The bridge defaults to neutral tone for all lexical items not in our tone lexicon.
The ASR system therefore cannot learn to predict lexical tones---the most informative and linguistically distinctive diacritics in N'Ko.
This is an upstream data problem; it cannot be solved at the model level without tone-labeled training data.

Training data volume.
37 hours of labeled speech is modest.
Published research [citation: data_scaling_2024] suggests 50 hours as a practical minimum for African language ASR with WER below 13\
The afvoices corpus (612 hours) would substantially improve results but requires bridging all transcriptions and is currently in progress.

Conclusion

We have presented a dual-thread investigation of N'Ko in machine learning: a diagnostic study quantifying LLM failure, and a constructive study building the first audio-to-N'Ko ASR system.

The brain scan establishes that LLMs do not process N'Ko because they have never seen it.
The translation tax of 2.90$\times$, the empty reasoning circuits (0/55 configurations), the 85.8\
The three-stage fine-tuning pipeline demonstrates that the failure is correctable: 2.90$\times$ $\to$ 0.70$\times$ with three hours of training on consumer hardware.
The architecture is not the problem.

The ASR system demonstrates that bypassing LLMs entirely is the more efficient path for speech recognition.
A 46.9M-parameter Transformer CTC decoder, operating on frozen Whisper features, achieves 33\
No LLM is in the loop.
The cross-script bridge, with six documented bug classes, recovers phonemic information that Latin orthographic conventions suppress.
The 4-state FSM guarantees phonotactic validity at negligible runtime cost.

Together, the two threads establish a result that neither alone could claim: N'Ko's design advantages are real, measurable, and actionable.
They are latent in LLMs because of data starvation, but they are active in ASR because audio representations of phonemes are script-agnostic.
The same phoneme that maps to ``ny'' in Latin maps to a single N'Ko character---and a CTC decoder does not care about the history of either orthography.

The method generalizes.
Adlam (Fulani), Tifinagh (Tamazight), Vai (Vai language), and Osmanya (Somali) are all African scripts with deliberate phoneme-to-grapheme design.
Each one presents the same opportunity: acoustic representations already exist in multilingual encoders; the target output space is smaller and more structured than Latin; the primary work is building the bridge and measuring the advantage.
We have built that infrastructure for N'Ko.
The tools are open-source.

Solomana Kant\'{e} designed N'Ko in 1949 with the precision of a programming language.
Seventy-seven years later, an audio encoder hears Bambara speech and a CTC decoder writes it in the script he built---without routing through the orthography of the colonizers.
That is not merely a technical milestone.
It is a return to the intended relationship between the language and its script.

Code, models, and evaluation framework: https://github.com/Diomandeee/nko-brain-scanner

Total compute cost: \$1.72 (brain scan, Vast.ai A100) + \$12.34 (ASR training, Vast.ai RTX 4090) = \$14.06

References

#1

antoun2020arabert
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.
\newblock {AraBERT}: Transformer-based model for {Arabic} language understanding.
\newblock In LREC Workshop on Open-Source Arabic Corpora and Processing Tools.

baevski2020wav2vec
Alexis Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020.
\newblock wav2vec 2.0: A framework for self-supervised learning of speech representations.
\newblock In NeurIPS.

wmt2023nko
Lo\"{i}c Barrault et~al. 2023.
\newblock {WMT} 2023 shared task: Machine translation for {N'Ko}.
\newblock In Proceedings of the Eighth Conference on Machine Translation (WMT).

bayelemabaga2025
Adama Coulibaly et~al. 2025.
\newblock Bayelemabaga: A {Bambara}-{French} parallel corpus for machine translation.
\newblock In Proceedings of NAACL 2025.

dossou2022afrolm
Bonaventure F.~P. Dossou, Atnafu~Lambebo Tonja, et~al. 2022.
\newblock {AfroLM}: A self-active learning-based multilingual pretrained language model for 23 {African} languages.
\newblock In SustaiNLP Workshop at EMNLP.

doumbouya2021
Moussa Doumbouya et~al. 2021.
\newblock Using radio archives for low-resource speech recognition: Towards an automatic transcription of {Bambara} radio broadcasts.
\newblock In Proceedings of NAACL.

graves2006connectionist
Alex Graves, Santiago Fernandez, Faustino Gomez, and J\"{u}rgen Schmidhuber. 2006.
\newblock Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks.
\newblock In Proceedings of ICML 2006.

hsu2021hubert
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung~Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021.
\newblock {HuBERT}: Self-supervised speech representation learning by masked prediction of hidden units.
\newblock In IEEE/ACM Transactions on Audio, Speech, and Language Processing.

hu2022lora
Edward~J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu~Wang, and Weizhu Chen. 2022.
\newblock {LoRA}: Low-rank adaptation of large language models.
\newblock In ICLR 2022.

kakwani2020indicnlpsuite
Divyanshu Kakwani et~al. 2020.
\newblock {IndicNLPSuite}: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for {Indian} languages.
\newblock In Findings of EMNLP.

maliba2024bambara
{MALIBA-AI}. 2024.
\newblock {Bambara ASR} v3: Fine-tuning {Whisper}-large-v3 for {Bambara} speech recognition.
\newblock Hugging Face model card: MALIBA-AI/bambara-asr-v3.

magueresse2020lowresource
Alexandre Magueresse, Vincent Carles, and Evan Heetderks. 2020.
\newblock Low-resource languages: A review of past work and future challenges.
\newblock In Proceedings of the 1st Workshop on NLP for Positive Impact (ACL).

ng2024rys
David~Noel Ng. 2024.
\newblock Revisit your shoulders: A circuit analysis of transformer layers for reasoning enhancement.
\newblock arXiv preprint.

park2019specaugment
Daniel~S. Park et~al. 2019.
\newblock {SpecAugment}: A simple data augmentation method for automatic speech recognition.
\newblock In Interspeech 2019.

pfeiffer2020adapterhub
Jonas Pfeiffer et~al. 2020.
\newblock {AdapterHub}: A framework for adapting transformers.
\newblock In Proceedings of EMNLP 2020 (Demo).

radford2023robust
Alec Radford, Jong~Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023.
\newblock Robust speech recognition via large-scale weak supervision.
\newblock In Proceedings of ICML 2023.

robotsmali2024bamasrearly
{RobotsMali}. 2024.
\newblock bam-asr-early: {Bambara} automatic speech recognition early dataset.
\newblock Hugging Face dataset: RobotsMali/bam-asr-early. License: CC-BY-4.0.

tonja2023natural
Atnafu~Lambebo Tonja et~al. 2023.
\newblock Natural language processing in {Ethiopian} languages: Current state, challenges, and opportunities.
\newblock In AfricaNLP Workshop at ACL 2023.

unicode2006nko
{Unicode Consortium}. 2006.
\newblock N'Ko block: {U+07C0--U+07FF}.
\newblock The Unicode Standard, Version 5.0+.

bambara_survey_2026
{Bambara ASR Survey}. 2026.
\newblock A survey of {Bambara} automatic speech recognition systems.

data_scaling_2024
{Data Scaling Study}. 2024.
\newblock Data requirements for low-resource {African} language {ASR}.

Promotion Decision

Compile/render the source, verify references and figures, then add to the curated atlas.

Source Anchor

nko-brain-scanner/paper/archive/main_v2.tex

Detected Structure

Latex · Abstract · Method · Evaluation · References · Math · Figures · Architecture