Theorems, Proofs, and Derivations for N'Ko Script-Native ASR
This document collects the formal mathematical results underlying the N'Ko Brain Scanner and ASR system. We present five main theorems with proofs, three derivations of key quantities, and two corollaries that connect the LLM diagnostic thread to the ASR construction thread. The results establish: (1) a phonetic transparency advantage for CTC decoding on bijective scripts, (2) bounds on the translation tax in under-represented scripts, (3) completeness and soundness of the FSM phonotactic validator, (4) a circuit d
Full Public Reader
Abstract
This document collects the formal mathematical results underlying the N'Ko Brain Scanner and ASR system. We present five main theorems with proofs, three derivations of key quantities, and two corollaries that connect the LLM diagnostic thread to the ASR construction thread. The results establish: (1) a phonetic transparency advantage for CTC decoding on bijective scripts, (2) bounds on the translation tax in under-represented scripts, (3) completeness and soundness of the FSM phonotactic validator, (4) a circuit death theorem for reasoning layers processing unseen scripts, and (5) rank-efficiency bounds for script adaptation via LoRA.
Preliminaries and Notation
Let $\Phi = \{\phi_1, \ldots, \phi_P\}$ denote the phoneme inventory of Manding languages, where $P = 35$ (23 consonants, 7 vowels, 5 tone levels). Let $\Sigma_N = \{c_1, \ldots, c_{65}\}$ denote the N'Ko Unicode character set (U+07C0--U+07FF), and let $\Sigma_L = \{a, b, \ldots, z\}$ denote the Latin alphabet used in standard Bambara orthography.
The bijective property of $f_N$ means $|f_N(\phi)| = 1$ for all $\phi \in \Phi$: every phoneme maps to exactly one character. For $f_L$, digraphs create multi-character representations: $f_L(\text{/\textipa{J}/}) = \texttt{ny}$ (two characters for one phoneme), $f_L(\text{/\textipa{N}/}) = \texttt{ng}$ (two characters for one phoneme).
For a transformer model $\mathcal{M}$ with $L$ layers, we denote the hidden state at layer $l$ as $h_l \in \mathbb{R}^{T \times d}$, where $T$ is the sequence length and $d$ is the hidden dimension.
Theorem 1: Phonetic Transparency Advantage
For N'Ko, each target token $y_u$ corresponds to exactly one phoneme: $y_u = f_N(\phi_u)$. The alignment search over $\mathcal{B}^{-1}(y)$ operates on $|V_{f_N}| = P + 1 = 36$ output classes.
For Latin, the digraph phonemes create ambiguity. Consider the phoneme /J/ (palatal nasal). In Latin, $f_L(\text{/\textipa{J}/}) = \texttt{ny}$, requiring the CTC decoder to emit two tokens $(\texttt{n}, \texttt{y})$ in sequence. But n is also a valid standalone consonant mapping ($f_L(\text{/n/}) = \texttt{n}$), creating a segmentation ambiguity: is the output sequence $[\ldots, \texttt{n}, \texttt{y}, \ldots]$ a single /J/ or a sequence /n/ followed by /j/? This ambiguity must be resolved from data alone.
In N'Ko, this ambiguity does not exist: /J/ maps to a single character U+07E2, and /n/ maps to a different character U+07E3. No segmentation decision is required.
The variance of this gradient is proportional to the entropy of the alignment posterior $P(\pi | x, y)$. For N'Ko, where each target token is unambiguous, this entropy is lower. For Latin, the digraph ambiguity adds alignment uncertainty, increasing gradient variance and slowing convergence.
Empirically, our architecture search confirms this: the Transformer $d=512$, $L=4$ achieves 45.7\
corollary: Generalization to Other Bijective Scripts.
The phonetic transparency advantage holds for any script $\Sigma$ with a bijective mapping $f: \Phi \to \Sigma$. This includes Adlam (Fulani), Tifinagh (Tamazight), Vai (Vai language), and Osmanya (Somali). The CTC output space complexity for these scripts is bounded by $|\Phi_{\text{language}}| + 1$.
Theorem 2: Translation Tax Bounds
proof.
We measure $\mathcal{T}(l)$ at 81 layers using 100 parallel sentence pairs. The empirical values are:
| Layer | $\|h^{(\text{EN})}\|_2$ | $\|h^{(\text{NK})}\|_2$ | $\mathcal{T}(l)$ |
|---|---|---|---|
| 0 (embed) | 0.61 | 0.25 | 2.50 |
| 2 | 1,803 | 497 | 3.63 |
| 9 | 1,815 | 504 | 3.60 |
| 20 | 1,832 | 518 | 3.54 |
| 40 | 1,965 | 628 | 3.13 |
| 51 | 2,093 | 684 | 3.06 |
| 65 | 2,179 | 901 | 2.42 |
| 77 | 2,438 | 1,221 | 2.00 |
The tax narrows from 3.63$\times$ at layer 2 to 2.00$\times$ at layer 77, but this is not because N'Ko representations improve. Rather, the later layers produce generic language-agnostic patterns (high entropy, low kurtosis) that reduce the ratio without improving N'Ko-specific representation quality.
The persistence of $\mathcal{T} > 2.0$ across all 81 layers confirms that the model has no internal mechanism for compensating for the embedding-layer deficit. Each layer amplifies the input it receives; if the input is weak (low L2 norm for N'Ko), the output remains proportionally weak.
The adaptation concentrates in the output projection layers (layer 35), where the L2 norm delta is $+572.7$ (a dramatic increase in confidence for N'Ko token predictions). Middle layers (28--34) show reduced L2 norms (deltas of $-38$ to $-104$), indicating more efficient encoding --- the model learns to represent N'Ko with less activation energy because the representations are better aligned with the task.
The English accuracy drops by only 1.2 percentage points (70.9\
Theorem 3: FSM Completeness and Soundness
definition: N'Ko Syllable FSM.
Define the finite-state machine $\mathcal{M} = (Q, \Sigma, \delta, q_0, F)$ where:
- $Q = \{\textsc{Start}, \textsc{Onset}, \textsc{Nucleus}, \textsc{Coda}\}$
- $\Sigma = C \cup V \cup T \cup N \cup \{\text{space}\}$, with $C$ = consonants (26), $V$ = vowels (7), $T$ = tone marks (5), $N$ = nasal marks (2)
- $q_0 = \textsc{Start}$
- $F = \{\textsc{Start}, \textsc{Nucleus}, \textsc{Coda}\}$
The transition function $\delta$ is defined by Table~[ref: tab:fsm].
Caption: FSM transition function
| Current | Input | Next | Condition |
|---|---|---|---|
| Start | $c \in C$ | Onset | Begin syllable |
| Start | $v \in V$ | Nucleus | V-initial syllable |
| Start | space | Start | Word boundary |
| Onset | $v \in V$ | Nucleus | CV nucleus |
| Onset | $c \in C$ | $\perp$ (reject) | CC forbidden |
| Nucleus | $c \in C$ | Onset | New syllable |
| Nucleus | $n \in N$ | Coda | Nasal coda |
| Nucleus | $t \in T$ | Nucleus | Tone attach |
| Nucleus | space | Start | Word boundary |
| Coda | $c \in C$ | Onset | New syllable |
| Coda | space | Start | Word boundary |
theorem: FSM Completeness.
For any valid N'Ko word $w$ composed of syllables from the set $\{V, CV, VN, CVN\}$, the FSM $\mathcal{M}$ accepts $w$: $\delta^*(q_0, w) \in F$.
proof.
We verify each syllable pattern:
Case V: Input is $v \in V$. Trace: $\textsc{Start} \xrightarrow{v} \textsc{Nucleus} \in F$. Accepted.
Case CV: Input is $cv$ with $c \in C$, $v \in V$. Trace: $\textsc{Start} \xrightarrow{c} \textsc{Onset} \xrightarrow{v} \textsc{Nucleus} \in F$. Accepted.
Case VN: Input is $vn$ with $v \in V$, $n \in N$. Trace: $\textsc{Start} \xrightarrow{v} \textsc{Nucleus} \xrightarrow{n} \textsc{Coda} \in F$. Accepted.
Case CVN: Input is $cvn$. Trace: $\textsc{Start} \xrightarrow{c} \textsc{Onset} \xrightarrow{v} \textsc{Nucleus} \xrightarrow{n} \textsc{Coda} \in F$. Accepted.
For multi-syllable words, the transition from $\textsc{Nucleus}$ or $\textsc{Coda}$ to $\textsc{Onset}$ (on consonant input) or $\textsc{Nucleus}$ (on vowel input from $\textsc{Start}$) correctly chains syllables. Word boundaries reset to $\textsc{Start}$ via space transitions.
theorem: FSM Soundness.
The FSM $\mathcal{M}$ rejects all sequences containing phonotactically invalid patterns:
- Consonant clusters: $cc$ with $c_1, c_2 \in C$ (no intervening vowel)
- Isolated onsets: sequence ending in Onset state
- Double codas: $nn$ with $n_1, n_2 \in N$
proof.
(1) From Onset, input $c \in C$ triggers $\delta(\textsc{Onset}, c) = \perp$ (rejection). (2) Onset $\notin F$, so any sequence ending after a consonant without a following vowel is rejected. (3) From Coda, there is no transition defined for $n \in N$, so the input is rejected by default.
remark
Empirically, 99\
remark
Theorem 4: Circuit Death in Under-Represented Scripts
proof.
We test all 55 configurations. The best English score is $S_{\text{EN}}(8, 16) = 0.752$, confirming that the comprehension-to-reasoning transition zone (layers 8--16) contains active reasoning circuits for English. The best N'Ko score is $S_{\text{NK}}(0, 40) = 0.067$.
Interpretation. Layer duplication is an amplifier. It runs the same weight matrices twice, giving the model a ``second pass'' at building representations. For English, where the embedding layer produces high-quality initial representations (L2 norm 1,803 at layer 2, kurtosis 7,692), a second pass refines them further. For N'Ko, where the embedding layer produces near-random representations (L2 norm 497, kurtosis 7,699 but declining to 128 at output), a second pass amplifies noise. The circuits are not weak --- they are absent. There is nothing to amplify.
corollary: Script-Specific Circuit Formation.
Reasoning circuits in transformer LLMs are language-specific, not language-agnostic. They form during pre-training as a function of token frequency. A script with token frequency $f \ll 10^{-5}$ of the training distribution will have no measurable reasoning circuits, regardless of the script's intrinsic computational properties.
Theorem 5: LoRA Rank-Efficiency for Script Adaptation
theorem: Low-Rank Script Comprehension.
Script comprehension is a low-rank adaptation: the mapping from a pre-trained multilingual embedding space to N'Ko character prediction requires rank $r \leq 16$ modification per layer.
This 0.059\
The rank-8 constraint means the adaptation modifies at most an 8-dimensional subspace of the 4096-dimensional layer activations. Script comprehension, therefore, lies on a low-dimensional manifold within the model's representational capacity. This is consistent with the linguistic observation that N'Ko's phonological structure is regular and learnable from few examples --- the LoRA adapter needs to learn only the mapping from Unicode codepoints to phonological features, not the phonological system itself.
Derivation: CTC Loss Gradient for N'Ko
For N'Ko, the extended label sequence $y' = (\epsilon, y_1, \epsilon, y_2, \epsilon, \ldots, y_U, \epsilon)$ has length $2U + 1$. With 93 input frames (after 16$\times$ downsampling) and typical target lengths of 10--30 characters, the CTC alignment is well-conditioned: $T \gg U$, ensuring multiple valid alignment paths.
where $\beta_t(s)$ is the backward variable. This gradient is computed in $O(TU)$ time, which for our setting ($T = 93$, $U \leq 30$) is $O(2{,}790)$ --- negligible.
Derivation: Kurtosis Deficit as Circuit Specialization Measure
High kurtosis indicates a leptokurtic distribution: most neurons are near-zero, but a small number fire strongly. This is the signature of specialized circuits --- neurons that have learned to respond to specific input patterns.
For English at the output layer (layer 80): $K_{\text{EN}} = 901$. This means a small fraction of the 8,192 neurons concentrate the model's prediction confidence on specific tokens.
For N'Ko at the output layer: $K_{\text{NK}} = 128$. The distribution is nearly mesokurtic (Gaussian-like), meaning activations are spread diffusely across neurons with no specialization.
reaches 85.8\
The monotonic increase of $\Delta K(l)$ from 0.1\
Derivation: Cross-Script Bridge Composition
The bridge $B: \Sigma_L^* \to \Sigma_N$ is a two-stage composition:
This ensures the base vowel and the tone diacritic are processed independently by stages 1 and 2.
The six documented bug classes correspond to violations of this composition:
- Greedy single-char match before digraph check (Bug 1: ``na'' $\to$ U+07E0)
- Missing entries in $g_{\text{single}}$ (Bugs 2, 3: ``g'', ``z'', ``@'', ``S'')
- Missing entries in $h$ after digraph resolution (Bug 4: /J/, /N/)
- Missing NFD call before lookup (Bug 5)
- Missing RTL mark after space (Bug 6: rendering, not phonological)
Summary of Results
Caption: Summary of theorems and their empirical validation
| Theorem | Prediction | Empirical Result | ||||
|---|---|---|---|---|---|---|
| 1 (Transparency) | $\text{CER}(f_N) \leq \text{CER}(f_L)$ | 33 2 (Tax Bound) | $\mathcal{T}(l) \in [2.0, 3.6]$ | Confirmed across all 81 layers | ||
| 2b (Tax Inversion) | LoRA inverts $\mathcal{T}$ | 2.90$\times$ $\to$ 0.70$\times$ | ||||
| 3 (FSM Complete) | Accepts all valid syllables | 99 3b (FSM Sound) | Rejects invalid patterns | 81 4 (Circuit Death) | $S_{\text{NK}} \approx S_{\text{rand}}$ | Best $S_{\text{NK}} = 0.067$ vs $S_{\text{rand}} = 0.05$ |
| 5 (Low-Rank) | $r = 8$ suffices | 45.6 |
References
- [graves2006] A. Graves et al., ``Connectionist Temporal Classification,'' ICML, 2006.
- [hu2022] E.J. Hu et al., ``LoRA: Low-Rank Adaptation of Large Language Models,'' ICLR, 2022.
- [ng2024] D.N. Ng, ``Revisit Your Shoulders: Circuit Analysis of Transformer Layers,'' arXiv, 2024.
- [radford2023] A. Radford et al., ``Robust Speech Recognition via Large-Scale Weak Supervision,'' ICML, 2023.
- [park2019] D.S. Park et al., ``SpecAugment,'' Interspeech, 2019.
- [maliba2024] MALIBA-AI, ``Bambara ASR v3,'' HuggingFace, 2024.
Promotion Decision
Compile/render the source, verify references and figures, then add to the curated atlas.
Source Anchor
nko-brain-scanner/paper/archive/nko_theorems.tex
Detected Structure
Latex · Abstract · Method · Evaluation · References · Math · Architecture