The Script That Machines Can't Read: Adapting Large Language Models for N'Ko

Full HTML reader

Read the full artifact

Extracted abstract or opening context

We present a systematic study of how large language models process N'Ko (\texttt{U+07C0--U+07FF}), an alphabetic script used by over 40 million Manding-language speakers in West Africa. Through activation profiling (``brain scanning'') of Qwen3-8B before and after fine-tuning, we demonstrate that: (1) fine-tuning concentrates N'Ko adaptation in the top 8 transformer layers, reducing activation magnitudes in reasoning layers while amplifying output confidence; (2) a three-stage training pipeline---continued pre-training on 3.7M characters of N'Ko Wikipedia, supervised fine-tuning on 4,312 instruction examples, and BPE-aware subword training on 25,100 examples---reduces the N'Ko-to-English perplexity gap (``translation tax'') from 2.90$\times$ to 0.70$\times$, a 76\% reduction; (3) N'Ko-specific token prediction accuracy improves from 23.0\% to 32.8\%, a 43\% relative gain, with only 1.2 percentage points of English accuracy loss; (4) a custom 512-merge BPE tokenizer trained on N'Ko Wikipedia achieves 2.75$\times$ compression, discovering linguistically valid subword units that align with Manding grammatical particles; (5) a morpheme-constrained BPE variant that respects Manding morphological boundaries improves boundary preservation by 5.6 percentage points (0.941 vs 0.891) at the cost of 42.6\% more tokens, revealing that morphological awareness requires larger training corpora to compete with unconstrained BPE on compression; (6) a 4-state finite-state machine encoding N'Ko CV/CVN syllable structure, used as a logits processor during generation, guarantees 100\% syllable validity with 39\% throughput overhead (the V3 model achieves 99.8\% validity even without the FSM); (7) vocabulary extension via quantized embedding surgery (151,936 $\to$ 152,192 tokens, adding 250 N'Ko BPE tokens) combined with a 2,000-iteration LoRA pass on 33,912 training examples reduces validation loss to 3.506, an 18.3\% improvement over the base adapter, and produces the first N'Ko text generation model to our knowledge. All training was performed on consumer hardware (Apple M4, 16GB) at zero cloud cost. We release the model, training pipeline, BPE tokenizer, and evaluation framework as open-source artifacts.

Promotion decision

What has to happen next

Compile/render the source, verify references and figures, then add to the curated atlas.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.