Dead Circuits: Script Invisibility and Representation Failure for N'Ko in Large Language Models

Full HTML reader

Read the full artifact

Extracted abstract or opening context

This paper studies \emph{script invisibility}: the condition in which a large language model accepts a writing system as valid Unicode while allocating little functional internal representation to it. The test case is \nko{}, the script designed by Solomana Kante for Manding languages. \nko{} is not a noisy informal encoding of Bambara, Maninka, or Dioula. It is a dedicated alphabetic system in the Unicode block U+07C0--U+07FF, with a close mapping between Manding phonology and written symbols, explicit diacritic machinery, and an active literacy tradition. For computational linguistics, it should be unusually favorable: it is more phonemically transparent than Latin Bambara, it avoids many digraph ambiguities, and it preserves distinctions that standard Latin transcriptions often hide. The empirical problem is that current LLMs do not receive \nko{} that way. Across the project papers, activation profiling found a repeated failure signature: reduced hidden-state norms, higher entropy, elevated sparsity, weaker output-layer kurtosis, and little evidence of reusable reasoning circuits for \nko{} strings. In one Qwen3-8B protocol, \nko{} incurred an average representation tax of about 2.94x, a 1.2--1.7 bit entropy gap, approximately 2.2x higher embedding sparsity, a 78.1\% output-layer kurtosis deficit, and no \nko{}-advantageous configuration in 45 layer-duplication probes. In a cross-model protocol, the average translation tax was 3.30x for Qwen3-8B, 3.59x for Qwen2.5-7B, and 2.67x for Mistral-7B; \nko{} activations were roughly 66--72\% weaker than English, and output-layer kurtosis deficits ranged from 64.6\% to 93.5\%. The main conclusion is not that \nko{} is intrinsically difficult. The opposite is more plausible: \nko{} is computationally regular, but invisible in the data and tokenizer regimes from which general LLM capability emerges. Arabic provides the control: another right-to-left script can be handled competently when it receives large-scale pretraining exposure and substantial tokenizer allocation. The failure is therefore structural and historical, not a rendering artifact. The paper argues that script invisibility should be measured directly in hidden-state geometry before downstream claims about translation, ASR correction, or language support are trusted.

Promotion decision

What has to happen next

Compile/render the source, verify references and figures, then add to the curated atlas.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.