Script Invisibility Is Structural: Activation Profiling Across Three LLM Families
A prior study demonstrated that Qwen3-8B processes N'Ko text with severely diminished neural activation compared to English, a phenomenon termed \emph{script invisibility}. That finding left an open question: is the deficit specific to one model, or is it a structural property of all models trained on corpora where N'Ko is absent? We answer this by performing identical activation profiling---per-layer extraction of L2 norm, Shannon entropy, sparsity, and kurtosis---on three architecturally distinct models: Qwen3-8B
Full Public Reader
Abstract
A prior study demonstrated that Qwen3-8B processes N'Ko text with severely diminished neural activation compared to English, a phenomenon termed script invisibility.
That finding left an open question: is the deficit specific to one model, or is it a structural property of all models trained on corpora where N'Ko is absent?
We answer this by performing identical activation profiling---per-layer extraction of L2 norm, Shannon entropy, sparsity, and kurtosis---on three architecturally distinct models: Qwen3-8B (37 layers, Qwen architecture), Qwen2.5-7B (29 layers, previous-generation Qwen), and Mistral-7B (33 layers, Mistral architecture).
All three process the same 100 parallel English/N'Ko sentence pairs.
Every model exhibits the same failure signature.
The average translation tax (ratio of English to N'Ko L2 norm) is 3.30$\times$ for Qwen3-8B, 3.59$\times$ for Qwen2.5-7B, and 2.67$\times$ for Mistral-7B.
N'Ko activations are 66\
Embedding-layer sparsity is 2.2--2.6$\times$ higher for N'Ko in both Qwen models.
Output-layer kurtosis deficit ranges from 64.6\
Entropy inflation of 0.78--1.22 bits confirms that N'Ko activations are diffuse rather than structured across all three architectures.
The consistency of these results across different model families, training pipelines, tokenizers, and companies establishes that script invisibility is a consequence of training data composition, not architectural design.
We discuss implications for the 50+ scripts in Unicode that share N'Ko's data-poverty profile and argue that architectural innovation cannot substitute for representative training data.
Total compute cost for all three scans: under \$5.
All code, data, and results are publicly available.
Introduction
Diomande (2026) introduced script invisibility: the measurable phenomenon of a language model allocating near-zero representational capacity to a writing system despite having no architectural constraint against processing it.
The study profiled Qwen3-8B on 100 parallel English/N'Ko sentence pairs and found a 2.94$\times$ translation tax, 78.1\
N'Ko (U+07C0--U+07FF) is an alphabetic script designed in 1949 by Solomana Kant\'{e} for Manding languages (Bambara, Maninka, Dioula), serving over 40 million speakers across West Africa.
Its engineering properties---strict phoneme-to-grapheme bijection, explicit tonal diacritics, zero spelling irregularities---make it arguably the most computationally advantageous script in the Unicode standard for NLP applications.
Its near-total absence from LLM training data makes it an ideal test case for studying how data composition shapes model capability.
The original finding, however, was limited to a single model from a single vendor.
A skeptic could attribute the deficit to Qwen-specific architectural decisions, to Alibaba's particular training data pipeline, or to idiosyncrasies of the Qwen tokenizer.
This paper eliminates those alternative explanations.
We run the identical brain scan on three models from two different companies, spanning two generations of Qwen and a completely independent architecture (Mistral).
The models differ in layer count, hidden dimensionality, attention mechanism, tokenizer vocabulary, and training data composition.
They agree on one thing: N'Ko is invisible.
Our contributions are:
- Cross-architecture validation of script invisibility across three model families, demonstrating that the translation tax, entropy inflation, sparsity elevation, and kurtosis deficit are consistent properties of models trained on N'Ko-absent data (\S[ref: sec:results]).
- Identification of a universal three-zone failure pattern---embedding collapse, mid-layer diffusion, and output-layer circuit death---that appears in all three models despite architectural differences (\S[ref: sec:zones]).
- Quantitative evidence that the magnitude of script invisibility is not proportional to model quality or recency, ruling out the hypothesis that newer or better models naturally acquire low-resource script capability (\S[ref: sec:discussion]).
- A reproducibility framework: all scans run on consumer hardware (Apple M4) for under \$5 total, with code and data publicly released (\S[ref: sec:reproducibility]).
Background and Related Work
Script Invisibility
Script invisibility was defined by Diomande (2026) as the condition where a model's internal activation profile for a given script falls measurably below its activation profile for well-represented scripts, despite the model's architecture imposing no constraint on processing that script.
The term distinguishes the phenomenon from language invisibility: a model may represent Bambara (a Manding language) through Latin-script tokens while maintaining zero functional capacity for N'Ko (the native script).
The diagnostic methodology uses four per-layer metrics:
[nosep]
- L2 norm: $\|h_l\|_2$ for hidden state $h_l$ at layer $l$. Measures activation magnitude.
- Shannon entropy: $H(h_l) = -\sum p_i \log_2 p_i$ over the normalized activation distribution. Measures how broadly information is distributed.
- Sparsity: Fraction of dimensions with $|h_{l,i}| < \epsilon$ (we use $\epsilon = 0.01$). Measures dead neuron count.
- Kurtosis: $\kappa(h_l) = \frac{\mu_4}{\sigma^4}$. Measures circuit specialization. High kurtosis indicates a few neurons firing strongly (specialized circuits); low kurtosis indicates diffuse, unstructured activation.
The translation tax is defined as the ratio of average English L2 norm to average N'Ko L2 norm across all layers:
A tax of $\tau > 1$ indicates that the model activates with less energy for N'Ko than for English.
A tax of $\tau = 1$ would indicate parity.
Multilingual Model Evaluation
Prior work on multilingual model capabilities has focused primarily on task-level performance metrics: BLEU scores for translation [citation: conneau2020unsupervised], accuracy on NLU benchmarks like XNLI [citation: conneau2018xnli], and perplexity on held-out text.
These metrics measure what the model does but not what the model is doing internally.
Activation profiling provides a mechanistic lens.
Rather than asking ``does the model translate N'Ko correctly?'' (it does not), we ask ``what is happening inside the model when it encounters N'Ko tokens?''
The answer---reduced activation magnitude, elevated sparsity, diffuse entropy, collapsed kurtosis---explains why the model fails and predicts which interventions can fix it.
Wu and Dredze (2020) studied how multilingual BERT allocates capacity across languages and found that low-resource languages receive systematically less representation.
Muller et al. (2021) showed that cross-lingual transfer in mBERT depends on shared vocabulary rather than shared syntax.
Our work extends this line by moving from task-level to activation-level analysis and from language-level to script-level granularity.
The Models Under Test
We selected three models to maximize architectural diversity at a comparable scale:
| Qwen3-8B | Qwen2.5-7B | Mistral-7B | |
|---|---|---|---|
| Parameters | 8B | 7B | 7B |
| Layers | 37 | 29 | 33 |
| Hidden dim | 4,096 | 3,584 | 4,096 |
| Attention | GQA | GQA | GQA |
| Vocab size | 151,936 | 151,936 | 32,000 |
| Developer | Alibaba | Alibaba | Mistral AI |
Caption: Models under test. GQA = grouped-query attention.
Qwen3-8B is the baseline model from the original study.
Qwen2.5-7B is the previous generation from the same developer, with a different layer count and training data vintage, providing an intra-family comparison.
Mistral-7B is from a different company with a different training pipeline, a significantly smaller vocabulary (32K vs 152K), and a different tokenizer implementation, providing an inter-family comparison.
Experimental Setup
Data
We use the same 100 parallel English/N'Ko sentence pairs from Diomande (2026).
Each pair consists of a sentence in English and its N'Ko equivalent, validated through the phoneme-to-grapheme bridge and the FSM phonotactic validator.
Sentences span everyday language, greetings, cultural knowledge, and simple factual statements.
The parallel corpus ensures that any difference in activation profile between English and N'Ko is attributable to the script encoding, not to the semantic content.
Procedure
For each model:
[nosep]
- Load in 4-bit quantization on Apple M4 (64GB unified memory).
- For each of the 100 sentence pairs, feed the English sentence and extract hidden states $h_l$ at every layer $l$.
- Repeat with the N'Ko sentence.
- Compute L2 norm, Shannon entropy, sparsity ($\epsilon = 0.01$), and kurtosis at each layer.
- Average across all 100 sentences to produce per-layer profiles for English and N'Ko.
The procedure is identical for all three models.
The only differences are the number of layers and hidden dimensions, which are intrinsic to each architecture.
Results
Translation Tax
All three models impose a substantial translation tax on N'Ko:
| Model | EN avg L2 | N'Ko avg L2 | Tax ($\tau$) |
|---|---|---|---|
| Qwen3-8B | 2,908.6 | 880.2 | 3.30$\times$ |
| Qwen2.5-7B | 3,642.2 | 1,014.9 | 3.59$\times$ |
| Mistral-7B | 47.9 | 17.9 | 2.67$\times$ |
Caption: Translation tax across models. All three models process N'Ko with significantly less activation energy than English.
The absolute L2 magnitudes differ by two orders between the Qwen and Mistral families (reflecting different weight initialization and normalization schemes), but the ratio is consistent: N'Ko receives 28\
Qwen2.5-7B shows the highest tax (3.59$\times$), despite being from the same family as Qwen3-8B.
This indicates that architectural improvements between Qwen generations did not address the N'Ko deficit, consistent with the hypothesis that the deficit is data-driven rather than architecture-driven.
Mistral-7B shows the lowest tax (2.67$\times$) but still well above parity.
Its smaller vocabulary (32K vs 152K) means it has even fewer dedicated N'Ko tokens than the Qwen models, yet its tax is slightly lower, suggesting that tokenizer vocabulary size alone does not fully determine the magnitude of script invisibility.
Entropy Inflation
| Model | EN avg $H$ | N'Ko avg $H$ | $\Delta H$ (bits) |
|---|---|---|---|
| Qwen3-8B | 8.82 | 10.04 | +1.22 |
| Qwen2.5-7B | 8.39 | 9.17 | +0.78 |
| Mistral-7B | 9.61 | 10.75 | +1.14 |
Caption: Shannon entropy averaged across all layers. N'Ko produces higher entropy (more diffuse activations) in every model.
Higher entropy for N'Ko is counterintuitive if interpreted as ``more information.''
In this context, it indicates the opposite: the model has not learned to concentrate its N'Ko representations.
For English, the model routes activation energy to specialized circuits (lower entropy, higher kurtosis).
For N'Ko, the model distributes energy uniformly across all dimensions because it has not learned which dimensions are relevant.
The entropy gap is remarkably consistent across architectures: 0.78--1.22 bits.
This narrow range suggests a common mechanism---plausibly the shared property that all three models have essentially zero N'Ko subword tokens and must process N'Ko at the single-character level.
Sparsity
| Model | EN $s_0$ | N'Ko $s_0$ | Ratio | Avg ratio |
|---|---|---|---|---|
| Qwen3-8B | 0.065 | 0.144 | 2.21$\times$ | 1.65$\times$ |
| Qwen2.5-7B | 0.110 | 0.285 | 2.59$\times$ | 2.35$\times$ |
| Mistral-7B | 0.705 | 0.666 | 0.95$\times$ | 1.15$\times$ |
Caption: Sparsity at the embedding layer () and average across all layers. Both Qwen models show dramatically elevated N'Ko sparsity at embedding; Mistral's embedding layer is uniformly sparse for both scripts.
The sparsity results reveal an architectural difference.
Both Qwen models show a pronounced sparsity spike at the embedding layer for N'Ko---2.2--2.6$\times$ more dead neurons than English---indicating that the comprehension failure begins at the earliest stage of processing.
Mistral-7B does not show this embedding-layer spike.
Its embedding layer is highly sparse for both English and N'Ko ($>$66\
However, Mistral's overall sparsity ratio (1.15$\times$ averaged across all layers) still favors English, confirming that the model allocates more active neurons to English processing even when the embedding-layer pattern differs.
Kurtosis Deficit
| Model | EN $\kappa_L$ | N'Ko $\kappa_L$ | Deficit | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-8B | 601.5 | 131.9 | 78.1 Qwen2.5-7B | 644.7 | 41.7 | 93.5 Mistral-7B | 168.0 | 59.5 | 64.6 |
Caption: Kurtosis at the output layer. All models show severely reduced kurtosis for N'Ko, indicating absence of specialized output circuits.
Kurtosis measures circuit specialization.
A high-kurtosis layer has learned to activate a small number of neurons strongly---it knows which circuits matter for this input.
A low-kurtosis layer activates all neurons weakly---it is guessing.
At the output layer, where the model's representation is closest to producing actual predictions, kurtosis is decimated for N'Ko across all three models.
Qwen2.5-7B is the most extreme: a 93.5\
This result is the most direct evidence that none of these models has learned functional N'Ko circuits.
The output layer is where language-specific knowledge would be most concentrated if it existed.
Its near-total absence of specialization for N'Ko confirms that what Paper 1 called ``dead circuits'' is not a Qwen-specific phenomenon.
Three-Zone Failure Analysis
Paper 1 identified three structurally distinct failure zones in Qwen3-8B: embedding collapse (early layers), reasoning vacuum (middle layers), and incoherent prediction (output layers).
We test whether this zonal structure appears in the other two models.
We divide each model's layers into three equal zones (early, middle, late) and compute the average N'Ko/English L2 ratio, entropy gap, and sparsity gap for each zone.
| Model | Zone | L2 ratio | $\Delta H$ | $\Delta s$ |
|---|---|---|---|---|
| 3*Qwen3-8B | Early | 0.277 | +0.73 | +0.006 |
| Mid | 0.275 | +1.63 | +0.001 | |
| Late | 0.339 | +1.30 | $-$0.000 | |
| 3*Qwen2.5-7B | Early | 0.271 | +0.32 | +0.020 |
| Mid | 0.271 | +0.79 | +0.001 | |
| Late | 0.290 | +1.13 | +0.000 | |
| 3*Mistral-7B | Early | 0.245 | +1.51 | +0.026 |
| Mid | 0.266 | +1.41 | $-$0.000 | |
| Late | 0.525 | +0.50 | $-$0.000 |
Caption: Three-zone failure analysis. L2 ratio = N'Ko/English L2 norm (lower means worse deficit). = N'Ko entropy minus English entropy. = N'Ko sparsity minus English sparsity.
The three-zone pattern is consistent across all models, with one architectural variation:
Early zone (embedding collapse): All three models show the lowest L2 ratios (0.245--0.277) and elevated sparsity gaps in the early layers.
This is where the model first encounters N'Ko tokens and fails to activate appropriate representations.
The sparsity gap is concentrated here: early layers have 6--26$\times$ the sparsity gap of later layers.
Middle zone (reasoning vacuum): The L2 ratio remains severely depressed (0.266--0.275).
The entropy gap peaks in the middle layers for Qwen3-8B (+1.63 bits), indicating maximum diffusion of the already-weak N'Ko signal.
The sparsity gap drops to near zero, because by this point the model has distributed N'Ko activations broadly---there are few dead neurons, but the live ones are firing weakly and without direction.
Late zone (output collapse): The L2 ratio improves slightly in all models (0.290--0.525), which might appear encouraging.
However, this improvement is accompanied by the kurtosis collapse shown in Table~[ref: tab:kurtosis]: the output layers have slightly more energy but dramatically less specialization.
The model is activating more neurons but has not learned which ones to use for N'Ko prediction.
Mistral-7B shows the most recovery in the late zone (L2 ratio 0.525 vs 0.245 in early layers), but this recovery is illusory: its 64.6\
Discussion
Architecture Does Not Determine Script Visibility
The central finding is negative: no architectural feature we tested---layer count, hidden dimensionality, vocabulary size, model generation---predicts the magnitude of script invisibility.
Qwen3-8B has 37 layers and a 152K vocabulary.
Mistral-7B has 33 layers and a 32K vocabulary.
Both show the same qualitative failure signature and quantitatively similar translation taxes (3.30$\times$ vs 2.67$\times$).
This result has a direct implication for the field: architectural innovation cannot fix script invisibility.
A new attention mechanism, a deeper network, a larger vocabulary---none of these address the root cause.
If N'Ko is absent from training data, the model will not learn to process it, regardless of how sophisticated the architecture.
Generational Improvement Does Not Help
Qwen2.5-7B (released 2024) and Qwen3-8B (released 2025) represent consecutive generations from the same developer.
Qwen3 is the better model by standard benchmarks.
Yet its translation tax (3.30$\times$) is only marginally lower than Qwen2.5's (3.59$\times$), and its kurtosis deficit (78.1\
This suggests that the incremental improvements between model generations---more data, better training recipes, architectural refinements---do not naturally trickle down to unseen scripts.
N'Ko does not benefit from improvements made on languages that are already well-represented.
The rising tide does not lift all boats.
Implications for Other Scripts
N'Ko is not the only script affected.
The Unicode Standard defines blocks for Adlam (Fulani), Tifinagh (Berber), Vai (Liberian), Osmanya (Somali), and dozens of other scripts that share N'Ko's data-poverty profile.
Our results predict that all such scripts will exhibit similar activation deficits in any model trained on standard web-crawled corpora.
We note that these scripts are not edge cases.
Adlam serves approximately 40 million Fulani speakers.
Tifinagh serves approximately 30 million Berber speakers.
The combined population affected by script invisibility likely exceeds 200 million people.
The Validation Question
A natural concern is whether these activation-level measurements translate to meaningful differences in model behavior.
We argue that they do, for two reasons.
First, the translation tax is a mechanistic explanation for known task-level failures.
N'Ko generation quality is poor across all tested models.
Activation profiling explains why: the model is processing N'Ko with 28--37\
A model running at a third of capacity will produce a third-quality output.
Second, the kurtosis deficit at the output layer directly predicts prediction quality.
A model with 93.5\
This is consistent with the observed behavior: N'Ko outputs from these models are phonotactically invalid, semantically incoherent, and often degenerate to repetitive token sequences.
Reproducibility
All experiments were conducted on a single Apple M4 Mac (64GB unified memory) using 4-bit quantization.
Total wall-clock time for all three scans was approximately 45 minutes.
Total compute cost: under \$5 (electricity only; no cloud compute required).
The following artifacts are publicly available at https://github.com/Diomandeee/nko-brain-scanner:
[nosep]
- Scanner script with per-layer metric extraction
- 100 parallel English/N'Ko sentence pairs
- Raw scan results for all three models (JSON format)
- Comparison and visualization scripts
Any researcher with access to a consumer laptop and the publicly available model weights can reproduce these results in under an hour at zero cost.
Conclusion
Script invisibility is not a quirk of one model.
It is a structural property of models trained on corpora where a script is absent.
Three architecturally distinct models from two companies, spanning two model generations and vocabulary sizes from 32K to 152K, all exhibit the same failure signature: reduced activation magnitude, elevated entropy, increased sparsity, and collapsed kurtosis when processing N'Ko.
The failure is consistent across layers and follows a three-zone pattern (embedding collapse, mid-layer diffusion, output-layer circuit death) that is architecturally invariant.
The implication is clear.
No amount of architectural innovation will grant a model capability on a script it has never seen.
The path to script visibility is through the training data.
For N'Ko and the 50+ scripts that share its data-poverty profile, the question is not whether the technology can handle these scripts---the phoneme-to-grapheme bijection makes N'Ko computationally ideal.
The question is whether the institutions that build language models choose to include them.
Acknowledgments
This work builds directly on the methodology established in ``Dead Circuits'' (Paper 1 of this series).
The 100 parallel sentence pairs were developed as part of the N'Ko Brain Scanner project.
acl_natbib
References
diomande2026dead
Mohamed Diomande. 2026.
\newblock Dead Circuits: Activation Profiling and Script Invisibility in Large Language Models.
\newblock Manuscript.
conneau2020unsupervised
Alexis Conneau et~al. 2020.
\newblock Unsupervised cross-lingual representation learning at scale.
\newblock In Proceedings of ACL 2020.
conneau2018xnli
Alexis Conneau et~al. 2018.
\newblock {XNLI}: Evaluating cross-lingual sentence representations.
\newblock In Proceedings of EMNLP 2018.
wu2020languages
Shijie Wu and Mark Dredze. 2020.
\newblock Are all languages created equal in multilingual {BERT}?
\newblock In Proceedings of the 5th Workshop on Representation Learning for NLP.
muller2021first
Benjamin M\"{u}ller et~al. 2021.
\newblock First align, then predict: Understanding the cross-lingual ability of multilingual {BERT}.
\newblock In Proceedings of EACL 2021.
lauscher2020zero
Anne Lauscher et~al. 2020.
\newblock From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers.
\newblock In Proceedings of EMNLP 2020.
ponti2019modeling
Edoardo~Maria Ponti et~al. 2019.
\newblock Modeling language variation and universals: A survey on typological linguistics for natural language processing.
\newblock Computational Linguistics, 45(3):559--601.
Promotion Decision
Compile/render the source, verify references and figures, then add to the curated atlas.
Source Anchor
nko-brain-scanner/paper/current/paper3_cross_model.tex
Detected Structure
Latex · Abstract · Method · Evaluation · References · Math · Architecture