Grand Diomande Research · Full HTML Reader

Does Every AI Have the Same Blind Spot?

Qwen3-8B, an 8-billion-parameter model trained on trillions of tokens, processed N'Ko text with measurably less activation than English at every single layer. More dead neurons. Less information being distributed. Flatter circuits. The model wasn't failing because N'Ko is difficult. It was failing because it had barely seen the script in training.

Language as Infrastructure experiment experiment writeup candidate score 24 .md

Full Public Reader

Does Every AI Have the Same Blind Spot?

Testing whether N'Ko invisibility is a universal property of language models, or a quirk of one.

---

The Original Finding

The first N'Ko brain scan found something uncomfortable.

Qwen3-8B, an 8-billion-parameter model trained on trillions of tokens, processed N'Ko text with measurably less activation than English at every single layer. More dead neurons. Less information being distributed. Flatter circuits. The model wasn't failing because N'Ko is difficult. It was failing because it had barely seen the script in training.

The technical name for this is an activation deficit. When a layer processes unfamiliar input, fewer neurons fire. The ones that do fire produce weaker signals. The result is a flatter, sparser activation profile across all 4,096 hidden dimensions. You can measure this with four numbers: L2 norm (how loudly is the layer speaking?), Shannon entropy (how spread out is the information?), sparsity (what fraction of neurons are essentially turned off?), and kurtosis (how specialized are the active circuits?).

For N'Ko, all four metrics pointed the same direction. The model was running on reduced capacity. It was processing N'Ko text with the cognitive equivalent of one hand behind its back.

But Qwen3-8B is one model, from one company, trained in one particular way.

Does every AI have this same blind spot?

---

What We're Testing

Experiment A asks a direct question: is N'Ko's invisibility specific to Qwen, or is it a structural property of models trained on data where N'Ko barely exists?

The hypothesis is that it's structural. Every tested model will show the same activation deficit for N'Ko, because all of them were trained predominantly on Latin and CJK text. The gap isn't about architecture. It's about data.

To test this, we ran the same brain scan on three models from different families:

Qwen3-8B (8B parameters, Qwen architecture, 37 layers) is the baseline from the original experiment. It gives us a replication check before we start comparing.

Qwen2.5-7B (7B parameters, Qwen architecture, 29 layers) is the previous generation Qwen model. Same family, different training data vintage, slightly different scale. If the problem is architectural, the older Qwen should show a similar pattern. If it's about training data composition changes between generations, the gap might differ.

Mistral-7B (7B parameters, Mistral architecture, 33 layers) is a completely different model family from a different company. Different architecture, different training data pipeline, different tokenizer. If Mistral shows the same N'Ko deficit as both Qwen models, that rules out Qwen-specific explanations entirely.

All three models were fed the same 100 parallel English/N'Ko sentence pairs from the project's corpus. Same sentences, same metrics, same method. The only variable is the model.

---

How the Scan Works

For each model, we load it in 4-bit quantization on an Apple M4. At every layer, we extract the full hidden state tensor and compute the four core metrics. The hidden state is the model's internal representation at that stage of processing, a vector of 4,096 numbers that encodes what the model understands about the input so far.

We also compute the translation tax: the ratio of N'Ko perplexity to English perplexity. Perplexity is a direct measure of how confused the model is. If the model has fully learned a language, perplexity is low. If the model is essentially guessing token by token, perplexity is high. The ratio tells us how much harder N'Ko is for each model, independent of the model's overall quality.

After all three scans are done, a comparison script pulls all the results together and looks for patterns. Does every model show a sparsity spike at the same layers? Are the activation curves shaped similarly, or does each architecture fail in a different way?

---

Why the Answer Matters

There are two possible outcomes, and both are useful.

If all three models show the same activation deficit, that's strong evidence the problem is in the training data, not the architecture. No amount of architectural innovation will fix N'Ko invisibility if training datasets keep underrepresenting the script. The fix has to be data-level.

If some models handle N'Ko better than others, that points to something else. Maybe certain architectures generalize better across scripts. Maybe certain training data compositions matter more than total scale. Maybe the tokenizer design affects how well the model can handle a right-to-left script with a compact Unicode block.

Either outcome changes how you think about solving the problem.

---

Results

Every model has the same blind spot. The deficit is not a quirk of Qwen. It is structural.

The Numbers

MetricQwen3-8B (37 layers)Qwen2.5-7B (29 layers)Mistral-7B (33 layers)
Avg L2 NormEnglish2,908.63,642.247.9
N'Ko880.21,014.917.9
N'Ko / English0.30x0.28x0.37x
Avg EntropyEnglish8.828.399.61
N'Ko10.049.1710.75
Delta+1.22+0.78+1.14
Avg SparsityEnglish0.00330.00480.056
N'Ko0.00540.01120.064
N'Ko / English1.65x2.35x1.15x
Avg KurtosisEnglish3,156.31,341.63,253.3
N'Ko2,734.11,283.62,261.9
N'Ko / English0.87x0.96x0.70x

What the Numbers Mean

L2 norm measures how loudly a layer is speaking. Across all three models, N'Ko activations are 63

Shannon entropy is higher for N'Ko in every model, by 0.78 to 1.22 bits. This is counterintuitive. Higher entropy means the activation energy is spread more uniformly across dimensions, which sounds like it should be good. But in the context of language processing, it means the model has not learned to concentrate its representation. English activations are more structured, more peaked, more specialized. N'Ko activations are flatter, more diffuse, less organized. The model is distributing its limited N'Ko energy across all neurons rather than routing it to the neurons that matter.

Sparsity tells us what fraction of neurons are essentially silent. N'Ko produces 15

Kurtosis measures how peaked the activation distribution is, how specialized the active circuits are. For English, high kurtosis means certain neurons fire strongly while most stay quiet. This is the pattern of a model that has learned which circuits to activate for this input. For N'Ko, kurtosis drops 4

The Pattern Across Architectures

The most important finding is not any individual number. It is that all three models fail in the same way, despite having different architectures, different training pipelines, different tokenizers, and different companies behind them.

The failure signature is consistent: low L2 norm, high entropy, elevated sparsity, reduced kurtosis. The embedding layers are hit hardest. The middle layers show the widest entropy gaps. The pattern does not depend on whether the model uses grouped-query attention (Mistral) or standard multi-head attention (Qwen). It does not depend on the number of layers. It does not depend on the tokenizer's vocabulary size.

The simplest explanation is the correct one. All three models were trained on data distributions where N'Ko barely exists. The Unicode block U+07C0 through U+07FF is a statistical rounding error in every major training corpus. The models have not learned to process these characters because they have almost never seen them.

This is a data problem, not an architecture problem. Swapping model families does not fix it. The fix has to happen upstream, in the training data.

Reproducibility

All scan data is available at `experiments/A_cross_model_brain_scan/results/` in the repository. Each JSON contains per-layer metrics for both English and N'Ko across the full depth of the model. The scanner script and the 100 parallel sentence pairs are included.

Total compute cost: under five dollars across all three scans.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/blog/experiments/experiment-a-cross-model-brain-scan.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture