Back to Language as Infrastructure
preprint2026Preprint-ready manuscript

The Script That Machines Can't Read: Adapting Large Language Models for N'Ko

This preprint studies script invisibility in modern language models and asks what adaptation recovers when the training distribution barely contains the writing system. The target is N'Ko, but the argument generalizes to scripts that are present in Unicode yet absent from model competence.

Paper workspace

Live draft structure

release-candidate

Artifacts

Draft PDF

Rendered draft for the script-invisibility line. Treat as a live manuscript, not fixed publication copy.

Open artifact

Final split-paper render

Final split-paper artifact from the N'Ko paper release set. Still treated as editable public draft copy here.

Open artifact

Editable source

Submission-ready draft exists, but it should remain editable until venue packaging is chosen.

Source anchors

nko-brain-scanner/paper/final/01-script-invisibility/paper.tex

nko-brain-scanner/paper/current/paper1_dead_circuits.tex

nko-brain-scanner/paper/current/paper3_cross_model.tex

Method tags

activation profilingscript invisibilitymodel adaptation

Ingest intersections

nkoscriptllmtokenizationactivationadaptation

Status

Submission-ready.

Key claims

01

Unicode support is not the same thing as model visibility.

02

Script adaptation should be measured internally, not only by output fluency.

03

N'Ko is a useful stress test for low-resource script competence.

Public reading note

Ready to attach once you choose the public preprint release path.

Standard skeleton

What this paper must keep proving

Schema

problem

A script can be present in Unicode while remaining functionally invisible to model internals.

method

Probe activation behavior and adaptation effects when models encounter N'Ko text.

implementation

Model-family probes, adaptation runs, and activation-profile comparisons.

data

N'Ko text probes and controlled adaptation examples. Release must preserve dataset provenance.

evaluation

Output behavior plus internal activation evidence, because fluency alone is not competence.

references

Tokenizer coverage, multilingual representation learning, activation patching, low-resource scripts.

openQuestions

Which failures are tokenizer-level, pretraining-distribution-level, or downstream alignment artifacts.

Checkpoints and references

Proof chain

paperpending

Claim checkpoint

central-claim slot

Every central claim must point to a proof anchor or remain labeled as speculative.

implementationpending

Implementation checkpoint

implementation-map slot

Every method should identify the code path, harness, schema, or protocol that embodies it.

experimentpending

Evidence checkpoint

evidence-manifest slot

Every reported result should point to run IDs, packet IDs, data snapshots, commits, or review artifacts.

external-referencepending

Reference checkpoint

references slot

Every external claim should resolve to a cited paper, benchmark, standard, or documented prior system.

paperpending

Release checkpoint

release-gate slot

Every PDF needs a named condition before it can move from draft to citation-ready.