Grand Diomande Research · Full HTML Reader

N'Ko Phonemic Substrate Validation Report

This report freezes the evidence state before turning the current work into a paper. It separates what is mechanically validated, what is empirically supported, what failed, and what remains a hypothesis.

Language as Infrastructure proposal experiment writeup candidate score 24 .md

Full Public Reader

N'Ko Phonemic Substrate Validation Report

Date: 2026-06-01

This report freezes the evidence state before turning the current work into a paper.
It separates what is mechanically validated, what is empirically supported, what
failed, and what remains a hypothesis.

Terminology

The correct term in this package is N'Ko. "NKL" was only a conversational typo;
it is not used as a technical term here.

Unicode uses the formal block/script spelling NKo because apostrophes are not
allowed in Unicode character and block names. The research text uses N'Ko for the
script/language tradition and NKo only when referring to Unicode identifiers.

External Anchors Checked

- Unicode Core Specification, Chapter 19, section 19.4.1, documents N'Ko as a
right-to-left, phonetic script with seven vowels, tone marks, nasalization, and
diacritics used for foreign sounds.
Source: https://www.unicode.org/versions/latest/core-spec/chapter-19/
- Unicode Table 19-3 documents concrete foreign-sound combinations, including
mappings involving U+07ED and U+07F3 for sounds such as [v], [theta], [esh],
[ezh], [schwa], and French [y].
Source: https://www.unicode.org/versions/latest/core-spec/chapter-19/
- ScriptSource independently summarizes N'Ko as a phonemic alphabet with 19
consonants, 7 vowels, 8 diacritics, and later combining marks for foreign sounds.
Source: https://scriptsource.org/scr/Nkoo
- Library of Congress romanization notes list foreign-sound and diacritic handling
for N'Ko cataloging practice.
Source: https://www.loc.gov/catdir/cpso/romanization/N

These sources validate the key standards claim: extending N'Ko with diacritics for
foreign sounds is not an invented premise. The paper's internal full-compositional
layer is still a computational encoding, not an official orthographic reform.

Mechanically Validated Claims

1. Representation coverage can cross the 90

Implemented evaluator:

`[home]/Desktop/NKo/scripts/evaluate_phoneme_coverage.py`

Reusable module:

`[home]/Desktop/NKo/nko/phonemic_extensions.py`

Verified command:

bash

cd [home]/Desktop/NKo
python3 scripts/evaluate_phoneme_coverage.py --threshold 0.90

Result:

text

language   layer                   covered  coverage
manding    baseline                27/27     100.0%
manding    unicode_extensions      27/27     100.0%
manding    full_compositional      27/27     100.0%
french     baseline                23/36      63.9%
french     unicode_extensions      29/36      80.6%
french     full_compositional      36/36     100.0%
english    baseline                24/41      58.5%
english    unicode_extensions      30/41      73.2%
english    full_compositional      41/41     100.0%

Interpretation: for the tested Manding, French, and English phoneme inventories, a
deterministic N'Ko extension layer reaches at least 90
representation result, not an ASR accuracy result.

2. The implementation is tested

Verified command:

bash

cd [home]/Desktop/NKo
python3 -m pytest -q tests/test_phonemic_extensions.py tests/test_phonetics.py tests/test_transliterate.py

Result:

text

164 passed in 0.16s

The tests enforce extension coverage, composition examples, and non-regression of the
existing phonetics/transliteration tests.

3. IPA to N'Ko label generation is runnable

The bundle generated concrete label examples such as:

text

theta i-small eng k   -> ߛ߳ߌ߳ߧߞ
v epsilon rhotic i   -> ߝ߭ߐߙ߳ߌ
French y             -> ߎ߳
French nasal vowel   -> vowel + ߲

This validates that text/IPA-to-N'Ko label construction is a deterministic pipeline.
It does not validate acoustic recognition.

Operational label harness:

`[home]/Desktop/nko-brain-scanner/experiments/phonemic_substrate/label_ipa_corpus.py`

Verified default-sample result:

text

rows: 4
layer: full_compositional
coverage: 1.0
covered symbols: 29/29
unknown symbols: 0

Empirically Supported Claims

1. Ungoverned N'Ko generation is harmful on the 500-row AGP pilot

Archived artifacts:

`[home]/Desktop/nko-brain-scanner/artifacts/agp_pilot/`

Bundle summary:

`[home]/Desktop/nko-brain-scanner/artifacts/phonemic_substrate/overnight_bundle_2026-06-01/bundle_summary.md`

Old adapter:

text

ASR CER:              0.3106
Blind proposal CER:   0.4701
Blind delta:          +15.94pp
Direct better/same/worse: 21/169/310

Interpretation: a generic N'Ko-emitting LLM adapter can produce valid-looking N'Ko
while moving away from the reference. Script validity is not correction correctness.

2. AGP governance neutralizes most harm but is not sufficient

Old adapter, gated:

text

Gated CER:            0.3120
Gated delta:          +0.14pp
Accepted:             23/500
Accepted worse:       18

Interpretation: the gate reduced a +15.94pp blind catastrophe to a +0.14pp near-wash.
That supports the governance thesis, while the 18 accepted-worse rows prevent any
"perfect safety" claim.

3. Minimal-edit SFT did not close the loop

Minimal-edit adapter:

text

Blind proposal CER:   0.4269
Blind delta:          +11.63pp
Direct better/same/worse: 14/225/261
Gated CER:            0.3156
Gated delta:          +0.50pp
Accepted:             89/500
Accepted worse:       69

Interpretation: the SFT model learned smaller edits and more refusals, but small wrong
edits passed the edit-size gate. Loop-v1 failed usefully: the next gate must score
evidence/correctness, not edit size alone.

4. Oracle headroom shows the architecture could improve if proposals become trustworthy

Oracle facts from the full bridge:

text

Rows:                         29,060
Accepted at cap=2:             3,302
Rejected:                     25,758
edit_too_large rejections:    22,914
Median rejected edit size:         9
Accepted-worse:                   0

Cap sweep, oracle proposals:

text

cap=2      -0.46pp
cap=8      -5.57pp
cap=12     -9.97pp
cap=999    -29.15pp

Same cap sweep, real proposals:

text

cap=2      +0.14pp
cap=8      +2.36pp
cap=12     +3.99pp
cap=999    +15.59pp

Interpretation: proposal correctness is the bottleneck. The edit budget is protective
while proposals are bad and useful only after proposals become trustworthy.

Claims That Remain Hypotheses

1. Direct cross-lingual audio to N'Ko without retraining

The representation layer does not require retraining. Direct ASR usually does.

Claim boundary:

text

text/IPA -> N'Ko labels: deterministic, no retraining
audio -> phoneme labels: requires an acoustic recognizer
audio -> N'Ko directly: requires ASR training/adaptation unless the model is featural

The zero-shot acoustic claim becomes plausible only if the acoustic model predicts
features such as place, manner, voicing, vowel height, rounding, nasality, and tone.
That is the FAC hypothesis. It is not yet validated by the current artifacts.

2. Phrase-level transfer through N'Ko

The idea that phrase structures can move across languages through a shared N'Ko
phonemic/semantic layer is promising, but currently conceptual. It needs a separate
evaluation:

text

source phrase -> phonemic/semantic representation -> N'Ko substrate -> target language

The representation layer can carry sounds. Meaning transfer needs semantic alignment
and language-specific generation.

3. A self-improving correction loop that lowers CER

The infrastructure exists:

text

ASR -> proposer -> gate -> decisions -> SFT data -> new proposer

But loop-v1 did not lower CER. The next version needs:

correctness/evidence gate;
proposal confidence or likelihood ratio;
acoustic support checks;
edit localization, not just edit size;
stricter handling of uncertain/boundary partitions.

Paper-Safe Interpretation

The strongest accurate claim is:

> N'Ko is an extensible phonemic substrate whose representation coverage can be
> raised mechanically using documented foreign-sound diacritics plus bounded internal
> composition; this enables automatic construction of N'Ko phonemic labels for
> cross-lingual ASR training, while the current AGP experiments show why any
> generative correction loop must be governed by evidence, not left to an LLM.

The claim to avoid:

> We can recognize any language in N'Ko without retraining.

Correct replacement:

> We can represent many languages in N'Ko without retraining; recognizing them from
> audio requires either ASR adaptation or a validated featural acoustic model.

Reproducible Artifacts

- Coverage artifact:
`[home]/Desktop/nko-brain-scanner/artifacts/phonemic_coverage/coverage_2026-06-01.md`
- Overnight bundle:
`[home]/Desktop/nko-brain-scanner/artifacts/phonemic_substrate/overnight_bundle_2026-06-01/bundle_summary.md`
- Bundle JSON:
`[home]/Desktop/nko-brain-scanner/artifacts/phonemic_substrate/overnight_bundle_2026-06-01/bundle_summary.json`
- IPA label harness:
`[home]/Desktop/nko-brain-scanner/experiments/phonemic_substrate/label_ipa_corpus.py`
- IPA label report:
`[home]/Desktop/nko-brain-scanner/artifacts/phonemic_substrate/overnight_bundle_2026-06-01/label_examples_report.json`
- AGP pilot artifacts:
`[home]/Desktop/nko-brain-scanner/artifacts/agp_pilot/`
- Paper proposal:
`[home]/Desktop/nko-brain-scanner/NKO-PHONEMIC-SUBSTRATE-PROPOSAL.md`

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/NKO-VALIDATION-REPORT.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture