Grand Diomande Research · Full HTML Reader

NKO-1.3 Complete — `nko.transliterate` Canonical Engine

`nko/transliterate.py` — the canonical, unified transliteration engine for the N'Ko Unified Platform. Consolidates **6 scattered implementations** into one authoritative module.

Language as Infrastructure research note experiment writeup candidate score 24 .md

Full Public Reader

NKO-1.3 Complete — `nko.transliterate` Canonical Engine

Status: ✅ DONE
Date: 2025-07-19
Tests: 58/58 passing (0.04s)

---

What Was Built

`nko/transliterate.py` — the canonical, unified transliteration engine for the N'Ko Unified Platform. Consolidates 6 scattered implementations into one authoritative module.

Implementations Consolidated

#SourceTypeStatus
1`core/transliteration/` (Bridge + NkoHandler + ArabicHandler + LatinHandler)Python — IPA intermediary archPrimary source — most structured
2`core/prediction/translation_bridge.py`Python — keyboard-ai with dictionary + SQLiteSuperseded (duplicate transliteration logic)
3`tools/telegram-bot/bridge_core.py`Python — standalone fallback bridgeSuperseded
4`tools/pwa/js/bridge-core.js`JavaScript — client-side bridgeReference only (JS, different runtime)
5`core/audio/phoneme.py`Python — PhonemeMapper with char mapsPhoneme maps referenced
6`core/pipeline/stages.py`Python — TransliterationStage wrapping BridgePipeline consumer, not source

Canonical source chosen: `core/transliteration/` — best architecture (IPA intermediary), most complete character maps, cleanest separation of concerns. Enhanced with corrections from other implementations.

API Surface

Module-level functions (quick usage)

python
from nko.transliterate import transliterate, detect_script, convert_all, batch, to_ipa, analyze

# Auto-detect → Latin
transliterate("ߒߞߏ")                      # → "nkɔ"

# Explicit source/target
transliterate("baka", target="nko")         # → "ߓߊߞߊ"
transliterate("سلام", target="latin")       # → "slam"

# Script detection
detect_script("ߒߞߏ")                       # → Script.NKO

# All scripts at once
convert_all("ߒߞߏ")                         # → {"nko": "ߒߞߏ", "latin": "nkɔ", "arabic": "نكو"}

# Batch
batch(["ߒߞߏ", "ߓߊ"], target="latin")      # → [TranslitResult(...), ...]

# IPA intermediary
to_ipa("ߒߞߏ")                              # → "nkɔ"

# Analysis
analyze("ߒߞߏ abc")                          # → {"dominant": "nko", "counts": {...}, ...}

Class API (full control)

python
from nko.transliterate import NkoTransliterator, Script, TranslitResult

t = NkoTransliterator()
result = t.convert("ߒߞߏ", source=Script.NKO, target=Script.LATIN)
# TranslitResult(source_text='ߒߞߏ', source_script=Script.NKO,
#                target_text='nkɔ', target_script=Script.LATIN,
#                ipa='nkɔ', confidence=1.0)

Exported character maps (for phonetics integration)

python
from nko.transliterate import (
    NKO_TO_IPA, IPA_TO_NKO, IPA_TO_LATIN, IPA_TO_ARABIC,
    ARABIC_TO_IPA, NKO_VOWELS_TO_IPA, NKO_CONSONANTS_TO_IPA,
    NKO_TONE_MARKS, NKO_DIGITS_TO_WESTERN,
)

Scripts Supported

DirectionStatus
N'Ko → Latin✅ Full (7 vowels, 19+ consonants, digits, tone marks, punctuation)
Latin → N'Ko✅ Full (single chars + digraphs: ny, ng, gb, ch, sh, dj, rr)
N'Ko → Arabic✅ Via IPA intermediary
Arabic → N'Ko✅ Via IPA intermediary
Arabic → Latin✅ Full Arabic consonants + vowel diacritics
Latin → Arabic✅ Via IPA intermediary
Script detection✅ Unicode-range voting (NKO: U+07C0-07FF, Arabic: U+0600-06FF+)

N'Ko Character Coverage

  • 7 vowels: ߊ(a) ߋ(o) ߌ(i) ߍ(e) ߎ(u) ߏ(ɔ) ߐ(ɛ)
  • 19 consonants: ߒ(n) ߓ(b) ߔ(p) ߕ(t) ߖ(dʒ) ߗ(tʃ) ߘ(d) ߙ(r) ߚ(rr) ߛ(s) ߜ(gb) ߝ(f) ߞ(k) ߟ(l) ߡ(m) ߢ(ɲ) ߣ(n) ߤ(h) ߥ(w) ߦ(j) ߧ(ŋ)
  • 3 alternates: ߠ(na) ߨ(p) ߩ(r) ߪ(s)
  • 10 digits: ߀-߉
  • 5 tone marks: ߫(high) ߬(low) ߭(falling) ߮(rising) ߯(long)
  • 4 punctuation: ߸(,) ߹(.) ߷(!) ߺ(-)
  • 2 combining: ߲(nasalization) ߳(tilde)

Test Results

58 passed in 0.04s

Test classes:
  TestScriptDetection ......... 11 tests (detect NKO/Latin/Arabic/empty/mixed/extended, is_* helpers)
  TestNkoToLatin .............. 9 tests (basic word, vowels, consonants, digraphs, nasals, spaces, digits, convenience, auto-detect)
  TestLatinToNko .............. 4 tests (basic, digraphs, spaces, convenience)
  TestNkoArabic ............... 3 tests (N'Ko→Arabic, Arabic→N'Ko, Arabic→Latin)
  TestIPA ..................... 3 tests (N'Ko→IPA, Latin→IPA, convenience)
  TestEdgeCases ............... 8 tests (empty, identity, whitespace, tones, punctuation, mixed, invalid script, long vowel)
  TestBatchAndConvertAll ...... 5 tests (batch, empty batch, convert_all, convenience functions)
  TestTranslitResult .......... 5 tests (str, repr, IPA, confidence, frozen)
  TestRoundTrip ............... 3 tests (simple CV, consonant cluster, vowels)
  TestAnalyze ................. 3 tests (NKO text, mixed, convenience)
  TestCharacterMaps ........... 4 tests (vowels complete, consonants complete, reverse map, digits)

Architecture

Input Text
    │
    ▼
detect_script() ──→ Script.NKO / Script.LATIN / Script.ARABIC
    │
    ▼
_to_ipa(text, source_script)
    │  NKO:    char-by-char lookup in NKO_TO_IPA
    │  Latin:  digraph-first matching (ny,ng,gb,ch...) then single chars
    │  Arabic: char-by-char lookup in ARABIC_TO_IPA
    │
    ▼
IPA String (phonetic intermediary)
    │
    ▼
_from_ipa(ipa, target_script)
    │  Latin:  longest-match against IPA_TO_LATIN (dʒ→j, tʃ→c, ŋ→ng...)
    │  NKO:    longest-match against IPA_TO_NKO
    │  Arabic: longest-match against IPA_TO_ARABIC
    │
    ▼
Target Text

Integration with nko.phonetics

The module exports all character maps as public symbols. When `nko.phonetics` (NKO-1.2) is ready:

python
from nko.transliterate import NKO_TO_IPA, NKO_VOWELS_TO_IPA, NKO_CONSONANTS_TO_IPA, NKO_TONE_MARKS

These are the single source of truth for N'Ko → IPA mappings across the entire platform.

Files Modified/Created

  • Created: `nko/transliterate.py` (22.7 KB — canonical engine)
  • Created: `tests/test_transliterate.py` (15.6 KB — 58 tests)
  • Updated: `nko/__init__.py` (clean imports)
  • Created: `NKO-1.3-COMPLETE.md` (this file)

Key Design Decisions

1. IPA intermediary — all conversions route through IPA. This ensures phonetic accuracy and makes adding new scripts trivial (just add IPA↔NewScript maps).
2. ɔ and ɛ preserved in Latin output — Manding Latin orthography uses ɔ and ɛ. We don't lossy-compress to "o"/"e".
3. Longest-match for multi-char tokens — digraphs (ny, ng, gb, dʒ, tʃ) are matched before single chars to avoid ambiguity.
4. Frozen TranslitResult — immutable dataclass prevents accidental mutation.
5. Module-level singleton — `_DEFAULT_ENGINE` avoids re-initialization cost for the convenience functions.
6. Character maps exported — enables downstream modules (phonetics, audio, prediction) to use the same maps.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

NKo/NKO-1.3-COMPLETE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture