Mohamed Diomande

Full HTML reader

Read the full artifact

Extracted abstract or opening context

**The Script That Machines Can't Read: Adapting Large Language Models for N'Ko** A systematic study of how large language models process N'Ko (U+07C0-U+07FF), an alphabetic script used by 40+ million Manding-language speakers in West Africa. We perform activation profiling ("brain scanning"), train a multi-stage adaptation pipeline, build a script-specific BPE tokenizer, implement phonotactically-constrained decoding, and design a retrieval-centric multimodal ASR architecture. The world's first ASR system that outputs N'Ko script directly from audio. No prior system does this. All existing Bambara ASR (MALIBA-AI, Meta MMS, Google USM) outputs Latin transliteration only. **Architecture**: Whisper large-v3 (frozen encoder) + 4x downsample + char-level BiLSTM CTC (5.4M params) + 65 N'Ko character classes + FSM syllable validation post-decoding. **Training data**: bam-asr-early (37h, 37,306 human-labeled samples) with cross-script bridge (Latin→N'Ko transliteration).

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.