NKo-Qwen3-8B-V3: N'Ko Script Adaptation with nicolingua Integration

Full HTML reader

Read the full artifact

Extracted abstract or opening context

--- license: apache-2.0 language: - nqo - en - bm base_model: mlx-community/Qwen3-8B-8bit tags: - nko - manding - low-resource - african-languages - brain-scan - lora - mlx - constrained-decoding - bpe-tokenizer - retrieval-asr datasets: - custom - nicolingua pipeline_tag: text-generation --- A Qwen3-8B model adapted for N'Ko script processing through multi-stage training (CPT + SFT + BPE-aware + vocabulary extension + nicolingua integration), with admissibility-constrained decoding via a syllable FSM and a retrieval-centric multimodal ASR architecture. | Metric | Base | V1 (3-Stage) | V2 (Extended) | V3 (nicolingua) | |--------|------|-------------|---------------|-----------------| | N'Ko Perplexity | 11.02 | **6.00** | --- | 62.89 | | Translation Tax | 2.90x | **0.70x** | --- | 7.11x | | Val Loss | 4.290 | --- | 3.506 | **3.275** | | Training Examples | --- | 25,100 | 33,912 | **92,184** | | LoRA Layers | --- | 8 | 8 | 8 | | Syllable Validity | 89.8% | --- | 100% (FSM) | 99.8% / 100% (FSM) | | Mode Collapse | --- | No | Yes (20/20) | **No (3/20)** | **Note**: V3's higher N'Ko perplexity is an artifact of vocabulary extension. The extended model (152,192 tokens) tokenizes N'Ko differently than the base model (151,936 tokens), making PPL scores non-comparable across vocabulary sizes. V3's key contribution is fixing V2's mode collapse while maintaining the extended vocabulary's superior training loss (3.275 vs V1's 4.290). ### V1: Three-Stage Adapter (Base Vocabulary) - **Training**: CPT (17,360) + SFT (21,240) + BPE-aware (25,100) = 25,100 total - **Config**: LoRA rank 8, scale 20.0, top 8 layers, lr 1e-5/5e-6/3e-6 - **Result**: N'Ko PPL 11.02 -> 6.00, Translation Tax 2.90x -> 0.70x - **Strength**: Higher generation diversity on base vocabulary

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.