Back to corpus
working paperpreprint structure candidatescore 64
Overview
Perfect — here’s a rewritten abstract and overview with the modular breakdown and explicit mention of bidirectional translations across English, French, N’ko, and Bambara.
Full HTML reader
Read the full artifact
Extracted abstract or opening context
Perfect — here’s a rewritten abstract and overview with the modular breakdown and explicit mention of bidirectional translations across English, French, N’ko, and Bambara.
This project develops a modular multilingual system for translation and speech processing in low-resource West African languages, focusing on N’ko and Bambara, while bridging them with French and English. Using RobotsMali/bam-asr-early as the foundational ASR dataset, the system integrates speech recognition (ASR), translation, and speech synthesis (TTS) within a unified pipeline. Built on self-supervised speech models (wav2vec 2.0) and multilingual transformers (mBART, mT5, or LLaMA derivatives), it supports bidirectional translation across all language pairs: English ↔ French, English ↔ N’ko, English ↔ Bambara, French ↔ N’ko, and French ↔ Bambara. The framework is designed to handle script-aware tokenization for N’ko, grammar-sensitive fine-tuning, and mixture-of-experts routing for language-specific adaptation. Beyond immediate performance, the system is structured for iterative learning, improving over time as new community-driven data is added. The outcome is a scalable, accessible platform that strengthens literacy, education, and cultural preservation in West Africa while advancing research in multilingual low-resource NLP.
West African languages such as N’ko and Bambara face limited digital resources and weak representation in NLP systems. Key challenges include: • Scarcity of annotated data, especially parallel corpora. • Lack of robust ASR datasets outside RobotsMali. • Unique script handling for N’ko (Unicode segmentation, grammar). • Cross-lingual gaps, especially for English ↔ Bambara and French ↔ N’ko translation.
2. Objectives • Build a modular pipeline combining ASR, translation, and TTS. • Support bidirectional translation across English, French, N’ko, and Bambara. • Enable both text-based and speech-based interaction. • Ensure N’ko script fidelity through custom tokenization. • Scale iteratively, improving as more data is added.
3. Core Dataset: RobotsMali/bam-asr-early • Total Duration: 37.41 hours • Samples: 38,769 (Train: 37,306 | Test: 1,463) • Subsets: • Oza’s Bambara-ASR: ~29 hours • Jeli-ASR-RMAI: ~3.5 hours • oza-tts-mali-pense: ~4 hours • Reading-tutor-data: ~1 hour
Promotion decision
What has to happen next
Convert into the standard paper schema, add citations, and render a draft PDF.
Why this is not always a full paper yet
Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.