Mohamed Diomande

Full HTML reader

Read the full artifact

Extracted abstract or opening context

Perfect — here’s a rewritten abstract and overview with the modular breakdown and explicit mention of bidirectional translations across English, French, N’ko, and Bambara. This project develops a modular multilingual system for translation and speech processing in low-resource West African languages, focusing on N’ko and Bambara, while bridging them with French and English. Using RobotsMali/bam-asr-early as the foundational ASR dataset, the system integrates speech recognition (ASR), translation, and speech synthesis (TTS) within a unified pipeline. Built on self-supervised speech models (wav2vec 2.0) and multilingual transformers (mBART, mT5, or LLaMA derivatives), it supports bidirectional translation across all language pairs: English ↔ French, English ↔ N’ko, English ↔ Bambara, French ↔ N’ko, and French ↔ Bambara. The framework is designed to handle script-aware tokenization for N’ko, grammar-sensitive fine-tuning, and mixture-of-experts routing for language-specific adaptation. Beyond immediate performance, the system is structured for iterative learning, improving over time as new community-driven data is added. The outcome is a scalable, accessible platform that strengthens literacy, education, and cultural preservation in West Africa while advancing research in multilingual low-resource NLP. West African languages such as N’ko and Bambara face limited digital resources and weak representation in NLP systems. Key challenges include: • Scarcity of annotated data, especially parallel corpora. • Lack of robust ASR datasets outside RobotsMali. • Unique script handling for N’ko (Unicode segmentation, grammar). • Cross-lingual gaps, especially for English ↔ Bambara and French ↔ N’ko translation. 2. Objectives • Build a modular pipeline combining ASR, translation, and TTS. • Support bidirectional translation across English, French, N’ko, and Bambara. • Enable both text-based and speech-based interaction. • Ensure N’ko script fidelity through custom tokenization. • Scale iteratively, improving as more data is added. 3. Core Dataset: RobotsMali/bam-asr-early • Total Duration: 37.41 hours • Samples: 38,769 (Train: 37,306 | Test: 1,463) • Subsets: • Oza’s Bambara-ASR: ~29 hours • Jeli-ASR-RMAI: ~3.5 hours • oza-tts-mali-pense: ~4 hours • Reading-tutor-data: ~1 hour

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.