Organic Vocabulary Acquisition for Low-Resource African Languages: A Video-First Approach to N'Ko and Manding Language Processing

Full HTML reader

Read the full artifact

Extracted abstract or opening context

This document presents a novel approach to building state-of-the-art natural language processing systems for N'Ko, Bambara, and related Manding languages spoken by approximately forty million people across West Africa. Unlike traditional corpus-driven methodologies that depend on pre-existing parallel texts such as Bible translations or government documents, we introduce a video-first organic vocabulary discovery system that extracts training data directly from educational YouTube content. The system processes video frames through multimodal optical character recognition, cross-references detections against the Ankataa dictionary containing over fifteen hundred verified entries, and expands vocabulary through AI-powered contextual generation across five distinct linguistic registers. This produces contextually-grounded training data suitable for automatic speech recognition, machine translation, and optical character recognition. The architecture currently processes content from seven N'Ko educational channels comprising nine hundred sixty-nine videos representing an estimated five hundred hours of instructional material.

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.