Back to corpus
working paperpreprint structure candidatescore 84

Organic Vocabulary Acquisition for Low-Resource African Languages: A Video-First Approach to N'Ko and Manding Language Processing

This document presents a novel approach to building state-of-the-art natural language processing systems for N'Ko, Bambara, and related Manding languages spoken by approximately forty million people across West Africa. Unlike traditional corpus-driven methodologies that depend on pre-existing parallel texts such as Bible translations or government documents, we introduce a video-first organic vocabulary discovery system that extracts training data directly from educational YouTube content. The system processes vide

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

This document presents a novel approach to building state-of-the-art natural language processing systems for N'Ko, Bambara, and related Manding languages spoken by approximately forty million people across West Africa. Unlike traditional corpus-driven methodologies that depend on pre-existing parallel texts such as Bible translations or government documents, we introduce a video-first organic vocabulary discovery system that extracts training data directly from educational YouTube content. The system processes video frames through multimodal optical character recognition, cross-references detections against the Ankataa dictionary containing over fifteen hundred verified entries, and expands vocabulary through AI-powered contextual generation across five distinct linguistic registers. This produces contextually-grounded training data suitable for automatic speech recognition, machine translation, and optical character recognition. The architecture currently processes content from seven N'Ko educational channels comprising nine hundred sixty-nine videos representing an estimated five hundred hours of instructional material.

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.