N'Ko Speech Search, Diarization, and TTS Architecture
1. **Provenance-first search** over N'Ko audio, transcripts, papers, and corrections. 2. **Improved diarization** for Djoko and future Bambara/Malinke broadcast corpora. 3. **N'Ko TTS / voice generation**, but only from a high-precision subset with explicit speaker boundaries and alignment confidence.
Full Public Reader
N'Ko Speech Search, Diarization, and TTS Architecture
Date: 2026-04-26
Mission
Turn the existing N'Ko ASR + AGP stack into a speaker-aware speech system with three outputs:
1. Provenance-first search over N'Ko audio, transcripts, papers, and corrections.
2. Improved diarization for Djoko and future Bambara/Malinke broadcast corpora.
3. N'Ko TTS / voice generation, but only from a high-precision subset with explicit speaker boundaries and alignment confidence.
This is not a generic web-search play. It is a vertical system for Manding speech understanding, correction, retrieval, and eventually synthesis.
What Already Exists
Acoustic / correction boundary
The stack is already split correctly:
- PyTorch/Whisper trajectory ASR on Vast
- Gemma/AGP corrective layer on Mac4/Mac5
That boundary should stay explicit:
audio
-> Whisper/trajectory ASR
-> first-pass N'Ko candidate + uncertainty metadata
-> AGP/Gemma corrective proposal
-> Rust admissibility gate
-> final corrected N'Ko text + provenanceRelevant reference:
- `docs/handoffs/agp-nko-vast-training-handoff.md`
Djoko assets already on disk
Verified local artifacts:
- `djoko_speakers.json`
- `7` weak speaker clusters
- `6,625` diarized segments
- `5` eligible speakers for adaptation experiments
- `djoko_transcriptions.jsonl`
- historical first-pass N'Ko transcriptions
- quality is still noisy and often collapses into repeated characters
- `consensus_pairs.jsonl`
- filtered subset with confidence and text-quality metadata
Historical broader deployment state in handoffs/papers:
- `32,826` Djoko segments across `1,124` episodes
- speaker-level TTT already treated as a real experimental lane
Core Design Decision
The next interface between ASR and AGP should not just be decoded text. It should be a compact uncertainty packet.
Instead of:
ASR -> raw text -> Gemma correctionuse:
ASR -> text + uncertainty packet + local context -> Gemma correction -> gated final textMinimum uncertainty packet fields:
- `audio_path`
- `episode_id`
- `segment_id`
- `speaker_id` and `speaker_confidence`
- `start_ms`, `end_ms`
- `asr_text_raw`
- `asr_text_postprocessed`
- `ctc_confidence`
- `trajectory_features`
- `top_confusable_spans`
- `n_best_hypotheses`
- `char_posteriors` or compressed token confidence summary
- `agp_proposal`
- `agp_accept_reject`
- `agp_reason`
- `final_text`
- `provenance_score`
This is the substrate for both retrieval and future synthesis.
System Architecture
1. Ingestion and canonical segment table
Create a canonical row format for every utterance:
episode -> segment -> speaker -> ASR hypotheses -> AGP decision -> final textEach row should be stable and queryable. The system should never require re-parsing ad hoc JSONL files from different experiments to answer basic questions.
Canonical outputs:
- `artifacts/corpus/segments.parquet`
- `artifacts/corpus/segments.jsonl`
- `artifacts/corpus/speakers.jsonl`
- `artifacts/corpus/episodes.jsonl`
2. Provenance-first search index
Build a vertical search index over:
- corrected N'Ko transcripts
- raw ASR hypotheses
- AGP proposals and rejections
- transliteration variants
- morphology-aware term expansions
- paper text and notes
- timestamps and speaker metadata
Retrieval stack:
- BM25 / lexical search over N'Ko, Latin, and transliterated forms
- dense embeddings for semantic recall
- metadata filters for `speaker`, `episode`, `date`, `confidence`, `accepted_by_agp`
- reranking with Gemma or another compact judge model
Answer layer requirements:
- always cite exact segments
- include timestamps
- include speaker if known
- expose whether returned text is raw ASR or AGP-corrected
This is where the system can beat broad providers: exact search over a script, dialect space, and provenance graph they do not model deeply.
3. AGP as correction and indexing intelligence
AGP should do two jobs:
1. Correction
- repair low-confidence ASR outputs under admissibility constraints
2. Index enrichment
- produce normalized forms, transliteration variants, topic hints, and compact summaries for search
Do not let AGP become an unconstrained rewriter. It should stay a bounded inference layer over acoustic evidence.
4. Speaker layer
Build a persistent speaker atlas instead of one-off clustering outputs.
Speaker atlas record:
- `speaker_id`
- `cluster_version`
- `embedding_centroid`
- `episode_count`
- `segment_count`
- `voiceprint_quality`
- `named_character_guess`
- `cooccurrence_neighbors`
- `tts_eligible`
The current MFCC clustering is enough to bootstrap, but not enough to be the final diarization story.
5. TTS layer
TTS should be built from the top of the pyramid down:
1. Text-first N'Ko TTS
- speaker-independent
- target: intelligible N'Ko with tone and script coverage
2. Speaker-conditioned TTS
- conditioned on stable speaker embeddings
- target: style transfer across a bounded speaker set
3. Character/actor voice recreation
- only if rights and provenance are clean
Do not start with raw voice cloning from YouTube. Start with clean alignment, speaker confidence, and a safety/risk model.
Can We Build Our Own Diarization From Djoko + YouTube?
Yes, but treat the current assets as weak supervision, not final labels.
What Djoko can already give us:
- recurring voices
- episode-level co-occurrence
- consistent background/domain conditions
- candidate speaker adaptation trajectories
What to build next:
1. Weak speaker atlas
- initialize from `djoko_speakers.json`
2. Embedding upgrade
- move from MFCC-only clustering to learned speaker embeddings on GPU/Linux where the Mac OMP issues do not interfere
3. Episode graph smoothing
- use co-occurrence and recurrence across episodes to merge/split unstable clusters
4. Optional multimodal aid
- use frame/scene analysis where available to associate recurring on-screen characters with voices
The right claim is not “we have perfect diarization.” The right claim is “we can build a progressively improving speaker graph tailored to Bambara/Malinke broadcast audio.”
Can We Build TTS From The YouTube Data?
Technically: yes, partially, with caveats.
The key distinction:
- Good for training now
- speaker-independent N'Ko TTS
- prosody/style modeling
- phoneme-to-audio alignment studies
- Not ready to trust blindly
- direct character/actor voice cloning from noisy YouTube scrape output
Why:
- current first-pass Djoko transcripts are still noisy
- diarization is weak supervision, not gold labels
- many segments have background music, overlap, and domain noise
- rights/consent around cloning identifiable voices are a real risk
So the correct TTS plan is:
Stage A: high-precision speech-text subset
Keep only segments that satisfy:
- high consensus score
- high CTC confidence
- acceptable AGP correction confidence
- stable single-speaker assignment
- low overlap/music estimate
This subset becomes the seed TTS corpus.
Stage B: speaker-independent N'Ko TTS
Train a text-to-mel or discrete codec model on the clean subset plus any cleaner narrated/teaching sources.
Target output:
- clear N'Ko pronunciation
- tone-aware reading
- not tied to any named actor
Stage C: speaker-conditioned TTS
Add speaker embeddings only after the atlas is stable.
Target output:
- “voice family” control
- character/style reconstruction
- bounded style transfer
Stage D: actor-specific cloning
Only do this if provenance, permissions, and product intent are explicit.
Immediate Build Order
Phase 1: search substrate
1. Export a canonical segment table from Djoko artifacts.
2. Add AGP decision/provenance columns.
3. Build a first retrieval index over `final_text`, `raw_text`, transliterations, speakers, and timestamps.
Success condition:
- query by N'Ko, Latinized Bambara, or speaker
- get exact segment citations back
Phase 2: same-provenance correction corpus
1. Re-run Djoko transcription on the current artifact-complete checkpoint family.
2. Export row-level ASR prediction dumps.
3. Join them with AGP proposals and gate outcomes.
Success condition:
- AGP is training on the real current ASR error surface, not historical or synthetic-only approximations
Phase 3: speaker atlas
1. Promote current clusters into versioned speaker records.
2. Re-embed on Linux/GPU with a stronger speaker encoder.
3. Build merge/split diagnostics across episodes.
Success condition:
- stable recurring speaker IDs with confidence scores and audit trails
Phase 4: TTS seed corpus
1. Filter the corpus to high-confidence single-speaker rows.
2. Build alignments and duration stats.
3. Train a small speaker-independent baseline.
Success condition:
- intelligible N'Ko TTS on held-out text
Concrete Research Opportunities
Paper / system lane 1
ASR -> AGP provenance search
Claim:
- low-resource speech systems become more useful when every correction is searchable, attributable, and reversible
Paper / system lane 2
Speaker adaptation and diarization under script advantage
Claim:
- N'Ko ASR may support better speaker-conditioned adaptation because the decoder starts from cleaner symbol structure
Paper / system lane 3
N'Ko TTS from weakly supervised broadcast corpora
Claim:
- it is feasible to bootstrap N'Ko speech synthesis from consensus-filtered YouTube data if speaker and text uncertainty are modeled explicitly
What Not To Do
- Do not train TTS directly on the raw `djoko_transcriptions.jsonl` output.
- Do not collapse ASR and AGP into one undocumented “smart model.”
- Do not claim actor-level voice cloning from the current corpus.
- Do not use broad “search engine” framing when the real moat is vertical speech retrieval.
Recommended Next Engineering Steps
1. Build the canonical segment/provenance table.
2. Wire AGP outputs into that schema.
3. Create the first search index and query API over those rows.
4. Re-run the current Djoko lane on the promoted checkpoint family for same-provenance correction data.
5. Start the speaker-atlas upgrade on Linux/GPU.
6. Only after that, cut the clean TTS seed subset.
Bottom Line
Yes, the YouTube data can support all three directions:
- search, immediately
- diarization, with weak supervision upgraded into a speaker atlas
- TTS, but only from a filtered high-precision subset
The right build order is:
ASR -> AGP -> provenance index -> speaker atlas -> clean TTS subset -> TTSThat gives the fastest path to a real product and the cleanest path to publishable claims.
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
nko-brain-scanner/docs/handoffs/nko_speech_search_diarization_tts_architecture_2026-04-26.md
Detected Structure
Method · Evaluation · Architecture