Grand Diomande Research · Full HTML Reader

N'Ko Speech Search, Diarization, and TTS Architecture

1. **Provenance-first search** over N'Ko audio, transcripts, papers, and corrections. 2. **Improved diarization** for Djoko and future Bambara/Malinke broadcast corpora. 3. **N'Ko TTS / voice generation**, but only from a high-precision subset with explicit speaker boundaries and alignment confidence.

Language as Infrastructure architecture technical paper candidate score 48 .md

Full Public Reader

N'Ko Speech Search, Diarization, and TTS Architecture

Date: 2026-04-26

Mission

Turn the existing N'Ko ASR + AGP stack into a speaker-aware speech system with three outputs:

1. Provenance-first search over N'Ko audio, transcripts, papers, and corrections.
2. Improved diarization for Djoko and future Bambara/Malinke broadcast corpora.
3. N'Ko TTS / voice generation, but only from a high-precision subset with explicit speaker boundaries and alignment confidence.

This is not a generic web-search play. It is a vertical system for Manding speech understanding, correction, retrieval, and eventually synthesis.

What Already Exists

Acoustic / correction boundary

The stack is already split correctly:

PyTorch/Whisper trajectory ASR on Vast
Gemma/AGP corrective layer on Mac4/Mac5

That boundary should stay explicit:

text

audio
  -> Whisper/trajectory ASR
  -> first-pass N'Ko candidate + uncertainty metadata
  -> AGP/Gemma corrective proposal
  -> Rust admissibility gate
  -> final corrected N'Ko text + provenance

Relevant reference:

- `docs/handoffs/agp-nko-vast-training-handoff.md`

Djoko assets already on disk

Verified local artifacts:

`djoko_speakers.json`
`7` weak speaker clusters
`6,625` diarized segments
`5` eligible speakers for adaptation experiments
`djoko_transcriptions.jsonl`
historical first-pass N'Ko transcriptions
quality is still noisy and often collapses into repeated characters
`consensus_pairs.jsonl`
filtered subset with confidence and text-quality metadata

Historical broader deployment state in handoffs/papers:

`32,826` Djoko segments across `1,124` episodes
speaker-level TTT already treated as a real experimental lane

Core Design Decision

The next interface between ASR and AGP should not just be decoded text. It should be a compact uncertainty packet.

Instead of:

text

ASR -> raw text -> Gemma correction

use:

text

ASR -> text + uncertainty packet + local context -> Gemma correction -> gated final text

Minimum uncertainty packet fields:

`audio_path`
`episode_id`
`segment_id`
`speaker_id` and `speaker_confidence`
`start_ms`, `end_ms`
`asr_text_raw`
`asr_text_postprocessed`
`ctc_confidence`
`trajectory_features`
`top_confusable_spans`
`n_best_hypotheses`
`char_posteriors` or compressed token confidence summary
`agp_proposal`
`agp_accept_reject`
`agp_reason`
`final_text`
`provenance_score`

This is the substrate for both retrieval and future synthesis.

System Architecture

1. Ingestion and canonical segment table

Create a canonical row format for every utterance:

text

episode -> segment -> speaker -> ASR hypotheses -> AGP decision -> final text

Each row should be stable and queryable. The system should never require re-parsing ad hoc JSONL files from different experiments to answer basic questions.

Canonical outputs:

`artifacts/corpus/segments.parquet`
`artifacts/corpus/segments.jsonl`
`artifacts/corpus/speakers.jsonl`
`artifacts/corpus/episodes.jsonl`

2. Provenance-first search index

Build a vertical search index over:

corrected N'Ko transcripts
raw ASR hypotheses
AGP proposals and rejections
transliteration variants
morphology-aware term expansions
paper text and notes
timestamps and speaker metadata

Retrieval stack:

BM25 / lexical search over N'Ko, Latin, and transliterated forms
dense embeddings for semantic recall
metadata filters for `speaker`, `episode`, `date`, `confidence`, `accepted_by_agp`
reranking with Gemma or another compact judge model

Answer layer requirements:

always cite exact segments
include timestamps
include speaker if known
expose whether returned text is raw ASR or AGP-corrected

This is where the system can beat broad providers: exact search over a script, dialect space, and provenance graph they do not model deeply.

3. AGP as correction and indexing intelligence

AGP should do two jobs:

1. Correction
- repair low-confidence ASR outputs under admissibility constraints
2. Index enrichment
- produce normalized forms, transliteration variants, topic hints, and compact summaries for search

Do not let AGP become an unconstrained rewriter. It should stay a bounded inference layer over acoustic evidence.

4. Speaker layer

Build a persistent speaker atlas instead of one-off clustering outputs.

Speaker atlas record:

`speaker_id`
`cluster_version`
`embedding_centroid`
`episode_count`
`segment_count`
`voiceprint_quality`
`named_character_guess`
`cooccurrence_neighbors`
`tts_eligible`

The current MFCC clustering is enough to bootstrap, but not enough to be the final diarization story.

5. TTS layer

TTS should be built from the top of the pyramid down:

1. Text-first N'Ko TTS
- speaker-independent
- target: intelligible N'Ko with tone and script coverage
2. Speaker-conditioned TTS
- conditioned on stable speaker embeddings
- target: style transfer across a bounded speaker set
3. Character/actor voice recreation
- only if rights and provenance are clean

Do not start with raw voice cloning from YouTube. Start with clean alignment, speaker confidence, and a safety/risk model.

Can We Build Our Own Diarization From Djoko + YouTube?

Yes, but treat the current assets as weak supervision, not final labels.

What Djoko can already give us:

recurring voices
episode-level co-occurrence
consistent background/domain conditions
candidate speaker adaptation trajectories

What to build next:

1. Weak speaker atlas
- initialize from `djoko_speakers.json`
2. Embedding upgrade
- move from MFCC-only clustering to learned speaker embeddings on GPU/Linux where the Mac OMP issues do not interfere
3. Episode graph smoothing
- use co-occurrence and recurrence across episodes to merge/split unstable clusters
4. Optional multimodal aid
- use frame/scene analysis where available to associate recurring on-screen characters with voices

The right claim is not “we have perfect diarization.” The right claim is “we can build a progressively improving speaker graph tailored to Bambara/Malinke broadcast audio.”

Can We Build TTS From The YouTube Data?

Technically: yes, partially, with caveats.

The key distinction:

Good for training now
speaker-independent N'Ko TTS
prosody/style modeling
phoneme-to-audio alignment studies
Not ready to trust blindly
direct character/actor voice cloning from noisy YouTube scrape output

Why:

current first-pass Djoko transcripts are still noisy
diarization is weak supervision, not gold labels
many segments have background music, overlap, and domain noise
rights/consent around cloning identifiable voices are a real risk

So the correct TTS plan is:

Stage A: high-precision speech-text subset

Keep only segments that satisfy:

high consensus score
high CTC confidence
acceptable AGP correction confidence
stable single-speaker assignment
low overlap/music estimate

This subset becomes the seed TTS corpus.

Stage B: speaker-independent N'Ko TTS

Train a text-to-mel or discrete codec model on the clean subset plus any cleaner narrated/teaching sources.

Target output:

clear N'Ko pronunciation
tone-aware reading
not tied to any named actor

Stage C: speaker-conditioned TTS

Add speaker embeddings only after the atlas is stable.

Target output:

“voice family” control
character/style reconstruction
bounded style transfer

Stage D: actor-specific cloning

Only do this if provenance, permissions, and product intent are explicit.

Immediate Build Order

Phase 1: search substrate

1. Export a canonical segment table from Djoko artifacts.
2. Add AGP decision/provenance columns.
3. Build a first retrieval index over `final_text`, `raw_text`, transliterations, speakers, and timestamps.

Success condition:

query by N'Ko, Latinized Bambara, or speaker
get exact segment citations back

Phase 2: same-provenance correction corpus

1. Re-run Djoko transcription on the current artifact-complete checkpoint family.
2. Export row-level ASR prediction dumps.
3. Join them with AGP proposals and gate outcomes.

Success condition:

- AGP is training on the real current ASR error surface, not historical or synthetic-only approximations

Phase 3: speaker atlas

1. Promote current clusters into versioned speaker records.
2. Re-embed on Linux/GPU with a stronger speaker encoder.
3. Build merge/split diagnostics across episodes.

Success condition:

- stable recurring speaker IDs with confidence scores and audit trails

Phase 4: TTS seed corpus

1. Filter the corpus to high-confidence single-speaker rows.
2. Build alignments and duration stats.
3. Train a small speaker-independent baseline.

Success condition:

- intelligible N'Ko TTS on held-out text

Concrete Research Opportunities

Paper / system lane 1

ASR -> AGP provenance search

Claim:

- low-resource speech systems become more useful when every correction is searchable, attributable, and reversible

Paper / system lane 2

Speaker adaptation and diarization under script advantage

Claim:

- N'Ko ASR may support better speaker-conditioned adaptation because the decoder starts from cleaner symbol structure

Paper / system lane 3

N'Ko TTS from weakly supervised broadcast corpora

Claim:

- it is feasible to bootstrap N'Ko speech synthesis from consensus-filtered YouTube data if speaker and text uncertainty are modeled explicitly

What Not To Do

Do not train TTS directly on the raw `djoko_transcriptions.jsonl` output.
Do not collapse ASR and AGP into one undocumented “smart model.”
Do not claim actor-level voice cloning from the current corpus.
Do not use broad “search engine” framing when the real moat is vertical speech retrieval.

Recommended Next Engineering Steps

1. Build the canonical segment/provenance table.
2. Wire AGP outputs into that schema.
3. Create the first search index and query API over those rows.
4. Re-run the current Djoko lane on the promoted checkpoint family for same-provenance correction data.
5. Start the speaker-atlas upgrade on Linux/GPU.
6. Only after that, cut the clean TTS seed subset.

Bottom Line

Yes, the YouTube data can support all three directions:

search, immediately
diarization, with weak supervision upgraded into a speaker atlas
TTS, but only from a filtered high-precision subset

The right build order is:

text

ASR -> AGP -> provenance index -> speaker atlas -> clean TTS subset -> TTS

That gives the fastest path to a real product and the cleanest path to publishable claims.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

nko-brain-scanner/docs/handoffs/nko_speech_search_diarization_tts_architecture_2026-04-26.md

Detected Structure

Method · Evaluation · Architecture