Grand Diomande Research · Full HTML Reader

N'Ko Uncertainty Packet Execution Plan

- a loose chain of scripts passing text around - a real speech system with explicit uncertainty, provenance, and partition-aware routing

Language as Infrastructure technical note experiment writeup candidate score 24 .md

Full Public Reader

N'Ko Uncertainty Packet Execution Plan

Date: 2026-04-28

Goal

Define and roll out the exact packet that connects:

text
Whisper/trajectory ASR -> AGP/Gemma correction -> provenance search -> TTS subset selection

This packet is the operational difference between:

  • a loose chain of scripts passing text around
  • a real speech system with explicit uncertainty, provenance, and partition-aware routing

Why This Plan Exists

We already have:

  • trajectory ASR on Vast
  • AGP/Gemma correction on Mac4/Mac5
  • ASR partitioning (`stable|boundary|uncertain|novelty`)
  • a canonical segment corpus at `artifacts/corpus/segments.{jsonl,parquet}`

What is still missing is the formal interface. Right now the stack is too text-centric. The next step is to make uncertainty and routing first-class.

Design Principles

1. Audio evidence stays upstream
- Gemma does not replace the acoustic model.
2. Correction stays bounded
- AGP proposes; the gate decides.
3. Every row is attributable
- raw ASR, corrected text, and decisions remain inspectable.
4. Partitions are policy, not decoration
- `stable`, `boundary`, `uncertain`, and `novelty` drive downstream behavior.
5. Search and TTS consume different slices
- not every corrected utterance is valid TTS training data.

Packet Schema

A. Identity

  • `feat_id`
  • `audio_id`
  • `audio_path`
  • `episode_id`
  • `segment_id`
  • `split`
  • `script`
  • `mode`

B. Timing and speaker

  • `start_ms`
  • `end_ms`
  • `duration_ms`
  • `speaker_id`
  • `speaker_confidence`
  • `speaker_cluster_version`

C. Acoustic output

  • `asr_text_raw`
  • `asr_text_postprocessed`
  • `reference_text`
  • `ctc_confidence`
  • `cer_edits`
  • `reference_chars`
  • `trajectory_scalars`
  • `partition`

D. Local uncertainty summary

  • `top_confusable_spans`
  • list of spans with alternate characters/tokens and confidence deltas
  • `n_best_hypotheses`
  • top candidates with scores
  • `char_posteriors_summary`
  • compressed per-span posterior summary, not full frame dumps
  • `uncertainty_score`
  • normalized scalar for routing

E. AGP correction block

  • `agp_prompt_version`
  • `agp_model_id`
  • `agp_proposal`
  • `agp_confidence`
  • `agp_accept_reject`
  • `agp_reason`
  • `agp_delta_spans`

F. Provenance/search block

  • `final_text`
  • `provenance_score`
  • `sources_used`
  • `transliteration_variants`
  • `normalized_forms`
  • `retrieval_tags`

G. TTS eligibility block

  • `tts_eligible`
  • `tts_exclusion_reason`
  • `overlap_risk`
  • `music_risk`
  • `single_speaker_clean`

Producer Responsibilities

Vast ASR producer

Must emit:

  • identity fields
  • acoustic output fields
  • partition
  • trajectory scalars
  • n-best / confusable summaries if available

Primary source:

  • `test_predictions.jsonl`
  • `test_references.jsonl`
  • `test_metrics_by_partition.json`

Corpus builder

Must:

  • join Djoko transcriptions, speakers, consensus rows, and later ASR prediction dumps
  • preserve raw and corrected text separately
  • write canonical corpus rows

Current entry point:

- `asr/build_segment_provenance_corpus.py`

AGP producer

Must append:

  • proposal
  • accept/reject outcome
  • rationale code
  • corrected final text

AGP does not overwrite ASR fields. It adds a decision layer.

Search/index producer

Must derive:

  • transliteration variants
  • normalized forms
  • retrieval tags
  • search-time embeddings

TTS filter producer

Must derive:

  • tts eligibility
  • exclusion reason
  • overlap/music risk
  • single-speaker cleanliness

Partition Policy

`stable`

Use for:

  • search by default
  • potential TTS candidate pool if speaker and audio quality also pass

`boundary`

Use for:

  • AGP correction training
  • search with provenance warning
  • usually not first-pass TTS training

`uncertain`

Use for:

  • AGP hard-case training
  • manual review or deferred indexing
  • excluded from initial TTS

`novelty`

Use for:

  • error analysis
  • vocabulary/domain expansion
  • not for TTS until independently validated

Rollout Phases

Phase 1: schema lock

Deliverables:

  • this plan
  • stable canonical field list
  • explicit mapping from current artifacts to target packet

Success check:

- no downstream component invents ad hoc field names

Phase 2: ASR dump upgrade

Tasks:

1. Ensure current Vast jobs emit prediction/reference rows in the expected format.
2. Add `partition` and `trajectory_scalars` to every row.
3. Add compact n-best/confusable summaries where feasible.

Deliverables:

  • upgraded `test_predictions.jsonl`
  • upgraded `test_references.jsonl`
  • partition-aware row dumps

Success check:

- a single ASR row is sufficient to reconstruct the correction input

Phase 3: corpus integration

Tasks:

1. Extend `build_segment_provenance_corpus.py` to ingest the upgraded ASR rows.
2. Join them onto Djoko segment rows by stable IDs.
3. Preserve current consensus and speaker joins.

Deliverables:

- enriched `artifacts/corpus/segments.parquet`

Success check:

- one row includes ASR text, partition, speaker data, consensus info, and AGP slots

Phase 4: AGP writeback

Tasks:

1. Define AGP output schema precisely.
2. Append AGP proposals and gate outcomes back into the canonical corpus.
3. Version AGP prompt/model identifiers.

Deliverables:

- AGP-enriched corpus rows

Success check:

- every correction is attributable and reversible

Phase 5: search index

Tasks:

1. Build lexical + metadata retrieval over canonical rows.
2. Add transliteration and normalized-form enrichment.
3. Add reranking with Gemma or another compact judge.

Deliverables:

- first vertical search API over N'Ko speech corpus

Success check:

- query by N'Ko, Latinized Bambara, episode, or speaker returns cited rows

Phase 6: TTS subset extraction

Tasks:

1. Mark TTS-eligible rows.
2. Exclude overlap/music/noisy or low-confidence rows.
3. Build speaker-independent training subset first.

Deliverables:

- `artifacts/corpus/tts_seed_subset.jsonl`

Success check:

- subset rows are high-confidence, single-speaker, and correction-clean

Mapping From Current Artifacts

Already present

  • `audio_path`
  • `episode_id`
  • `segment_id`
  • `speaker_id`
  • `speaker_cluster_version`
  • `asr_text_raw`
  • `asr_text_postprocessed`
  • `final_text`
  • `ctc_confidence`
  • `consensus_score`
  • `text_quality`
  • `char_diversity`
  • `sources_used`
  • `provenance_score`

Present in handoff spec but not yet wired into corpus builder

  • `feat_id`
  • `audio_id`
  • `split`
  • `script`
  • `mode`
  • `reference_text`
  • `cer_edits`
  • `reference_chars`
  • `trajectory_scalars`
  • `partition`

Not yet produced anywhere cleanly

  • `top_confusable_spans`
  • `n_best_hypotheses`
  • `char_posteriors_summary`
  • `agp_confidence`
  • `agp_delta_spans`
  • `transliteration_variants`
  • `normalized_forms`
  • `retrieval_tags`
  • `tts_eligible`
  • `tts_exclusion_reason`
  • `overlap_risk`
  • `music_risk`

Immediate Task List

1. Extend ASR evaluation/export code to emit the row-level fields already defined in the handoff.
2. Upgrade the corpus builder to join those ASR dumps.
3. Define AGP writeback JSONL format.
4. Add a corpus filter that emits:
- search-ready rows
- AGP-training rows
- TTS-seed rows

Validation Gates

Gate 1: row completeness

For a sampled row, verify:

  • stable identity
  • speaker metadata
  • raw and corrected text
  • partition
  • provenance score

Gate 2: correction auditability

For a corrected row, verify:

  • raw ASR text remains preserved
  • AGP delta is visible
  • accept/reject decision is visible

Gate 3: search readiness

For a search result, verify:

  • query returns exact row
  • cites speaker/episode/segment
  • shows whether text is raw or corrected

Gate 4: TTS readiness

For a TTS candidate row, verify:

  • single-speaker
  • high-confidence
  • low overlap/music risk
  • no unresolved AGP ambiguity

Main Risks

1. ASR dump incompleteness
- if the acoustic job does not export row-level details, AGP remains synthetic-heavy
2. Overcorrection
- if Gemma is allowed to rewrite too freely, provenance collapses
3. Weak diarization
- speaker labels are still weak supervision and must remain versioned
4. Premature TTS
- raw Djoko output is not a safe synthesis corpus

Bottom Line

This plan makes the uncertainty packet the backbone of the stack.

Once implemented, the same row can support:

  • AGP correction
  • provenance search
  • speaker atlas growth
  • high-precision TTS filtering

That is the right bridge between the current N'Ko ASR system and the broader speech platform you want to build.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/docs/handoffs/nko_uncertainty_packet_execution_plan_2026-04-28.md

Detected Structure

Evaluation · References · Code Anchors · Architecture