Grand Diomande Research · Full HTML Reader

N'Ko ASR Row Contract

This is the concrete row-level contract that the Vast ASR exporter should emit so the local uncertainty-packet pipeline can consume it without ad hoc parsing.

Language as Infrastructure technical note experiment writeup candidate score 32 .md

Full Public Reader

N'Ko ASR Row Contract

Date: 2026-04-28

Purpose

This is the concrete row-level contract that the Vast ASR exporter should emit so the local uncertainty-packet pipeline can consume it without ad hoc parsing.

Consumer already implemented:

- `asr/build_segment_provenance_corpus.py`

Prediction Row

json
{
  "feat_id": "bam_test_0001",
  "audio_id": "EP999_0001",
  "audio_path": "/workspace/djoko_audio/segments/EP999/EP999_0001.wav",
  "segment_id": "EP999_0001",
  "split": "test",
  "script": "nko",
  "mode": "trajectory",
  "asr_text": "ߊߟߐ",
  "ctc_confidence": 0.73,
  "cer_edits": 1,
  "trajectory_scalars": {
    "velocity": 0.2,
    "jerk": 0.8
  },
  "partition": "boundary",
  "top_confusable_spans": [
    {
      "start": 2,
      "end": 3,
      "alts": ["ߎ", "ߐ"]
    }
  ],
  "n_best_hypotheses": [
    {
      "text": "ߊߟߐ",
      "score": -0.4
    }
  ],
  "char_posteriors_summary": {
    "mean_entropy": 0.31
  }
}

Reference Row

json
{
  "feat_id": "bam_test_0001",
  "audio_path": "/workspace/djoko_audio/segments/EP999/EP999_0001.wav",
  "reference_text": "ߊߟߎ"
}

Required Fields

Prediction row must include:

  • `audio_path`
  • `split`
  • `script`
  • `mode`
  • `asr_text`
  • `ctc_confidence`
  • `partition`

Reference row must include:

  • `reference_text`
  • one stable join key:
  • `audio_path`, preferred
  • or `segment_id`
  • or `feat_id`

Recommended Fields

  • `feat_id`
  • `audio_id`
  • `segment_id`
  • `cer_edits`
  • `trajectory_scalars`
  • `top_confusable_spans`
  • `n_best_hypotheses`
  • `char_posteriors_summary`

Join Policy

The local corpus builder joins rows in this order:

1. `audio_path`
2. `segment_id`
3. `feat_id`

So the exporter should always emit `audio_path` when possible.

Local Build Command

If you already have legacy artifact bundles

First normalize them:

bash
python3 asr/upgrade_asr_row_contract.py \
  --predictions /path/to/legacy/test_predictions.jsonl \
  --references /path/to/legacy/test_references.jsonl \
  --metrics-by-partition /path/to/legacy/test_metrics_by_partition.json \
  --output-dir /tmp/paper4_upgraded_rows

Then build the canonical corpus:

bash
python3 asr/build_segment_provenance_corpus.py \
  --transcriptions djoko_transcriptions.jsonl \
  --speakers djoko_speakers.json \
  --consensus consensus_pairs.jsonl \
  --asr-predictions /tmp/paper4_upgraded_rows/test_predictions.jsonl \
  --asr-references /tmp/paper4_upgraded_rows/test_references.jsonl \
  --output-dir artifacts/corpus

Current Local Status

Implemented and tested:

  • legacy row upgrader:
  • `asr/upgrade_asr_row_contract.py`
  • corpus schema now includes:
  • `feat_id`
  • `audio_id`
  • `split`
  • `script`
  • `mode`
  • `reference_text`
  • `cer_edits`
  • `partition`
  • `trajectory_features`
  • `top_confusable_spans`
  • `n_best_hypotheses`
  • `char_posteriors_summary`
  • `uncertainty_score`

Current producer gap:

  • the local repo does not currently contain the live Vast trainer/exporter implementation for `train_vastai_tar_ttt.py`
  • so the producer still needs to be patched wherever that code is currently maintained or on the remote box

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/docs/handoffs/nko_asr_row_contract_2026-04-28.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture