N'Ko ASR Row Contract
This is the concrete row-level contract that the Vast ASR exporter should emit so the local uncertainty-packet pipeline can consume it without ad hoc parsing.
Full Public Reader
N'Ko ASR Row Contract
Date: 2026-04-28
Purpose
This is the concrete row-level contract that the Vast ASR exporter should emit so the local uncertainty-packet pipeline can consume it without ad hoc parsing.
Consumer already implemented:
- `asr/build_segment_provenance_corpus.py`
Prediction Row
{
"feat_id": "bam_test_0001",
"audio_id": "EP999_0001",
"audio_path": "/workspace/djoko_audio/segments/EP999/EP999_0001.wav",
"segment_id": "EP999_0001",
"split": "test",
"script": "nko",
"mode": "trajectory",
"asr_text": "ߊߟߐ",
"ctc_confidence": 0.73,
"cer_edits": 1,
"trajectory_scalars": {
"velocity": 0.2,
"jerk": 0.8
},
"partition": "boundary",
"top_confusable_spans": [
{
"start": 2,
"end": 3,
"alts": ["ߎ", "ߐ"]
}
],
"n_best_hypotheses": [
{
"text": "ߊߟߐ",
"score": -0.4
}
],
"char_posteriors_summary": {
"mean_entropy": 0.31
}
}Reference Row
{
"feat_id": "bam_test_0001",
"audio_path": "/workspace/djoko_audio/segments/EP999/EP999_0001.wav",
"reference_text": "ߊߟߎ"
}Required Fields
Prediction row must include:
- `audio_path`
- `split`
- `script`
- `mode`
- `asr_text`
- `ctc_confidence`
- `partition`
Reference row must include:
- `reference_text`
- one stable join key:
- `audio_path`, preferred
- or `segment_id`
- or `feat_id`
Recommended Fields
- `feat_id`
- `audio_id`
- `segment_id`
- `cer_edits`
- `trajectory_scalars`
- `top_confusable_spans`
- `n_best_hypotheses`
- `char_posteriors_summary`
Join Policy
The local corpus builder joins rows in this order:
1. `audio_path`
2. `segment_id`
3. `feat_id`
So the exporter should always emit `audio_path` when possible.
Local Build Command
If you already have legacy artifact bundles
First normalize them:
python3 asr/upgrade_asr_row_contract.py \
--predictions /path/to/legacy/test_predictions.jsonl \
--references /path/to/legacy/test_references.jsonl \
--metrics-by-partition /path/to/legacy/test_metrics_by_partition.json \
--output-dir /tmp/paper4_upgraded_rowsThen build the canonical corpus:
python3 asr/build_segment_provenance_corpus.py \
--transcriptions djoko_transcriptions.jsonl \
--speakers djoko_speakers.json \
--consensus consensus_pairs.jsonl \
--asr-predictions /tmp/paper4_upgraded_rows/test_predictions.jsonl \
--asr-references /tmp/paper4_upgraded_rows/test_references.jsonl \
--output-dir artifacts/corpusCurrent Local Status
Implemented and tested:
- legacy row upgrader:
- `asr/upgrade_asr_row_contract.py`
- corpus schema now includes:
- `feat_id`
- `audio_id`
- `split`
- `script`
- `mode`
- `reference_text`
- `cer_edits`
- `partition`
- `trajectory_features`
- `top_confusable_spans`
- `n_best_hypotheses`
- `char_posteriors_summary`
- `uncertainty_score`
Current producer gap:
- the local repo does not currently contain the live Vast trainer/exporter implementation for `train_vastai_tar_ttt.py`
- so the producer still needs to be patched wherever that code is currently maintained or on the remote box
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/docs/handoffs/nko_asr_row_contract_2026-04-28.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture