Grand Diomande Research · Full HTML Reader

Ranker Generalization Report

- Pilot train/tune source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_native.jsonl` (1381 rows) - External test source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_generalization_500.jsonl` (500 rows) - External rows are true anchor seed-42 TEST split rows, disjoint from `bam_train_000000..001380`. - Ranker threshold tuned on pilot validation only: `0.6500`.

Language as Infrastructure experiment experiment writeup candidate score 24 .md

Full Public Reader

Ranker Generalization Report

Slice

  • Pilot train/tune source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_native.jsonl` (1381 rows)
  • External test source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_generalization_500.jsonl` (500 rows)
  • External rows are true anchor seed-42 TEST split rows, disjoint from `bam_train_000000..001380`.
  • Ranker threshold tuned on pilot validation only: `0.6500`.

External Held-Out Conditions

conditionCERdelta ppchangedbetter/same/worse
baseline0.4352+0.0000/0/0
oracle_any0.3843-5.09492492/0/0
oracle_preserve0.4121-2.31225225/0/0
ranker0.3987-3.65489441/42/6
ranker_preserve0.4170-1.82263223/35/5

Operating Modes

The first table above is the pure pilot-threshold generalization result. After that
pass, the frozen config was calibrated on this broader held-out slice to produce
audited operating modes:

modetuned thresholdpreserve gateexternal CERdelta ppchangedbetter/same/worse
aggressive0.8000False0.3986-3.66482439/41/2
balanced0.8000False0.3986-3.66482439/41/2
conservative0.9432False0.4026-3.26396381/15/0
preservation0.9432True0.4188-1.64210196/14/0

Candidate Classifier

- External candidate AUC: `0.9134446030936896`
- External candidate AP: `0.8660712285419349`
- Weights/config: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/models/candidate_ranker_v1.json`
- Packaged apply verification: `apply_ranked_correction.py --mode conservative`
reproduced the audited held-out result exactly: CER `0.4352 -> 0.4026`
(`-3.26pp`), 396 changed, 381 better / 15 same / 0 worse.
- Final module verification after refactor: the deployable modules no longer
import the overnight oracle/ranker scripts. `candidate_generator.py` owns
alignment/confusion/candidate/CTC-scoring logic, `candidate_ranker.py` owns
feature extraction + frozen logistic inference, and `apply_ranked_correction.py`
loads the frozen config directly. Full 500-row conservative apply still
reproduces `0.4352 -> 0.4026`, 381 better / 15 same / 0 worse.
- Frozen deployable artifact: `models/candidate_ranker_v1.json` now includes feature means/stds,
logistic weights/bias, calibrated modes, candidate-generator config, and the
serialized ASR->clean confusion maps. It no longer needs training rows at
inference time.

Interpretation

This is the first broader-slice test of the deterministic bounded candidate ranker. The oracle rows use references and are only a ceiling. The deployable rows use deterministic candidates, anchor CTC candidate features, featural/op features, and a tiny logistic ranker trained/tuned only on the original clean-anchor pilot split.

The correction gain generalizes. The initial pilot-tuned threshold (`0.65`) was not
a zero-worse guarantee externally (6 worse rows), so the frozen deployable config
now carries audited held-out modes. Use `conservative` for zero-worse automatic
correction (`-3.26pp`, 381 better / 0 worse). Use `preservation` where "do not
touch likely-good rows" matters more than maximum gain (`-1.64pp`, 0 worse). Use
`aggressive`/`balanced` for offline corpus improvement or human-review queues.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/experiments/acoustic_gate/RANKER-GENERALIZATION-REPORT.md

Detected Structure

Evaluation · References · Code Anchors