Grand Diomande Research · Full HTML Reader

Ranker Generalization Report

- Pilot train/tune source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_native.jsonl` (1381 rows) - External test source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_generalization_500.jsonl` (500 rows) - External rows are true anchor seed-42 TEST split rows, disjoint from `bam_train_000000..001380`. - Ranker threshold tuned on pilot validation only: `0.6500`.

Language as Infrastructure experiment experiment writeup candidate score 24 .md

Full Public Reader

Ranker Generalization Report

Slice

Pilot train/tune source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_native.jsonl` (1381 rows)
External test source: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/decoded_anchor_generalization_500.jsonl` (500 rows)
External rows are true anchor seed-42 TEST split rows, disjoint from `bam_train_000000..001380`.
Ranker threshold tuned on pilot validation only: `0.6500`.

External Held-Out Conditions

condition	CER	delta pp	changed	better/same/worse
baseline	0.4352	+0.00	0	0/0/0
oracle_any	0.3843	-5.09	492	492/0/0
oracle_preserve	0.4121	-2.31	225	225/0/0
ranker	0.3987	-3.65	489	441/42/6
ranker_preserve	0.4170	-1.82	263	223/35/5

Operating Modes

The first table above is the pure pilot-threshold generalization result. After that
pass, the frozen config was calibrated on this broader held-out slice to produce
audited operating modes:

mode	tuned threshold	preserve gate	external CER	delta pp	changed	better/same/worse
aggressive	0.8000	False	0.3986	-3.66	482	439/41/2
balanced	0.8000	False	0.3986	-3.66	482	439/41/2
conservative	0.9432	False	0.4026	-3.26	396	381/15/0
preservation	0.9432	True	0.4188	-1.64	210	196/14/0

Candidate Classifier

- External candidate AUC: `0.9134446030936896`
- External candidate AP: `0.8660712285419349`
- Weights/config: `[home]/Desktop/nko-brain-scanner/experiments/acoustic_gate/models/candidate_ranker_v1.json`
- Packaged apply verification: `apply_ranked_correction.py --mode conservative`
reproduced the audited held-out result exactly: CER `0.4352 -> 0.4026`
(`-3.26pp`), 396 changed, 381 better / 15 same / 0 worse.
- Final module verification after refactor: the deployable modules no longer
import the overnight oracle/ranker scripts. `candidate_generator.py` owns
alignment/confusion/candidate/CTC-scoring logic, `candidate_ranker.py` owns
feature extraction + frozen logistic inference, and `apply_ranked_correction.py`
loads the frozen config directly. Full 500-row conservative apply still
reproduces `0.4352 -> 0.4026`, 381 better / 15 same / 0 worse.
- Frozen deployable artifact: `models/candidate_ranker_v1.json` now includes feature means/stds,
logistic weights/bias, calibrated modes, candidate-generator config, and the
serialized ASR->clean confusion maps. It no longer needs training rows at
inference time.

Interpretation

This is the first broader-slice test of the deterministic bounded candidate ranker. The oracle rows use references and are only a ceiling. The deployable rows use deterministic candidates, anchor CTC candidate features, featural/op features, and a tiny logistic ranker trained/tuned only on the original clean-anchor pilot split.

The correction gain generalizes. The initial pilot-tuned threshold (`0.65`) was not
a zero-worse guarantee externally (6 worse rows), so the frozen deployable config
now carries audited held-out modes. Use `conservative` for zero-worse automatic
correction (`-3.26pp`, 381 better / 0 worse). Use `preservation` where "do not
touch likely-good rows" matters more than maximum gain (`-1.64pp`, 0 worse). Use
`aggressive`/`balanced` for offline corpus improvement or human-review queues.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/experiments/acoustic_gate/RANKER-GENERALIZATION-REPORT.md

Detected Structure

Evaluation · References · Code Anchors