Grand Diomande Research · Full HTML Reader

Stage 3 Report — Bounded Edit-Op Corrector

**Status:** done. The edit-op interface is valid and much faster than full-string correction, but the trained proposer collapsed to COPY and does **not** improve clean-anchor CER.

Language as Infrastructure experiment experiment writeup candidate score 24 .md

Full Public Reader

Stage 3 Report — Bounded Edit-Op Corrector

Status: done. The edit-op interface is valid and much faster than full-string correction, but
the trained proposer collapsed to COPY and does not improve clean-anchor CER.

Headline

Full-string correction was the wrong serving shape: Codex's clean LoRA trained, but 1,381-row
generation was stopped after 39+ minutes with no completed output. Stage 3 replaced that with bounded
edit operations (`COPY`, `SUB`, `INS`, `DEL`) scored by `featural_edit.py` and applied
deterministically by `edit_ops.py`.

The interface itself works only with a constrained opening-bracket decode prompt:

text

ASR: <candidate>
OPS: [

The generated tail is parsed as `[` + tail. Without this prefix, Gemma drifts into N'Ko phrase loops
instead of JSON. With the prefix, the full 1,381-row generation is valid and bounded, but every output
is `COPY`.

Dataset

Builder: `build_editop_sft.py`

Source: `experiments/acoustic_gate/datasets/clean_proposer_sft_v1`

Final output: `overnight/editop_sft_v2/`

metric	value
rows in	86,990
rows kept	86,043
rows dropped, over 3 ops	947
median edit ops	0
max edit ops observed	3
median target chars	10
max target chars	50

Operation counts:

op	count
COPY	51,031
SUB	25,939
DEL	9,837
INS	8,414

Latency target passed at the data level: targets are a few ops, not full transcripts.

Training

v1: verbose prompt

Adapter: `[home]/agp_pilot/adapter_editop_v1`

Full 600-iter training completed. Best validation was at iter 200:

checkpoint	val loss
iter 1	3.763
iter 100	0.673
iter 200	0.605
iter 300	0.620
iter 400	0.631
iter 500	0.693
iter 600	0.625

Smoke generation failed schema: 3/20 valid JSON scripts, 17/20 invalid, all valid scripts were
`COPY`. Raw generations were N'Ko text, not edit syntax. Diagnosis: the verbose prompt produced
max-sequence truncation warnings, so the model likely did not reliably see the JSON target.

v2: compact prompt

Adapter: `[home]/agp_pilot/adapter_editop_v2`

Prompt: `ASR: <candidate>\nOPS: `

Full 600-iter training completed:

checkpoint	val loss
iter 1	3.417
iter 100	1.143
iter 200	1.064
iter 300	1.085
iter 400	1.066
iter 500	1.064
iter 600	1.032

Peak memory: 4.558 GB.

Plain compact generation still failed schema: 1/20 valid JSON scripts, 19/20 invalid N'Ko loops.

Constrained-prefix generation fixed schema:

eval	valid JSON	invalid	median sec/row	changed
20-row smoke, `compact_open`	20/20	0	0.319	0
full 1,381, `compact_open`	1,381/1,381	0	0.493	0

Full generation report:

`accepted_by_cost`: 1,381
`copy`: 1,381
`changed`: 0
`max_featural_cost`: 0.0
output report: `overnight/editop_anchor_eval_v2/generation_report.json`

Clean-Anchor Robust Eval

Robust eval:

bash

KMP_DUPLICATE_LIB_OK=TRUE python3 experiments/acoustic_gate/robust_eval_anchor_clean.py \
  --raw experiments/acoustic_gate/overnight/editop_anchor_eval_v2/proposals_editop_v1.jsonl \
  --clean experiments/acoustic_gate/overnight/editop_anchor_eval_v2/proposals_editop_v1.jsonl \
  --output experiments/acoustic_gate/overnight/editop_anchor_eval_v2/clean_anchor_eval_editop_v2_report.json

Result:

condition	CER	delta	accepted	better/same/worse
ASR baseline	0.3514	-	-	-
raw + acoustic gate	0.3514	+0.00pp	0	0/0/0
clean + acoustic gate	0.3514	+0.00pp	0	0/0/0
clean + preserve + acoustic	0.3514	+0.00pp	0	0/0/0

Proposal diagnostics:

raw: 0 changed / 0 better / 1,381 same / 0 worse
clean: 0 changed / 0 better / 1,381 same / 0 worse
preservation AUC remains 0.7397 on clean anchor labels.

Interpretation

This is a useful negative result, not a win. Bounded edit operations solve the 39+ minute full-string
latency failure as an interface, and constrained-prefix decoding gives strict parseability. But
the Gemma LoRA learned the safe prior too strongly and emits universal `COPY`, so it cannot close the
correction loop.

The next correction architecture should not ask Gemma to freely serialize both decision and edit.
Use a smaller structured head or two-step system instead:

1. classify `COPY` vs `EDIT` with calibrated confidence;
2. only when `EDIT`, predict a bounded op from a constrained candidate set;
3. score candidates by featural cost plus anchor acoustic support;
4. keep Gemma out of live token-by-token generation unless its output space is grammar-constrained.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/experiments/acoustic_gate/overnight/STAGE3-REPORT.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture