Ranker Serving Handoff — ANE/TurboQuant Wraps the Tiny Ranker, Not Gemma
```text anchor ASR hyp + logits/self-score -> deterministic bounded candidates -> frozen logistic candidate ranker -> calibrated mode threshold -> deterministic corrected text ```
Full Public Reader
Ranker Serving Handoff — ANE/TurboQuant Wraps the Tiny Ranker, Not Gemma
Decision
The live correction lane is now:
anchor ASR hyp + logits/self-score
-> deterministic bounded candidates
-> frozen logistic candidate ranker
-> calibrated mode threshold
-> deterministic corrected textGemma is out of the live path. Full-string Gemma failed the latency gate; edit-op
Gemma met the schema only under constrained decode and collapsed to `COPY`. The
deployable correction model is the tiny deterministic ranker config:
`models/candidate_ranker_v1.json`
That JSON carries feature means/stds, logistic weights/bias, calibrated operating
modes, candidate-generator config, and serialized ASR->clean confusion maps. It
does not need training rows at inference time.
Current Performance
500 disjoint true anchor-test rows beyond the original 1,381 pilot:
| mode | CER | delta pp | better/same/worse |
|---|---|---|---|
| baseline | 0.4352 | +0.00 | 0/0/0 |
| aggressive/balanced | 0.3986 | -3.66 | 439/41/2 |
| conservative | 0.4026 | -3.26 | 381/15/0 |
| preservation | 0.4188 | -1.64 | 196/14/0 |
Use `conservative` for automatic correction. Use `aggressive`/`balanced` for
offline corpus improvement or human-review queues.
Serving Contract
The serving ASR path must provide, per utterance:
- `feat_id`
- `asr_hyp`
- `asr_score`
- CTC log-prob access for candidate scoring
The correction layer then:
1. Generates bounded `SUB`/`DEL`/`INS` candidates from frozen confusion maps.
2. Scores each candidate with the ASR CTC model/log-probs.
3. Computes featural/op/CTC/prior features.
4. Applies the frozen logistic ranker in the selected mode.
5. Emits either `COPY` or a deterministic corrected string.
Serving Verification
The deployable path has now been exercised in three progressively lower-level
forms:
1. Packaged Python modules. `apply_ranked_correction.py --mode conservative`
loads only the frozen JSON config and reproduces the audited 500-row result:
CER `0.4352 -> 0.4026` (`-3.26pp`), 381 better / 15 same / 0 worse.
2. One-pass CTC-logit reuse. `decode_and_correct.py` decodes the anchor ASR
and applies the ranker from the same `log_probs` tensor used for greedy ASR,
instead of rerunning the head per candidate. On the same 500 rows it exactly
reproduces conservative mode: aggregate CER `0.435201 -> 0.402613`
(`-3.2588pp`), 381 better / 15 same / 0 worse.
3. Swift correction export. `export_ranker_for_serving.py` emits a Swift
Package at `ios/NKORanker/` with both the frozen scalar ranker
(`NKOCandidateRankerV1.swift`) and the deterministic serving correction engine
(`NKOCorrectionEngineV1.swift`). The Swift engine embeds the frozen confusion
maps/config, generates bounded COPY/SUB/DEL/INS candidates, scores candidates
with a small CTC forward algorithm, and selects through the ranker. `swift
test` passes 5 tests: probability fixture, mode acceptance, candidate
generation, CTC target preference, and short-edit selection.
4. iOS physical-device validation. Direct SwiftPM/XCTest execution on a
physical device is not available for package-only test targets, so
`ios/NKORankerDeviceHarness/` provides a tiny XcodeGen host app. The hosted
test imports the local `NKORanker` package and runs the conservative fixture on
Mohamed's paired iPad (`platform=iOS,id=00008120-000A6D3221080032`). Result:
`NKORankerDeviceHarnessTests/testConservativeFixtureOnDevice` passed on-device
in `0.003s`, `1 test, 0 failures`, `TEST SUCCEEDED`.
CoreML note: the ranker itself does not need CoreML; Swift scalar math is the
verified serving artifact. The ASR CTC head now has a reproducible CoreML export
path via `export_anchor_ctc_head_coreml.py` (Python 3.11 + torch 2.7.0 +
coremltools 9.0 in `/Volumes/HD1/venvs/nko-coreml`). The validated package is
`/Volumes/HD1/nko_coreml/anchor_ctc_head_valid_fp32.mlpackage`.
ANE / TurboQuant Placement
ANE and TurboQuant are serving-stack tools, not correction-brain tools.
- ANE: run the heavy frozen Whisper/ASR feature path efficiently on-device,
but only if train/serve feature consistency is maintained. The earlier
1500-vs-375-frame mismatch is the failure mode to avoid.
- TurboQuant: apply to the ASR/encoder/head serving stack only after CER-cost
measurement. The ranker itself is already tiny JSON math and does not need
quantization.
- Ranker: run as CPU/Accelerate/Swift scalar math or a tiny CoreML logistic
layer. It is not worth routing through Gemma or a large neural proposer.
TurboQuant ASR-Stack Smoke
`test_turboquant_asr_stack.py` measures TurboQuant where it belongs: around the
real anchor Whisper feature tensors before the CTC head. It quantizes/dequantizes
the 500 disjoint anchor-test feature rows (`[1,1500,1280]` fp16), decodes through
the same anchor CTC head, and compares CER against the unquantized baseline.
| feature path | aggregate CER | delta pp | compression vs fp16 | avg MSE | changed vs baseline |
|---|---|---|---|---|---|
| baseline fp16->fp32 | 0.435201 | +0.0000 | 1.00x | - | - |
| TurboQuant affine8:g32 | 0.435309 | +0.0108 | 1.78x | 0.000021 | 16 |
| TurboQuant affine4:g32 | 0.435094 | -0.0108 | 3.20x | 0.006050 | 150 |
Interpretation: 8-bit group-wise affine quantization is effectively decode-stable
on this slice. 4-bit is also CER-neutral within measurement noise and gives the
deployment-relevant compression. This is not an ANE throughput claim; it is
the CER-cost gate that says ANE/TurboQuant serving experiments are worth doing
without expecting immediate ASR collapse.
TurboQuant Serving-Stack Smoke
`benchmark_turboquant_serving_stack.py` extends the previous ASR-only smoke to
the complete deterministic serving shape:
feature -> optional TurboQuant quant/dequant -> anchor CTC head -> greedy ASR
-> deterministic candidates -> conservative frozen ranker -> corrected textThis is still local PyTorch/MPS evidence, not iPad/ANE throughput. It is the
CER/timing gate that tests whether TurboQuant breaks the correction stack, not
just the base ASR decode. On the full 500-row disjoint anchor-test slice:
| feature path | ASR CER | corrected CER | ranker delta pp | compression vs fp16 | q/dq ms | head ms | ranker ms | better/same/worse |
|---|---|---|---|---|---|---|---|---|
| fp32 | 0.435201 | 0.402613 | -3.26 | 0.50x | 0.00 | 111.23 | 133.70 | 381/15/0 |
| affine8:g32 | 0.435309 | 0.402613 | -3.27 | 1.78x | 27.29 | 109.36 | 131.85 | 382/15/0 |
| affine4:g32 | 0.435094 | 0.402345 | -3.27 | 3.20x | 34.14 | 109.45 | 130.74 | 382/15/0 |
Interpretation: TurboQuant survives the full deterministic correction path on
the complete held-out slice. 8-bit exactly matches fp32 aggregate corrected CER;
4-bit is slightly better within measurement noise. Both keep `ranker_worse=0`.
The correction ranker remains tiny scalar math and is not quantized.
CoreML / ANE Status
The ASR-stack CoreML export is now validated; the physical ANE half is not yet
validated. Current evidence:
- The previous Homebrew Python 3.14 `coremltools` install was degraded
(`BlobWriter` / native library failures). A repaired Python 3.11 environment on
HD1 now exports and runs the anchor CTC head.
- Export target: `UnifiedCTCHead(num_classes=66, use_trajectory=True,
use_tar=False, use_ttt=False)` consuming native HF Whisper features
`[1,1500,1280]` and producing logits `[1,375,66]`.
- FP32 CoreML runtime parity passes on macOS:
`parity_mse=7.9046e-11`, `parity_max_abs=0.000172`, and greedy CTC output
matches PyTorch. Report: `ane_coreml_export_report.json`.
- Physical iPad CoreML runtime parity also passes with the compiled FP32 head,
one real native feature fixture, and the Swift ranker package in the same
XCTest harness:
`/Volumes/HD1/nko_coreml/AnchorHeadDeviceHarness`; result bundle:
`/Volumes/HD1/tmp/AnchorHeadHarness-iPad-combined.xcresult`. XCTest
`AnchorHeadDeviceHarnessTests/testAnchorHeadCoreMLParityOnDevice` ran 7 warmed
predictions for `MLComputeUnits.cpuOnly` and `cpuAndNeuralEngine`, then 100k
ranker evaluations. Result: `cpu_only_avg_ms=44.145`,
`cpu_ne_avg_ms=44.234`, `parity_mse=3.9121e-11`,
`parity_max_abs=0.0001097`, `argmax_mismatches=0`,
`ranker_avg_us=1.737`, `TEST SUCCEEDED`. Repo report:
`ane_device_anchor_head_report.json`.
- FP16 CoreML export is not production-valid yet: it saved and ran, but
parity failed (`parity_mse≈27.18`, mismatched greedy output). Use FP32 until a
separate FP16/quantized-CoreML parity fix lands.
- The existing Comp-Core `agp-turboquant-ane` benchmark proves only a constrained
route-head proxy: ANE bridge available/compiled, but ANE eval failed
(`ANEBridgeError('ANE eval failed')`). It is explicitly not a full ASR-stack
claim.
- Whisper encoder path audit: no reusable clean-anchor CoreML Whisper encoder
artifact was found locally. The historical
`/Volumes/HD1/Mac4-Offload/Desktop/ane-training` folder contains 1,381
already-extracted `(375,1280)` fp16 feature tensors plus an MLX CTC-head
trainer/checkpoint, but no `.mlmodel`, `.mlmodelc`, or `.mlpackage`. The clean
anchor head expects native `(1,1500,1280)` features and downsamples internally,
so that old feature path is not a drop-in serving encoder.
- New Whisper-large-v3 encoder export: `export_whisper_encoder_coreml.py`
verifies the actual model config (`num_mel_bins=128`,
`max_source_positions=1500`, `d_model=1280`), smokes the torch encoder
contract `[1,128,3000] -> [1,1500,1280]`, traces it, exports a FP32 CoreML ML
Program, and compiles it with `coremlcompiler`. The successful package lives
outside git on an APFS sparse image because direct packaging on `/Volumes/HD1`
fails on AppleDouble `._weight.bin` sidecars:
`/Volumes/NKOCoreMLWork/whisper_large_v3_encoder_fp32.mlpackage`;
compiled model:
`/Volumes/NKOCoreMLWork/compiled/whisper_large_v3_encoder_fp32.mlmodelc`.
Report: `whisper_encoder_coreml_export_report.json`. See
`WHISPER-ENCODER-PATH-AUDIT.md`.
- External device harness staging: `/Volumes/HD1/nko_coreml/AnchorHeadDeviceHarness`
now has both `testWhisperEncoderCoreMLParityOnDevice` and
`testWhisperEncoderToHeadToRankerPipelineOnDevice`. It also now has
`testRealAudioWhisperEncoderToHeadToRankerPipelineOnDevice`, backed by a real
local Djoko WAV converted through ffmpeg + the Whisper-large-v3 128-mel feature
extractor. The real-audio test now calls `NKOCorrectionEngineV1` with the
expected CTC logits fixture, so the compiled iOS harness exercises:
greedy decode -> deterministic candidates -> CTC scoring -> Swift ranker
selection. Stable resources live under
`/Volumes/HD1/nko_coreml/device_harness_resources`: compiled encoder, random
and real-audio mel fixtures, expected encoder output, expected CTC-head
logits/argmax, and manifests. `xcodebuild build-for-testing` passed for the
chained real-audio pipeline harness when DerivedData was placed on HD1
(`/Volumes/HD1/tmp/NKORealAudioPipelineHarnessDD`), including the final
correction-engine build. A later focused iPhone 16 Plus runtime attempt
executed `0` tests and logged `Error locating DeviceSupport directory`; this
is invalid as proof. DeviceSupport has now been moved to an HD1-backed symlink,
iPhone (7) DeviceSupport was prepared successfully, and the trace runner fails
closed on zero-test/DeviceSupport/device-offline conditions. The latest iPhone
(7) run executes one real XCTest, but the FP32 Whisper encoder fails while
CoreML builds the on-device execution plan (`BNNS` storage-reader version
error, Espresso `model.mil:511:12` index-out-of-bounds, CoreML error `-14`).
A later CPU-only rerun of `testWhisperEncoderCoreMLParityOnDevice` still fails
before prediction with the same BNNS/Espresso/CoreML `model.mil:511` error,
even after a reproducible explicit-position MIL patch changed the positional
constant from `[1500,1280]` to `[1,1500,1280]`. Instruments also failed to
record the device, so there is still no ANE/GPU/CPU placement evidence.
Report: `whisper_encoder_device_harness_report.json`.
- Split-graph device probes now narrow the encoder blocker. `convpos`
(`mel -> conv1/conv2 -> explicit positions`) passes on iPhone (7) with
CPU-only `42.454ms`, CPU+NE requested `31.174ms`, `mse=3.39e-14`.
Standalone `layer00` passes with CPU-only `110.054ms`, CPU+NE requested
`98.289ms`, `mse=3.27e-13`. Multi-layer prefix probes show that
`layers00_01` and `layers00_02` both pass on iPhone with tight parity
(`mse=3.71e-13` and `4.06e-13`), while `layers00_03` executes but is
numerically wrong (`mse=0.03256`, max abs `1.466`). Standalone layer 3 passes
when fed the `layers00_02` fixture (`mse=5.49e-14`), so the failure is the
four-layer package composition/device lowering, not layer 3 itself. Report:
`whisper_split_device_probe_report.json`. These are requested compute-unit
timings, not ANE placement proof.
- The split encoder is now staged through the full Whisper-large-v3 encoder
chain. `layers04_06`, `layers07_09`, `layers10_12`, `layers13_15`,
`layers16_18`, `layers19_21`, `layers22_24`, `layers25_27`, `layers28_30`,
standalone `layer31`, and `finalnorm` export, compile, pass macOS CoreML
parity, and are bundled into the external XCTest harness. Later parity stays
tight: `layers13_15` `mse=1.88e-12`, `layers16_18` `8.92e-13`,
`layers19_21` `1.97e-13`, `layers22_24` `1.01e-13`, `layers25_27`
`1.23e-13`, `layers28_30` `1.83e-13`, `layer31` `7.51e-14`, and
`finalnorm` `7.89e-15`.
- Generic iOS `build-for-testing` passes with the full split chain and the
real-audio split pipeline XCTest embedded. The built host app is 3.9GB and the
embedded XCTest plug-in is 2.7GB. The test binary contains
`WHISPER_SPLIT_FULL_ENCODER_DEVICE_BENCH`,
`WHISPER_SPLIT_PIPELINE_DEVICE_BENCH`, and
`WHISPER_SPLIT_REAL_AUDIO_PIPELINE_DEVICE_BENCH`.
- Historical physical-device attempts stopped on locked-device or install-delta
preflight before the final process-attached proof succeeded. Those failures are
retained in `whisper_split_device_probe_report.json` because they justify the
current runner design: existing-app launch preflight, strict readiness mode,
output-disk guard, runtime-marker analysis, traced-runtime analysis, CoreML
trace analysis, and fail-closed `proof-summary.json`.
- Full physical runtime and trace proof is now complete at
`/Volumes/HD1/tmp/nko_real_audio_device_watch_active_20260603/20260603_075734_iPhone7_proof_attempt1`.
The proof emits `WHISPER_SPLIT_REAL_AUDIO_PIPELINE_DEVICE_BENCH`, passes both
runtime analyzers, exports a valid process-attached CoreML trace, and reports
`proof_complete`.
- Device-preflight analysis is now machine-readable too:
`experiments/acoustic_gate/analyze_device_preflight.py` writes
`device-preflight-analysis.json` for existing-app launch probes. Fresh resume
probes on 2026-06-03 saw both available iPhones in CoreDevice, Xcode, and
Instruments, but classified both launch gates as `blocked_device_locked`:
`/Volumes/HD1/tmp/nko_preflight_only_iPhone7_resume_20260603_0326/device-preflight-analysis.json`
and
`/Volumes/HD1/tmp/nko_preflight_only_Mohameds_iPhone_resume_20260603_0326/device-preflight-analysis.json`.
- Auto-resume is available through
`experiments/acoustic_gate/watch_real_audio_iphone_proof.sh`. It polls cheap
launch preflights and runs the full proof runner only after a launchable phone
is found. The dry run at
`/Volumes/HD1/tmp/nko_real_audio_device_watch_dry_20260603_0335` made one
pass across both phones, classified both as `blocked_device_locked`, exited
`76`, and did not start XCTest or Instruments.
- The runner now has an output disk guard before XCTest/trace. It exits `74`
when `OUT_DIR` has less than `MIN_FREE_GIB_BEFORE_XCTEST` free, default
`2GiB`. Smoke:
`/Volumes/HD1/tmp/nko_disk_preflight_guard_iPhone7_20260603_0308`.
- Trace analysis is now automated after `xctrace` succeeds:
`experiments/acoustic_gate/analyze_coreml_trace.py` writes
`coreml-trace-analysis.json` next to the trace. It was smoke-tested against the
old failed trace and correctly classified it as
`invalid_empty_or_failed_trace`, with no valid ANE/CPU/GPU placement claim.
- Runtime-marker analysis is also automated:
`experiments/acoustic_gate/analyze_pipeline_runtime_log.py` writes
`pipeline-runtime-analysis.json` after the runtime gate and
`pipeline-runtime-analysis-traced.json` after the traced pass. The analyzer
requires the audio fixture, both 14-stage split-encoder passes, encoder/head
parity, nonblank greedy decode, bounded candidates, and Swift ranker
acceptance. It was smoke-tested at
`/Volumes/HD1/tmp/nko_pipeline_runtime_marker_synthetic_20260603.json` with
status `runtime_marker_requirements_passed`; the locked preflight log correctly
classifies as `missing_marker`.
- Completion audit is now explicit:
`experiments/acoustic_gate/audit_ondevice_asr_goal.py` writes
`experiments/acoustic_gate/ondevice_asr_goal_audit.json`. The current audit is
`complete`/`completion_ready=true`. The completed artifact directory is
`/Volumes/HD1/tmp/nko_real_audio_device_watch_active_20260603/20260603_075734_iPhone7_proof_attempt1`.
That proof has passing runtime analysis, passing traced-runtime analysis, and a
CoreML trace analysis with paired CoreML CPU/GPU compute markers.
- The full runner now emits a machine-checkable
`proof-summary.json` through `summarize_ondevice_asr_proof.py`; the runner
fails closed with exit `78` unless that summary is `proof_complete`.
- Disk state: HD1 remains tight at about `3.1GiB` free. A 2.3GB archived
CoreSimulator cache, `/Volumes/HD1/tmp/root_cache_archive_20260602`, was moved
to HD1 Trash, but that does not free space until the Trash item is emptied or
moved to a writable external volume. The active real-audio harness DerivedData
is still retained at `/Volumes/HD1/tmp/NKORealAudioPipelineHarnessDD` because it
is the no-rebuild proof bundle.
So the current deployable proof is: Swift/iPad ranker + one-pass CTC-logit reuse
+ TurboQuant feature CER-cost + CoreML-compatible anchor CTC head with FP32
parity on macOS and physical iPad + physical-iPhone real-audio split encoder to
head to greedy decode to Swift ranker runtime and traced CoreML placement
evidence. The whole exported Whisper encoder MLProgram is still a negative
device-compatibility baseline, while the split encoder is the proven serving
path. The paired iPad run is evidence against useful Neural Engine acceleration
for the CTC-head graph, and the final iPhone trace has no paired CoreML-ANE
marker. The correct claim is on-device CoreML ASR with paired CPU/GPU
placement/fallback evidence, not ANE-accelerated ASR.
Next Deployment Work
1. Treat the CTC head as on-device CoreML/CPU unless Instruments or a CoreML
compute plan proves otherwise; the paired iPad run currently looks like CPU
fallback.
2. Keep the full split encoder as the active physical-device serving path. It
has now passed on iPhone (7):
`convpos -> layers00_02 -> layer03 -> layers04_06 -> layers07_09 ->
layers10_12 -> layers13_15 -> layers16_18 -> layers19_21 -> layers22_24 ->
layers25_27 -> layers28_30 -> layer31 -> finalnorm`. The four-layer
`layers00_03` package remains invalid on physical iPhone even though macOS
CoreML parity passes.
3. The staged split real-audio proof is complete. `djoko_IAfxh3pI1R4` now runs
as real WAV -> audio `[480000]` -> mel `[1,128,3000]` -> split encoder/head ->
greedy text with `16` nonblank frames -> deterministic candidates -> Swift
ranker correction. Runtime and traced-runtime analyses both report
`runtime_marker_requirements_passed`.
4. Instruments/CoreML compute-plan capture is complete enough for fallback
placement, not for ANE. The process-attached trace exported successfully and
`coreml-trace-analysis.json` reports `placement_markers_found`, with paired
CoreML CPU/GPU markers and no paired CoreML-ANE marker.
5. `experiments/acoustic_gate/audit_ondevice_asr_goal.py` reports
`completion_ready=true`. Re-run it only when the proof artifacts or reports
change.
6. If FP32 head/encoder latency is too high for live use, investigate CoreML-safe
quantization/precision strategies; the naive FP16 CoreML export already failed
parity, so every lower-precision variant needs a parity gate before speed work.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/experiments/acoustic_gate/RANKER-SERVING-HANDOFF.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture