ADR-001: On-Device Serving — Feature-Extractor Consistency, the ANE, and TurboQuant
- **Status:** Accepted (2026-06-02) - **Scope:** How N'Ko ASR is served on-device, and the discipline that keeps a model decodable end to end. Does **not** change the training/research path. - **Authors:** Mohamed + Claude
Full Public Reader
ADR-001: On-Device Serving — Feature-Extractor Consistency, the ANE, and TurboQuant
- Status: Accepted (2026-06-02)
- Scope: How N'Ko ASR is served on-device, and the discipline that keeps a model
decodable end to end. Does not change the training/research path.
- Authors: Mohamed + Claude
Context
Two compute backends appear in this program and were briefly conflated:
1. Research / training path — features extracted with standard PyTorch-GPU
Whisper-large-v3, giving `(1500, 1280)` encoder output (1500 frames / 30 s). The
validated 20.57
`UnifiedCTCHead(use_trajectory=True, use_tar=False, use_ttt=False)`) trains and decodes
on this path. Its model contains an internal `temporal_ds` (Conv1d, stride 4) that
downsamples 1500 → 375 frames inside the network.
2. On-device path — features extracted with a CoreML Whisper encoder on the Apple
Neural Engine (ANE) (`Desktop/ane-training/ane_ctc_train.py`: "Frozen Whisper
encoder on ANE, CTC head on MLX GPU"), giving pre-downsampled `(375, 1280)`
features. The 297k pilot model is compatible with this 375-frame layout.
The incident that forced this ADR
Feeding the GPU-trained anchor the ANE's 375-frame features produced all-blank output
(`blank_mass = 1.000` on every frame). Root cause: a train/serve feature mismatch — the
anchor expects raw 1500 frames and downsamples internally; the ANE features are already
downsampled, so the model double-downsampled (375 → ~94) into garbage. The cleanly-loaded
state dict (0 missing / 0 unexpected) was a red herring; the features were out of
distribution.
Two things that are NOT the same
- The ANE is a hardware accelerator. It did not "contaminate" anything. The
contamination in `ane-training/` was in the text labels (a corrupt N'Ko
transliteration — `ߙߌߗߌ` instead of `ߖߌߜߌ` for "ji", plus raw IPA `ɔ/ɛ`), not in the
acoustic features. The two merely co-lived in one folder. The ANE features are fine; the
ANE is fully reusable.
Decision
1. Feature-extractor consistency is law. A model's head MUST be trained on features
from the same extractor it will be served with (same model variant, precision, and
frame layout), or the GPU↔ANE skew must be measured and shown negligible. One extractor
per model, end to end.
2. Research stays on the GPU/standard path. All correctness work — fixing the
transliteration contamination, the anchor re-eval, the acoustic-gate / preservation /
flywheel numbers — uses PyTorch-GPU 1500-frame features. Reproducible, comparable to
published baselines, not Apple-locked. Clean numbers are earned here, not on-device.
3. The ANE is the serving backend (Phase 2), paired with TurboQuant. On-device serving
= frozen Whisper encoder on the ANE (efficient, offline, battery-friendly) + the small
trainable CTC head, compressed with TurboQuant. To deploy the clean anchor on the
ANE, do one of:
- re-extract the anchor's training features through the CoreML/ANE Whisper and confirm
it still decodes ~0.20, or
- make the ANE extraction emit raw 1500-frame features and let the model's own
`temporal_ds` downsample (restores drop-in compatibility with the GPU-trained head).
4. TurboQuant is the compression half of the serving stack. (`Desktop/turboquant/`,
benchmarked with AGP + ANE in `Comp-Core/benchmarks/agp-turboquant-ane`, 2026-04-22.)
Measured, relevant operating points:
- Whisper 1280-dim activation packets → 4× at MSE 0.037 (the feature tensor itself).
- 768-dim vectors (our `d_model`) → 5.9× @ 4-bit (recall 0.86) / 3× @ 8-bit
(recall 0.98).
- KV-cache → 3.2× (relevant only if an autoregressive proposer runs on-device).
5. Pursue AGP-aware quantization as the differentiator. AGP already partitions the
signal (stable / boundary / uncertain / novelty); TurboQuant already exposes an
"AGP-PTP packet-compression lane." Let the anticipation geometry allocate the bit
budget: high precision for uncertain/boundary/novelty regions, 4-bit for stable
regions. This ties three of the program's own primitives — AGP partition + TurboQuant +
ANE serving — into one coherent on-device story. This is systems-level novelty, not
model-capability novelty, and it is the defensible deployment moat.
Consequences
Positive
- The "N'Ko as computational infrastructure" thesis gets a real systems leg: a
script-native ASR that runs offline, on the phone, at low power — where the ~40 M
Manding speakers actually are.
- A clear seam: research = GPU/clean; serving = ANE + TurboQuant. No more accidental
cross-contamination of feature regimes.
Costs / risks
- TurboQuant's CER impact on the N'Ko ASR path is unmeasured (numbers above are from
the AGP/retrieval benchmark). Quantizing the head/encoder and measuring CER is required
Phase-2 work before any claim.
- Running the full large-v3 encoder on the ANE may be too heavy; a smaller/quantized
encoder variant changes the features again — re-triggering the consistency rule.
- None of this fixes contamination; it is strictly downstream of clean numbers.
Status of related work
- Anchor validated end-to-end on native GPU features (0.31 here / 0.2057 held-out). ✅
- Transliteration contamination identified; clean re-eval pending (needs mac5 proposal
regen). ⏳
- ANE + TurboQuant = Phase 2 (deployment), not started for the ASR path. 🔜
References
- `experiments/acoustic_gate/TECHNICAL-REPORT.md` §8d.1, §8d.2
- `docs/handoffs/train_vastai_tar_ttt_anchor_audit_20260502.py` (`UnifiedCTCHead`)
- `Desktop/ane-training/ane_ctc_train.py` (ANE+MLX split design)
- `Desktop/turboquant/` (affine.py, codebooks.py, core.py)
- `Desktop/Comp-Core/benchmarks/agp-turboquant-ane/` (2026-04-22 reports)
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
nko-brain-scanner/docs/adr/ADR-001-on-device-serving-and-quantization.md
Detected Structure
Evaluation · References · Code Anchors