AGP TurboQuant + Apple Neural Engine Performance Plan
This plan turns the local TurboQuant and Apple Neural Engine research into an executable AGP performance lane. The goal is not to add accelerator names to the paper. The goal is to prove which parts of AGP become faster, smaller, or more energy efficient when the system uses the right engine for the right class of computation.
Full Public Reader
AGP TurboQuant + Apple Neural Engine Performance Plan
Date: `2026-04-22`
Purpose
This plan turns the local TurboQuant and Apple Neural Engine research into an executable AGP performance lane. The goal is not to add accelerator names to the paper. The goal is to prove which parts of AGP become faster, smaller, or more energy efficient when the system uses the right engine for the right class of computation.
The local implementation already exists outside the main AGP docs:
- `Desktop/cog-rlm/scripts/turboquant.py`
- `Desktop/cog-rlm/scripts/ane_bridge.py`
- `Desktop/cog-rlm/scripts/ane_lora_mil.py`
- `Desktop/cog-rlm/scripts/ane_trainer.py`
- `Desktop/cog-rlm/scripts/ane_whisper_spike.py`
- `Desktop/cog-rlm/scripts/ane_mlx_train.py`
The new AGP benchmark harness lives at:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/Architecture Placement
TurboQuant belongs at the compression boundary, not at the acoustic model boundary. It should compress embedding indexes, hidden-state transfer packets, and AGP-PTP payloads. Its role is state mobility under bounded distortion.
The Apple Neural Engine belongs at the reflex boundary, not at the full-transformer training boundary. It should run compact projection-heavy heads: route, vitality, semantic projection, sigil or partition classifiers, and later frozen-forward projection kernels if the private MIL path remains stable. Its role is low-power repeated inference, not replacing MLX/GPU training.
The performance architecture is therefore:
MLX/GPU
dense transformer trunk
LoRA/DoRA adapter training
corrective continuation
ANE
route head
vitality head
semantic head
shallow projection kernels
optional micro-update kernels
TurboQuant
compressed retrieval index
compressed AGP transfer packet
compressed activation/hidden-state transport
CPU/Rust
orchestration
packet framing
Graph Kernel admissibility
exact rerank and provenance
Thunderbolt
Mac4 <-> Mac5 state fabricCurrent Measured Anchors
TurboQuant real-data Convex eval already measured:
- `4-bit`: about `0.993` cosine on real `768D` embeddings.
- `4-bit`: about `0.95-0.98` recall@10 on small real samples.
- `4-bit`: about `5.9x` compression versus fp32.
- `8-bit`: effectively lossless retrieval behavior in the small eval.
The local library also contains a RAG++ scale target:
- `332K x 768D` vectors.
- fp32 footprint about `1.02GB`.
- 4-bit packed footprint target around `127MB` plus metadata.
- intended compressed scan target under `10ms`, but this requires a Rust/SIMD sidecar rather than the current Python prototype path.
ANE local research already contains:
- a private `libane_bridge.dylib` ctypes bridge
- MIL text compilation
- IOSurface input/output
- dynamic weight injection
- LoRA forward/backward MIL generators
- Whisper-scale projection spike code
- ANE + MLX hybrid training scaffold
The current stable statement is: ANE bridge availability and projection-kernel feasibility are locally proven. Full production integration is not yet proven.
First AGP Benchmark Harness Result
Command:
cd Desktop/Comp-Core/benchmarks/agp-turboquant-ane
python3 agp_turboquant_ane_bench.py \
--n-vectors 4096 \
--queries 32 \
--activation-shape 1,1280,1,256Report:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/agp-turboquant-ane-benchmark-20260422T115420Z.mdEmbedding candidate generation:
| bits | recall@10 | mean cosine | ratio fp32 | per query ms | packed MB | prototype MB |
|---|---|---|---|---|---|---|
| 4 | 0.859 | 0.993 | 5.91x | 7.283 | 2.1 | 12.6 |
| 8 | 0.981 | 1.000 | 2.98x | 7.245 | 4.2 | 12.6 |
Activation packet compression:
- packet shape: `[1, 1280, 1, 256]`
- fp16 bytes: `655360`
- estimated packed 4-bit bytes: `163848`
- ratio vs fp16: `4.0x`
- MSE: `0.03651738`
- max absolute error: `0.332031`
- compress/decompress: `0.807ms / 0.345ms`
ANE bridge:
- local private MIL bridge available: `true`
- compile count at check: `0`
Interpretation:
The benchmark confirms the expected structure. TurboQuant already gives useful distortion behavior and packet-size reduction, but the Python prototype search path is not the final latency path because it pre-dequantizes rotated vectors to fp16. The next implementation target is a Rust/SIMD packed-code sidecar. The ANE bridge exists and initializes locally, but this benchmark only checks availability; the next ANE benchmark must compile and time a route/vitality/semantic head.
Expected Performance
TurboQuant Retrieval
Target:
- `4-bit` first-stage candidate generation.
- `5-6x` memory reduction versus fp32.
- `0.95+` recall@10 before exact rerank on real embeddings.
- Rust sidecar p95 under `10ms` for hundreds of thousands of vectors.
Fallback:
- `8-bit` when recall is more important than memory.
- `3x` memory reduction versus fp32.
- near-lossless top-k behavior in the small local eval.
Do not claim:
- replacement of exact rerank.
- replacement of database vector indexes.
- final latency from the Python/Numpy prototype.
TurboQuant AGP Packets
Target:
- compress transfer bottleneck packets by roughly `4x` versus fp16 at 4-bit.
- use packed codes for AGP-PTP transport over Thunderbolt.
- exact reconstruction is not required; continuation fidelity is the metric.
Benchmark:
- compress/decompress synthetic activation packets first.
- then benchmark trained `CompressionBottleneck` states once Stage 5 exists.
- score by downstream continuation fidelity, not only MSE.
Do not claim:
- useful transfer before the transfer bottleneck exists.
- generic compression of every hidden tensor.
Apple Neural Engine Sidecar
Target:
- sub-ms to low-ms route/vitality/semantic head inference.
- lower GPU contention by moving repeated projection-heavy heads off the MLX path.
- lower energy per routing decision.
- private MIL path for research, Core ML path for public/product-safe experiments.
The ANE micro-update path targets:
- small LoRA correction batches.
- `3-10` gradient steps.
- sub-second to low-second update latency.
- hot-reloadable emergency correction adapters.
Do not claim:
- full-transformer ANE training.
- stable public API training support.
- ANE replacement for MLX/GPU adapter training.
Implementation Plan
Phase 1. Benchmark Lock
Status: started.
Deliverables:
- `agp_turboquant_ane_bench.py`
- JSON report
- Markdown report
Metrics:
- `recall@k`
- `mean_cosine`
- `compression_ratio_vs_fp32`
- `per_query_ms`
- activation packet compression ratio
- activation packet MSE
- ANE bridge availability
Phase 2. TurboQuant Rust Sidecar
Build `cc-turboquant-index` as the mainline AGP sidecar.
Status: sidecar v0 implemented.
Current crate:
Desktop/Comp-Core/core/retrieval/cc-turboquant-index/Current reports:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b8-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-32768x768-b4-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-snapshot-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-inspect-20260422.jsonCurrent snapshot:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/snapshots/cc-turboquant-index-4096x768-b4-20260422.tqidxSidecar v0 results:
| vectors | dim | bits | queries | recall@10 | per query ms | ratio fp32 | packed bytes |
|---|---|---|---|---|---|---|---|
| 4096 | 768 | 4 | 32 | 0.878 | 4.802 | 5.95x | 2113536 |
| 4096 | 768 | 8 | 32 | 0.981 | 4.321 | 2.99x | 4210688 |
| 32768 | 768 | 4 | 16 | 0.838 | 37.417 | 5.95x | 16908288 |
Snapshot gate:
- saved `.tqidx` snapshot: `2.0MB`
- inspected vectors: `4096`
- inspected dim/padded dim: `768 / 1024`
- inspected bit width: `4`
- inspected compression ratio: `5.953x` versus fp32
- `cargo test`: `5` passed, including snapshot round-trip search preservation
JSONL ingestion gate:
- `benchmark-jsonl --input PATH` implemented.
- Accepted row fields: `embedding`, `vector`, `values`, `embedding_vector`, or raw JSON arrays.
- Loaded vectors are L2-normalized before indexing, so exported non-unit embeddings can still be evaluated by cosine-style top-k.
- Smoke verification passed on mixed JSONL row shapes with valid benchmark and inspect JSON output.
Interpretation: packed-code Rust scanning removes the Python prototype's fp16 predecode memory path and improves the 4096-vector smoke latency. The sidecar can now persist and reload a packed index snapshot, and it can ingest exported embedding JSONL. That is the minimum bridge needed before attaching real Orbit/RAG++ embeddings. The 32768-vector result also makes the next bottleneck explicit: scalar packed scan will not meet the final hundreds-of-thousands-vector target by itself. The next sidecar version needs memory mapping, blocked scanning, an Apple Silicon SIMD kernel, and exact rerank.
Required features:
- packed-code storage — implemented in v0
- snapshot format — implemented in v0
- JSONL vector ingestion — implemented in v0
- mmap-friendly snapshot loading
- query rotation — implemented in v0
- SIMD dot-product path over compressed or predecoded blocks
- exact rerank against fp16/fp32 vectors for final top-k
- JSONL and binary export import
Success criteria:
- same or better recall than Python prototype at the same bit width
- lower memory than fp16 predecoded search
- p95 under `10ms` on the target Orbit/RAG++ embedding snapshot
Phase 3. ANE Route-Head Export
Train or stub the route/vitality/semantic head in MLX first. Export the stabilized head to Core ML or private MIL.
Status: private-MIL probe implemented; current eval blocked.
Current script:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/ane_route_head_bench.pyLatest report:
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/agp-ane-route-head-benchmark-20260422T120745Z.mdCurrent result:
- bridge available: `true`
- MIL compile: `true`
- shape: `seq=8`, `dim=64`, `out=16`
- compile time: `38.426834ms`
- eval: `false`
- error: `ANEBridgeError('ANE eval failed')`
- repository smoke `Desktop/cog-rlm/scripts/test_eval_simple.py` also compiles but fails eval with Apple ANE status `0x2/statusType=0x9`
Interpretation: the ANE bridge is still a research lane, not an operational AGP accelerator yet. The current local private path can compile tiny kernels but cannot execute them in the present daemon/runtime state. The next production-safe lane is Core ML export for the route/vitality/semantic heads, while private MIL remains useful for low-level investigation once the ANE daemon issue is resolved.
Required features:
- fixed-shape input
- fp16 or int8-friendly weights
- deterministic comparison against MLX output
- benchmark against CPU and MLX/GPU
Success criteria:
- ANE sidecar beats CPU for the head.
- ANE sidecar reduces GPU contention or improves median latency in the full AGP loop.
- output parity remains within defined tolerance.
Phase 4. AGP-PTP Packet Compression
Attach TurboQuant to the learned transfer packet, not raw arbitrary hidden states.
Required features:
- packet schema with quantization metadata
- checksum and model/version fingerprint
- decode path on receiver
- continuation-fidelity benchmark
Success criteria:
- compressed transfer preserves continuation quality within tolerance.
- transfer latency is lower than raw fp16 transfer.
- packet size is small enough that Thunderbolt transfer is not the bottleneck.
Phase 5. Paper Integration
Only after Phases 1-4:
- report TurboQuant retrieval and packet results separately
- report ANE sidecar head results separately
- report full AGP runtime impact as an end-to-end measurement
The paper claim should be:
> AGP improves local model efficiency by combining conditional compute, typed semantic routing, compressed state transfer, and heterogeneous Apple execution.
Not:
> ANE makes the whole model faster.
Immediate Next Commands
cd Desktop/Comp-Core/benchmarks/agp-turboquant-ane
python3 agp_turboquant_ane_bench.pyOptional larger benchmark:
python3 agp_turboquant_ane_bench.py \
--n-vectors 10000 \
--queries 50 \
--activation-shape 1,1280,1,512Paper Readiness Gate
This lane becomes paper-ready when we have:
- repeatable benchmark reports committed to the repo
- Rust TurboQuant sidecar report on a real Orbit/RAG++ snapshot
- ANE head export report with parity and latency
- AGP end-to-end report showing latency, energy, or quality-per-compute improvement
- clear distinction between private research path and public deployment path
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/docs/research/agp-turboquant-ane-performance-plan.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture