Grand Diomande Research · Full HTML Reader

AGP TurboQuant + Apple Neural Engine Performance Plan

This plan turns the local TurboQuant and Apple Neural Engine research into an executable AGP performance lane. The goal is not to add accelerator names to the paper. The goal is to prove which parts of AGP become faster, smaller, or more energy efficient when the system uses the right engine for the right class of computation.

Agents That Account for Themselves research note experiment writeup candidate score 32 .md

Full Public Reader

AGP TurboQuant + Apple Neural Engine Performance Plan

Date: `2026-04-22`

Purpose

This plan turns the local TurboQuant and Apple Neural Engine research into an executable AGP performance lane. The goal is not to add accelerator names to the paper. The goal is to prove which parts of AGP become faster, smaller, or more energy efficient when the system uses the right engine for the right class of computation.

The local implementation already exists outside the main AGP docs:

  • `Desktop/cog-rlm/scripts/turboquant.py`
  • `Desktop/cog-rlm/scripts/ane_bridge.py`
  • `Desktop/cog-rlm/scripts/ane_lora_mil.py`
  • `Desktop/cog-rlm/scripts/ane_trainer.py`
  • `Desktop/cog-rlm/scripts/ane_whisper_spike.py`
  • `Desktop/cog-rlm/scripts/ane_mlx_train.py`

The new AGP benchmark harness lives at:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/

Architecture Placement

TurboQuant belongs at the compression boundary, not at the acoustic model boundary. It should compress embedding indexes, hidden-state transfer packets, and AGP-PTP payloads. Its role is state mobility under bounded distortion.

The Apple Neural Engine belongs at the reflex boundary, not at the full-transformer training boundary. It should run compact projection-heavy heads: route, vitality, semantic projection, sigil or partition classifiers, and later frozen-forward projection kernels if the private MIL path remains stable. Its role is low-power repeated inference, not replacing MLX/GPU training.

The performance architecture is therefore:

text
MLX/GPU
  dense transformer trunk
  LoRA/DoRA adapter training
  corrective continuation

ANE
  route head
  vitality head
  semantic head
  shallow projection kernels
  optional micro-update kernels

TurboQuant
  compressed retrieval index
  compressed AGP transfer packet
  compressed activation/hidden-state transport

CPU/Rust
  orchestration
  packet framing
  Graph Kernel admissibility
  exact rerank and provenance

Thunderbolt
  Mac4 <-> Mac5 state fabric

Current Measured Anchors

TurboQuant real-data Convex eval already measured:

  • `4-bit`: about `0.993` cosine on real `768D` embeddings.
  • `4-bit`: about `0.95-0.98` recall@10 on small real samples.
  • `4-bit`: about `5.9x` compression versus fp32.
  • `8-bit`: effectively lossless retrieval behavior in the small eval.

The local library also contains a RAG++ scale target:

  • `332K x 768D` vectors.
  • fp32 footprint about `1.02GB`.
  • 4-bit packed footprint target around `127MB` plus metadata.
  • intended compressed scan target under `10ms`, but this requires a Rust/SIMD sidecar rather than the current Python prototype path.

ANE local research already contains:

  • a private `libane_bridge.dylib` ctypes bridge
  • MIL text compilation
  • IOSurface input/output
  • dynamic weight injection
  • LoRA forward/backward MIL generators
  • Whisper-scale projection spike code
  • ANE + MLX hybrid training scaffold

The current stable statement is: ANE bridge availability and projection-kernel feasibility are locally proven. Full production integration is not yet proven.

First AGP Benchmark Harness Result

Command:

bash
cd Desktop/Comp-Core/benchmarks/agp-turboquant-ane
python3 agp_turboquant_ane_bench.py \
  --n-vectors 4096 \
  --queries 32 \
  --activation-shape 1,1280,1,256

Report:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/agp-turboquant-ane-benchmark-20260422T115420Z.md

Embedding candidate generation:

bitsrecall@10mean cosineratio fp32per query mspacked MBprototype MB
40.8590.9935.91x7.2832.112.6
80.9811.0002.98x7.2454.212.6

Activation packet compression:

  • packet shape: `[1, 1280, 1, 256]`
  • fp16 bytes: `655360`
  • estimated packed 4-bit bytes: `163848`
  • ratio vs fp16: `4.0x`
  • MSE: `0.03651738`
  • max absolute error: `0.332031`
  • compress/decompress: `0.807ms / 0.345ms`

ANE bridge:

  • local private MIL bridge available: `true`
  • compile count at check: `0`

Interpretation:

The benchmark confirms the expected structure. TurboQuant already gives useful distortion behavior and packet-size reduction, but the Python prototype search path is not the final latency path because it pre-dequantizes rotated vectors to fp16. The next implementation target is a Rust/SIMD packed-code sidecar. The ANE bridge exists and initializes locally, but this benchmark only checks availability; the next ANE benchmark must compile and time a route/vitality/semantic head.

Expected Performance

TurboQuant Retrieval

Target:

  • `4-bit` first-stage candidate generation.
  • `5-6x` memory reduction versus fp32.
  • `0.95+` recall@10 before exact rerank on real embeddings.
  • Rust sidecar p95 under `10ms` for hundreds of thousands of vectors.

Fallback:

  • `8-bit` when recall is more important than memory.
  • `3x` memory reduction versus fp32.
  • near-lossless top-k behavior in the small local eval.

Do not claim:

  • replacement of exact rerank.
  • replacement of database vector indexes.
  • final latency from the Python/Numpy prototype.

TurboQuant AGP Packets

Target:

  • compress transfer bottleneck packets by roughly `4x` versus fp16 at 4-bit.
  • use packed codes for AGP-PTP transport over Thunderbolt.
  • exact reconstruction is not required; continuation fidelity is the metric.

Benchmark:

  • compress/decompress synthetic activation packets first.
  • then benchmark trained `CompressionBottleneck` states once Stage 5 exists.
  • score by downstream continuation fidelity, not only MSE.

Do not claim:

  • useful transfer before the transfer bottleneck exists.
  • generic compression of every hidden tensor.

Apple Neural Engine Sidecar

Target:

  • sub-ms to low-ms route/vitality/semantic head inference.
  • lower GPU contention by moving repeated projection-heavy heads off the MLX path.
  • lower energy per routing decision.
  • private MIL path for research, Core ML path for public/product-safe experiments.

The ANE micro-update path targets:

  • small LoRA correction batches.
  • `3-10` gradient steps.
  • sub-second to low-second update latency.
  • hot-reloadable emergency correction adapters.

Do not claim:

  • full-transformer ANE training.
  • stable public API training support.
  • ANE replacement for MLX/GPU adapter training.

Implementation Plan

Phase 1. Benchmark Lock

Status: started.

Deliverables:

  • `agp_turboquant_ane_bench.py`
  • JSON report
  • Markdown report

Metrics:

  • `recall@k`
  • `mean_cosine`
  • `compression_ratio_vs_fp32`
  • `per_query_ms`
  • activation packet compression ratio
  • activation packet MSE
  • ANE bridge availability

Phase 2. TurboQuant Rust Sidecar

Build `cc-turboquant-index` as the mainline AGP sidecar.

Status: sidecar v0 implemented.

Current crate:

text
Desktop/Comp-Core/core/retrieval/cc-turboquant-index/

Current reports:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b8-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-32768x768-b4-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-snapshot-20260422.json
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/cc-turboquant-index-4096x768-b4-inspect-20260422.json

Current snapshot:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/snapshots/cc-turboquant-index-4096x768-b4-20260422.tqidx

Sidecar v0 results:

vectorsdimbitsqueriesrecall@10per query msratio fp32packed bytes
40967684320.8784.8025.95x2113536
40967688320.9814.3212.99x4210688
327687684160.83837.4175.95x16908288

Snapshot gate:

  • saved `.tqidx` snapshot: `2.0MB`
  • inspected vectors: `4096`
  • inspected dim/padded dim: `768 / 1024`
  • inspected bit width: `4`
  • inspected compression ratio: `5.953x` versus fp32
  • `cargo test`: `5` passed, including snapshot round-trip search preservation

JSONL ingestion gate:

  • `benchmark-jsonl --input PATH` implemented.
  • Accepted row fields: `embedding`, `vector`, `values`, `embedding_vector`, or raw JSON arrays.
  • Loaded vectors are L2-normalized before indexing, so exported non-unit embeddings can still be evaluated by cosine-style top-k.
  • Smoke verification passed on mixed JSONL row shapes with valid benchmark and inspect JSON output.

Interpretation: packed-code Rust scanning removes the Python prototype's fp16 predecode memory path and improves the 4096-vector smoke latency. The sidecar can now persist and reload a packed index snapshot, and it can ingest exported embedding JSONL. That is the minimum bridge needed before attaching real Orbit/RAG++ embeddings. The 32768-vector result also makes the next bottleneck explicit: scalar packed scan will not meet the final hundreds-of-thousands-vector target by itself. The next sidecar version needs memory mapping, blocked scanning, an Apple Silicon SIMD kernel, and exact rerank.

Required features:

  • packed-code storage — implemented in v0
  • snapshot format — implemented in v0
  • JSONL vector ingestion — implemented in v0
  • mmap-friendly snapshot loading
  • query rotation — implemented in v0
  • SIMD dot-product path over compressed or predecoded blocks
  • exact rerank against fp16/fp32 vectors for final top-k
  • JSONL and binary export import

Success criteria:

  • same or better recall than Python prototype at the same bit width
  • lower memory than fp16 predecoded search
  • p95 under `10ms` on the target Orbit/RAG++ embedding snapshot

Phase 3. ANE Route-Head Export

Train or stub the route/vitality/semantic head in MLX first. Export the stabilized head to Core ML or private MIL.

Status: private-MIL probe implemented; current eval blocked.

Current script:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/ane_route_head_bench.py

Latest report:

text
Desktop/Comp-Core/benchmarks/agp-turboquant-ane/reports/agp-ane-route-head-benchmark-20260422T120745Z.md

Current result:

  • bridge available: `true`
  • MIL compile: `true`
  • shape: `seq=8`, `dim=64`, `out=16`
  • compile time: `38.426834ms`
  • eval: `false`
  • error: `ANEBridgeError('ANE eval failed')`
  • repository smoke `Desktop/cog-rlm/scripts/test_eval_simple.py` also compiles but fails eval with Apple ANE status `0x2/statusType=0x9`

Interpretation: the ANE bridge is still a research lane, not an operational AGP accelerator yet. The current local private path can compile tiny kernels but cannot execute them in the present daemon/runtime state. The next production-safe lane is Core ML export for the route/vitality/semantic heads, while private MIL remains useful for low-level investigation once the ANE daemon issue is resolved.

Required features:

  • fixed-shape input
  • fp16 or int8-friendly weights
  • deterministic comparison against MLX output
  • benchmark against CPU and MLX/GPU

Success criteria:

  • ANE sidecar beats CPU for the head.
  • ANE sidecar reduces GPU contention or improves median latency in the full AGP loop.
  • output parity remains within defined tolerance.

Phase 4. AGP-PTP Packet Compression

Attach TurboQuant to the learned transfer packet, not raw arbitrary hidden states.

Required features:

  • packet schema with quantization metadata
  • checksum and model/version fingerprint
  • decode path on receiver
  • continuation-fidelity benchmark

Success criteria:

  • compressed transfer preserves continuation quality within tolerance.
  • transfer latency is lower than raw fp16 transfer.
  • packet size is small enough that Thunderbolt transfer is not the bottleneck.

Phase 5. Paper Integration

Only after Phases 1-4:

  • report TurboQuant retrieval and packet results separately
  • report ANE sidecar head results separately
  • report full AGP runtime impact as an end-to-end measurement

The paper claim should be:

> AGP improves local model efficiency by combining conditional compute, typed semantic routing, compressed state transfer, and heterogeneous Apple execution.

Not:

> ANE makes the whole model faster.

Immediate Next Commands

bash
cd Desktop/Comp-Core/benchmarks/agp-turboquant-ane
python3 agp_turboquant_ane_bench.py

Optional larger benchmark:

bash
python3 agp_turboquant_ane_bench.py \
  --n-vectors 10000 \
  --queries 50 \
  --activation-shape 1,1280,1,512

Paper Readiness Gate

This lane becomes paper-ready when we have:

  • repeatable benchmark reports committed to the repo
  • Rust TurboQuant sidecar report on a real Orbit/RAG++ snapshot
  • ANE head export report with parity and latency
  • AGP end-to-end report showing latency, energy, or quality-per-compute improvement
  • clear distinction between private research path and public deployment path

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/docs/research/agp-turboquant-ane-performance-plan.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture