Grand Diomande Research · Full HTML Reader

RAG++ SOTA Improvements

1. [Overview](#overview) 2. [Architecture](#architecture) 3. [SIMD Acceleration (P0)](#simd-acceleration-p0) 4. [Scalar Quantization - SQ8 (P1.1)](#scalar-quantization---sq8-p11) 5. [Product Quantization - PQ (P1.2)](#product-quantization---pq-p12) 6. [Parallel HNSW Construction (P2)](#parallel-hnsw-construction-p2) 7. [Hybrid Search - Dense + Sparse (P3)](#hybrid-search---dense--sparse-p3) 8. [Hybrid Query Engine (P4)](#hybrid-query-engine-p4) 9. [Benchmarks](#benchmarks) 10. [API Reference](#api-reference) 11. [M

Agents That Account for Themselves proposal experiment writeup candidate score 44 .md

Full Public Reader

RAG++ SOTA Improvements

Technical documentation for state-of-the-art performance optimizations in RAG++ retrieval engine.

Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [SIMD Acceleration (P0)](#simd-acceleration-p0)
4. [Scalar Quantization - SQ8 (P1.1)](#scalar-quantization---sq8-p11)
5. [Product Quantization - PQ (P1.2)](#product-quantization---pq-p12)
6. [Parallel HNSW Construction (P2)](#parallel-hnsw-construction-p2)
7. [Hybrid Search - Dense + Sparse (P3)](#hybrid-search---dense--sparse-p3)
8. [Hybrid Query Engine (P4)](#hybrid-query-engine-p4)
9. [Benchmarks](#benchmarks)
10. [API Reference](#api-reference)
11. [Migration Guide](#migration-guide)

---

Overview

RAG++ now includes production-ready optimizations for high-performance retrieval:

ImprovementBenefitTrade-off
SIMD (AVX2)4-8x faster distance computationx86_64 only
SQ84x memory reduction<1
PQ32-128x memory reduction5-20
Parallel HNSW2-4x faster index buildMore memory during build
Hybrid Search+10-20

Design Principles

1. Additive Changes: All existing APIs remain unchanged
2. Opt-in Performance: New features are explicitly enabled
3. Measurable Trade-offs: Each optimization has documented recall/latency impact
4. Composable: Features can be combined (e.g., SQ8 + Hybrid)

---

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         RAG++ Core                                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │
│  │   distance/  │  │ quantization/│  │    sparse/   │               │
│  │              │  │              │  │              │               │
│  │  • scalar    │  │  • sq8       │  │  • bm25      │               │
│  │  • simd_avx2 │  │  • pq        │  │  • hybrid    │               │
│  │              │  │              │  │  • tokenizer │               │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘               │
│         │                 │                 │                        │
│         └────────────┬────┴────────────────┬┘                        │
│                      │                     │                         │
│  ┌───────────────────▼─────────────────────▼───────────────────────┐│
│  │                        index/                                    ││
│  │                                                                  ││
│  │  • FlatIndex        • HNSWIndex        • ParallelHNSWIndex      ││
│  │  • IndexRegistry    • ParallelHNSWBuilder                       ││
│  └──────────────────────────┬───────────────────────────────────────┘│
│                             │                                        │
│  ┌──────────────────────────▼───────────────────────────────────────┐│
│  │                      retrieval/                                  ││
│  │                                                                  ││
│  │  • QueryEngine           • HybridQueryEngine                     ││
│  │  • Reranker              • HybridSearcher                        ││
│  └──────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘

---

SIMD Acceleration (P0)

### Location
- `crates/core/src/distance/simd_avx2.rs`
- `crates/core/src/distance/mod.rs`

Description

AVX2 + FMA intrinsics for vectorized distance computation. Processes 8 floats per instruction.

Supported Operations

FunctionSpeedupNotes
`l2_distance_avx2`4-8xSquared L2, then sqrt
`inner_product_avx2`4-8xDot product
`cosine_similarity_avx2`4-6xIncludes normalization
`norm_avx2`4-8xVector magnitude

Usage

rust
use rag_plusplus_core::{l2_distance_fast, inner_product_fast, cosine_similarity_fast};

// Automatic dispatch - uses AVX2 if available, falls back to scalar
let dist = l2_distance_fast(&vec_a, &vec_b);
let ip = inner_product_fast(&vec_a, &vec_b);
let cos = cosine_similarity_fast(&vec_a, &vec_b);

// Or use compute_distance for index-compatible interface
use rag_plusplus_core::{compute_distance, DistanceType};
let dist = compute_distance(&vec_a, &vec_b, DistanceType::L2);

Runtime Detection

rust
// Check if AVX2 is available
#[cfg(target_arch = "x86_64")]
fn has_avx2() -> bool {
    is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma")
}

---

Scalar Quantization - SQ8 (P1.1)

### Location
- `crates/core/src/quantization/sq8.rs`
- `crates/core/src/quantization/mod.rs`

Description

Quantizes each f32 dimension to u8 (0-255), achieving 4x memory compression with minimal accuracy loss.

Algorithm

For each dimension d:
  1. Compute min_d, max_d from training data
  2. scale_d = 255.0 / (max_d - min_d)
  3. Encode: q[d] = round((x[d] - min_d) * scale_d)
  4. Decode: x'[d] = q[d] / scale_d + min_d

Usage

rust
use rag_plusplus_core::{SQ8Quantizer, Quantizer, AsymmetricDistance};

// Train on representative vectors
let training_data: Vec<Vec<f32>> = load_training_vectors();
let quantizer = SQ8Quantizer::train(&training_data);

// Encode corpus
let encoded: Vec<SQ8EncodedVector> = corpus
    .iter()
    .map(|v| quantizer.encode(v))
    .collect();

// Asymmetric distance (full precision query vs quantized corpus)
let query: Vec<f32> = get_query();
for encoded_vec in &encoded {
    let dist = quantizer.asymmetric_l2(&query, encoded_vec);
    // dist is approximate L2 distance
}

Memory Savings

DimensionOriginal (f32)SQ8 (u8)Savings
128512 bytes128 bytes4x
5122048 bytes512 bytes4x
7683072 bytes768 bytes4x

Recall Impact

Typical recall@10 loss: **< 1

---

Product Quantization - PQ (P1.2)

### Location
- `crates/core/src/quantization/pq.rs`

Description

Divides vectors into M subspaces, clusters each subspace into K centroids, stores only centroid indices. Achieves 32-128x compression.

Algorithm

Training:
  1. Split D-dim vectors into M subvectors of D/M dimensions
  2. For each subspace m:
     - Run k-means with K centroids
     - Store codebook[m] = K centroids

Encoding:
  For each subvector m:
    code[m] = argmin_k ||x[m] - codebook[m][k]||

Distance (ADC - Asymmetric Distance Computation):
  1. Precompute distance table: table[m][k] = ||q[m] - codebook[m][k]||²
  2. For each encoded vector: dist = Σ table[m][code[m]]

Usage

rust
use rag_plusplus_core::{PQQuantizer, PQConfig, Quantizer};

// Configure PQ
let config = PQConfig {
    m: 16,           // Number of subspaces (compression = dim * 4 / m)
    k: 256,          // Centroids per subspace (256 = 1 byte per code)
    kmeans_iters: 25,
    seed: Some(42),
};

// Train
let quantizer = PQQuantizer::train_with_config(&training_data, &config);

// Encode corpus
let encoded: Vec<PQEncodedVector> = corpus
    .iter()
    .map(|v| quantizer.encode(v))
    .collect();

// Search with precomputed distance table (fast!)
let query: Vec<f32> = get_query();
let table = quantizer.compute_distance_table(&query);

for enc in &encoded {
    // O(M) distance computation instead of O(D)
    let dist_sq = quantizer.asymmetric_l2_squared_with_table(&table, enc);
}

Compression Ratios

ConfigOriginalEncodedCompression
512d, M=162048 bytes16 bytes128x
512d, M=322048 bytes32 bytes64x
768d, M=243072 bytes24 bytes128x

Recall Trade-off

M (subspaces)CompressionTypical Recall@10
8256x70-85
16128x80-92
3264x88-96
6432x93-98

---

Parallel HNSW Construction (P2)

### Location
- `crates/core/src/index/hnsw_parallel.rs`

Description

Rayon-based parallel batch insertion for HNSW graph construction. Uses RwLock for thread-safe graph updates.

Algorithm

1. Insert first node (entry point) sequentially
2. Partition remaining nodes into batches
3. For each batch in parallel:
   - Search for insertion point (read locks)
   - Update graph edges (write locks on affected nodes)
4. Maintain entry point atomically

Usage

rust
use rag_plusplus_core::{ParallelHNSWBuilder, HNSWConfig};

let config = HNSWConfig::new(128)  // dimension
    .with_m(16)                     // max connections per layer
    .with_ef_construction(100);     // search width during build

let vectors: Vec<(String, Vec<f32>)> = load_vectors();

// Build with automatic thread count
let index = ParallelHNSWBuilder::new()
    .with_config(config)
    .with_seed(42)  // reproducible builds
    .build(vectors)?;

// Or specify thread count
let index = ParallelHNSWBuilder::new()
    .with_config(config)
    .with_threads(8)
    .with_batch_size(1000)
    .build(vectors)?;

// Search as normal
let results = index.search(&query, 10)?;

Performance

VectorsDimensionSequentialParallel (8 cores)Speedup
10K1282.1s0.7s3.0x
50K12818s5.2s3.5x
100K12865s18s3.6x

---

Hybrid Search - Dense + Sparse (P3)

### Location
- `crates/core/src/sparse/bm25.rs`
- `crates/core/src/sparse/hybrid.rs`
- `crates/core/src/sparse/tokenizer.rs`

Description

Combines dense vector search (semantic) with BM25 sparse search (keyword) for improved recall on queries with specific terminology.

Components

BM25 Index

rust
use rag_plusplus_core::{BM25Index, BM25Config};

let mut index = BM25Index::new();

// Or with custom config
let config = BM25Config {
    k1: 1.2,      // Term frequency saturation
    b: 0.75,      // Length normalization
    min_idf: 0.0, // IDF threshold
};
let mut index = BM25Index::with_config(config);

// Add documents
index.add("doc-1".into(), "machine learning neural networks");
index.add("doc-2".into(), "deep learning transformers attention");

// Search
let results = index.search("neural network architecture", 10);

Hybrid Searcher

rust
use rag_plusplus_core::{HybridSearcher, HybridSearchConfig, HybridFusionStrategy};

let config = HybridSearchConfig {
    strategy: HybridFusionStrategy::RRF { k: 60.0 },
    candidates_per_index: 100,
    min_sparse_score: 0.0,
};

let searcher = HybridSearcher::new(&dense_index, &sparse_index)
    .with_config(config);

// Search with both vector and text
let results = searcher.search(
    &query_embedding,
    "transformer attention mechanism",
    10
)?;

// Results include both score components
for r in results {
    println!("{}: fused={:.3}, dense={:?}, sparse={:?}",
        r.id, r.score, r.dense_score, r.sparse_score);
}

Fusion Strategies

StrategyFormulaBest For
RRF`score = Σ 1/(k + rank)`Robust, no normalization needed
Linear`score = α·dense + (1-α)·sparse`Tunable balance
WeightedRRF`score = w_d·rrf_d + w_s·rrf_s`Fine-grained control

---

Hybrid Query Engine (P4)

### Location
- `crates/core/src/retrieval/hybrid.rs`

Description

Production-ready query engine that combines dense index, sparse index, filtering, reranking, and prior building.

Usage

rust
use rag_plusplus_core::{
    HybridQueryEngine, HybridQueryEngineConfig, HybridQueryRequest,
    HybridFusionStrategy,
};

// Configure
let config = HybridQueryEngineConfig::new()
    .with_default_k(10)
    .with_fusion_strategy(HybridFusionStrategy::RRF { k: 60.0 })
    .with_candidates_per_index(100);

// Create engine
let engine = HybridQueryEngine::new(config, &dense_index, &sparse_index, &store);

// Query
let request = HybridQueryRequest::new(embedding)
    .with_text("neural network training")
    .with_k(10)
    .with_filter(FilterExpr::Eq("category".into(), "ml".into()));

let response = engine.query(request)?;

// Response includes
println!("Results: {}", response.len());
println!("Latency: {:?}", response.latency);
println!("Used hybrid: {}", response.used_hybrid);

if let Some(priors) = response.priors {
    println!("Prior mean: {}", priors.mean_outcome);
    println!("Prior CI: {:?}", priors.confidence_interval);
}

---

Benchmarks

Running Benchmarks

bash
# All benchmarks
cargo bench

# Specific benchmark
cargo bench --bench e2e_benchmark
cargo bench --bench quantization
cargo bench --bench distance

Results Summary (Apple M-series / Intel Xeon)

Distance Computation

OperationScalarAVX2Speedup
L2 (128d)45ns8ns5.6x
L2 (512d)180ns28ns6.4x
Inner Product (512d)165ns22ns7.5x
Cosine (512d)220ns38ns5.8x

Quantization Search (50K vectors, 512d)

MethodMemorySearch Time (100 queries)Recall@10
Full f32102 MB850ms100
SQ826 MB920ms99.3
PQ-160.8 MB45ms85

*PQ recall depends heavily on data distribution

Index Construction

Method100K vectors, 128dSpeedup
Sequential HNSW65s1x
Parallel HNSW (4t)22s3.0x
Parallel HNSW (8t)18s3.6x

---

API Reference

New Exports

rust
// Quantization
pub use quantization::{
    Quantizer,              // Trait for all quantizers
    AsymmetricDistance,     // Trait for ADC
    SQ8Quantizer,           // Scalar quantization
    SQ8EncodedVector,
    PQQuantizer,            // Product quantization
    PQEncodedVector,
    PQDistanceTable,
    PQConfig,
};

// Sparse Retrieval
pub use sparse::{
    BM25Index,              // BM25 inverted index
    BM25Config,
    SparseResult,
    SimpleTokenizer,        // Default tokenizer
    Tokenizer,              // Tokenizer trait
    HybridSearcher,         // Dense + Sparse fusion
    HybridSearchConfig,
    HybridFusionStrategy,
    HybridResult,
};

// Parallel Index
pub use index::{
    ParallelHNSWBuilder,    // Parallel construction
    ParallelHNSWIndex,      // Thread-safe HNSW
};

// Hybrid Query Engine
pub use retrieval::{
    HybridQueryEngine,
    HybridQueryEngineConfig,
    HybridQueryRequest,
    HybridQueryResponse,
    HybridRetrievedRecord,
};

// Distance (SIMD-accelerated)
pub use distance::{
    l2_distance_fast,
    inner_product_fast,
    cosine_similarity_fast,
    compute_distance,       // Unified interface
};

---

Migration Guide

From v0.1.x to v0.2.x (with SOTA)

No breaking changes. All existing code continues to work.

Opt-in to SIMD

rust
// Before (still works)
use rag_plusplus_core::l2_distance;

// After (automatic SIMD when available)
use rag_plusplus_core::l2_distance_fast;

Opt-in to Quantization

rust
// Add quantized corpus alongside existing
let sq8 = SQ8Quantizer::train(&sample_vectors);
let quantized_corpus: Vec<_> = corpus.iter().map(|v| sq8.encode(v)).collect();

Opt-in to Hybrid Search

rust
// Create BM25 index alongside dense index
let mut bm25 = BM25Index::new();
for (id, text) in documents {
    bm25.add(id, &text);
}

// Use HybridSearcher instead of direct index search
let searcher = HybridSearcher::new(&dense_index, &bm25);

Opt-in to Parallel Build

rust
// Before
let mut index = HNSWIndex::new(config);
for (id, vec) in vectors {
    index.add(id, &vec)?;
}

// After (parallel)
let index = ParallelHNSWBuilder::new()
    .with_config(config)
    .build(vectors)?;

---

Invariants

The following invariants are maintained:

IDInvariantEnforcement
INV-001Record immutability after creationAPI design
INV-002Welford numerical stabilityAlgorithm
INV-003WAL-before-buffer durabilityOrdering
INV-004Index-record consistencyTransactions
INV-005Quantizer trained before encodeRuntime check
INV-006Dimension match on searchAssertion
INV-007Thread-safe parallel HNSW updatesRwLock

---

Future Roadmap

1. QuantizedHNSW: Store PQ codes directly in HNSW nodes
2. Streaming BM25: Approximate IDF for real-time updates
3. NEON SIMD: ARM acceleration for Apple Silicon
4. Distributed Sharding: Multi-node index partitioning
5. GPU Acceleration: CUDA distance computation

---

References

  • [Product Quantization for Nearest Neighbor Search](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf) - Jégou et al., 2011
  • [Efficient and Robust Approximate Nearest Neighbor Search](https://arxiv.org/abs/1603.09320) - HNSW, Malkov & Yashunin, 2016
  • [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) - Robertson et al., 1995
  • [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) - Cormack et al., 2009

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/retrieval/cc-rag-plus-plus/SOTA_IMPROVEMENTS.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture