RAG++ SOTA Improvements
1. [Overview](#overview) 2. [Architecture](#architecture) 3. [SIMD Acceleration (P0)](#simd-acceleration-p0) 4. [Scalar Quantization - SQ8 (P1.1)](#scalar-quantization---sq8-p11) 5. [Product Quantization - PQ (P1.2)](#product-quantization---pq-p12) 6. [Parallel HNSW Construction (P2)](#parallel-hnsw-construction-p2) 7. [Hybrid Search - Dense + Sparse (P3)](#hybrid-search---dense--sparse-p3) 8. [Hybrid Query Engine (P4)](#hybrid-query-engine-p4) 9. [Benchmarks](#benchmarks) 10. [API Reference](#api-reference) 11. [M
Full Public Reader
RAG++ SOTA Improvements
Technical documentation for state-of-the-art performance optimizations in RAG++ retrieval engine.
Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [SIMD Acceleration (P0)](#simd-acceleration-p0)
4. [Scalar Quantization - SQ8 (P1.1)](#scalar-quantization---sq8-p11)
5. [Product Quantization - PQ (P1.2)](#product-quantization---pq-p12)
6. [Parallel HNSW Construction (P2)](#parallel-hnsw-construction-p2)
7. [Hybrid Search - Dense + Sparse (P3)](#hybrid-search---dense--sparse-p3)
8. [Hybrid Query Engine (P4)](#hybrid-query-engine-p4)
9. [Benchmarks](#benchmarks)
10. [API Reference](#api-reference)
11. [Migration Guide](#migration-guide)
---
Overview
RAG++ now includes production-ready optimizations for high-performance retrieval:
| Improvement | Benefit | Trade-off |
|---|---|---|
| SIMD (AVX2) | 4-8x faster distance computation | x86_64 only |
| SQ8 | 4x memory reduction | <1 |
| PQ | 32-128x memory reduction | 5-20 |
| Parallel HNSW | 2-4x faster index build | More memory during build |
| Hybrid Search | +10-20 |
Design Principles
1. Additive Changes: All existing APIs remain unchanged
2. Opt-in Performance: New features are explicitly enabled
3. Measurable Trade-offs: Each optimization has documented recall/latency impact
4. Composable: Features can be combined (e.g., SQ8 + Hybrid)
---
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ RAG++ Core │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ distance/ │ │ quantization/│ │ sparse/ │ │
│ │ │ │ │ │ │ │
│ │ • scalar │ │ • sq8 │ │ • bm25 │ │
│ │ • simd_avx2 │ │ • pq │ │ • hybrid │ │
│ │ │ │ │ │ • tokenizer │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────────┬────┴────────────────┬┘ │
│ │ │ │
│ ┌───────────────────▼─────────────────────▼───────────────────────┐│
│ │ index/ ││
│ │ ││
│ │ • FlatIndex • HNSWIndex • ParallelHNSWIndex ││
│ │ • IndexRegistry • ParallelHNSWBuilder ││
│ └──────────────────────────┬───────────────────────────────────────┘│
│ │ │
│ ┌──────────────────────────▼───────────────────────────────────────┐│
│ │ retrieval/ ││
│ │ ││
│ │ • QueryEngine • HybridQueryEngine ││
│ │ • Reranker • HybridSearcher ││
│ └──────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘---
SIMD Acceleration (P0)
### Location
- `crates/core/src/distance/simd_avx2.rs`
- `crates/core/src/distance/mod.rs`
Description
AVX2 + FMA intrinsics for vectorized distance computation. Processes 8 floats per instruction.
Supported Operations
| Function | Speedup | Notes |
|---|---|---|
| `l2_distance_avx2` | 4-8x | Squared L2, then sqrt |
| `inner_product_avx2` | 4-8x | Dot product |
| `cosine_similarity_avx2` | 4-6x | Includes normalization |
| `norm_avx2` | 4-8x | Vector magnitude |
Usage
use rag_plusplus_core::{l2_distance_fast, inner_product_fast, cosine_similarity_fast};
// Automatic dispatch - uses AVX2 if available, falls back to scalar
let dist = l2_distance_fast(&vec_a, &vec_b);
let ip = inner_product_fast(&vec_a, &vec_b);
let cos = cosine_similarity_fast(&vec_a, &vec_b);
// Or use compute_distance for index-compatible interface
use rag_plusplus_core::{compute_distance, DistanceType};
let dist = compute_distance(&vec_a, &vec_b, DistanceType::L2);Runtime Detection
// Check if AVX2 is available
#[cfg(target_arch = "x86_64")]
fn has_avx2() -> bool {
is_x86_feature_detected!("avx2") && is_x86_feature_detected!("fma")
}---
Scalar Quantization - SQ8 (P1.1)
### Location
- `crates/core/src/quantization/sq8.rs`
- `crates/core/src/quantization/mod.rs`
Description
Quantizes each f32 dimension to u8 (0-255), achieving 4x memory compression with minimal accuracy loss.
Algorithm
For each dimension d:
1. Compute min_d, max_d from training data
2. scale_d = 255.0 / (max_d - min_d)
3. Encode: q[d] = round((x[d] - min_d) * scale_d)
4. Decode: x'[d] = q[d] / scale_d + min_dUsage
use rag_plusplus_core::{SQ8Quantizer, Quantizer, AsymmetricDistance};
// Train on representative vectors
let training_data: Vec<Vec<f32>> = load_training_vectors();
let quantizer = SQ8Quantizer::train(&training_data);
// Encode corpus
let encoded: Vec<SQ8EncodedVector> = corpus
.iter()
.map(|v| quantizer.encode(v))
.collect();
// Asymmetric distance (full precision query vs quantized corpus)
let query: Vec<f32> = get_query();
for encoded_vec in &encoded {
let dist = quantizer.asymmetric_l2(&query, encoded_vec);
// dist is approximate L2 distance
}Memory Savings
| Dimension | Original (f32) | SQ8 (u8) | Savings |
|---|---|---|---|
| 128 | 512 bytes | 128 bytes | 4x |
| 512 | 2048 bytes | 512 bytes | 4x |
| 768 | 3072 bytes | 768 bytes | 4x |
Recall Impact
Typical recall@10 loss: **< 1
---
Product Quantization - PQ (P1.2)
### Location
- `crates/core/src/quantization/pq.rs`
Description
Divides vectors into M subspaces, clusters each subspace into K centroids, stores only centroid indices. Achieves 32-128x compression.
Algorithm
Training:
1. Split D-dim vectors into M subvectors of D/M dimensions
2. For each subspace m:
- Run k-means with K centroids
- Store codebook[m] = K centroids
Encoding:
For each subvector m:
code[m] = argmin_k ||x[m] - codebook[m][k]||
Distance (ADC - Asymmetric Distance Computation):
1. Precompute distance table: table[m][k] = ||q[m] - codebook[m][k]||²
2. For each encoded vector: dist = Σ table[m][code[m]]Usage
use rag_plusplus_core::{PQQuantizer, PQConfig, Quantizer};
// Configure PQ
let config = PQConfig {
m: 16, // Number of subspaces (compression = dim * 4 / m)
k: 256, // Centroids per subspace (256 = 1 byte per code)
kmeans_iters: 25,
seed: Some(42),
};
// Train
let quantizer = PQQuantizer::train_with_config(&training_data, &config);
// Encode corpus
let encoded: Vec<PQEncodedVector> = corpus
.iter()
.map(|v| quantizer.encode(v))
.collect();
// Search with precomputed distance table (fast!)
let query: Vec<f32> = get_query();
let table = quantizer.compute_distance_table(&query);
for enc in &encoded {
// O(M) distance computation instead of O(D)
let dist_sq = quantizer.asymmetric_l2_squared_with_table(&table, enc);
}Compression Ratios
| Config | Original | Encoded | Compression |
|---|---|---|---|
| 512d, M=16 | 2048 bytes | 16 bytes | 128x |
| 512d, M=32 | 2048 bytes | 32 bytes | 64x |
| 768d, M=24 | 3072 bytes | 24 bytes | 128x |
Recall Trade-off
| M (subspaces) | Compression | Typical Recall@10 |
|---|---|---|
| 8 | 256x | 70-85 |
| 16 | 128x | 80-92 |
| 32 | 64x | 88-96 |
| 64 | 32x | 93-98 |
---
Parallel HNSW Construction (P2)
### Location
- `crates/core/src/index/hnsw_parallel.rs`
Description
Rayon-based parallel batch insertion for HNSW graph construction. Uses RwLock for thread-safe graph updates.
Algorithm
1. Insert first node (entry point) sequentially
2. Partition remaining nodes into batches
3. For each batch in parallel:
- Search for insertion point (read locks)
- Update graph edges (write locks on affected nodes)
4. Maintain entry point atomicallyUsage
use rag_plusplus_core::{ParallelHNSWBuilder, HNSWConfig};
let config = HNSWConfig::new(128) // dimension
.with_m(16) // max connections per layer
.with_ef_construction(100); // search width during build
let vectors: Vec<(String, Vec<f32>)> = load_vectors();
// Build with automatic thread count
let index = ParallelHNSWBuilder::new()
.with_config(config)
.with_seed(42) // reproducible builds
.build(vectors)?;
// Or specify thread count
let index = ParallelHNSWBuilder::new()
.with_config(config)
.with_threads(8)
.with_batch_size(1000)
.build(vectors)?;
// Search as normal
let results = index.search(&query, 10)?;Performance
| Vectors | Dimension | Sequential | Parallel (8 cores) | Speedup |
|---|---|---|---|---|
| 10K | 128 | 2.1s | 0.7s | 3.0x |
| 50K | 128 | 18s | 5.2s | 3.5x |
| 100K | 128 | 65s | 18s | 3.6x |
---
Hybrid Search - Dense + Sparse (P3)
### Location
- `crates/core/src/sparse/bm25.rs`
- `crates/core/src/sparse/hybrid.rs`
- `crates/core/src/sparse/tokenizer.rs`
Description
Combines dense vector search (semantic) with BM25 sparse search (keyword) for improved recall on queries with specific terminology.
Components
BM25 Index
use rag_plusplus_core::{BM25Index, BM25Config};
let mut index = BM25Index::new();
// Or with custom config
let config = BM25Config {
k1: 1.2, // Term frequency saturation
b: 0.75, // Length normalization
min_idf: 0.0, // IDF threshold
};
let mut index = BM25Index::with_config(config);
// Add documents
index.add("doc-1".into(), "machine learning neural networks");
index.add("doc-2".into(), "deep learning transformers attention");
// Search
let results = index.search("neural network architecture", 10);Hybrid Searcher
use rag_plusplus_core::{HybridSearcher, HybridSearchConfig, HybridFusionStrategy};
let config = HybridSearchConfig {
strategy: HybridFusionStrategy::RRF { k: 60.0 },
candidates_per_index: 100,
min_sparse_score: 0.0,
};
let searcher = HybridSearcher::new(&dense_index, &sparse_index)
.with_config(config);
// Search with both vector and text
let results = searcher.search(
&query_embedding,
"transformer attention mechanism",
10
)?;
// Results include both score components
for r in results {
println!("{}: fused={:.3}, dense={:?}, sparse={:?}",
r.id, r.score, r.dense_score, r.sparse_score);
}Fusion Strategies
| Strategy | Formula | Best For |
|---|---|---|
| RRF | `score = Σ 1/(k + rank)` | Robust, no normalization needed |
| Linear | `score = α·dense + (1-α)·sparse` | Tunable balance |
| WeightedRRF | `score = w_d·rrf_d + w_s·rrf_s` | Fine-grained control |
---
Hybrid Query Engine (P4)
### Location
- `crates/core/src/retrieval/hybrid.rs`
Description
Production-ready query engine that combines dense index, sparse index, filtering, reranking, and prior building.
Usage
use rag_plusplus_core::{
HybridQueryEngine, HybridQueryEngineConfig, HybridQueryRequest,
HybridFusionStrategy,
};
// Configure
let config = HybridQueryEngineConfig::new()
.with_default_k(10)
.with_fusion_strategy(HybridFusionStrategy::RRF { k: 60.0 })
.with_candidates_per_index(100);
// Create engine
let engine = HybridQueryEngine::new(config, &dense_index, &sparse_index, &store);
// Query
let request = HybridQueryRequest::new(embedding)
.with_text("neural network training")
.with_k(10)
.with_filter(FilterExpr::Eq("category".into(), "ml".into()));
let response = engine.query(request)?;
// Response includes
println!("Results: {}", response.len());
println!("Latency: {:?}", response.latency);
println!("Used hybrid: {}", response.used_hybrid);
if let Some(priors) = response.priors {
println!("Prior mean: {}", priors.mean_outcome);
println!("Prior CI: {:?}", priors.confidence_interval);
}---
Benchmarks
Running Benchmarks
# All benchmarks
cargo bench
# Specific benchmark
cargo bench --bench e2e_benchmark
cargo bench --bench quantization
cargo bench --bench distanceResults Summary (Apple M-series / Intel Xeon)
Distance Computation
| Operation | Scalar | AVX2 | Speedup |
|---|---|---|---|
| L2 (128d) | 45ns | 8ns | 5.6x |
| L2 (512d) | 180ns | 28ns | 6.4x |
| Inner Product (512d) | 165ns | 22ns | 7.5x |
| Cosine (512d) | 220ns | 38ns | 5.8x |
Quantization Search (50K vectors, 512d)
| Method | Memory | Search Time (100 queries) | Recall@10 |
|---|---|---|---|
| Full f32 | 102 MB | 850ms | 100 |
| SQ8 | 26 MB | 920ms | 99.3 |
| PQ-16 | 0.8 MB | 45ms | 85 |
*PQ recall depends heavily on data distribution
Index Construction
| Method | 100K vectors, 128d | Speedup |
|---|---|---|
| Sequential HNSW | 65s | 1x |
| Parallel HNSW (4t) | 22s | 3.0x |
| Parallel HNSW (8t) | 18s | 3.6x |
---
API Reference
New Exports
// Quantization
pub use quantization::{
Quantizer, // Trait for all quantizers
AsymmetricDistance, // Trait for ADC
SQ8Quantizer, // Scalar quantization
SQ8EncodedVector,
PQQuantizer, // Product quantization
PQEncodedVector,
PQDistanceTable,
PQConfig,
};
// Sparse Retrieval
pub use sparse::{
BM25Index, // BM25 inverted index
BM25Config,
SparseResult,
SimpleTokenizer, // Default tokenizer
Tokenizer, // Tokenizer trait
HybridSearcher, // Dense + Sparse fusion
HybridSearchConfig,
HybridFusionStrategy,
HybridResult,
};
// Parallel Index
pub use index::{
ParallelHNSWBuilder, // Parallel construction
ParallelHNSWIndex, // Thread-safe HNSW
};
// Hybrid Query Engine
pub use retrieval::{
HybridQueryEngine,
HybridQueryEngineConfig,
HybridQueryRequest,
HybridQueryResponse,
HybridRetrievedRecord,
};
// Distance (SIMD-accelerated)
pub use distance::{
l2_distance_fast,
inner_product_fast,
cosine_similarity_fast,
compute_distance, // Unified interface
};---
Migration Guide
From v0.1.x to v0.2.x (with SOTA)
No breaking changes. All existing code continues to work.
Opt-in to SIMD
// Before (still works)
use rag_plusplus_core::l2_distance;
// After (automatic SIMD when available)
use rag_plusplus_core::l2_distance_fast;Opt-in to Quantization
// Add quantized corpus alongside existing
let sq8 = SQ8Quantizer::train(&sample_vectors);
let quantized_corpus: Vec<_> = corpus.iter().map(|v| sq8.encode(v)).collect();Opt-in to Hybrid Search
// Create BM25 index alongside dense index
let mut bm25 = BM25Index::new();
for (id, text) in documents {
bm25.add(id, &text);
}
// Use HybridSearcher instead of direct index search
let searcher = HybridSearcher::new(&dense_index, &bm25);Opt-in to Parallel Build
// Before
let mut index = HNSWIndex::new(config);
for (id, vec) in vectors {
index.add(id, &vec)?;
}
// After (parallel)
let index = ParallelHNSWBuilder::new()
.with_config(config)
.build(vectors)?;---
Invariants
The following invariants are maintained:
| ID | Invariant | Enforcement |
|---|---|---|
| INV-001 | Record immutability after creation | API design |
| INV-002 | Welford numerical stability | Algorithm |
| INV-003 | WAL-before-buffer durability | Ordering |
| INV-004 | Index-record consistency | Transactions |
| INV-005 | Quantizer trained before encode | Runtime check |
| INV-006 | Dimension match on search | Assertion |
| INV-007 | Thread-safe parallel HNSW updates | RwLock |
---
Future Roadmap
1. QuantizedHNSW: Store PQ codes directly in HNSW nodes
2. Streaming BM25: Approximate IDF for real-time updates
3. NEON SIMD: ARM acceleration for Apple Silicon
4. Distributed Sharding: Multi-node index partitioning
5. GPU Acceleration: CUDA distance computation
---
References
- [Product Quantization for Nearest Neighbor Search](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf) - Jégou et al., 2011
- [Efficient and Robust Approximate Nearest Neighbor Search](https://arxiv.org/abs/1603.09320) - HNSW, Malkov & Yashunin, 2016
- [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) - Robertson et al., 1995
- [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) - Cormack et al., 2009
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/core/retrieval/cc-rag-plus-plus/SOTA_IMPROVEMENTS.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture