Graph Kernel Benchmark Evaluation
The Graph Kernel service at `localhost:8001` was evaluated against three baseline retrieval methods across 27 queries in 5 categories. The evaluation reveals that the Graph Kernel is **not a general-purpose search engine** — it's a **deterministic context slicing engine** with a bolted-on knowledge graph. Its real value lies in provenance-tracked, policy-governed context construction — not keyword matching.
Full Public Reader
Graph Kernel Benchmark Evaluation
Date: 2026-02-13
Version: Graph Kernel v0.1.0, Schema v1.0.0
Author: Automated benchmark suite
---
Executive Summary
The Graph Kernel service at `localhost:8001` was evaluated against three baseline retrieval methods across 27 queries in 5 categories. The evaluation reveals that the Graph Kernel is not a general-purpose search engine — it's a deterministic context slicing engine with a bolted-on knowledge graph. Its real value lies in provenance-tracked, policy-governed context construction — not keyword matching.
Verdict: Worth the operational complexity if you need deterministic, auditable context construction. The knowledge graph query layer needs significant improvement to compete with even naive baselines for ad-hoc search.
---
1. System Under Test
### Graph Kernel (localhost:8001)
- Language: Rust (Axum + sqlx)
- Storage: PostgreSQL (Supabase-hosted, remote)
- Data: 2,681 knowledge triples (subject–predicate–object), 70 unique subjects, 39 unique predicates
- Source: Kimi-K2 memory extraction (synced from conversation history)
- Primary purpose: Deterministic context slicing for conversation DAGs
- Secondary purpose: Knowledge graph triple store (the `/api/knowledge` endpoint we're benchmarking)
### Baselines
| Method | Description |
|--------|------------|
| Keyword | Substring matching across `subject + predicate + object` text |
| BM25 | Classic Okapi BM25 (k1=1.5, b=0.75) over the same triple corpus |
| RAG++ (localhost:8000) | Vector similarity search over 107K+ conversation turns (embeddings) |
### Important Context
The Graph Kernel and baselines 1–3 operate on different data:
- Graph Kernel / Keyword / BM25: 2,681 structured triples (extracted knowledge)
- RAG++: 107K+ raw conversation turns (original text with embeddings)
This means RAG++ results are not directly comparable for precision — it searches a fundamentally different and much larger corpus. We include it to understand the semantic retrieval landscape.
---
2. Methodology
### Query Categories (27 total queries)
| Category | Count | Tests |
|----------|-------|-------|
| Factual Recall | 6 | Direct attribute lookups ("What does X use?") |
| Relationship | 6 | Dependency/integration mapping ("What depends on X?") |
| Multi-hop | 5 | 2-hop graph traversal ("X → Y → Z") |
| Fuzzy/Semantic | 5 | Loose topic matching ("anything about skating") |
| Predicate-specific | 5 | Structured predicate filters ("likes", "should", "has_file") |
### Metrics
- Response time (ms): Wall-clock latency including network round-trip
- Result count: Number of results returned
- Relevance score (0–1): Fraction of expected terms found in results
### Graph Kernel query strategy
- Factual/Relationship/Predicate queries use the `/api/knowledge` endpoint with exact `subject`, `predicate`, and/or `object` filters
- Multi-hop queries use sequential API calls: fetch triples for subject → follow objects → fetch triples for new subjects
- Fuzzy queries use the general endpoint (no filters) or predicate filters
---
3. Results
3.1 Per-Category Performance
#### Factual Recall
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 248.3 ms | 3.7 | 1.00 |
| Keyword | 2.7 ms | 20.0 | 1.00 |
| BM25 | 9.0 ms | 18.2 | 1.00 |
| RAG++ | 421.9 ms | 10.0 | 0.92 |
Analysis: All triple-based methods achieve perfect relevance on factual recall. Graph Kernel is slower because every query hits a remote PostgreSQL instance over the network (~200ms baseline). Keyword/BM25 operate in-memory. RAG++ misses some factual queries because conversation turn text doesn't always contain the exact entity name (e.g., "Mohamed" appears in conversation context but not always explicitly).
#### Relationship Queries
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 204.3 ms | 9.5 | 0.94 |
| Keyword | 2.8 ms | 19.3 | 1.00 |
| BM25 | 8.7 ms | 12.3 | 1.00 |
| RAG++ | 336.4 ms | 10.0 | 0.69 |
Analysis: Graph Kernel returns precisely scoped results (only matching triples), while keyword/BM25 return many partially-matching results. The GK 0.94 relevance drop comes from one query where "GCP" wasn't in the `deploys_to` results (it uses "Google Cloud Platform" instead). RAG++ struggles with relationship queries — it wasn't designed to express "A depends on B" relationships.
#### Multi-hop Reasoning ⭐
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 586.6 ms | 7.6 | 1.00 |
| Keyword | 3.3 ms | 20.0 | 1.00 |
| BM25 | 9.2 ms | 18.8 | 1.00 |
| RAG++ | 348.1 ms | 10.0 | 0.40 |
Analysis: This is Graph Kernel's killer feature. Multi-hop traversal (Mohamed → works_on → clawdbot → uses → Gemini batch API) returns causally connected results that follow actual relationship chains. Keyword/BM25 achieve high relevance scores but through keyword coincidence, not structural reasoning — they find documents containing "Mohamed" and "clawdbot" but don't understand the connection between them. RAG++ drops to 0.40 relevance because conversation turns rarely contain multi-hop relationship chains in a single text span.
The relevance scores mask a critical quality difference: Graph Kernel returns 7.6 results that are structurally connected through the graph. Keyword returns 20 results that happen to contain matching words. The GK results are a knowledge chain; the keyword results are a coincidence pile.
#### Fuzzy/Semantic Search
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 215.2 ms | 19.8 | 0.42 |
| Keyword | 2.0 ms | 16.0 | 0.80 |
| BM25 | 6.1 ms | 7.6 | 0.53 |
| RAG++ | 484.0 ms | 10.0 | 0.65 |
Analysis: Graph Kernel's weakest category. It has no semantic understanding — if you search for "music", it won't find triples about "audio production" or "sound design". RAG++ performs best here due to embedding-based similarity, though even it struggles (0.65). Keyword search benefits from the breadth of its matching ("music" appears directly in some triple text).
#### Predicate-specific Queries
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 230.1 ms | 16.0 | 0.80 |
| Keyword | 3.3 ms | 20.0 | 1.00 |
| BM25 | 9.5 ms | 20.0 | 1.00 |
| RAG++ | 460.8 ms | 10.0 | 0.80 |
Analysis: Graph Kernel should excel here (exact predicate filters), but one query returned 0 results due to subject mismatch — "Dream Weaver" has files under a different capitalization/alias. This exposes a normalization gap in the knowledge extraction pipeline.
3.2 Overall Averages
| Method | Avg Latency | Avg Results | Avg Relevance | Latency Rank | Relevance Rank |
|---|---|---|---|---|---|
| Keyword | 2.8 ms | 19.1 | 0.96 | 🥇 | 🥇 |
| BM25 | 8.5 ms | 15.4 | 0.91 | 🥈 | 🥈 |
| Graph Kernel | 291.7 ms | 11.0 | 0.84 | 🥉 | 🥉 |
| RAG++ | 407.9 ms | 10.0 | 0.70 | 4th | 4th |
3.3 Latency Distribution
| Component | Contribution |
|---|---|
| Network RTT to Supabase Postgres | ~180-200 ms (baseline) |
| PostgreSQL query execution | ~5-20 ms |
| Rust serialization | ~1-2 ms |
| Multi-hop (per additional hop) | +200 ms |
Key insight: 90
---
4. Strengths & Weaknesses
Graph Kernel Strengths
| Strength | Evidence | Impact |
|---|---|---|
| Structural queries | 1.00 relevance on multi-hop | Only method that returns causally connected results |
| Precise filtering | Exact subject/predicate/object filters | No false positives from keyword coincidence |
| Deterministic slicing | slice_id fingerprints are reproducible | Critical for provenance, audit, replay |
| Admissibility tokens | HMAC-signed, verifiable bundles | Downstream trust without sharing secrets |
| Confidence scoring | Per-triple confidence values | Quality-aware retrieval |
| Provenance tracking | Source field on every triple | Know where knowledge came from |
Graph Kernel Weaknesses
| Weakness | Evidence | Severity |
|---|---|---|
| High latency | 291ms avg (remote DB) | Medium — fixable with local DB |
| No semantic search | 0.42 fuzzy relevance | High — can't handle paraphrases |
| Entity normalization | "Dream Weaver" ≠ "dream-weaver-engine" | High — knowledge fragmented across aliases |
| No full-text search | Must match exact subject/predicate/object | High — no LIKE, no trigram, no FTS |
| Limited traversal API | Must do multi-hop via multiple HTTP calls | Medium — needs server-side traversal endpoint |
| No pagination | Max 500 results, no offset | Low — but limits bulk operations |
| Small corpus | 2,681 triples vs 107K turns | Context — triple extraction is lossy |
---
5. Category-Specific Recommendations
### Where Graph Kernel Excels (USE IT)
1. Answering "what connects X to Y?" — multi-hop traversal is uniquely valuable
2. Structured lookups — "what does X use?" with exact entity names
3. Provenance-critical workflows — when you need to prove which context influenced a decision
4. Dependency analysis — following chains of depends_on, integrates_with, uses
5. Context window construction — the slice API is its true purpose and unmatched
### Where Graph Kernel Struggles (USE ALTERNATIVES)
1. Fuzzy/semantic search → Use RAG++ (vector similarity)
2. Speed-critical autocomplete → Use Keyword/BM25 (in-memory, sub-10ms)
3. Raw conversation retrieval → Use RAG++ (has the full 107K corpus)
4. Exploratory browsing → Use Keyword (broad results, instant)
---
6. Improvement Recommendations
Priority 1: Entity Normalization (High Impact)
Problem: "Dream Weaver", "Dream-weaver-engine", "dream-weaver-engine" are separate subjects
Impact: Queries miss related triples, fragmented knowledge
Fix: Canonical entity resolution during Kimi-K2 extraction
Add alias table: subject_aliases(canonical_id, alias_text)Priority 2: Server-Side Graph Traversal (High Impact)
Problem: Multi-hop requires N sequential HTTP calls (N × 200ms)
Impact: 2-hop query takes 600ms+ vs 200ms for single lookup
Fix: Add POST /api/knowledge/traverse endpoint
{ "start": "Mohamed Diomande", "max_hops": 2, "predicates": ["works_on", "uses"] }
Execute traversal server-side in single DB transactionPriority 3: Full-Text Search on Triples (Medium Impact)
Problem: No way to search "anything mentioning docker" without exact field match
Impact: Fuzzy queries return irrelevant or zero results
Fix: Add PostgreSQL tsvector column + GIN index on (subject || predicate || object)
Enable ILIKE or pg_trgm for approximate matchingPriority 4: Local Database Option (Medium Impact)
Problem: 180-200ms network overhead per query to Supabase
Impact: Can't compete with in-memory BM25 on latency
Fix: Support SQLite backend for local-first mode
Use Supabase as sync target, not primary query enginePriority 5: Hybrid Search (Future)
Problem: Structured graph and semantic search are complementary, not competing
Impact: Best results need both structured + semantic
Fix: Query Graph Kernel for structure, RAG++ for semantics, merge results
Already partially implemented in RAG++'s /api/rag/search/enhanced---
7. The Context Slicing Value Proposition
The benchmark focused on the knowledge graph `/api/knowledge` endpoint, but this is a secondary feature. The Graph Kernel's core value is the context slicing API (`/api/slice`):
POST /api/slice
{
"anchor_turn_id": "726d8410-3a61-448b-a2d9-463598bbdda3"
}
→ Returns:
- Deterministic slice of related turns (BFS from anchor)
- Policy-governed (phase-weighted, budget-bounded)
- Fingerprinted (slice_id = xxHash64 of canonical content)
- HMAC-signed (admissibility_token for downstream verification)No other system in the stack provides this. RAG++ finds similar turns by embedding distance. BM25/keyword find matching text. Only the Graph Kernel constructs a deterministic, reproducible, provenance-tracked context window around a specific conversation turn.
This is essential for:
- Reproducible analysis: Same slice_id = same input = same output
- Audit trails: Every downstream artifact references its source slice
- Trust boundaries: Admissibility tokens prove context wasn't tampered with
- Budget control: Policies limit context window size (max_nodes, max_radius)
---
8. Verdict
Is the Graph Kernel worth the operational complexity?
Yes — but not for the reasons the benchmark tested.
| Criterion | Assessment |
|---|---|
| As a search engine | ❌ Not competitive. Keyword/BM25 are faster and more flexible |
| As a knowledge graph | ⚠️ Adequate but needs entity normalization and traversal API |
| As a context slicer | ✅ Irreplaceable. No alternative provides deterministic, auditable context construction |
| As part of CompCore stack | ✅ Essential. Provides the provenance layer that RAG++ and Orbit depend on |
### Operational Cost Justified When:
1. You need reproducible context windows for downstream analysis
2. You need auditable provenance (which slice produced which output)
3. You need multi-hop relationship reasoning over structured knowledge
4. You're building a cognitive twin that needs deterministic memory recall
### Not Justified When:
1. You only need keyword search (use SQLite FTS or grep)
2. You only need semantic similarity (use RAG++ directly)
3. Your corpus is too small to benefit from graph structure (<100 entities)
---
9. Appendix: Raw Data
### Test Environment
- Machine: MacBook Air (Apple Silicon, arm64)
- OS: Darwin 24.6.0
- Graph Kernel: Rust binary, v0.1.0
- RAG++: Python (FastAPI), v0.1.0
- Database: Supabase PostgreSQL (remote, us-east-1)
- Network: Home broadband (~200ms RTT to Supabase)
### Corpus Statistics
| Metric | Value |
|--------|-------|
| Total triples | 2,681 |
| Unique subjects | 70 |
| Unique predicates | 39 |
| Top predicate | has_file (774) |
| Data source | kimi-k2-extraction |
| Avg confidence | 0.73 |
### Top 10 Predicates
| Predicate | Count |
|-----------|-------|---|
| has_file | 774 | 28.9
| needs_to | 456 | 17.0
| has_path | 344 | 12.8
| should | 318 | 11.9
| likes | 209 | 7.8
| wants_to | 173 | 6.5
| building | 77 | 2.9
| item | 50 | 1.9
| uses | 50 | 1.9
| works_on | 50 | 1.9
### Benchmark Script
Located at: `benchmarks/run_benchmark.py`
Raw JSON results: `/tmp/benchmark_results.json`
---
Generated by automated benchmark suite on 2026-02-13. Graph Kernel commit: HEAD of cc-graph-kernel.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/benchmarks/graph-kernel-evaluation.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture