Grand Diomande Research · Full HTML Reader

Graph Kernel Benchmark Evaluation

The Graph Kernel service at `localhost:8001` was evaluated against three baseline retrieval methods across 27 queries in 5 categories. The evaluation reveals that the Graph Kernel is **not a general-purpose search engine** — it's a **deterministic context slicing engine** with a bolted-on knowledge graph. Its real value lies in provenance-tracked, policy-governed context construction — not keyword matching.

Agents That Account for Themselves experiment experiment writeup candidate score 40 .md

Full Public Reader

Graph Kernel Benchmark Evaluation

Date: 2026-02-13
Version: Graph Kernel v0.1.0, Schema v1.0.0
Author: Automated benchmark suite

---

Executive Summary

The Graph Kernel service at `localhost:8001` was evaluated against three baseline retrieval methods across 27 queries in 5 categories. The evaluation reveals that the Graph Kernel is not a general-purpose search engine — it's a deterministic context slicing engine with a bolted-on knowledge graph. Its real value lies in provenance-tracked, policy-governed context construction — not keyword matching.

Verdict: Worth the operational complexity if you need deterministic, auditable context construction. The knowledge graph query layer needs significant improvement to compete with even naive baselines for ad-hoc search.

---

1. System Under Test

### Graph Kernel (localhost:8001)
- Language: Rust (Axum + sqlx)
- Storage: PostgreSQL (Supabase-hosted, remote)
- Data: 2,681 knowledge triples (subject–predicate–object), 70 unique subjects, 39 unique predicates
- Source: Kimi-K2 memory extraction (synced from conversation history)
- Primary purpose: Deterministic context slicing for conversation DAGs
- Secondary purpose: Knowledge graph triple store (the `/api/knowledge` endpoint we're benchmarking)

### Baselines
| Method | Description |
|--------|------------|
| Keyword | Substring matching across `subject + predicate + object` text |
| BM25 | Classic Okapi BM25 (k1=1.5, b=0.75) over the same triple corpus |
| RAG++ (localhost:8000) | Vector similarity search over 107K+ conversation turns (embeddings) |

### Important Context
The Graph Kernel and baselines 1–3 operate on different data:
- Graph Kernel / Keyword / BM25: 2,681 structured triples (extracted knowledge)
- RAG++: 107K+ raw conversation turns (original text with embeddings)

This means RAG++ results are not directly comparable for precision — it searches a fundamentally different and much larger corpus. We include it to understand the semantic retrieval landscape.

---

2. Methodology

### Query Categories (27 total queries)
| Category | Count | Tests |
|----------|-------|-------|
| Factual Recall | 6 | Direct attribute lookups ("What does X use?") |
| Relationship | 6 | Dependency/integration mapping ("What depends on X?") |
| Multi-hop | 5 | 2-hop graph traversal ("X → Y → Z") |
| Fuzzy/Semantic | 5 | Loose topic matching ("anything about skating") |
| Predicate-specific | 5 | Structured predicate filters ("likes", "should", "has_file") |

### Metrics
- Response time (ms): Wall-clock latency including network round-trip
- Result count: Number of results returned
- Relevance score (0–1): Fraction of expected terms found in results

### Graph Kernel query strategy
- Factual/Relationship/Predicate queries use the `/api/knowledge` endpoint with exact `subject`, `predicate`, and/or `object` filters
- Multi-hop queries use sequential API calls: fetch triples for subject → follow objects → fetch triples for new subjects
- Fuzzy queries use the general endpoint (no filters) or predicate filters

---

3. Results

3.1 Per-Category Performance

#### Factual Recall
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 248.3 ms | 3.7 | 1.00 |
| Keyword | 2.7 ms | 20.0 | 1.00 |
| BM25 | 9.0 ms | 18.2 | 1.00 |
| RAG++ | 421.9 ms | 10.0 | 0.92 |

Analysis: All triple-based methods achieve perfect relevance on factual recall. Graph Kernel is slower because every query hits a remote PostgreSQL instance over the network (~200ms baseline). Keyword/BM25 operate in-memory. RAG++ misses some factual queries because conversation turn text doesn't always contain the exact entity name (e.g., "Mohamed" appears in conversation context but not always explicitly).

#### Relationship Queries
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 204.3 ms | 9.5 | 0.94 |
| Keyword | 2.8 ms | 19.3 | 1.00 |
| BM25 | 8.7 ms | 12.3 | 1.00 |
| RAG++ | 336.4 ms | 10.0 | 0.69 |

Analysis: Graph Kernel returns precisely scoped results (only matching triples), while keyword/BM25 return many partially-matching results. The GK 0.94 relevance drop comes from one query where "GCP" wasn't in the `deploys_to` results (it uses "Google Cloud Platform" instead). RAG++ struggles with relationship queries — it wasn't designed to express "A depends on B" relationships.

#### Multi-hop Reasoning ⭐
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 586.6 ms | 7.6 | 1.00 |
| Keyword | 3.3 ms | 20.0 | 1.00 |
| BM25 | 9.2 ms | 18.8 | 1.00 |
| RAG++ | 348.1 ms | 10.0 | 0.40 |

Analysis: This is Graph Kernel's killer feature. Multi-hop traversal (Mohamed → works_on → clawdbot → uses → Gemini batch API) returns causally connected results that follow actual relationship chains. Keyword/BM25 achieve high relevance scores but through keyword coincidence, not structural reasoning — they find documents containing "Mohamed" and "clawdbot" but don't understand the connection between them. RAG++ drops to 0.40 relevance because conversation turns rarely contain multi-hop relationship chains in a single text span.

The relevance scores mask a critical quality difference: Graph Kernel returns 7.6 results that are structurally connected through the graph. Keyword returns 20 results that happen to contain matching words. The GK results are a knowledge chain; the keyword results are a coincidence pile.

#### Fuzzy/Semantic Search
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 215.2 ms | 19.8 | 0.42 |
| Keyword | 2.0 ms | 16.0 | 0.80 |
| BM25 | 6.1 ms | 7.6 | 0.53 |
| RAG++ | 484.0 ms | 10.0 | 0.65 |

Analysis: Graph Kernel's weakest category. It has no semantic understanding — if you search for "music", it won't find triples about "audio production" or "sound design". RAG++ performs best here due to embedding-based similarity, though even it struggles (0.65). Keyword search benefits from the breadth of its matching ("music" appears directly in some triple text).

#### Predicate-specific Queries
| Method | Avg Latency | Avg Results | Avg Relevance |
|--------|-------------|-------------|---------------|
| Graph Kernel | 230.1 ms | 16.0 | 0.80 |
| Keyword | 3.3 ms | 20.0 | 1.00 |
| BM25 | 9.5 ms | 20.0 | 1.00 |
| RAG++ | 460.8 ms | 10.0 | 0.80 |

Analysis: Graph Kernel should excel here (exact predicate filters), but one query returned 0 results due to subject mismatch — "Dream Weaver" has files under a different capitalization/alias. This exposes a normalization gap in the knowledge extraction pipeline.

3.2 Overall Averages

Method	Avg Latency	Avg Results	Avg Relevance	Latency Rank	Relevance Rank
Keyword	2.8 ms	19.1	0.96	🥇	🥇
BM25	8.5 ms	15.4	0.91	🥈	🥈
Graph Kernel	291.7 ms	11.0	0.84	🥉	🥉
RAG++	407.9 ms	10.0	0.70	4th	4th

3.3 Latency Distribution

Component	Contribution
Network RTT to Supabase Postgres	~180-200 ms (baseline)
PostgreSQL query execution	~5-20 ms
Rust serialization	~1-2 ms
Multi-hop (per additional hop)	+200 ms

Key insight: 90

---

4. Strengths & Weaknesses

Graph Kernel Strengths

Strength	Evidence	Impact
Structural queries	1.00 relevance on multi-hop	Only method that returns causally connected results
Precise filtering	Exact subject/predicate/object filters	No false positives from keyword coincidence
Deterministic slicing	slice_id fingerprints are reproducible	Critical for provenance, audit, replay
Admissibility tokens	HMAC-signed, verifiable bundles	Downstream trust without sharing secrets
Confidence scoring	Per-triple confidence values	Quality-aware retrieval
Provenance tracking	Source field on every triple	Know where knowledge came from

Graph Kernel Weaknesses

Weakness	Evidence	Severity
High latency	291ms avg (remote DB)	Medium — fixable with local DB
No semantic search	0.42 fuzzy relevance	High — can't handle paraphrases
Entity normalization	"Dream Weaver" ≠ "dream-weaver-engine"	High — knowledge fragmented across aliases
No full-text search	Must match exact subject/predicate/object	High — no LIKE, no trigram, no FTS
Limited traversal API	Must do multi-hop via multiple HTTP calls	Medium — needs server-side traversal endpoint
No pagination	Max 500 results, no offset	Low — but limits bulk operations
Small corpus	2,681 triples vs 107K turns	Context — triple extraction is lossy

---

5. Category-Specific Recommendations

### Where Graph Kernel Excels (USE IT)
1. Answering "what connects X to Y?" — multi-hop traversal is uniquely valuable
2. Structured lookups — "what does X use?" with exact entity names
3. Provenance-critical workflows — when you need to prove which context influenced a decision
4. Dependency analysis — following chains of depends_on, integrates_with, uses
5. Context window construction — the slice API is its true purpose and unmatched

### Where Graph Kernel Struggles (USE ALTERNATIVES)
1. Fuzzy/semantic search → Use RAG++ (vector similarity)
2. Speed-critical autocomplete → Use Keyword/BM25 (in-memory, sub-10ms)
3. Raw conversation retrieval → Use RAG++ (has the full 107K corpus)
4. Exploratory browsing → Use Keyword (broad results, instant)

---

6. Improvement Recommendations

Priority 1: Entity Normalization (High Impact)

Problem: "Dream Weaver", "Dream-weaver-engine", "dream-weaver-engine" are separate subjects
Impact: Queries miss related triples, fragmented knowledge
Fix: Canonical entity resolution during Kimi-K2 extraction
     Add alias table: subject_aliases(canonical_id, alias_text)

Priority 2: Server-Side Graph Traversal (High Impact)

Problem: Multi-hop requires N sequential HTTP calls (N × 200ms)
Impact: 2-hop query takes 600ms+ vs 200ms for single lookup
Fix: Add POST /api/knowledge/traverse endpoint
     { "start": "Mohamed Diomande", "max_hops": 2, "predicates": ["works_on", "uses"] }
     Execute traversal server-side in single DB transaction

Priority 3: Full-Text Search on Triples (Medium Impact)

Problem: No way to search "anything mentioning docker" without exact field match
Impact: Fuzzy queries return irrelevant or zero results
Fix: Add PostgreSQL tsvector column + GIN index on (subject || predicate || object)
     Enable ILIKE or pg_trgm for approximate matching

Priority 4: Local Database Option (Medium Impact)

Problem: 180-200ms network overhead per query to Supabase
Impact: Can't compete with in-memory BM25 on latency
Fix: Support SQLite backend for local-first mode
     Use Supabase as sync target, not primary query engine

Priority 5: Hybrid Search (Future)

Problem: Structured graph and semantic search are complementary, not competing
Impact: Best results need both structured + semantic
Fix: Query Graph Kernel for structure, RAG++ for semantics, merge results
     Already partially implemented in RAG++'s /api/rag/search/enhanced

---

7. The Context Slicing Value Proposition

The benchmark focused on the knowledge graph `/api/knowledge` endpoint, but this is a secondary feature. The Graph Kernel's core value is the context slicing API (`/api/slice`):

POST /api/slice
{
  "anchor_turn_id": "726d8410-3a61-448b-a2d9-463598bbdda3"
}

→ Returns:
  - Deterministic slice of related turns (BFS from anchor)
  - Policy-governed (phase-weighted, budget-bounded)
  - Fingerprinted (slice_id = xxHash64 of canonical content)
  - HMAC-signed (admissibility_token for downstream verification)

No other system in the stack provides this. RAG++ finds similar turns by embedding distance. BM25/keyword find matching text. Only the Graph Kernel constructs a deterministic, reproducible, provenance-tracked context window around a specific conversation turn.

This is essential for:
- Reproducible analysis: Same slice_id = same input = same output
- Audit trails: Every downstream artifact references its source slice
- Trust boundaries: Admissibility tokens prove context wasn't tampered with
- Budget control: Policies limit context window size (max_nodes, max_radius)

---

8. Verdict

Is the Graph Kernel worth the operational complexity?

Yes — but not for the reasons the benchmark tested.

Criterion	Assessment
As a search engine	❌ Not competitive. Keyword/BM25 are faster and more flexible
As a knowledge graph	⚠️ Adequate but needs entity normalization and traversal API
As a context slicer	✅ Irreplaceable. No alternative provides deterministic, auditable context construction
As part of CompCore stack	✅ Essential. Provides the provenance layer that RAG++ and Orbit depend on

### Operational Cost Justified When:
1. You need reproducible context windows for downstream analysis
2. You need auditable provenance (which slice produced which output)
3. You need multi-hop relationship reasoning over structured knowledge
4. You're building a cognitive twin that needs deterministic memory recall

### Not Justified When:
1. You only need keyword search (use SQLite FTS or grep)
2. You only need semantic similarity (use RAG++ directly)
3. Your corpus is too small to benefit from graph structure (<100 entities)

---

9. Appendix: Raw Data

### Test Environment
- Machine: MacBook Air (Apple Silicon, arm64)
- OS: Darwin 24.6.0
- Graph Kernel: Rust binary, v0.1.0
- RAG++: Python (FastAPI), v0.1.0
- Database: Supabase PostgreSQL (remote, us-east-1)
- Network: Home broadband (~200ms RTT to Supabase)

### Corpus Statistics
| Metric | Value |
|--------|-------|
| Total triples | 2,681 |
| Unique subjects | 70 |
| Unique predicates | 39 |
| Top predicate | has_file (774) |
| Data source | kimi-k2-extraction |
| Avg confidence | 0.73 |

### Top 10 Predicates
| Predicate | Count |
|-----------|-------|---|
| has_file | 774 | 28.9
| needs_to | 456 | 17.0
| has_path | 344 | 12.8
| should | 318 | 11.9
| likes | 209 | 7.8
| wants_to | 173 | 6.5
| building | 77 | 2.9
| item | 50 | 1.9
| uses | 50 | 1.9
| works_on | 50 | 1.9

### Benchmark Script
Located at: `benchmarks/run_benchmark.py`
Raw JSON results: `/tmp/benchmark_results.json`

---

Generated by automated benchmark suite on 2026-02-13. Graph Kernel commit: HEAD of cc-graph-kernel.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/benchmarks/graph-kernel-evaluation.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture