Policy-Governed Context Slicing for Autonomous Agent Systems: A Lightweight Knowledge Graph Approach
Autonomous AI agent systems face a fundamental challenge: constructing reproducible, trustworthy context windows from large conversational histories while enforcing governance policies over what information may influence downstream decisions. We present the \textbf{Graph Kernel}, a deterministic context slicing engine implemented as a single Rust binary (${\sim}15$ KLOC) that combines a lightweight knowledge graph triple store with cryptographically-signed, policy-governed context window construction. Unlike genera
Full Public Reader
Policy-Governed Context Slicing for Autonomous Agent Systems: A Lightweight Knowledge Graph Approach
\author{Mohamed Diomande
\IEEEauthorblockA{OpenClaw Research
Brooklyn, NY, USA
[email]}}
Abstract
Autonomous AI agent systems face a fundamental challenge: constructing reproducible, trustworthy context windows from large conversational histories while enforcing governance policies over what information may influence downstream decisions. We present the Graph Kernel, a deterministic context slicing engine implemented as a single Rust binary (${\sim}15$ KLOC) that combines a lightweight knowledge graph triple store with cryptographically-signed, policy-governed context window construction. Unlike general-purpose graph databases or retrieval-augmented generation (RAG) pipelines, the Graph Kernel introduces the concept of a provenance engine---a system whose primary purpose is not information retrieval but the production of verifiable, reproducible evidence bundles for autonomous agent reasoning.
We evaluate the Graph Kernel across 27 queries spanning five categories (factual recall, relationship mapping, multi-hop reasoning, fuzzy/semantic search, and predicate-specific queries) against three baseline methods: keyword search, BM25, and vector-similarity RAG. Results demonstrate that the Graph Kernel achieves perfect relevance (1.00) on multi-hop traversal queries---returning structurally connected knowledge chains rather than keyword-coincidence result sets---while maintaining sub-300ms average latency over a remote PostgreSQL backend. We further present a comparative analysis against nine industry-grade alternatives (Neo4j, Amazon Neptune, Apache Jena, Dgraph, TypeDB, Weaviate, LangChain/LlamaIndex Knowledge Graphs, Microsoft GraphRAG, and Zep), establishing that no existing system provides the combination of HMAC-signed deterministic context windows, policy-governed access control, and multi-hop provenance tracking that the Graph Kernel offers.
Our key contributions are: (1)~a formal model for HMAC-signed deterministic context windows with type-level enforcement of admissibility invariants; (2)~a policy governance framework for phase-weighted, budget-bounded context expansion; (3)~multi-hop provenance at sub-300ms latency with projected sub-30ms under local deployment; and (4)~a hybrid architecture positioning that bridges structural graph reasoning with semantic vector search.
IEEEkeywords
knowledge graphs, context management, autonomous agents, provenance, deterministic systems, retrieval-augmented generation, graph databases, policy governance
IEEEkeywords
Introduction and Motivation
The Context Authority Problem
The rapid deployment of autonomous AI agents---systems that plan, reason, and act over extended multi-turn conversations---has exposed a critical gap in the infrastructure stack: who decides what context an agent is allowed to see, and how do we prove it?
Modern large language model (LLM) agents operate over conversation histories that can span tens of thousands of turns~[citation: vaswani2017attention]. At inference time, these histories must be compressed into context windows of fixed token budgets. The prevailing approach---truncation, summarization, or embedding-based retrieval---treats context selection as an information retrieval problem. But for autonomous agents making consequential decisions (code deployment, financial transactions, system administration), context selection is fundamentally a governance problem:
- Reproducibility. If an agent produces an output, can we reconstruct the exact context window that produced it?
- Auditability. Can a downstream system verify that a context window was authorized by a trusted authority?
- Policy Compliance. Can we enforce that certain conversation phases (e.g., debugging artifacts) are deprioritized relative to others (e.g., synthesis conclusions)?
- Tamper Resistance. Can we detect if a context window has been modified after construction?
None of the widely-deployed systems---vector databases~[citation: weaviate2023,johnson2021billion], RAG pipelines~[citation: lewis2020rag,gao2023rag_survey], or general-purpose graph databases~[citation: neo4j2024,angles2017foundations]---address these requirements as first-class concerns. They optimize for relevance, latency, or scale, but not for provenance.
The Provenance Engine Category
We propose that autonomous agent systems require a new category of infrastructure component: the provenance engine. A provenance engine is not a search engine, a vector database, or a general-purpose graph store. Its purpose is narrower and more fundamental:
skip
Definition 1 (Provenance Engine). A service that, given a target anchor in a conversation DAG and a governance policy, produces a deterministic, cryptographically-signed context window (evidence bundle) such that: (a)~the same inputs always produce the same output; (b)~the output includes an unforgeable proof of authorization; and (c)~downstream services can verify the proof without accessing the signing secret.
skip
The Graph Kernel is the first purpose-built implementation of this category. It answers a single question:
quote
Given a target turn, which other turns are allowed to influence meaning---and can you prove it?
quote
Contributions
This paper makes four contributions:
- HMAC-Signed Deterministic Context Windows. We present a formal model for context window construction where identical inputs (anchor turn, policy, graph state) produce identical outputs (slice ID, fingerprint, admissibility token). The admissibility token is an HMAC-SHA256 signature over six canonical fields, providing 128-bit tamper resistance. We enforce this at the Rust type level through the AdmissibleEvidenceBundle type, which can only be constructed through the verified pathway.
- Policy-Governed Access Control. We introduce SlicePolicyV1, a parameterized policy framework that governs context expansion through phase-weighted priority queues, budget bounds (max nodes, max radius), distance decay, and sibling expansion limits. Policies are registered in an immutable registry with hash-stable fingerprints, enabling reproducible policy resolution across sessions.
- Multi-hop Provenance at Sub-300ms. We demonstrate that structural multi-hop traversal---following actual relationship chains through a knowledge graph---produces qualitatively superior results to keyword coincidence, achieving 1.00 relevance on causally-connected queries at 291.7ms average latency (network-dominated), with projected sub-30ms under local SQLite deployment.
- Hybrid Architecture with Semantic Search. We present an integration architecture that bridges the Graph Kernel's structural reasoning with vector-similarity search (RAG++), combining deterministic provenance with semantic flexibility. This hybrid approach addresses the Graph Kernel's primary limitation (0.42 relevance on fuzzy/semantic queries) while preserving its provenance guarantees.
Paper Organization
Section~[ref: sec:related] surveys related work in knowledge graphs, RAG systems, context management, and graph databases. Section~[ref: sec:design] describes the system design of the Graph Kernel. Section~[ref: sec:methodology] presents our evaluation methodology. Section~[ref: sec:results] reports results across five query categories and four retrieval methods. Section~[ref: sec:comparative] provides a comparative analysis against nine industry alternatives. Section~[ref: sec:discussion] discusses implications, limitations, and the provenance engine positioning. Section~[ref: sec:future] outlines future work. Section~[ref: sec:conclusion] concludes.
Related Work
Knowledge Graphs and Triple Stores
Knowledge graphs have a long history in structured knowledge representation, from early semantic networks~[citation: sowa1992semantic] to the modern Knowledge Graph paradigm popularized by Google~[citation: singhal2012knowledge]. RDF-based systems like Apache Jena~[citation: jena2024] and Blazegraph~[citation: blazegraph2024] provide standards-compliant triple stores with SPARQL query interfaces and OWL-based reasoning. Property graph databases such as Neo4j~[citation: neo4j2024] and its Cypher query language~[citation: francis2018cypher] offer more flexible schema-on-read models suited to application development.
Recent surveys~[citation: ji2022survey,hogan2021knowledge] identify knowledge graphs as foundational to AI systems, but focus on construction, completion, and embedding---not on the governance and provenance dimensions that autonomous agents require. Hogan et al.~[citation: hogan2022knowledge] provide a comprehensive overview of knowledge graph technology but do not address deterministic context window construction.
The Graph Kernel differs from these systems by being purpose-built for a narrower mandate: it is not a general-purpose knowledge representation system but a context authority layer that happens to use a triple store as its backend. It trades query expressiveness (no SPARQL, no Cypher) for deployment simplicity (single binary, ${\sim}20$MB) and provenance guarantees (HMAC-signed bundles).
Retrieval-Augmented Generation
RAG~[citation: lewis2020rag] has become the dominant paradigm for grounding LLM outputs in external knowledge. Lewis et al. demonstrated that combining retrieval with generation improves factual accuracy on knowledge-intensive tasks. Subsequent work has refined the approach: REALM~[citation: guu2020realm] pre-trains the retriever jointly with the language model; FiD~[citation: izacard2021fid] processes multiple retrieved passages independently before fusion; and Self-RAG~[citation: asai2024selfrag] introduces retrieval-time self-reflection.
Vector databases like Weaviate~[citation: weaviate2023], Pinecone~[citation: pinecone2024], and Chroma~[citation: chroma2024] provide the retrieval backbone for RAG systems, offering approximate nearest-neighbor search over dense embeddings. These systems optimize for semantic similarity but provide no structural reasoning capability---they cannot follow relationship chains or enforce governance policies over what is retrieved.
The Graph Kernel is complementary to RAG, not competitive with it. Our evaluation (Section~[ref: sec:results]) shows that RAG achieves 0.65 relevance on fuzzy/semantic queries where the Graph Kernel scores 0.42, while the Graph Kernel achieves 1.00 on multi-hop structural queries where RAG drops to 0.40. The hybrid architecture (Section~[ref: sec:hybrid]) combines both modalities.
Graph-Enhanced RAG
Microsoft's GraphRAG~[citation: edge2024graphrag] represents the most prominent effort to combine graph structure with RAG. It uses LLM-driven entity and relationship extraction to construct a knowledge graph, then applies the Leiden community detection algorithm to identify hierarchical topic clusters. Queries are answered through either local search (entity-centric subgraph retrieval) or global search (community summary aggregation).
GraphRAG advances the state of the art in holistic corpus understanding---its community summaries can answer ``what is this dataset about?'' queries that neither flat RAG nor the Graph Kernel can address. However, GraphRAG does not provide deterministic reproducibility (its outputs depend on LLM behavior during both indexing and querying), has no concept of governance policies, and offers no cryptographic provenance chain. Its graph construction is also computationally expensive, requiring multiple LLM calls per document chunk.
LlamaIndex~[citation: llamaindex2024] and LangChain~[citation: langchain2024] provide knowledge graph integrations that wrap external graph stores (Neo4j, Nebula) in LLM-friendly interfaces. These are orchestration layers rather than graph engines; they inherit the limitations of their backend stores and add LLM-dependent extraction with no formal reproducibility guarantees.
Context Window Management
The challenge of managing context for LLMs has received increasing attention. Longformer~[citation: beltagy2020longformer] and BigBird~[citation: zaheer2020bigbird] extend attention mechanisms to handle longer sequences. Memory-augmented architectures like MemoryBank~[citation: zhong2024memorybank] and RMT~[citation: bulatov2022rmt] provide external memory stores for conversation continuity. Zep~[citation: zep2024] offers a purpose-built memory layer for LLM applications with automatic entity extraction and temporal awareness.
These systems optimize for the content of context windows---what information is included. The Graph Kernel optimizes for the governance of context windows---who authorized the inclusion, whether the window is reproducible, and whether downstream consumers can verify its provenance. This is a complementary concern; a production system may use Zep for memory management and the Graph Kernel for provenance tracking.
Provenance and Trust in AI Systems
Provenance tracking in data systems has a long history~[citation: buneman2001provenance,cheney2009provenance], but its application to AI agent context has received limited attention. The W3C PROV specification~[citation: moreau2013prov] provides a data model for provenance but lacks integration with LLM context windows. Blockchain-based provenance systems~[citation: liang2023blockchain] offer tamper-resistance but introduce latency and complexity inappropriate for real-time context slicing.
The Graph Kernel's HMAC-based provenance model is inspired by JSON Web Tokens (JWT)~[citation: jones2015jwt] but adapted for the specific requirements of context window authorization: binding six provenance fields (slice ID, anchor turn, policy ID, policy parameters hash, graph snapshot hash, schema version) into a single unforgeable token.
System Design
Architecture Overview
The Graph Kernel is implemented as a single Rust binary (${\sim}15$ KLOC) built on the Axum web framework~[citation: axum2024] with the Tokio async runtime~[citation: tokio2024]. It compiles to a statically-linked binary (${\sim}20$MB) deployable as a local service, Docker container, or cloud function (Google Cloud Run).
The architecture comprises three layers:
tikzpicture[
box/.style={draw, rounded corners, minimum width=2.2cm, minimum height=0.8cm, font=\scriptsize, align=center},
layer/.style={draw, rounded corners, minimum width=7.5cm, minimum height=2.2cm, fill=gray!5}
]
\node[layer] (service) at (0,0) {};
\node[font= \bfseries] at (0,1.3) {Graph Kernel Service};
\node[box, fill=blue!10] (api) at (-2.5,0) {API Layer
(Axum)
[2pt]/api/slice
[-1pt]/api/verify
[-1pt]/api/knowledge};
\node[box, fill=green!10] (core) at (0,0) {Core Engine
[2pt]ContextSlicer
[-1pt]PolicyRegistry
[-1pt]TokenAuthority
[-1pt]SnapshotHash};
\node[box, fill=orange!10] (store) at (2.5,0) {Storage Layer
[2pt]PostgreSQL
[-1pt](sqlx, pool
[-1pt]2..10)};
\draw[->, thick] (api) -- (core);
\draw[->, thick] (core) -- (store);
tikzpicture
Caption: Graph Kernel three-layer architecture.
API Layer. Axum route handlers expose RESTful endpoints for context slicing (POST /api/slice, POST /api/slice/batch), token verification (POST /api/verify\_token), policy management (GET/POST /api/policies), knowledge graph CRUD (GET/POST /api/knowledge, POST /api/knowledge/batch), and health monitoring (/health/*).
Core Engine. The ContextSlicer implements a priority-queue BFS expansion over the conversation DAG. The PolicyRegistry maintains an immutable store of registered policies with hash-stable fingerprints. The TokenAuthority issues and verifies HMAC-SHA256 admissibility tokens. The SnapshotHash computes content-derived graph fingerprints for reproducibility.
Storage Layer. The GraphStore trait abstracts database access with two implementations: PostgresGraphStore for production (sqlx connection pool, 2--10 connections) and InMemoryGraphStore for testing.
Deterministic Context Slicing
The core algorithm is a policy-weighted BFS expansion from an anchor turn through the conversation DAG:
Algorithm: PolicyWeightedBFS
Input: anchor_id ∈ TurnId, policy ∈ SlicePolicyV1, store ∈ GraphStore
Output: bundle ∈ AdmissibleEvidenceBundle
anchor ← store.get_turn(anchor_id)
frontier ← MaxHeap<ExpansionCandidate>
frontier.push(anchor, d=0, p=score(anchor, policy))
selected ← [ ]; visited ← anchor_id
while frontier ≠ ∅ and |selected| < policy.max_nodes:
c ← frontier.pop()
if c.distance > policy.max_radius:
continue
selected ← selected ∪ c.turn
for each neighbor ∈ parents(c) ∪ children(c):
if neighbor ∉ visited:
frontier.push(neighbor, c.d+1, score(neighbor, policy, c.d+1))
visited ← visited ∪ neighbor
if policy.include_siblings:
for each sib ∈ siblings(c, policy.max_sib):
frontier.push(sib, c.d, score(sib, policy, c.d))
edges ← store.get_edges(selected)
sh ← xxHash64(sort(content_hashes(selected)))
sid ← xxHash64(anchor, sort(ids), sort(edges), pid, ph, sv)
token ← HMAC-SHA256(secret, canonical(sid, anchor, pid, ph, sh, sv))[0:16]
return AdmissibleEvidenceBundle(selected, edges, sid, token)HMAC-Signed Admissibility Tokens
The admissibility token creates an unforgeable proof-of-authorization binding six provenance fields:
lstlisting
canonical = "{slice_id}|{anchor_turn_id}|
{policy_id}|{policy_params_hash}|
{graph_snapshot_hash}|{schema_version}|
admissibility_token_v2_hmac"
[sensitive field redacted], canonical)[0..16]
lstlisting
The token is a 128-bit (32 hex character) truncation of the full HMAC-SHA256 digest. Downstream services verify tokens via POST /api/verify\_token without accessing the HMAC secret. Verification uses constant-time comparison to prevent timing attacks.
We enforce a critical invariant at the type level:
skip
Invariant INV-GK-003 (No Phantom Authority). The AdmissibleEvidenceBundle type can only be constructed through the from\_verified() pathway, which requires a valid HMAC computation. Unverified context windows are unrepresentable in the type system.
skip
This prevents a class of bugs where context windows bypass the authorization pathway through careless refactoring or missing validation checks.
Policy Governance Framework
Context expansion is parameterized by SlicePolicyV1:
Caption: SlicePolicyV1 Parameters
| Parameter | Default | Description |
|---|---|---|
| `max\_nodes` | 256 | Max turns in a context slice |
| `max\_radius` | 10 | Max graph hops from anchor |
| `phase\_weights` | \S=1.0, P=0.9, | Phase importance |
| C=0.6, D=0.5, E=0.3 | ||
| `salience\_weight` | 0.3 | Salience contribution |
| `distance\_decay` | 0.9 | Priority decay per hop |
| `include\_siblings` | true | Expand to sibling turns |
| `max\_siblings` | 5 | Sibling expansion limit |
Policies are registered in an immutable PolicyRegistry with deterministic fingerprints. Policy parameter hashes use quantized floats (multiply by $10^6$, round to i64) to ensure cross-platform determinism between Rust and Python clients. The registry itself has a fingerprint (xxHash64 of sorted policy hashes) enabling clients to detect policy changes.
Hybrid Architecture: Structural + Semantic
The Graph Kernel's primary limitation is the absence of semantic search (Section~[ref: sec:fuzzy]). To address this, we design a hybrid retrieval architecture that bridges structural graph reasoning with vector-similarity search:
tikzpicture[
box/.style={draw, rounded corners, minimum width=2.8cm, minimum height=1.2cm, font=\scriptsize, align=center},
merge/.style={draw, rounded corners, minimum width=3.2cm, minimum height=0.9cm, font=\scriptsize, align=center, fill=yellow!15}
]
\node[box, fill=blue!10] (rag) at (-1.8,1.5) {RAG++ (Semantic)
Vector embedding
$\rightarrow$ similarity
$\rightarrow$ top-K turns};
\node[box, fill=green!10] (gk) at (1.8,1.5) {Graph Kernel
(Structural)
Subject/predicate
$\rightarrow$ exact triples};
\node[merge] (merge) at (0,0) {Merge \& Rank
Deduplicate
Cross-enrich
Provenance-tag};
\node[font=\scriptsize\itshape] (out) at (0,-1) {Enriched Result Set};
\draw[->, thick] (rag) -- (merge);
\draw[->, thick] (gk) -- (merge);
\draw[->, thick] (merge) -- (out);
tikzpicture
Caption: Hybrid retrieval architecture combining semantic and structural search with provenance tagging.
The enrichment endpoint (POST /api/enrich) accepts RAG++ results and augments them with graph context: entities found in result text are resolved to knowledge graph subjects, their 1--2 hop neighborhoods are traversed, and the combined result set preserves provenance metadata from both sources. This transforms flat vector similarity into structured, provenance-tagged reasoning.
Knowledge Graph Triple Store
The secondary function of the Graph Kernel is a knowledge graph triple store with the schema:
lstlisting[language=SQL]
CREATE TABLE knowledge_graph (
id BIGSERIAL PRIMARY KEY,
subject TEXT NOT NULL,
predicate TEXT NOT NULL,
object TEXT NOT NULL,
confidence DOUBLE PRECISION DEFAULT 0.5,
source TEXT DEFAULT 'unknown',
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(subject, predicate, object)
);
lstlisting
Each triple (subject, predicate, object) represents a factual assertion with an associated confidence score (0.0--1.0) and source provenance. On conflict (duplicate SPO), the system retains the higher confidence value and updates the source. The evaluated corpus contains 3,502 triples across 221 unique subjects and 88 unique predicates, sourced from LLM-driven conversation extraction (Kimi-K2) and topology ingestion pipelines.
Evaluation Methodology
Test Design
We evaluated four retrieval methods across 27 queries in five categories:
Caption: Query Categories
| Category | $n$ | What It Tests |
|---|---|---|
| Factual Recall | 6 | Direct attribute lookups |
| Relationship | 6 | Dependency/integration mapping |
| Multi-hop | 5 | 2-hop graph traversal |
| Fuzzy/Semantic | 5 | Loose topic matching |
| Predicate-specific | 5 | Structured predicate filters |
Queries were designed to exercise different retrieval capabilities: factual recall tests exact match precision; relationship queries test structured predicate filtering; multi-hop queries test graph traversal; fuzzy/semantic queries test paraphrase understanding; and predicate-specific queries test schema-aware filtering.
Methods Under Test
Caption: Retrieval Methods
| Method | Corpus | Mechanism | Deploy |
|---|---|---|---|
| Graph Kernel | 2,681 triples | REST API | Rust |
| Keyword | 2,681 triples | Substring match | Python |
| BM25 | 2,681 triples | Okapi BM25 | Python |
| RAG++ | 107K+ turns | Vector similarity | FastAPI |
Important caveat. The Graph Kernel, keyword, and BM25 methods operate on the same triple corpus (2,681 structured triples extracted from conversations). RAG++ operates on a fundamentally different and much larger corpus (107K+ raw conversation turns with embeddings). Direct comparison between the triple-based methods and RAG++ is informational, not apples-to-apples.
Metrics
We measure three metrics per query:
- Response Time (ms). Wall-clock latency including all network round-trips. For multi-hop queries, this includes sequential HTTP calls (one per hop).
- Result Count. Number of results returned per query.
- Relevance Score (0--1). Fraction of expected terms found in results. This metric captures whether the expected information is present but does not capture the structural quality of results.
Environment
Caption: Evaluation Environment
| Component | Specification |
|---|---|
| Machine | MacBook Air (Apple M3, arm64) |
| OS | Darwin 24.6.0 |
| Graph Kernel | Rust binary, v0.1.0 |
| RAG++ | Python FastAPI, v0.1.0 |
| Database | Supabase PostgreSQL (us-east-1) |
| Network | Home broadband (${\sim}200$ms RTT) |
Results
Factual Recall
Caption: Factual Recall Results
| Method | Avg Lat. (ms) | Avg Results | Avg Rel. |
|---|---|---|---|
| Graph Kernel | 248.3 | 3.7 | 1.00 |
| Keyword | 2.7 | 20.0 | 1.00 |
| BM25 | 9.0 | 18.2 | 1.00 |
| RAG++ | 421.9 | 10.0 | 0.92 |
All triple-based methods achieve perfect relevance on factual recall, confirming that the knowledge graph corpus contains the expected information. The Graph Kernel returns precisely scoped results (3.7 average) compared to keyword's broader 20.0, reflecting exact field matching versus substring coincidence. RAG++ achieves 0.92 relevance due to occasional mismatches between conversation turn text and expected entity names. The latency differential between Graph Kernel (248.3ms) and keyword (2.7ms) is attributable to network RTT to the remote PostgreSQL instance.
Relationship Queries
Caption: Relationship Query Results
| Method | Avg Lat. (ms) | Avg Results | Avg Rel. |
|---|---|---|---|
| Graph Kernel | 204.3 | 9.5 | 0.94 |
| Keyword | 2.8 | 19.3 | 1.00 |
| BM25 | 8.7 | 12.3 | 1.00 |
| RAG++ | 336.4 | 10.0 | 0.69 |
The Graph Kernel's 0.94 relevance on relationship queries is attributable to a single entity normalization failure: ``GCP'' not matching ``Google Cloud Platform'' in deploys\_to results. With proper entity normalization (Section~[ref: sec:limitations]), this would be 1.00. RAG++ drops to 0.69, as vector similarity was not designed to express structured relationships like ``A depends on B.''
Multi-hop Reasoning
Caption: Multi-hop Reasoning Results
| Method | Avg Lat. (ms) | Avg Results | Avg Rel. |
|---|---|---|---|
| Graph Kernel | 586.6 | 7.6 | 1.00 |
| Keyword | 3.3 | 20.0 | 1.00 |
| BM25 | 9.2 | 18.8 | 1.00 |
| RAG++ | 348.1 | 10.0 | 0.40 |
Multi-hop reasoning is the Graph Kernel's distinguishing capability. While keyword and BM25 achieve identical 1.00 relevance scores, the nature of their results is fundamentally different:
- Graph Kernel returns 7.6 structurally connected results forming verified relationship chains (e.g., Mohamed $\rightarrow$ works\_on $\rightarrow$ clawdbot $\rightarrow$ uses $\rightarrow$ Gemini batch API). Each result is causally linked through graph edges.
- Keyword returns 20.0 coincidence results---documents that happen to contain matching substrings but with no concept of why the terms co-occur.
The relevance metric (Section~[ref: sec:methodology]) masks this critical quality difference. In production, the Graph Kernel's causal chains enable provenance-tracked reasoning; keyword coincidence does not. RAG++ drops to 0.40 because conversation turns rarely contain multi-hop relationship chains in a single text span.
The Graph Kernel's higher latency (586.6ms) on multi-hop queries results from sequential HTTP round-trips: a 2-hop query requires 3 HTTP calls $\times$ ${\sim}200$ms RTT. A server-side traversal endpoint (Section~[ref: sec:future]) would reduce this to a single call.
Fuzzy/Semantic Search
Caption: Fuzzy/Semantic Search Results
| Method | Avg Lat. (ms) | Avg Results | Avg Rel. |
|---|---|---|---|
| Graph Kernel | 215.2 | 19.8 | 0.42 |
| Keyword | 2.0 | 16.0 | 0.80 |
| BM25 | 6.1 | 7.6 | 0.53 |
| RAG++ | 484.0 | 10.0 | 0.65 |
Fuzzy/semantic search is the Graph Kernel's weakest category, confirming the expected limitation of a system without embedding-based similarity. Searching for ``music'' will not find triples about ``audio production'' or ``sound design.'' This is the primary motivation for the hybrid architecture (Section~[ref: sec:hybrid]). RAG++ outperforms all triple-based methods on semantic queries due to its vector-similarity backbone, though even it achieves only 0.65---suggesting that the query set includes genuinely difficult semantic matching tasks.
Predicate-Specific Queries
Caption: Predicate-Specific Query Results
| Method | Avg Lat. (ms) | Avg Results | Avg Rel. |
|---|---|---|---|
| Graph Kernel | 230.1 | 16.0 | 0.80 |
| Keyword | 3.3 | 20.0 | 1.00 |
| BM25 | 9.5 | 20.0 | 1.00 |
| RAG++ | 460.8 | 10.0 | 0.80 |
The Graph Kernel should excel on predicate-specific queries (exact predicate filters are a native capability), but entity normalization failures suppress relevance. One query for ``Dream Weaver'' files returned 0 results due to a capitalization/alias mismatch with the stored subject ``dream-weaver-engine.'' This is a data quality issue, not an algorithmic limitation.
Overall Summary
Caption: Overall Results Summary
| Method | Lat. | Results | Rel. | Lat. Rk | Rel. Rk |
|---|---|---|---|---|---|
| Keyword | 2.8 | 19.1 | 0.96 | 1 | 1 |
| BM25 | 8.5 | 15.4 | 0.91 | 2 | 2 |
| Graph Kernel | 291.7 | 11.0 | 0.84 | 3 | 3 |
| RAG++ | 407.9 | 10.0 | 0.70 | 4 | 4 |
Latency Decomposition
Caption: Latency Decomposition
Critical insight. The Graph Kernel is compute-efficient. Its latency problem is an architecture choice (remote Supabase), not a fundamental limitation. A latency decomposition reveals that 90\
Comparative Analysis
We compare the Graph Kernel against nine industry-grade alternatives across seven dimensions: context slicing, provenance, multi-hop traversal, semantic search, scale, deployment complexity, and cost.
Comparison Matrix
Caption: Comparative Analysis: Graph Kernel vs. Industry Alternatives
| System | Ctx Slice | Provenance | Multi-hop | Semantic | Scale | Deploy | Cost |
|---|---|---|---|---|---|---|---|
| Graph Kernel | ✅ Native | ✅ HMAC | ✅ | --- | Small | Single binary | Free |
| Neo4j [neo4j2024] | --- Custom | --- | ✅✅ | --- | Large | JVM | Free/$$$ |
| Amazon Neptune [neptune2024] | --- | --- | ✅✅ | --- | Massive | AWS managed | $$$ |
| Apache Jena [jena2024] | --- | Partial | ✅ | --- | Medium | JVM | Free |
| Dgraph [dgraph2024] | --- | --- | ✅✅ | ✅ | Massive | Distributed | Free/$$ |
| TypeDB [typedb2024] | --- | --- | ✅✅ | --- | Large | JVM | Free |
| Weaviate [weaviate2023] | --- | --- | Limited | ✅✅ | Large | Go binary | Free/$$ |
| LC/LI KGs [llamaindex2024,langchain2024] | --- | --- | ✅ | ✅ | Varies | Python | Free+LLM |
| GraphRAG [edge2024graphrag] | --- | --- | ✅ | ✅✅ | Medium | Python | Free+LLM |
| Zep [zep2024] | --- | --- | ✅ | ✅ | Medium | Go/Cloud | Free/$$ |
Key Findings
No existing system provides deterministic context slicing. Across all nine alternatives, none offers native, policy-governed context window construction with cryptographic provenance. This is the Graph Kernel's irreplaceable contribution. To replicate this functionality with Neo4j, for example, one would need to build a custom application layer implementing BFS expansion with priority scoring, HMAC token issuance, deterministic fingerprinting, and policy registry management---essentially re-implementing the Graph Kernel on top of a more complex substrate.
General-purpose graph databases offer superior query expressiveness and scale. Neo4j (Cypher), Amazon Neptune (SPARQL/Gremlin/openCypher), Apache Jena (SPARQL), and TypeDB (TypeQL) all provide significantly more expressive query languages than the Graph Kernel's REST field filters. For applications requiring complex graph pattern matching, recursive path queries, or billion-scale traversal, these systems are superior. The Graph Kernel does not compete on this axis.
Vector databases address the Graph Kernel's primary weakness. Weaviate's hybrid search (vector + BM25 + filters) and GraphRAG's community-level semantic understanding directly address the 0.42 fuzzy/semantic relevance that is the Graph Kernel's weakest dimension. The hybrid architecture (Section~[ref: sec:hybrid]) is designed to bridge this gap.
Memory-oriented systems (Zep) optimize for developer experience at the cost of determinism. Zep provides automatic entity extraction, temporal memory decay, and drop-in integration with LLM applications. However, it offers no formal reproducibility guarantees---the same query at different times may return different results due to memory decay and re-summarization.
LLM-dependent systems (GraphRAG, LangChain/LlamaIndex KGs) are non-deterministic by construction. Both graph construction and query answering depend on LLM behavior, which varies across model versions, temperature settings, and even inference infrastructure. The Graph Kernel's determinism guarantee---same input, same output---is fundamentally incompatible with LLM-in-the-loop systems.
Positioning
The Graph Kernel does not compete with Neo4j on query expressiveness, Neptune on scale, Weaviate on semantic search, or GraphRAG on holistic corpus understanding. It occupies a distinct niche:
quote
The Graph Kernel is a provenance engine---its value proposition is not information retrieval but the construction of verifiable, reproducible evidence bundles for autonomous agent systems.
quote
This positioning implies that the Graph Kernel is most valuable as a layer in a stack rather than a standalone system. In the OpenClaw CompCore architecture, it serves as the context authority layer between raw conversation storage (PostgreSQL) and agent reasoning (LLM inference), providing the trust boundary that ensures every downstream decision can be traced to a specific, authorized context window.
Discussion
The Provenance Gap
Our evaluation reveals a significant gap in the current AI infrastructure landscape: no widely-deployed system provides deterministic, cryptographically-verifiable context windows for autonomous agents. This is not because the problem is unrecognized---the AI safety community has extensively discussed the need for interpretable, auditable AI systems~[citation: amodei2016concrete,weidinger2021ethical]---but because the infrastructure to support these properties has not been built.
The Graph Kernel addresses this gap by making provenance a first-class architectural concern rather than an afterthought. The AdmissibleEvidenceBundle type, the HMAC-signed admissibility token, and the policy governance framework together ensure that every context window carries a complete provenance chain: what was included, why it was included (policy), and proof that the inclusion was authorized (token).
Limitations
Entity normalization. The evaluated corpus exhibits a 22\
No semantic search. The Graph Kernel achieves only 0.42 average relevance on fuzzy/semantic queries, confirming that exact field matching cannot substitute for embedding-based similarity. This is a fundamental limitation of the structural approach and motivates the hybrid architecture (Section~[ref: sec:hybrid]).
Client-side multi-hop. Multi-hop queries currently require sequential HTTP round-trips ($N$ hops = $N{+}1$ calls $\times$ ${\sim}200$ms RTT), resulting in 586.6ms average latency for 2-hop queries. A server-side traversal endpoint would collapse this to a single call, reducing latency to ${\sim}220$ms (remote) or ${\sim}15$ms (local).
Scale. The evaluated corpus contains 3,502 triples across 221 subjects. The Graph Kernel has not been tested at scales beyond tens of thousands of triples. For billion-scale graph applications, purpose-built distributed graph databases (Neo4j, Neptune, Dgraph) remain necessary.
Relevance metric limitations. Our relevance metric (fraction of expected terms found) does not capture the structural quality of results. As demonstrated in Section~[ref: sec:results], keyword search and Graph Kernel can achieve identical 1.00 relevance scores while producing qualitatively different results (coincidence piles vs. causal chains). Future evaluations should incorporate structural metrics such as chain validity and provenance completeness.
Broader Implications
The provenance engine concept has implications beyond the specific Graph Kernel implementation:
- Regulatory compliance. As AI regulation evolves (EU AI Act, NIST AI RMF), the ability to audit which information influenced an agent's decision will become a compliance requirement, not an optional feature.
- Multi-agent trust. In systems where multiple agents collaborate, provenance tokens enable trust delegation: Agent A can verify that Agent B's context window was authorized by a shared policy authority without accessing the authority's signing secret.
- Deterministic debugging. When an agent produces an unexpected output, deterministic context slicing enables exact reproduction of the input conditions---same anchor, same policy, same graph state yields the same context window.
Future Work
Server-Side Traversal API
The highest-priority improvement is a server-side traversal endpoint (POST /api/knowledge/traverse) that executes multi-hop BFS within a single database transaction. This eliminates the $N\times$ RTT multiplier for multi-hop queries, reducing 2-hop latency from ${\sim}600$ms to ${\sim}220$ms (remote) or ${\sim}15$ms (local).
Entity Normalization Pipeline
A canonical entity resolution system with alias tables and fuzzy matching at both ingestion and query time. Projected impact: overall relevance improvement from 0.84 to 0.92+.
RAG++ Integration Bridge
The POST /api/enrich endpoint that combines vector-similarity results with graph context, addressing the 0.42 fuzzy/semantic relevance while preserving provenance guarantees. This bridge transforms RAG from flat similarity search into structured, provenance-tagged reasoning.
Local-First Deployment
SQLite-based local storage with periodic sync to the remote PostgreSQL backend. Reduces query latency from 291ms to projected 10--30ms, making the Graph Kernel competitive with in-memory BM25 on latency while maintaining provenance guarantees.
Federated Graph Protocol
A multi-kernel federation protocol enabling distributed knowledge across agent systems with cross-kernel queries and provenance chain propagation. This extends the single-agent provenance model to multi-agent collaboration scenarios.
Community Detection
Applying the Leiden algorithm~[citation: traag2019leiden] (as in GraphRAG) to identify hierarchical communities within the knowledge graph, enabling ``what is this corpus about?'' queries that the current system cannot answer.
Structural Relevance Metrics
Development of evaluation metrics that capture the structural quality of results (chain validity, provenance completeness, subgraph connectivity) rather than relying solely on term-overlap relevance scores.
Conclusion
We have presented the Graph Kernel, a deterministic context slicing engine that introduces the provenance engine category to the AI agent infrastructure stack. Through evaluation across 27 queries and four retrieval methods, we demonstrate that the Graph Kernel achieves perfect relevance on multi-hop structural queries while providing properties that no existing system offers: HMAC-signed deterministic context windows, policy-governed access control, and type-level enforcement of admissibility invariants.
The Graph Kernel is not a general-purpose search engine (0.84 overall relevance vs. keyword's 0.96), nor a general-purpose graph database (REST field filters vs. Cypher/SPARQL), nor a semantic retrieval system (0.42 fuzzy relevance). It occupies a distinct niche: the provenance authority layer for autonomous agent systems. In this role, it is---among the ten systems we evaluated---irreplaceable.
As autonomous AI agents assume greater responsibility in production systems, the infrastructure to ensure their decisions are traceable, reproducible, and verifiable becomes not optional but essential. The Graph Kernel is a first step toward that infrastructure.
\appendices
Reproducibility Statement
All benchmark scripts, raw results, and evaluation code are available in the OpenClaw CompCore repository at benchmarks/run\_benchmark.py. The Graph Kernel binary can be compiled from source with cargo build --release --features service. The evaluation corpus (3,502 triples) is stored in the Supabase PostgreSQL instance; a SQLite snapshot is maintained at \texttt{\~{}/.compcore/graph-kernel.db} for offline evaluation. Complete benchmark results in JSON format are archived at /tmp/benchmark\_results.json.
Corpus Statistics
Caption: Corpus Statistics
| Metric | Value | |
|---|---|---|
| Total triples | 3,502 | |
| Unique subjects | 221 | |
| Unique predicates | 88 | |
| Data sources | kimi-k2-extraction, topology-ingester | |
| Average confidence | 0.73 (Kimi), 0.95 (topology) | |
| Top predicate | has\_file (810, 23.1 Second predicate | needs\_to (467, 13.3 |
Predicate Taxonomy
The 88 unique predicates cluster into four semantic categories:
- \textbf{Structural (40\
- \textbf{Intentional (34\
- \textbf{Relational (12\
- \textbf{Descriptive (14\
The dominance of structural and intentional predicates reflects the Graph Kernel's primary data source: LLM-extracted knowledge from development-focused conversations, where entities are software systems and relationships encode architectural dependencies and development intentions.
References
- [vaswani2017attention]
A.~Vaswani, N.~Shazeer, N.~Parmar, J.~Uszkoreit, L.~Jones, A.~N.~Gomez, \L.~Kaiser, and I.~Polosukhin, ``Attention is all you need,'' in Advances in Neural Information Processing Systems (NeurIPS), vol.~30, 2017.
- [weaviate2023]
B.~Mohr, E.~Bueno~de~Mesquita, and S.~van~Cranenburgh, ``Weaviate: An open-source vector database,'' 2023. [Online]. Available: https://weaviate.io
- [johnson2021billion]
J.~J.~Johnson, M.~Douze, and H.~J\'{e}gou, ``Billion-scale similarity search with GPUs,'' IEEE Transactions on Big Data, vol.~7, no.~3, pp.~535--547, 2021.
- [lewis2020rag]
P.~Lewis et~al., ``Retrieval-augmented generation for knowledge-intensive NLP tasks,'' in Advances in Neural Information Processing Systems (NeurIPS), vol.~33, 2020, pp.~9459--9474.
- [gao2023rag_survey]
Y.~Gao et~al., ``Retrieval-augmented generation for large language models: A survey,'' arXiv preprint arXiv:2312.10997, 2023.
- [neo4j2024]
Neo4j, Inc., ``Neo4j Graph Database,'' 2024. [Online]. Available: https://neo4j.com
- [angles2017foundations]
R.~Angles, M.~Arenas, P.~Barcel\'{o}, A.~Hogan, J.~Reutter, and D.~Vrgoc, ``Foundations of modern query languages for graph databases,'' ACM Computing Surveys, vol.~50, no.~5, pp.~1--40, 2017.
- [sowa1992semantic]
J.~F.~Sowa, ``Semantic networks,'' in Encyclopedia of Artificial Intelligence, 2nd~ed., S.~C.~Shapiro, Ed. Wiley, 1992.
- [singhal2012knowledge]
A.~Singhal, ``Introducing the Knowledge Graph: Things, not strings,'' Google Official Blog, May 2012. [Online]. Available: https://blog.google/products/search/introducing-knowledge-graph-things-not/
- [jena2024]
Apache Software Foundation, ``Apache Jena,'' 2024. [Online]. Available: https://jena.apache.org
- [blazegraph2024]
Blazegraph, ``Blazegraph Database,'' 2024. [Online]. Available: https://blazegraph.com
- [francis2018cypher]
N.~Francis et~al., ``Cypher: An evolving query language for property graphs,'' in Proc. 2018 International Conference on Management of Data (SIGMOD), 2018, pp.~1433--1445.
- [ji2022survey]
S.~Ji, S.~Pan, E.~Cambria, P.~Marttinen, and P.~S.~Yu, ``A survey on knowledge graphs: Representation, acquisition, and applications,'' IEEE Transactions on Neural Networks and Learning Systems, vol.~33, no.~2, pp.~494--514, 2022.
- [hogan2021knowledge]
A.~Hogan et~al., ``Knowledge graphs,'' ACM Computing Surveys, vol.~54, no.~4, pp.~1--37, 2021.
- [hogan2022knowledge]
A.~Hogan et~al., ``Knowledge graphs,'' Synthesis Lectures on Data, Semantics, and Knowledge, vol.~12, no.~2, pp.~1--257, Morgan \& Claypool, 2022.
- [guu2020realm]
K.~Guu, K.~Lee, Z.~Tung, P.~Pasupat, and M.-W.~Chang, ``Retrieval augmented language model pre-training,'' in Proc. 37th ICML, 2020, pp.~3929--3938.
- [izacard2021fid]
G.~Izacard and E.~Grave, ``Leveraging passage retrieval with generative models for open domain question answering,'' in Proc. 16th EACL, 2021, pp.~874--880.
- [asai2024selfrag]
A.~Asai, Z.~Wu, Y.~Wang, A.~Sil, and H.~Hajishirzi, ``Self-RAG: Learning to retrieve, generate, and critique through self-reflection,'' in Proc. 12th ICLR, 2024.
- [pinecone2024]
Pinecone Systems, Inc., ``Pinecone: Vector Database for Machine Learning,'' 2024. [Online]. Available: https://www.pinecone.io
- [chroma2024]
Chroma, ``Chroma: The AI-native open-source embedding database,'' 2024. [Online]. Available: https://www.trychroma.com
- [edge2024graphrag]
D.~Edge et~al., ``From local to global: A graph RAG approach to query-focused summarization,'' arXiv preprint arXiv:2404.16130, 2024.
- [llamaindex2024]
J.~Liu, ``LlamaIndex: A data framework for LLM applications,'' 2024. [Online]. Available: https://www.llamaindex.ai
- [langchain2024]
H.~Chase, ``LangChain,'' 2024. [Online]. Available: https://www.langchain.com
- [beltagy2020longformer]
I.~Beltagy, M.~E.~Peters, and A.~Cohan, ``Longformer: The long-document transformer,'' arXiv preprint arXiv:2004.05150, 2020.
- [zaheer2020bigbird]
M.~Zaheer et~al., ``Big Bird: Transformers for longer sequences,'' in Advances in Neural Information Processing Systems (NeurIPS), vol.~33, 2020, pp.~17283--17297.
- [zhong2024memorybank]
W.~Zhong, L.~Guo, Q.~Gao, H.~Ye, and Y.~Wang, ``MemoryBank: Enhancing large language models with long-term memory,'' in Proc. AAAI Conference on Artificial Intelligence, vol.~38, 2024.
- [bulatov2022rmt]
A.~Bulatov, Y.~Kuratov, and M.~S.~Burtsev, ``Recurrent memory transformer,'' in Advances in Neural Information Processing Systems (NeurIPS), vol.~35, 2022, pp.~11079--11091.
- [zep2024]
Zep, ``Zep: Long-term memory for AI assistants,'' 2024. [Online]. Available: https://www.getzep.com
- [buneman2001provenance]
P.~Buneman, S.~Khanna, and W.-C.~Tan, ``Why and where: A characterization of data provenance,'' in International Conference on Database Theory (ICDT), 2001, pp.~316--330.
- [cheney2009provenance]
J.~Cheney, L.~Chiticariu, and W.-C.~Tan, ``Provenance in databases: Why, how, and where,'' Foundations and Trends in Databases, vol.~1, no.~4, pp.~379--474, 2009.
- [moreau2013prov]
L.~Moreau, P.~Missier, et~al., ``PROV-DM: The PROV Data Model,'' W3C Recommendation, 2013. [Online]. Available: https://www.w3.org/TR/prov-dm/
- [liang2023blockchain]
R.~Liang, C.~Niu, F.~Zhang, Z.~Qi, and Y.~Hao, ``Blockchain-based data provenance for AI: A comprehensive survey,'' IEEE Access, vol.~11, pp.~61748--61770, 2023.
- [jones2015jwt]
M.~Jones, J.~Bradley, and N.~Sakimura, ``JSON Web Token (JWT),'' RFC~7519, Internet Engineering Task Force, 2015. [Online]. Available: https://tools.ietf.org/html/rfc7519
- [axum2024]
D.~Tolnay et~al., ``Axum: Ergonomic and modular web framework built with Tokio, Tower, and Hyper,'' 2024. [Online]. Available: https://github.com/tokio-rs/axum
- [tokio2024]
C.~Lerche et~al., ``Tokio: An asynchronous runtime for the Rust programming language,'' 2024. [Online]. Available: https://tokio.rs
- [neptune2024]
Amazon Web Services, ``Amazon Neptune,'' 2024. [Online]. Available: https://aws.amazon.com/neptune/
- [dgraph2024]
Dgraph Labs, ``Dgraph: Distributed graph database,'' 2024. [Online]. Available: https://dgraph.io
- [typedb2024]
Vaticle, ``TypeDB: The polymorphic database,'' 2024. [Online]. Available: https://typedb.com
- [amodei2016concrete]
D.~Amodei et~al., ``Concrete problems in AI safety,'' arXiv preprint arXiv:1606.06565, 2016.
- [weidinger2021ethical]
S.~Weidinger et~al., ``Ethical and social risks of harm from language models,'' arXiv preprint arXiv:2112.04359, 2021.
- [traag2019leiden]
V.~A.~Traag, L.~Waltman, and N.~J.~van~Eck, ``From Louvain to Leiden: Guaranteeing well-connected communities,'' Scientific Reports, vol.~9, no.~1, p.~5233, 2019.
Promotion Decision
Compile/render the source, verify references and figures, then add to the curated atlas.
Source Anchor
Comp-Core/docs/latex/graph-kernel-paper.tex
Detected Structure
Latex · Abstract · Method · Evaluation · References · Math · Figures · Code Anchors · Architecture