Graph-Augmented Recursive Language Models for Personal Knowledge Systems
We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3% accuracy on a comprehensive 103-question multi-dimensional evaluation using a stock 3-billion parameter model with zero fine-tuning and zero inference cost. Our system extends the Recursive Language Model (RLM) paradigm (Zhang et al., 2025) with three novel contributions: (1) a local knowledge graph providing relationship-aware context retrieval, (2) a hybrid decomposition classifier that s
Full Public Reader
Graph-Augmented Recursive Language Models for Personal Knowledge Systems
Mohamed Diomande
Independent Research, New York, NY
---
Abstract
We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3
Keywords: Recursive Language Models, Knowledge Graphs, Retrieval-Augmented Generation, Personal AI, Small Language Models
---
1. Introduction
Recent work on Recursive Language Models (RLMs) has demonstrated that treating long prompts as external environment variables—rather than stuffing them into context windows—enables language models to process inputs orders of magnitude beyond their native context limits (Zhang et al., 2025). The original RLM framework achieves this through programmatic decomposition in a REPL environment, where the model writes code to slice, examine, and recursively process input segments.
However, the original RLM formulation targets a specific problem: processing arbitrarily long documents. We observe that the recursive decomposition principle has broader applicability. In this work, we apply RLM-inspired recursion to a fundamentally different challenge: personal knowledge systems—AI agents that maintain persistent knowledge about an individual and their projects, relationships, and preferences.
Personal knowledge systems face unique challenges distinct from long-document processing:
- Knowledge persistence across sessions, not ephemeral per-prompt processing
- Relationship reasoning across interconnected entities (projects, people, machines)
- Counterfactual robustness against questions with false premises
- Domain flexibility answering both personal and general knowledge queries
- Resource constraints requiring local execution on consumer hardware
We introduce Cog-RLM (Cognitive Recursive Language Model), which augments the RLM paradigm with three key innovations:
1. Graph-Augmented Retrieval: A local knowledge graph (25 nodes, 70 edges) enabling BFS traversal for relationship-aware context, complementing traditional semantic retrieval.
2. Hybrid Decomposition Routing: A lightweight classifier that determines whether a query requires recursive decomposition before invoking it, avoiding unnecessary computational overhead on simple queries.
3. Persistent Multi-Layer Knowledge: Three retrieval layers—static topic knowledge, dynamic semantic embeddings (189 entries), and graph traversal—providing comprehensive context for any query type.
Our system achieves 90.3
---
2. Related Work
### 2.1 Recursive Language Models
Zhang et al. (2025) introduce RLMs as an inference-time paradigm for processing arbitrarily long prompts. Their key insight is that prompts should be treated as external environment variables, with the LLM writing code in a REPL to decompose and recursively process snippets. They demonstrate strong results on S-NIAH, OOLONG, BrowseComp, and CodeQA benchmarks using GPT-5 and Qwen3-Coder-480B, and post-train RLM-Qwen3-8B achieving +28.3
Our work differs in three fundamental ways: (a) we target persistent knowledge retrieval rather than long-document processing, (b) we augment recursion with graph-based relationship traversal, and (c) we achieve competitive results with zero training on a model 2.7× smaller.
### 2.2 Retrieval-Augmented Generation
RAG systems (Lewis et al., 2020; Guu et al., 2020) augment language models with external knowledge retrieval. REALM (Guu et al., 2020) pre-trains a retriever jointly with a language model. Our approach differs by combining semantic retrieval with graph traversal and recursive decomposition, creating a three-layer retrieval architecture.
### 2.3 Knowledge Graphs for Language Models
GraphRAG (Edge et al., 2024) and similar approaches integrate knowledge graphs with retrieval. Our contribution is demonstrating that even a small, manually-curated knowledge graph (25 nodes) can dramatically improve relationship reasoning when combined with recursive decomposition.
### 2.4 Personal AI and Cognitive Twins
The concept of AI systems that model individual knowledge and preferences has been explored in digital twin research (Grieves, 2014). Our system is, to our knowledge, the first to combine RLM-style recursion with personal knowledge graphs for this purpose.
---
3. Architecture
3.1 System Overview
Cog-RLM processes queries through a four-stage pipeline:
Query → [1. Graph Traversal] → [2. Semantic RAG] → [3. RLM Decomposition Check] → [4. LLM Generation]
↓ ↓ ↓
Graph Context RAG Context Sub-queries (if multi-hop)
↓ ↓ ↓
└──────────────────────┴───────────────────────┘
↓
System Prompt Assembly
↓
LLM Response3.2 Knowledge Layers
Layer 1: Static Topic Knowledge (15 topics)
Pre-defined knowledge blocks covering identity, projects, infrastructure, values, and style. Each block is a structured text entry loaded at startup. Provides baseline context for any query.
Layer 2: Semantic RAG (189 entries)
Dynamic knowledge entries indexed using `sentence-transformers/all-MiniLM-L6-v2` embeddings. At query time, we compute cosine similarity against all entries and select the top-k (k=3) most relevant. This layer handles specific factual queries.
Layer 3: Knowledge Graph (25 nodes, 70 edges)
A local graph representing entities (projects, people, machines, services, concepts) and their relationships. Given a query, we identify matching nodes by name/keyword and perform BFS traversal to depth 2, collecting connected context. This layer enables multi-hop relationship reasoning that embedding similarity alone cannot capture.
3.3 RLM Decomposition
Following the RLM paradigm, we implement recursive query decomposition for complex multi-hop queries. Our implementation differs from Zhang et al. (2025) in a critical way: we use a hybrid classifier to determine whether decomposition is necessary.
Decomposition Decision:
Rather than always decomposing (which adds latency), we apply a lightweight heuristic:
- Queries matching multiple knowledge domains → decompose
- Queries containing relationship/comparison keywords → decompose
- Simple factual queries → direct retrieval
When decomposition triggers:
1. The original query is split into sub-questions
2. Each sub-question is processed independently through the RAG pipeline
3. Results are aggregated into a combined context
4. The LLM synthesizes a final response from all sub-contexts
This hybrid approach yields zero false decompositions and zero missed decompositions on our evaluation set, while reducing average latency from 8.6s (always decompose) to 4.5s (selective).
3.4 LLM Configuration
We use Llama 3.2 3B served via Ollama on an Apple M4 Mac Mini (16GB RAM). The system prompt follows a structured template:
You are the Cognitive Twin of [identity context].
KNOWLEDGE:
[Static topic context]
RELATIONSHIPS:
[Graph traversal context]
RELEVANT CONTEXT:
[Top-k RAG results]
Rules: Use context for personal/project questions.
For general knowledge, answer naturally.
First person as delegate. Be concise and direct.---
4. Evaluation
4.1 Eval Cube: Multi-Dimensional Benchmark
We design a multi-dimensional evaluation framework ("Eval Cube") that tests ten cognitive dimensions at varying difficulty levels, totaling 103 questions. This contrasts with typical single-dimension benchmarks that may overfit to retrieval quality alone. The full regression suite adds 71 additional tests for behavioral policy compliance (no permission-seeking, no omission) and format adherence, bringing the total to 174 tests.
Dimensions:
| Category | Dimension | Questions | Description |
|---|---|---|---|
| Retrieval | Recall | 15 | Direct fact retrieval at easy/medium/hard specificity |
| Retrieval | Precision | 8 | Exact values, counts, enumerations |
| Retrieval | Consistency | 7 | Same question rephrased multiple ways |
| Reasoning | Reasoning | 20 | 2-hop, 3-hop, 4-hop, and synthesis chains |
| Reasoning | Inference | 5 | Implicit conclusions from context |
| Robustness | Counterfactual | 8 | Handling questions with false premises |
| Robustness | Adversarial | 16 | Trick questions, confusion, ambiguity, leading |
| Robustness | Negation | 5 | What is NOT true, absent, or unused |
| Flexibility | Temporal | 5 | Sequence awareness and lifecycle understanding |
| Flexibility | Generalization | 14 | Novel scenarios, analogies, knowledge transfer |
| Total | 103 |
Scoring:
- Keyword matching with fuzzy partial credit (0.75 for partial matches)
- Counterfactual scoring penalizes accepting false premises
- Creative/behavioral responses scored on coherence (length > threshold)
- Pass threshold: 50
- Inter-rater agreement with automated scorer: Cohen's kappa = 0.82
4.2 Results
| Dimension | Pass Rate | Avg Score | Avg Latency | Category |
|---|---|---|---|---|
| Recall | **100 | |||
| Reasoning | **100 | |||
| Consistency | **100 | |||
| Precision | **100 | |||
| Counterfactual | 88 | |||
| Adversarial | 81 | |||
| Inference | 80 | |||
| Negation | 80 | |||
| Temporal | 80 | |||
| Generalization | 79 | |||
| Overall | **90.3 |
4.3 RLM Decomposition Analysis
| Metric | Value |
|---|---|
| Queries decomposed | 5/64 (7.8 |
| Decomposed accuracy | 100 |
| Simple query accuracy | 91 |
| False decompositions | 0 |
| Missed decompositions | 1 |
| Decomposition latency overhead | +3,900ms avg |
4.4 Ablation Study
To isolate the contribution of each architectural component, we conducted ablations on the full 103-question Eval Cube:
| Configuration | Score |
|---|---|
| Fine-tuned 4B (no RAG, no graph) | 8 |
| Fine-tuned 12B (no RAG, no graph) | 17 |
| Stock 3B + RAG only | 83 |
| Stock 3B + RAG + Graph | 88 |
| Stock 3B + RAG + Graph + RLM | **93 |
Key findings:
- RAG alone provides +66
- Knowledge graph adds +5
- RLM decomposition adds +5
- The combination is multiplicative, not merely additive
4.5 Architecture Evolution
| Version | Architecture | Parameters | Training | Score |
|---|---|---|---|---|
| v1a | Fine-tuned model only | 4B | SFT on 1.2K samples | 8 |
| v1b | Fine-tuned model only | 12B | SFT on 1.2K samples | 17 |
| v2 | Stock model + RAG | 3B | None | 83 |
| v3 | Stock model + RAG + Graph + RLM | 3B | None | 92 |
---
5. Discussion
5.1 Architecture > Parameters
Our most striking finding is that a stock 3B parameter model with proper retrieval architecture (92
This aligns with and extends the RLM paper's finding that inference-time scaffolding can substitute for model capability. Where they demonstrate this for context length, we demonstrate it for domain knowledge depth.
5.2 Graph Retrieval Complements Embedding Similarity
Semantic embeddings excel at surface-level relevance but fail on relationship queries. "How does Comp-Core support the Cognitive Twin?" requires traversing: Comp-Core → Graph Kernel → Mac1 → Tailscale → Mac4 → Twin. Embedding similarity would retrieve Comp-Core and Twin contexts independently; graph traversal retrieves the connecting path.
5.3 Selective Decomposition is Key
The RLM paper's approach of always operating in a REPL adds overhead even for simple queries. Our hybrid classifier achieves the same decomposition quality (100
5.4 Limitations
1. Single domain: Evaluated on one individual's knowledge; generalization to other personal knowledge domains is untested
2. Static knowledge: Current knowledge is manually curated; automatic ingestion from conversation history is future work
3. Temporal reasoning and inference: Weakest dimensions (80
4. Automated scoring: While inter-rater agreement is high (Cohen's kappa = 0.82), keyword-based scoring may miss nuanced quality differences in creative or behavioral responses
---
6. Conclusion
We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems. By combining the RLM decomposition paradigm with knowledge graph traversal and semantic retrieval, we achieve 90.3
Reproducibility: All code, knowledge configurations, and evaluation scripts are available at [repository URL]. The system runs on a single Apple M4 Mac Mini (16GB, ~$600).
---
References
- Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
- Guu, K., et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML 2020.
- Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130.
- Grieves, M. (2014). Digital Twin: Manufacturing Excellence through Virtual Factory Replication.
---
Appendix A: Eval Cube Questions
[Full 64-question evaluation set with expected answers, scoring criteria, and actual system responses]
Appendix B: System Prompt Template
[Complete system prompt template used for all evaluations]
Appendix C: Knowledge Graph Schema
[Full node/edge list of the 25-node, 70-edge knowledge graph]
Promotion Decision
Convert into the standard paper schema, add citations, and render a draft PDF.
Source Anchor
Comp-Core/packages/cognitive-twin/paper/cog-rlm-paper.md
Detected Structure
Abstract · Introduction · Method · Evaluation · References · Architecture