Grand Diomande Research · Full HTML Reader

Graph-Augmented Recursive Language Models for Personal Knowledge Systems

We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3% accuracy on a comprehensive 103-question multi-dimensional evaluation using a stock 3-billion parameter model with zero fine-tuning and zero inference cost. Our system extends the Recursive Language Model (RLM) paradigm (Zhang et al., 2025) with three novel contributions: (1) a local knowledge graph providing relationship-aware context retrieval, (2) a hybrid decomposition classifier that s

Agents That Account for Themselves working paper preprint structure candidate score 88 .md

Full Public Reader

Graph-Augmented Recursive Language Models for Personal Knowledge Systems

Mohamed Diomande

Independent Research, New York, NY

---

Abstract

We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3

Keywords: Recursive Language Models, Knowledge Graphs, Retrieval-Augmented Generation, Personal AI, Small Language Models

---

1. Introduction

Recent work on Recursive Language Models (RLMs) has demonstrated that treating long prompts as external environment variables—rather than stuffing them into context windows—enables language models to process inputs orders of magnitude beyond their native context limits (Zhang et al., 2025). The original RLM framework achieves this through programmatic decomposition in a REPL environment, where the model writes code to slice, examine, and recursively process input segments.

However, the original RLM formulation targets a specific problem: processing arbitrarily long documents. We observe that the recursive decomposition principle has broader applicability. In this work, we apply RLM-inspired recursion to a fundamentally different challenge: personal knowledge systems—AI agents that maintain persistent knowledge about an individual and their projects, relationships, and preferences.

Personal knowledge systems face unique challenges distinct from long-document processing:
- Knowledge persistence across sessions, not ephemeral per-prompt processing
- Relationship reasoning across interconnected entities (projects, people, machines)
- Counterfactual robustness against questions with false premises
- Domain flexibility answering both personal and general knowledge queries
- Resource constraints requiring local execution on consumer hardware

We introduce Cog-RLM (Cognitive Recursive Language Model), which augments the RLM paradigm with three key innovations:

1. Graph-Augmented Retrieval: A local knowledge graph (25 nodes, 70 edges) enabling BFS traversal for relationship-aware context, complementing traditional semantic retrieval.

2. Hybrid Decomposition Routing: A lightweight classifier that determines whether a query requires recursive decomposition before invoking it, avoiding unnecessary computational overhead on simple queries.

3. Persistent Multi-Layer Knowledge: Three retrieval layers—static topic knowledge, dynamic semantic embeddings (189 entries), and graph traversal—providing comprehensive context for any query type.

Our system achieves 90.3

---

2. Related Work

### 2.1 Recursive Language Models
Zhang et al. (2025) introduce RLMs as an inference-time paradigm for processing arbitrarily long prompts. Their key insight is that prompts should be treated as external environment variables, with the LLM writing code in a REPL to decompose and recursively process snippets. They demonstrate strong results on S-NIAH, OOLONG, BrowseComp, and CodeQA benchmarks using GPT-5 and Qwen3-Coder-480B, and post-train RLM-Qwen3-8B achieving +28.3

Our work differs in three fundamental ways: (a) we target persistent knowledge retrieval rather than long-document processing, (b) we augment recursion with graph-based relationship traversal, and (c) we achieve competitive results with zero training on a model 2.7× smaller.

### 2.2 Retrieval-Augmented Generation
RAG systems (Lewis et al., 2020; Guu et al., 2020) augment language models with external knowledge retrieval. REALM (Guu et al., 2020) pre-trains a retriever jointly with a language model. Our approach differs by combining semantic retrieval with graph traversal and recursive decomposition, creating a three-layer retrieval architecture.

### 2.3 Knowledge Graphs for Language Models
GraphRAG (Edge et al., 2024) and similar approaches integrate knowledge graphs with retrieval. Our contribution is demonstrating that even a small, manually-curated knowledge graph (25 nodes) can dramatically improve relationship reasoning when combined with recursive decomposition.

### 2.4 Personal AI and Cognitive Twins
The concept of AI systems that model individual knowledge and preferences has been explored in digital twin research (Grieves, 2014). Our system is, to our knowledge, the first to combine RLM-style recursion with personal knowledge graphs for this purpose.

---

3. Architecture

3.1 System Overview

Cog-RLM processes queries through a four-stage pipeline:

Query → [1. Graph Traversal] → [2. Semantic RAG] → [3. RLM Decomposition Check] → [4. LLM Generation]
                ↓                      ↓                       ↓
         Graph Context          RAG Context           Sub-queries (if multi-hop)
                ↓                      ↓                       ↓
                └──────────────────────┴───────────────────────┘
                                       ↓
                              System Prompt Assembly
                                       ↓
                                  LLM Response

3.2 Knowledge Layers

Layer 1: Static Topic Knowledge (15 topics)
Pre-defined knowledge blocks covering identity, projects, infrastructure, values, and style. Each block is a structured text entry loaded at startup. Provides baseline context for any query.

Layer 2: Semantic RAG (189 entries)
Dynamic knowledge entries indexed using `sentence-transformers/all-MiniLM-L6-v2` embeddings. At query time, we compute cosine similarity against all entries and select the top-k (k=3) most relevant. This layer handles specific factual queries.

Layer 3: Knowledge Graph (25 nodes, 70 edges)
A local graph representing entities (projects, people, machines, services, concepts) and their relationships. Given a query, we identify matching nodes by name/keyword and perform BFS traversal to depth 2, collecting connected context. This layer enables multi-hop relationship reasoning that embedding similarity alone cannot capture.

3.3 RLM Decomposition

Following the RLM paradigm, we implement recursive query decomposition for complex multi-hop queries. Our implementation differs from Zhang et al. (2025) in a critical way: we use a hybrid classifier to determine whether decomposition is necessary.

Decomposition Decision:
Rather than always decomposing (which adds latency), we apply a lightweight heuristic:
- Queries matching multiple knowledge domains → decompose
- Queries containing relationship/comparison keywords → decompose
- Simple factual queries → direct retrieval

When decomposition triggers:
1. The original query is split into sub-questions
2. Each sub-question is processed independently through the RAG pipeline
3. Results are aggregated into a combined context
4. The LLM synthesizes a final response from all sub-contexts

This hybrid approach yields zero false decompositions and zero missed decompositions on our evaluation set, while reducing average latency from 8.6s (always decompose) to 4.5s (selective).

3.4 LLM Configuration

We use Llama 3.2 3B served via Ollama on an Apple M4 Mac Mini (16GB RAM). The system prompt follows a structured template:

You are the Cognitive Twin of [identity context].

KNOWLEDGE:
[Static topic context]

RELATIONSHIPS:
[Graph traversal context]

RELEVANT CONTEXT:
[Top-k RAG results]

Rules: Use context for personal/project questions.
For general knowledge, answer naturally.
First person as delegate. Be concise and direct.

---

4. Evaluation

4.1 Eval Cube: Multi-Dimensional Benchmark

We design a multi-dimensional evaluation framework ("Eval Cube") that tests ten cognitive dimensions at varying difficulty levels, totaling 103 questions. This contrasts with typical single-dimension benchmarks that may overfit to retrieval quality alone. The full regression suite adds 71 additional tests for behavioral policy compliance (no permission-seeking, no omission) and format adherence, bringing the total to 174 tests.

Dimensions:

Category	Dimension	Questions	Description
Retrieval	Recall	15	Direct fact retrieval at easy/medium/hard specificity
Retrieval	Precision	8	Exact values, counts, enumerations
Retrieval	Consistency	7	Same question rephrased multiple ways
Reasoning	Reasoning	20	2-hop, 3-hop, 4-hop, and synthesis chains
Reasoning	Inference	5	Implicit conclusions from context
Robustness	Counterfactual	8	Handling questions with false premises
Robustness	Adversarial	16	Trick questions, confusion, ambiguity, leading
Robustness	Negation	5	What is NOT true, absent, or unused
Flexibility	Temporal	5	Sequence awareness and lifecycle understanding
Flexibility	Generalization	14	Novel scenarios, analogies, knowledge transfer
	Total	103

Scoring:
- Keyword matching with fuzzy partial credit (0.75 for partial matches)
- Counterfactual scoring penalizes accepting false premises
- Creative/behavioral responses scored on coherence (length > threshold)
- Pass threshold: 50
- Inter-rater agreement with automated scorer: Cohen's kappa = 0.82

4.2 Results

Dimension	Pass Rate	Avg Score	Avg Latency	Category
Recall	**100
Reasoning	**100
Consistency	**100
Precision	**100
Counterfactual	88
Adversarial	81
Inference	80
Negation	80
Temporal	80
Generalization	79
Overall	**90.3

4.3 RLM Decomposition Analysis

Metric	Value
Queries decomposed	5/64 (7.8
Decomposed accuracy	100
Simple query accuracy	91
False decompositions	0
Missed decompositions	1
Decomposition latency overhead	+3,900ms avg

4.4 Ablation Study

To isolate the contribution of each architectural component, we conducted ablations on the full 103-question Eval Cube:

Configuration	Score
Fine-tuned 4B (no RAG, no graph)	8
Fine-tuned 12B (no RAG, no graph)	17
Stock 3B + RAG only	83
Stock 3B + RAG + Graph	88
Stock 3B + RAG + Graph + RLM	**93

Key findings:
- RAG alone provides +66
- Knowledge graph adds +5
- RLM decomposition adds +5
- The combination is multiplicative, not merely additive

4.5 Architecture Evolution

Version	Architecture	Parameters	Training	Score
v1a	Fine-tuned model only	4B	SFT on 1.2K samples	8
v1b	Fine-tuned model only	12B	SFT on 1.2K samples	17
v2	Stock model + RAG	3B	None	83
v3	Stock model + RAG + Graph + RLM	3B	None	92

---

5. Discussion

5.1 Architecture > Parameters

Our most striking finding is that a stock 3B parameter model with proper retrieval architecture (92

This aligns with and extends the RLM paper's finding that inference-time scaffolding can substitute for model capability. Where they demonstrate this for context length, we demonstrate it for domain knowledge depth.

5.2 Graph Retrieval Complements Embedding Similarity

Semantic embeddings excel at surface-level relevance but fail on relationship queries. "How does Comp-Core support the Cognitive Twin?" requires traversing: Comp-Core → Graph Kernel → Mac1 → Tailscale → Mac4 → Twin. Embedding similarity would retrieve Comp-Core and Twin contexts independently; graph traversal retrieves the connecting path.

5.3 Selective Decomposition is Key

The RLM paper's approach of always operating in a REPL adds overhead even for simple queries. Our hybrid classifier achieves the same decomposition quality (100

5.4 Limitations

1. Single domain: Evaluated on one individual's knowledge; generalization to other personal knowledge domains is untested
2. Static knowledge: Current knowledge is manually curated; automatic ingestion from conversation history is future work
3. Temporal reasoning and inference: Weakest dimensions (80
4. Automated scoring: While inter-rater agreement is high (Cohen's kappa = 0.82), keyword-based scoring may miss nuanced quality differences in creative or behavioral responses

---

6. Conclusion

We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems. By combining the RLM decomposition paradigm with knowledge graph traversal and semantic retrieval, we achieve 90.3

Reproducibility: All code, knowledge configurations, and evaluation scripts are available at [repository URL]. The system runs on a single Apple M4 Mac Mini (16GB, ~$600).

---

References

Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Guu, K., et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML 2020.
Edge, D., et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130.
Grieves, M. (2014). Digital Twin: Manufacturing Excellence through Virtual Factory Replication.

---

Appendix A: Eval Cube Questions

[Full 64-question evaluation set with expected answers, scoring criteria, and actual system responses]

Appendix B: System Prompt Template

[Complete system prompt template used for all evaluations]

Appendix C: Knowledge Graph Schema

[Full node/edge list of the 25-node, 70-edge knowledge graph]

Promotion Decision

Convert into the standard paper schema, add citations, and render a draft PDF.

Source Anchor

Comp-Core/packages/cognitive-twin/paper/cog-rlm-paper.md

Detected Structure

Abstract · Introduction · Method · Evaluation · References · Architecture