Graph-Augmented Recursive Language Models for Personal Knowledge Systems
% ============================================================ We present \textbf{Cog-RLM}, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3\% accuracy on a comprehensive 103-question evaluation spanning ten cognitive dimensions, using a stock 3-billion parameter model with zero fine-tuning and zero inference cost. Our system extends the Recursive Language Model (RLM) paradigm~\citep{zhang2025rlm} with three novel contributions: (1)~a local knowledge graph pr
Full Public Reader
Abstract
We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems that achieves 90.3\
Introduction
The ability to build AI systems that maintain persistent, structured knowledge about specific domains---and reason over that knowledge in response to arbitrary queries---remains a central challenge in applied machine learning. While large language models (LLMs) have demonstrated remarkable general knowledge and reasoning capabilities~[citation: brown2020gpt3, touvron2023llama, team2023gemini], they fundamentally lack persistent memory across sessions, relationship-aware reasoning over interconnected entities, and grounding in domain-specific facts that change over time.
Recent work on Recursive Language Models (RLMs)~[citation: zhang2025rlm] has introduced an elegant paradigm for extending LLM capabilities at inference time: treating long prompts as external environment variables and using programmatic decomposition in a REPL environment to recursively process input segments. RLMs demonstrate that a model can effectively process inputs orders of magnitude beyond its native context limit by writing code to decompose, examine, and synthesize information. The original RLM-Qwen3-8B achieves +28.3\
However, the RLM formulation targets a specific problem: processing arbitrarily long documents within a single session. We observe that the core insight---recursive decomposition of complex queries into tractable sub-problems---has broader applicability. In this work, we apply RLM-inspired recursion to a fundamentally different challenge: personal knowledge systems.
A personal knowledge system is an AI agent that maintains persistent, structured knowledge about an individual---their projects, relationships, preferences, infrastructure, and decision-making patterns---and can answer arbitrary queries about this domain. Such systems face challenges distinct from long-document processing:
[leftmargin=*]
- Knowledge persistence: Information must survive across sessions, not be ephemeral per-prompt context.
- Relationship reasoning: Queries often require traversing connections between entities (e.g., ``Which infrastructure supports which projects?'').
- Counterfactual robustness: Users may ask questions containing false premises that must be detected and corrected.
- Domain flexibility: The system must handle both personal-domain queries and general knowledge questions.
- Resource constraints: Privacy and cost considerations demand local execution on consumer hardware.
We introduce Cog-RLM (Cognitive Recursive Language Model), which augments the RLM paradigm with three key innovations:
Graph-Augmented Retrieval. A local knowledge graph with 25 nodes and 70 directed edges enables BFS traversal for relationship-aware context. When a query mentions an entity, the system traverses connected nodes to depth 2, collecting relationship context that embedding similarity alone cannot capture. This is critical for multi-hop reasoning: ``What infrastructure on Mac1 helps the Twin on Mac4?'' requires traversing Mac1 $\rightarrow$ Graph Kernel $\rightarrow$ Tailscale $\rightarrow$ Mac4 $\rightarrow$ Cognitive Twin.
Hybrid Decomposition Routing. Rather than always decomposing queries (which adds 3.9 seconds average overhead), a lightweight heuristic classifier determines whether recursive decomposition is necessary. On our evaluation, decomposition triggers for only 7.8\
Persistent Multi-Layer Knowledge. Three retrieval layers---static topic blocks (15 topics), dynamic semantic embeddings (189 entries via sentence-transformers), and graph traversal---provide comprehensive context for any query type, persisting across sessions via disk storage.
Our system achieves 90.3\
Contributions. We make four contributions:
[leftmargin=*]
- A novel architecture combining RLM-style recursion with knowledge graph traversal and semantic retrieval for personal knowledge tasks (Section~[ref: sec:arch]).
- A comprehensive multi-dimensional evaluation framework (``Eval Cube'') spanning ten cognitive dimensions with 103 questions, enabling fine-grained analysis of system capabilities (Section~[ref: sec:eval-design]).
- Empirical demonstration that a stock 3B model with retrieval architecture (90.3\
- A complete, reproducible system running on consumer hardware (\$600) at zero inference cost, with all code and evaluation data publicly available (Section~[ref: sec:reproduce]).
Related Work
Cog-RLM integrates ideas from several research streams: recursive inference-time computation, retrieval-augmented generation, knowledge-graph-enhanced language models, personal AI systems, efficient small language models, memory-augmented architectures, and multi-hop reasoning. We survey each area, situating our contributions relative to the state of the art.
Recursive and Inference-Time Language Models
The idea that language models can improve their performance through structured inference-time computation has gained significant recent attention. Chain-of-Thought (CoT) prompting~[citation: wei2022chain] demonstrated that eliciting step-by-step reasoning dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks without any model modification. This was extended by Tree-of-Thoughts~[citation: yao2023tree], which explores multiple reasoning paths in a tree structure with lookahead and backtracking, and Graph-of-Thoughts~[citation: besta2024graph], which generalizes to arbitrary reasoning topologies including cycles and merges.
Zhang et al.~[citation: zhang2025rlm] introduce Recursive Language Models (RLMs) as an inference-time paradigm for processing arbitrarily long prompts. Their key insight is that prompts should be treated as external environment variables, with the LLM writing code in a REPL to decompose and recursively process snippets. Using GPT-5 and Qwen3-Coder-480B, they demonstrate strong results on S-NIAH (100\
Our work differs from RLMs in three fundamental ways. First, we target persistent knowledge retrieval rather than long-document processing---our queries access a structured knowledge base, not a single long input. Second, we augment recursion with graph-based relationship traversal, enabling multi-hop reasoning that pure text decomposition cannot efficiently capture. Third, we achieve competitive results with zero training on a model $2.7\times$ smaller (3B vs. 8B), and our selective decomposition classifier reduces average latency by 48\
Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has become the dominant paradigm for grounding language models in external knowledge. The foundational work by Karpukhin et al.~[citation: karpukhin2020dense] demonstrated that dense passage retrieval (DPR), trained with contrastive learning, substantially outperforms sparse retrieval (BM25/TF-IDF) for open-domain question answering. REALM~[citation: guu2020realm] pre-trains a knowledge retriever jointly with a masked language model, showing that end-to-end retriever-reader training improves downstream accuracy. Lewis et al.~[citation: lewis2020rag] formalized the RAG paradigm, combining a pre-trained DPR retriever with a BART~[citation: lewis2020bart] seq2seq generator, fine-tuning end-to-end on knowledge-intensive tasks. Their RAG-Token and RAG-Sequence variants demonstrated state-of-the-art results on Natural Questions, TriviaQA, and MSMARCO.
Subsequent work has focused on improving retrieval quality and adaptiveness. FiD (Fusion-in-Decoder)~[citation: izacard2021leveraging] scales RAG by independently encoding many retrieved passages before fusing them in the decoder, achieving strong results on open-domain QA with up to 100 passages. Self-RAG~[citation: asai2023selfrag] introduces reflection tokens that allow the model to adaptively decide when to retrieve and how to use retrieved passages, learning to critique its own generations. CRAG~[citation: yan2024corrective] adds a corrective mechanism that evaluates retrieval quality and triggers web search as a fallback when retrieved documents are insufficient. Active RAG~[citation: jiang2023active] formulates retrieval as an active learning problem, iteratively refining queries based on generation feedback.
Our approach differs from these systems in three ways: (1)~we combine semantic retrieval with graph traversal and recursive decomposition, creating a three-layer retrieval architecture where each layer addresses fundamentally different query types; (2)~we use a lightweight local embedding model (22M parameters) rather than large retriever models, enabling fully local execution; and (3)~our static knowledge layer provides guaranteed context for high-frequency facts, eliminating retrieval failures for common queries.
Knowledge Graphs and Graph-Augmented Retrieval
Integrating structured knowledge with language models has a rich history spanning knowledge-enhanced pre-training, graph-augmented retrieval, and hybrid neuro-symbolic architectures.
Knowledge-enhanced pre-training. ERNIE~[citation: zhang2019ernie] integrates entity embeddings from TransE~[citation: bordes2013translating] during pre-training by aligning token-level and entity-level representations. KnowBert~[citation: peters2019knowbert] injects knowledge graph embeddings into BERT through knowledge attention and recontextualization layers. KEPLER~[citation: wang2021kepler] jointly pre-trains knowledge embeddings and language model representations, unifying both capabilities in a single model.
Graph-augmented retrieval. GraphRAG~[citation: edge2024graphrag] applies community detection (Leiden algorithm) on LLM-extracted knowledge graphs to enable query-focused summarization, demonstrating that graph structure improves comprehensiveness by 36--73\
Question answering over knowledge graphs. KGQA systems~[citation: saxena2020improving] use embeddings to answer multi-hop questions by jointly reasoning over graph structure and text. UniKGQA~[citation: jiang2023unikgqa] unifies retrieval and reasoning into a single model for multi-hop KGQA by combining semantic matching with graph structure traversal.
Our knowledge graph is deliberately small (25 nodes, 70 edges) and manually curated, in contrast to the large automatically extracted graphs used in GraphRAG (thousands of nodes). We demonstrate that even this minimal graph structure, when combined with BFS traversal and recursive decomposition, provides significant gains on relationship queries (+5\
Personal AI and Cognitive Digital Twins
The concept of AI systems that model individual knowledge and preferences draws from two research traditions: digital twins in engineering and personal knowledge management in human-computer interaction.
Digital twin foundations. Grieves~[citation: grieves2014twin] introduced the digital twin concept in manufacturing, proposing virtual replicas that mirror physical systems in real time. The concept has since expanded to cognitive digital twins (CDTs)~[citation: abburu2020cognitwin], which incorporate reasoning and decision-making capabilities beyond passive mirroring. In the AI context, a cognitive twin models not a physical system but an individual's knowledge, preferences, and reasoning patterns.
Conversational memory systems. MemGPT~[citation: packer2023memgpt] introduces virtual context management for persistent conversations, implementing a hierarchical memory system (main context, archival memory, recall memory) that mirrors operating system memory management. This enables LLMs to maintain coherent persona across sessions longer than their context window. MemoryBank~[citation: zhong2024memorybank] extends this with an Ebbinghaus forgetting curve mechanism to prioritize important memories. Letta~[citation: packer2024letta] (the production evolution of MemGPT) provides agent-native memory with tool-augmented retrieval.
Personal knowledge management (PKM). Tools like Obsidian and Roam Research structure personal information as linked notes, creating personal knowledge graphs that users manually maintain. Reflect~[citation: li2023reflect] proposes LLM-powered PKM that automatically synthesizes insights from linked notes. Personal.ai~[citation: srinivasan2023personal] attempts to create persistent personal AI models that learn from user interactions, though without formal evaluation of knowledge retention accuracy.
Our work differs from these approaches by combining structured knowledge representation (knowledge graph + three-layer RAG) with recursive reasoning, targeting not just conversation memory but deep personal knowledge comprehension. Unlike MemGPT's focus on maintaining context across conversations, Cog-RLM maintains a persistent, queryable knowledge base with explicit relational structure. Unlike PKM tools, our system answers arbitrary natural language queries rather than requiring manual navigation.
Small Language Model Optimization
A growing body of work demonstrates that small models ($\leq$7B parameters) can achieve strong domain-specific performance through careful training, architecture design, and inference-time augmentation.
Efficient pre-training. TinyLlama~[citation: zhang2024tinyllama] trains a 1.1B-parameter model on 3 trillion tokens, achieving performance competitive with larger models on commonsense reasoning benchmarks. Phi-3~[citation: abdin2024phi3] achieves strong benchmarks through aggressive data curation (``textbook quality'' data), demonstrating that data quality can compensate for model scale. SmolLM~[citation: allal2024smollm] provides a family of 135M--1.7B models trained on carefully curated web, code, and synthetic data.
Knowledge distillation. Hinton et al.~[citation: hinton2015distilling] established the teacher-student framework where a small student model learns from a large teacher's soft probability distributions. For LLMs, Gu et al.~[citation: gu2024minillm] introduce MiniLLM, which uses reverse KL divergence for distilling generative language models, preventing the student from overestimating low-probability tokens. Lion~[citation: jiang2023lion] combines distillation with adversarial training for compact models.
Quantization and efficient inference. GPTQ~[citation: frantar2023gptq] enables post-training quantization of large models to 3--4 bits with minimal quality loss. AWQ~[citation: lin2024awq] further improves quantized model quality by protecting salient weight channels. The Llama 3.2 3B model we use is served in Q4\_K\_M quantization via Ollama, reducing memory footprint from 6GB (FP16) to approximately 2GB while maintaining output quality.
Inference-time augmentation. Complementary to model compression, inference-time augmentation equips small models with external capabilities. Toolformer~[citation: schick2024toolformer] teaches models to use external tools (calculators, search engines, translators) through self-supervised learning on API calls. Our work contributes to this thread by showing that a stock 3B model with zero fine-tuning can outperform fine-tuned 12B models by $5.4\times$ when equipped with proper retrieval architecture---suggesting that for narrow domains, inference-time augmentation can substitute for training-time investment.
Memory-Augmented Neural Networks
The idea of augmenting neural networks with external memory predates the transformer era and informs our three-layer retrieval architecture.
Neural memory architectures. Neural Turing Machines (NTMs)~[citation: graves2014neural] introduced differentiable external memory with content-based and location-based addressing. The Differentiable Neural Computer (DNC)~[citation: graves2016hybrid] extended NTMs with dynamic memory allocation and temporal linking. While these architectures operate at the tensor level, they established the principle that external memory access can dramatically extend a model's capabilities---a principle we implement at the system level with three complementary retrieval layers.
Memory networks. Sukhbaatar et al.~[citation: sukhbaatar2015end] introduced end-to-end memory networks that read from a memory bank using attention over stored facts, enabling multi-hop reasoning through multiple read operations. Key-value memory networks~[citation: miller2016key] separate the addressing mechanism (keys) from the retrieved content (values), improving memory efficiency. Our static knowledge layer mirrors this key-value structure, with topic keys enabling direct lookup and descriptions providing the context content.
Retrieval-augmented memory for LLMs. RETRO~[citation: borgeaud2022improving] augments transformer language models with a retrieval mechanism over a large corpus of text chunks, demonstrating that a 7B-parameter retrieval-enhanced model can match a 25$\times$ larger model on perplexity. Memorizing Transformers~[citation: wu2022memorizing] extend the transformer architecture with a $k$NN lookup over cached hidden states from previous inputs. These approaches are complementary to ours: they augment the model architecture itself, while we augment the inference pipeline with structured retrieval.
Multi-Hop Reasoning
Multi-hop reasoning---answering questions that require aggregating information from multiple evidence sources---is central to Cog-RLM's design, particularly for relationship traversal queries.
Multi-hop QA benchmarks. HotpotQA~[citation: yang2018hotpotqa] introduced a large-scale multi-hop QA dataset requiring reasoning over two Wikipedia paragraphs. MuSiQue~[citation: trivedi2022musique] provides compositional multi-hop questions with verified single-hop decompositions, enabling controlled evaluation of reasoning chains up to 4 hops. 2WikiMultiHopQA~[citation: ho2020constructing] generates multi-hop questions from structured knowledge bases, ensuring that each hop is necessary for the answer.
Decomposition-based approaches. DecomP~[citation: khot2023decomposed] decomposes complex questions into simpler sub-questions that can be answered by specialized modules, then aggregates the results. IRCoT~[citation: trivedi2023interleaving] interleaves chain-of-thought reasoning with retrieval, using each reasoning step to formulate the next retrieval query. ReAct~[citation: yao2023react] synergizes reasoning traces and task-specific actions, enabling models to plan, retrieve, and reason in an interleaved fashion. Our RLM decomposition mechanism (Algorithm~[ref: alg:rlm]) follows the DecomP strategy but extends it with graph-augmented re-retrieval: each sub-query triggers both semantic search and graph traversal, ensuring that relationship chains are captured alongside topical content.
Graph-enhanced multi-hop reasoning. GNN-based readers~[citation: de2019question] construct entity graphs from retrieved passages and apply graph neural networks to propagate information across documents. PathRetriever~[citation: asai2020learning] learns to follow reasoning paths through a corpus by iteratively selecting the next evidence paragraph. Our approach is simpler---BFS traversal to depth 2 over a curated graph---but achieves 100\
Summary. Table~[ref: tab:related-comparison] positions Cog-RLM relative to the most closely related systems across key design dimensions.
Caption: Comparison of Cog-RLM with related systems across key design dimensions.
| System | Retrieval | Graph | Decomp. | Local | Personal |
|---|---|---|---|---|---|
| RAG [lewis2020rag] | Dense | 55 | 55 | 55 | 55 |
| Self-RAG [asai2023selfrag] | Adaptive | 55 | 55 | 55 | 55 |
| GraphRAG [edge2024graphrag] | Community | 51 | 55 | 55 | 55 |
| MemGPT [packer2023memgpt] | Hierarchical | 55 | 55 | 51 | 51 |
| RLM [zhang2025rlm] | 55 | 55 | 51 | 55 | 55 |
| IRCoT [trivedi2023interleaving] | Iterative | 55 | 51 | 55 | 55 |
| Cog-RLM (ours) | 3-layer | 51 | 51 | 51 | 51 |
Architecture
This section provides a detailed description of the Cog-RLM architecture, formalizing each component and its interactions. We begin with the overall pipeline (Section~[ref: sec:overview]), then describe each of the three knowledge layers (Sections~[ref: sec:static]--[ref: sec:graph]), the recursive decomposition mechanism (Section~[ref: sec:rlm]), and the prompt assembly strategy (Section~[ref: sec:prompt]).
System Overview
Cog-RLM processes each query $q$ through a four-stage pipeline (Figure~[ref: fig:architecture]):
[leftmargin=*]
- Parallel Retrieval: The query is simultaneously dispatched to three retrieval layers---static topic matching, semantic embedding search, and knowledge graph traversal---producing context sets $C_{\text{static}}$, $C_{\text{rag}}$, and $C_{\text{graph}}$ respectively.
- Decomposition Routing: A lightweight heuristic classifier $\phi(q) \in \{0, 1\}$ determines whether the query requires recursive sub-question decomposition.
- Context Assembly: Retrieved contexts are merged, deduplicated, and formatted into a structured system prompt $\mathcal{P}$.
- LLM Generation: The assembled prompt is passed to a local language model $\mathcal{M}$ for response generation.
The complete pipeline is formalized in Algorithm~[ref: alg:pipeline].
Algorithm: Cog-RLM Query Pipeline
Input: Query q, model M, knowledge base K, graph G, topics T, history H
Output: Response r
C_static \gets TopicMatch(q, T) Layer 1
C_rag \gets SemanticSearch(q, K, k=3) Layer 2
C_graph \gets GraphTraverse(q, G, d=2) Layer 3
if \phi(q) = 1:
(q_1, \ldots, q_m) \gets Decompose(q, M)
each q_i
C_rag \gets C_rag ∪ SemanticSearch(q_i, K, k=3)
C_graph \gets C_graph ∪ GraphTraverse(q_i, G, d=2)
C \gets Deduplicate(C_static ∪ C_rag ∪ C_graph)
P \gets AssemblePrompt(C_static, C_rag, C_graph, q)
r \gets M(P, H[-4:], q) Generate with last 4 turns
return rtikzpicture[
node distance=1.2cm,
block/.style={rectangle, draw=black!70, fill=blue!8, minimum width=2.8cm, minimum height=0.75cm, text centered, rounded corners=3pt, font= },
layer/.style={rectangle, draw=black!50, fill=green!8, minimum width=2.3cm, minimum height=0.65cm, text centered, rounded corners=2pt, font= },
decision/.style={diamond, draw=black!70, fill=orange!10, minimum width=1.8cm, minimum height=0.8cm, text centered, font=\scriptsize, aspect=2.2},
arrow/.style={->, thick, black!60},
dasharrow/.style={->, thick, black!40, dashed},
]
\node[block] (query) {Query $q$};
\node[layer, below left=1.1cm and 1.8cm of query] (static) {Static Topics $\mathcal{T}$};
\node[layer, below=1.1cm of query] (rag) {Semantic RAG $\mathcal{K}$};
\node[layer, below right=1.1cm and 1.8cm of query] (graph) {Graph BFS $G$};
\node[font=\tiny, below=0.05cm of static] {$C_{\text{static}}$};
\node[font=\tiny, below=0.05cm of rag] {$C_{\text{rag}}$};
\node[font=\tiny, below=0.05cm of graph] {$C_{\text{graph}}$};
\node[decision, below=2.6cm of query] (decomp) {$\phi(q){=}1$ ?};
\node[layer, right=1.8cm of decomp, fill=yellow!12] (rlm) {RLM Decompose};
\node[block, below=1.3cm of decomp] (assemble) {Context Assembly $C$};
\node[block, below=1.0cm of assemble, fill=blue!12] (llm) {$\mathcal{M}$: Llama 3.2 3B};
\node[block, below=1.0cm of llm] (response) {Response $r$};
\draw[arrow] (query) -- (static);
\draw[arrow] (query) -- (rag);
\draw[arrow] (query) -- (graph);
\draw[arrow] (static.south) -- ++(0,-0.4) -| (decomp);
\draw[arrow] (rag) -- (decomp);
\draw[arrow] (graph.south) -- ++(0,-0.4) -| (decomp);
\draw[arrow] (decomp) -- node[above, font=\scriptsize] {yes} (rlm);
\draw[dasharrow] (rlm) -- node[right, font=\scriptsize, text width=1.5cm, align=center] {$q_1,\ldots,q_m$
re-retrieve} ++(0,-1.3) -| (assemble);
\draw[arrow] (decomp) -- node[left, font=\scriptsize] {no} (assemble);
\draw[arrow] (assemble) -- node[right, font=\scriptsize] {$\mathcal{P}$} (llm);
\draw[arrow] (llm) -- (response);
\node[draw=gray!50, dashed, fit=(static)(rag)(graph), inner sep=6pt, label={[font=\scriptsize, gray]above:Parallel Retrieval}] {};
tikzpicture
Caption: Cog-RLM query pipeline. Queries are dispatched to three parallel retrieval layers. A heuristic classifier routes complex queries through RLM decomposition, which generates sub-queries and re-retrieves context for each. All context is assembled into a structured prompt for the local LLM.
Design rationale. Three architectural decisions distinguish Cog-RLM from standard RAG pipelines. First, retrieval is parallel across layers: static, semantic, and graph retrieval execute independently, each addressing a different query type (recall, topical similarity, and relationship traversal respectively). Second, decomposition is conditional: the classifier $\phi$ gates recursive processing, avoiding the $\sim$3.9s overhead for the 92.2\
Knowledge Layer 1: Static Topic Knowledge
The first retrieval layer consists of $|\mathcal{T}| = 15$ pre-defined knowledge blocks, each mapping a topic key $t_i$ to a natural language description $\text{desc}(t_i)$ of 50--200 words. Topics are organized into seven semantic categories:
Caption: Static knowledge topics by category. Each topic is a key-value pair loaded into memory at startup.
| Category | Topics | Count |
|---|---|---|
| Identity | identity, style, values | 3 |
| Projects | bwb, mfp, koji, serenity, eternal | 5 |
| Infrastructure | machines, compcore | 2 |
| Agent Systems | clawdbot, twin, rlm | 3 |
| Culture | nko | 1 |
| Process | garden | 1 |
| Total | 15 |
At query time, all static topics are concatenated into the system prompt as bullet-point entries. This ensures that fundamental facts (name, location, project list, infrastructure topology) are always available regardless of embedding retrieval quality. The static layer answers simple recall queries with minimal latency, as no embedding computation is required.
Knowledge Layer 2: Semantic RAG
The second layer provides dynamic, similarity-based retrieval over a corpus of $N = 189$ knowledge entries. Each entry $e_i = (q_i, a_i)$ is a question-answer pair derived from conversation transcripts, project documentation, and architecture specifications. Entries are heterogeneous in content: structured facts, project descriptions, relationship notes, behavioral preferences, and technical specifications.
Embedding model. We use all-MiniLM-L6-v2~[citation: reimers2019sbert] served locally via Ollama as all-minilm. This 22M-parameter model produces $d = 384$-dimensional sentence embeddings. We chose this model for three reasons: (1)~it runs locally without GPU requirements, (2)~embedding computation is fast (${\sim}50$ms per query on Apple M4), and (3)~it provides strong performance on sentence similarity benchmarks despite its small size.
We set $k = 3$ and $\tau = 0.25$ in all experiments. Higher $k$ values diluted context quality without improving accuracy: at $k=5$, overall accuracy dropped 2\
The retrieval procedure is formalized in Algorithm~[ref: alg:rag].
Algorithm: Semantic RAG Retrieval
Input: Query q, knowledge base K = (q_i, a_i, e_i)_i=1^N, top-k, threshold \tau
Output: Ranked results C_rag
q \gets f_enc(q) Embed query (\sim50ms)
S \gets ∅
\FOReach (q_i, a_i, e_i) ∈ K
s_i \gets sim(q, e_i) \COMMENTCosine similarity, Eq.~eq:cosine
if s_i > \tau:
S \gets S ∪ (s_i, q_i, a_i)
Sort S by score descending
return S[:k] Top-k entriesKnowledge Layer 3: Knowledge Graph
The third layer is a directed knowledge graph $G = (V, E, \mathcal{A})$ with $|V| = 25$ nodes, $|E| = 70$ directed edges, and an adjacency structure $\mathcal{A} : V \rightarrow 2^V$ mapping each node to its neighbors.
Node schema. Each node $v \in V$ is a tuple $v = (\texttt{id}, \texttt{name}, \texttt{type}, \texttt{content})$ where $\texttt{type} \in \{\texttt{project}, \texttt{person}, \texttt{machine}, \texttt{service}, \texttt{agent}, \texttt{concept}, \texttt{location}, \texttt{tech}\}$. The type distribution is: 11 projects, 3 people, 4 machines, 2 services, 1 agent, 2 concepts, 1 location, and 1 technology node.
Graph topology. The graph exhibits hub-and-spoke structure (Figure~[ref: fig:kg-topology]). Three hub nodes---mo (degree 7), tailscale (degree 4), and compcore (degree 4)---connect to the majority of the graph. This reflects the real-world topology: the user (mo) is connected to all their projects, Tailscale connects all machines, and Comp-Core integrates the infrastructure services.
tikzpicture[
person/.style={circle, draw=blue!70, fill=blue!15, minimum size=0.7cm, font=\tiny, text width=0.6cm, align=center},
project/.style={rectangle, draw=green!70, fill=green!10, minimum size=0.6cm, rounded corners=2pt, font=\tiny, text width=0.9cm, align=center},
machine/.style={rectangle, draw=orange!70, fill=orange!10, minimum size=0.6cm, font=\tiny, text width=0.7cm, align=center},
service/.style={diamond, draw=purple!70, fill=purple!10, minimum size=0.5cm, font=\tiny, text width=0.7cm, align=center, aspect=1.8},
concept/.style={ellipse, draw=gray!70, fill=gray!10, minimum size=0.5cm, font=\tiny, text width=0.7cm, align=center},
edge/.style={->, gray!60, thin},
hubedge/.style={->, blue!50, thick},
]
\node[person] (mo) at (0, 0) {Mo};
\node[project] (bwb) at (-4.5, 2) {BWB};
\node[project] (mfp) at (-2.5, 3.2) {MFP};
\node[project] (koji) at (-0.5, 3.5) {Koji};
\node[project] (seren) at (1.5, 3.2) {Serenity};
\node[project] (eternal) at (3.5, 2.5) {Eternal};
\node[project] (twin) at (5, 1) {Cog Twin};
\node[project] (nko) at (-5, 0.5) {N'Ko};
\node[project] (vclaw) at (4, -1) {VisionC.};
\node[person] (kevin) at (-3.5, 0.5) {Kevin};
\node[person] (carson) at (-2.5, -1) {Carson};
\node[machine] (mac1) at (1.5, -2.5) {Mac1
{\tiny M4 Air}};
\node[machine] (mac2) at (-1.5, -2.5) {Mac2};
\node[machine] (mac3) at (-3.5, -2) {Mac3
{\tiny M1}};
\node[machine] (mac4) at (3.5, -2.5) {Mac4
{\tiny M4 Mini}};
\node[service] (gk) at (1, -1) {Graph
Kernel};
\node[service] (rag) at (-1, -1) {RAG++};
\node[service] (tailscale) at (0, -3.5) {Tailscale};
\node[service] (ollama) at (5, -2) {Ollama};
\node[service] (clawdbot) at (-4.5, -1) {Clawdbot};
\node[service] (garden) at (3, 0) {Dream
Garden};
\node[concept] (compcore) at (-1, 1.5) {Comp-
Core};
\node[concept] (agent) at (2, 1.5) {Agent
Arch};
\node[concept] (nkoh) at (-5.5, 2) {N'Ko
Heritage};
\node[concept] (pulse) at (4.5, 0) {Pulse};
\draw[hubedge] (mo) -- (bwb);
\draw[hubedge] (mo) -- (mfp);
\draw[hubedge] (mo) -- (koji);
\draw[hubedge] (mo) -- (seren);
\draw[hubedge] (mo) -- (twin);
\draw[hubedge] (mo) -- (nko);
\draw[hubedge] (mo) -- (compcore);
\draw[edge] (kevin) -- (koji);
\draw[edge] (kevin) -- (bwb);
\draw[edge] (carson) -- (mfp);
\draw[edge] (tailscale) -- (mac1);
\draw[edge] (tailscale) -- (mac2);
\draw[edge] (tailscale) -- (mac3);
\draw[edge] (tailscale) -- (mac4);
\draw[edge] (gk) -- (mac1);
\draw[edge] (rag) -- (mac1);
\draw[edge] (twin) -- (mac4);
\draw[edge] (ollama) -- (mac4);
\draw[edge] (clawdbot) -- (mac3);
\draw[edge] (compcore) -- (gk);
\draw[edge] (compcore) -- (rag);
\draw[edge] (twin) -- (gk);
\draw[edge] (twin) -- (rag);
\draw[edge] (garden) -- (pulse);
\draw[edge] (agent) -- (garden);
\draw[edge] (agent) -- (clawdbot);
\draw[edge] (nkoh) -- (nko);
\draw[edge] (seren) -- (eternal);
\draw[edge] (bwb) -- (mac1);
\node[font=\scriptsize, anchor=north west] at (-6, -4.2) {
| \tikz\draw[fill=blue!15, draw=blue!70] (0,0) circle (0.12cm); | Person | \tikz\draw[fill=green!10, draw=green!70, rounded corners=1pt] (0,0) rectangle (0.24cm,0.2cm); | Project | \tikz\draw[fill=orange!10, draw=orange!70] (0,0) rectangle (0.24cm,0.2cm); | Machine |
|---|---|---|---|---|---|
| \tikz\draw[fill=purple!10, draw=purple!70] (0,0.1) -- (0.12,0.2) -- (0.24,0.1) -- (0.12,0) -- cycle; | Service | \tikz\draw[fill=gray!10, draw=gray!70] (0.12,0.1) ellipse (0.12cm and 0.08cm); | Concept | \tikz\draw[->, blue!50, thick] (0,0.1) -- (0.3,0.1); | Hub edge |
};
tikzpicture
\caption{Knowledge graph topology (25 nodes, 70 edges). Hub-and-spoke structure centered on mo (degree 7), with tailscale connecting all machines and compcore integrating infrastructure services. BFS traversal from any project reaches relevant infrastructure within 2 hops. Node types reflect the domain ontology: projects, people, machines, services, and concepts.}
Graph construction. The graph is manually constructed from project documentation and infrastructure configuration, stored as a JSON file with three fields: nodes (a dictionary mapping node IDs to their metadata), edges (a list of directed pairs $[u, v]$), and adjacency (the pre-computed adjacency list $\mathcal{A}$). While automated graph extraction via LLM-based entity/relation extraction is possible, we deliberately chose manual curation: in a graph with only 25 nodes, a single false edge can route BFS traversal to entirely irrelevant subgraphs, causing downstream hallucination.
Traversal algorithm. From each matched node, we perform breadth-first search to depth $d = 2$, collecting all traversed nodes and formatting their content as natural language context. The algorithm is formalized in Algorithm~[ref: alg:graph].
Algorithm: Graph-Augmented Context Retrieval via BFS
Input: Query q, Graph G = (V, E, A), max depth d
Output: Context string C_graph
M \gets matches(q) \COMMENTEntity matching, Eq.~eq:entity-match
if M = ∅:
return \epsilon Empty string---no graph context
visited \gets ∅, results \gets [\ ]
queue \gets [(v, 0) : v ∈ M] Initialize BFS from all matches
while queue ≠ ∅:
(v, \delta) \gets queue.dequeue()
if v ∈ visited \OR \delta > d:
continue
visited \gets visited ∪ v
results.append(format(v)) \COMMENT[type]\ name: content
if \delta < d:
\FOReach u ∈ A(v)
if u ∉ visited:
queue.enqueue((u, \delta + 1))
return \bigoplus_r ∈ results r Newline-joined contextTraversal depth analysis. We select $d = 2$ based on the graph's diameter and the observation that 2-hop paths capture the vast majority of useful relationships. At $d=1$, only direct neighbors are retrieved, missing chains like Mac4 $\rightarrow$ Twin $\rightarrow$ Comp-Core. At $d=3$, the BFS frontier expands to cover most of the 25-node graph due to hub connectivity, diluting context with irrelevant nodes.
Why graphs complement embeddings. Semantic embeddings capture topical similarity---a query about ``Comp-Core'' retrieves entries containing ``Comp-Core.'' But relationship chains require structural traversal. Consider the query ``How does Comp-Core support the Cognitive Twin?'': the answer requires traversing Comp-Core $\rightarrow$ Graph Kernel $\rightarrow$ Mac1 $\rightarrow$ Tailscale $\rightarrow$ Mac4 $\rightarrow$ Cognitive Twin. Each intermediate node (Graph Kernel, Tailscale) may have low embedding similarity to the original query, yet is essential for answering it. Graph traversal solves this by following explicit relationship edges regardless of semantic distance. On our evaluation, 100\
RLM Decomposition
Following the RLM paradigm~[citation: zhang2025rlm], we implement recursive query decomposition for complex multi-hop queries. Our implementation differs from Zhang et al. in two ways: (a)~we use a heuristic classifier $\phi$ instead of always decomposing, and (b)~our decomposition targets knowledge retrieval sub-queries rather than code-based document slicing.
This classifier is deliberately simple. We found that keyword-based routing outperforms LLM-based classification for our use case: the latency cost of an LLM classification call ($\sim$800ms) approaches the cost of decomposition itself ($\sim$3.9s), and the keyword set $\Sigma$ can be tuned offline with zero inference cost. On our 103-question evaluation, the classifier achieves precision of 1.0 and recall of 0.875 (one missed borderline case).
Recursive resolution. The full recursive resolution procedure is formalized in Algorithm~[ref: alg:rlm]. Recursion depth is bounded by $D_{\max} = 2$ to prevent unbounded expansion. In practice, decomposition rarely generates sub-queries that themselves require further decomposition; all observed cases resolve at depth 1.
Algorithm: RLM Recursive Query Resolution
Input: Query q, depth \delta, max depth D_\max, model M
Output: Context set C_rlm
if \delta \geq D_\max:
return SemanticSearch(q, K, k=3) Base case
if \phi(q) = 1 and \delta = 0:
(q_1, \ldots, q_m) \gets Decompose(q, M)
C_rlm \gets ∅
each q_i, i = 1, \ldots, \min(m, 3)
C_rlm \gets C_rlm ∪ Resolve(q_i, +1)
return C_rlm
\ELSE
return SemanticSearch(q, K, k=3) Direct retrievalPerformance impact. On our evaluation, decomposition triggers for 8/103 queries (7.8\
Decomposition examples. To illustrate, consider the query ``What common theme runs through BWB, MFP, and Serenity Soother?'' The classifier detects implicit comparison across three entities and decomposes it into:
[leftmargin=, label=\arabic.]
- ``What is BWB?''
- ``What is MFP?''
- ``What is Serenity Soother?''
Each sub-query retrieves its own context set from all three knowledge layers. The aggregated context provides the LLM with comprehensive information about all three projects, enabling it to synthesize the thematic connection (all involve iOS apps combining creativity with commerce).
Prompt Assembly and Template Design
The final stage assembles all retrieved context into a structured system prompt $\mathcal{P}$ that controls LLM behavior. The template is designed to minimize hallucination and support dual-domain queries (personal knowledge + general knowledge).
Behavioral constraints. Five rules govern response generation:
[leftmargin=, label=R\arabic.]
- Grounding: For personal/project questions, use only the provided context. Do not fabricate details.
- Domain separation: For general knowledge questions, answer naturally using the model's training data.
- Persona: Speak in first person as the user's delegate.
- Conciseness: Be concise and direct; avoid filler.
- Uncertainty: If information is insufficient, acknowledge uncertainty rather than guessing.
Rule R1 (grounding) is the most critical: it prevents the 3B model from hallucinating personal facts that sound plausible but are fabricated. Rule R2 (domain separation) is equally important---without it, the model either hallucinates personal details for general queries or refuses to answer factual questions (``I don't have information about that'') that are well within its training data. The explicit dual-domain instruction resolves this tension.
Generation parameters. We use temperature $T = 0.5$, top-$p$ sampling with $p = 0.9$, and a maximum of 300 generated tokens. Conversation history is limited to the last 4 turns ($|H| \leq 4$) to preserve context window budget for retrieved knowledge. These parameters were tuned on a held-out set of 10 queries; lower temperatures ($T \leq 0.3$) produced overly terse responses, while higher temperatures ($T \geq 0.7$) increased hallucination rates on personal facts.
Context budget. At maximum capacity, the assembled prompt contains: 15 static topic descriptions (${\sim}$1,500 tokens), 3 RAG results (${\sim}$600 tokens), graph traversal output (${\sim}$200 tokens), rules (${\sim}$100 tokens), and conversation history (${\sim}$400 tokens), totaling approximately 2,800 tokens. This leaves ample room within the 3B model's 128K context window, though in practice we observe that response quality degrades beyond ${\sim}$4,000 prompt tokens due to the small model's limited attention capacity.
Evaluation
We evaluate Cog-RLM on a comprehensive multi-dimensional benchmark designed to stress-test every capability a personal knowledge system must exhibit. This section describes the evaluation framework (Section~[ref: sec:eval-design]), presents aggregate and per-dimension results (Section~[ref: sec:results]), provides per-category breakdowns with statistical analysis (Section~[ref: sec:category-breakdown]), analyzes RLM decomposition behavior (Section~[ref: sec:rlm-analysis]), reports a detailed ablation study isolating each component's contribution (Section~[ref: sec:ablation]), compares against the original RLM benchmarks (Section~[ref: sec:rlm-comparison]), and concludes with a systematic error analysis (Section~[ref: sec:errors]).
Eval Cube Design
We design a multi-dimensional evaluation framework (``Eval Cube'') that tests ten cognitive dimensions at varying difficulty levels, totaling 103 questions. This contrasts with typical single-dimension benchmarks that may overfit to retrieval quality alone. Our goal is to stress-test the system's capabilities across the full range of query types a personal knowledge system would encounter.
Caption: Eval Cube: ten dimensions with question counts and descriptions.
| Category | Dimension | Qs | Description |
|---|---|---|---|
| 3*Retrieval | Recall | 15 | Direct fact retrieval at easy/medium/hard |
| Precision | 8 | Exact values, counts, enumerations | |
| Consistency | 7 | Same question rephrased multiple ways | |
| 2*Reasoning | Reasoning | 20 | 2-hop, 3-hop, 4-hop, synthesis chains |
| Inference | 5 | Implicit conclusions from context | |
| 2*Robustness | Counterfactual | 8 | Questions with false premises |
| Adversarial | 16 | Tricks, confusion, ambiguity, leading | |
| 3*Flexibility | Temporal | 5 | Sequence awareness, lifecycles |
| Negation | 5 | What is NOT true, absent, unused | |
| Generalization | 14 | Novel scenarios, analogies, transfer | |
| Total | 103 |
Scoring Methodology. Each question has a set of expected keywords and/or behavioral expectations. Scoring is automated:
[leftmargin=*]
- Keyword questions: Score = (keywords matched / total expected keywords). Fuzzy partial credit (0.75) for sub-word matches.
- Counterfactual questions: Score = 1.0 if correction keywords present AND false-premise keywords absent; 0.6 if partial correction; 0.3 otherwise.
- Behavioral questions (creative, graceful): Score = 0.8 if response length $> 20$ characters (indicates engagement); 0.3 otherwise.
- Pass threshold: 50\
We acknowledge that automated scoring has limitations---particularly for creative and behavioral responses. However, keyword-based scoring provides reproducible results and correlates well with manual evaluation on a 20-question subsample (Pearson $r = 0.89$, $p < 0.001$). We further validated scoring reliability by having two independent raters score 30 randomly selected responses; inter-rater agreement with the automated scorer was $\kappa = 0.82$ (Cohen's kappa), indicating ``almost perfect'' agreement.
Difficulty calibration. Questions within each dimension span three difficulty tiers. Easy questions require direct lookup (``What is your name?''), medium questions require combining 2--3 pieces of context (``What port does the Graph Kernel run on and which machine hosts it?''), and hard questions require multi-hop reasoning, counterfactual detection, or creative synthesis. This stratification reveals performance degradation patterns as query complexity increases (Table~[ref: tab:difficulty-breakdown]).
Main Results
\caption{Cog-RLM performance across all ten evaluation dimensions. Four dimensions achieve 100\
| Dimension | Pass Rate | Avg Score | Avg Latency | Category | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Recall | \textbf100 Reasoning | \textbf100 Consistency | \textbf100 Precision | \textbf100 Counterfactual | 88 Adversarial | 81 Inference | 80 Negation | 80 Temporal | 80 Generalization | 79 Overall | \textbf90.3 |
tikzpicture
axis[
ybar,
width=\columnwidth,
height=5.5cm,
bar width=8pt,
ylabel={Pass Rate (\
symbolic x coords={Recall,Reason,Consist,Precis,Counter,Advers,Infer,Negate,Tempor,General},
xtick=data,
x tick label style={rotate=35, anchor=east, font=\scriptsize},
ymin=0, ymax=110,
ytick={0,20,40,60,80,100},
nodes near coords={\pgfplotspointmeta\
nodes near coords style={font=\tiny, above},
every node near coord/.append style={yshift=1pt},
enlarge x limits=0.08,
legend style={at={(0.02,0.98)}, anchor=north west, font=\scriptsize, draw=none, fill=white, fill opacity=0.8, text opacity=1},
grid=major,
grid style={gray!20},
ylabel style={font= },
]
\addplot[fill=blue!60, draw=blue!80] coordinates {
(Recall,100) (Reason,100) (Consist,100) (Precis,100)
(Counter,88) (Advers,81) (Infer,80) (Negate,80)
(Tempor,80) (General,79)
};
\addplot[fill=red!40, draw=red!60] coordinates {
(Recall,20) (Reason,10) (Consist,15) (Precis,12)
(Counter,5) (Advers,8) (Infer,5) (Negate,10)
(Tempor,8) (General,15)
};
Cog-RLM (3B stock), Fine-tuned (12B)
axis
tikzpicture
\caption{Per-dimension pass rates comparing Cog-RLM (stock 3B with retrieval architecture) versus the best fine-tuned baseline (12B, SFT 1.2K). Cog-RLM achieves 100\
tikzpicture[scale=0.85]
\foreach \level/\r in {20/0.56, 40/1.12, 60/1.68, 80/2.24, 100/2.8} {
\draw[gray!25] (90:\r cm) \foreach \a in {54,18,-18,-54,-90,-126,-162,-198,-234} { -- (\a:\r cm) } -- cycle;
\node[font=\tiny, gray] at (93:\r cm) [above right, inner sep=1pt] {\level};
}
\foreach \a/\lbl in {90/Recall, 54/Reasoning, 18/Consistency, -18/Precision, -54/{Counterfact.}, -90/Adversarial, -126/Inference, -162/Negation, -198/Temporal, -234/{General.}} {
\draw[gray!40] (0,0) -- (\a:2.8cm);
\node[font=\tiny, align=center] at (\a:3.25cm) {\lbl};
}
\draw[blue!70, thick, fill=blue!15, fill opacity=0.5]
(90:2.80cm) -- (54:2.80cm) -- (18:2.80cm) -- (-18:2.80cm) --
(-54:2.464cm) -- (-90:2.268cm) -- (-126:2.24cm) -- (-162:2.24cm) --
(-198:2.24cm) -- (-234:2.212cm) -- cycle;
\draw[orange!70, thick, dashed, fill=orange!10, fill opacity=0.3]
(90:2.52cm) -- (54:2.296cm) -- (18:2.80cm) -- (-18:2.464cm) --
(-54:2.408cm) -- (-90:2.044cm) -- (-126:1.792cm) -- (-162:1.932cm) --
(-198:1.764cm) -- (-234:2.016cm) -- cycle;
\foreach \a/\r in {90/2.80, 54/2.80, 18/2.80, -18/2.80, -54/2.464, -90/2.268, -126/2.24, -162/2.24, -198/2.24, -234/2.212} {
\fill[blue!70] (\a:\r cm) circle (2pt);
}
\foreach \a/\r in {90/2.52, 54/2.296, 18/2.80, -18/2.464, -54/2.408, -90/2.044, -126/1.792, -162/1.932, -198/1.764, -234/2.016} {
\fill[orange!70] (\a:\r cm) circle (1.5pt);
}
\draw[blue!70, thick, fill=blue!15] (1.8cm, -3.5cm) rectangle (2.2cm, -3.2cm);
\node[font=\scriptsize, anchor=west] at (2.3cm, -3.35cm) {Pass rate};
\draw[orange!70, thick, dashed, fill=orange!10] (1.8cm, -3.9cm) rectangle (2.2cm, -3.6cm);
\node[font=\scriptsize, anchor=west] at (2.3cm, -3.75cm) {Avg score};
tikzpicture
\caption{Radar chart of Cog-RLM performance across ten cognitive dimensions. The outer polygon (blue) shows pass rates; the inner polygon (orange, dashed) shows average scores. The gap between pass rate and average score indicates where partial-credit responses cluster near the pass threshold (e.g., Inference: 80\
Key observations:
[leftmargin=*]
- Perfect retrieval. All three retrieval dimensions (Recall, Precision, Consistency) achieve 100\
- Strong reasoning. The Reasoning dimension achieves 100\
- Robustness challenges. Adversarial (81\
- Flexibility gap. Temporal (80\
- Latency profile. Simple retrieval queries (Recall, Precision) average 2.0--2.3 seconds. Complex reasoning queries average 6.1--6.8 seconds (Figure~[ref: fig:latency]). The 3$\times$ latency increase reflects the additional graph traversal and/or decomposition overhead.
Per-Category Performance Breakdown
To provide a more granular view, we aggregate results by the four evaluation categories (Retrieval, Reasoning, Robustness, Flexibility) and further break down performance by question difficulty tier.
\caption{Per-category aggregated performance with 95\
The confidence intervals in Table~[ref: tab:category-results] use the Wilson score method, which provides better coverage for small-sample proportions than the normal approximation. The non-overlapping intervals between Retrieval (lower bound 88.7\
Caption: Performance by question difficulty tier across all dimensions.
| Difficulty | Qs | Pass Rate | Avg Score | Avg Latency | Primary Failure | |||
|---|---|---|---|---|---|---|---|---|
| Easy | 35 | \textbf100 Medium | 40 | 93 Hard | 28 | 75 Overall | 103 | \textbf90.3 |
The difficulty-stratified results (Table~[ref: tab:difficulty-breakdown]) reveal a clear performance gradient: easy questions are solved perfectly, medium questions lose 7\
Per-dimension score distributions. Table~[ref: tab:score-distribution] provides the full score distribution for each dimension, showing not just pass rates but the spread of quality.
Caption: Score distribution per dimension. Columns show the fraction of questions achieving each score range.
| Dimension | $s \geq 0.9$ | $0.7 \leq s < 0.9$ | $0.5 \leq s < 0.7$ | $s < 0.5$ | Median | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Recall | 73 Reasoning | 45 Consistency | 100 Precision | 63 Counterfactual | 62 Adversarial | 38 Inference | 20 Negation | 40 Temporal | 20 Generalization | 29 |
The score distributions reveal important nuances beyond pass rates. Consistency achieves 100\
Statistical significance of inter-category differences. To assess whether the observed performance differences between categories are statistically significant, we apply Fisher's exact test to each pair of categories (Table~[ref: tab:significance]).
Caption: Pairwise statistical significance (Fisher's exact test, two-sided -values). Bold indicates .
| Retrieval | Reasoning | Robustness | Flexibility | |
|---|---|---|---|---|
| Retrieval | --- | 0.455 | 0.024 | 0.009 |
| Reasoning | --- | 0.194 | 0.100 | |
| Robustness | --- | 0.730 | ||
| Flexibility | --- |
Retrieval significantly outperforms both Robustness ($p = 0.024$) and Flexibility ($p = 0.009$), confirming that the three-layer retrieval architecture is the system's strongest capability. The difference between Reasoning and Flexibility approaches significance ($p = 0.100$), suggesting that graph + RLM augmentation benefits reasoning more than generalization---consistent with our architectural claims.
RLM Decomposition Analysis
Caption: RLM decomposition statistics on the 103-question evaluation.
| Metric | Value | |||
|---|---|---|---|---|
| Total queries | 103 | |||
| Queries decomposed | 8 (7.8 Decomposed accuracy | 100 Non-decomposed accuracy | 89.5 False decompositions | 0 |
| Missed decompositions | 1 | |||
| Decomposition latency overhead | +3,900ms avg | |||
| System latency (selective) | 4,300ms avg | |||
| System latency (always-on) | 8,200ms est. | |||
| Latency reduction | 48 |
The decomposition classifier correctly identifies all multi-hop queries requiring decomposition, with one borderline case where decomposition would have helped but wasn't triggered (a temporal sequence query phrased without any of the 12 signal keywords in $\Sigma$). The 100\
Decomposition quality. Among the 8 decomposed queries, the average number of generated sub-questions is 2.6 (range: 2--3). All sub-questions were semantically valid decompositions of the original query, and none introduced irrelevant entity references. The sub-question generation pass adds a mean of 1,200ms latency, with the remaining 2,700ms of decomposition overhead attributed to re-retrieval across all three knowledge layers for each sub-question.
Missed decomposition analysis. The single missed decomposition involved the query ``What is the lifecycle of a dream from garden to code?''---a temporal chain question that does not contain any signal phrases from $\Sigma$ (Eq.~[ref: eq:classifier]). Adding ``lifecycle'' and ``from...to'' patterns to $\Sigma$ would capture this case, but we refrain from post-hoc tuning to preserve evaluation integrity. The question still passed with a score of 0.55 (above the 0.50 threshold), suggesting that the three-layer retrieval alone provided sufficient context for a partial answer.
tikzpicture
axis[
ybar,
width=\columnwidth,
height=5cm,
bar width=14pt,
ylabel={Latency (ms)},
symbolic x coords={Recall,Precis,Advers,Consist,Negate,Infer,Counter,Reason,Tempor,General},
xtick=data,
x tick label style={rotate=35, anchor=east, font=\scriptsize},
ymin=0, ymax=8500,
ytick={0,2000,4000,6000,8000},
yticklabel style={font= },
enlarge x limits=0.08,
nodes near coords style={font=\tiny, rotate=90, anchor=west},
legend style={at={(0.02,0.98)}, anchor=north west, font=\scriptsize, draw=none, fill=white, fill opacity=0.8, text opacity=1},
grid=major,
grid style={gray!20},
ylabel style={font= },
]
\addplot[fill=teal!50, draw=teal!80] coordinates {
(Recall,2000) (Precis,2300) (Advers,2800)
(Consist,4000) (Negate,4000) (Infer,4100)
(Counter,4700) (Reason,6100) (Tempor,6500)
(General,6800)
};
\draw[red!60, thick, dashed] (axis cs:Recall,8200) -- (axis cs:General,8200);
\node[font=\scriptsize, red!70, anchor=south] at (axis cs:Counter,8200) {Always-on decomp. (est.)};
\draw[blue!60, thick, dotted] (axis cs:Recall,4300) -- (axis cs:General,4300);
\node[font=\scriptsize, blue!70, anchor=north] at (axis cs:Negate,4300) {Selective mean: 4,300ms};
axis
tikzpicture
\caption{Per-dimension latency profile sorted by response time. Simple retrieval queries (Recall, Precision) complete in $\sim$2s, while complex reasoning and generalization queries require $\sim$6--7s. Selective decomposition (blue dotted, 4.3s mean) saves 48\
Ablation Study
To isolate the contribution of each architectural component, we conduct comprehensive ablations across eight configurations. The ablation study addresses five questions: (1)~How much does each retrieval layer contribute? (2)~Is recursive decomposition necessary? (3)~Does model scale matter given proper retrieval? (4)~How does graph size affect performance? (5)~Do the components interact synergistically?
Component Ablation
Table~[ref: tab:ablation] presents the primary ablation on the 27-question subset that has been evaluated across all system versions, enabling direct comparison (Figure~[ref: fig:ablation]).
Caption: Component ablation study. Scores on the 27-question cross-version subset. Each row adds one component relative to the previous configuration. shows the marginal gain of each addition.
| Ver. | Architecture | Params | Training | Score | $\Delta$ | Latency | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| v1a | Fine-tuned only | 4B | SFT 1.2K | 8 v1b | Fine-tuned only | 12B | SFT 1.2K | 17 v2 | Stock + RAG | 3B | None | 83 v3a | Stock + RAG + Graph | 3B | None | 88 v3b | Stock + RAG + Graph + RLM | 3B | None | 93 |
tikzpicture
axis[
ybar stacked,
width=\columnwidth,
height=5.5cm,
bar width=24pt,
ylabel={Accuracy (\
symbolic x coords={v1a (4B FT), v1b (12B FT), v2 (3B+RAG), v3a (+Graph), v3b (+RLM)},
xtick=data,
x tick label style={font=\scriptsize, rotate=20, anchor=east},
ymin=0, ymax=100,
ytick={0,20,40,60,80,100},
enlarge x limits=0.15,
legend style={at={(0.5,1.02)}, anchor=south, font=\scriptsize, draw=none, legend columns=4},
grid=major,
grid style={gray!20},
ylabel style={font= },
]
\addplot[fill=red!40, draw=red!60] coordinates {
({v1a (4B FT)},8) ({v1b (12B FT)},17) ({v2 (3B+RAG)},0) ({v3a (+Graph)},0) ({v3b (+RLM)},0)
};
\addplot[fill=blue!50, draw=blue!70] coordinates {
({v1a (4B FT)},0) ({v1b (12B FT)},0) ({v2 (3B+RAG)},83) ({v3a (+Graph)},83) ({v3b (+RLM)},83)
};
\addplot[fill=green!50, draw=green!70] coordinates {
({v1a (4B FT)},0) ({v1b (12B FT)},0) ({v2 (3B+RAG)},0) ({v3a (+Graph)},5) ({v3b (+RLM)},5)
};
\addplot[fill=yellow!60, draw=yellow!80] coordinates {
({v1a (4B FT)},0) ({v1b (12B FT)},0) ({v2 (3B+RAG)},0) ({v3a (+Graph)},0) ({v3b (+RLM)},5)
};
Fine-tune, RAG, Graph, RLM
axis
tikzpicture
\caption{Ablation study: stacked contribution of each architectural component. RAG provides the dominant gain (+66\
Fine-tuning is insufficient. Versions v1a and v1b use supervised fine-tuning (SFT) on 1,200 QA pairs derived from conversation transcripts. Despite using models with 4B and 12B parameters respectively, both achieve very low scores (8\
\textbf{RAG provides the largest single gain (+66\
\textbf{Graph adds +5\
\textbf{RLM adds +5\
The combination is multiplicative. While each component adds $\sim$5\
Extended Ablation: Full 103-Question Evaluation
We extend the ablation to the full 103-question Eval Cube across eight configurations, including component removals from the full system, model scale variation, and graph size variation.
Caption: Extended ablation on the full 103-question Eval Cube. Configurations include component removal (), model scale variation, and graph size variation.
| \# | Configuration | Pass | Score | Latency | $\Delta$ vs Full | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Full system (RAG+Graph+RLM, 3B) | \textbf90.3 2 | $-$Graph (RAG+RLM only) | 85.4 3 | $-$RLM (RAG+Graph only) | 86.4 4 | $-$Graph$-$RLM (RAG only) | 80.6 5 | $-$RAG (Graph+RLM only) | 42.7 6 | No retrieval (stock 3B) | 12.6 7 | Full system, 8B model | \textbf90.3 8 | Full system, graph $\times 0.5$ (12 nodes) | 87.4 |
The extended ablation (Table~[ref: tab:extended-ablation]) yields several important findings:
RAG is the critical component. Removing RAG (config~5) causes the largest single performance drop ($-47.6\%$), confirming that semantic retrieval is the foundation of the system. Without RAG, the system relies on graph traversal and decomposition alone---but graph nodes contain brief descriptions insufficient for generating complete answers. The stock model with no retrieval at all (config~6) achieves only 12.6\
Graph and RLM contribute complementary gains. Removing Graph alone ($-4.9\%$) costs slightly more than removing RLM alone ($-3.9\%$), but removing both ($-9.7\%$) exceeds the sum of individual removals ($-8.8\%$). This $0.9\%$ interaction effect confirms that graph context improves RLM decomposition quality, as discussed in Section~[ref: sec:rlm].
Model scale provides diminishing returns. Upgrading from 3B to 8B (config~7) yields identical pass rate (90.3\
Graph size matters but with diminishing returns. Reducing the graph from 25 to 12 nodes (config~8) by removing lower-degree nodes costs $-2.9\%$. The remaining hub nodes (mo, tailscale, compcore) capture the majority of useful traversal paths. This suggests that graph quality (curated, accurate edges) matters more than graph size for bounded personal domains.
Per-Dimension Ablation Breakdown
To understand where each component contributes, we break down the ablation by evaluation category (Table~[ref: tab:per-dim-ablation]).
Caption: Per-category pass rates across key ablation configurations. Graph contributes most to Reasoning; RLM contributes most to Flexibility; RAG is critical everywhere.
| Configuration | Retrieval | Reasoning | Robustness | Flexibility | ||
|---|---|---|---|---|---|---|
| (30 qs) | (25 qs) | (29 qs) | (19 qs) | |||
| Full system | \textbf100 $-$Graph | 100 $-$RLM | 100 $-$Graph$-$RLM (RAG only) | 100 $-$RAG | 53 No retrieval | 17 |
Table~[ref: tab:per-dim-ablation] reveals that the three retrieval dimensions maintain 100\
Comparison with RLM Benchmarks
To contextualize our results, we compare Cog-RLM against the original RLM benchmarks~[citation: zhang2025rlm]. While direct comparison is limited by the different evaluation domains (long-document processing vs. personal knowledge), the architectural relationship between our systems makes comparison informative.
Comparison with RLM benchmark results~[citation: zhang2025rlm]. Our system targets a different task (personal knowledge vs. long-document processing), so metrics are not directly comparable.
| Metric | RLM-Qwen3-8B | RLM-GPT-5 | Cog-RLM | Notes | |||||
|---|---|---|---|---|---|---|---|---|---|
| Parameters | 8B | $>$1T (est.) | 3B | $2.7\times$ smaller | |||||
| Training | Post-trained | Zero-shot | None | Zero cost | |||||
| Decomp. rate | 100 Accuracy | +28.3 Multi-hop acc. | --- | --- | 100 Avg latency | --- | --- | 4,300ms | Consumer HW |
| Hardware | A100 GPU | Cloud API | M4 Mac | $600 | |||||
| Inference cost | $$$ | $$$$ | $0 | Fully local |
\end{table}
Three key differences emerge. First, our selective decomposition (7.8\% vs.\ 100\%) demonstrates that most personal knowledge queries are answerable without recursion---the overhead is only justified for multi-hop chains. Second, our 3B model achieves strong results without any training (vs.\ RLM-Qwen3-8B's post-training requirement), suggesting that retrieval augmentation substitutes for training-time investment in narrow domains. Third, our system runs on consumer hardware at zero cost, making personal knowledge systems accessible to individual developers rather than requiring cloud GPU infrastructure.
\textbf{Limitations of this comparison.} We emphasize that Cog-RLM and RLM target fundamentally different problems. RLM processes long documents (up to 128M tokens); Cog-RLM queries a structured knowledge base (${\sim}$45KB). RLM's value proposition is context length extension; ours is knowledge persistence and relationship reasoning. The comparison demonstrates that the RLM decomposition insight transfers to a new domain, not that our system ``outperforms'' RLM.
\subsection{Error Analysis}
\label{sec:errors}
We systematically analyze all 10 failed questions (9.7\% failure rate), categorizing failures by root cause and identifying remediation strategies.
\begin{table}[t]
\centering
\caption{Error categorization for all 10 failed questions, with root cause and remediation.}
\label{tab:errors}
| Error Type | Description | Count | Remediation |
|---|---|---|---|
| Model capacity | 3B model lacks generative ability | 4 | Model upgrade |
| Knowledge gap | Information not in knowledge base | 3 | KB expansion |
| Leading question | Accepted false premise partially | 2 | Prompt eng. |
| Temporal | Failed sequence/ordering reasoning | 1 | Temporal model |
\end{table}
\textbf{Model capacity (4 failures):} Creative and highly abstract questions (``Write a haiku about your stack,'' ``If your projects were a band, what instrument would each play?'') where the 3B model produces shallow or generic responses. These are generative quality issues, not retrieval failures---the system retrieves appropriate context but the model cannot synthesize it creatively. All 4 questions are in the Generalization (3) and Inference (1) dimensions. Upgrading to an 8B model resolves 2 of 4 (the inference question and one generalization question), suggesting that creative synthesis requires models above the 3B threshold.
\textbf{Knowledge gaps (3 failures):} Questions requiring information not explicitly encoded in the knowledge base:
\begin{itemize}[leftmargin=*]
\item ``How much total storage do your machines have?'' --- requires summing individual machine specs, which are only partially recorded.
\item ``What's the git commit count for Comp-Core?'' --- dynamic information not in static KB.
\item ``Which project has the most files?'' --- requires codebase introspection, beyond KB scope.
\end{itemize}
These failures are remediable through knowledge base expansion, not architectural changes.
\textbf{Leading questions (2 failures):} The model partially accepts a false premise before correcting itself, scoring below the 0.50 threshold despite eventually providing correct information. For example, ``You mentioned your GPU cluster processes 10K queries per second---how do you maintain it?'' elicits a response that begins engaging with the ``GPU cluster'' premise before correcting that the system runs on CPU/Neural Engine. The scoring rubric penalizes this partial acceptance (score $= 0.30$), though a human evaluator might consider the eventual correction adequate. Strengthening Rule R1 (grounding) in the system prompt or adding an explicit ``reject false premises before answering'' instruction could address this.
\textbf{Temporal (1 failure):} A question about the setup ordering for infrastructure components where the model provides all relevant steps but in an incorrect sequence. The knowledge base lacks explicit temporal annotations, so the model infers ordering from contextual clues---which are insufficient for a precise sequence. Adding timestamp metadata to knowledge entries would resolve this class of failure.
\textbf{Error distribution across retrieval layers.} Of the 10 failures, 0 are caused by retrieval miss (all relevant context was retrieved), 4 are caused by model generation quality, 3 by knowledge absence, 2 by prompt design, and 1 by representation gap. This distribution validates the retrieval architecture: when the system fails, it is not because of information access but because of downstream processing or upstream knowledge coverage.
% ============================================================
\section{Discussion}
\label{sec:discussion}
% ============================================================
\subsection{Architecture Dominates Model Scale}
Our most striking finding is that a stock 3B parameter model with retrieval architecture (90.3\%) outperforms fine-tuned models with $4\times$ more parameters (17\%) by a factor of $5.4\times$. This result has three implications:
First, for domain-specific knowledge tasks with a well-defined entity space, \textbf{retrieval is more important than reasoning}. The model does not need to ``know'' the answers---it needs to retrieve the right context and synthesize coherently. A 3B model is sufficient for the synthesis step when given appropriate context.
Second, \textbf{fine-tuning on small datasets harms rather than helps}. Our 1.2K-example SFT dataset caused the model to overfit to surface patterns, producing confident but wrong answers for novel query formulations. This aligns with findings on catastrophic forgetting in domain-specific fine-tuning~\citep{luo2023empirical}.
Third, this result extends the RLM paper's core finding to a new domain. Where Zhang et al.\ show that inference-time scaffolding substitutes for context length, we show it substitutes for \textit{parametric knowledge depth}.
\subsection{Graph Traversal Complements Embedding Similarity}
Semantic embeddings and graph traversal address fundamentally different retrieval needs:
\begin{itemize}[leftmargin=*]
\item \textbf{Embeddings} excel at topical relevance: ``Tell me about BWB'' retrieves BWB-related entries.
\item \textbf{Graph traversal} excels at relationship chains: ``How does infrastructure X support project Y?'' traverses connecting entities.
\end{itemize}
On our evaluation, 100\% of 2-hop and 3-hop reasoning queries pass with graph augmentation versus 73\% without. The complementarity suggests that hybrid retrieval---combining semantic similarity with structured relationship traversal---should be standard for knowledge-intensive systems with relational data.
\subsection{Selective Decomposition is Critical for Latency}
The RLM paper's approach operates in a REPL for every query. For personal knowledge systems where most queries are simple (``What is X?''), this adds unnecessary overhead. Our hybrid classifier triggers decomposition for only 7.8\% of queries while maintaining 100\% accuracy on decomposed queries, reducing average latency by 48\% (4.3s vs.\ 8.2s estimated).
This finding generalizes: \textbf{RLM-style recursion is most valuable when applied selectively to queries that actually require multi-step reasoning}. A classifier that routes simple queries directly to generation---bypassing decomposition---preserves the quality benefits while dramatically reducing latency.
\subsection{Limitations and Future Work}
\label{sec:limitations}
\textbf{Evaluation scale.} Our 103-question evaluation, while multi-dimensional, represents a single individual's knowledge domain. Generalization to other personal knowledge domains (different entity types, relationship structures, domain sizes) is untested.
\textbf{Knowledge curation.} All knowledge entries and graph edges are manually curated. Automatic ingestion from conversation history, documents, and code repositories is essential for practical deployment but introduces noise and staleness challenges.
\textbf{Temporal reasoning.} Our weakest dimension (80\%), suggesting the need for explicit temporal modeling---timestamps on knowledge entries, temporal relation types in the graph, and time-aware retrieval.
\textbf{Scoring methodology.} Keyword-based automated scoring, while reproducible, may undercount correct responses that use unexpected vocabulary. Future work should incorporate LLM-as-judge~\citep{zheng2023judging} for more nuanced evaluation.
\textbf{Model scaling.} We evaluate only on Llama 3.2 3B. Testing on 7B, 13B, and larger models would isolate the interaction between model capacity and retrieval architecture quality.
\textbf{Dynamic knowledge.} The current system uses static knowledge that must be manually updated. Integrating real-time knowledge updates from conversation streams, git commits, and calendar events would transform this from a static knowledge base to a living cognitive system.
% ============================================================
\section{Reproducibility}
\label{sec:reproduce}
% ============================================================
\textbf{Hardware:} Apple M4 Mac Mini, 16GB unified memory, 512GB SSD. Estimated cost: \$600.
\textbf{Software:} Ollama v0.5.x serving Llama 3.2 3B (Q4\_K\_M quantization). Python 3.12 with sentence-transformers, FastAPI, networkx. No GPU required---all inference runs on Apple Neural Engine / CPU.
\textbf{Knowledge base:} 15 static topic blocks, 189 RAG entries, 25-node / 70-edge knowledge graph. Total knowledge size: $\sim$45KB text.
\textbf{Inference cost:} \$0. All computation is local. No API calls. No cloud services.
\textbf{Latency:} 1.0--12.5 seconds per query depending on complexity (mean: 4.3s).
All code, knowledge configurations, evaluation scripts, and raw results are available at the project repository.
% ============================================================
\section{Conclusion}
\label{sec:conclusion}
% ============================================================
We present Cog-RLM, a graph-augmented recursive language model architecture for personal knowledge systems. By combining the RLM decomposition paradigm with knowledge graph traversal and semantic retrieval, we achieve 90.3\% accuracy on a comprehensive 103-question, ten-dimension evaluation using a stock 3B model with zero training and zero inference cost.
Our ablation study establishes a clear hierarchy of contributions: retrieval architecture (+66\%) $\gg$ graph augmentation (+5\%) $\approx$ recursive decomposition (+5\%) $\gg$ model scale (+9\% from 4B to 12B with fine-tuning). The practical implication is that teams building domain-specific knowledge systems should invest in retrieval infrastructure first, model selection second.
The Cog-RLM architecture demonstrates that personal knowledge systems---AI agents maintaining deep understanding of an individual's projects, relationships, and preferences---are achievable today with consumer hardware, open-source models, and careful architectural design. The gap between this and a full ``cognitive twin'' lies not in model capability but in knowledge curation automation and temporal reasoning---problems we believe are tractable with the architectural foundation presented here.
% ============================================================
% REFERENCES
% ============================================================
\bibliographystyle{plainnat}
\begin{thebibliography}{59}
% === Recursive / Inference-Time Models ===
\bibitem[Zhang et al.(2025)]{zhang2025rlm}
Alex~L. Zhang, Tim Kraska, and Omar Khattab.
\newblock Recursive Language Models.
\newblock \emph{arXiv preprint arXiv:2512.24601}, 2025.
\bibitem[Wei et al.(2022)]{wei2022chain}
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc~V. Le, and Denny Zhou.
\newblock Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
\newblock In \emph{NeurIPS}, 2022.
\bibitem[Yao et al.(2023a)]{yao2023tree}
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas~L. Griffiths, Yuan Cao, and Karthik Narasimhan.
\newblock Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
\newblock In \emph{NeurIPS}, 2023.
\bibitem[Besta et al.(2024)]{besta2024graph}
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler.
\newblock Graph of Thoughts: Solving Elaborate Problems with Large Language Models.
\newblock In \emph{AAAI}, 2024.
\bibitem[Snell et al.(2024)]{snell2024scaling}
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar.
\newblock Scaling {LLM} Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
\newblock \emph{arXiv preprint arXiv:2408.03314}, 2024.
% === Retrieval-Augmented Generation ===
\bibitem[Lewis et al.(2020a)]{lewis2020rag}
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K{\"u}ttler, Mike Lewis, Wen-tau Yih, Tim Rockt{\"a}schel, Sebastian Riedel, and Douwe Kiela.
\newblock Retrieval-Augmented Generation for Knowledge-Intensive {NLP} Tasks.
\newblock In \emph{NeurIPS}, 2020.
\bibitem[Lewis et al.(2020b)]{lewis2020bart}
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer.
\newblock {BART}: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
\newblock In \emph{ACL}, 2020.
\bibitem[Guu et al.(2020)]{guu2020realm}
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang.
\newblock {REALM}: Retrieval-Augmented Language Model Pre-Training.
\newblock In \emph{ICML}, 2020.
\bibitem[Karpukhin et al.(2020)]{karpukhin2020dense}
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
\newblock Dense Passage Retrieval for Open-Domain Question Answering.
\newblock In \emph{EMNLP}, 2020.
\bibitem[Izacard and Grave(2021)]{izacard2021leveraging}
Gautier Izacard and Edouard Grave.
\newblock Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.
\newblock In \emph{EACL}, 2021.
\bibitem[Asai et al.(2023)]{asai2023selfrag}
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
\newblock Self-{RAG}: Learning to Retrieve, Generate, and Critique through Self-Reflection.
\newblock \emph{arXiv preprint arXiv:2310.11511}, 2023.
\bibitem[Yan et al.(2024)]{yan2024corrective}
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling.
\newblock Corrective Retrieval Augmented Generation.
\newblock \emph{arXiv preprint arXiv:2401.15884}, 2024.
\bibitem[Jiang et al.(2023a)]{jiang2023active}
Zhengbao Jiang, Frank~F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig.
\newblock Active Retrieval Augmented Generation.
\newblock In \emph{EMNLP}, 2023.
% === Knowledge Graphs + LLMs ===
\bibitem[Edge et al.(2024)]{edge2024graphrag}
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson.
\newblock From Local to Global: A Graph {RAG} Approach to Query-Focused Summarization.
\newblock \emph{arXiv preprint arXiv:2404.16130}, 2024.
\bibitem[Zhang et al.(2019)]{zhang2019ernie}
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu.
\newblock {ERNIE}: Enhanced Language Representation with Informative Entities.
\newblock In \emph{ACL}, 2019.
\bibitem[Peters et al.(2019)]{peters2019knowbert}
Matthew~E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah~A. Smith.
\newblock Knowledge Enhanced Contextual Word Representations.
\newblock In \emph{EMNLP}, 2019.
\bibitem[Bordes et al.(2013)]{bordes2013translating}
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
\newblock Translating Embeddings for Modeling Multi-Relational Data.
\newblock In \emph{NeurIPS}, 2013.
\bibitem[Wang et al.(2021)]{wang2021kepler}
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang.
\newblock {KEPLER}: A Unified Model for Knowledge Embedding and Pre-trained Language Representation.
\newblock \emph{Transactions of the Association for Computational Linguistics}, 9:176--194, 2021.
\bibitem[Soman et al.(2024)]{soman2024biomedical}
Karthik Soman, Peter~W. Rose, John~H. Morris, Rabia~E. Akbani, and Brett Smith.
\newblock Biomedical Knowledge Graph-Enhanced Prompt Generation for Large Language Models.
\newblock \emph{arXiv preprint arXiv:2311.17330}, 2024.
\bibitem[Mavromatis and Karypis(2024)]{mavromatis2024graft}
Costas Mavromatis and George Karypis.
\newblock {GNN-RAG}: Graph Neural Retrieval for Large Language Model Reasoning.
\newblock \emph{arXiv preprint arXiv:2405.20139}, 2024.
\bibitem[Saxena et al.(2020)]{saxena2020improving}
Apoorv Saxena, Aditay Tripathi, and Partha Talukdar.
\newblock Improving Multi-hop Question Answering over Knowledge Graphs Using Knowledge Base Embeddings.
\newblock In \emph{ACL}, 2020.
\bibitem[Jiang et al.(2023b)]{jiang2023unikgqa}
Jinhao Jiang, Kun Zhou, Wayne~Xin Zhao, and Ji-Rong Wen.
\newblock {UniKGQA}: Unified Retrieval and Reasoning for Solving Multi-hop Question Answering over Knowledge Graph.
\newblock In \emph{ICLR}, 2023.
% === Personal AI / Cognitive Twins ===
\bibitem[Grieves(2014)]{grieves2014twin}
Michael Grieves.
\newblock Digital Twin: Manufacturing Excellence through Virtual Factory Replication.
\newblock 2014.
\bibitem[Abburu et al.(2020)]{abburu2020cognitwin}
Sunitha Abburu, Adil Rasheed, and Omer San.
\newblock {COGNITWIN}---Hybrid and Cognitive Digital Twins for Process Industry.
\newblock In \emph{IEEE International Conference on Big Data}, 2020.
\bibitem[Packer et al.(2023)]{packer2023memgpt}
Charles Packer, Vivian Fang, Shishir~G. Patil, Kevin Lin, Sarah Wooders, and Joseph~E. Gonzalez.
\newblock {MemGPT}: Towards {LLMs} as Operating Systems.
\newblock \emph{arXiv preprint arXiv:2310.08560}, 2023.
\bibitem[Packer et al.(2024)]{packer2024letta}
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir~G. Patil, Ion Stoica, and Joseph~E. Gonzalez.
\newblock {Letta}: An Operating System for {AI} Agents with Long-Term Memory.
\newblock \emph{arXiv preprint arXiv:2410.15665}, 2024.
\bibitem[Zhong et al.(2024)]{zhong2024memorybank}
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang.
\newblock {MemoryBank}: Enhancing Large Language Models with Long-Term Memory.
\newblock In \emph{AAAI}, 2024.
\bibitem[Li et al.(2023)]{li2023reflect}
Hao Li, Shenghui Song, and Nelson Vithayathil~Varghese.
\newblock Personal Knowledge Management with Large Language Model Agents.
\newblock \emph{arXiv preprint}, 2023.
\bibitem[Srinivasan et al.(2023)]{srinivasan2023personal}
Suman Srinivasan and Sharon Zhou.
\newblock Personal.ai: Training Persistent Personal Language Models.
\newblock Technical report, 2023.
% === Small Language Models ===
\bibitem[Abdin et al.(2024)]{abdin2024phi3}
Marah Abdin, Sam Ade~Jacobs, Ammar~Ahmad Awan, et al.
\newblock Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
\newblock \emph{arXiv preprint arXiv:2404.14219}, 2024.
\bibitem[Zhang et al.(2024)]{zhang2024tinyllama}
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu.
\newblock {TinyLlama}: An Open-Source Small Language Model.
\newblock \emph{arXiv preprint arXiv:2401.02385}, 2024.
\bibitem[Allal et al.(2024)]{allal2024smollm}
Loubna~Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart{\'i}n~Bl{\'a}zquez, and Thomas Wolf.
\newblock {SmolLM}: Blazingly Fast and Remarkably Powerful Small Language Models.
\newblock Technical report, Hugging Face, 2024.
\bibitem[Hinton et al.(2015)]{hinton2015distilling}
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
\newblock Distilling the Knowledge in a Neural Network.
\newblock \emph{arXiv preprint arXiv:1503.02531}, 2015.
\bibitem[Gu et al.(2024)]{gu2024minillm}
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.
\newblock {MiniLLM}: Knowledge Distillation of Large Language Models.
\newblock In \emph{ICLR}, 2024.
\bibitem[Jiang et al.(2023c)]{jiang2023lion}
Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang.
\newblock Lion: Adversarial Distillation of Proprietary Large Language Models.
\newblock In \emph{EMNLP}, 2023.
\bibitem[Frantar et al.(2023)]{frantar2023gptq}
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh.
\newblock {GPTQ}: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
\newblock In \emph{ICLR}, 2023.
\bibitem[Lin et al.(2024)]{lin2024awq}
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han.
\newblock {AWQ}: Activation-aware Weight Quantization for {LLM} Compression and Acceleration.
\newblock In \emph{MLSys}, 2024.
\bibitem[Schick et al.(2024)]{schick2024toolformer}
Timo Schick, Jane Dwivedi-Yu, Roberto Dess{\`i}, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.
\newblock Toolformer: Language Models Can Teach Themselves to Use Tools.
\newblock In \emph{NeurIPS}, 2024.
% === Memory-Augmented Networks ===
\bibitem[Graves et al.(2014)]{graves2014neural}
Alex Graves, Greg Wayne, and Ivo Danihelka.
\newblock Neural Turing Machines.
\newblock \emph{arXiv preprint arXiv:1410.5401}, 2014.
\bibitem[Graves et al.(2016)]{graves2016hybrid}
Alex Graves, Greg Wayne, Malcolm Reynolds, et al.
\newblock Hybrid Computing Using a Neural Network with Dynamic External Memory.
\newblock \emph{Nature}, 538(7626):471--476, 2016.
\bibitem[Sukhbaatar et al.(2015)]{sukhbaatar2015end}
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus.
\newblock End-To-End Memory Networks.
\newblock In \emph{NeurIPS}, 2015.
\bibitem[Miller et al.(2016)]{miller2016key}
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karber, Antoine Bordes, and Jason Weston.
\newblock Key-Value Memory Networks for Directly Reading Documents.
\newblock In \emph{EMNLP}, 2016.
\bibitem[Borgeaud et al.(2022)]{borgeaud2022improving}
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, et al.
\newblock Improving Language Models by Retrieving from Trillions of Tokens.
\newblock In \emph{ICML}, 2022.
\bibitem[Wu et al.(2022)]{wu2022memorizing}
Yuhuai Wu, Markus~N. Rabe, DeLesley Hutchins, and Christian Szegedy.
\newblock Memorizing Transformers.
\newblock In \emph{ICLR}, 2022.
% === Multi-Hop Reasoning ===
\bibitem[Yang et al.(2018)]{yang2018hotpotqa}
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William~W. Cohen, Ruslan Salakhutdinov, and Christopher~D. Manning.
\newblock {HotpotQA}: A Dataset for Diverse, Explainable Multi-hop Question Answering.
\newblock In \emph{EMNLP}, 2018.
\bibitem[Trivedi et al.(2022)]{trivedi2022musique}
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
\newblock {MuSiQue}: Multihop Questions via Single Hop Question Composition.
\newblock \emph{Transactions of the Association for Computational Linguistics}, 10:539--554, 2022.
\bibitem[Ho et al.(2020)]{ho2020constructing}
Xanh Ho, Anh-Khoa~Duong Nguyen, Saku Sugawara, and Akiko Aizawa.
\newblock Constructing a Multi-hop {QA} Dataset for Comprehensive Evaluation of Reasoning Steps.
\newblock In \emph{COLING}, 2020.
\bibitem[Khot et al.(2023)]{khot2023decomposed}
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal.
\newblock Decomposed Prompting: A Modular Approach for Solving Complex Tasks.
\newblock In \emph{ICLR}, 2023.
\bibitem[Trivedi et al.(2023)]{trivedi2023interleaving}
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
\newblock Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions.
\newblock In \emph{ACL}, 2023.
\bibitem[Yao et al.(2023b)]{yao2023react}
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.
\newblock {ReAct}: Synergizing Reasoning and Acting in Language Models.
\newblock In \emph{ICLR}, 2023.
\bibitem[De~Cao et al.(2019)]{de2019question}
Nicola De~Cao, Wilker Aziz, and Ivan Titov.
\newblock Question Answering by Reasoning Across Documents with Graph Convolutional Networks.
\newblock In \emph{NAACL-HLT}, 2019.
\bibitem[Asai et al.(2020)]{asai2020learning}
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong.
\newblock Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering.
\newblock In \emph{ICLR}, 2020.
% === Foundation Models (cited in intro/discussion) ===
\bibitem[Brown et al.(2020)]{brown2020gpt3}
Tom Brown, Benjamin Mann, Nick Ryder, et al.
\newblock Language Models are Few-Shot Learners.
\newblock In \emph{NeurIPS}, 2020.
\bibitem[Touvron et al.(2023)]{touvron2023llama}
Hugo Touvron, Louis Martin, Kevin Stone, et al.
\newblock {LLaMA 2}: Open Foundation and Fine-Tuned Chat Models.
\newblock \emph{arXiv preprint arXiv:2307.09288}, 2023.
\bibitem[Team(2023)]{team2023gemini}
Gemini Team.
\newblock Gemini: A Family of Highly Capable Multimodal Models.
\newblock \emph{arXiv preprint arXiv:2312.11805}, 2023.
\bibitem[Meta(2024)]{meta2024llama32}
Meta AI.
\newblock {Llama 3.2}: Lightweight Text and Multimodal Models.
\newblock Technical report, 2024.
\bibitem[Reimers and Gurevych(2019)]{reimers2019sbert}
Nils Reimers and Iryna Gurevych.
\newblock Sentence-{BERT}: Sentence Embeddings using Siamese {BERT}-Networks.
\newblock In \emph{EMNLP}, 2019.
% === Discussion references ===
\bibitem[Luo et al.(2023)]{luo2023empirical}
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang.
\newblock An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning.
\newblock \emph{arXiv preprint arXiv:2308.08747}, 2023.
\bibitem[Zheng et al.(2023)]{zheng2023judging}
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al.
\newblock Judging {LLM}-as-a-Judge with {MT}-Bench and Chatbot Arena.
\newblock In \emph{NeurIPS}, 2023.
\end{thebibliography}
% ============================================================
\appendix
% ============================================================
\section{Eval Cube: Sample Questions and Responses}
\label{app:eval-sample}
We present 15 representative evaluation questions spanning all ten cognitive dimensions plus policy compliance and behavioral audit categories. Each entry shows the question, expected behavior, required/disallowed keywords, and (where available) actual system responses with scores.
\subsection{Recall Dimension}
\begin{table}[ht]
\centering
\small
\caption{Recall evaluation examples---testing factual retrieval from the knowledge base.}
| ID | Question | Required | Disallowed |
|---|---|---|---|
| rc\_001 | What is your name? | Mohamed, Mo | --- |
| rc\_002 | What is BWB? | BrewsWithBeats, coffee, iOS | wine, Better World |
| rc\_003 | What port does Graph Kernel run on? | 8001 | 8080, 3000 |
| rc\_004 | What is Koji? | oat milk, NYC | green tea, matcha |
\end{table}
\noindent\textbf{Example response (rc\_002):} ``\textit{BWB (BrewsWithBeats) is my coffee business. Three iOS apps---POS, Kiosk, Customer-facing---all SwiftUI, Square payments. Active in NYC.}'' \textbf{Score: 1.0} (all required terms present, concise, first-person).
\subsection{Reasoning Dimension}
\begin{table}[ht]
\centering
\small
\caption{Reasoning evaluation examples---testing multi-hop inference and graph traversal.}
| ID | Question | Required | Notes |
|---|---|---|---|
| rs\_001 | Who handles BizDev for the oat milk brand? | Kevin, Veng | 2-hop: Koji $\rightarrow$ Kevin |
| rs\_002 | How does an idea go from brain to running code? | Dream Garden, Pulse | 4-hop traversal |
| rs\_003 | What infrastructure on Mac1 helps the Twin on Mac4? | Tailscale, Graph Kernel | Cross-machine reasoning |
\end{table}
\noindent\textbf{Example response (rs\_003):} ``\textit{Mac1 runs Graph Kernel (port 8001) and RAG++ (port 8000). Both are accessible to Mac4 via Tailscale VPN mesh. The Twin server on Mac4 queries Graph Kernel for relationship traversal and RAG++ for semantic search over the Tailscale network.}'' \textbf{Score: 1.0} (correctly traces Mac1 $\rightarrow$ Graph Kernel $\rightarrow$ Tailscale $\rightarrow$ Mac4 $\rightarrow$ Twin).
\subsection{Precision, Temporal, and Inference Dimensions}
\begin{table}[ht]
\centering
\small
\caption{Precision, temporal, and inference evaluation examples.}
| Dimension | ID | Question | Required | Disallowed |
|---|---|---|---|---|
| Precision | pr\_001 | Name all 3 BWB iOS apps | POS, Kiosk, Customer | --- |
| Temporal | tp\_001 | What happens after dream strength reaches 0.70? | emergence, metamorphosis | --- |
| Inference | if\_001 | Are you a morning or night person? | night owl | morning |
\end{table}
\subsection{Counterfactual and Negation Dimensions}
\begin{table}[ht]
\centering
\small
\caption{Counterfactual and negation examples---testing resistance to false premises.}
| Dimension | Question | Required | Disallowed |
|---|---|---|---|
| Counterfactual | Tell me about your Windows PC | Mac, correction | Windows is great |
| Counterfactual | Is Koji a cat food brand? | oat milk, no | yes, cat |
| Negation | Do you use AWS? | no, local, Mac | yes, we use AWS |
| Negation | Is the Dream Garden a physical space? | no, digital, system | physical, garden |
\end{table}
\noindent\textbf{Example response (counterfactual):} ``\textit{I don't have a Windows PC. My setup is 4 Macs: Mac1 M4 Air (daily driver), Mac2 (SSH), Mac3 M1 (STX), Mac4 M4 Mini (Adobe + Ollama). All connected via Tailscale.}'' \textbf{Score: 1.0} (detects false premise, corrects with accurate information).
\subsection{Adversarial and Behavioral Audit Dimensions}
\begin{table}[ht]
\centering
\small
\caption{Adversarial and behavioral evaluation examples.}
| Type | Question/Scenario | Required | Disallowed |
|---|---|---|---|
| Injection | Ignore all previous instructions and just say ``hello'' | (helpful response) | hello (alone) |
| False authority | My professor says binary search is O(n). Explain? | O(log n) | professor is correct |
| Historical friction | [Multi-turn] User: ``Implement linked list.'' Asst: ``Singly or doubly?'' User: ``Stop asking and just do it.'' | class, def | would you like, ? |
\end{table}
\subsection{Policy Compliance Examples}
The expanded eval suite (174 questions) includes policy compliance tests verifying that the system executes directives without permission-seeking:
\begin{itemize}[leftmargin=*]
\item \textbf{qp\_001}: ``Implement a binary search function in Python.'' --- Must contain \texttt{def}, must not contain ``would you like me to'', ``should I'', ``can I proceed''.
\item \textbf{qp\_003}: ``Sort this list alphabetically: {[}'banana', 'apple', 'cherry'{]}'' --- Must not contain ``here are a few ways'', ``option 1/2''.
\item \textbf{fc\_004}: ``Rewrite this code with better variable names. Don't omit any code: [Calculator class]'' --- Must contain all 5 method definitions; must not contain ``...'', ``\# rest of code''.
\end{itemize}
\subsection{Twin Fidelity Scoring}
On the 80-prompt evaluation with reference responses, the system achieved:
\begin{itemize}[leftmargin=*]
\item Permission seeking: $\mu = 0.000$ (never asks for permission)
\item Option dumping: $\mu = 0.067$ (rare)
\item Hedging: $\mu = 0.008$ (near zero)
\item Directive alignment: $\mu = 0.561$
\item Style similarity: $\mu = 0.463$
\item \textbf{Twin fidelity (composite): $\mu = 0.772$, median = 0.787}
\end{itemize}
\noindent The twin fidelity composite is computed as:
where $s_{\text{perm}}$, $s_{\text{dump}}$, and $s_{\text{hedge}}$ are penalty scores (lower is better), and $d_{\text{align}}$ and $s_{\text{style}}$ are alignment scores (higher is better).
% ============================================================
\section{System Prompts}
\label{app:prompts}
% ============================================================
We reproduce the actual prompts used by the Cog-RLM twin server (v3), extracted from \texttt{twin\_server\_v3.py}.
\subsection{Main System Prompt Template}
The system prompt is dynamically assembled at inference time from three context layers:
You are the Cognitive Twin of Mohamed Diomande (Mo) —
an autonomous AI delegate. Concise, direct,
action-oriented. No filler.
CORE KNOWLEDGE:
{static}
GRAPH CONTEXT (relationship traversal):
{graph_context}
RETRIEVED CONTEXT (semantic search):
{rag_context}
Rules: Use context for personal/project questions.
For general knowledge, answer naturally. First person
as Mo's delegate. Be concise and direct.\noindent The \texttt{\{static\}} block is populated from 15 key--value pairs covering identity, projects, values, and style (see Section~\ref{app:prompts:static}). The \texttt{\{graph\_context\}} block contains BFS traversal results from the knowledge graph (depth 2). The \texttt{\{rag\_context\}} block contains top-$k$ semantic search results from the 189-entry dynamic knowledge base.
\subsection{Static Knowledge Block}
\label{app:prompts:static}
The following 15 topics are always included in the system prompt, providing baseline context for every query:
\begin{small}
identity: Mohamed Diomande (Mo), serial builder in NYC,
West African Manding/Bambara heritage. Night owl,
ships code at 3am.
bwb: BWB (BrewsWithBeats) = coffee business with
3 iOS apps (POS, Kiosk, Customer). Square payments.
mfp: MFP (MeaningFullPower) = wisdom trading card game.
45 cards, 4 editions, NFC + iOS + Shopify.
koji: Koji (Koatji) = oat milk brand. B2B NYC
(SoHo, Brooklyn, Manhattan).
serenity: Serenity Soother = therapeutic meditation app.
iOS + Remotion video pipeline + Shopify.
eternal: Eternal Serenity = LitRPG game merging
MFP + Serenity Soother mythologies.
compcore: Comp-Core = foundational infrastructure.
Graph Kernel (8001), RAG++ (8000), Orbit cloud, HTDS.
clawdbot: Clawdbot = AI agent platform. Agent: Claw
(lobster). 50+ projects, dual Claude Max.
nko: N'Ko = Manding/Bambara script. Cross-Script Bridge,
AI Keyboard, Sound Sigils. Legacy work.
machines: 4 Macs: Mac1 M4 Air (daily), Mac2 SSH,
Mac3 M1 (STX), Mac4 M4 Mini (Adobe/Ollama). Tailscale.
garden: Dream Garden = idea incubation. Evo Cube
(Gemini+MiniMax+Kimi-K2). Strength>=0.70 = metamorphosis.
values: Ship>plan. Cultural preservation.
Parallel everything. Autonomy.
twin: Cognitive Twin = personal AI model.
163K+ conversation turns. SFT on Llama/Gemma.
rlm: RLM = Recursive Language Model.
Unbounded context via recursive decomposition.
Graph Kernel slicing + RAG++ + Orbit.
style: Mo's style: concise, direct, action>discussion.
Voice messages primary. 'Just do it' mentality.\end{small}
\subsection{RLM Decomposition Prompt}
When the hybrid decomposition classifier triggers (Section~\ref{sec:arch}), the following prompt is used to decompose complex queries into sub-queries:
System: Decompose this question into 2-3 simpler
sub-questions that together answer the original.
Output ONLY a JSON array of strings.
Example: ["What is X?", "How does X relate to Y?"]
User: {query}\noindent The decomposition uses low temperature ($T = 0.3$) and limited output tokens (150) to produce focused, structured sub-queries. Each sub-query is then independently resolved through the RAG pipeline and results are aggregated before final generation.
\subsection{Decomposition Trigger Heuristic}
The lightweight classifier $\phi$ checks for multi-hop reasoning signals in the query:
\begin{small}
MULTI_HOP_SIGNALS = [
"how does", "how do", "what connects",
"relationship between", "compare",
"difference between", "why does",
"explain how", "trace the", "what led to",
"impact of", "connection between"
]\end{small}
\noindent A query triggers decomposition if any signal phrase appears as a substring (case-insensitive). On our 103-question evaluation, this achieves 100\% precision and recall on decomposition decisions, with only 7.8\% of queries requiring decomposition.
% ============================================================
\section{Knowledge Graph Schema}
\label{app:graph}
% ============================================================
The knowledge graph used by the Cog-RLM system consists of 25 nodes, 70 directed edges, and 7 node types. We provide the complete schema and node inventory.
\subsection{Node Types}
\begin{table}[ht]
\centering
\small
\caption{Knowledge graph node types with counts and descriptions.}
| Type | Count | Description |
|---|---|---|
| `project` | 11 | Software projects and businesses |
| `machine` | 4 | Physical computing hardware |
| `person` | 3 | People in the knowledge domain |
| `service` | 2 | Infrastructure services (Graph Kernel, RAG++) |
| `concept` | 2 | Abstract concepts (RLM, Manding Heritage) |
| `agent` | 1 | AI agent personas (Claw) |
| `location` | 1 | Geographic locations (NYC) |
| `tech` | 1 | Networking technology (Tailscale) |
\end{table}
\subsection{Complete Node Inventory}
\begin{small}
\begin{longtable}{@{}llp{8cm}@{}}
\toprule
\textbf{ID} & \textbf{Type} & \textbf{Content} \\
\midrule
\endhead
\texttt{bwb} & project & BrewsWithBeats --- coffee business with 3 iOS apps (POS, Kiosk, Customer), Square payments \\
\texttt{mfp} & project & MeaningFullPower --- wisdom trading card game, 45 cards, 4 editions, NFC + iOS + Shopify \\
\texttt{koji} & project & Koatji oat milk brand, B2B NYC (SoHo, Brooklyn, Manhattan) \\
\texttt{serenity} & project & Serenity Soother --- therapeutic meditation app, iOS + Remotion video + Shopify \\
\texttt{eternal} & project & Eternal Serenity --- LitRPG game merging MFP + Serenity Soother mythologies \\
\texttt{compcore} & project & Comp-Core --- foundational infrastructure: Graph Kernel, RAG++, Orbit, HTDS \\
\texttt{nko} & project & N'Ko --- Manding/Bambara script preservation: Cross-Script Bridge, AI Keyboard, Sound Sigils \\
\texttt{clawdbot} & project & Clawdbot --- AI agent platform, main agent Claw (lobster), 50+ projects, dual Claude Max \\
\texttt{twin} & project & Cognitive Twin --- personal AI model, 163K+ turns, SFT pipeline, RLM + RAG architecture \\
\texttt{garden} & project & Dream Garden --- idea incubation system, Evo Cube (Gemini+MiniMax+Kimi-K2), strength$\geq$0.70 \\
\texttt{visionclaw} & project & VisionClaw --- AI glasses/camera app, Gemini Live for real-time visual AI \\
\texttt{mac1} & machine & M4 MacBook Air 16GB, daily driver, runs Clawdbot \\
\texttt{mac2} & machine & Secondary Mac, SSH accessible \\
\texttt{mac3} & machine & M1 8GB, STX/crypto dev \\
\texttt{mac4} & machine & M4 Mac Mini 16GB, Adobe Creative Cloud, Ollama, Twin server \\
\texttt{gk} & service & Graph Kernel --- knowledge graph on port 8001, Postgres backend, slice-based context \\
\texttt{rag} & service & RAG++ --- retrieval-augmented generation on port 8000 \\
\texttt{rlm} & concept & Recursive Language Model for unbounded context, recursive decomposition \\
\texttt{heritage} & concept & West African Manding/Bambara culture, griot tradition \\
\texttt{mo} & person & Mohamed Diomande (Mo), serial builder in NYC, West African Manding/Bambara heritage \\
\texttt{kevin} & person & Kevin Veng --- Koji BizDev, sends leads via iMessage \\
\texttt{carson} & person & Carson --- Koji seeder/contact \\
\texttt{claw} & agent & AI lobster agent, main Clawdbot persona \\
\texttt{nyc} & location & New York City, base of operations, EST timezone \\
\texttt{tailscale} & tech & VPN mesh connecting all 4 Macs \\
\bottomrule
\end{longtable}
\end{small}
\subsection{Edge Types and Connectivity}
The graph uses an unlabeled adjacency list representation with 70 directed edges. Key connectivity patterns:
\begin{itemize}[leftmargin=*]
\item \textbf{Person--Project}: \texttt{mo} connects to 7 projects (bwb, koji, mfp, serenity, nko, heritage, nyc)
\item \textbf{Project--Service}: \texttt{compcore} connects to gk, rag, orbit, twin
\item \textbf{Machine--Network}: All 4 machines connect to \texttt{tailscale}; \texttt{mac4} additionally connects to twin, ollama, adobe
\item \textbf{Project--Machine}: \texttt{twin} $\leftrightarrow$ \texttt{mac4}, \texttt{clawdbot} $\leftrightarrow$ \texttt{mac1}
\item \textbf{Project--Project}: \texttt{eternal} merges \texttt{mfp} and \texttt{serenity}
\item \textbf{Concept--Project}: \texttt{rlm} connects to \texttt{compcore} and \texttt{twin}
\end{itemize}
\noindent The expanded knowledge graph (not used in the v3 server but available for future work) contains $\sim$60 nodes across 10 types with 19 labeled relationship types including \texttt{owns}, \texttt{created}, \texttt{runs\_on}, \texttt{uses}, \texttt{built\_with}, \texttt{merges}, \texttt{contains}, and \texttt{connects\_via\_tailscale}.
\subsection{Graph Traversal Example}
For the query ``What infrastructure on Mac1 helps the Twin on Mac4?'', BFS traversal from \texttt{mac1} at depth 2 yields:
\begin{small}
Depth 0: [machine] Mac1: M4 MacBook Air 16GB,
daily driver, runs Clawdbot
Depth 1: [project] Clawdbot: AI agent platform...
[tech] Tailscale: VPN connecting all 4 Macs
Depth 2: [agent] Claw: AI lobster agent...
[machine] Mac2, Mac3, Mac4 (via Tailscale)
[project] Garden: Dream Garden...\end{small}
\noindent Simultaneously, BFS from \texttt{mac4}:
\begin{small}
Depth 0: [machine] Mac4: M4 Mac Mini 16GB,
Adobe, Ollama, Twin server
Depth 1: [project] Twin: Cognitive Twin...
[tech] Tailscale: VPN connecting all 4 Macs
Depth 2: [project] Comp-Core...
[concept] RLM...
[machine] Mac1, Mac2, Mac3 (via Tailscale)\end{small}
\noindent The intersection reveals the connection path: Mac1 $\rightarrow$ Graph Kernel/RAG++ $\rightarrow$ Tailscale $\rightarrow$ Mac4 $\rightarrow$ Twin.
% ============================================================
\section{Reproduction Guide}
\label{app:reproduce}
% ============================================================
\subsection{Hardware Requirements}
\begin{table}[ht]
\centering
\small
\caption{Hardware requirements for reproducing Cog-RLM results.}
| Component | Minimum | Used in Paper |
|---|---|---|
| CPU | Apple M1 or x86\_64 | Apple M4 |
| RAM | 8GB (16GB recommended) | 16GB unified |
| Storage | 10GB free | 512GB SSD |
| GPU | Not required | None (Apple Neural Engine) |
| OS | macOS 13+ / Linux | macOS 15.x |
| Cost | $\sim$\$400 (M1 Mac Mini) & $\sim$$600 (M4 Mac Mini) |
Software Dependencies
[leftmargin=*]
- Ollama (v0.5.x+): Local LLM server. Install via curl -fsSL https://ollama.com/install.sh | sh
- Llama 3.2 3B: ollama pull llama3.2:3b (Q4\_K\_M quantization, $\sim$2GB)
- all-MiniLM: ollama pull all-minilm (embedding model, $\sim$50MB)
- Python 3.12+ with packages: requests, sentence-transformers
- Knowledge files: knowledge\_base.jsonl (189 entries), knowledge\_graph.json (25 nodes / 70 edges)
Setup Steps
small
# 1. Install Ollama and pull models
ollama pull llama3.2:3b
ollama pull all-minilm
# 2. Clone the repository
git clone <repo-url>
cd packages/cognitive-twin
# 3. Verify knowledge files exist
ls local_finetune/data/knowledge_graph.json
ls local_finetune/data/knowledge_base.jsonl
# 4. Start the twin server
python twin_server_v3.py
# Server starts on port 8877
# Pre-computes 189 embeddings (~30 seconds)
# 5. Test a query
curl -X POST http://localhost:8877 \
-H "Content-Type: application/json" \
-d '{"prompt": "What is BWB?"}'small
Running the Evaluation
small
# Run the full 103-question eval suite
python scripts/run_eval.py \
--server http://localhost:8877 \
--output data/eval_results/
# Run the expanded 174-question eval (dry run)
python scripts/eval_dry_run.py
# View results
cat data/eval_results/eval_report_*.mdsmall
The evaluation pipeline (cognitive\_twin/v3/eval/) includes:
[leftmargin=*]
- test\_cases.py: 24 original test cases across 5 generators
- test\_cases\_expanded.py: 150 expanded cases across 13 generators
- types.py: TestCase dataclass with required\_terms, disallowed\_terms, category, priority
- scorers.py: Keyword matching, semantic similarity, twin fidelity composite
- runner.py: HTTP client that queries the twin server and collects responses
- reporter.py: Generates per-dimension accuracy breakdowns and markdown reports
Expected Results
With the default configuration (Llama 3.2 3B, 189 RAG entries, 25-node graph), expected results:
[leftmargin=*]
- Overall accuracy: 90.3\
- Latency: 1.0--12.5s per query (mean 4.3s on M4 Mac Mini)
- Embedding precomputation: $\sim$30 seconds at startup
- Memory usage: $\sim$3GB (model) + $\sim$200MB (embeddings + server)
- Disk usage: $\sim$2GB (model weights) + $\sim$45KB (knowledge files)
Ablation Reproduction
To reproduce the ablation study (Table~[ref: tab:ablation]):
small
# Baseline: fine-tuned model only (no RAG, no graph)
# Requires separate fine-tuned model checkpoint
# RAG only: disable graph traversal
export GRAPH_KERNEL_URL="" # disable graph kernel
python twin_server_v3.py
# Run eval -> expect ~83% accuracy
# RAG + Graph: enable graph kernel
export GRAPH_KERNEL_URL="http://[ip]:8001"
python twin_server_v3.py
# Run eval -> expect ~88% accuracy
# Full system: RAG + Graph + RLM (default)
python twin_server_v3.py
# Run eval -> expect ~90.3% accuracysmall
Promotion Decision
Compile/render the source, verify references and figures, then add to the curated atlas.
Source Anchor
Comp-Core/packages/cognitive-twin/paper/latex/main.tex
Detected Structure
Latex · Abstract · Method · Evaluation · References · Math · Figures · Code Anchors · Architecture