Cognitive Twin Architecture V2 — Decoupled RLM + Qwen 3.5 Migration
The Cognitive Twin is Mo's personal AI delegate — a model that knows his projects, preferences, reasoning patterns, and history. V1 used Llama 3.2:3B locally with a tightly-coupled RAG+Graph+RLM stack. V2 decouples every layer, swaps the base model to Qwen 3.5, and creates a clean evaluation pipeline.
Full Public Reader
Cognitive Twin Architecture V2 — Decoupled RLM + Qwen 3.5 Migration
Executive Summary
The Cognitive Twin is Mo's personal AI delegate — a model that knows his projects, preferences, reasoning patterns, and history. V1 used Llama 3.2:3B locally with a tightly-coupled RAG+Graph+RLM stack. V2 decouples every layer, swaps the base model to Qwen 3.5, and creates a clean evaluation pipeline.
Key Results (Benchmark: March 4, 2026)
| Config | Score | What it proves |
|---|---|---|
| A: Bare Qwen3-Next-80B | 29.5 | |
| B: + RAG | 87.2 | |
| C: + Graph | 89.7 | |
| D: + RLM | 93.6 |
---
Architecture Layers (Decoupled)
### Layer 0: Base Model
Current: Qwen3-Next-80B-A3B (Together AI, serverless, $0)
Target: Qwen3.5-35B-A3B (local on Mac4+Mac5 exo cluster OR API)
The base model is swappable. Any OpenAI-compatible chat endpoint works.
Model Options (tested/available):
| Model | Where | Active Params | Score (Config A) | Cost |
|-------|-------|---------------|------------------|------|
| Llama 3.2:3B | Mac4 Ollama | 3B | ~25
| Qwen3-Next-80B-A3B | Together API | 3B | 29.5
| Qwen3.5-35B-A3B | OpenRouter | 3B | TBD | $0.16/M |
| Qwen3.5-35B-A3B | Mac4+Mac5 exo | 3B | TBD | $0 |
| Qwen3.5-397B-A17B | Together API | 17B | TBD | $0.60/M |
Migration path: Start with API for evaluation speed, migrate to local exo cluster for $0 inference at scale.
### Layer 1: RAG (Retrieval-Augmented Generation)
Status: Working, biggest impact (+57.7
Components:
- Knowledge Base: 466 entries in JSONL (knowledge_base.jsonl + knowledge_base_v2.jsonl)
- Embeddings: Gemini gemini-embedding-001 (batch API, fast, free tier)
- Search: Cosine similarity, top-k=3, min_sim=0.25
- Format: Q&A pairs injected into system prompt
Decoupled interface:
class RAGLayer:
def __init__(self, kb_paths: list[Path], embed_fn: Callable):
self.entries = load_knowledge_base(kb_paths)
self.embed_fn = embed_fn
def search(self, query: str, top_k=3) -> list[tuple[float, dict]]:
"""Returns [(similarity, entry), ...] ranked by relevance."""
def to_context(self, results) -> str:
"""Formats search results for system prompt injection."""Improvement opportunities:
- Increase knowledge base from 466 → 1000+ entries (V10+ corpus)
- Add recency weighting (prefer recent entries)
- Hybrid search: combine embedding similarity with keyword matching
- Move to Qwen3.5's native 262K context — at some point, just dump everything in-context
### Layer 2: Knowledge Graph
Status: Working, marginal lift (+2.5
Components:
- Graph: 103 nodes, 103 adjacency entries (knowledge_graph_v2.json)
- Traversal: BFS, max_depth=2
- Node types: Project, Service, Tech, Concept, Person
- Format: `[type] name: content` strings injected into context
Decoupled interface:
class GraphLayer:
def __init__(self, graph_path: Path):
self.nodes, self.adjacency = load_graph(graph_path)
def traverse(self, query: str, max_depth=2) -> str:
"""BFS from query-matched nodes, returns context string."""Why marginal lift: The graph only fires when query terms match node names exactly. Many questions don't trigger any graph traversal. The RAG layer already captures most of the same information.
Improvement opportunities:
- Fuzzy node matching (embedding-based seed selection, not just string matching)
- Expand graph to 500+ nodes (currently only 103)
- Add temporal edges (what changed when)
- Integrate with Graph Kernel service (:8001) for live traversal
### Layer 3: RLM (Recursive Language Model)
Status: Working, meaningful lift on multi-hop (+3.9
What it does:
1. Detects complex queries via signal phrases ("how does", "what connects", etc.)
2. Uses the LLM itself to decompose into 2-3 sub-queries
3. Runs RAG + Graph for each sub-query independently
4. Deduplicates and merges all context
5. Final LLM call with enriched context
Decoupled interface:
class RLMLayer:
def __init__(self, rag: RAGLayer, graph: GraphLayer, llm_fn: Callable):
self.rag = rag
self.graph = graph
self.llm_fn = llm_fn
def should_decompose(self, query: str) -> bool:
"""Detect if query needs multi-hop decomposition."""
def decompose(self, query: str) -> list[str]:
"""Use LLM to split complex query into sub-queries."""
def retrieve(self, query: str) -> str:
"""Full pipeline: decompose → search → merge → format context."""Key insight from benchmarks: The RLM helps most on multi-hop questions (83
Improvement opportunities:
- Confidence-based decomposition (don't decompose if RAG alone scores high)
- Chain-of-thought decomposition (let the model explain why it's decomposing)
- Recursive depth > 1 (currently decompose → search, could do decompose → search → decompose again)
- Separate decomposition model (small/fast model decomposes, big model answers)
---
Serving Architecture
Phase 1: API-First (NOW)
┌──────────────┐
│ Clawdbot │
│ (Gateway) │
└──────┬───────┘
│ query
┌──────▼───────┐
│ CogTwin │
│ Server v4 │
│ │
│ ┌─────────┐ │
│ │ RAG │ │ ← Gemini embeddings
│ │ Layer │ │ ← 466 entries
│ └────┬────┘ │
│ ┌────▼────┐ │
│ │ Graph │ │ ← 103 nodes
│ │ Layer │ │ ← BFS traversal
│ └────┬────┘ │
│ ┌────▼────┐ │
│ │ RLM │ │ ← Decomposition
│ │ Layer │ │ ← Multi-hop merge
│ └────┬────┘ │
│ │ │
└───────┼──────┘
│ enriched prompt
┌──────▼───────┐
│ Together AI │
│ Qwen3-Next │
│ 80B-A3B │
│ (FREE) │
└──────────────┘Advantages: Zero cost, fast iteration, no local compute needed for inference.
Phase 2: Hybrid (NEXT)
┌──────────────┐
│ CogTwin │
│ Server v4 │
│ + Layers │
└──────┬───────┘
│
┌──────▼───────┐
┌─────│ Router │─────┐
│ └──────────────┘ │
│ │
┌──────▼───────┐ ┌──────▼───────┐
│ Mac4+Mac5 │ │ Together AI │
│ exo cluster │ │ (fallback) │
│ Qwen3.5 │ │ Qwen3-Next │
│ 35B-A3B Q4 │ │ 80B-A3B │
└──────────────┘ └──────────────┘Advantages: Local = zero marginal cost for high-volume. API = fallback when local is slow/busy.
Phase 3: Fine-Tuned Local (TARGET)
┌──────────────┐
│ Mac4+Mac5 │
│ exo cluster │
│ Qwen3.5 │
│ 35B-A3B │
│ + LoRA │ ← Fine-tuned on Mo's corpus
│ + Adapters │ ← Domain-specific experts
└──────────────┘Advantages: A model that knows Mo's projects natively (no RAG needed for common questions), with RAG as augmentation for long-tail knowledge.
---
Dataset Architecture
### Current Corpus
| Version | Records | Source |
|---------|---------|--------|
| V1-V5 | 43,173 | Conversations, Apple Notes, Discord, WORMS |
| V6 | +191 | Evoflow/TIE evolution |
| V7 | +58 | Meta-evolution |
| V8 | +100 | Deep convos, RLM-enhanced |
| V9 | +133 | Kimi synthesis |
| Total | ~43,655 | |
Corpus Pipeline (Decoupled)
Raw Sources → Extract → Deduplicate → Format → Split → Train/Val/Test
│ │
├─ Discord sessions ├─ SFT format (chat turns)
├─ Apple Notes ├─ DPO format (preferred/rejected)
├─ Kimi synthesis └─ Knowledge base (Q&A JSONL)
├─ Claude Code logs
└─ Voice transcripts### V10 Corpus Targets
1. 163K+ conversation turns from full session mining (Kimi DB)
2. Architecture knowledge from all AGENTS.md, SOUL.md, design docs
3. Project-specific Q&A generated from codebases
4. Multi-hop training pairs (question + decomposition + answer)
5. Preference pairs for DPO (good vs bad answers on same questions)
---
Evaluation Framework (Decoupled)
### Eval Suite
- 39 questions across 9 categories
- Keyword scoring: must_contain / must_not_contain
- Categories: v1_project, v1_personal, v1_arch, v1_style, v2_project, v2_tech, v2_service, v2_concept, multi_hop
### Eval Configs
| Config | Tests | Purpose |
|--------|-------|---------|
| A (Bare) | Base model capability | How much does the model know without context? |
| B (+RAG) | Retrieval impact | Does semantic search provide the right context? |
| C (+Graph) | Graph impact | Does BFS traversal add useful relationships? |
| D (+RLM) | Decomposition impact | Does multi-hop decomposition help complex queries? |
### Eval Targets
| Metric | Current | Target |
|--------|---------|--------|
| Config A (bare) | 29.5
| Config B (RAG) | 87.2
| Config C (RAG+Graph) | 89.7
| Config D (full) | 93.6
| Multi-hop specifically | 83
---
Implementation Roadmap
### Wave 1: Foundation (Week 1)
1. Decouple CogTwin server — Refactor `twin_server_v3.py` into modular layers (RAGLayer, GraphLayer, RLMLayer)
2. API backend adapter — Abstract LLM calls behind a provider interface (Ollama/Together/OpenRouter)
3. Benchmark suite upgrade — Expand from 39 → 100+ questions, add semantic scoring (not just keyword)
### Wave 2: Knowledge Expansion (Week 2)
4. Corpus V10 — Mine full 163K turns from Kimi DB into training format
5. Knowledge Base expansion — Generate 500+ Q&A pairs from all project docs
6. Graph expansion — Grow from 103 → 500+ nodes with fuzzy matching
### Wave 3: Model Migration (Week 3)
7. Pull Qwen3.5-35B-A3B on Mac4 via Ollama
8. Set up exo cluster — Mac4 + Mac5 tensor parallel for full model
9. Benchmark local vs API — Verify parity before switching
### Wave 4: Fine-Tuning (Week 4)
10. LoRA fine-tune on Qwen3.5-35B-A3B using V10 corpus
11. DPO training — Preference alignment on Mo's style
12. Ablation benchmark — Compare bare fine-tuned vs fine-tuned + RAG + Graph + RLM
### Wave 5: Production (Week 5)
13. Deploy fine-tuned model to exo cluster
14. Integrate with Clawdbot — CogTwin as a MCP tool / skill
15. Continuous learning pipeline — Auto-ingest new sessions into corpus
---
Key Decisions
### Why Qwen 3.5?
1. MoE architecture — 35B total, 3B active = same memory as Llama 3B but 10x knowledge
2. 1M context — Can hold entire session histories in-context
3. Multimodal — Native image understanding (future: screenshot comprehension)
4. Agent benchmarks — TAU2-Bench: 81.2
5. Available everywhere — Ollama, MLX, Together, OpenRouter
### Why Decouple?
1. Independent evaluation — Test each layer's contribution cleanly
2. Swappable models — Move from API to local to fine-tuned without touching retrieval code
3. Layer optimization — Improve RAG without touching Graph, improve Graph without touching RLM
4. Cost optimization — Use cheap/free API for most queries, expensive/local for complex ones
### Why Keep the RLM?
The benchmark proves it: RLM adds **+3.9
---
Files
| File | Purpose |
|---|---|
| `scripts/benchmark_api_qwen35.py` | API-based benchmark (ABCD configs) |
| `scripts/benchmark_cog_rlm.py` | Local Ollama benchmark (original) |
| `twin_server_v3.py` | Current monolithic server |
| `twin_rag_server.py` | RAG-only server |
| `local_finetune/data/` | Knowledge bases and training data |
| `eval_results/` | Benchmark result JSONs |
| `docs/ARCHITECTURE-V2.md` | This document |
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/packages/cognitive-twin/docs/ARCHITECTURE-V2.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture