Grand Diomande Research · Full HTML Reader

Cognitive Twin Architecture V2 — Decoupled RLM + Qwen 3.5 Migration

The Cognitive Twin is Mo's personal AI delegate — a model that knows his projects, preferences, reasoning patterns, and history. V1 used Llama 3.2:3B locally with a tightly-coupled RAG+Graph+RLM stack. V2 decouples every layer, swaps the base model to Qwen 3.5, and creates a clean evaluation pipeline.

Agents That Account for Themselves architecture technical paper candidate score 62 .md

Full Public Reader

Cognitive Twin Architecture V2 — Decoupled RLM + Qwen 3.5 Migration

Executive Summary

Key Results (Benchmark: March 4, 2026)

Config	Score	What it proves
A: Bare Qwen3-Next-80B	29.5
B: + RAG	87.2
C: + Graph	89.7
D: + RLM	93.6

---

Architecture Layers (Decoupled)

### Layer 0: Base Model
Current: Qwen3-Next-80B-A3B (Together AI, serverless, $0)
Target: Qwen3.5-35B-A3B (local on Mac4+Mac5 exo cluster OR API)

The base model is swappable. Any OpenAI-compatible chat endpoint works.

Model Options (tested/available):
| Model | Where | Active Params | Score (Config A) | Cost |
|-------|-------|---------------|------------------|------|
| Llama 3.2:3B | Mac4 Ollama | 3B | ~25
| Qwen3-Next-80B-A3B | Together API | 3B | 29.5
| Qwen3.5-35B-A3B | OpenRouter | 3B | TBD | $0.16/M |
| Qwen3.5-35B-A3B | Mac4+Mac5 exo | 3B | TBD | $0 |
| Qwen3.5-397B-A17B | Together API | 17B | TBD | $0.60/M |

Migration path: Start with API for evaluation speed, migrate to local exo cluster for $0 inference at scale.

### Layer 1: RAG (Retrieval-Augmented Generation)
Status: Working, biggest impact (+57.7

Components:
- Knowledge Base: 466 entries in JSONL (knowledge_base.jsonl + knowledge_base_v2.jsonl)
- Embeddings: Gemini gemini-embedding-001 (batch API, fast, free tier)
- Search: Cosine similarity, top-k=3, min_sim=0.25
- Format: Q&A pairs injected into system prompt

Decoupled interface:

python

class RAGLayer:
    def __init__(self, kb_paths: list[Path], embed_fn: Callable):
        self.entries = load_knowledge_base(kb_paths)
        self.embed_fn = embed_fn

    def search(self, query: str, top_k=3) -> list[tuple[float, dict]]:
        """Returns [(similarity, entry), ...] ranked by relevance."""

    def to_context(self, results) -> str:
        """Formats search results for system prompt injection."""

Improvement opportunities:
- Increase knowledge base from 466 → 1000+ entries (V10+ corpus)
- Add recency weighting (prefer recent entries)
- Hybrid search: combine embedding similarity with keyword matching
- Move to Qwen3.5's native 262K context — at some point, just dump everything in-context

### Layer 2: Knowledge Graph
Status: Working, marginal lift (+2.5

Components:
- Graph: 103 nodes, 103 adjacency entries (knowledge_graph_v2.json)
- Traversal: BFS, max_depth=2
- Node types: Project, Service, Tech, Concept, Person
- Format: `[type] name: content` strings injected into context

Decoupled interface:

python

class GraphLayer:
    def __init__(self, graph_path: Path):
        self.nodes, self.adjacency = load_graph(graph_path)

    def traverse(self, query: str, max_depth=2) -> str:
        """BFS from query-matched nodes, returns context string."""

Why marginal lift: The graph only fires when query terms match node names exactly. Many questions don't trigger any graph traversal. The RAG layer already captures most of the same information.

Improvement opportunities:
- Fuzzy node matching (embedding-based seed selection, not just string matching)
- Expand graph to 500+ nodes (currently only 103)
- Add temporal edges (what changed when)
- Integrate with Graph Kernel service (:8001) for live traversal

### Layer 3: RLM (Recursive Language Model)
Status: Working, meaningful lift on multi-hop (+3.9

What it does:
1. Detects complex queries via signal phrases ("how does", "what connects", etc.)
2. Uses the LLM itself to decompose into 2-3 sub-queries
3. Runs RAG + Graph for each sub-query independently
4. Deduplicates and merges all context
5. Final LLM call with enriched context

Decoupled interface:

python

class RLMLayer:
    def __init__(self, rag: RAGLayer, graph: GraphLayer, llm_fn: Callable):
        self.rag = rag
        self.graph = graph
        self.llm_fn = llm_fn

    def should_decompose(self, query: str) -> bool:
        """Detect if query needs multi-hop decomposition."""

    def decompose(self, query: str) -> list[str]:
        """Use LLM to split complex query into sub-queries."""

    def retrieve(self, query: str) -> str:
        """Full pipeline: decompose → search → merge → format context."""

Key insight from benchmarks: The RLM helps most on multi-hop questions (83

Improvement opportunities:
- Confidence-based decomposition (don't decompose if RAG alone scores high)
- Chain-of-thought decomposition (let the model explain why it's decomposing)
- Recursive depth > 1 (currently decompose → search, could do decompose → search → decompose again)
- Separate decomposition model (small/fast model decomposes, big model answers)

---

Serving Architecture

Phase 1: API-First (NOW)

                    ┌──────────────┐
                    │  Clawdbot    │
                    │  (Gateway)   │
                    └──────┬───────┘
                           │ query
                    ┌──────▼───────┐
                    │  CogTwin     │
                    │  Server v4   │
                    │              │
                    │  ┌─────────┐ │
                    │  │  RAG    │ │  ← Gemini embeddings
                    │  │  Layer  │ │  ← 466 entries
                    │  └────┬────┘ │
                    │  ┌────▼────┐ │
                    │  │  Graph  │ │  ← 103 nodes
                    │  │  Layer  │ │  ← BFS traversal
                    │  └────┬────┘ │
                    │  ┌────▼────┐ │
                    │  │  RLM    │ │  ← Decomposition
                    │  │  Layer  │ │  ← Multi-hop merge
                    │  └────┬────┘ │
                    │       │      │
                    └───────┼──────┘
                           │ enriched prompt
                    ┌──────▼───────┐
                    │  Together AI │
                    │  Qwen3-Next  │
                    │  80B-A3B     │
                    │  (FREE)      │
                    └──────────────┘

Advantages: Zero cost, fast iteration, no local compute needed for inference.

Phase 2: Hybrid (NEXT)

                    ┌──────────────┐
                    │  CogTwin     │
                    │  Server v4   │
                    │  + Layers    │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
              ┌─────│  Router      │─────┐
              │     └──────────────┘     │
              │                          │
       ┌──────▼───────┐          ┌──────▼───────┐
       │  Mac4+Mac5   │          │  Together AI │
       │  exo cluster │          │  (fallback)  │
       │  Qwen3.5     │          │  Qwen3-Next  │
       │  35B-A3B Q4  │          │  80B-A3B     │
       └──────────────┘          └──────────────┘

Advantages: Local = zero marginal cost for high-volume. API = fallback when local is slow/busy.

Phase 3: Fine-Tuned Local (TARGET)

       ┌──────────────┐
       │  Mac4+Mac5   │
       │  exo cluster │
       │  Qwen3.5     │
       │  35B-A3B     │
       │  + LoRA      │  ← Fine-tuned on Mo's corpus
       │  + Adapters  │  ← Domain-specific experts
       └──────────────┘

Advantages: A model that knows Mo's projects natively (no RAG needed for common questions), with RAG as augmentation for long-tail knowledge.

---

Dataset Architecture

### Current Corpus
| Version | Records | Source |
|---------|---------|--------|
| V1-V5 | 43,173 | Conversations, Apple Notes, Discord, WORMS |
| V6 | +191 | Evoflow/TIE evolution |
| V7 | +58 | Meta-evolution |
| V8 | +100 | Deep convos, RLM-enhanced |
| V9 | +133 | Kimi synthesis |
| Total | ~43,655 | |

Corpus Pipeline (Decoupled)

Raw Sources → Extract → Deduplicate → Format → Split → Train/Val/Test
     │                                    │
     ├─ Discord sessions                  ├─ SFT format (chat turns)
     ├─ Apple Notes                       ├─ DPO format (preferred/rejected)
     ├─ Kimi synthesis                    └─ Knowledge base (Q&A JSONL)
     ├─ Claude Code logs
     └─ Voice transcripts

### V10 Corpus Targets
1. 163K+ conversation turns from full session mining (Kimi DB)
2. Architecture knowledge from all AGENTS.md, SOUL.md, design docs
3. Project-specific Q&A generated from codebases
4. Multi-hop training pairs (question + decomposition + answer)
5. Preference pairs for DPO (good vs bad answers on same questions)

---

Evaluation Framework (Decoupled)

### Eval Suite
- 39 questions across 9 categories
- Keyword scoring: must_contain / must_not_contain
- Categories: v1_project, v1_personal, v1_arch, v1_style, v2_project, v2_tech, v2_service, v2_concept, multi_hop

### Eval Configs
| Config | Tests | Purpose |
|--------|-------|---------|
| A (Bare) | Base model capability | How much does the model know without context? |
| B (+RAG) | Retrieval impact | Does semantic search provide the right context? |
| C (+Graph) | Graph impact | Does BFS traversal add useful relationships? |
| D (+RLM) | Decomposition impact | Does multi-hop decomposition help complex queries? |

### Eval Targets
| Metric | Current | Target |
|--------|---------|--------|
| Config A (bare) | 29.5
| Config B (RAG) | 87.2
| Config C (RAG+Graph) | 89.7
| Config D (full) | 93.6
| Multi-hop specifically | 83

---

Implementation Roadmap

### Wave 1: Foundation (Week 1)
1. Decouple CogTwin server — Refactor `twin_server_v3.py` into modular layers (RAGLayer, GraphLayer, RLMLayer)
2. API backend adapter — Abstract LLM calls behind a provider interface (Ollama/Together/OpenRouter)
3. Benchmark suite upgrade — Expand from 39 → 100+ questions, add semantic scoring (not just keyword)

### Wave 2: Knowledge Expansion (Week 2)
4. Corpus V10 — Mine full 163K turns from Kimi DB into training format
5. Knowledge Base expansion — Generate 500+ Q&A pairs from all project docs
6. Graph expansion — Grow from 103 → 500+ nodes with fuzzy matching

### Wave 3: Model Migration (Week 3)
7. Pull Qwen3.5-35B-A3B on Mac4 via Ollama
8. Set up exo cluster — Mac4 + Mac5 tensor parallel for full model
9. Benchmark local vs API — Verify parity before switching

### Wave 4: Fine-Tuning (Week 4)
10. LoRA fine-tune on Qwen3.5-35B-A3B using V10 corpus
11. DPO training — Preference alignment on Mo's style
12. Ablation benchmark — Compare bare fine-tuned vs fine-tuned + RAG + Graph + RLM

### Wave 5: Production (Week 5)
13. Deploy fine-tuned model to exo cluster
14. Integrate with Clawdbot — CogTwin as a MCP tool / skill
15. Continuous learning pipeline — Auto-ingest new sessions into corpus

---

Key Decisions

### Why Qwen 3.5?
1. MoE architecture — 35B total, 3B active = same memory as Llama 3B but 10x knowledge
2. 1M context — Can hold entire session histories in-context
3. Multimodal — Native image understanding (future: screenshot comprehension)
4. Agent benchmarks — TAU2-Bench: 81.2
5. Available everywhere — Ollama, MLX, Together, OpenRouter

### Why Decouple?
1. Independent evaluation — Test each layer's contribution cleanly
2. Swappable models — Move from API to local to fine-tuned without touching retrieval code
3. Layer optimization — Improve RAG without touching Graph, improve Graph without touching RLM
4. Cost optimization — Use cheap/free API for most queries, expensive/local for complex ones

### Why Keep the RLM?
The benchmark proves it: RLM adds **+3.9

---

Files

File	Purpose
`scripts/benchmark_api_qwen35.py`	API-based benchmark (ABCD configs)
`scripts/benchmark_cog_rlm.py`	Local Ollama benchmark (original)
`twin_server_v3.py`	Current monolithic server
`twin_rag_server.py`	RAG-only server
`local_finetune/data/`	Knowledge bases and training data
`eval_results/`	Benchmark result JSONs
`docs/ARCHITECTURE-V2.md`	This document

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

Comp-Core/packages/cognitive-twin/docs/ARCHITECTURE-V2.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture