Grand Diomande Research · Full HTML Reader

Pulse Plan: Cognitive Twin V2 — Decoupled RLM + Qwen 3.5

Benchmark results (March 4, 2026) using Qwen3-Next-80B-A3B via Together AI: - Config A (Bare): 29.5% → Config D (Full RLM): 93.6% - RAG is the biggest lever (+57.7%), RLM adds meaningful value on multi-hop (+3.9%) - API inference is fast (~1-2s/question) and free (Together serverless) - Target: 97%+ accuracy with fine-tuned Qwen3.5-35B-A3B on local exo cluster

Agents That Account for Themselves research note experiment writeup candidate score 44 .md

Full Public Reader

Pulse Plan: Cognitive Twin V2 — Decoupled RLM + Qwen 3.5

Created: 2026-03-04
Status: 🟢 ACTIVE
Priority: HIGH
Estimated Duration: 5 waves across 5 weeks

---

Context

Benchmark results (March 4, 2026) using Qwen3-Next-80B-A3B via Together AI:
- Config A (Bare): 29.5
- RAG is the biggest lever (+57.7
- API inference is fast (~1-2s/question) and free (Together serverless)
- Target: 97

---

## Wave 1: Decouple & Refactor
Duration: 3-4 days
Dependencies: None

Tasks

#### 1.1 — Extract RAGLayer Module
- Input: `twin_server_v3.py` monolithic code
- Output: `layers/rag.py` with clean interface
- Spec:
- `RAGLayer(kb_paths, embed_fn)` constructor
- `.search(query, top_k=3) -> list[SearchResult]`
- `.to_context(results) -> str`
- Support Gemini + Ollama embedding backends
- Unit tests with mock embeddings
- Verify: Run benchmark with extracted layer, scores unchanged

#### 1.2 — Extract GraphLayer Module
- Input: BFS traversal code from `twin_server_v3.py`
- Output: `layers/graph.py`
- Spec:
- `GraphLayer(graph_path)` constructor
- `.traverse(query, max_depth=2) -> str`
- Add fuzzy node matching via embeddings (not just string match)
- Unit tests
- Verify: Config C score should IMPROVE with fuzzy matching

#### 1.3 — Extract RLMLayer Module
- Input: Decomposition logic from both benchmark scripts
- Output: `layers/rlm.py`
- Spec:
- `RLMLayer(rag, graph, llm_fn)` constructor
- `.should_decompose(query) -> bool`
- `.decompose(query) -> list[str]`
- `.retrieve(query) -> str` (full pipeline)
- Configurable decomposition signals
- Unit tests with mock LLM
- Verify: Config D scores unchanged

#### 1.4 — Abstract LLM Provider
- Input: Direct API calls in benchmark
- Output: `providers/llm.py`
- Spec:
- `LLMProvider` base class
- `OllamaProvider(url)` — local models
- `TogetherProvider(api_key)` — Together AI
- `OpenRouterProvider(api_key)` — OpenRouter
- Common interface: `.generate(messages, max_tokens, temperature) -> (text, latency, usage)`
- Provider auto-detection from model name
- Verify: Benchmark works with `--provider ollama/together/openrouter`

#### 1.5 — Unified Server v4
- Input: Extracted layers + providers
- Output: `twin_server_v4.py`
- Spec:
- Compose layers: `CogTwin(llm, rag, graph, rlm)`
- Config-driven: `config.yaml` specifies which layers are active
- HTTP API: `POST /query` with `{"query": "...", "config": "D"}`
- Health endpoint: `GET /health`
- Backward compatible with v3 callers
- Verify: Serve queries via API, benchmark against it

Wave 1 Gate: All 4 benchmark configs pass with scores within ±1

---

## Wave 2: Knowledge Expansion
Duration: 4-5 days
Dependencies: Wave 1 complete (layers extracted)

Tasks

#### 2.1 — Corpus V10 Mining
- Input: Kimi synthesis DB (163K+ turns)
- Output: `data/ctv3_export_v10/` with SFT + DPO formats
- Spec:
- Extract all session turns from `kimi_memory.db`
- Deduplicate against V1-V9
- Format as `{"messages": [...]}` JSONL
- Generate preference pairs (DPO): take good answers, generate worse alternatives
- Target: 5000+ new high-quality training records
- Verify: Dataset stats, no duplicates, format validates

#### 2.2 — Knowledge Base Expansion
- Input: All project docs (AGENTS.md, SOUL.md, READMEs, design docs)
- Output: Expanded `knowledge_base_v3.jsonl`
- Spec:
- Parse all markdown docs in active projects
- Generate Q&A pairs using Claude/Gemini
- Categories: project facts, architecture decisions, personal preferences, relationships
- Target: 1000+ entries (up from 466)
- Verify: Benchmark Config B score improves to 92

#### 2.3 — Graph Expansion
- Input: Current 103-node graph + project docs
- Output: `knowledge_graph_v3.json` with 500+ nodes
- Spec:
- Auto-extract entities from project docs
- Add temporal edges (when was X created/updated)
- Add relationship types (uses, depends_on, created_by, part_of)
- Integrate with Graph Kernel :8001 for live queries
- Verify: Config C score improves, fuzzy matching fires on 80

#### 2.4 — Eval Suite Expansion
- Input: Current 39-question suite
- Output: `eval_suite_v3.json` with 100+ questions
- Spec:
- Add 30+ multi-hop questions (RLM's sweet spot)
- Add temporal questions ("When was X last updated?")
- Add preference questions ("How would Mo approach X?")
- Add negative tests (questions it should refuse)
- Add semantic scoring alongside keyword scoring
- Verify: Suite validates, baseline scores established

Wave 2 Gate: Knowledge base 1000+ entries, graph 500+ nodes, eval suite 100+ questions, Config B ≥ 92

---

## Wave 3: Model Migration
Duration: 3-4 days
Dependencies: Wave 1 (provider abstraction)

Tasks

#### 3.1 — Pull Qwen3.5-35B-A3B on Mac4
- Command: `ssh mac4 "ollama pull qwen3.5:35b"`
- Verify: `ollama run qwen3.5:35b "Hello"` returns valid response
- Check: Memory usage, inference speed (tok/s)
- Fallback: If doesn't fit in 16GB, use Q4_K_M quantization via GGUF

#### 3.2 — Set Up exo Cluster (Mac4 + Mac5)
- Spec:
- Install exo on Mac5
- Configure tensor parallel: model split across 32GB combined
- OpenAI-compatible API on :52415
- Test with `curl http://[ip]:52415/v1/chat/completions`
- Verify: Model responds correctly, latency acceptable (<3s for 200 tokens)

#### 3.3 — MLX Alternative Test
- Spec:
- Download `mlx-community/Qwen3.5-35B-A3B-4bit` on Mac4
- Run via `mlx_lm.server`
- Compare: throughput, memory, quality vs Ollama/exo
- Verify: Pick best local serving option

#### 3.4 — Local vs API Parity Benchmark
- Spec:
- Run full ABCD benchmark against local model
- Run same against Together API
- Compare scores, latencies, quality
- Decision: which to use as primary, which as fallback
- Verify: Local within 5

Wave 3 Gate: Local model serving at <3s/query, within 5

---

## Wave 4: Fine-Tuning
Duration: 5-7 days
Dependencies: Wave 2 (V10 corpus), Wave 3 (local model running)

Tasks

#### 4.1 — SFT Training (Together AI)
- Spec:
- Upload V10 corpus to Together AI
- Fine-tune Qwen3.5-35B-A3B with LoRA
- Hyperparams: lr=2e-5, epochs=3, rank=16
- Monitor training loss, eval loss
- Cost estimate: ~$5-15 for fine-tuning job
- Verify: Eval loss < 1.0, no catastrophic forgetting on general knowledge

#### 4.2 — DPO Training
- Spec:
- Take SFT checkpoint
- Train with preference pairs (Mo's style vs generic)
- Focus on: conciseness, directness, project-specific terminology
- Verify: A/B comparison with human eval (Mo judges 20 pairs)

#### 4.3 — Adapter Merge & Quantization
- Spec:
- Merge LoRA adapter into base weights
- Convert to GGUF (Q4_K_M)
- Deploy to Mac4 Ollama
- If too large: quantize further (Q3_K_M) or run on exo cluster
- Verify: Model loads, runs, produces good output

#### 4.4 — Fine-Tuned Model Ablation
- Spec:
- Run full ABCD benchmark on fine-tuned model
- Compare: fine-tuned bare (Config A) vs base bare
- Key question: Does fine-tuning reduce dependence on RAG?
- Target: Config A jumps from 29.5
- Verify: Overall Config D ≥ 95

Wave 4 Gate: Fine-tuned Config A ≥ 60

---

## Wave 5: Production & Integration
Duration: 3-4 days
Dependencies: Wave 4 (fine-tuned model deployed)

Tasks

#### 5.1 — Clawdbot Integration
- Spec:
- Register CogTwin v4 as a Clawdbot skill
- Route personal knowledge queries through CogTwin
- Fallback: if local model down, use API
- Add to heartbeat health checks
- Verify: Ask Clawdbot "What is BWB?" and get CogTwin-powered answer

#### 5.2 — Continuous Learning Pipeline
- Spec:
- Auto-ingest new session turns into knowledge base
- Nightly corpus update job (cron)
- Monthly re-benchmark (automated)
- Quarterly re-fine-tune if accuracy drops below 90
- Verify: New session content appears in KB within 24h

#### 5.3 — Multi-Modal Integration
- Spec:
- If Qwen3.5 has vision: feed screenshots to CogTwin
- Browser context → CogTwin → understands what Mo is looking at
- Voice input → transcription → CogTwin → response → TTS
- Verify: Send screenshot, get contextual response

#### 5.4 — Twin Swarm (Stretch Goal)
- Spec:
- Twin Alpha (Architect): Same base, architect system prompt
- Twin Beta (Builder): Same base, code-focused prompt
- Twin Gamma (Runner): Same base, lightweight tasks
- All share same LoRA weights, different system prompts
- Serve via Together AI serverless (3 instances, zero marginal cost)
- Verify: All three personas respond correctly to domain queries

Wave 5 Gate: CogTwin live in production, continuous learning active, Config D ≥ 97

---

Success Criteria

MetricCurrentWave 2Wave 4Wave 5
Config A (bare)29.5
Config B (RAG)87.2
Config C (RAG+Graph)89.7
Config D (full)93.6
Eval questions39100+100+150+
KB entries4661000+1000+1500+
Graph nodes103500+500+800+
Inference cost$0 | $0$0 | $0
Latency (Config D)1.9s1.5s2.5s (local)1.0s

---

Risk Register

RiskProbabilityImpactMitigation
Qwen3.5-35B doesn't fit Mac4 16GBMediumHighUse exo cluster or Q3 quantization
Fine-tuning degrades general knowledgeMediumHighEval includes general questions, early stopping
Together API goes down/paidLowMediumLocal fallback ready from Wave 3
Mac5 not configured in timeMediumMediumMac4 solo with Q4 quantization
Corpus V10 has quality issuesLowHighManual review of 100 random samples

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/docs/PULSE-PLAN-COGTWIN-V2.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture