Pulse Plan: Cognitive Twin V2 — Decoupled RLM + Qwen 3.5
Benchmark results (March 4, 2026) using Qwen3-Next-80B-A3B via Together AI: - Config A (Bare): 29.5% → Config D (Full RLM): 93.6% - RAG is the biggest lever (+57.7%), RLM adds meaningful value on multi-hop (+3.9%) - API inference is fast (~1-2s/question) and free (Together serverless) - Target: 97%+ accuracy with fine-tuned Qwen3.5-35B-A3B on local exo cluster
Full Public Reader
Pulse Plan: Cognitive Twin V2 — Decoupled RLM + Qwen 3.5
Created: 2026-03-04
Status: 🟢 ACTIVE
Priority: HIGH
Estimated Duration: 5 waves across 5 weeks
---
Context
Benchmark results (March 4, 2026) using Qwen3-Next-80B-A3B via Together AI:
- Config A (Bare): 29.5
- RAG is the biggest lever (+57.7
- API inference is fast (~1-2s/question) and free (Together serverless)
- Target: 97
---
## Wave 1: Decouple & Refactor
Duration: 3-4 days
Dependencies: None
Tasks
#### 1.1 — Extract RAGLayer Module
- Input: `twin_server_v3.py` monolithic code
- Output: `layers/rag.py` with clean interface
- Spec:
- `RAGLayer(kb_paths, embed_fn)` constructor
- `.search(query, top_k=3) -> list[SearchResult]`
- `.to_context(results) -> str`
- Support Gemini + Ollama embedding backends
- Unit tests with mock embeddings
- Verify: Run benchmark with extracted layer, scores unchanged
#### 1.2 — Extract GraphLayer Module
- Input: BFS traversal code from `twin_server_v3.py`
- Output: `layers/graph.py`
- Spec:
- `GraphLayer(graph_path)` constructor
- `.traverse(query, max_depth=2) -> str`
- Add fuzzy node matching via embeddings (not just string match)
- Unit tests
- Verify: Config C score should IMPROVE with fuzzy matching
#### 1.3 — Extract RLMLayer Module
- Input: Decomposition logic from both benchmark scripts
- Output: `layers/rlm.py`
- Spec:
- `RLMLayer(rag, graph, llm_fn)` constructor
- `.should_decompose(query) -> bool`
- `.decompose(query) -> list[str]`
- `.retrieve(query) -> str` (full pipeline)
- Configurable decomposition signals
- Unit tests with mock LLM
- Verify: Config D scores unchanged
#### 1.4 — Abstract LLM Provider
- Input: Direct API calls in benchmark
- Output: `providers/llm.py`
- Spec:
- `LLMProvider` base class
- `OllamaProvider(url)` — local models
- `TogetherProvider(api_key)` — Together AI
- `OpenRouterProvider(api_key)` — OpenRouter
- Common interface: `.generate(messages, max_tokens, temperature) -> (text, latency, usage)`
- Provider auto-detection from model name
- Verify: Benchmark works with `--provider ollama/together/openrouter`
#### 1.5 — Unified Server v4
- Input: Extracted layers + providers
- Output: `twin_server_v4.py`
- Spec:
- Compose layers: `CogTwin(llm, rag, graph, rlm)`
- Config-driven: `config.yaml` specifies which layers are active
- HTTP API: `POST /query` with `{"query": "...", "config": "D"}`
- Health endpoint: `GET /health`
- Backward compatible with v3 callers
- Verify: Serve queries via API, benchmark against it
Wave 1 Gate: All 4 benchmark configs pass with scores within ±1
---
## Wave 2: Knowledge Expansion
Duration: 4-5 days
Dependencies: Wave 1 complete (layers extracted)
Tasks
#### 2.1 — Corpus V10 Mining
- Input: Kimi synthesis DB (163K+ turns)
- Output: `data/ctv3_export_v10/` with SFT + DPO formats
- Spec:
- Extract all session turns from `kimi_memory.db`
- Deduplicate against V1-V9
- Format as `{"messages": [...]}` JSONL
- Generate preference pairs (DPO): take good answers, generate worse alternatives
- Target: 5000+ new high-quality training records
- Verify: Dataset stats, no duplicates, format validates
#### 2.2 — Knowledge Base Expansion
- Input: All project docs (AGENTS.md, SOUL.md, READMEs, design docs)
- Output: Expanded `knowledge_base_v3.jsonl`
- Spec:
- Parse all markdown docs in active projects
- Generate Q&A pairs using Claude/Gemini
- Categories: project facts, architecture decisions, personal preferences, relationships
- Target: 1000+ entries (up from 466)
- Verify: Benchmark Config B score improves to 92
#### 2.3 — Graph Expansion
- Input: Current 103-node graph + project docs
- Output: `knowledge_graph_v3.json` with 500+ nodes
- Spec:
- Auto-extract entities from project docs
- Add temporal edges (when was X created/updated)
- Add relationship types (uses, depends_on, created_by, part_of)
- Integrate with Graph Kernel :8001 for live queries
- Verify: Config C score improves, fuzzy matching fires on 80
#### 2.4 — Eval Suite Expansion
- Input: Current 39-question suite
- Output: `eval_suite_v3.json` with 100+ questions
- Spec:
- Add 30+ multi-hop questions (RLM's sweet spot)
- Add temporal questions ("When was X last updated?")
- Add preference questions ("How would Mo approach X?")
- Add negative tests (questions it should refuse)
- Add semantic scoring alongside keyword scoring
- Verify: Suite validates, baseline scores established
Wave 2 Gate: Knowledge base 1000+ entries, graph 500+ nodes, eval suite 100+ questions, Config B ≥ 92
---
## Wave 3: Model Migration
Duration: 3-4 days
Dependencies: Wave 1 (provider abstraction)
Tasks
#### 3.1 — Pull Qwen3.5-35B-A3B on Mac4
- Command: `ssh mac4 "ollama pull qwen3.5:35b"`
- Verify: `ollama run qwen3.5:35b "Hello"` returns valid response
- Check: Memory usage, inference speed (tok/s)
- Fallback: If doesn't fit in 16GB, use Q4_K_M quantization via GGUF
#### 3.2 — Set Up exo Cluster (Mac4 + Mac5)
- Spec:
- Install exo on Mac5
- Configure tensor parallel: model split across 32GB combined
- OpenAI-compatible API on :52415
- Test with `curl http://[ip]:52415/v1/chat/completions`
- Verify: Model responds correctly, latency acceptable (<3s for 200 tokens)
#### 3.3 — MLX Alternative Test
- Spec:
- Download `mlx-community/Qwen3.5-35B-A3B-4bit` on Mac4
- Run via `mlx_lm.server`
- Compare: throughput, memory, quality vs Ollama/exo
- Verify: Pick best local serving option
#### 3.4 — Local vs API Parity Benchmark
- Spec:
- Run full ABCD benchmark against local model
- Run same against Together API
- Compare scores, latencies, quality
- Decision: which to use as primary, which as fallback
- Verify: Local within 5
Wave 3 Gate: Local model serving at <3s/query, within 5
---
## Wave 4: Fine-Tuning
Duration: 5-7 days
Dependencies: Wave 2 (V10 corpus), Wave 3 (local model running)
Tasks
#### 4.1 — SFT Training (Together AI)
- Spec:
- Upload V10 corpus to Together AI
- Fine-tune Qwen3.5-35B-A3B with LoRA
- Hyperparams: lr=2e-5, epochs=3, rank=16
- Monitor training loss, eval loss
- Cost estimate: ~$5-15 for fine-tuning job
- Verify: Eval loss < 1.0, no catastrophic forgetting on general knowledge
#### 4.2 — DPO Training
- Spec:
- Take SFT checkpoint
- Train with preference pairs (Mo's style vs generic)
- Focus on: conciseness, directness, project-specific terminology
- Verify: A/B comparison with human eval (Mo judges 20 pairs)
#### 4.3 — Adapter Merge & Quantization
- Spec:
- Merge LoRA adapter into base weights
- Convert to GGUF (Q4_K_M)
- Deploy to Mac4 Ollama
- If too large: quantize further (Q3_K_M) or run on exo cluster
- Verify: Model loads, runs, produces good output
#### 4.4 — Fine-Tuned Model Ablation
- Spec:
- Run full ABCD benchmark on fine-tuned model
- Compare: fine-tuned bare (Config A) vs base bare
- Key question: Does fine-tuning reduce dependence on RAG?
- Target: Config A jumps from 29.5
- Verify: Overall Config D ≥ 95
Wave 4 Gate: Fine-tuned Config A ≥ 60
---
## Wave 5: Production & Integration
Duration: 3-4 days
Dependencies: Wave 4 (fine-tuned model deployed)
Tasks
#### 5.1 — Clawdbot Integration
- Spec:
- Register CogTwin v4 as a Clawdbot skill
- Route personal knowledge queries through CogTwin
- Fallback: if local model down, use API
- Add to heartbeat health checks
- Verify: Ask Clawdbot "What is BWB?" and get CogTwin-powered answer
#### 5.2 — Continuous Learning Pipeline
- Spec:
- Auto-ingest new session turns into knowledge base
- Nightly corpus update job (cron)
- Monthly re-benchmark (automated)
- Quarterly re-fine-tune if accuracy drops below 90
- Verify: New session content appears in KB within 24h
#### 5.3 — Multi-Modal Integration
- Spec:
- If Qwen3.5 has vision: feed screenshots to CogTwin
- Browser context → CogTwin → understands what Mo is looking at
- Voice input → transcription → CogTwin → response → TTS
- Verify: Send screenshot, get contextual response
#### 5.4 — Twin Swarm (Stretch Goal)
- Spec:
- Twin Alpha (Architect): Same base, architect system prompt
- Twin Beta (Builder): Same base, code-focused prompt
- Twin Gamma (Runner): Same base, lightweight tasks
- All share same LoRA weights, different system prompts
- Serve via Together AI serverless (3 instances, zero marginal cost)
- Verify: All three personas respond correctly to domain queries
Wave 5 Gate: CogTwin live in production, continuous learning active, Config D ≥ 97
---
Success Criteria
| Metric | Current | Wave 2 | Wave 4 | Wave 5 |
|---|---|---|---|---|
| Config A (bare) | 29.5 | |||
| Config B (RAG) | 87.2 | |||
| Config C (RAG+Graph) | 89.7 | |||
| Config D (full) | 93.6 | |||
| Eval questions | 39 | 100+ | 100+ | 150+ |
| KB entries | 466 | 1000+ | 1000+ | 1500+ |
| Graph nodes | 103 | 500+ | 500+ | 800+ |
| Inference cost | $0 | $0 | $0 | $0 | ||
| Latency (Config D) | 1.9s | 1.5s | 2.5s (local) | 1.0s |
---
Risk Register
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Qwen3.5-35B doesn't fit Mac4 16GB | Medium | High | Use exo cluster or Q3 quantization |
| Fine-tuning degrades general knowledge | Medium | High | Eval includes general questions, early stopping |
| Together API goes down/paid | Low | Medium | Local fallback ready from Wave 3 |
| Mac5 not configured in time | Medium | Medium | Mac4 solo with Q4 quantization |
| Corpus V10 has quality issues | Low | High | Manual review of 100 random samples |
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/packages/cognitive-twin/docs/PULSE-PLAN-COGTWIN-V2.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture