Cognitive Twin V9 — Evolution³
**Generated:** 2026-02-18 **Method:** Evolution³ — three-stage recursive evoflow **Core Question:** How should we train, deploy, and integrate the Cognitive Twin V9 into our multi-machine architecture (Mac1 gateway + Mac4 local compute + Together AI cloud + 3 Claude Max accounts) to maximize autonomy, minimize cost, and keep the model evergreen with our rapidly evolving ecosystem?
Full Public Reader
# Cognitive Twin V9 — Evolution³
### Stage 1: Explore → Stage 2: Compound → Stage 3: Master Plan
Generated: 2026-02-18
Method: Evolution³ — three-stage recursive evoflow
Core Question: How should we train, deploy, and integrate the Cognitive Twin V9 into our multi-machine architecture (Mac1 gateway + Mac4 local compute + Together AI cloud + 3 Claude Max accounts) to maximize autonomy, minimize cost, and keep the model evergreen with our rapidly evolving ecosystem?
Context Inherited:
- Current dataset: 77,708 records (V5+V6+V7+V8 combined, CTv3.1 JSONL)
- V9 expansion potential: ~2,635 new records from 8 sources
- 141 skills, 32 CLAUDE.md files, 23 pulse plans, 30K+ Kimi memory turns
- Mac4: M4 Mac Mini 16GB, Ollama (Llama 3.2:3B, MiniMax M2.5), macOS 26.3
- Mac1: M4 MacBook Air 16GB, Clawdbot gateway, daily driver
- Together AI: Serverless LoRA on Qwen3-235B ($0.20/$0.60/MTk) or Llama 4 Maverick
- 3 Claude Max accounts (free frontier inference, rate-limited)
- Graph Kernel + RAG++ + Cortex + Dream Weaver operational
- Twin Swarm DEP (Feb 14): Alpha/Beta/Gamma lanes designed but never deployed
- Previous blockers: Together AI billing limit, Vast.ai not rented
---
STAGE 1: EXPLORE
Path 1: "The Distillery" — Quantized Qwen3-235B on Mac4
Concept: Fine-tune Qwen3-235B-A22B on Together AI/Vast.ai, then quantize the merged adapter into GGUF Q4_K_M and run it locally on Mac4 via Ollama. The MoE architecture means only 22B params are active per inference — theoretically fits in 16GB RAM with aggressive quantization.
Why it works:
- Qwen3-235B has best-in-class coding performance among open models
- MoE with 22B active params is comparable to running a 22B dense model
- Q4_K_M quantization of 22B active params needs ~12-14GB VRAM/RAM
- Mac4's M4 chip has unified memory — no CPU-GPU transfer overhead
- Once deployed, inference is completely free and private
- 262K context window survives quantization
Profit/value angle:
- $0/month inference after one-time training cost ($40-75)
- No external dependency — works offline, no rate limits
- Private — training data never leaves our machines during inference
- If it works, this replaces the entire Together AI serving layer
Risks:
- GGUF MoE quantization for Qwen3 is bleeding edge — tooling may not exist yet
- 16GB is razor thin — OS + Ollama overhead could cause OOM
- Quantization quality loss on a LoRA-merged MoE model is unknown territory
- Mac4 inference speed may be too slow for real-time agent work (tokens/sec?)
- If it doesn't fit, we wasted the training budget
---
Path 2: "The Cascade" — Tiered Model Stack with Skill-Aware Routing
Concept: Train THREE separate LoRA adapters at different sizes, each specialized for a tier of task complexity. Mac4 runs the small model always-on, Together AI runs the medium on-demand, frontier (Claude Max) handles the rest. Clawdbot's model router (already partially built) dispatches based on task classification.
Tiers:
- T0 (Mac4 local): Llama 3.2-3B + LoRA → triage, classification, simple completions, heartbeat checks, density scoring. Always on. Free.
- T1 (Together AI): Qwen3-235B + LoRA → feature implementation, architecture decisions, code review. On-demand. $0.20-0.60/MTk.
- T2 (Claude Max): Claude Opus 4.6 → novel reasoning, complex debugging, creative work. Escalation only. Free (subscription).
Why it works:
- Matches our existing 3-account dual-max architecture
- T0 handles 60-70
- T1 handles 20-30
- T2 handles <10
- Each tier is optimized for its role — no wasted capacity
- The routing layer already exists in Cortex daemon
Profit/value angle:
- Total cost drops to $30-100/mo (mostly T1 usage)
- T0 gives us an always-on local brain that understands our ecosystem
- The 3B LoRA trains in hours on Mac4 itself (no Vast.ai needed for T0)
- T1 LoRA trains once on Together AI, serves via serverless
- We keep Claude Max for genuinely hard problems
Risks:
- Training 3 separate adapters = 3x training overhead
- Routing accuracy is critical — misrouted tasks waste money or produce bad output
- 3B model may not understand enough to route correctly
- Context gap between tiers (T0 doesn't know what T1 worked on)
---
Path 3: "The Mirror" — Continuous Self-Distillation Loop
Concept: Instead of one-shot training, build a continuous pipeline where every Clawdbot session generates training data in real-time. Mac4 runs a perpetual training loop: ingest today's sessions → generate SFT/DPO pairs → fine-tune → deploy updated model → repeat. The Twin is never stale because it's always learning.
Pipeline:
1. Clawdbot sessions (all channels) → RAG++ bridge → kimi_memory.db
2. Density scorer (MiniMax on Mac4) filters for CORE+ENRICHED turns
3. Training pair generator converts high-density turns → CTv3.1 format
4. Mac4 runs LoRA training on accumulated data (nightly batch)
5. Updated model deployed to Ollama, replaces previous version
6. Eval suite runs, gates deployment (reject if quality drops)
Why it works:
- Solves the "stale model" problem permanently
- Leverages Mac4's idle overnight compute (Mo sleeps, Mac4 trains)
- Density scoring pipeline already exists and works
- RAG++ bridge already ingests all sessions in real-time
- The Twin evolves WITH the ecosystem, not behind it
Profit/value angle:
- No recurring training costs (Mac4 does it locally)
- Model improves daily, automatically
- New projects, skills, and patterns are absorbed within 24h
- Competitive advantage: most fine-tunes are snapshots, ours is a living model
Risks:
- Mac4 can only train small models (3-8B) locally — M4 16GB limits us
- Training quality on consumer hardware vs H100 cluster
- Catastrophic forgetting if nightly updates aren't properly regularized
- Need robust eval to prevent quality regression
- Accumulated drift over time without periodic full retraining
---
Path 4: "The Diplomat" — MiniMax M2.5 as Primary Twin
Concept: Skip the LoRA training entirely. MiniMax M2.5 is already running on Mac4 via Ollama, already handles density scoring, and already processes through Cortex. Instead of training a new model, deeply prompt-engineer MiniMax with our full governance stack (AGENTS.md, SOUL.md, skills) and use RAG++ to inject project context. Essentially: make MiniMax our Twin through context, not weights.
Why it works:
- Zero training cost, zero training time
- MiniMax M2.5 is surprisingly capable (handles Cortex routing well)
- Already deployed, already running, already tested
- We can iterate on prompts hourly vs waiting for training runs
- If MiniMax adds LoRA support later, we have the data ready
Profit/value angle:
- Immediate deployment — could be "Twin V9" by end of day
- $0 inference (local Ollama)
- Low risk — if it doesn't work well enough, we still have the training data for a real LoRA
- Buys time while we solve billing/Vast.ai blockers
Risks:
- Context injection hits the same token/compaction limits as current architecture
- MiniMax M2.5 reasoning may not match fine-tuned Qwen3
- Not a real "twin" — it's still a generic model with fancy prompting
- Doesn't solve the fundamental problem (context in weights vs context in window)
---
Path 5: "The Federation" — Multi-Model Ensemble with Consensus
Concept: Don't pick one model. Run the Twin as an ensemble: fine-tuned 3B on Mac4 + fine-tuned Qwen3-235B on Together + MiniMax M2.5 on Mac4, all receiving the same task. A lightweight consensus layer (could be the 3B itself) evaluates responses and picks the best one, or synthesizes a merged answer.
Inspired by: Dream Weaver's Evo³ pattern (Gemini + MiniMax + Kimi-K2 evolving together), Swarm Consensus system at `Desktop/swarm-consensus/`.
Why it works:
- Diversity of models catches different failure modes
- Consensus is more reliable than any single model
- We already have the swarm-consensus voting infrastructure
- Small model can be the fast-path (if confident, skip ensemble)
- Large model validates when small model is uncertain
Profit/value angle:
- Best quality ceiling — ensemble beats individual models
- Graceful degradation — if Together AI is down, local models still work
- The consensus patterns become training data for the next version
- Aligns with our multi-model philosophy (we already run 5+ models)
Risks:
- 3x latency if all models must respond before consensus
- Complexity: debugging ensemble failures is hard
- Cost: running 3 models per query is expensive even if 2 are local
- Consensus layer itself needs to be smart enough to evaluate
- Over-engineered for simple tasks (heartbeats don't need 3 opinions)
---
Path 6: "The Surgeon" — Targeted Micro-LoRAs per Domain
Concept: Instead of one monolithic LoRA, train multiple small LoRAs — one per domain (coding, architecture, planning, N'Ko, business). Load the right LoRA at inference time based on task classification. Ollama supports hot-swapping LoRA adapters.
LoRA domains:
- `twin-code`: Coding patterns, conventional commits, RTD verification
- `twin-arch`: Architecture decisions, CLAUDE.md patterns, system design
- `twin-plan`: Pulse planning, wave-gating, task decomposition
- `twin-nko`: N'Ko/Manding language patterns, cross-script bridge
- `twin-biz`: Business context (Koji, BWB, Shopify, B2B sales)
- `twin-ops`: Heartbeat patterns, agent management, infrastructure
Why it works:
- Each LoRA is small and fast to train (subset of data per domain)
- Domain specialization > generalist dilution
- Can update individual domains without retraining everything
- Base model stays clean — no domain interference
- Ollama supports adapter hot-swap on M-series Macs
Profit/value angle:
- Faster iteration: update `twin-code` when coding patterns change, leave others alone
- Better quality: each adapter deeply trained on its domain
- Smaller adapters = faster training = Mac4 can do it locally
- Mix-and-match: load `twin-code` + `twin-nko` for N'Ko keyboard work
Risks:
- Router must correctly classify domain before loading adapter
- Cross-domain tasks (coding + architecture) need multi-adapter merging
- More moving parts = more failure modes
- Total training time across all domains may exceed monolithic approach
- Adapter merging quality is unpredictable
---
STAGE 2: COMPOUND
Step 1: Foundation — What We Actually Need the Twin to Do
Inherits: Stage 1 all paths — synthesizing requirements from all six directions
Before choosing an architecture, we need clarity on what the Twin must accomplish. From analyzing all six paths and our current ecosystem, the Twin's job is:
Primary roles (ordered by frequency):
1. Triage & Route (~40
2. Code Generation (~25
3. Planning & Decomposition (~15
4. Knowledge Retrieval (~10
5. Autonomous Operations (~10
Critical constraint: The Twin must work WITHIN Clawdbot's existing tool infrastructure. It needs to call read/write/exec/message/etc — not be a standalone chatbot. This means it must be served as a model endpoint that Clawdbot can route to.
The real insight from Path 2 (Cascade): Not all roles need the same model. Triage (40
---
Step 2: The Two-Brain Architecture — Building on Step 1's Role Analysis
Inherits: Step 1 — role frequency analysis, tool integration requirement
From Step 1, we know 40
Brain 1: "Cortex Twin" (Mac4 local, always-on)
- Base: Llama 3.2-3B + LoRA (or Qwen2.5-3B if better coding benchmarks)
- Role: Triage, routing, classification, heartbeats, simple completions, density scoring
- Serves: Ollama on Mac4 (http://[ip]:11434)
- Training: Can train ON Mac4 itself (3B LoRA trains in 2-4 hours on M4)
- Update cycle: Weekly (absorbs new sessions, skills, plans)
- Cost: $0 (local compute)
Brain 2: "Reasoning Twin" (Together AI, on-demand)
- Base: Qwen3-235B-A22B + LoRA
- Role: Feature implementation, architecture decisions, complex planning, code review
- Serves: Together AI serverless LoRA ($0.20/$0.60/MTk)
- Training: Vast.ai or Together AI fine-tuning ($40-75 per run)
- Update cycle: Monthly (curated high-quality data only)
- Cost: $30-150/mo depending on usage
The bridge between brains: Cortex daemon (already running on Mac4 port 18081) classifies incoming tasks and routes to Brain 1 or Brain 2. If Brain 1 confidence is high (>0.85), it handles it directly. If low, it escalates to Brain 2. If Brain 2 confidence is also low or task is novel, escalate to frontier (Claude Max — free).
This inherits the best elements from:
- Path 1 (local inference) → Brain 1
- Path 2 (tiered cascade) → the routing logic
- Path 3 (continuous learning) → Brain 1's weekly update cycle
- Path 4 (MiniMax as interim) → MiniMax stays as density scorer alongside Brain 1
---
Step 3: Training Data Strategy — Building on Steps 1-2's Architecture
Inherits: Steps 1-2 — two-brain architecture, role specialization, update cycles
Each brain needs different training data because they serve different roles (Step 1) at different capacities (Step 2):
Brain 1 (Cortex Twin, 3B) — Training Data:
Focus: Classification accuracy, routing decisions, operational patterns
- Source A: Skill routing examples — For each of 141 skills, generate 3-5 "when to invoke" examples → ~500 SFT pairs
- Source B: Pulse plan decomposition summaries — Task descriptions → correct wave assignment → ~200 SFT pairs
- Source C: Heartbeat decision patterns — From memory files, extract "check X → found Y → action Z" chains → ~150 SFT pairs
- Source D: Triage DPO pairs — "Route to local" (preferred) vs "Escalate unnecessarily" (rejected) → ~300 DPO pairs
- Source E: Kimi memory high-density turns (CORE 9-10 only) — Pure personality/decision DNA → ~400 SFT pairs
- Total Brain 1: ~1,250 SFT + 300 DPO = 1,550 new records
- Merged with: Subset of V5 base (triage-relevant turns only, ~5,000) = ~6,550 total
Brain 2 (Reasoning Twin, 235B) — Training Data:
Focus: Code quality, architecture decisions, deep reasoning
- Source A: CLAUDE.md architecture specs → design decision Q&A pairs → ~300 SFT pairs
- Source B: Full pulse plans → "how to decompose this feature" → ~200 SFT pairs
- Source C: Governance protocols (AGENTS.md, SOUL.md, PROTOCOLS.md) → behavioral pairs → ~150 SFT pairs
- Source D: Code-grounded examples from V5 base (Claude Code sessions) → keep all ~35,000 coding turns
- Source E: DPO pairs — autonomous execution (preferred) vs permission-seeking (rejected) → keep all 740 existing + ~260 new = ~1,000 DPO
- Source F: Memory file architecture narratives → reasoning chain examples → ~200 SFT pairs
- Total Brain 2: ~850 new SFT + 260 new DPO on top of V8 base
- Merged with: Full V5+V6+V7+V8 (77,708) = ~78,818 total
Key insight: Brain 1 gets a CURATED, SMALL, FOCUSED dataset. Brain 2 gets the FULL corpus. This is because:
- 3B models learn better from focused data (less noise to overfit on)
- 235B models can handle the full distribution without confusion
- Brain 1 only needs to be RIGHT about routing, not CREATIVE
- Brain 2 needs the full context to generate high-quality code
---
Step 4: Training Pipeline — Building on Steps 1-3's Data Strategy
Inherits: Steps 1-3 — two brains, different datasets, different update cycles
Brain 1 Training (Mac4 local, weekly):
Weekly Cycle (every Sunday night, Mo sleeping):
1. Extract: Pull last 7 days from kimi_memory.db + new session logs
2. Score: MiniMax M2.5 density scorer (already built) → filter CORE+ENRICHED
3. Generate: Convert to SFT/DPO using template prompts
4. Merge: Append to Brain 1 dataset (cumulative, capped at 10K to prevent drift)
5. Train: LoRA fine-tune on Mac4 (MLX or llama.cpp, ~2-4 hours for 3B)
6. Eval: Run routing accuracy test suite (target: >90% correct routing)
7. Deploy: Hot-swap Ollama adapter if eval passes
8. Log: Write results to memory/twin-training-log.mdKey detail from Path 3 (Mirror): We don't train from scratch each week. We accumulate data but use a sliding window (last 30 days weighted 2x, older weighted 1x) to prevent catastrophic forgetting while staying current.
Brain 2 Training (Cloud, monthly):
Monthly Cycle (1st of each month):
1. Audit: Review V9 audit doc for new data sources
2. Generate: Full expansion pipeline (Gemini 3 Pro for generation, RLM for quality gate)
3. Merge: Combine with existing dataset (dedup by content hash)
4. Upload: Push to Together AI or Vast.ai
5. Train: LoRA fine-tune on Qwen3-235B (H100 cluster, ~15-30 hours, $40-75)
6. Eval: Run Twin Fidelity suite (target: >0.80, currently 0.772 baseline)
7. Deploy: Upload adapter to Together AI serverless
8. Benchmark: Run side-by-side vs previous version on 50 held-out tasksCost model (from Steps 1-2):
- Brain 1 training: $0/month (Mac4 local)
- Brain 2 training: $40-75/month (cloud GPU)
- Brain 1 inference: $0/month (Mac4 local)
- Brain 2 inference: $30-150/month (Together AI serverless)
- **Total: $70-225/month** (vs current $0 for Claude Max subscriptions but rate-limited)
The honest comparison: Claude Max is "free" but rate-limited and burns 3 subscription slots. The Twin frees those slots for genuinely complex work while handling routine stuff locally.
---
Step 5: Integration Architecture — Building on Steps 1-4's Pipeline
Inherits: Steps 1-4 — two brains, training pipelines, cost model, Cortex routing
How the Twin plugs into Clawdbot's existing infrastructure:
Clawdbot Model Router (already exists in config):
models:
twin-cortex:
provider: ollama
endpoint: http://[ip]:11434
model: twin-cortex-3b:latest
role: triage
twin-reasoning:
provider: together
model: Qwen/Qwen3-235B-A22B-Instruct-2507
lora: twin-reasoning-v9
role: coding, architecture, planning
claude-opus:
provider: anthropic
model: claude-opus-4-6
role: escalationRequest flow:
User message → Clawdbot gateway
→ Cortex daemon classifies task type + complexity
→ If triage/simple/heartbeat → twin-cortex (Mac4, free)
→ If coding/architecture/planning → twin-reasoning (Together AI)
→ If novel/complex/failing → claude-opus (Max account, free but limited)
→ Response routes back through Clawdbot → userTool access: Both Twins need tool access. Options:
1. Clawdbot proxy: Twin generates tool call JSON, Clawdbot executes and returns results. Keeps tool infra centralized.
2. Direct MCP: Twin connects directly to file system, exec, etc. More autonomous but harder to monitor.
3. Hybrid: Brain 1 (triage) uses Clawdbot proxy (simple tool calls). Brain 2 (reasoning) gets direct MCP for coding tasks.
Recommendation: Option 1 (Clawdbot proxy) for V9. Keep it simple. Direct MCP is a V10 optimization.
Integration with existing systems:
- RAG++: Both brains query RAG++ for current file state (Twin knows patterns, RAG++ knows "now")
- Graph Kernel: Brain 2 uses graph for dependency analysis, architecture queries
- Dream Weaver: Brain 1 routes dream seeds, Brain 2 evaluates emergence
- Pulse Plans: Brain 1 monitors wave gates, Brain 2 generates task specs
- Memory: Both read/write to memory files through Clawdbot's file tools
---
Step 6: Eval Framework & Quality Gates — Building on Steps 1-5
Inherits: Steps 1-5 — full architecture, both brains, integration points
We can't deploy what we can't measure. From Step 4's eval targets and the existing Twin Fidelity baseline (0.772):
Brain 1 (Cortex Twin) Eval Suite:
| Test | Metric | Target | Method |
|---|---|---|---|
| Routing accuracy | |||
| Classification speed | tokens/sec on Mac4 | >50 tok/s | Benchmark 100 classifications |
| False escalation rate | |||
| Heartbeat quality | |||
| Personality fidelity | Matches Mo's communication style | Score >0.8 | A/B test vs real Mo responses |
Brain 2 (Reasoning Twin) Eval Suite:
| Test | Metric | Target | Method |
|---|---|---|---|
| Twin Fidelity | Overall similarity to Mo's patterns | >0.80 | Existing eval framework |
| Code quality | Builds pass, conventional commits | >95 | |
| Architecture alignment | Matches CLAUDE.md patterns | >85 | |
| Permission-seeking | DPO effectiveness | <5 | |
| Context adherence | References correct project context | >90 |
Quality gate for deployment:
1. Both brains must pass ALL targets before deployment
2. A/B test: run Twin alongside Claude Max for 48h, compare outputs
3. Gradual rollout: 10
4. Automatic rollback if any metric drops >10
---
Step 7: Evergreen Strategy — Building on Steps 1-6's Full System
Inherits: Steps 1-6 — architecture, training, integration, eval
The Twin must stay current as the ecosystem evolves. This is where Path 3 (Mirror) and Path 6 (Surgeon) combine:
Continuous learning loop:
Daily:
- All Clawdbot sessions → kimi_memory.db (already happening)
- Density scorer tags new high-value turns (Mac4, free)
Weekly (Brain 1):
- Aggregate scored turns → generate training pairs
- LoRA fine-tune on Mac4 (2-4 hours overnight)
- Eval gate → deploy if passing
Monthly (Brain 2):
- Full audit of new data sources (new skills, CLAUDE.md updates, etc.)
- Generate expansion dataset (Gemini 3 Pro)
- LoRA fine-tune on cloud ($40-75)
- Eval gate → deploy to Together AI
Quarterly:
- Full dataset review: prune stale data, rebalance domains
- Consider base model upgrade (if better models available)
- Re-evaluate Brain 1 vs Brain 2 split ratioDomain-specific micro-updates (from Path 6):
When a specific domain changes significantly (e.g., new pulse plan skill, new project), generate a focused update batch for that domain only. Brain 1 can absorb this in a targeted overnight run without touching other domains.
Drift detection:
- Track routing accuracy weekly — if it drops below 85
- Track Twin Fidelity monthly — if below 0.75, trigger full audit
- Track false escalation rate — if above 20
The V10 horizon: Once the two-brain system is proven, expand to:
- Domain micro-LoRAs (Path 6) for specialized tasks
- Ensemble consensus (Path 5) for high-stakes decisions
- Full MoE distillation (Path 1) if Mac4 gets a RAM upgrade
---
Step 8: Synthesis — The Complete V9 Twin Architecture
Inherits: Steps 1-7 — everything
The Cognitive Twin V9 is a two-brain system:
1. Cortex Twin (Brain 1): A fine-tuned 3B model on Mac4 that handles triage, routing, heartbeats, and simple operations. Trained weekly from density-scored session data. Free to run. Covers ~50
2. Reasoning Twin (Brain 2): A fine-tuned Qwen3-235B on Together AI that handles coding, architecture, and complex planning. Trained monthly from curated high-quality data. Costs $30-150/mo. Covers ~40
3. Frontier Escalation (existing): Claude Max accounts handle the remaining ~10
The training pipeline is asymmetric by design:
- Brain 1: Small data, fast iteration, local training, weekly updates
- Brain 2: Large data, careful curation, cloud training, monthly updates
- Both share the same eval framework but with different targets
Total monthly cost: $70-225 (training + inference), replacing ~50
Timeline to deployment: 3-4 weeks from go signal.
---
STAGE 3: EXPAND + MASTER PLAN
3a. Audit
What holds strong:
✅ Two-brain split is well-justified. The role frequency analysis (Step 1) clearly shows 50
✅ Mac4 as always-on local brain. Already running Ollama, already has MiniMax scoring pipeline, already connected via Tailscale. Zero new infrastructure needed for Brain 1.
✅ Training data strategy is sound. Brain 1 gets focused data (classification accuracy matters more than creativity). Brain 2 gets full corpus (needs the breadth). Different data for different roles.
✅ Weekly Brain 1 updates solve the evergreen problem. No more "stale model" — new skills, plans, and patterns absorbed within 7 days. Mac4 handles it overnight for free.
✅ Clawdbot proxy for tool access (V9). Keeps integration simple. Direct MCP can come in V10 after the basic system is proven.
✅ Gradual rollout with automatic rollback. 10
What breaks under pressure:
⚠️ Brain 1 routing accuracy is the single point of failure. If the 3B model misroutes a complex task as "simple," the response quality tanks. Mitigation: conservative confidence threshold (0.85) — when in doubt, escalate.
⚠️ Together AI serverless LoRA availability. If Together has an outage or changes their LoRA serving API, Brain 2 goes down. Mitigation: fallback to Claude Max (already configured). Also consider uploading adapter to a second provider.
⚠️ 3B model ceiling for routing. 3B params may not understand nuanced task classification (e.g., "is this a simple N'Ko keyboard fix or a complex cross-script bridge architecture change?"). Mitigation: include domain-specific routing examples in training data. If accuracy stays below 85
⚠️ Monthly Brain 2 updates may be too slow. If a major architecture shift happens mid-month, Brain 2 won't know about it until next training cycle. Mitigation: allow emergency mid-cycle training for critical changes.
⚠️ Billing blocker still exists. Together AI billing limit was the original V8 blocker. Need to resolve before Brain 2 can train or serve. Mitigation: check current billing status, consider Vast.ai as fallback for training.
What's missing:
🔴 No plan for N'Ko/multilingual data. Brain 2 needs Manding/Bambara patterns but Qwen3's multilingual coverage of N'Ko is unknown. Need to test base model on N'Ko tasks before committing.
🔴 No offline fallback for Brain 2 tasks. If Together AI is down AND Claude Max is rate-limited, complex tasks have no path. Consider: can a quantized 7B on Mac4 serve as emergency Brain 2?
🔴 Inter-brain context sharing. When Brain 1 triages a task and Brain 2 picks it up, how does Brain 2 know what Brain 1 observed? Need a context handoff protocol.
🔴 Training data generation tooling. V8 used `run_v8_gemini3.py` for Gemini-powered generation. V9 needs updated scripts for each new source (skills, pulse plans, governance docs). These don't exist yet.
🔴 Mac4 MLX/llama.cpp training benchmarks. We assume 2-4 hours for 3B LoRA on M4 but haven't actually benchmarked this. Need to validate before committing to weekly cycle.
---
3b. Expand — Deep-Dives
Deep-Dive 1: Inter-Brain Context Handoff Protocol
When Brain 1 triages a task and escalates to Brain 2, Brain 2 needs:
1. The original user message
2. Brain 1's classification (task type, complexity score, suggested skill)
3. Relevant context Brain 1 pulled from RAG++ during classification
4. The routing reason (why Brain 1 escalated)
Protocol:
{
"handoff": {
"from": "cortex-twin",
"to": "reasoning-twin",
"task_id": "uuid",
"original_message": "...",
"classification": {
"type": "coding",
"complexity": 0.78,
"skill": "bot:pulse",
"project": "PULSE-V1"
},
"context": [
{"source": "rag++", "doc_id": "...", "relevance": 0.92},
{"source": "memory", "file": "active-tasks.md", "section": "..."}
],
"escalation_reason": "Complexity above Brain 1 threshold (0.78 > 0.70)",
"timestamp": "2026-02-18T12:00:00Z"
}
}Clawdbot's Cortex daemon manages this handoff. Brain 2 receives the handoff as a structured context prefix before the task prompt.
Deep-Dive 2: V9 Training Data Generation Scripts
New scripts needed (placed in `scripts/`):
| Script | Source | Output | Records |
|---|---|---|---|
| `gen_v9_skills.py` | 141 SKILL.md files | Skill routing SFT pairs | ~500 |
| `gen_v9_plans.py` | 23 pulse plan JSONs | Task decomposition SFT pairs | ~200 |
| `gen_v9_governance.py` | AGENTS/SOUL/HEARTBEAT/PROTOCOLS.md | Behavioral SFT + DPO pairs | ~250 |
| `gen_v9_architecture.py` | 32 CLAUDE.md files | Design decision SFT pairs | ~300 |
| `gen_v9_memory.py` | 35+ memory files | Decision narrative SFT pairs | ~250 |
| `gen_v9_kimi.py` | kimi_memory.db (density-filtered) | Conversation SFT pairs | ~750 |
| `merge_v9.py` | All V9 sources + V8 base | Final combined datasets | ~80K total |
Each generator uses Gemini 3 Pro for pair creation + RLM quality gate (existing pattern from V8).
Deep-Dive 3: Mac4 Training Benchmark
Before committing to weekly Brain 1 training:
# On Mac4 via SSH:
# 1. Install MLX-LM (Apple's native training framework for M-series)
pip install mlx-lm
# 2. Download Llama 3.2-3B base
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct --mlx-path ./llama-3.2-3b-mlx
# 3. Benchmark LoRA training on 1000 records
time mlx_lm.lora --model ./llama-3.2-3b-mlx \
--train --data ./test-data \
--batch-size 4 --lora-layers 16 --epochs 2
# Expected: 2-4 hours for 6,500 records on M4 16GB
# If >6 hours, consider reducing dataset or epochsDeep-Dive 4: Together AI Billing Resolution
Current status check needed:
together billing status
# Or check https://api.together.xyz/settings/billingOptions if still blocked:
1. Add payment method — resolve billing limit directly
2. Vast.ai for training only — rent A100 80GB ($0.40-1.00/hr), train there, upload adapter to Together AI for serving
3. Lambda Labs — alternative GPU cloud, often cheaper spot instances
4. RunPod — another alternative, good Unsloth/Axolotl support
For serving, Together AI serverless LoRA is still the best option ($0.20/$0.60/MTk). Training can happen anywhere.
---
3c. Master Execution Checklist
---
PHASE 1: Data Preparation (Days 1-3)
- [ ] 1.1 Resolve Together AI billing status
- Owner: Mo (requires payment method)
- Input: Together AI account access
- Output: Active billing, confirmed LoRA training + serving access
- Validation: `together billing status` returns active
- Depends on: Nothing
- Status: Not Started
- [ ] 1.2 Benchmark Mac4 LoRA training
- Owner: Claw (via Mac4 SSH)
- Input: Mac4 access, MLX-LM installed
- Output: Training time benchmark for 3B LoRA on 1K/5K/10K records
- Validation: Benchmark report with tokens/sec, total time, memory usage
- Depends on: Nothing
- Status: Not Started
- [ ] 1.3 Complete Kimi memory density scoring
- Owner: Claw (Mac4)
- Input: kimi_memory.db (30,704 messages), density scorer pipeline
- Output: All messages scored, filtered CORE+ENRICHED set extracted
- Validation: density_scores file with 30K+ entries, CORE/ENRICHED subset extracted
- Depends on: Nothing
- Status: In Progress (9,155/30,704 scored)
- [ ] 1.4 Write V9 training data generators
- Owner: Claw
- Input: V9 audit doc, existing V8 scripts as templates
- Output: 7 generator scripts in `scripts/` (skills, plans, governance, architecture, memory, kimi, merge)
- Validation: Each script runs without errors, produces valid CTv3.1 JSONL
- Depends on: Nothing
- Status: Not Started
- [ ] 1.5 Test Qwen3-235B N'Ko/multilingual capability
- Owner: Claw
- Input: 10 N'Ko test prompts, Qwen3-235B via Together AI
- Output: N'Ko capability assessment (pass/fail per task)
- Validation: Report showing Qwen3's N'Ko token handling, generation quality
- Depends on: 1.1 (Together AI access)
- Status: Not Started
---
PHASE 2: Brain 1 — Cortex Twin (Days 3-7)
- [ ] 2.1 Generate Brain 1 training dataset
- Owner: Claw
- Input: V9 generators (1.4), density-scored Kimi data (1.3)
- Output: `data/brain1_v9/sft_train.jsonl` + `dpo_train.jsonl` (~6,550 records)
- Validation: Record count matches estimate, format validates, no duplicates
- Depends on: 1.3, 1.4
- Status: Not Started
- [ ] 2.2 Train Brain 1 LoRA on Mac4
- Owner: Claw (Mac4)
- Input: Brain 1 dataset (2.1), Llama 3.2-3B base model, MLX-LM
- Output: LoRA adapter `twin-cortex-v9` in Ollama format
- Validation: Training completes, loss < 1.0, adapter loads in Ollama
- Depends on: 1.2, 2.1
- Status: Not Started
- [ ] 2.3 Eval Brain 1 routing accuracy
- Owner: Claw
- Input: 200 labeled routing tasks, Brain 1 model (2.2)
- Output: Eval report — routing accuracy, false escalation rate, speed
- Validation: Accuracy >90
- Depends on: 2.2
- Status: Not Started
- [ ] 2.4 Deploy Brain 1 to Mac4 Ollama
- Owner: Claw (Mac4)
- Input: Passing eval (2.3), Ollama adapter
- Output: `twin-cortex-3b:latest` running on Mac4:11434
- Validation: `curl http://[ip]:11434/api/generate` returns valid response
- Depends on: 2.3 (must pass)
- Status: Not Started
---
PHASE 3: Brain 2 — Reasoning Twin (Days 5-14)
- [ ] 3.1 Generate Brain 2 training dataset expansion
- Owner: Claw
- Input: V9 generators (1.4), full V8 dataset, Gemini 3 Pro API
- Output: `data/brain2_v9/` — merged V5-V9 dataset (~78,818 records)
- Validation: Record count, format validation, dedup check
- Depends on: 1.4, 1.5 (N'Ko assessment informs data balance)
- Status: Not Started
- [ ] 3.2 Upload dataset to training platform
- Owner: Claw
- Input: Brain 2 dataset (3.1), Together AI or Vast.ai credentials
- Output: Dataset uploaded, file ID confirmed
- Validation: Upload complete, file verified on platform
- Depends on: 1.1, 3.1
- Status: Not Started
- [ ] 3.3 Train Brain 2 LoRA on Qwen3-235B
- Owner: Claw (remote)
- Input: Uploaded dataset (3.2), Qwen3-235B base
- Output: LoRA adapter `twin-reasoning-v9`
- Validation: Training completes, loss < 1.0, adapter saves
- Depends on: 3.2
- Status: Not Started
- [ ] 3.4 Eval Brain 2 Twin Fidelity
- Owner: Claw
- Input: Twin Fidelity eval suite, Brain 2 model (3.3)
- Output: Eval report — fidelity score, code quality, permission-seeking rate
- Validation: Fidelity >0.80, code quality >95
- Depends on: 3.3
- Status: Not Started
- [ ] 3.5 Deploy Brain 2 to Together AI serverless
- Owner: Claw
- Input: Passing eval (3.4), Together AI account
- Output: Serverless LoRA endpoint active
- Validation: API call returns valid response with Twin patterns
- Depends on: 3.4 (must pass)
- Status: Not Started
---
PHASE 4: Integration (Days 10-18)
- [ ] 4.1 Implement inter-brain handoff protocol in Cortex
- Owner: Claw
- Input: Handoff protocol spec (Stage 3 deep-dive), Cortex daemon code
- Output: Cortex daemon routes tasks between Brain 1, Brain 2, and frontier
- Validation: 50 test tasks correctly routed with handoff context
- Depends on: 2.4, 3.5
- Status: Not Started
- [ ] 4.2 Add Twin models to Clawdbot config
- Owner: Claw
- Input: Clawdbot gateway config, model endpoints
- Output: `twin-cortex` and `twin-reasoning` available as model options
- Validation: `clawdbot status` shows both models, test query works
- Depends on: 2.4, 3.5
- Status: Not Started
- [ ] 4.3 Wire RAG++ context injection for both brains
- Owner: Claw
- Input: RAG++ API, Cortex daemon
- Output: Both brains receive relevant RAG++ context with every query
- Validation: Brain 2 references current file state in coding responses
- Depends on: 4.1, 4.2
- Status: Not Started
- [ ] 4.4 Implement automatic rollback mechanism
- Owner: Claw
- Input: Quality metrics baseline, monitoring script
- Output: Cron job that checks Twin quality hourly, rolls back to Claude Max if degraded
- Validation: Simulated degradation triggers automatic rollback within 1 hour
- Depends on: 4.2
- Status: Not Started
---
PHASE 5: Gradual Rollout (Days 18-32)
- [ ] 5.1 10
- Owner: Claw
- Input: Deployed system (Phase 4), traffic router
- Output: 10
- Validation: 48h monitoring, routing accuracy maintained, no quality drops
- Depends on: Phase 4 complete
- Status: Not Started
- [ ] 5.2 25
- Owner: Claw
- Input: Passing 5.1 metrics
- Output: Brain 2 handles 25
- Validation: 48h monitoring, code quality maintained, fidelity stable
- Depends on: 5.1 (48h passing)
- Status: Not Started
- [ ] 5.3 50
- Owner: Claw
- Input: Passing 5.2 metrics
- Output: Half of all interactions through Twin system
- Validation: 1 week monitoring, all metrics stable
- Depends on: 5.2 (48h passing)
- Status: Not Started
- [ ] 5.4 100
- Owner: Claw + Mo decision
- Input: Passing 5.3 metrics, Mo's approval
- Output: Twin handles all routine tasks, Claude Max is escalation-only
- Validation: 2 week monitoring, cost reduction confirmed, quality maintained
- Depends on: 5.3 (1 week passing), Mo approval
- Status: Not Started
---
PHASE 6: Continuous Learning Pipeline (Days 25-35, ongoing)
- [ ] 6.1 Automate Brain 1 weekly training cycle
- Owner: Claw
- Input: Training pipeline scripts, Mac4 cron
- Output: Launchd service that trains Brain 1 every Sunday night
- Validation: First automated cycle completes, model deploys, eval passes
- Depends on: Phase 2 complete, Phase 5 in progress
- Status: Not Started
- [ ] 6.2 Automate Brain 2 monthly training cycle
- Owner: Claw
- Input: Training pipeline scripts, Together AI/Vast.ai automation
- Output: Monthly cron that generates data, trains, evals, deploys
- Validation: First automated cycle completes end-to-end
- Depends on: Phase 3 complete, Phase 5 in progress
- Status: Not Started
- [ ] 6.3 Build drift detection dashboard
- Owner: Claw
- Input: Quality metrics, routing accuracy logs
- Output: Discord channel (#twin-health) with daily metrics digest
- Validation: Dashboard shows metrics for 7 days, alerts on degradation
- Depends on: Phase 5 in progress
- Status: Not Started
- [ ] 6.4 Document V9→V10 upgrade path
- Owner: Claw
- Input: V9 learnings, architecture decisions
- Output: V10 planning doc (micro-LoRAs, ensemble, MoE distillation)
- Validation: Document covers all V10 candidates from Stage 1 paths
- Depends on: Phase 5 complete
- Status: Not Started
---
Total timeline: ~5 weeks from go signal
Total training cost: ~$40-150 (one-time) + $30-150/mo (inference)
**Expected outcome: 50
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/packages/cognitive-twin/EVOCUBE_V9.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture