Grand Diomande Research · Full HTML Reader

Cognitive Twin V9 — Evolution³

**Generated:** 2026-02-18 **Method:** Evolution³ — three-stage recursive evoflow **Core Question:** How should we train, deploy, and integrate the Cognitive Twin V9 into our multi-machine architecture (Mac1 gateway + Mac4 local compute + Together AI cloud + 3 Claude Max accounts) to maximize autonomy, minimize cost, and keep the model evergreen with our rapidly evolving ecosystem?

Agents That Account for Themselves technical note experiment writeup candidate score 52 .md

Full Public Reader

# Cognitive Twin V9 — Evolution³
### Stage 1: Explore → Stage 2: Compound → Stage 3: Master Plan

Generated: 2026-02-18
Method: Evolution³ — three-stage recursive evoflow
Core Question: How should we train, deploy, and integrate the Cognitive Twin V9 into our multi-machine architecture (Mac1 gateway + Mac4 local compute + Together AI cloud + 3 Claude Max accounts) to maximize autonomy, minimize cost, and keep the model evergreen with our rapidly evolving ecosystem?

Context Inherited:
- Current dataset: 77,708 records (V5+V6+V7+V8 combined, CTv3.1 JSONL)
- V9 expansion potential: ~2,635 new records from 8 sources
- 141 skills, 32 CLAUDE.md files, 23 pulse plans, 30K+ Kimi memory turns
- Mac4: M4 Mac Mini 16GB, Ollama (Llama 3.2:3B, MiniMax M2.5), macOS 26.3
- Mac1: M4 MacBook Air 16GB, Clawdbot gateway, daily driver
- Together AI: Serverless LoRA on Qwen3-235B ($0.20/$0.60/MTk) or Llama 4 Maverick
- 3 Claude Max accounts (free frontier inference, rate-limited)
- Graph Kernel + RAG++ + Cortex + Dream Weaver operational
- Twin Swarm DEP (Feb 14): Alpha/Beta/Gamma lanes designed but never deployed
- Previous blockers: Together AI billing limit, Vast.ai not rented

---

STAGE 1: EXPLORE

Path 1: "The Distillery" — Quantized Qwen3-235B on Mac4

Concept: Fine-tune Qwen3-235B-A22B on Together AI/Vast.ai, then quantize the merged adapter into GGUF Q4_K_M and run it locally on Mac4 via Ollama. The MoE architecture means only 22B params are active per inference — theoretically fits in 16GB RAM with aggressive quantization.

Why it works:
- Qwen3-235B has best-in-class coding performance among open models
- MoE with 22B active params is comparable to running a 22B dense model
- Q4_K_M quantization of 22B active params needs ~12-14GB VRAM/RAM
- Mac4's M4 chip has unified memory — no CPU-GPU transfer overhead
- Once deployed, inference is completely free and private
- 262K context window survives quantization

Profit/value angle:
- $0/month inference after one-time training cost ($40-75)
- No external dependency — works offline, no rate limits
- Private — training data never leaves our machines during inference
- If it works, this replaces the entire Together AI serving layer

Risks:
- GGUF MoE quantization for Qwen3 is bleeding edge — tooling may not exist yet
- 16GB is razor thin — OS + Ollama overhead could cause OOM
- Quantization quality loss on a LoRA-merged MoE model is unknown territory
- Mac4 inference speed may be too slow for real-time agent work (tokens/sec?)
- If it doesn't fit, we wasted the training budget

---

Path 2: "The Cascade" — Tiered Model Stack with Skill-Aware Routing

Concept: Train THREE separate LoRA adapters at different sizes, each specialized for a tier of task complexity. Mac4 runs the small model always-on, Together AI runs the medium on-demand, frontier (Claude Max) handles the rest. Clawdbot's model router (already partially built) dispatches based on task classification.

Tiers:
- T0 (Mac4 local): Llama 3.2-3B + LoRA → triage, classification, simple completions, heartbeat checks, density scoring. Always on. Free.
- T1 (Together AI): Qwen3-235B + LoRA → feature implementation, architecture decisions, code review. On-demand. $0.20-0.60/MTk.
- T2 (Claude Max): Claude Opus 4.6 → novel reasoning, complex debugging, creative work. Escalation only. Free (subscription).

Why it works:
- Matches our existing 3-account dual-max architecture
- T0 handles 60-70
- T1 handles 20-30
- T2 handles <10
- Each tier is optimized for its role — no wasted capacity
- The routing layer already exists in Cortex daemon

Profit/value angle:
- Total cost drops to $30-100/mo (mostly T1 usage)
- T0 gives us an always-on local brain that understands our ecosystem
- The 3B LoRA trains in hours on Mac4 itself (no Vast.ai needed for T0)
- T1 LoRA trains once on Together AI, serves via serverless
- We keep Claude Max for genuinely hard problems

Risks:
- Training 3 separate adapters = 3x training overhead
- Routing accuracy is critical — misrouted tasks waste money or produce bad output
- 3B model may not understand enough to route correctly
- Context gap between tiers (T0 doesn't know what T1 worked on)

---

Path 3: "The Mirror" — Continuous Self-Distillation Loop

Concept: Instead of one-shot training, build a continuous pipeline where every Clawdbot session generates training data in real-time. Mac4 runs a perpetual training loop: ingest today's sessions → generate SFT/DPO pairs → fine-tune → deploy updated model → repeat. The Twin is never stale because it's always learning.

Pipeline:
1. Clawdbot sessions (all channels) → RAG++ bridge → kimi_memory.db
2. Density scorer (MiniMax on Mac4) filters for CORE+ENRICHED turns
3. Training pair generator converts high-density turns → CTv3.1 format
4. Mac4 runs LoRA training on accumulated data (nightly batch)
5. Updated model deployed to Ollama, replaces previous version
6. Eval suite runs, gates deployment (reject if quality drops)

Why it works:
- Solves the "stale model" problem permanently
- Leverages Mac4's idle overnight compute (Mo sleeps, Mac4 trains)
- Density scoring pipeline already exists and works
- RAG++ bridge already ingests all sessions in real-time
- The Twin evolves WITH the ecosystem, not behind it

Profit/value angle:
- No recurring training costs (Mac4 does it locally)
- Model improves daily, automatically
- New projects, skills, and patterns are absorbed within 24h
- Competitive advantage: most fine-tunes are snapshots, ours is a living model

Risks:
- Mac4 can only train small models (3-8B) locally — M4 16GB limits us
- Training quality on consumer hardware vs H100 cluster
- Catastrophic forgetting if nightly updates aren't properly regularized
- Need robust eval to prevent quality regression
- Accumulated drift over time without periodic full retraining

---

Path 4: "The Diplomat" — MiniMax M2.5 as Primary Twin

Concept: Skip the LoRA training entirely. MiniMax M2.5 is already running on Mac4 via Ollama, already handles density scoring, and already processes through Cortex. Instead of training a new model, deeply prompt-engineer MiniMax with our full governance stack (AGENTS.md, SOUL.md, skills) and use RAG++ to inject project context. Essentially: make MiniMax our Twin through context, not weights.

Why it works:
- Zero training cost, zero training time
- MiniMax M2.5 is surprisingly capable (handles Cortex routing well)
- Already deployed, already running, already tested
- We can iterate on prompts hourly vs waiting for training runs
- If MiniMax adds LoRA support later, we have the data ready

Profit/value angle:
- Immediate deployment — could be "Twin V9" by end of day
- $0 inference (local Ollama)
- Low risk — if it doesn't work well enough, we still have the training data for a real LoRA
- Buys time while we solve billing/Vast.ai blockers

Risks:
- Context injection hits the same token/compaction limits as current architecture
- MiniMax M2.5 reasoning may not match fine-tuned Qwen3
- Not a real "twin" — it's still a generic model with fancy prompting
- Doesn't solve the fundamental problem (context in weights vs context in window)

---

Path 5: "The Federation" — Multi-Model Ensemble with Consensus

Concept: Don't pick one model. Run the Twin as an ensemble: fine-tuned 3B on Mac4 + fine-tuned Qwen3-235B on Together + MiniMax M2.5 on Mac4, all receiving the same task. A lightweight consensus layer (could be the 3B itself) evaluates responses and picks the best one, or synthesizes a merged answer.

Inspired by: Dream Weaver's Evo³ pattern (Gemini + MiniMax + Kimi-K2 evolving together), Swarm Consensus system at `Desktop/swarm-consensus/`.

Why it works:
- Diversity of models catches different failure modes
- Consensus is more reliable than any single model
- We already have the swarm-consensus voting infrastructure
- Small model can be the fast-path (if confident, skip ensemble)
- Large model validates when small model is uncertain

Profit/value angle:
- Best quality ceiling — ensemble beats individual models
- Graceful degradation — if Together AI is down, local models still work
- The consensus patterns become training data for the next version
- Aligns with our multi-model philosophy (we already run 5+ models)

Risks:
- 3x latency if all models must respond before consensus
- Complexity: debugging ensemble failures is hard
- Cost: running 3 models per query is expensive even if 2 are local
- Consensus layer itself needs to be smart enough to evaluate
- Over-engineered for simple tasks (heartbeats don't need 3 opinions)

---

Path 6: "The Surgeon" — Targeted Micro-LoRAs per Domain

Concept: Instead of one monolithic LoRA, train multiple small LoRAs — one per domain (coding, architecture, planning, N'Ko, business). Load the right LoRA at inference time based on task classification. Ollama supports hot-swapping LoRA adapters.

LoRA domains:
- `twin-code`: Coding patterns, conventional commits, RTD verification
- `twin-arch`: Architecture decisions, CLAUDE.md patterns, system design
- `twin-plan`: Pulse planning, wave-gating, task decomposition
- `twin-nko`: N'Ko/Manding language patterns, cross-script bridge
- `twin-biz`: Business context (Koji, BWB, Shopify, B2B sales)
- `twin-ops`: Heartbeat patterns, agent management, infrastructure

Why it works:
- Each LoRA is small and fast to train (subset of data per domain)
- Domain specialization > generalist dilution
- Can update individual domains without retraining everything
- Base model stays clean — no domain interference
- Ollama supports adapter hot-swap on M-series Macs

Profit/value angle:
- Faster iteration: update `twin-code` when coding patterns change, leave others alone
- Better quality: each adapter deeply trained on its domain
- Smaller adapters = faster training = Mac4 can do it locally
- Mix-and-match: load `twin-code` + `twin-nko` for N'Ko keyboard work

Risks:
- Router must correctly classify domain before loading adapter
- Cross-domain tasks (coding + architecture) need multi-adapter merging
- More moving parts = more failure modes
- Total training time across all domains may exceed monolithic approach
- Adapter merging quality is unpredictable

---

STAGE 2: COMPOUND

Step 1: Foundation — What We Actually Need the Twin to Do

Inherits: Stage 1 all paths — synthesizing requirements from all six directions

Before choosing an architecture, we need clarity on what the Twin must accomplish. From analyzing all six paths and our current ecosystem, the Twin's job is:

Primary roles (ordered by frequency):
1. Triage & Route (~40
2. Code Generation (~25
3. Planning & Decomposition (~15
4. Knowledge Retrieval (~10
5. Autonomous Operations (~10

Critical constraint: The Twin must work WITHIN Clawdbot's existing tool infrastructure. It needs to call read/write/exec/message/etc — not be a standalone chatbot. This means it must be served as a model endpoint that Clawdbot can route to.

The real insight from Path 2 (Cascade): Not all roles need the same model. Triage (40

---

Step 2: The Two-Brain Architecture — Building on Step 1's Role Analysis

Inherits: Step 1 — role frequency analysis, tool integration requirement

From Step 1, we know 40

Brain 1: "Cortex Twin" (Mac4 local, always-on)
- Base: Llama 3.2-3B + LoRA (or Qwen2.5-3B if better coding benchmarks)
- Role: Triage, routing, classification, heartbeats, simple completions, density scoring
- Serves: Ollama on Mac4 (http://[ip]:11434)
- Training: Can train ON Mac4 itself (3B LoRA trains in 2-4 hours on M4)
- Update cycle: Weekly (absorbs new sessions, skills, plans)
- Cost: $0 (local compute)

Brain 2: "Reasoning Twin" (Together AI, on-demand)
- Base: Qwen3-235B-A22B + LoRA
- Role: Feature implementation, architecture decisions, complex planning, code review
- Serves: Together AI serverless LoRA ($0.20/$0.60/MTk)
- Training: Vast.ai or Together AI fine-tuning ($40-75 per run)
- Update cycle: Monthly (curated high-quality data only)
- Cost: $30-150/mo depending on usage

The bridge between brains: Cortex daemon (already running on Mac4 port 18081) classifies incoming tasks and routes to Brain 1 or Brain 2. If Brain 1 confidence is high (>0.85), it handles it directly. If low, it escalates to Brain 2. If Brain 2 confidence is also low or task is novel, escalate to frontier (Claude Max — free).

This inherits the best elements from:
- Path 1 (local inference) → Brain 1
- Path 2 (tiered cascade) → the routing logic
- Path 3 (continuous learning) → Brain 1's weekly update cycle
- Path 4 (MiniMax as interim) → MiniMax stays as density scorer alongside Brain 1

---

Step 3: Training Data Strategy — Building on Steps 1-2's Architecture

Inherits: Steps 1-2 — two-brain architecture, role specialization, update cycles

Each brain needs different training data because they serve different roles (Step 1) at different capacities (Step 2):

Brain 1 (Cortex Twin, 3B) — Training Data:

Focus: Classification accuracy, routing decisions, operational patterns
- Source A: Skill routing examples — For each of 141 skills, generate 3-5 "when to invoke" examples → ~500 SFT pairs
- Source B: Pulse plan decomposition summaries — Task descriptions → correct wave assignment → ~200 SFT pairs
- Source C: Heartbeat decision patterns — From memory files, extract "check X → found Y → action Z" chains → ~150 SFT pairs
- Source D: Triage DPO pairs — "Route to local" (preferred) vs "Escalate unnecessarily" (rejected) → ~300 DPO pairs
- Source E: Kimi memory high-density turns (CORE 9-10 only) — Pure personality/decision DNA → ~400 SFT pairs
- Total Brain 1: ~1,250 SFT + 300 DPO = 1,550 new records
- Merged with: Subset of V5 base (triage-relevant turns only, ~5,000) = ~6,550 total

Brain 2 (Reasoning Twin, 235B) — Training Data:

Focus: Code quality, architecture decisions, deep reasoning
- Source A: CLAUDE.md architecture specs → design decision Q&A pairs → ~300 SFT pairs
- Source B: Full pulse plans → "how to decompose this feature" → ~200 SFT pairs
- Source C: Governance protocols (AGENTS.md, SOUL.md, PROTOCOLS.md) → behavioral pairs → ~150 SFT pairs
- Source D: Code-grounded examples from V5 base (Claude Code sessions) → keep all ~35,000 coding turns
- Source E: DPO pairs — autonomous execution (preferred) vs permission-seeking (rejected) → keep all 740 existing + ~260 new = ~1,000 DPO
- Source F: Memory file architecture narratives → reasoning chain examples → ~200 SFT pairs
- Total Brain 2: ~850 new SFT + 260 new DPO on top of V8 base
- Merged with: Full V5+V6+V7+V8 (77,708) = ~78,818 total

Key insight: Brain 1 gets a CURATED, SMALL, FOCUSED dataset. Brain 2 gets the FULL corpus. This is because:
- 3B models learn better from focused data (less noise to overfit on)
- 235B models can handle the full distribution without confusion
- Brain 1 only needs to be RIGHT about routing, not CREATIVE
- Brain 2 needs the full context to generate high-quality code

---

Step 4: Training Pipeline — Building on Steps 1-3's Data Strategy

Inherits: Steps 1-3 — two brains, different datasets, different update cycles

Brain 1 Training (Mac4 local, weekly):

Weekly Cycle (every Sunday night, Mo sleeping):
1. Extract: Pull last 7 days from kimi_memory.db + new session logs
2. Score: MiniMax M2.5 density scorer (already built) → filter CORE+ENRICHED
3. Generate: Convert to SFT/DPO using template prompts
4. Merge: Append to Brain 1 dataset (cumulative, capped at 10K to prevent drift)
5. Train: LoRA fine-tune on Mac4 (MLX or llama.cpp, ~2-4 hours for 3B)
6. Eval: Run routing accuracy test suite (target: >90% correct routing)
7. Deploy: Hot-swap Ollama adapter if eval passes
8. Log: Write results to memory/twin-training-log.md

Key detail from Path 3 (Mirror): We don't train from scratch each week. We accumulate data but use a sliding window (last 30 days weighted 2x, older weighted 1x) to prevent catastrophic forgetting while staying current.

Brain 2 Training (Cloud, monthly):

Monthly Cycle (1st of each month):
1. Audit: Review V9 audit doc for new data sources
2. Generate: Full expansion pipeline (Gemini 3 Pro for generation, RLM for quality gate)
3. Merge: Combine with existing dataset (dedup by content hash)
4. Upload: Push to Together AI or Vast.ai
5. Train: LoRA fine-tune on Qwen3-235B (H100 cluster, ~15-30 hours, $40-75)
6. Eval: Run Twin Fidelity suite (target: >0.80, currently 0.772 baseline)
7. Deploy: Upload adapter to Together AI serverless
8. Benchmark: Run side-by-side vs previous version on 50 held-out tasks

Cost model (from Steps 1-2):
- Brain 1 training: $0/month (Mac4 local)
- Brain 2 training: $40-75/month (cloud GPU)
- Brain 1 inference: $0/month (Mac4 local)
- Brain 2 inference: $30-150/month (Together AI serverless)
- **Total: $70-225/month** (vs current $0 for Claude Max subscriptions but rate-limited)

The honest comparison: Claude Max is "free" but rate-limited and burns 3 subscription slots. The Twin frees those slots for genuinely complex work while handling routine stuff locally.

---

Step 5: Integration Architecture — Building on Steps 1-4's Pipeline

Inherits: Steps 1-4 — two brains, training pipelines, cost model, Cortex routing

How the Twin plugs into Clawdbot's existing infrastructure:

Clawdbot Model Router (already exists in config):

yaml

models:
  twin-cortex:
    provider: ollama
    endpoint: http://[ip]:11434
    model: twin-cortex-3b:latest
    role: triage

  twin-reasoning:
    provider: together
    model: Qwen/Qwen3-235B-A22B-Instruct-2507
    lora: twin-reasoning-v9
    role: coding, architecture, planning

  claude-opus:
    provider: anthropic
    model: claude-opus-4-6
    role: escalation

Request flow:

User message → Clawdbot gateway
  → Cortex daemon classifies task type + complexity
  → If triage/simple/heartbeat → twin-cortex (Mac4, free)
  → If coding/architecture/planning → twin-reasoning (Together AI)
  → If novel/complex/failing → claude-opus (Max account, free but limited)
  → Response routes back through Clawdbot → user

Tool access: Both Twins need tool access. Options:
1. Clawdbot proxy: Twin generates tool call JSON, Clawdbot executes and returns results. Keeps tool infra centralized.
2. Direct MCP: Twin connects directly to file system, exec, etc. More autonomous but harder to monitor.
3. Hybrid: Brain 1 (triage) uses Clawdbot proxy (simple tool calls). Brain 2 (reasoning) gets direct MCP for coding tasks.

Recommendation: Option 1 (Clawdbot proxy) for V9. Keep it simple. Direct MCP is a V10 optimization.

Integration with existing systems:
- RAG++: Both brains query RAG++ for current file state (Twin knows patterns, RAG++ knows "now")
- Graph Kernel: Brain 2 uses graph for dependency analysis, architecture queries
- Dream Weaver: Brain 1 routes dream seeds, Brain 2 evaluates emergence
- Pulse Plans: Brain 1 monitors wave gates, Brain 2 generates task specs
- Memory: Both read/write to memory files through Clawdbot's file tools

---

Step 6: Eval Framework & Quality Gates — Building on Steps 1-5

Inherits: Steps 1-5 — full architecture, both brains, integration points

We can't deploy what we can't measure. From Step 4's eval targets and the existing Twin Fidelity baseline (0.772):

Brain 1 (Cortex Twin) Eval Suite:

Test	Metric	Target	Method
Routing accuracy
Classification speed	tokens/sec on Mac4	>50 tok/s	Benchmark 100 classifications
False escalation rate
Heartbeat quality
Personality fidelity	Matches Mo's communication style	Score >0.8	A/B test vs real Mo responses

Brain 2 (Reasoning Twin) Eval Suite:

Test	Metric	Target	Method
Twin Fidelity	Overall similarity to Mo's patterns	>0.80	Existing eval framework
Code quality	Builds pass, conventional commits	>95
Architecture alignment	Matches CLAUDE.md patterns	>85
Permission-seeking	DPO effectiveness	<5
Context adherence	References correct project context	>90

Quality gate for deployment:
1. Both brains must pass ALL targets before deployment
2. A/B test: run Twin alongside Claude Max for 48h, compare outputs
3. Gradual rollout: 10
4. Automatic rollback if any metric drops >10

---

Step 7: Evergreen Strategy — Building on Steps 1-6's Full System

Inherits: Steps 1-6 — architecture, training, integration, eval

The Twin must stay current as the ecosystem evolves. This is where Path 3 (Mirror) and Path 6 (Surgeon) combine:

Continuous learning loop:

Daily:
  - All Clawdbot sessions → kimi_memory.db (already happening)
  - Density scorer tags new high-value turns (Mac4, free)

Weekly (Brain 1):
  - Aggregate scored turns → generate training pairs
  - LoRA fine-tune on Mac4 (2-4 hours overnight)
  - Eval gate → deploy if passing

Monthly (Brain 2):
  - Full audit of new data sources (new skills, CLAUDE.md updates, etc.)
  - Generate expansion dataset (Gemini 3 Pro)
  - LoRA fine-tune on cloud ($40-75)
  - Eval gate → deploy to Together AI

Quarterly:
  - Full dataset review: prune stale data, rebalance domains
  - Consider base model upgrade (if better models available)
  - Re-evaluate Brain 1 vs Brain 2 split ratio

Domain-specific micro-updates (from Path 6):
When a specific domain changes significantly (e.g., new pulse plan skill, new project), generate a focused update batch for that domain only. Brain 1 can absorb this in a targeted overnight run without touching other domains.

Drift detection:
- Track routing accuracy weekly — if it drops below 85
- Track Twin Fidelity monthly — if below 0.75, trigger full audit
- Track false escalation rate — if above 20

The V10 horizon: Once the two-brain system is proven, expand to:
- Domain micro-LoRAs (Path 6) for specialized tasks
- Ensemble consensus (Path 5) for high-stakes decisions
- Full MoE distillation (Path 1) if Mac4 gets a RAM upgrade

---

Step 8: Synthesis — The Complete V9 Twin Architecture

Inherits: Steps 1-7 — everything

The Cognitive Twin V9 is a two-brain system:

1. Cortex Twin (Brain 1): A fine-tuned 3B model on Mac4 that handles triage, routing, heartbeats, and simple operations. Trained weekly from density-scored session data. Free to run. Covers ~50

2. Reasoning Twin (Brain 2): A fine-tuned Qwen3-235B on Together AI that handles coding, architecture, and complex planning. Trained monthly from curated high-quality data. Costs $30-150/mo. Covers ~40

3. Frontier Escalation (existing): Claude Max accounts handle the remaining ~10

The training pipeline is asymmetric by design:
- Brain 1: Small data, fast iteration, local training, weekly updates
- Brain 2: Large data, careful curation, cloud training, monthly updates
- Both share the same eval framework but with different targets

Total monthly cost: $70-225 (training + inference), replacing ~50

Timeline to deployment: 3-4 weeks from go signal.

---

STAGE 3: EXPAND + MASTER PLAN

3a. Audit

What holds strong:

✅ Two-brain split is well-justified. The role frequency analysis (Step 1) clearly shows 50

✅ Mac4 as always-on local brain. Already running Ollama, already has MiniMax scoring pipeline, already connected via Tailscale. Zero new infrastructure needed for Brain 1.

✅ Training data strategy is sound. Brain 1 gets focused data (classification accuracy matters more than creativity). Brain 2 gets full corpus (needs the breadth). Different data for different roles.

✅ Weekly Brain 1 updates solve the evergreen problem. No more "stale model" — new skills, plans, and patterns absorbed within 7 days. Mac4 handles it overnight for free.

✅ Clawdbot proxy for tool access (V9). Keeps integration simple. Direct MCP can come in V10 after the basic system is proven.

✅ Gradual rollout with automatic rollback. 10

What breaks under pressure:

⚠️ Brain 1 routing accuracy is the single point of failure. If the 3B model misroutes a complex task as "simple," the response quality tanks. Mitigation: conservative confidence threshold (0.85) — when in doubt, escalate.

⚠️ Together AI serverless LoRA availability. If Together has an outage or changes their LoRA serving API, Brain 2 goes down. Mitigation: fallback to Claude Max (already configured). Also consider uploading adapter to a second provider.

⚠️ 3B model ceiling for routing. 3B params may not understand nuanced task classification (e.g., "is this a simple N'Ko keyboard fix or a complex cross-script bridge architecture change?"). Mitigation: include domain-specific routing examples in training data. If accuracy stays below 85

⚠️ Monthly Brain 2 updates may be too slow. If a major architecture shift happens mid-month, Brain 2 won't know about it until next training cycle. Mitigation: allow emergency mid-cycle training for critical changes.

⚠️ Billing blocker still exists. Together AI billing limit was the original V8 blocker. Need to resolve before Brain 2 can train or serve. Mitigation: check current billing status, consider Vast.ai as fallback for training.

What's missing:

🔴 No plan for N'Ko/multilingual data. Brain 2 needs Manding/Bambara patterns but Qwen3's multilingual coverage of N'Ko is unknown. Need to test base model on N'Ko tasks before committing.

🔴 No offline fallback for Brain 2 tasks. If Together AI is down AND Claude Max is rate-limited, complex tasks have no path. Consider: can a quantized 7B on Mac4 serve as emergency Brain 2?

🔴 Inter-brain context sharing. When Brain 1 triages a task and Brain 2 picks it up, how does Brain 2 know what Brain 1 observed? Need a context handoff protocol.

🔴 Training data generation tooling. V8 used `run_v8_gemini3.py` for Gemini-powered generation. V9 needs updated scripts for each new source (skills, pulse plans, governance docs). These don't exist yet.

🔴 Mac4 MLX/llama.cpp training benchmarks. We assume 2-4 hours for 3B LoRA on M4 but haven't actually benchmarked this. Need to validate before committing to weekly cycle.

---

3b. Expand — Deep-Dives

Deep-Dive 1: Inter-Brain Context Handoff Protocol

When Brain 1 triages a task and escalates to Brain 2, Brain 2 needs:
1. The original user message
2. Brain 1's classification (task type, complexity score, suggested skill)
3. Relevant context Brain 1 pulled from RAG++ during classification
4. The routing reason (why Brain 1 escalated)

Protocol:

json

{
  "handoff": {
    "from": "cortex-twin",
    "to": "reasoning-twin",
    "task_id": "uuid",
    "original_message": "...",
    "classification": {
      "type": "coding",
      "complexity": 0.78,
      "skill": "bot:pulse",
      "project": "PULSE-V1"
    },
    "context": [
      {"source": "rag++", "doc_id": "...", "relevance": 0.92},
      {"source": "memory", "file": "active-tasks.md", "section": "..."}
    ],
    "escalation_reason": "Complexity above Brain 1 threshold (0.78 > 0.70)",
    "timestamp": "2026-02-18T12:00:00Z"
  }
}

Clawdbot's Cortex daemon manages this handoff. Brain 2 receives the handoff as a structured context prefix before the task prompt.

Deep-Dive 2: V9 Training Data Generation Scripts

New scripts needed (placed in `scripts/`):

Script	Source	Output	Records
`gen_v9_skills.py`	141 SKILL.md files	Skill routing SFT pairs	~500
`gen_v9_plans.py`	23 pulse plan JSONs	Task decomposition SFT pairs	~200
`gen_v9_governance.py`	AGENTS/SOUL/HEARTBEAT/PROTOCOLS.md	Behavioral SFT + DPO pairs	~250
`gen_v9_architecture.py`	32 CLAUDE.md files	Design decision SFT pairs	~300
`gen_v9_memory.py`	35+ memory files	Decision narrative SFT pairs	~250
`gen_v9_kimi.py`	kimi_memory.db (density-filtered)	Conversation SFT pairs	~750
`merge_v9.py`	All V9 sources + V8 base	Final combined datasets	~80K total

Each generator uses Gemini 3 Pro for pair creation + RLM quality gate (existing pattern from V8).

Deep-Dive 3: Mac4 Training Benchmark

Before committing to weekly Brain 1 training:

bash

# On Mac4 via SSH:
# 1. Install MLX-LM (Apple's native training framework for M-series)
pip install mlx-lm

# 2. Download Llama 3.2-3B base
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct --mlx-path ./llama-3.2-3b-mlx

# 3. Benchmark LoRA training on 1000 records
time mlx_lm.lora --model ./llama-3.2-3b-mlx \
  --train --data ./test-data \
  --batch-size 4 --lora-layers 16 --epochs 2

# Expected: 2-4 hours for 6,500 records on M4 16GB
# If >6 hours, consider reducing dataset or epochs

Deep-Dive 4: Together AI Billing Resolution

Current status check needed:

bash

together billing status
# Or check https://api.together.xyz/settings/billing

Options if still blocked:
1. Add payment method — resolve billing limit directly
2. Vast.ai for training only — rent A100 80GB ($0.40-1.00/hr), train there, upload adapter to Together AI for serving
3. Lambda Labs — alternative GPU cloud, often cheaper spot instances
4. RunPod — another alternative, good Unsloth/Axolotl support

For serving, Together AI serverless LoRA is still the best option ($0.20/$0.60/MTk). Training can happen anywhere.

---

3c. Master Execution Checklist

---

PHASE 1: Data Preparation (Days 1-3)

[ ] 1.1 Resolve Together AI billing status
Owner: Mo (requires payment method)
Input: Together AI account access
Output: Active billing, confirmed LoRA training + serving access
Validation: `together billing status` returns active
Depends on: Nothing
Status: Not Started

[ ] 1.2 Benchmark Mac4 LoRA training
Owner: Claw (via Mac4 SSH)
Input: Mac4 access, MLX-LM installed
Output: Training time benchmark for 3B LoRA on 1K/5K/10K records
Validation: Benchmark report with tokens/sec, total time, memory usage
Depends on: Nothing
Status: Not Started

[ ] 1.3 Complete Kimi memory density scoring
Owner: Claw (Mac4)
Input: kimi_memory.db (30,704 messages), density scorer pipeline
Output: All messages scored, filtered CORE+ENRICHED set extracted
Validation: density_scores file with 30K+ entries, CORE/ENRICHED subset extracted
Depends on: Nothing
Status: In Progress (9,155/30,704 scored)

[ ] 1.4 Write V9 training data generators
Owner: Claw
Input: V9 audit doc, existing V8 scripts as templates
Output: 7 generator scripts in `scripts/` (skills, plans, governance, architecture, memory, kimi, merge)
Validation: Each script runs without errors, produces valid CTv3.1 JSONL
Depends on: Nothing
Status: Not Started

[ ] 1.5 Test Qwen3-235B N'Ko/multilingual capability
Owner: Claw
Input: 10 N'Ko test prompts, Qwen3-235B via Together AI
Output: N'Ko capability assessment (pass/fail per task)
Validation: Report showing Qwen3's N'Ko token handling, generation quality
Depends on: 1.1 (Together AI access)
Status: Not Started

---

PHASE 2: Brain 1 — Cortex Twin (Days 3-7)

[ ] 2.1 Generate Brain 1 training dataset
Owner: Claw
Input: V9 generators (1.4), density-scored Kimi data (1.3)
Output: `data/brain1_v9/sft_train.jsonl` + `dpo_train.jsonl` (~6,550 records)
Validation: Record count matches estimate, format validates, no duplicates
Depends on: 1.3, 1.4
Status: Not Started

[ ] 2.2 Train Brain 1 LoRA on Mac4
Owner: Claw (Mac4)
Input: Brain 1 dataset (2.1), Llama 3.2-3B base model, MLX-LM
Output: LoRA adapter `twin-cortex-v9` in Ollama format
Validation: Training completes, loss < 1.0, adapter loads in Ollama
Depends on: 1.2, 2.1
Status: Not Started

[ ] 2.3 Eval Brain 1 routing accuracy
Owner: Claw
Input: 200 labeled routing tasks, Brain 1 model (2.2)
Output: Eval report — routing accuracy, false escalation rate, speed
Validation: Accuracy >90
Depends on: 2.2
Status: Not Started

[ ] 2.4 Deploy Brain 1 to Mac4 Ollama
Owner: Claw (Mac4)
Input: Passing eval (2.3), Ollama adapter
Output: `twin-cortex-3b:latest` running on Mac4:11434
Validation: `curl http://[ip]:11434/api/generate` returns valid response
Depends on: 2.3 (must pass)
Status: Not Started

---

PHASE 3: Brain 2 — Reasoning Twin (Days 5-14)

[ ] 3.1 Generate Brain 2 training dataset expansion
Owner: Claw
Input: V9 generators (1.4), full V8 dataset, Gemini 3 Pro API
Output: `data/brain2_v9/` — merged V5-V9 dataset (~78,818 records)
Validation: Record count, format validation, dedup check
Depends on: 1.4, 1.5 (N'Ko assessment informs data balance)
Status: Not Started

[ ] 3.2 Upload dataset to training platform
Owner: Claw
Input: Brain 2 dataset (3.1), Together AI or Vast.ai credentials
Output: Dataset uploaded, file ID confirmed
Validation: Upload complete, file verified on platform
Depends on: 1.1, 3.1
Status: Not Started

[ ] 3.3 Train Brain 2 LoRA on Qwen3-235B
Owner: Claw (remote)
Input: Uploaded dataset (3.2), Qwen3-235B base
Output: LoRA adapter `twin-reasoning-v9`
Validation: Training completes, loss < 1.0, adapter saves
Depends on: 3.2
Status: Not Started

[ ] 3.4 Eval Brain 2 Twin Fidelity
Owner: Claw
Input: Twin Fidelity eval suite, Brain 2 model (3.3)
Output: Eval report — fidelity score, code quality, permission-seeking rate
Validation: Fidelity >0.80, code quality >95
Depends on: 3.3
Status: Not Started

[ ] 3.5 Deploy Brain 2 to Together AI serverless
Owner: Claw
Input: Passing eval (3.4), Together AI account
Output: Serverless LoRA endpoint active
Validation: API call returns valid response with Twin patterns
Depends on: 3.4 (must pass)
Status: Not Started

---

PHASE 4: Integration (Days 10-18)

[ ] 4.1 Implement inter-brain handoff protocol in Cortex
Owner: Claw
Input: Handoff protocol spec (Stage 3 deep-dive), Cortex daemon code
Output: Cortex daemon routes tasks between Brain 1, Brain 2, and frontier
Validation: 50 test tasks correctly routed with handoff context
Depends on: 2.4, 3.5
Status: Not Started

[ ] 4.2 Add Twin models to Clawdbot config
Owner: Claw
Input: Clawdbot gateway config, model endpoints
Output: `twin-cortex` and `twin-reasoning` available as model options
Validation: `clawdbot status` shows both models, test query works
Depends on: 2.4, 3.5
Status: Not Started

[ ] 4.3 Wire RAG++ context injection for both brains
Owner: Claw
Input: RAG++ API, Cortex daemon
Output: Both brains receive relevant RAG++ context with every query
Validation: Brain 2 references current file state in coding responses
Depends on: 4.1, 4.2
Status: Not Started

[ ] 4.4 Implement automatic rollback mechanism
Owner: Claw
Input: Quality metrics baseline, monitoring script
Output: Cron job that checks Twin quality hourly, rolls back to Claude Max if degraded
Validation: Simulated degradation triggers automatic rollback within 1 hour
Depends on: 4.2
Status: Not Started

---

PHASE 5: Gradual Rollout (Days 18-32)

[ ] 5.1 10
Owner: Claw
Input: Deployed system (Phase 4), traffic router
Output: 10
Validation: 48h monitoring, routing accuracy maintained, no quality drops
Depends on: Phase 4 complete
Status: Not Started

[ ] 5.2 25
Owner: Claw
Input: Passing 5.1 metrics
Output: Brain 2 handles 25
Validation: 48h monitoring, code quality maintained, fidelity stable
Depends on: 5.1 (48h passing)
Status: Not Started

[ ] 5.3 50
Owner: Claw
Input: Passing 5.2 metrics
Output: Half of all interactions through Twin system
Validation: 1 week monitoring, all metrics stable
Depends on: 5.2 (48h passing)
Status: Not Started

[ ] 5.4 100
Owner: Claw + Mo decision
Input: Passing 5.3 metrics, Mo's approval
Output: Twin handles all routine tasks, Claude Max is escalation-only
Validation: 2 week monitoring, cost reduction confirmed, quality maintained
Depends on: 5.3 (1 week passing), Mo approval
Status: Not Started

---

PHASE 6: Continuous Learning Pipeline (Days 25-35, ongoing)

[ ] 6.1 Automate Brain 1 weekly training cycle
Owner: Claw
Input: Training pipeline scripts, Mac4 cron
Output: Launchd service that trains Brain 1 every Sunday night
Validation: First automated cycle completes, model deploys, eval passes
Depends on: Phase 2 complete, Phase 5 in progress
Status: Not Started

[ ] 6.2 Automate Brain 2 monthly training cycle
Owner: Claw
Input: Training pipeline scripts, Together AI/Vast.ai automation
Output: Monthly cron that generates data, trains, evals, deploys
Validation: First automated cycle completes end-to-end
Depends on: Phase 3 complete, Phase 5 in progress
Status: Not Started

[ ] 6.3 Build drift detection dashboard
Owner: Claw
Input: Quality metrics, routing accuracy logs
Output: Discord channel (#twin-health) with daily metrics digest
Validation: Dashboard shows metrics for 7 days, alerts on degradation
Depends on: Phase 5 in progress
Status: Not Started

[ ] 6.4 Document V9→V10 upgrade path
Owner: Claw
Input: V9 learnings, architecture decisions
Output: V10 planning doc (micro-LoRAs, ensemble, MoE distillation)
Validation: Document covers all V10 candidates from Stage 1 paths
Depends on: Phase 5 complete
Status: Not Started

---

Total timeline: ~5 weeks from go signal
Total training cost: ~$40-150 (one-time) + $30-150/mo (inference)
**Expected outcome: 50

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/packages/cognitive-twin/EVOCUBE_V9.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture