Grand Diomande Research · Full HTML Reader

Stage 2: Compound -- KARL Phase 4+ Unified Architecture

We have a trajectory recording system (110 records), a reward engine (3-signal composite), a shadow vector router (10% cache hit rate), and a training pipeline that produced one adapter (KARL v2, loss 1.843, gemma-3-1b-4bit) from 35 SFT examples. The adapter exists but has never been evaluated for actual routing or planning quality. The finetune daemon on Mac5 is down. The promotion gate says the shadow router is not ready.

Agents That Account for Themselves architecture technical paper candidate score 44 .md

Full Public Reader

Stage 2: Compound -- KARL Phase 4+ Unified Architecture

Starting from Scratch: Ground Truth

We have a trajectory recording system (110 records), a reward engine (3-signal composite), a shadow vector router (10

The honest assessment: We have a data collection system that works and a training pipeline that runs. We do not yet have evidence that the trained model improves anything. The gap between "model trained" and "model useful" is the gap this compound must close.

---

Step 1: Fix Infrastructure Foundation (Path E core)

Before any data or algorithm work, the training infrastructure must be reliable and observable.

Actions:
1. SSH to Mac5, restart finetune daemon, create LaunchAgent `com.openclaw.finetune-daemon.plist` with auto-restart
2. Increase training hyperparameters: seq_len 256->512, LoRA rank 8->16, layers 4->8, batch_size 1->2
3. Create `[home-path]` with three evaluation functions:
- `evaluate_routing_accuracy()`: Given test prompts, does the model predict the correct skill?
- `evaluate_planning_quality()`: Given tasks, does the model generate tool plans that match reference?
- `compare_adapters()`: Head-to-head comparison between two adapter versions
4. Build a held-out evaluation set: 20 real prompts with known-correct skills and tool plans, manually curated from the 34 high-reward trajectories. This set is NEVER used for training.

Validation gate: Finetune daemon responds on :9200. Evaluator runs successfully on KARL v2. Baseline metrics recorded.

From Path E: Hyperparameter optimization, evaluation pipeline, daemon reliability.
Rejected from Path E: Multi-model comparison (deferred to Step 6), daily micro-training (premature).

---

Step 2: Sharpen the Reward Engine (Path F core)

The reward distribution is too compressed (mean 0.6066, range 0.32-0.79). The advantage signal is too weak for meaningful training contrast.

Actions:
1. Add verification reward signal (20
2. Add build/test detection to Tap B: pattern-match xcodebuild/pytest/npm-test/git-push commands, record build_success as objective ground truth.
3. Enhance Tap D correction detection: add 8 new patterns for subtle corrections ("actually, let's", "instead, use", "undo that") and 4 positive signals ("looks good", "ship it", "next, let's").
4. Implement z-score advantage computation: domain-specific baselines with 30-day rolling window and 14-day EMA halflife. Normalize advantage by domain stddev. Floor stddev at 0.05.
5. Re-score all 110 trajectories with upgraded engine. Re-export SFT data.

Validation gate: New reward distribution has stddev >= 0.12 (up from ~0.09). Advantage range spans at least [-2, +2] after z-score normalization. High-reward trajectories (top 25

From Path F: Verification signal, build detection, enhanced Tap D, z-score normalization.
Rejected from Path F: Power-law sharpening (z-score is sufficient). Rank-based normalization (loses absolute reward information).

---

Step 3: Implement Multi-Stage Trajectory Filtering (Path B core)

Replace the single-filter SFT export with a composable 6-stage pipeline.

Actions:
1. Create `TrajectoryFilter` class in sft_exporter.py with stages:
- Stage 1: Reward floor (>= 0.35)
- Stage 2: Tool diversity (Shannon entropy >= 0.5, requiring >= 2 unique tool types)
- Stage 3: Novelty filter (max cosine similarity < 0.90 to already-selected trajectories)
- Stage 4: Difficulty calibration (difficulty score in [0.2, 0.8])
- Stage 5: Continuous advantage weighting (weight = clip(1 + advantage_zscore * 0.5, 0.5, 3.0))
- Stage 6: Skill coverage balancing (min 3 per skill, max 30
2. Add stage-by-stage survival stats to export output (logged to karl_status.json)
3. The novelty filter requires prompt embeddings. Use existing embedding_cache.py. For trajectories without cached embeddings, fall back to content-hash dedup.

Validation gate: Filter pipeline produces >= 40 training examples from 110 trajectories (up from 31). Skill distribution has at least 5 skills represented. No single skill exceeds 30

From Path B: All 6 filter stages.
Conflict resolved: Path B's novelty filter requires embeddings, but embedding cache hit rate is 10

---

Step 4: Build Agentic Synthetic Q&A Generator (Path A core)

Replace the crude git-diff synthetic_qa.py with a two-stage agentic synthesis pipeline.

Actions:
1. Create `[home-path]` with:
- `index_corpus()`: Verify obsidian vault and key codebase dirs are indexed in RAG++ pgvector
- `synthesize_questions(n_clusters=50)`: For each topic cluster, retrieve 3-5 docs via RAG++, prompt Gemini 2.5 Flash to generate 2-4 Q&A pairs requiring multi-step tool use
- `validate_questions()`: Run 2 solver variants on Mac5 (gemma-1b base + gemma-1b adapter). Filter by pass rate [0.25, 0.75]. Start with 2 solvers (not 4) to keep validation time under 3 hours.
- `quality_filter()`: Gemini 2.5 Flash judges factual accuracy, ambiguity, tool-use requirement
2. Target: 150 raw questions -> 60-80 after pass-rate filter -> 50-70 after quality filter
3. Integrate output into sft_exporter.py alongside trajectory data

Validation gate: >= 50 synthetic Q&A pairs pass all filters. Each requires >= 3 tool steps. At least 8 of 13 skill domains are covered.

From Path A: Two-stage synthesis, multi-solver validation, quality filter.
Conflict resolved: Path A proposes 4 solvers (including Qwen-9B). Resolution: start with 2 solvers (gemma-1b) to validate the pipeline works, then add Qwen-9B in Step 6 after confirming it trains on Mac5.

---

Step 5: Add Expert Distillation Track (Path D routing component)

The 1B model serves two functions: routing (which skill?) and planning (which tools?). Routing is the higher-value target because correct routing is the prerequisite for everything else.

Actions:
1. Create `[home-path]` with:
- `extract_routing_pairs()`: From the 110 trajectories, extract (prompt_text, skill_name) pairs where skill was explicitly injected and the session had reward >= 0.55
- `generate_expert_plans()`: For each synthetic question from Step 4, call Gemini 2.5 Flash to generate an expert tool-use plan. These are gold-standard references.
- `build_routing_sft()`: Format routing pairs as ChatML where user provides prompt and assistant responds with skill name
- `build_planning_sft()`: Format planning data as ChatML where user provides task and assistant responds with tool sequence
2. Merge routing + planning + trajectory SFT + synthetic Q&A into a unified training set
3. Tag each example with its source (trajectory/synthetic/expert/routing) for analysis

Validation gate: Combined training set has >= 120 examples. Routing subset has >= 30 examples covering >= 10 skills. Expert plans are structurally valid (contain tool names and file paths).

From Path D: Routing distillation, expert plan generation via Gemini.
Rejected from Path D: Separate routing model (single model handles both tasks). Gemini as an inference source (only for plan generation, not for live routing).

---

Step 6: Model Comparison and Algorithm Selection (Path C phased + Path E multi-model)

With reliable infrastructure (Step 1), strong training data (Steps 2-5), and evaluation (Step 1), we can now empirically answer: which model and which algorithm?

Actions:
1. Train KARL v3 adapter (gemma-3-1b-4bit) on the combined dataset from Step 5 using SFT+ (continuous advantage weighting from Step 3)
2. Evaluate KARL v3 vs KARL v2 using the held-out evaluation set from Step 1
3. Train KARL v3-qwen adapter (Qwen3.5-9B-4bit) on the same dataset with conservative hyperparameters (batch 1, rank 8, grad_checkpoint true)
4. Evaluate KARL v3-qwen vs KARL v3 on the same held-out set
5. Decision point: If Qwen-9B routing accuracy is >= 10

Validation gate: KARL v3 routing accuracy > KARL v2 routing accuracy (the new data and filters actually improved the model). At least one adapter achieves >= 60

From Path C: SFT+ as Phase 1 algorithm. The DPO and OAPL phases are deferred to Step 8.
From Path E: Multi-model comparison, Qwen-9B training config.
Conflict resolved: Path C recommends phased algorithm rollout (SFT+ -> DPO -> OAPL). Path E recommends multi-model comparison. Resolution: run them together. Train SFT+ on both models, compare, then apply DPO to the winner.

---

Step 7: Iterative Bootstrapping (Path F bootstrap loop)

With a validated model (Step 6) and synthetic questions (Step 4), implement the iterative improvement loop.

Actions:
1. Create `[home-path]`:
- Use the Step 6 winning adapter as iteration 0's pi_ref
- Generate rollouts on the 50-70 synthetic questions from Step 4
- Score rollouts with upgraded reward engine from Step 2
- Filter through pipeline from Step 3
- Combine with original training data
- Train iteration 1 adapter
- Evaluate against iteration 0 using held-out set
- If routing accuracy improved by >= 3
- If not: stop (convergence or collapse)
2. Maximum 3 iterations (matching Databricks paper)
3. Save every iteration's adapter for comparison

Validation gate: At least 1 bootstrap iteration produces measurable improvement (>= 3

From Path F: Bootstrap loop architecture, convergence detection.
Conflict resolved: Path F proposes 5 max iterations. Path C's OAPL would replace SFT in later iterations. Resolution: cap at 3 iterations for SFT+ (proven in paper). OAPL is Step 8.

---

Step 8: DPO/OAPL Upgrade (Path C Phase 2-3, conditional)

Only execute if Steps 1-7 demonstrate that the model has capacity for RL signal (routing accuracy >= 60

Actions:
1. Construct preference pairs from trajectory data: high-reward trajectory = chosen, low-reward trajectory for same skill domain = rejected
2. Implement DPO training loop in MLX (custom script, ~200 lines)
3. Train DPO adapter on preference pairs
4. Evaluate DPO adapter vs best SFT+ adapter
5. If DPO improves over SFT+: Proceed to OAPL implementation
- Generate 4 rollouts per prompt on Mac5 overnight
- Implement OAPL loss function (squared regression)
- Train OAPL adapter
- Evaluate against DPO
6. If DPO does not improve: Stop. SFT+ is the ceiling for this model scale.

Validation gate: DPO routing accuracy > SFT+ routing accuracy by >= 5

From Path C: DPO as Phase 2, OAPL as Phase 3, conditional execution.
This step may never execute -- and that is the correct outcome if the model lacks RL capacity.

---

Compound Summary

Step	Primary Path	Deliverable	Dependencies	Effort
1	E	Infrastructure + Evaluation	None	1 day
2	F	Reward Engine v2	None	0.5 day
3	B	Multi-Stage Filter	Step 2 (new scores)	0.5 day
4	A	Agentic Synthetic Q&A	Step 1 (Mac5 working)	1.5 days
5	D	Expert Distillation	Step 4 (questions)	1 day
6	C+E	Model Comparison	Steps 1-5 (data + eval)	1 day
7	F	Iterative Bootstrap	Steps 4+6 (questions + model)	1 day
8	C	DPO/OAPL (conditional)	Step 7 (evidence of capacity)	2-3 days

Total estimated effort: 7-9.5 days
Critical path: Steps 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 (sequential, ~6.5 days)
Step 8 is parallel/optional: Only if Steps 1-7 prove the model can learn

---

What Was Rejected and Why

Rejected Element	Source	Reason
8 solver variants for validation	Path A	4 is enough; 8 takes 10+ hours on Mac5
Power-law reward sharpening	Path F	Z-score normalization is sufficient
Separate routing model	Path D	Single model, two output modes is simpler
PPO	Path C	Too complex, unstable on small models
Daily micro-training	Path E	Premature; weekly is sufficient until data pipeline matures
Training-Free GRPO	Path C	No parameter updates = no persistent learning
Gemini as live inference source	Path D	Adds latency and cost to every routing decision
Qwen3-8B-8bit	Path E	OOM risk too high; Qwen3.5-9B-4bit is the right tradeoff

---

Key Insight from Compounding

The six paths have a natural dependency chain that was not visible when examining them individually:

Infrastructure (E) -> Signal Quality (F) -> Data Quality (B) -> Data Quantity (A) -> Expert Data (D) -> Model Selection (C/E) -> Iteration (F) -> Algorithm (C)

Attempting to implement OAPL (Path C Phase 3) before having reliable evaluation (Path E), strong reward signal (Path F), sufficient data (Path A), and evidence of model capacity (Path C Phase 1) would be building on sand. The compound enforces this sequencing.

The second insight: evaluation is the bottleneck, not training. Training gemma-1b takes 8 minutes. But without evaluation, we cannot know if those 8 minutes produced an improvement. The held-out evaluation set from Step 1 is the most valuable single deliverable in this entire compound.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/carl-karl-trajectory-rl/stage2-compound.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture · is Stage Research