KARL Integration -- Evolution3 / Stage 3: EXPAND + MASTER PLAN
| Assumption (from Stage 2) | Reality | Impact | |---|---|---| | unified.jsonl has 3,909 entries with tool_calls arrays | **3,940 entries, but only 3 have populated tool_calls** (all empty arrays) | **CRITICAL**: The unified store does NOT contain usable trajectory data | | verbose-all.jsonl has 3,249 entries, 157 with multi-step tool sequences | **3,258 entries, 157 with tool_calls in assistant_turns** (confirmed) | Correct, but these are 96% Codex entries (`exec_command`, `shell_command`), not Claude Code entries
Full Public Reader
# KARL Integration -- Evolution3 / Stage 3: EXPAND + MASTER PLAN
Run: karl-trajectory-intelligence
Generated: 2026-03-10
Method: Evolution3 -- stress-test and master checklist
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/
---
PART A: AUDIT -- What Holds, What Breaks, What's Missing
Verified Ground Truth (as of 2026-03-10)
Before auditing the compound, I verified every data assumption against the live system:
| Assumption (from Stage 2) | Reality | Impact |
|---|---|---|
| unified.jsonl has 3,909 entries with tool_calls arrays | 3,940 entries, but only 3 have populated tool_calls (all empty arrays) | CRITICAL: The unified store does NOT contain usable trajectory data |
| verbose-all.jsonl has 3,249 entries, 157 with multi-step tool sequences | 3,258 entries, 157 with tool_calls in assistant_turns (confirmed) | Correct, but these are 96 |
| 88 skills, 13 active in registry | Confirmed: 88 skills, 13 active | Accurate |
| ops_trigger.py is 233 lines | 232 lines | Accurate |
| RAG++ at :8000 reachable | Confirmed: healthy, returns related_turns + graph_context | Accurate |
| Mac5 MLX :8100 reachable | Unreachable via direct curl AND via SSH | BLOCKER: Mac5 is offline or unreachable right now |
| Mac5 finetune-daemon :9200 | Unreachable | Same blocker as above |
| numu-weave at [home-path] | Confirmed: 270 lines | Accurate |
| l4_controller.py exists | Does not exist (greenfield) | Expected -- this is what we build |
| pulse_bridge.py exists | 981 lines (much larger than Stage 0's estimate of "lines 109-123") | Need to account for actual complexity |
| Supabase has 141 tables | Could not verify (API key not in env) | Assume accurate per MEMORY.md |
Step-by-Step Audit
---
STEP 1: Unified Data Layer -- HOLDS WITH MAJOR REVISION
What holds:
- The two-channel architecture (live tap + historical backfill) is correct.
- The JSONL append-only store design is sound for hook latency constraints.
- The schema is comprehensive and covers all downstream consumers.
- The tap points in ops_trigger.py, post_tool_hook.py, response_hook.py, and session_end_hook.py are identified correctly.
What breaks under pressure:
1. The backfill data is mostly Codex, not Claude Code. Of 157 entries with tool_calls, the tool names are `exec_command` (1,632), `shell_command` (1,108), `apply_patch` (266). These are Codex tool names. Only 7 entries have `read_file` (Claude Code's equivalent). The backfill extractor assumed these entries represent Claude Code workflows, but they represent Codex workflows. The tool name normalization (`shell_command` -> `Bash`) is technically possible but produces Codex-biased trajectories that may not transfer to Claude Code routing.
2. Claude Code tool calls are NOT captured in unified.jsonl. All 3,940 entries have `tool_calls: []`. The response_hook writes tool data to verbose-all.jsonl (via `assistant_turns[].tool_calls`), but the unified store aggregation drops this data. The compound assumed unified.jsonl was the goldmine -- it is not. The goldmine is verbose-all.jsonl, but only for Codex sessions.
3. Claude Code session tool capture appears broken or non-existent. Of 3,258 verbose entries, `source: "claude"` entries (40 of last 50) have zero tool_calls. Only `source: "codex"` entries (10 of last 50) capture tools. This means the response_hook's tool extraction pathway may not fire for Claude Code sessions, or the transcript data it parses does not include tool use details in the same format.
Revised assessment: The historical backfill channel produces ~157 usable records, but they are predominantly Codex agent trajectories. The live recording channel (Taps A-D) is the actual critical path. Without live recording working for Claude Code sessions, KARL has no trajectory data to learn from.
Gap: Transcript Parser. The response_hook at `[home-path]` (1,283 lines) needs investigation. Why are Claude Code tool calls not captured? Likely: Claude Code's Stop event payload does not include the same structured tool data that Codex sessions produce. The tap implementation must parse Claude Code's actual output format, which may require reading the conversation JSONL files that Claude Code maintains internally.
---
STEP 2: Reward Engine -- HOLDS, NO MAJOR ISSUES
What holds:
- The 3-level reward design (outcome score, process reward, fleet fitness) is architecturally sound.
- Process reward computation from exit codes and error counts is self-contained.
- Advantage computation via daily batch Prefect flow is feasible.
- The composite reward weights (0.35/0.25/0.20/0.20) are reasonable defaults.
What breaks under pressure:
1. Outcome score depends on "next prompt" signals that may not arrive. If a session has only one prompt (common for quick tasks), the cross-turn annotation (Tap D) never fires. The `annotation_status` will be `"partial"` for a significant fraction of records. Estimate: 30-40
2. The correction_signal (weight 0.20) relies on correction_detector.py, which has a 500ms SIGALRM. Both the correction detector AND the trajectory tap would fire on the same Stop event. Combined budget: 500ms. The trajectory flush (Tap C, ~15ms) plus reward computation (~5ms) eats into the correction detector's budget. If they run sequentially, there is no budget issue (each hook invocation gets its own 500ms). But if they share a process, there could be contention.
3. Fleet fitness (L4 signal) assumes CALC dispatch markers in trajectory records. Currently, pulse_bridge.py does not write any session markers to the trajectory store (which does not exist yet). The marker injection depends on pulse_bridge modification.
No revision needed, but the partial-annotation rate should be tracked as a metric from day 1.
---
STEP 3: Routing Intelligence -- HOLDS WITH LATENCY RISK
What holds:
- Embedding-based routing via RAG++ at :8000 is architecturally sound.
- Trajectory-weighted similarity is the right evolution from regex matching.
- Shadow mode before live cutover is prudent.
- Two-tier learning (real-time weight updates + daily re-embedding) is clean.
- The local cache (skill_embeddings.pkl, ~60KB, 1h TTL) handles the cache layer correctly.
What breaks under pressure:
1. Embedding latency is the hard constraint. The compound says "~80ms with cache warmth" for the RAG++ embedding call. I tested the gateway and it returns results, but the embedding is generated server-side via Gemini API. Cold-path latency for a fresh embedding is 200-400ms (Gemini API round-trip). The 500ms hook budget leaves only 100-300ms for all other processing. If the Gemini API has a bad day (500ms+ latency spikes), the entire routing hook gets SIGALRM-killed.
Mitigation: The compound's design includes a fallback to regex if embedding fails. This is correct but needs to be the DEFAULT behavior during the shadow period -- regex-first, with embedding computed async and logged for comparison, not blocking the hook.
2. **Multi-skill composition (two skills within 20
3. The similarity threshold (0.35) is arbitrary. Without trajectory data to calibrate against, this threshold was invented. Too low = noisy matches. Too high = cold prompts never routed. The shadow period MUST include threshold sweep analysis.
Revision: Move embedding to async-fire-and-log during shadow mode. Do NOT block the hook on Gemini API latency. After 2 weeks of data, calibrate the threshold empirically.
---
STEP 4: Evolution Integration (L4) -- HOLDS BUT DEPENDS ON STEPS 1-3
What holds:
- L4 as a policy consumer (not trajectory producer) is the right separation.
- The 5 mutation operators (affinity reinforcement, duration prior, search template, strategy lock, weight redistribution) are well-designed.
- The 4 invariants (minimum trajectory entropy, bounded policy divergence, cross-layer coupling, no degenerate routing) prevent catastrophic failure modes.
What breaks under pressure:
1. L4 fires every 90 L1 steps, but L1 cycle frequency is variable. The daemon's heartbeat interval is adaptive [30s, 600s]. If the heartbeat runs at 300s average, 90 L1 steps = 7.5 hours between L4 fires. If the heartbeat runs at 30s (maximum speed), 90 L1 steps = 45 minutes. The L4 fire rate is unpredictable and depends on EW's feedback metabolism state.
2. L4 needs CALC dispatch data that does not yet exist in trajectory format. The pulse_bridge.py is 981 lines -- much more complex than the compound assumed. Adding CALC session markers requires understanding the actual dispatch flow, not just "add a marker at dispatch time." This is a non-trivial integration.
3. M3 (Search Template Mutation) probes new GK queries, but GK entity naming is fragile. From MEMORY.md gotchas: entities are bare lowercase names, multi-word names need exact-then-fallback. An L4-generated template that uses `project:evolution-world` instead of `evolution` will silently return empty results. The mutation operator needs a GK query validator.
Revision: L4 should be Phase 5 (not Phase 4) since it depends on trajectory volume from Phases 1-3. Move pulse_bridge integration to a separate sub-task with its own spike.
---
STEP 5: Training Pipeline (OAPL-Lite) -- BLOCKED ON MAC5
What holds:
- The advantage-weighted SFT concept is correct.
- The quality gate (Path B thresholds for live, Path E LLM judge for self-play) is well-layered.
- The numu-weave extension (new source type, advantage weight) is minimal and clean.
- The A/B evaluation design is sound.
What breaks under pressure:
1. Mac5 is unreachable. Both direct curl and SSH to [ip] failed. The MLX server (:8100) and finetune-daemon (:9200) are offline. Until Mac5 is back online and stable, the entire training pipeline is blocked. This is a hard dependency.
2. Training data volume is insufficient. The compound estimates "~180-200 training examples after augmentation" from backfill + initial live data. But the backfill data is Codex trajectories (see Step 1 audit). Codex's `exec_command` calls do not translate directly to Claude Code's `Read`, `Edit`, `Bash`, `Write`, `Grep`, `Glob` tool vocabulary. After filtering for Claude-native tool calls, backfill may yield fewer than 20 usable examples.
3. The 200ms MLX query timeout in the A/B treatment is tight. Mac5 is behind Tailscale. Typical Tailscale RTT between Mac1 and Mac5 is 5-15ms for local network, but can spike to 50-100ms. Add MLX inference time (50-100ms for a short completion), and the 200ms budget is barely sufficient for happy-path. Any contention on Mac5 (other Ollama models, LoRA training running simultaneously) would push this over budget.
4. Training "from scratch each time" at adapter v3+ is wasteful. The compound says "not fine-tuning on fine-tune to avoid catastrophic forgetting." But retraining from scratch means each run takes the same ~400s regardless of incremental data. A better approach: LoRA warm-start from the previous adapter with a lower learning rate for fine-tuning iterations after v2.
Revised assessment: Phase 4 (Training) is the riskiest phase. It depends on: (a) Mac5 being online, (b) sufficient Claude Code trajectory data (which does not exist yet), (c) network latency being acceptable. The A/B evaluation cannot begin until Phases 1-3 have run for at least 2-3 weeks.
---
STEP 6: Self-Play Loop -- HOLDS BUT EXPENSIVE ON COMPUTE
What holds:
- The 5-tier corpus source hierarchy is comprehensive.
- G=5 solve attempts with pass-rate filtering is standard KARL methodology.
- The freshness guard (mesh event bus subscription) prevents stale Q&A.
- Cost estimates ($1.44-$6.23/week) are reasonable.
What breaks under pressure:
1. Solver agents require headless Claude Code sessions. The compound says "spawned using the Pane Spawn Protocol." From MEMORY.md, the pane spawn protocol has a ~30
Better approach: Use `claude --dangerously-skip-permissions -p "prompt" --output-format stream-json` in headless mode via Bash, not pane spawning. This is more reliable for batch operations.
2. Self-play trajectories recorded via Taps A-D assume the solver uses the same hook infrastructure. But headless Claude Code sessions spawned with `--dangerously-skip-permissions` may not load hooks (depends on whether the spawned session reads `[home-path]`). If hooks do not load, the trajectory tap does not fire, and self-play produces no trajectory records.
Mitigation: The solver orchestrator must verify that the spawned session's output includes tool-use data (parse the stream-json output directly instead of relying on hooks).
3. The bootstrapping iteration (Phase 3, Week 6+) assumes the improved LoRA model is a better solver. But the LoRA model is a 1-3B parameter adapter on a small base model. It cannot solve questions that require Claude Opus 4.6-level reasoning. The bootstrapping gain will plateau quickly because the solver's reasoning ceiling is far below the question difficulty ceiling. KARL's paper bootstrapped from GLM 4.5 Air (a much larger model with real reasoning capability).
Revised assessment: Self-play is valuable for generating question-answer pairs and SKILL.md improvements (the "fastest path to value"), but its contribution to LoRA training will be limited by the base model's capabilities. Prioritize SKILL.md improvement over training data generation.
---
STEP 7: Feedback Closure -- HOLDS
What holds:
- The three learning loops (routing/minutes, training/weeks, evolution/months) are correctly identified.
- The interaction effects between loops are analyzed (Loop 1 feeds Loop 2 feeds Loop 3).
- The decay integration (quality-aware decay) is a genuine improvement over frequency-only decay.
What breaks under pressure:
1. Oscillation risk between Loop 1 and Loop 3. If L4 shifts agent routing (e.g., more Codex, less Claude), the trajectory distribution changes, which changes skill routing weights, which changes routing, which changes trajectory quality, which L4 observes. The compound acknowledges this but relies on bounded EMA and invariants to damp oscillation. In practice, three coupled feedback loops with shared state will produce transient oscillations during any regime change (new skill added, infrastructure outage, codebase refactor).
2. No circuit breaker at the system level. Individual loops have bounds (EMA alpha, L4-I2 max shift, etc.), but there is no global "something is wrong, freeze everything" mechanism. If the reward engine has a bug (e.g., scoring all trajectories at 0.0), Loop 1 would drag all weights down, Loop 2 would train on garbage advantages, and Loop 3 would panic-redistribute agents. A global health check (`karl_health.py`) should monitor aggregate metrics and freeze all loops if anomalies are detected.
---
STEP 8: Unified Architecture -- HOLDS
What holds:
- The component registry (14 files, ~2,620 lines) is feasible.
- The Prefect flow registry (6 flows) is manageable.
- The Supabase table consolidation (3 new tables instead of 5) is correct.
- The Nexus Portal pages (/karl, /karl/skills) add value.
What breaks under pressure:
1. The timeline (6 weeks to full compound) is optimistic given the Step 1 data gap. The compound assumes live trajectory recording starts on Day 1 and accumulates 10-40 records/day. But if Claude Code tool calls are not captured (see Step 1 audit), we first need to fix the response_hook's tool extraction for Claude Code sessions. This is an unbudgeted investigation + fix that could take 2-5 days.
2. 14 new files across 4 different directories (`[home-path]`, `[home-path]`, `[home-path]`, `flows/feed-hub/`) increase the surface area for maintenance. The existing cortex has 17 files. Adding 14 files (82
---
What's Missing (Gaps the Compound Didn't Address)
1. Claude Code Tool Capture Gap. The most critical gap. The response_hook needs to extract tool use data from Claude Code sessions. Currently, only Codex sessions populate `assistant_turns[].tool_calls`. Either the Stop event payload from Claude Code includes different data, or the response_hook's parsing logic has a code path that does not fire for Claude Code sessions. This must be diagnosed and fixed BEFORE any trajectory recording can work.
2. Global Health Monitor. No mechanism to detect when the KARL system itself is malfunctioning. Need: `karl_health.py` that checks trajectory ingest rate, reward distribution sanity, embedding freshness, and Mac5 availability. Alerts via Discord webhook.
3. Rollback Plan. What happens if KARL routing is worse than regex routing? The compound describes shadow mode and A/B testing, but no explicit "revert to regex" playbook. Need: a feature flag (`KARL_ROUTING_ENABLED=0/1`) that can instantly switch back to regex routing without code changes.
4. Cross-Machine Trajectory Sync. The compound mentions Supabase sync for trajectories but does not specify the sync mechanism. Mac2 and Mac3 also run Claude Code sessions. Their trajectories should flow into the same store. Need: either a shared Supabase write path from all machines, or a Syncthing-based file sync for `[home-path]`.
5. Token Budget Tracking. Embedding via Gemini API, LLM judge via Claude Haiku, self-play solver via Claude Code -- these all cost tokens. No budget tracking or cost alerting exists. Need: per-component token counters with weekly summaries.
6. Skill Retirement Path. The compound evolves SKILL.md content based on trajectories but does not address when a skill should be RETIRED (not just decayed). If a skill's trajectory weight drops to 0.5 (minimum) and stays there for 30 days, it should be archived and its domain coverage absorbed by a neighbor skill.
---
PART B: EXPAND -- Deep-Dive on Critical Subsystems
B1. Claude Code Tool Capture Fix (Unbudgeted Prerequisite)
The entire KARL system depends on Claude Code tool use data flowing into the trajectory store. Currently, this pipeline is broken.
Root Cause Investigation Plan:
The response_hook at `[home-path]` (1,283 lines) fires on the Stop event. The Stop event payload from Claude Code contains:
{
"type": "Stop",
"session_id": "...",
"stop_hook_result": {
"transcript": "...",
"model": "claude-opus-4-6",
"cost_usd": 0.12
}
}The `transcript` field is the raw conversation JSONL. The response_hook's `_parse_transcript()` function extracts tool calls from this transcript. For Claude Code, tool calls appear as content blocks with `type: "tool_use"` and `type: "tool_result"` in the conversation messages.
For Codex, tool calls appear as `exec_command` entries in a different format (OpenAI function calling format).
Hypothesis: The response_hook's transcript parser handles Codex's format (OpenAI function calling) but not Claude Code's format (Anthropic content blocks). The `tool_calls` extraction path has a conditional that only fires for Codex-formatted transcripts.
Fix specification:
- File: `[home-path]`
- New parser: Extract `tool_use` and `tool_result` blocks from Claude conversation messages
- Normalize to the same `VerboseToolCall` schema used for Codex
- Map Claude tool names directly (Read, Edit, Bash, Write, Grep, Glob are already the canonical names)
- Test: After fix, the next Claude Code session's Stop event should produce a verbose-all.jsonl entry with populated `assistant_turns[].tool_calls`
Estimated effort: 2-3 days (1 day investigation, 1 day implementation, 0.5 day testing)
Lines of code: ~80-120 new lines in response_hook.py
Risk: Medium. The Stop event payload format may have changed between Claude Code versions. Need to capture a raw payload sample first.
---
B2. Trajectory Tap -- Full Implementation Spec
The 4 tap points from the compound, expanded with exact code locations and error handling:
Tap A: Session Buffer Init (ops_trigger.py)
# Insert after line 226 in ops_trigger.py (after invocation_record write)
# Budget: 5ms max
try:
from karl.trajectory_tap import init_session_buffer
init_session_buffer(
session_id=session_id,
prompt_id=prompt_id,
skill_name=matched_skill_name,
domain=domain,
prompt_text=prompt_text[:500], # Truncate to prevent bloat
cwd=cwd,
machine=_get_machine_id(), # "mac1", "mac2", etc.
pane=os.environ.get("TTY", "unknown"),
)
except Exception:
pass # Never fail the hookTap B: Tool Event Append (post_tool_hook.py)
# Insert after line 244 in post_tool_hook.py (after cadence signaling)
# Budget: 8ms max
try:
from karl.trajectory_tap import append_tool_event
tool_input = hook_input.get("tool_input", {})
tool_result = hook_input.get("tool_result", {})
append_tool_event(
session_id=session_id,
tool_name=tool_input.get("tool_name", "unknown"),
parameters={
k: v for k, v in (tool_input.get("input", {}) or {}).items()
if k in ("file_path", "command", "pattern", "path", "url") # Only safe params
},
result_success=not bool(tool_result.get("error")),
exit_code=tool_result.get("exit_code"),
duration_ms=tool_result.get("duration_ms"),
file_path=tool_input.get("input", {}).get("file_path"),
)
except Exception:
passTap C: Session Flush (response_hook.py or session_end_hook.py)
# Insert in response_hook.py Stop handler, after verbose record write
# OR in session_end_hook.py (cleaner separation)
# Budget: 15ms max
try:
from karl.trajectory_tap import flush_session
record = flush_session(
session_id=session_id,
response_length=len(response_text),
token_usage=token_usage,
files_read=files_read_list,
files_modified=files_modified_list,
errors=errors_list,
)
if record:
# Compute immediate process reward (Step 2)
from karl.reward_engine import compute_process_reward
record["outcome"] = compute_process_reward(record)
# Append to store
from karl.trajectory_tap import append_to_store
append_to_store(record)
except Exception:
passTap D: Cross-Turn Annotation (ops_trigger.py, beginning)
# Insert at the beginning of ops_trigger.py main(), before skill matching
# Budget: 10ms max
try:
from karl.trajectory_tap import annotate_previous
annotate_previous(
session_id=session_id,
prompt_text=prompt_text,
correction_detected=_check_correction_patterns(prompt_text),
redo_detected=_check_redo_patterns(prompt_text),
)
except Exception:
passtrajectory_tap.py -- Full Module Spec:
# [home-path] (~220 lines)
import json
import os
import time
from pathlib import Path
from typing import Optional, Dict, Any, List
KARL_DIR = Path.home() / ".claude" / "karl"
STORE_PATH = KARL_DIR / "trajectories.jsonl"
BUFFER_DIR = KARL_DIR / "buffers" # Per-session buffers, cleaned on flush
MAX_BUFFER_TOOLS = 200 # Cap tool events per session to prevent runaway
SELF_DISABLE_COUNT = 3 # Disable after N consecutive timeouts
SELF_DISABLE_THRESHOLD_MS = 50 # A single tap exceeding this counts
_disable_counter = 0
def _is_disabled() -> bool:
return os.environ.get("KARL_TAP_DISABLED") == "1"
def _check_timing(start_ns: int) -> None:
global _disable_counter
elapsed_ms = (time.monotonic_ns() - start_ns) / 1_000_000
if elapsed_ms > SELF_DISABLE_THRESHOLD_MS:
_disable_counter += 1
if _disable_counter >= SELF_DISABLE_COUNT:
os.environ["KARL_TAP_DISABLED"] = "1"
else:
_disable_counter = max(0, _disable_counter - 1)
def init_session_buffer(session_id: str, prompt_id: str, skill_name: str,
domain: str, prompt_text: str, cwd: str,
machine: str, pane: str) -> None:
if _is_disabled(): return
start = time.monotonic_ns()
BUFFER_DIR.mkdir(parents=True, exist_ok=True)
buf_path = BUFFER_DIR / f"{session_id}.json"
buffer = {
"session_id": session_id,
"prompt_id": prompt_id,
"skill": {"name": skill_name, "injected": True, "domain": domain},
"prompt": {
"text_excerpt": prompt_text[:200],
"text_length": len(prompt_text),
},
"cwd": cwd,
"machine": machine,
"pane": pane,
"tool_events": [],
"started_at": time.time(),
}
buf_path.write_text(json.dumps(buffer))
_check_timing(start)
def append_tool_event(session_id: str, tool_name: str, parameters: dict,
result_success: bool, exit_code: Optional[int],
duration_ms: Optional[int], file_path: Optional[str]) -> None:
if _is_disabled(): return
start = time.monotonic_ns()
buf_path = BUFFER_DIR / f"{session_id}.json"
if not buf_path.exists():
# Session not tracked (no skill was injected) -- create minimal buffer
init_session_buffer(session_id, "", "", "unknown", "", "", "", "")
try:
buffer = json.loads(buf_path.read_text())
except (json.JSONDecodeError, FileNotFoundError):
return
if len(buffer["tool_events"]) >= MAX_BUFFER_TOOLS:
return # Cap reached
buffer["tool_events"].append({
"tool": tool_name,
"params": parameters,
"success": result_success,
"exit_code": exit_code,
"duration_ms": duration_ms,
"file": file_path,
"t": time.time(),
})
buf_path.write_text(json.dumps(buffer))
_check_timing(start)
def flush_session(session_id: str, response_length: int,
token_usage: dict, files_read: list,
files_modified: list, errors: list) -> Optional[dict]:
if _is_disabled(): return None
start = time.monotonic_ns()
buf_path = BUFFER_DIR / f"{session_id}.json"
if not buf_path.exists():
return None
try:
buffer = json.loads(buf_path.read_text())
except (json.JSONDecodeError, FileNotFoundError):
return None
events = buffer["tool_events"]
tool_sequence = [e["tool"] for e in events]
tool_counts = {}
for t in tool_sequence:
tool_counts[t] = tool_counts.get(t, 0) + 1
bash_fails = sum(1 for e in events if e["tool"] == "Bash" and not e["success"])
error_count = len(errors) + sum(1 for e in events if not e["success"])
duration_ms = int((time.time() - buffer["started_at"]) * 1000)
record = {
"id": f"karl-{session_id[:12]}",
"schema_version": 1,
"channel": "live",
"recorded_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"session_id": session_id,
"prompt_id": buffer["prompt_id"],
"machine": buffer["machine"],
"pane": buffer["pane"],
"cwd": buffer["cwd"],
"skill": buffer["skill"],
"prompt": buffer["prompt"],
"trajectory": {
"tool_sequence": tool_sequence,
"tool_counts": tool_counts,
"total_tools": len(tool_sequence),
"duration_ms": duration_ms,
"files_read": files_read[:50], # Cap list sizes
"files_modified": files_modified[:50],
"bash_commands": [e["params"].get("command", "")[:200] for e in events
if e["tool"] == "Bash"][:20],
"bash_exit_codes": [e["exit_code"] for e in events
if e["tool"] == "Bash" and e["exit_code"] is not None],
"error_count": error_count,
"bash_fail_count": bash_fails,
"token_usage": token_usage,
},
"outcome": {
"annotation_status": "pending",
"score": None,
"reward": None,
"advantage": None,
"signals": {},
},
"karl_meta": None,
}
# Clean up buffer
buf_path.unlink(missing_ok=True)
_check_timing(start)
return record
def append_to_store(record: dict) -> None:
KARL_DIR.mkdir(parents=True, exist_ok=True)
with open(STORE_PATH, "a") as f:
f.write(json.dumps(record) + "\n")
def annotate_previous(session_id: str, prompt_text: str,
correction_detected: bool, redo_detected: bool) -> None:
if _is_disabled(): return
# Read last record from store, update if same session
try:
lines = STORE_PATH.read_text().strip().split("\n")
if not lines: return
last = json.loads(lines[-1])
if last["session_id"] != session_id: return
if last["outcome"]["annotation_status"] != "pending": return
last["outcome"]["signals"]["correction_detected"] = correction_detected
last["outcome"]["signals"]["redo_detected"] = redo_detected
last["outcome"]["signals"]["session_continued"] = True
last["outcome"]["annotation_status"] = "annotated"
# Rewrite last line (atomic via temp file)
lines[-1] = json.dumps(last)
tmp = STORE_PATH.with_suffix(".tmp")
tmp.write_text("\n".join(lines) + "\n")
tmp.replace(STORE_PATH)
except Exception:
pass # Never failSize estimate: 220 lines
File count: 1 new file + 4 modified files (8-12 lines each)
---
B3. Embedding Latency Solution -- Async Shadow Router
The hook budget of 500ms makes synchronous Gemini API calls risky. The solution: compute embeddings asynchronously and use them for the NEXT prompt, not the current one.
Architecture:
UserPromptSubmit hook (500ms budget):
1. Check local cache for prompt embedding (< 1ms)
2. If cached: compute weighted similarity, select skill (< 5ms total)
3. If NOT cached: fall back to regex routing (< 2ms)
4. Fire background thread to embed this prompt via RAG++ (no budget)
5. Store embedding in local LRU cache for future lookups
Stop hook:
1. weight_updater.py updates trajectory weights (< 50ms)
Background (no hook budget):
1. Embedding completed, stored in [home-path]
2. Next UserPromptSubmit for a similar prompt hits cacheThis design means the FIRST prompt in a topic misses vector routing (falls back to regex). The SECOND+ prompt benefits from the cached embedding. Since users often issue multiple prompts on the same topic, the cache hit rate after warmup should be 60-80
Cache spec:
- Format: LRU dict, max 500 entries
- Key: SHA256(prompt_text[:200] + cwd_basename)
- Value: 768-dim float32 array (3KB per entry)
- Max size: 500 * 3KB = 1.5MB
- Persistence: pickle file, loaded on hook startup (~50ms), saved async on write
- TTL: 24 hours per entry (prompt relevance decays)
Embedding call path:
import threading
def _embed_async(prompt_text: str, cwd: str, cache_key: str):
try:
import requests
resp = requests.post("http://localhost:8000/api/rag/gateway/context", json={
"query": f"[project:{os.path.basename(cwd)}] {prompt_text[:200]}",
"cwd": cwd,
"k_rag": 0, # We only need the embedding, not results
"include_graph": False,
}, timeout=5)
if resp.ok:
# Extract embedding from response (need to add endpoint to RAG++)
embedding = resp.json().get("query_embedding")
if embedding:
_cache_store(cache_key, embedding)
except Exception:
pass # Silent failure, regex fallback continues to work
def route_prompt_v2(prompt: str, cwd: str) -> Optional[str]:
cache_key = _cache_key(prompt, cwd)
cached = _cache_get(cache_key)
if cached is not None:
# Vector routing path
skill_embeddings = _load_skill_embeddings()
scores = {name: _cosine(cached, vec) * weight
for name, (vec, weight) in skill_embeddings.items()}
best = max(scores.items(), key=lambda x: x[1])
if best[1] > 0.35:
return best[0]
# Regex fallback (always available)
regex_match = _regex_route(prompt)
# Fire async embedding for next time
threading.Thread(target=_embed_async, args=(prompt, cwd, cache_key), daemon=True).start()
return regex_matchRAG++ modification needed: The gateway currently returns `related_turns` and `graph_context` but does NOT return the raw query embedding vector. Need to add `"query_embedding"` to the response payload. This is a ~5-line change in the RAG++ gateway code.
---
B4. Reward Engine -- Numbers and Edge Cases
Computation costs:
| Component | Computation | Latency | Per-day (40 prompts) |
|---|---|---|---|
| Process reward (Tap C) | 4 weighted sums | < 1ms | 40ms total |
| Outcome annotation (Tap D) | File read + pattern match + rewrite | < 10ms | 400ms total |
| Advantage batch (daily Prefect) | Read all records, group by domain, compute means | < 2s for 1000 records | 2s once |
| Skill metrics aggregation (30-min Prefect) | Read records, compute per-skill stats | < 1s for 1000 records | 48s/day |
Edge cases and handling:
| Edge Case | Frequency | Handling |
|---|---|---|
| Session with 0 tool calls | ~20 | |
| Session with 100+ tool calls | ~5 | |
| Correction detected in SAME turn | ~8 | |
| Multiple skills injected (multi-skill composition) | ~3 | |
| Session ends without Stop event (crash, timeout) | ~2 | |
| Reward distribution skew (all records at 0.7-0.8) | Likely in first weeks | Add normalization: z-score within domain bucket, clip to [-2, 2] |
Expected data volume projections:
| Week | Live Records | Backfill | Self-Play | Total | Training-Ready (after quality gate) |
|---|---|---|---|---|---|
| Week 1 | 70-140 | 20-40 (Codex only) | 0 | 90-180 | 30-60 |
| Week 2 | 140-280 | 0 (one-time) | 0 | 140-280 | 50-100 |
| Week 3 | 210-420 | 0 | 100-150 (seed run) | 310-570 | 100-200 |
| Week 4 | 280-560 | 0 | 200-350 | 480-910 | 160-300 |
| Week 6 | 420-840 | 0 | 400-700 | 820-1540 | 270-500 |
Assumptions: 20-40 prompts/day on Mac1 with skill injection, 40-60
---
B5. Self-Play Implementation Detail
Solver approach (revised from pane spawning to headless CLI):
# Headless solver invocation (no pane spawn needed)
timeout 120 claude --dangerously-skip-permissions \
-p "$(cat question.txt)" \
--output-format stream-json \
--max-tokens 4096 \
2>/dev/null | python3 parse_solver_output.pyparse_solver_output.py extracts:
- Final text answer
- Tool calls (from stream-json events with `type: "tool_use"`)
- Total tokens consumed
- Duration
Question bank format:
{
"id": "qb-a7f3c2d1",
"question": "What port does the Graph Kernel run on, and where is it hosted?",
"type": "lookup",
"tier": "T1",
"source_file": "[home-path]
"source_line": 42,
"reference_answer": "Port 8001, running natively on cloud-vm (not Docker). Accessed from Mac1 via SSH tunnel.",
"created_at": "2026-03-15T03:00:00Z",
"status": "active",
"freshness_hash": "sha256:abc123...",
"attempts": [],
"pass_rate": null,
"best_trajectory_id": null
}Question generation prompt template:
Given this documentation excerpt:
---
{excerpt}
---
Generate {N} questions of type "{type}" that:
1. Can be answered using ONLY the tools: Read, Grep, Glob, Bash, WebSearch
2. Require reading 1-3 files to answer (not trivial, not impossibly deep)
3. Have a single clear correct answer verifiable from the source material
4. Are representative of what a human operator would ask about this system
For each question, provide:
- question: the question text
- reference_answer: the correct answer with source citation
- expected_tools: which tools would be needed
- difficulty: easy/medium/hardCost model (detailed):
| Component | Unit Cost | Units/Week (Conservative) | Weekly Cost |
|---|---|---|---|
| Question generation (Claude Haiku) | $0.001/question | 50 questions | $0.05 | ||
| Solver attempts (Claude Sonnet) | $0.02/attempt | 250 attempts (50 x G=5) | $5.00 | ||
| LLM judge (Claude Haiku) | $0.001/judgment | 250 judgments | $0.25 | ||
| Freshness re-solve (Claude Sonnet) | $0.02/attempt | 20 re-solves | $0.40 | ||
| Weekly total | $5.70 | ||
| Monthly total | $22.80 |
Note: Stage 2 estimated $1.44-$6.23/week. The revised estimate uses Claude Sonnet for solving (more realistic than Haiku for multi-tool tasks), which pushes the conservative estimate to $5.70/week. Still under $25/month.
---
B6. Dependency Map
DEPENDENCY GRAPH
================
[B1: Tool Capture Fix]
|
v
[1.1: [home-path] package]
|
v
[1.2: trajectory_tap.py] -------> [1.3: Wire Taps A-D]
| |
v v
[1.4: trajectory_extractor.py] [1.5: 24h validation]
| |
+-----------------------------------+
|
v
[2.1: reward_engine.py] ---------> [2.2: Backfill rewards]
| |
v v
[2.3: karl_advantage_batch] [2.4: Metrics aggregator]
| |
+-----------------------------------+
|
v
[3.1: RAG++ embedding endpoint] --> [3.2: embedding_cache.py]
| |
v v
[3.3: ops_trigger_v2.py] [3.4: Shadow mode deploy]
| |
v v
[3.5: weight_updater.py] [3.6: skill_embedding_refresh]
|
+-------> [3.7: Shadow mode analysis (2 weeks)]
|
v
[3.8: Threshold calibration + cutover decision]
|
+---------------+---------------+
| |
v v
[4.1: question_generator.py] [5.1: sft_formatter_trajectory.py]
| |
v v
[4.2: solver (headless CLI)] [5.2: numu-weave extension]
| |
v v
[4.3: quality_filter.py] [5.3: Mac5 readiness check]
| |
v v
[4.4: SKILL.md updater] [5.4: LoRA v2 training]
| |
v v
[4.5: karl_daily_batch flow] [5.5: A/B evaluation deploy]
|
v
[6.1: l4_controller.py]
|
v
[6.2: pulse_bridge integration]
|
v
[6.3: EW daemon wiring]
|
v
[6.4: L4 invariants]
|
v
[7.1: Nexus /karl page]
[7.2: Nexus /karl/skills page]
[7.3: Grafana dashboard]
[7.4: karl_health.py]
[7.5: karl_drift_check flow]---
PART C: MASTER CHECKLIST
PHASE 0: Prerequisite Fix (Days 1-3)
This phase was NOT in the Stage 2 compound. It emerged from the audit. Without it, Phases 1-6 produce no Claude Code trajectory data.
- [ ] 0.1 Diagnose Claude Code tool capture in response_hook.py
- Owner: claw
- automate: false (requires investigation and debugging)
- Input: `[home-path]` (1,283 lines), a raw Stop event payload from a Claude Code session
- Output: Root cause document explaining why Claude Code sessions produce empty `tool_calls` arrays, with proposed fix
- Validation: Capture a Stop event payload from a live Claude Code session. Identify the code path in response_hook.py that should extract tool calls. Confirm whether the data is missing from the payload or whether the parser skips it.
- Depends on: nothing
- Status: Not Started
- [ ] 0.2 Fix Claude Code tool extraction in response_hook.py
- Owner: claw
- automate: true
- Input: Root cause from 0.1, response_hook.py source
- Output: Modified `response_hook.py` that populates `assistant_turns[].tool_calls` for Claude Code sessions using the Anthropic content block format (`tool_use` / `tool_result`)
- Validation: Run 5 Claude Code prompts that involve Read + Bash + Edit. Check `verbose-all.jsonl` -- the last 5 entries must have non-empty `tool_calls` arrays with correct tool names (Read, Bash, Edit).
- Depends on: 0.1
- Status: Not Started
- [ ] 0.3 Create `[home-path]` package directory structure
- Owner: claw
- automate: true
- Input: nothing
- Output: `[home-path]`, `[home-path]` directory
- Validation: `python3 -c "from karl import trajectory_tap"` does not error (after adding to PYTHONPATH)
- Depends on: nothing
- Status: Not Started
---
PHASE 1: Data Foundation (Days 3-7)
- [ ] 1.1 Implement `trajectory_tap.py`
- Owner: claw
- automate: true
- Input: Phase 0 complete (tool capture working), `[home-path]` package exists
- Output: `[home-path]` (~220 lines) with functions: `init_session_buffer()`, `append_tool_event()`, `flush_session()`, `append_to_store()`, `annotate_previous()`
- Validation: Unit test: call init_session_buffer, append 5 tool events, flush_session. Verify `trajectories.jsonl` contains 1 record with correct schema. Verify buffer file is cleaned up.
- Depends on: 0.3
- Status: Not Started
- [ ] 1.2 Wire Tap A + Tap D into ops_trigger.py
- Owner: claw
- automate: true
- Input: trajectory_tap.py (1.1), `[home-path]` (232 lines)
- Output: Modified `ops_trigger.py` with +8 lines: Tap A after line 226, Tap D at beginning of main()
- Validation: Trigger a skill injection by entering a prompt matching an active skill's regex. Check that `[home-path]` exists with correct skill name.
- Depends on: 1.1
- Status: Not Started
- [ ] 1.3 Wire Tap B into post_tool_hook.py
- Owner: claw
- automate: true
- Input: trajectory_tap.py (1.1), `[home-path]` (287 lines)
- Output: Modified `post_tool_hook.py` with +5 lines after line 244
- Validation: Run a prompt that uses Read + Bash. Check the session buffer JSON has 2+ tool_events with correct tool names.
- Depends on: 1.1
- Status: Not Started
- [ ] 1.4 Wire Tap C into response_hook.py or session_end_hook.py
- Owner: claw
- automate: true
- Input: trajectory_tap.py (1.1), response_hook.py or session_end_hook.py
- Output: Modified hook file with +7 lines that calls `flush_session()` and `append_to_store()` on Stop event
- Validation: Complete a session. Check `[home-path]` has a new record with populated `trajectory.tool_sequence`, `trajectory.tool_counts`, and `outcome.annotation_status == "pending"`.
- Depends on: 1.1, 0.2
- Status: Not Started
- [ ] 1.5 Implement `trajectory_extractor.py` for historical backfill
- Owner: claw
- automate: true
- Input: `[home-path]` (3,258 entries, 157 with tool data)
- Output: `[home-path]` (~180 lines). Extracts Codex trajectories, normalizes tool names (`exec_command` -> `Bash`, `apply_patch` -> `Edit`, `read_file` -> `Read`), writes to `trajectories.jsonl` with `channel: "backfill"`.
- Validation: Run extractor. Verify `trajectories.jsonl` gains 100-157 backfill records. Spot-check 10 records: tool names must be normalized to Claude Code vocabulary.
- Depends on: 0.3
- Status: Not Started
- [ ] 1.6 48-hour live validation
- Owner: mohamed (monitoring)
- automate: false
- Input: All taps wired (1.2, 1.3, 1.4)
- Output: Validation report: number of records in `trajectories.jsonl`, breakdown by channel (live vs backfill), tool distribution, any tap failures logged
- Validation: At least 20 live trajectory records with non-empty tool sequences. Zero hook crashes (check `[home-path]`). No performance degradation reports from user.
- Depends on: 1.2, 1.3, 1.4, 1.5
- Status: Not Started
---
PHASE 2: Reward Engine (Days 7-10)
- [ ] 2.1 Implement `reward_engine.py`
- Owner: claw
- automate: true
- Input: Trajectory schema from Phase 1
- Output: `[home-path]` (~250 lines) with functions: `compute_process_reward()`, `compute_outcome_score()`, `compute_full_reward()`
- Validation: Unit test with 5 synthetic trajectory records covering: clean bash-only trajectory (expected reward > 0.7), high-error trajectory (expected reward < 0.4), read-only trajectory (expected reward ~0.85), correction-detected trajectory (expected outcome score < 0). All rewards must be in [0.0, 1.0], all outcome scores in [-1.0, 1.0].
- Depends on: 1.1 (schema)
- Status: Not Started
- [ ] 2.2 Backfill rewards for existing trajectory records
- Owner: claw
- automate: true
- Input: `trajectories.jsonl` with backfill records (from 1.5), `reward_engine.py` (2.1)
- Output: All backfill records in `trajectories.jsonl` have `outcome.reward` populated (annotation_status = "partial" since no cross-turn signals for historical data)
- Validation: `python3 -c "import json; lines=open('[home-path]).readlines(); rewarded=[l for l in lines if json.loads(l)['outcome']['reward'] is not None]; print(f'{len(rewarded)}/{len(lines)} rewarded')"` shows 100
- Depends on: 1.5, 2.1
- Status: Not Started
- [ ] 2.3 Deploy `karl_advantage_batch` Prefect flow
- Owner: claw
- automate: true
- Input: `reward_engine.py` (2.1), Prefect on cloud-vm (:4200)
- Output: `flows/feed-hub/karl_advantage_batch.py` (~120 lines). Daily at 02:00 UTC. Reads all trajectories, groups by skill domain, computes V_baseline = mean(reward) per domain, patches `outcome.advantage = (reward - V_baseline) / 0.05` clipped to [-2.0, 2.0].
- Validation: Deploy to Prefect. Manual trigger. Check that trajectories.jsonl records have `outcome.advantage` populated. Verify advantage distribution has mean near 0.0 (by construction).
- Depends on: 2.1, Phase 1 complete
- Status: Not Started
- [ ] 2.4 Implement `metrics_aggregator.py` + deploy Prefect flow
- Owner: claw
- automate: true
- Input: `trajectories.jsonl` with rewards
- Output: `[home-path]` (~150 lines) + `flows/feed-hub/karl_metrics.py` (~60 lines). Runs every 30 minutes. Computes per-skill: success_rate, mean_reward, trend (7-day MA), invocation_count, lift_over_baseline. Writes to `[home-path]`.
- Validation: After one day of data, `skill_metrics.json` has entries for each skill that recorded a trajectory. Values are plausible (success_rate in [0,1], invocation_count > 0).
- Depends on: 2.1
- Status: Not Started
- [ ] 2.5 Add reward-aware quality gate to decay detector
- Owner: claw
- automate: true
- Input: `skill_metrics.json` (2.4), `[home-path]` (251 lines)
- Output: Modified `detector.py` with +15 lines: load `skill_metrics.json`, check if any skill has `success_rate < 0.2` AND `invocations >= 20`, flag as `"disable"` (actively harmful, not just stale).
- Validation: Create a synthetic `skill_metrics.json` entry with success_rate=0.1, invocations=25. Run decay detector. Verify it flags that skill for disabling.
- Depends on: 2.4
- Status: Not Started
---
PHASE 3: Routing Upgrade (Days 10-24)
- [ ] 3.1 Add `query_embedding` to RAG++ gateway response
- Owner: claw
- automate: true
- Input: RAG++ gateway source (Docker on cloud-vm)
- Output: Modified RAG++ gateway that includes `"query_embedding": [768 floats]` in the `/api/rag/gateway/context` response body
- Validation: `curl -s -X POST http://localhost:8000/api/rag/gateway/context -H 'Content-Type: application/json' -d '{"query":"test","cwd":"/tmp","k_rag":0}' | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d.get('query_embedding',[])))"` prints `768`
- Depends on: nothing
- Status: Not Started
- [ ] 3.2 Implement `embedding_cache.py`
- Owner: claw
- automate: true
- Input: RAG++ embedding endpoint (3.1)
- Output: `[home-path]` (~100 lines). LRU dict cache (max 500 entries, 24h TTL), pickle persistence at `[home-path]`. Functions: `_cache_key()`, `_cache_get()`, `_cache_store()`, `_embed_async()`, `load_skill_embeddings()`.
- Validation: Call `_embed_async("deploy flows", "[home]")`. Wait 3s. Call `_cache_get()` with same key -- returns 768-dim array. Call again with different text -- returns None.
- Depends on: 3.1
- Status: Not Started
- [ ] 3.3 Bootstrap initial skill embeddings
- Owner: claw
- automate: true
- Input: 13 active SKILL.md files, RAG++ embedding endpoint (3.1)
- Output: `skill_embeddings` table in Supabase with 13 rows (one per active skill). Each row: skill_name, embedding (768-dim), embedding_text, trajectory_weight=1.0. Local cache at `[home-path]`.
- Validation: `python3 -c "import pickle; d=pickle.load(open('$HOME/.claude/cortex/skill_embeddings.pkl','rb')); print(len(d))"` prints `13`.
- Depends on: 3.1
- Status: Not Started
- [ ] 3.4 Implement `ops_trigger_v2.py` (shadow mode)
- Owner: claw
- automate: true
- Input: `embedding_cache.py` (3.2), skill embeddings (3.3)
- Output: `[home-path]` (~280 lines). Runs alongside existing ops_trigger.py. On UserPromptSubmit: compute vector similarity if cache hit, otherwise async embed + regex fallback. Log both selections (regex and vector) to `[home-path]`. Do NOT inject based on vector routing -- only log.
- Validation: Run 10 prompts. Check `routing_shadow.jsonl` has 10 entries, each with `regex_selection` and `vector_selection` (or `"miss"` if no cache hit). No hook failures.
- Depends on: 3.2, 3.3
- Status: Not Started
- [ ] 3.5 Deploy shadow mode routing hook
- Owner: claw
- automate: true
- Input: ops_trigger_v2.py (3.4)
- Output: New hook registration in `[home-path]` that runs ops_trigger_v2.py on UserPromptSubmit alongside the existing ops_trigger.py
- Validation: Both hooks fire on the same prompt. Neither crashes. `routing_shadow.jsonl` accumulates entries.
- Depends on: 3.4
- Status: Not Started
- [ ] 3.6 Implement `weight_updater.py`
- Owner: claw
- automate: true
- Input: `reward_engine.py` (2.1), `skill_embeddings` table
- Output: `[home-path]` (~120 lines). Runs on Stop event. Reads the trajectory record just written by Tap C. Computes outcome score. Updates `trajectory_weight` for the matched skill via EMA (alpha=0.1, bounds [0.5, 1.5]). Updates both Supabase table and local pickle cache.
- Validation: Manually set a skill's trajectory_weight to 1.0. Simulate 5 successful trajectories (reward=0.8) and 5 failed trajectories (reward=0.2). After 10 updates: successful skill's weight should be > 1.0, failed skill's weight should be < 1.0.
- Depends on: 2.1, 3.3
- Status: Not Started
- [ ] 3.7 Deploy `skill_embedding_refresh` Prefect flow
- Owner: claw
- automate: true
- Input: Skill embeddings in Supabase (3.3), SKILL.md files
- Output: `flows/feed-hub/skill_embedding_refresh.py` (~100 lines). Daily at 05:00 UTC. Re-embeds skills whose SKILL.md content changed (mtime check) or whose trajectory_weight drifted > 0.15 from 1.0.
- Validation: Modify a SKILL.md file. Run the flow manually. Verify the skill's embedding in Supabase was updated (check `updated_at` timestamp).
- Depends on: 3.3
- Status: Not Started
- [ ] 3.8 2-week shadow mode analysis
- Owner: mohamed (analysis decision)
- automate: false
- Input: 2 weeks of `routing_shadow.jsonl` data + trajectory reward data from Phase 2
- Output: Analysis report: agreement rate (regex vs vector), lift on disagreements (which selection leads to higher outcome scores), coverage improvement (prompts matched by vector but not regex). Recommended threshold value. Go/no-go decision for live cutover.
- Validation: At least 200 routing comparisons logged. Report includes statistical significance test (chi-squared or Fisher's exact on success rates).
- Depends on: 3.5, 2.4, 14 calendar days of data collection
- Status: Not Started
- [ ] 3.9 Live cutover (conditional on 3.8 approval)
- Owner: claw (after mohamed approves)
- automate: true
- Input: Positive shadow mode analysis (3.8), ops_trigger_v2.py
- Output: ops_trigger_v2.py replaces ops_trigger.py as the primary routing hook. Feature flag `KARL_ROUTING_ENABLED=1` in `[home-path]`. Regex routing preserved as fallback (fires when vector cache misses or flag is 0).
- Validation: Set `KARL_ROUTING_ENABLED=1`. Run 5 prompts. Verify vector routing is active (check injection source in trajectory records). Set `KARL_ROUTING_ENABLED=0`. Verify regex routing fires instead.
- Depends on: 3.8 (approved)
- Status: Not Started
---
PHASE 4: Training Pipeline (Days 24-35)
- [ ] 4.1 Verify Mac5 online and MLX server operational
- Owner: mohamed (hardware)
- automate: false
- Input: Mac5 at [ip]
- Output: Mac5 accessible via SSH, MLX server (:8100) responding to /health, finetune-daemon (:9200) responding to /health
- Validation: `ssh [ip] "curl -s http://localhost:8100/health && curl -s http://localhost:9200/health"` returns healthy responses for both
- Depends on: nothing (can start anytime)
- Status: Not Started
- [ ] 4.2 Implement `quality_filter.py`
- Owner: claw
- automate: true
- Input: `trajectories.jsonl` with rewards (Phase 2)
- Output: `[home-path]` (~200 lines). Two modes: `filter_live()` (threshold-based: discard reward < -0.3 or > 0.95, require at least 2 tool calls) and `filter_selfplay()` (adds pass-rate band [0.1, 0.9] + LLM judge with mean >= 3.5). Output: filtered JSONL at `[home-path]`.
- Validation: Feed 20 synthetic trajectories spanning reward range [-1.0, 1.0]. Verify live filter passes 60-80
- Depends on: 2.1
- Status: Not Started
- [ ] 4.3 Implement `sft_formatter_trajectory.py`
- Owner: claw
- automate: true
- Input: `training_ready.jsonl` (4.2)
- Output: `[home-path]` (~180 lines). Converts filtered trajectories to ChatML format with advantage-weighted repetition (A > 0.5 -> duplicate 2x, A < -0.5 -> frame as "what NOT to do"). Output: `[home-path]`.
- Validation: Feed 10 filtered trajectories with varying advantages. Output JSONL has 12-14 entries (some duplicated). Each entry has valid ChatML schema with `messages` array and `advantage_weight` field.
- Depends on: 4.2
- Status: Not Started
- [ ] 4.4 Extend numu-weave with trajectory source
- Owner: claw
- automate: true
- Input: `[home-path]` (270 lines)
- Output: Modified `index.ts` with: new `source: "trajectory"` type in CorpusEntry, `advantageWeight` field, `exportWeighted()` method (~40 new lines)
- Validation: TypeScript compiles without errors. Unit test: create 3 CorpusEntry objects with source="trajectory" and varying advantageWeight. Call `exportWeighted()`. Verify high-advantage entries appear 2x in output.
- Depends on: nothing
- Status: Not Started
- [ ] 4.5 First LoRA training run (adapter v2) on Mac5
- Owner: claw
- automate: true
- Input: `training_data.jsonl` (4.3), Mac5 online (4.1), numu-weave extended (4.4)
- Output: Adapter v2 at `[home-path]`. Training parameters: 1000 iterations, lr=5e-5, batch_size=1, num_layers=4, max_seq_length=256. Training log with final loss.
- Validation: Training completes without OOM or crash. Final loss < 1.5 (vs v1 baseline 1.694). Adapter files exist on Mac5. MLX server can load and serve the fused model.
- Depends on: 4.1, 4.3, 4.4
- Status: Not Started
- [ ] 4.6 Deploy `karl_training_trigger` Prefect flow
- Owner: claw
- automate: true
- Input: Training pipeline (4.2, 4.3, 4.5)
- Output: `flows/feed-hub/karl_training_trigger.py` (~80 lines). Daily at 04:00 UTC. Counts new annotated trajectories since last training run. If >= 50 new, triggers Mac5 LoRA training via SSH + numu-weave.
- Validation: Set counter to 50+. Manual trigger. Verify it SSHs to Mac5 and initiates training (or logs "Mac5 unreachable" gracefully).
- Depends on: 4.5
- Status: Not Started
- [ ] 4.7 Deploy A/B evaluation framework
- Owner: claw
- automate: true
- Input: Fused model on Mac5 :8100 (4.5), ops_trigger_v2.py (3.4 or 3.9)
- Output: Modified ops_trigger_v2.py adds LoRA tool plan prepend for 50
- Validation: Run 10 prompts. 5 should have `[Learned Tool Plan]` prefix in injection (check trajectory records). 5 should not. No timeouts (Mac5 is responsive -- if not, all 10 fall back gracefully).
- Depends on: 3.9 (or 3.4 for shadow evaluation), 4.5
- Status: Not Started
---
PHASE 5: Evolution L4 (Days 28-38)
- [ ] 5.1 Implement `l4_controller.py`
- Owner: claw
- automate: true
- Input: Trajectory store with CALC dispatch markers, Evolution World files (`daemon.py` 965 lines, `engine.py` 400 lines, `invariants.py` 325 lines)
- Output: `[home-path]` (~400 lines). Classes: `ToolPreferenceGenome` (6 components: agent_weights, technique_agent_affinity, duration_priors, search_templates, cross_layer_thresholds, trajectory_discount). 5 mutation operators (M1-M5). 4 invariants (L4-I1 through L4-I4). Method: `should_fire(l1_step)` returns True every 90 steps. Method: `mutate(trajectory_window)`. Method: `apply_to_pulse_bridge(bridge)`.
- Validation: Unit test: create L4Controller, feed 20 synthetic trajectory records with varying agent + success. Call `mutate()`. Verify: agent_weights changed within L4-I2 bounds (max 0.30 shift), at least 2 agents have weight >= 0.10 (L4-I4), KL divergence > 0.005 (L4-I1).
- Depends on: 2.1 (reward engine), Phase 1 (trajectory store)
- Status: Not Started
- [ ] 5.2 Spike: pulse_bridge.py CALC dispatch marker integration
- Owner: claw
- automate: false (requires reading 981-line file and understanding dispatch flow)
- Input: `[home-path]` (981 lines)
- Output: Design doc: exactly where in pulse_bridge.py to inject the CALC session marker, what data the marker contains, and how L4 filters for it in trajectories.jsonl.
- Validation: Design doc reviewed. The injection point identified is reachable during normal EW dispatch. The marker format matches what l4_controller.py expects.
- Depends on: 5.1
- Status: Not Started
- [ ] 5.3 Wire L4 into Evolution World daemon and engine
- Owner: claw
- automate: true
- Input: l4_controller.py (5.1), pulse_bridge design (5.2), daemon.py (965 lines), engine.py (400 lines)
- Output: Modified files: `engine.py` (+15 lines: import L4Controller, call `l4.should_fire()` after L3 block), `daemon.py` (+20 lines: wire CALC completions to `l4.record_trajectory()`, call `l4.apply_to_pulse_bridge()` post-adaptation), `pulse_bridge.py` (+10 lines: CALC dispatch marker per 5.2 design), `l2_controller.py` (+5 lines: `_l4_strategy_locked` guard on `_mutate_strategy`), `invariants.py` (+30 lines: `check_l4_policy_divergence()`, `check_l4_agent_diversity()`).
- Validation: Start EW daemon. After 90 L1 steps, L4 fires. Check Supabase `ew_l4_steps` table has a new row with the L4 generation data. No invariant violations logged.
- Depends on: 5.1, 5.2
- Status: Not Started
- [ ] 5.4 Create `ew_l4_steps` Supabase table
- Owner: claw
- automate: true
- Input: L4 generation data schema from 5.1
- Output: New Supabase table: `ew_l4_steps(id uuid, generation int, fired_at timestamptz, agent_weights jsonb, affinity_matrix jsonb, fitness float, invariant_violations jsonb, trajectory_window_size int)`.
- Validation: `curl` POST to Supabase REST API creates a test row. Select returns it.
- Depends on: nothing
- Status: Not Started
---
PHASE 6: Self-Play (Days 30-42)
- [ ] 6.1 Implement `question_generator.py`
- Owner: claw
- automate: true
- Input: Corpus sources T1-T5 (memory files, SKILL.md files, flows, hooks, prompt logs)
- Output: `[home-path]` (~250 lines). Functions: `generate_questions(tier, count, question_type)`. Uses Claude Haiku for generation. Writes to `[home-path]`.
- Validation: Run `generate_questions("T1", 10, "lookup")`. Output 10 questions in `question_bank.jsonl` with correct schema (id, question, type, tier, source_file, reference_answer, status="active").
- Depends on: nothing
- Status: Not Started
- [ ] 6.2 Implement headless solver (parse_solver_output.py)
- Owner: claw
- automate: true
- Input: Question bank (6.1)
- Output: `[home-path]` (~100 lines). Parses Claude Code `--output-format stream-json` output. Extracts: final text answer, tool calls (from `tool_use` events), total tokens, duration. Plus `[home-path]` (~200 lines): takes a question, runs G solve attempts via headless CLI, records results.
- Validation: Run solver on 3 questions from question bank with G=2. Each attempt produces a solver output file with answer, tool calls, and metrics. Trajectories are recorded (either via hooks or by direct write to trajectories.jsonl with `channel: "self_play"`).
- Depends on: 6.1, Phase 1 (trajectory store)
- Status: Not Started
- [ ] 6.3 Phase 0 seed run: 200 questions, G=3
- Owner: claw
- automate: true
- Input: question_generator (6.1), solver (6.2), quality_filter (4.2)
- Output: 200 questions generated from T1+T2. 600 solve attempts. Filtered results in `training_ready.jsonl`. Expected: 100-150 questions pass quality gate.
- Validation: `question_bank.jsonl` has 200+ entries. `training_ready.jsonl` has 100+ entries. Pass rate distribution is not degenerate (not all 0
- Depends on: 6.1, 6.2, 4.2
- Status: Not Started
- [ ] 6.4 Implement `skill_updater.py` for SKILL.md trajectory-derived workflows
- Owner: claw
- automate: true
- Input: Filtered self-play trajectories (6.3), active SKILL.md files
- Output: `[home-path]` (~150 lines). For each skill with 5+ successful solver trajectories: extract modal tool sequence, compute pass_rate, format as "Verified Tool Sequence" section. Write proposed diff to `[home-path]`.
- Validation: After seed run, at least 3 skills have proposed diffs. Each diff replaces the generic 4-step workflow with a specific tool sequence with pass_rate annotation.
- Depends on: 6.3
- Status: Not Started
- [ ] 6.5 Deploy `karl_daily_batch` Prefect flow
- Owner: claw
- automate: true
- Input: question_generator (6.1), solver (6.2), quality_filter (4.2)
- Output: `flows/feed-hub/karl_daily_batch.py` (~120 lines). Daily at 03:00 UTC. Generates 50 questions (rotating tiers: T1 Mon/Fri, T2 Tue, T3 Wed, T4/T5 Thu), runs solver with G=5, filters, writes to training_ready.jsonl.
- Validation: Deploy to Prefect. Manual trigger. After run: question_bank has 50 new entries, training_ready has 15-30 new entries.
- Depends on: 6.1, 6.2, 4.2
- Status: Not Started
- [ ] 6.6 Deploy `karl_drift_check` weekly Prefect flow
- Owner: claw
- automate: true
- Input: Question bank (6.1), solver (6.2)
- Output: `flows/feed-hub/karl_drift_check.py` (~80 lines). Weekly Sunday 06:00 UTC. Re-solves 20 random questions from the bank. Compares answers (Jaccard similarity) against stored reference. Marks questions with similarity < 0.5 as `stale`.
- Validation: Manually modify a memory file that a question references. Run drift check. Verify the affected question is marked stale.
- Depends on: 6.1, 6.2
- Status: Not Started
---
PHASE 7: Nexus + Observability (Days 35-42)
- [ ] 7.1 Build Nexus Portal `/karl` dashboard page
- Owner: claw
- automate: true
- Input: `skill_metrics.json` (2.4), `trajectories.jsonl` stats, Dashboard API at :8421
- Output: New page at `monitoring/nexus-portal/src/app/karl/page.tsx`. Displays: ranked skills by success_rate with lift-over-baseline, trend sparklines (7-day MA), top tool sequences per skill, self-play funnel (generated -> filtered -> training), LoRA adapter version history.
- Validation: `curl http://localhost:3001/karl` renders the page. All data sections populated (or show "No data yet" placeholders).
- Depends on: 2.4
- Status: Not Started
- [ ] 7.2 Build Nexus Portal `/karl/skills` skill evolution viewer
- Owner: claw
- automate: true
- Input: `skill_proposals/` diffs (6.4), active SKILL.md files
- Output: New page at `monitoring/nexus-portal/src/app/karl/skills/page.tsx`. Displays: current vs proposed SKILL.md content (diff view), trajectory-derived workflow vs static template, approve/reject controls (writes decision to `[home-path]`).
- Validation: Page loads and shows at least one skill with a proposed diff. Approve action writes to decisions file.
- Depends on: 6.4, 7.1
- Status: Not Started
- [ ] 7.3 Implement `karl_health.py` global health monitor
- Owner: claw
- automate: true
- Input: trajectories.jsonl, skill_metrics.json, Mac5 health endpoint, Supabase
- Output: `[home-path]` (~120 lines). Checks: trajectory ingest rate (alert if < 5/day for 2 consecutive days), reward distribution (alert if std_dev < 0.05 -- all same value), embedding freshness (alert if oldest embedding > 7 days), Mac5 availability (alert if unreachable for > 1 hour). Sends alerts to Discord webhook.
- Validation: Simulate each alert condition. Verify Discord webhook fires with correct alert text.
- Depends on: Phase 2 (metrics), Phase 3 (embeddings)
- Status: Not Started
- [ ] 7.4 Create `karl_trajectories` Supabase table + sync job
- Owner: claw
- automate: true
- Input: trajectories.jsonl schema, Supabase
- Output: New table: `karl_trajectories(id text primary key, session_id text, machine text, skill_name text, domain text, channel text, tool_sequence jsonb, tool_count int, reward float, outcome_score float, advantage float, recorded_at timestamptz)`. Plus `[home-path]` (~80 lines) that syncs new records every 30 minutes.
- Validation: Sync job runs. Records visible in Supabase. Count matches local trajectories.jsonl.
- Depends on: Phase 1 (trajectory store)
- Status: Not Started
- [ ] 7.5 Create KARL feature flag and rollback mechanism
- Owner: claw
- automate: true
- Input: ops_trigger_v2.py (3.4)
- Output: `[home-path]` with flags: `routing_enabled` (bool), `training_enabled` (bool), `self_play_enabled` (bool), `l4_enabled` (bool). Each subsystem checks its flag before executing. Setting `routing_enabled: false` reverts to regex routing instantly.
- Validation: Set `routing_enabled: false`. Next prompt uses regex routing. Set back to `true`. Next prompt uses vector routing (if cache hit).
- Depends on: 3.4
- Status: Not Started
- [ ] 7.6 Add KARL Grafana dashboard
- Owner: claw
- automate: true
- Input: Prometheus metrics from karl_health.py, skill_metrics.json
- Output: Grafana dashboard JSON at `monitoring/grafana/dashboards/karl.json`. Panels: trajectory ingest rate (time series), reward distribution (histogram), per-skill success rate (bar chart), routing method breakdown (pie: vector vs regex vs miss), Mac5 availability (status).
- Validation: Dashboard visible in Grafana at :3000. At least 3 panels show data after 1 day of operation.
- Depends on: 7.3
- Status: Not Started
---
Phase Summary Table
| Phase | Tasks | Estimated Days | Key Dependency | Automatable |
|---|---|---|---|---|
| Phase 0: Prereq Fix | 0.1-0.3 | 3 | None | 2/3 |
| Phase 1: Data Foundation | 1.1-1.6 | 4 | Phase 0 | 5/6 |
| Phase 2: Reward Engine | 2.1-2.5 | 3 | Phase 1 | 5/5 |
| Phase 3: Routing Upgrade | 3.1-3.9 | 14 (includes 2-week shadow) | Phase 2 | 8/9 |
| Phase 4: Training Pipeline | 4.1-4.7 | 11 | Phase 3, Mac5 | 6/7 |
| Phase 5: Evolution L4 | 5.1-5.4 | 10 | Phase 2, Phase 1 | 3/4 |
| Phase 6: Self-Play | 6.1-6.6 | 12 | Phase 1, Phase 4 | 6/6 |
| Phase 7: Observability | 7.1-7.6 | 7 | Phase 2, Phase 3 | 6/6 |
Total: 41 tasks, ~7 weeks calendar time (with parallelism between Phases 4-6).
Critical path: Phase 0 -> Phase 1 -> Phase 2 -> Phase 3 (shadow period) -> Phase 4 (training). Total critical path: 35 days.
Parallelizable: Phase 5 (L4) can start after Phase 2 (does not need routing upgrade). Phase 6 (Self-Play) can start after Phase 1 (does not need routing or training). Phase 7 can start incrementally after Phase 2.
---
Appendix: Risk Register
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Claude Code tool capture cannot be fixed (Phase 0 blocker) | 15 | ||
| Mac5 stays offline through Phase 4 | 25 | ||
| Gemini API latency spikes break routing hook | 20 | ||
| Reward function produces degenerate distribution | 15 | ||
| Shadow mode shows vector routing is worse than regex | 10 | ||
| Self-play solver produces low-quality trajectories | 20 | ||
| Three feedback loops oscillate instead of converging | 10 |
---
Sources
### Stage 0 Research
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md` -- Full research: Cortex architecture, hooks, EW, KARL paper
### Stage 2 Compound
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage2-compound.md` -- 8-step unified system design
### Live System Verification (2026-03-10)
- `[home-path]` -- 3,940 entries, tool_calls empty on ALL entries
- `[home-path]` -- 3,258 entries, 157 with tool_calls (96
- `[home-path]` -- 399 entries (324 invocation_records, 75 decay_flags)
- `[home-path]` -- 232 lines, confirmed
- `[home-path]` -- 88 skills, 13 active, confirmed
- RAG++ `:8000` -- healthy, returns all expected response fields
- Mac5 `:8100` / `:9200` -- UNREACHABLE (both direct and SSH)
- `[home-path]` -- 270 lines, confirmed
- `[home-path]` -- DOES NOT EXIST (greenfield)
- `[home-path]` -- 981 lines (much larger than assumed)
- `[home-path]` -- 965 lines
- `[home-path]` -- 400 lines
- `[home-path]` -- 325 lines
### External
- KARL paper: arXiv 2603.05218 -- OAPL algorithm, self-play pipeline, nugget-based rewards
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/karl-trajectory-intelligence/stage3-expand-master-plan.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture · is Stage Research