Grand Diomande Research · Full HTML Reader

KARL Integration -- Evolution3 / Stage 2: COMPOUND

Stage 0 established that the Cortex pipeline operates on **prompt text only** (Section 1, core limitation). It has zero visibility into tool sequences, exit codes, file diffs, or task outcomes. All five Stage 1 paths converged on the same foundational requirement: a structured trajectory record that captures what actually happened during a session, not just what the user asked for.

Agents That Account for Themselves proposal experiment writeup candidate score 30 .md

Full Public Reader

# KARL Integration -- Evolution3 / Stage 2: COMPOUND
Run: karl-trajectory-intelligence
Generated: 2026-03-10
Method: Evolution3 -- sequential compounding
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/

---

STEP 1: The Unified Data Layer -- Foundation for Everything

Inherits: Nothing -- this is ground zero.

The Problem All Five Paths Exposed

Stage 0 established that the Cortex pipeline operates on prompt text only (Section 1, core limitation). It has zero visibility into tool sequences, exit codes, file diffs, or task outcomes. All five Stage 1 paths converged on the same foundational requirement: a structured trajectory record that captures what actually happened during a session, not just what the user asked for.

Path A designed the `TrajectoryRecord` schema with 4 tap points in existing hooks. Path B discovered that `verbose-all.jsonl` already contains 3,249 entries with 157 multi-step tool sequences. Path D needs trajectory outcomes to weight its embedding space. Path C needs execution records to compute L4 fitness. Path E needs a place to store solver attempt trajectories. They all need the same thing: a unified trajectory store with outcome annotations.

The Compound Design: Two Ingest Channels, One Store

The unified data layer merges Path A's live recording with Path B's historical extraction into a single trajectory store at `[home-path]`. Two channels feed it:

Channel 1: Live Recording (Path A's Trajectory Tap)

Four tap points in existing hooks, exactly as Path A specified:
- Tap A (`ops_trigger.py`, after line 226): Initialize session buffer when skill is injected. Cost: ~5ms.
- Tap B (`post_tool_hook.py`, after line 244): Append tool event metadata per tool call. Cost: ~8ms.
- Tap C (`response_hook.py`, after line 838): Flush session buffer into a TrajectoryRecord at Stop. Cost: ~15ms.
- Tap D (`ops_trigger.py`, beginning of main()): Annotate the previous record with cross-turn outcome signals at the next UserPromptSubmit. Cost: ~10ms.

All tap points use `try/except: pass` wrappers and self-disable if they exceed 50ms for 3+ consecutive calls (`KARL_TAP_DISABLED=1` in hook state). The 500ms SIGALRM budget (enforced at `ops_trigger.py` line 21) kills the entire hook if anything overruns. Zero risk to production behavior.

Channel 2: Historical Backfill (Path B's Extraction)

`trajectory_extractor.py` processes the existing `verbose-all.jsonl` (3,249 entries) and `unified.jsonl` (3,909 entries) to produce retroactive trajectory records. Path B identified 157 entries with tool sequences of 2+ calls. The extractor normalizes tool names from various agent formats (Path B Section 2.2: `shell_command` -> `Bash`, `read_file` -> `Read`, etc.) and writes them to the same `trajectories.jsonl` store with `channel: "backfill"` to distinguish from live records.

Schema (merging Path A Section 3.1 + Path E Section 3.2):

json
{
  "id": "karl-abc12345",
  "schema_version": 1,
  "channel": "live" | "backfill" | "self_play",
  "recorded_at": "2026-03-10T14:23:11Z",
  "session_id": "...",
  "prompt_id": "...",
  "machine": "mac1",
  "pane": "/dev/ttys007",
  "cwd": "[home]",
  "git_repo": "mohameddiomande",

  "skill": {
    "name": "deploy",
    "injected": true,
    "domain": "deploy"
  },

  "prompt": {
    "text_excerpt": "deploy the flows to cloud-vm",
    "text_length": 29,
    "intent_tokens": ["deploy", "flows", "cloud-vm"]
  },

  "trajectory": {
    "tool_sequence": ["Read", "Bash", "Bash", "Bash"],
    "tool_counts": {"Read": 1, "Bash": 3},
    "total_tools": 4,
    "duration_ms": 8420,
    "files_read": ["docker-compose.yml"],
    "files_modified": [],
    "bash_commands": ["ssh cloud-vm 'systemctl restart prefect'"],
    "bash_exit_codes": [0],
    "error_count": 0,
    "bash_fail_count": 0,
    "token_usage": {"input": 4200, "output": 310}
  },

  "outcome": {
    "annotation_status": "complete",
    "score": 0.7,
    "reward": 0.68,
    "advantage": 0.13,
    "signals": {
      "correction_detected": false,
      "correction_absent": true,
      "build_success_detected": true,
      "redo_detected": false,
      "session_continued": true,
      "process_cleanliness": 0.85,
      "file_modification_signal": 0.65,
      "error_signal": 1.0,
      "correction_signal": null
    }
  },

  "karl_meta": null
}

The `karl_meta` field is populated only for self-play trajectories (Step 6). The `outcome.reward` and `outcome.advantage` fields are computed by Step 2's Reward Engine. The schema supports all five paths' data needs without requiring separate stores.

Why One Store, Not Five

Path A proposed `[home-path]`. Path B works from `verbose-all.jsonl`. Path C proposed `ew_trajectory_log` in Supabase. Path D proposed `skill_embeddings` in pgvector. Path E proposed `[home-path]` shards.

Five stores means five sync problems, five schema migrations, five dedup passes. One JSONL store with a `channel` field and a Supabase sync job (following the existing `pane_supabase_sync.py` pattern) gives every downstream consumer a single source of truth. The Supabase table (`karl_trajectories`) mirrors the JSONL for cross-machine access; the JSONL is the primary write target because hooks need sub-millisecond appends.

Storage budget: At 30+ prompts/day with tool calls, the store grows ~50-80 KB/day. 90-day rotation keeps it under 10 MB. The Supabase table stays permanently (used by L4 for long-term trajectory history, per Path C Section 9).

Files Created

[home-path]
  trajectories.jsonl          # Append-only unified trajectory store
  trajectory_tap.py           # Live recording logic (~200 lines, Path A)
  trajectory_extractor.py     # Backfill from verbose-all.jsonl (~180 lines, Path B)
  __init__.py                 # Package marker

Files Modified

[home-path]      # +8 lines (Tap A, Tap D)
[home-path]  # +5 lines (Tap B)
[home-path]   # +7 lines (Tap C)
[home-path] # +4 lines (final flush)

---

STEP 2: The Reward Engine -- Building on the Data Layer

Inherits: Step 1 -- trajectories now flow into a single store with tool sequences, exit codes, and prompt context. What we lack is a value judgment: did this trajectory succeed?

Synthesizing Three Reward Designs

Path A designed 4 outcome signals (correction absence, build success, redo absence, session continuation) with weight coefficients summing to 1.0. Path B designed 4 reward components (process cleanliness, file modification, error array, correction signal) with weights 0.35/0.25/0.20/0.20. Path C designed a 4-term L4 fitness function (build pass rate, files changed rate, milestone delta, duration efficiency) with weights 0.40/0.30/0.20/0.10.

These are not competing designs. They measure success at three different granularities:

LevelWhat it measuresSourceConsumer
Outcome score (Path A)Did the user approve the result?Cross-turn signals (next prompt)Skill metrics, decay detector
Process reward (Path B)Was the tool execution clean?Within-turn signals (exit codes, errors)LoRA training weights, advantage computation
Fleet fitness (Path C)Did the trajectory advance EW goals?Post-dispatch signals (builds, milestones)L4 policy evolution

The compound Reward Engine computes all three, stored in the `outcome` block of Step 1's schema.

The Composite Reward Function

python
# [home-path] (~250 lines)

def compute_full_reward(record: dict) -> dict:
    """
    Computes outcome score (Path A), process reward (Path B),
    and fleet fitness contribution (Path C) for a single trajectory record.

    Returns updated outcome dict with all fields populated.
    """
    traj = record["trajectory"]
    signals = {}

    # ---- Path B: Process Reward (within-turn, available immediately) ----

    # Component 1: Process cleanliness (weight 0.35)
    bash_events = [i for i, t in enumerate(traj["tool_sequence"]) if t == "Bash"]
    if not bash_events:
        r_process = 0.85  # Read-only trajectories are inherently clean
    else:
        fail_count = traj["bash_fail_count"]
        total_bash = traj["tool_counts"].get("Bash", 1)
        fail_rate = fail_count / max(total_bash, 1)
        r_process = max(0.0, 1.0 - fail_rate * 0.6)

    # Component 2: File modification signal (weight 0.25)
    domain = (record.get("skill") or {}).get("domain", "unknown")
    modified = len(traj.get("files_modified", []))
    if domain in ("ios", "deploy", "docker"):
        r_file = min(1.0, 0.2 + modified * 0.3) if modified == 0 else min(1.0, 0.5 + modified * 0.15)
    else:
        r_file = 0.6  # Neutral for non-build domains

    # Component 3: Error signal (weight 0.20)
    error_count = traj["error_count"]
    r_error = max(0.1, 1.0 - error_count * 0.3)

    # Component 4: Correction signal (weight 0.20, from cross-turn annotation)
    r_correction = signals.get("correction_signal")  # Set by Tap D

    # Composite process reward
    if r_correction is not None:
        process_reward = (r_process * 0.35 + r_file * 0.25 +
                         r_error * 0.20 + r_correction * 0.20)
    else:
        total_other = 0.35 + 0.25 + 0.20
        process_reward = (r_process * 0.35/total_other +
                         r_file * 0.25/total_other +
                         r_error * 0.20/total_other)

    # ---- Path A: Outcome Score (cross-turn, requires next prompt) ----

    outcome_score = 0.0
    if signals.get("correction_detected"):
        outcome_score -= 0.5
    elif signals.get("correction_absent") and signals.get("next_prompt_exists"):
        outcome_score += 0.4
    if signals.get("build_success_detected"):
        outcome_score += 0.3
    elif signals.get("build_fail_detected"):
        outcome_score -= 0.3
    if signals.get("redo_detected"):
        outcome_score -= 0.4
    elif signals.get("redo_absent"):
        outcome_score += 0.2
    if signals.get("session_continued"):
        outcome_score += 0.1
    outcome_score = max(-1.0, min(1.0, outcome_score))

    return {
        "annotation_status": "complete" if signals.get("next_prompt_exists") else "partial",
        "score": outcome_score,                    # Path A: [-1, 1]
        "reward": process_reward,                  # Path B: [0, 1]
        "advantage": None,                         # Computed in batch by Step 5
        "signals": {
            "process_cleanliness": r_process,
            "file_modification_signal": r_file,
            "error_signal": r_error,
            "correction_signal": r_correction,
            **signals,
        }
    }

Advantage Computation (OAPL Core from Path B)

The advantage `A = r - V_baseline` is computed in batch, not per-record, because it requires domain-level statistics. Path B Section 3.4 specifies:

V_baseline = mean reward across all trajectories in the same domain bucket
A = (reward - V_baseline) / beta
beta = 0.05 (tighter KL constraint than KARL's 0.1 due to Mac5's smaller model)
Clipped to [-2.0, 2.0]

The advantage computation runs as a daily Prefect flow (`karl_advantage_batch`) that reads all trajectories from the store, groups by domain, computes baselines, and patches the `outcome.advantage` field. This feeds directly into Step 5's training pipeline.

Fleet Fitness (Path C's L4 Signal)

For trajectories produced by Evolution World dispatches (identified by `channel: "live"` + a CALC session marker), the Reward Engine also computes the L4 fitness contribution:

L4_contribution = 0.40 * build_pass_bit
                + 0.30 * (1 if files_changed > 0 else 0)
                + 0.20 * milestone_delta_bit
                + 0.10 * (duration_prior / actual_duration)

This is written to the same record as `outcome.l4_fitness_contribution` and consumed by Step 4's L4 controller. The key insight: the same raw trajectory data (exit codes, files changed, duration) produces three different reward signals at three different abstraction levels, all sharing a single computation pass.

Expected Baseline Rewards (Path B Section 3.5)

DomainExpected Mean rWhy
deploy~0.55Mix of clean deploys and SSH retry loops
ios~0.45xcodebuild is noisy, high Bash failure rate
git~0.70Simple tool sequences, usually clean
debug~0.40Exploratory, expected high error rates
supabase~0.60Read + verify patterns
monitoring~0.65Clean check + report patterns

---

STEP 3: Routing Intelligence -- Building on Data + Rewards

Inherits: Steps 1-2 -- trajectories are recording and annotated with outcome scores, process rewards, and advantages. The data layer is accumulating. Now: how does the system use this data to make better routing decisions?

The Regex Problem (Confirmed by Path D)

Path D documented 5 concrete failures of the current regex-based routing in `ops_trigger.py`:
1. Vocabulary gaps: "The Nexus portal is throwing a 502" matches neither `ops:deploy` nor `ops:monitoring` patterns.
2. First-match-wins: "Fix the deploy script" triggers `ops:debug` (on "fix") instead of `ops:deploy`.
3. No outcome signal: 71 `ops:ios` invocations in `entries.jsonl` with zero feedback on which actually helped.
4. No multi-skill composition: "Fix the Prefect deploy for Spore's Supabase migration" touches 3 domains.
5. Cold prompts never routed: Creative or hybrid prompts bypass all skills.

Step 2's Reward Engine now solves problem #3 (outcome signal exists). Step 3 uses that signal to solve problems #1, #2, #4, and #5 by replacing regex matching with trajectory-weighted vector similarity.

Embedding Architecture (Path D Section 1)

Embedding model: RAG++ gateway at `:8000` (SSH tunnel to cloud-vm Docker). Uses Gemini `text-embedding-004` producing 768-dimensional vectors. Already exists, already persistent in pgvector. No new infrastructure.

What gets embedded:

Each skill is embedded from its operational semantic surface (Path D `build_skill_embedding_text()`):

Skill: {name}
Intent: {intent_section[:200]}
Workflow: {workflow_section[:300]}
Gotchas summary: {gotchas[:200]}
Used for prompts like: {top_5_historical_trigger_prompts}

Each prompt is embedded with project context:

[project:{cwd_basename}] {prompt_text}

Storage: One new pgvector table `skill_embeddings` (Path D Section 1):

sql
CREATE TABLE skill_embeddings (
    skill_name      TEXT PRIMARY KEY,
    embedding       vector(768),
    embedding_text  TEXT,
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    version         INT DEFAULT 1,
    trajectory_weight FLOAT DEFAULT 1.0
);

Local cache: `[home-path]` -- 10 skills x 768 dims = ~60KB. Loaded once per hook process. TTL: 1 hour (Path D Section 1, `CACHE_TTL_SECONDS = 3600`).

Trajectory-Weighted Similarity (Path D Section 2 + Step 2 Rewards)

This is where Steps 1-2 compound into Step 3. Raw cosine similarity asks "which skill text is closest to this prompt?" Trajectory-weighted similarity asks "which skill has historically worked for prompts like this?"

sim_weighted(prompt, skill) = cosine_sim(embed(prompt), embed(skill)) * w_s

Where `w_s` is the trajectory weight, updated by Step 2's outcome signals via EMA:

python
ALPHA = 0.1   # Learning rate
def update_weight(current_weight: float, outcome: float) -> float:
    target = 1.0 + (outcome * 0.5)  # Maps [-1,1] to [0.5, 1.5]
    return current_weight * (1 - ALPHA) + target * ALPHA

The weight bounds [0.5, 1.5] ensure no skill is completely suppressed or unconditionally dominant. A skill with consistent corrections (outcome=-1) converges to weight 0.5 (halved similarity). A skill with consistent success converges to 1.5 (50

The New Routing Pipeline

The replacement for `ops_trigger.py`'s regex matching (Path D Section 4):

python
# ops_trigger_v2.py -- replaces regex matching, keeps injection mechanism

def route_prompt(prompt: str, cwd: str, session_id: str) -> Optional[str]:
    """Select the best skill for this prompt using vector similarity."""

    # 1. Embed the prompt (RAG++ :8000, ~80ms with cache warmth)
    prompt_text = build_prompt_embedding_text(prompt, cwd)
    prompt_embedding = embed_via_ragpp(prompt_text)

    # 2. Load cached skill embeddings (<1ms)
    skill_embeddings = load_skill_embeddings()  # {name: (vector, weight)}

    # 3. Compute weighted similarities (numpy, <1ms for 10 skills)
    scores = {}
    for skill_name, (skill_vec, weight) in skill_embeddings.items():
        sim = cosine_similarity(prompt_embedding, skill_vec)
        scores[skill_name] = sim * weight

    # 4. Select top-k skills above threshold
    threshold = 0.35  # Below this, no skill is relevant enough
    candidates = [(name, score) for name, score in scores.items() if score > threshold]
    candidates.sort(key=lambda x: -x[1])

    if not candidates:
        return None  # No skill matched -- record as baseline trajectory (Step 1)

    # 5. Multi-skill composition (solves Path D Problem #4)
    if len(candidates) >= 2 and candidates[1][1] > 0.8 * candidates[0][1]:
        # Two skills within 20% -- compose them
        return compose_skills(candidates[0][0], candidates[1][0])

    return candidates[0][0]

Shadow Mode Before Live (Path D Section 5)

The vector router runs in shadow mode for 2 weeks alongside the existing regex router. Both fire on every UserPromptSubmit. The shadow router's selection is logged but NOT injected. After 2 weeks, compare:

  • Agreement rate: How often do regex and vector select the same skill?
  • Lift on disagreements: When they disagree, which selection leads to better Step 2 outcome scores?
  • Coverage improvement: How many prompts does the vector router match that the regex router missed (cold prompts)?

If the vector router shows positive lift on disagreements AND higher coverage, promote it. Otherwise, investigate the embedding quality.

Two-Tier Learning (Path D Section 3)

Tier 1 (real-time): At every Stop event, `weight_updater.py` computes trajectory scores from Step 2's Reward Engine and applies EMA updates to `trajectory_weight` in the skill_embeddings table. Cost: <50ms per session. This is the closed feedback loop: better routing -> better outcomes -> updated weights -> even better routing.

Tier 2 (daily Prefect): `skill_embedding_refresh` re-embeds any skill whose content changed or whose trajectory weight drifted significantly from 1.0. Incorporates new top historical trigger prompts accumulated since the last embedding.

Files Created

[home-path]     # Vector-based routing (~250 lines)
[home-path]     # Local skill embedding cache (~80 lines)
[home-path]      # Stop-event weight update (~120 lines)
flows/feed-hub/skill_embedding_refresh.py      # Daily Prefect flow (~100 lines)

---

STEP 4: Evolution Integration -- Building on Data + Rewards + Routing

Inherits: Steps 1-3 -- trajectories are recorded (Step 1), scored with multi-level rewards (Step 2), and used to route skills via weighted embeddings (Step 3). Now: how does Evolution World consume trajectory intelligence to evolve its own agent routing?

Why L4 Is Not Redundant With Step 3

Step 3's routing intelligence operates at the skill selection level: given a user prompt, which SKILL.md should be injected? Path C's L4 operates at the agent routing level: given a task dispatched by the EW heartbeat loop, which CALC agent (claude-code, codex, gemini-cli) should execute it?

These are different decisions at different scales:
- Step 3 runs in 500ms hooks, on every human prompt, routing ~10 skills
- L4 fires every 90 L1 steps, routing across 3+ agents, across 40+ apps

But they share Step 1's trajectory store and Step 2's reward signals. L4 reads the same `trajectories.jsonl` records but filters for those with CALC dispatch markers. It computes fleet-wide fitness from the same reward components (build pass rate, files changed, duration) that Step 2 already computed.

The L4 Controller (Path C, informed by Steps 1-2)

Path C designed the `ToolPreferenceGenome` with 6 components:
1. agent_weights: `{claude-code: 0.6, codex: 0.8, gemini-cli: 0.4}`
2. technique_agent_affinity: 2D matrix mapping technique prefix x agent to success probability
3. duration_priors: Per-technique-per-agent expected duration (replaces the global `CALC_MAX_WAIT=1200`)
4. search_templates: Parameterized GK query patterns for the sense phase
5. cross-layer forwarding thresholds: Circuit breaker controls
6. trajectory_discount: Temporal decay factor (0.85)

The compound advancement over Path C alone: L4 does not need its own trajectory buffer. It reads from Step 1's unified store, filtered by `channel: "live"` + the CALC session marker that `pulse_bridge.py` writes on dispatch. And it computes fitness using Step 2's Reward Engine outputs rather than reimplementing its own reward function.

L4's 5 Mutation Operators (Path C Section 5)

These operate on the `ToolPreferenceGenome` every 90 L1 steps:

M1: Affinity Reinforcement -- For each (technique, agent) pair in the observation window, update affinity toward observed success rate using `lr = l3_genome.learning_rate` (inherited from L3). This is the direct feedback loop from Step 2's rewards into agent routing.

M2: Duration Prior Update -- Decay-weighted moving average of `PulseResult.duration_s`, per technique-agent pair. Replaces the static `CALC_MAX_WAIT = 1200` with learned per-pair estimates.

M3: Search Template Mutation -- Probes new GK query patterns in the daemon's sense phase. Uses a 10-step inner trial (compare admissibility scores of new vs. old template). If the new template yields higher average admissibility, absorb it. This evolves how the daemon queries RAG++ for context, not just which agent executes.

M4: Strategy Lock/Unlock -- Circuit breaker. If fleet-wide trajectory success crosses the lock threshold upward, inhibit L2's `_mutate_strategy` for 5 L2 steps (protect what's working). If it drops below the interrupt threshold, unlock and force maximum exploration. This is the only mutation operator that directly modifies a lower layer's live state.

M5: Agent Weight Redistribution -- Dirichlet perturbation on agent weights. Concentration parameter `alpha = 3.0 * l4_fitness` -- higher fitness = tighter (exploit), lower = looser (explore). Combined with the no-degenerate-routing invariant (L4-I4): at least 2 agents must have weight >= 0.10 at all times.

L4's Invariants (Path C Section 4)

InvariantWhat It PreventsMechanism
L4-I1: Minimum Trajectory EntropyL4 converging to a fixed policyKL divergence between pre/post mutation must exceed 0.005
L4-I2: Bounded Policy DivergenceCatastrophic routing shiftsMax 0.30 weight shift per agent per generation
L4-I3: Cross-Layer CouplingL4 prescriptions being ignoredCheck actual dispatches match prescribed distribution within 10 steps
L4-I4: No Degenerate RoutingAll traffic to one agentAt least 2 agents with weight >= 0.10

Integration with Existing EW Files

From Path C Section 14:

New files:
- `[home]/projects/evolution_world/l4_controller.py` -- L4Controller, ToolPreferenceGenome, 5 mutation operators
- `[home]/projects/evolution_world/trajectory_store.py` -- Reads from Step 1's `[home-path]`, filtered for CALC dispatches

Modified files:
- `engine.py` -- Import L4Controller, call `l4.should_fire` after L3 block (~line 244)
- `daemon.py` -- Wire CALC completions to `l4.record_trajectory`, call `l4.apply_to_pulse_bridge` post-adaptation
- `pulse_bridge.py` -- Replace static `select_calc_agent` with trajectory-informed version consulting `ToolPreferenceGenome`
- `l2_controller.py` -- Add `_l4_strategy_locked` guard on `_mutate_strategy`
- `invariants.py` -- Add `check_l4_policy_divergence` and `check_l4_agent_diversity`
- `state.py` -- Add `save_l4_step`, `get_l4_state` for Supabase persistence

The Compound Insight

Without Steps 1-2, L4 would need its own trajectory buffer and its own reward function (as Path C originally designed). With Steps 1-2 in place, L4 becomes a policy consumer of the unified store: it reads annotated trajectories with pre-computed rewards and focuses entirely on the agent routing optimization problem. This eliminates ~200 lines of redundant data collection and reward computation from L4's implementation.

Without Step 3's embedding-based routing, L4 would need to also handle skill selection. With Step 3 in place, the concerns are cleanly separated: Step 3 handles "which skill for this prompt?" while L4 handles "which agent for this technique?"

---

STEP 5: Training Pipeline -- Building on Data + Rewards + Routing + L4

Inherits: Steps 1-4 -- trajectories record with outcome scores (Step 1-2), routing intelligence learns from weights (Step 3), and L4 evolves agent preferences (Step 4). Now: how do we train a LoRA adapter that internalizes trajectory patterns?

What OAPL-Lite Trains On (Path B + Steps 1-2)

Path B designed the OAPL-Lite pipeline: advantage-weighted SFT on Mac5 using MLX LoRA. The compound advancement: Path B's original design had to build its own reward function and extract its own trajectories from `verbose-all.jsonl`. With Steps 1-2 in place, the training pipeline simply reads from the unified trajectory store where rewards and advantages are already computed.

The pipeline:

Step 1's trajectories.jsonl (reward + advantage pre-computed by Step 2)
    |
    v
quality_filter.py                -- Path B's pass-rate band [0.1, 0.9]
    |                               + Path E's LLM judge (Section 4.3)
    v
sft_formatter_trajectory.py      -- ChatML examples with trajectory context
    |                               Advantage-weighted repetition (Path B Section 4.2)
    v
MLX LoRA training on Mac5         -- python3 -m mlx_lm lora
    |                               (gotcha: use --num-layers not --lora-layers, v0.29+)
    v
adapter_v2 (tool-use specialist)
    |
    v
fused model at :8100              -- hot-swapped via finetune-daemon.py

Advantage-Weighted SFT Format (Path B Section 2.3)

Each trajectory becomes a ChatML training example:

json
{
  "messages": [
    {"role": "system", "content": "You are a Cognitive Twin. When given a task, reason through the optimal tool-use sequence."},
    {"role": "user", "content": "Task: deploy the flows to cloud-vm\n\nContext:\n- Domain: deploy\n- Skill: ops:deploy\n- Gotchas: SSH heredoc mangles variables; port 8080 is Docker proxy\n- Git: branch=main, repo=mohameddiomande"},
    {"role": "assistant", "content": "<thinking>Deploy task targeting cloud-vm. Read config first, verify service state, restart via SSH, confirm success.</thinking>\n\nTool sequence: Read(docker-compose.yml) -> Bash(ssh cloud-vm 'systemctl status') -> Bash(ssh cloud-vm 'systemctl restart prefect') -> Bash(verify)\n\nKey checks:\n1. Read docker-compose.yml first to confirm service names\n2. SSH commands must be single-quoted to avoid heredoc variable expansion\n3. Verify exit_code=0 on systemctl restart before marking complete"}
  ],
  "advantage_weight": 0.73
}

The `advantage_weight` comes directly from Step 2's advantage computation (`outcome.advantage` in the trajectory record). High-advantage examples (A > 0.5) are duplicated 2x in the training JSONL. Negative-advantage examples are included once with a "what NOT to do" framing (Path B Section 4.2, Strategy A).

Quality Gate: Merging Path B + Path E Filters

Path B uses simple thresholds: discard if `reward < -0.3` or `reward > 0.95`. Path E adds a sophisticated 3-stage filter:
1. Pass-rate filtering: For self-play questions with multiple attempts, keep only 0.1 <= pass_rate <= 0.9
2. LLM judge: Claude Haiku scores factual_accuracy, completeness, clarity, training_value (each 1-5). Accept if mean >= 3.5.
3. Deduplication: BM25 Jaccard > 0.7 triggers dedup

The compound quality gate uses Path B's thresholds for live trajectories (fast, no API cost) and Path E's full pipeline for self-play trajectories (more rigorous, worth the API cost for synthetic data).

numu-weave Integration (Path B Section 6)

The existing `NUMUWeave` class at `[home]/bin/numu-daemon/packages/numu-weave/src/index.ts` already handles corpus building, training dispatch to Mac5 `:9200`, and evaluation. Path B extends it with:

  • New `source: "trajectory"` type in `CorpusEntry`
  • `advantageWeight` field for per-example weighting
  • `exportWeighted()` method applying repeat-based upweighting

This means the training pipeline runs through the existing numu-weave infrastructure rather than building a parallel system. The finetune-daemon on Mac5 already has hot-swap capability for fused adapters.

Training Schedule

Initial run (adapter v2):
- Source: all trajectories with `outcome.advantage != null` from unified store
- Expected: ~180-200 training examples after augmentation (Path B Section 4.1)
- Training: 1000 iterations, lr=5e-5, batch_size=1, num_layers=4, max_seq_length=256
- Duration: ~400s on Mac5 M4 (2x adapter v1's 188.4s for 2x iterations)
- Target loss: < 1.5 (vs. v1's 1.694 baseline)

Subsequent runs (adapter v3+):
- Triggered when Step 1 accumulates 50+ new annotated trajectories (Path B Section 4.4)
- Retrain from scratch each time (not fine-tuning on fine-tune) to avoid catastrophic forgetting
- Expected cadence: every ~10-17 days at 3-5 new tool trajectories/day

A/B Evaluation (Path B Section 5 + Step 3)

After training, the fused model at `:8100` is evaluated via pane routing:

50

The `ops_trigger_v2.py` from Step 3 checks a `karl_oapl_lite` flag per pane. When set:

python
if _is_karl_pane(session_id):
    plan = _query_mlx_server(prompt_text, skill_name, timeout_ms=200)
    injection = f"[Learned Tool Plan]\n{plan}\n\n" + skill_content

The 200ms timeout accounts for Mac5 Tailscale latency (~150ms inference + network). If it exceeds the budget, fall back to standard injection.

Evaluation period: 2 weeks. Success criterion: `treatment.success_rate - control.success_rate >= 0.05` (5 percentage point lift), measured using Step 2's outcome scores.

---

STEP 6: Self-Play Loop -- Building on Data + Rewards + Routing + L4 + Training

Inherits: Steps 1-5 -- trajectories are recorded (1), scored (2), used for routing (3), consumed by L4 (4), and feeding LoRA training (5). Now: how do we generate MORE trajectories on demand, especially for skills/domains with insufficient coverage?

The Coverage Problem

Steps 1-5 depend on trajectory volume. Path A estimates 500+ annotated records in 2-3 weeks of passive recording. But trajectory coverage is uneven: `ops:deploy` may accumulate 50 records/week while `ops:supabase` gets 5. The LoRA adapter from Step 5 will be biased toward high-frequency domains and ignorant of low-frequency ones.

Path E solves this by generating synthetic trajectories from our own codebase: mine questions from code/docs, solve them with real tools, record the trajectories, quality-filter, and feed into the training pipeline.

Question Generation (Path E Section 1)

Five tiers of corpus sources, mined by a Prefect flow:

TierSourceVolumeQuestion Type
T1`[home-path]` topic files29 files, ~3,500 linesDeclarative lookup
T2`[home-path]` SKILL.md files (Gen 2)12 active, ~800 linesProcedural
T3`flows/feed-hub/*.py`106 filesImplementation procedure
T4`[home-path]` all hook source34 hooks, 29 scriptsHook behavior
T5`[home-path]`~400 filtered promptsContextual (what users ask)

Five question types: Lookup, Procedure, Diagnostic, Architectural, Comparative. Each type has a different expected solver trajectory length and difficulty profile.

Questions go into `[home-path]`, tagged with type, source file, tier, and status.

Solver Agent (Path E Section 2)

Each question gets G=5 solve attempts (G=3 for expensive Architectural questions) via headless Claude Code sessions spawned using the Pane Spawn Protocol. The solver uses Read, Grep, Bash, and RAG++ queries to find answers.

Compound advantage over standalone Path E: The solver sessions are recorded by Step 1's Trajectory Tap. Every solver attempt automatically becomes a new trajectory record with `channel: "self_play"` and `karl_meta` populated:

json
{
  "karl_meta": {
    "question_id": "qb-7fa3c2",
    "question_text": "What port does the Graph Kernel run on?",
    "question_type": "lookup",
    "attempt_index": 3,
    "extracted_answer": "Port 8001, running natively on cloud-vm",
    "answer_found": true,
    "pass": null,
    "reward": null
  }
}

Step 2's Reward Engine computes rewards for these trajectories the same way it does for live ones. But for self-play trajectories, there's an additional reward signal: answer correctness (checked via consensus across G attempts, Path E Section 4.2).

Quality Filtering Pipeline (Path E Section 4)

Three filters applied in sequence:

1. Pass-rate filter: Discard if pass_rate < 0.1 (unsolvable) or > 0.9 (trivial). Keep the sweet spot 0.1-0.9.

2. LLM judge: Claude Haiku evaluates factual accuracy, completeness, clarity, and training value. Threshold: mean >= 3.5.

3. Deduplication: BM25 Jaccard > 0.7 against existing training set.

Filtered trajectories flow into Step 5's training pipeline. High-pass-rate trajectories with clear tool sequences also flow into SKILL.md improvement (Path E Section 7.1): the modal tool sequence from successful solver trajectories replaces the generic 4-step workflow in the corresponding SKILL.md.

SKILL.md Evolution (Path E Output A)

This is the fastest path to value, requiring zero LoRA training:

Before (static template from `generator.py`):

markdown
## Workflow
1. Check current state
2. Execute operation
3. Verify success
4. Report outcome

After (trajectory-derived, from 12 successful solver trajectories):

markdown
## Workflow -- Deploy Prefect Flow to cloud-vm

### Verified Tool Sequence (pass_rate=0.83)
1. Read current flow: `ssh cloud-vm prefect flow ls`
2. Verify Docker stack: `ssh cloud-vm docker ps | grep prefect`
3. Copy flow file (scp, NOT heredoc): `scp flows/feed-hub/{flow}.py cloud-vm:[home-path]`
4. Register: `ssh cloud-vm "cd [home-path] && python3 {flow}.py"`
5. Verify in Prefect UI: `open http://localhost:4200`

### Gotchas (from failed trajectories)
- SSH heredoc mangles `${...}` -- always scp files
- Flow name in registry != filename. Use `@flow(name="...")` decorator

Bootstrapping Schedule (Path E Section 5)

Phase 0 (Week 1): Seed run. 200 questions from T1+T2. G=3. Validate pipeline. ~120 pass filtering.

Phase 1 (Weeks 2-4): Daily batch. 50 questions/run. Corpus rotation (T1 Mon/Fri, T2 Tue, T3 Wed, T4/T5 Thu). G=5. Prefect flow `karl_daily_batch` at 03:00 UTC.

Phase 2 (Week 4+): Trigger Mac5 LoRA training when training set reaches 500 filtered examples.

Phase 3 (Week 6+): Bootstrapped iteration. The improved LoRA model from Step 5 becomes the solver for the next batch. Better model -> better trajectories -> better training data -> better model. KARL's iterative bootstrapping flywheel.

Cost Control (Path E Section 9)

CadenceWeekly CostMonthly Cost
Conservative (50 Q/week)~$1.44 | ~$6
Full speed (200 Q/week)~$6.23 | ~$25

Mac5 LoRA training: electricity only (~$0.05/run). The entire self-play pipeline costs less than a Spotify subscription.

Freshness Guard (Path E Section 8)

Questions go stale when the codebase changes. The freshness watcher subscribes to the Mesh Event Bus (`:8600`) for `file_modified` events on corpus sources. Changed files -> mark derived questions as `stale` -> regenerate in next batch. Weekly drift detection re-solves 20 random Q&As against the current codebase and flags answers with Jaccard similarity < 0.5 against stored consensus.

---

STEP 7: Feedback Closure -- Building on All Six Previous Steps

Inherits: Steps 1-6 -- trajectory recording (1), reward computation (2), embedding-based routing (3), L4 evolution (4), LoRA training (5), and synthetic self-play (6) are all in place. Now: how do they form closed loops where improvements compound?

Three Learning Loops

The KARL-integrated system contains three distinct feedback loops operating at different timescales:

Loop 1: The Routing Loop (Steps 1-3, timescale: minutes)

User prompt arrives
    -> Step 3: vector routing selects skill
    -> Step 1: Trajectory Tap records tool sequence
    -> Step 2: Reward Engine computes outcome score
    -> Step 3: weight_updater.py adjusts trajectory_weight via EMA
    -> NEXT user prompt: routing is slightly more informed

This is the fastest loop. Within a single session, the routing weights update after each Stop event. A skill that fails in Turn 1 has slightly lower weight by Turn 5. Over a week of operation, the routing weights converge toward the empirically best skill for each prompt class.

Convergence bound: The EMA with `ALPHA=0.1` means each observation moves the weight by at most 10

Loop 2: The Training Loop (Steps 1-2-5-6, timescale: weeks)

Step 1: Trajectories accumulate (live + self-play)
    -> Step 2: Rewards and advantages computed in daily batch
    -> Step 5: When 50 new examples reach threshold, LoRA training on Mac5
    -> Step 5: Fused model deployed at :8100, A/B tested
    -> Step 6: Improved model used as solver for next self-play batch
    -> Step 1: Better solver trajectories -> better training data
    -> CYCLE REPEATS with iteratively improving model

This is KARL's bootstrapping flywheel adapted for our stack. Each LoRA iteration (v2, v3, v4...) produces a better solver that generates higher-quality training data for the next iteration. Path B estimated ~10-17 days between retraining runs. Path E's self-play accelerates this to ~7 days by generating 50 questions/night.

Convergence bound: KARL's paper showed 2 iterations of bootstrapping improved from 52.6 to 67.5 (a 15-point gain). Our gains will be smaller (we're optimizing routing, not core reasoning), but the same diminishing-returns curve applies. Expect the largest improvement from v1 to v2, smaller from v2 to v3, plateauing around v4-v5.

Loop 3: The Evolution Loop (Steps 1-2-4, timescale: months)

Step 1: Trajectories from EW dispatches accumulate
    -> Step 2: L4 fitness contributions computed
    -> Step 4: L4 fires every 90 L1 steps
    -> Step 4: Affinity matrix updated, duration priors refined
    -> Step 4: PulseBridge routing updated via apply_to_pulse_bridge()
    -> NEXT heartbeat cycle: agent selection uses learned preferences
    -> Step 1: Next dispatch produces a trajectory
    -> CYCLE REPEATS

This is the slowest loop. L4 fires every 90 L1 steps. At typical heartbeat rates, that's every few hours. It takes 3+ L4 generations (multiple days) to accumulate enough trajectory evidence to shift agent routing weights significantly, bounded by L4-I2's 0.30 max shift per generation.

Convergence bound: The technique-agent affinity matrix (Path C Section 2b) has at most 10 technique prefixes x 3 agents = 30 cells. Each cell converges via EMA with L3's learning rate (typically 0.1-0.3). With L4 firing ~3x/day, full matrix convergence takes 2-4 weeks.

How the Loops Interact

The three loops are not independent. They share data (Step 1's store) and reward signals (Step 2's engine):

Loop 1 feeds Loop 2: Every routing decision in Loop 1 produces a trajectory. Those trajectories accumulate in the unified store. When Loop 2 triggers a LoRA retraining, it trains on ALL accumulated trajectories, including the routing-optimized ones from Loop 1. Better routing -> better training examples -> better model.

Loop 2 feeds Loop 1: When the LoRA model improves and its tool plans are injected into SKILL.md context (Step 5's A/B treatment), the resulting sessions produce higher-quality trajectories. Those trajectories have higher reward scores, which update Loop 1's routing weights more strongly. Better model -> better outcome scores -> stronger weight updates.

Loop 3 feeds Loops 1-2: When L4 improves agent routing, the EW dispatches produce better outcomes (more builds passing, more files changed). Those improved trajectories flow into the unified store and eventually into LoRA training. Better agent routing -> better fleet trajectory quality -> better training data.

Loops 1-2 feed Loop 3: As Loop 1's routing improves, skills produce better outcomes. As Loop 2's model improves, tool plans produce cleaner executions. Both effects increase the fleet-wide trajectory success rate that L4 observes, which increases L4's fitness signal, which stabilizes L4's routing (L4-M4 strategy lock activates when things are going well).

What Breaks the Compound

The primary failure mode is feedback delay masking real degradation. If a codebase change breaks a procedure (e.g., a service moves ports), the existing trajectory weights reflect historical success. The routing weight for the now-broken skill stays high for days until enough new failures accumulate to drag it down.

Mitigations:
- Path E's freshness guard (Step 6): detects file changes and marks questions as stale, triggering re-solve
- Step 2's correction signal: detects user corrections within 120 seconds, immediately penalizing the trajectory
- Path C's L4-I1 invariant: prevents L4 from converging to a fixed policy, ensuring it keeps exploring alternatives

The Decay Integration (Path A Section 5.5 + Step 2)

The existing decay detector at `[home-path]` currently measures only invocation frequency (days inactive). With Step 2's reward data available, decay detection gains a quality dimension:

python
# In detect_stale_skills(), add quality gate:
metrics = load_skill_metrics()
skill_data = metrics.get("skills", {}).get(name, {})
if skill_data.get("success_rate", 1.0) < 0.2 and skill_data.get("invocations", 0) >= 20:
    action = "disable"  # Not just stale, actively harmful

A skill that is frequently invoked but consistently produces bad outcomes (lift < 0 over baseline) is worse than a stale skill. The compound system can now distinguish "unused" from "harmful."

---

STEP 8: Unified Architecture -- Building on Everything

Inherits: Steps 1-7 -- the full system exists. Now: the complete picture showing data flow, component interactions, and the three learning loops.

System Diagram

                                KARL Unified Architecture
                        ========================================

    USER PROMPT
         |
         v
    [ops_trigger_v2.py] -----> embed prompt (RAG++ :8000, ~80ms)
         |                        |
         |                   [skill_embeddings.pkl]
         |                   10 skills x 768 dims
         |                   trajectory_weight per skill
         |                        |
         v                        v
    Select skill         cosine_sim * trajectory_weight
    (highest score          |
     above 0.35)            |
         |                  |
         v                  v
    [SKILL.md injection] <--'
         |
         +--- if karl_oapl_lite pane: query Mac5 :8100 for tool plan (200ms)
         |    prepend [Learned Tool Plan] to injection
         |
         v
    SESSION EXECUTES (tools called, files read/modified, bash commands)
         |
         |--- Tap B fires per tool call (post_tool_hook.py, ~8ms each)
         |    accumulates tool events in session buffer
         |
         v
    [STOP EVENT]
         |
         +--- Tap C fires: flush session buffer -> TrajectoryRecord
         |    writes to [home-path]
         |
         +--- Reward Engine (Step 2): compute process reward [0,1]
         |    (process_cleanliness, file_modification, error_signal)
         |
         +--- weight_updater.py: EMA update trajectory_weight for matched skill
         |    (Loop 1: Routing Loop, updates in <50ms)
         |
         v
    [NEXT USER PROMPT]
         |
         +--- Tap D: annotate previous record with cross-turn signals
         |    (correction_detected, redo_detected, session_continued)
         |
         +--- Reward Engine completes: outcome score [-1,1]
         |
         v
    trajectories.jsonl (unified store)
         |
         +---> [Daily Advantage Batch] (Prefect)
         |     compute A = (r - V_baseline) / beta per domain
         |     patch outcome.advantage in store
         |
         +---> [30-min Metrics Aggregator] (Prefect)
         |     compute per-skill success_rate, lift, trend
         |     write skill_metrics.json
         |     serve via Dashboard API GET /api/karl/skill-metrics
         |
         +---> [L4 Controller] (every 90 L1 steps)
         |     filter for CALC dispatches
         |     M1: affinity reinforcement
         |     M2: duration prior update
         |     M3: search template mutation
         |     M4: strategy lock/unlock
         |     M5: agent weight redistribution
         |     -> apply_to_pulse_bridge()
         |     (Loop 3: Evolution Loop)
         |
         +---> [Training Pipeline] (triggered at 50 new examples)
         |     quality_filter.py -> sft_formatter.py -> MLX LoRA on Mac5
         |     -> fused model at :8100 -> A/B evaluation
         |     (Loop 2: Training Loop)
         |
         +---> [Self-Play Loop] (Prefect daily 03:00 UTC)
               question generator -> solver agent panes
               -> trajectories -> quality filter
               -> SKILL.md improvement (near-term)
               -> training dataset (medium-term)
               -> bootstrapped iteration (long-term)
               (Feeds back into trajectories.jsonl with channel: "self_play")


    ---- INFRASTRUCTURE MAP ----

    Mac1:     ops_trigger_v2.py, trajectory_tap.py, hooks, embedding_cache.pkl
    Mac5:     MLX Server :8100, finetune-daemon :9200, LoRA training
    cloud-vm: RAG++ :8000, pgvector (skill_embeddings table), Prefect :4200,
              Graph Kernel :8001, Nexus Portal :3001 (/karl page)
    Supabase: karl_trajectories (mirror), ew_l4_steps, ew_trajectory_log

Component Registry

ComponentLocationLinesConsumerProducer
trajectory_tap.py`[home-path]`~200trajectories.jsonlhooks (Taps A-D)
trajectory_extractor.py`[home-path]`~180trajectories.jsonlverbose-all.jsonl
reward_engine.py`[home-path]`~250outcome fields in storetrajectory records
ops_trigger_v2.py`[home-path]`~250SKILL.md injectionprompt embeddings + weights
embedding_cache.py`[home-path]`~80ops_trigger_v2pgvector skill_embeddings
weight_updater.py`[home-path]`~120skill_embeddings weightsoutcome scores
metrics_aggregator.py`[home-path]`~150skill_metrics.jsontrajectories.jsonl
quality_filter.py`[home-path]`~200filtered training setraw trajectories
sft_formatter_trajectory.py`[home-path]`~180MLX LoRA datasetfiltered trajectories
l4_controller.py`[home-path]`~400ToolPreferenceGenomeCALC trajectories
question_generator.py`[home-path]`~250question_bank.jsonlcorpus sources T1-T5
solver_orchestrator.py`[home-path]`~200solver pane spawnsquestion_bank
skill_updater.py`[home-path]`~150SKILL.md patchesfiltered trajectories
skill_embedding_refresh.py`flows/feed-hub/`~100pgvector updatesskill content changes

Total new code: ~2,620 lines across 14 files.

Prefect Flow Registry

FlowScheduleHostPurpose
karl_advantage_batchDaily 02:00 UTCcloud-vmCompute advantages across all trajectories
karl_metrics/30 *cloud-vmAggregate per-skill success metrics
karl_daily_batchDaily 03:00 UTCcloud-vmSelf-play question generation + solving
karl_training_triggerDaily 04:00 UTCcloud-vmCheck if 50+ new examples, trigger Mac5 LoRA
skill_embedding_refreshDaily 05:00 UTCcloud-vmRe-embed skills with changed content
karl_drift_checkWeekly Sun 06:00 UTCcloud-vmRe-verify Q&A answers against live codebase

Supabase Tables (3 new, consolidating Path C + D + E proposals)

TablePurposeWrites FromReads From
karl_trajectoriesMirror of trajectories.jsonl for cross-machine accessDaily sync jobL4, Nexus Portal
skill_embeddingsSkill vectors with trajectory weightsembedding_refresh flowops_trigger_v2 (via cache)
ew_l4_stepsL4 generation historyL4 controllerNexus Portal /evolution

Nexus Portal Pages (2 new)

/karl -- Skill performance dashboard:
- Ranked skills by success_rate with lift-over-baseline
- Trend sparklines (7-day moving average)
- Top tool sequences per skill with mean scores
- Self-play question bank funnel (generated -> filtered -> training)
- LoRA adapter version history with loss curves

/karl/skills -- Skill evolution viewer:
- Current vs. proposed SKILL.md content (diff view)
- Trajectory-derived workflow vs. static template
- Approve/reject proposed updates

Implementation Sequence

Week 1 (Days 1-5): Foundation
- Day 1: Create `[home-path]` package. Write `trajectory_tap.py`, `trajectory_extractor.py`.
- Day 2: Wire Taps A-D into existing hooks (+20 lines total across 4 files).
- Day 3: Write `reward_engine.py`. Backfill rewards for existing trajectories.
- Day 4: Write `metrics_aggregator.py`. Deploy Prefect flow `karl_metrics`.
- Day 5: Validate: 24h of live recording, check trajectories.jsonl has 10-40 records.

Week 2 (Days 6-10): Routing Intelligence
- Day 6: Write `embedding_cache.py`, `ops_trigger_v2.py`.
- Day 7: Bootstrap skill embeddings from current SKILL.md content via RAG++.
- Day 8: Deploy in shadow mode alongside existing regex router.
- Day 9: Write `weight_updater.py`. Deploy on Stop hook.
- Day 10: Deploy `skill_embedding_refresh.py` Prefect flow.

Week 3 (Days 11-15): Accumulation + Self-Play
- Day 11-12: Write `question_generator.py`, `solver_orchestrator.py`.
- Day 13: Phase 0 seed run: 200 questions, G=3.
- Day 14-15: Write `quality_filter.py`. Process seed results. Deploy `karl_daily_batch`.

Week 4 (Days 16-20): Training Pipeline + L4
- Day 16: Write `sft_formatter_trajectory.py`. Extend numu-weave with trajectory source.
- Day 17: First LoRA training run on Mac5 (adapter v2).
- Day 18-19: Write `l4_controller.py`. Wire into Evolution World.
- Day 20: Deploy A/B evaluation. Promote vector router if shadow mode validates.

Week 5-6: Compound
- Full pipeline operational. All three loops running.
- Monitor convergence of routing weights, training loss, L4 fitness.
- Trigger adapter v3 when 50+ new trajectories accumulate post-v2.

Success Metrics

MetricBaseline (current)Target (Week 6)Measurement
Skill routing accuracyUnknown (no measurement)70
Skill lift over baselineUnknown> 0.05 for 8/10 active skillsskill_metrics.json
LoRA first-action accuracy~0.35 (random for 4 tools)> 0.60Held-out test set
LoRA plan accuracy (LCS)~0.20> 0.50Held-out test set
L4 fleet trajectory success rateUnknown> 0.50 (weighted)l4_fitness in ew_l4_steps
Self-play question bank size0500+ passing questionsquestion_bank.jsonl
SKILL.md trajectory coverage0/10 skills8/10 skills with trajectory-derived workflowsskill_updater diffs

What This System Cannot Do

1. Replace Claude Code's reasoning. KARL-style training on a 1B-parameter LoRA adapter will not match Claude Opus 4.6's reasoning. The LoRA model suggests tool plans; Claude Code still does the actual work. The system optimizes which context Claude receives, not Claude itself.

2. Handle novel domains. The vector routing and trajectory weights are calibrated on the 10 operational domains defined in `extractor.py` (lines 70-81). A genuinely new domain (e.g., "blockchain") starts with uniform weights and no trajectory history. Self-play can accelerate bootstrapping, but there is an inherent cold-start period.

3. Guarantee convergence. The three learning loops have interaction effects that may produce oscillation (L4 changes agent routing, which changes trajectory quality, which changes skill weights, which changes routing, etc.). The invariant system (L4-I1 through L4-I4) and bounded EMA updates limit amplitude, but they do not guarantee monotonic improvement.

4. Scale beyond Mac5. The LoRA training pipeline is bottlenecked at Mac5's M4 16GB. Training on larger models or with larger batch sizes requires additional hardware. The architecture is designed for this constraint -- OAPL-Lite explicitly drops online rollouts and compression steps to fit on a single M4.

---

Sources

### Stage 0 Research
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md` -- Full research: Cortex architecture, hooks, EW, KARL paper analysis, constraints

### Stage 1 Paths
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-a.md` -- Trajectory Tap: 4 tap points, outcome signals, TrajectoryRecord schema, per-skill metrics
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-b.md` -- OAPL-Lite: advantage-weighted SFT, 157 tool trajectories in verbose-all.jsonl, reward function, numu-weave integration
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-c.md` -- EW L4: ToolPreferenceGenome, 5 mutation operators, 4 invariants, agent routing
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-d.md` -- Skill Embeddings: vector routing, trajectory-weighted similarity, two-tier learning
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-e.md` -- Synthetic Self-Play: question generation, solver agents, quality filtering, SKILL.md improvement

### Codebase Files Referenced
- `[home-path]` (233 lines) -- Current regex routing, Tap A + D insertion points
- `[home-path]` (287 lines) -- Domain taxonomy (10 domains, lines 70-81)
- `[home-path]` (178 lines) -- Static SKILL.md generation
- `[home-path]` (160 lines) -- 14 correction patterns, outcome signal source
- `[home-path]` (251 lines) -- Decay ladder, quality gate insertion point
- `[home-path]` (169 lines) -- CortexEntry types, JSONL persistence
- `[home-path]` (1283 lines) -- VerboseToolCall extraction, Tap C insertion
- `[home-path]` (287 lines) -- Per-tool tracking, Tap B insertion
- `[home-path]` (404 lines) -- 3,909 entries with tool_calls arrays
- `[home-path]` -- 3,249 entries, 157 with multi-step tool sequences
- `[home-path]` (966 lines) -- 5-phase heartbeat, CALC dispatch, fitness tracking
- `[home-path]` (401 lines) -- L1/L2/L3 coupling, L4 insertion at line 244
- `[home-path]` -- PulseResult, select_calc_agent (lines 109-123)
- `[home-path]` -- MethodGenome, _l4_strategy_locked guard
- `[home-path]` -- L3Genome, line 8: "L4 would optimize 3-4 floats"
- `[home-path]` -- 4 non-halting invariants, L4 extension points

### External
- KARL paper: arXiv 2603.05218 -- OAPL algorithm, nugget-based rewards, self-play data pipeline
- KARL performance: GLM 4.5 Air 52.6 -> KARL 67.5 (matching Claude Opus 4.6) via trajectory RL

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-trajectory-intelligence/stage2-compound.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture · is Stage Research