Grand Diomande Research · Full HTML Reader

KARL Integration — Evolution³ / Stage 1: PATH E

Path E adapts KARL's synthetic self-play pipeline to our living codebase. Instead of mining static enterprise documents, our question generator reads our own code, memory files, hooks, flows, and configs to produce domain-specific questions. A solver agent then attempts to answer each question using our actual tool stack (Read, Grep, Bash, RAG++, GK). Every attempt is recorded as a trajectory. Trajectories are quality-filtered, then used to either improve SKILL.md content (near-term, zero training cost) or train a

Agents That Account for Themselves proposal experiment writeup candidate score 22 .md

Full Public Reader

# KARL Integration — Evolution³ / Stage 1: PATH E
Run: karl-trajectory-intelligence
Path: E — Synthetic Self-Play: Generate Our Own Training Data from the Codebase
Stage: 1 of 4 (Explore + Design)
Generated: 2026-03-10
Method: Evolution³ — four-stage recursive evoflow (research-grounded)
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/

---

Executive Summary

Why this is the right path over alternatives:
- Path A (add OAPL directly) requires distributed GPU training we do not have.
- Path B (plug in KARL's external model weights) gives us someone else's knowledge, not ours.
- Path C (add reward signals to existing Cortex) improves routing but not knowledge depth.
- Path D (use RAG++ as the KARL search tool) improves retrieval but not the agent's procedures.
- Path E generates our own proprietary training signal from our own codebase — a moat no external model can replicate.

---

1. Question Generation

1.1 Corpus Sources

The question generator mines five tiers of source material, ordered by information density:

Tier	Source	Volume	Why
T1	`[home-path]` topic files	29 files, ~3,500 lines	Curated operational knowledge with hard-won gotchas
T2	`[home-path]` SKILL.md files (Gen 2 only)	12 active, ~800 lines	Procedure documents with trigger patterns and workflows
T3	`flows/feed-hub/*.py` — Prefect flow source	106 files	Actual implementation truth for flow/task questions
T4	`[home-path]` — all hook source files	34 hooks, 29 scripts	Ground truth for hook behavior questions
T5	`[home-path]` (filtered)	~400 real human prompts after SKIP_PATTERNS	Reveals what humans actually need to know

T1 and T2 generate declarative questions ("What port does the Graph Kernel run on?", "What is the minimum line count for memory/active-tasks.md?"). T3 and T4 generate procedural questions ("How do you deploy a new Prefect flow to cloud-vm?", "What does the ops_trigger hook do when two panes claim the same domain?"). T5 generates contextual questions based on what users historically asked.

1.2 Question Types

Five question archetypes, each with different solver profiles and reward signals:

Type 1: Lookup (T1-T2 primary)
Exact-answer questions with a single nugget.
Examples:
- "What Tailscale IP does Mac5 use?"
- "What is the WebSocket port for the NUMU Bus?"
- "What model does the MLX Server on Mac5 serve?"
Expected solver trajectory: 1-2 tool calls (Read MEMORY.md or Grep for IP pattern). Short. High success rate.

Type 2: Procedure (T3-T4 primary)
Multi-step how-to questions requiring ordered steps.
Examples:
- "How do you add a new Prefect flow to the feed-hub?"
- "What are the steps to run a LoRA fine-tune on Mac5?"
- "How do you spawn a pane on Mac2 from Mac1?"
Expected solver trajectory: 5-15 tool calls (Read file, check imports, grep for deploy pattern, read .service file). Medium length. Moderate success rate.

Type 3: Diagnostic (T5 primary — "debug/fix" domain)
Problem-to-solution questions extracted from correction patterns.
Examples:
- "Why does `mlx_lm.lora` fail on Mac5 with v0.29+?"
- "Why does the pane spawn miss ~30
- "What causes the Graph Kernel to be unreachable from Docker containers?"
Expected solver trajectory: 3-8 tool calls (Grep for error, Read relevant config, trace code path). Medium length. Low success rate initially (these are hard precisely because they're gotchas).

Type 4: Architectural (T1-T4 cross-referenced)
Design-level questions requiring synthesis across multiple files.
Examples:
- "What is the full data flow from a user prompt to a SKILL.md injection?"
- "How do the Cortex hooks interact with the pane orchestrator?"
- "What is the relationship between the NUMU Bus and the Mesh Event Bus?"
Expected solver trajectory: 10-25 tool calls. Long. Low initial success rate. Highest value when solved correctly.

Type 5: Comparative (T2-T5 synthetic)
"Should I use X or Y?" questions with trade-off answers.
Examples:
- "When should a new tool use the Bash hook vs. a PostToolUse hook?"
- "What is the difference between unified.jsonl and verbose-all.jsonl?"
- "When should a new skill target mac1 vs. cloud-vm?"
Expected solver trajectory: 5-12 tool calls (Read two files, compare schemas). Medium length. High value for disambiguation.

1.3 Question Generation Process

The generator runs as a Prefect flow. Each batch:

Step 1: Select source shard
  - Round-robin over Tier 1-5 sources, weighted by tier priority (T1 = 3x, T2 = 2x, T3-T5 = 1x)
  - For T3-T5: randomly sample 5-10 files to avoid context overflow

Step 2: Chunk the shard
  - Sliding window over source text: 2K token chunks with 200 token overlap
  - Skip chunks with < 50 meaningful tokens (table of contents, blank sections)

Step 3: Call Claude Haiku with generation prompt:
  System: "You are a question generator for an internal ops knowledge base. Generate questions
           a new engineer would ask when working with this system. Questions must be answerable
           from code or documentation, not from general knowledge."
  User:   "<source_chunk>\n\nGenerate 5 questions of types: lookup, procedure, diagnostic,
           architectural, comparative. Return JSON array."

Step 4: Parse + deduplicate
  - Reject questions < 20 chars or > 300 chars
  - Reject questions without "?" or imperative verbs
  - Dedup against existing question bank using BM25 similarity > 0.7
  - Tag each question with: type, source_file, source_line_range, tier, generated_at

The output is a `question_bank.jsonl` in `[home-path]`:

json

{
  "id": "qb-7fa3c2",
  "text": "What port does the Graph Kernel run on, and on which machine?",
  "type": "lookup",
  "tier": 1,
  "source_file": "[home-path]
  "source_lines": [29, 40],
  "generated_at": "2026-03-10T08:00:00Z",
  "batch_id": "b-001",
  "status": "unsolved",
  "solve_attempts": 0,
  "pass_rate": null,
  "judge_score": null
}

1.4 Freshness-Triggered Regeneration

Questions go stale when the codebase changes. A freshness watcher:

1. Subscribes to the Mesh Event Bus (`:8600`) for `file_modified` events on corpus sources.
2. When a watched file is modified, marks all questions sourced from that file as `stale`.
3. The next batch run regenerates stale questions, compares against existing bank with diff-aware dedup.
4. Keeps old questions as historical data (with `stale: true` flag) rather than deleting them.

Specific triggers:
- MEMORY.md modified → re-mine T1 questions
- New `.py` file in `flows/feed-hub/` → add T3 questions for new flow
- SKILL.md updated → re-mine T2 questions from updated skill
- CLAUDE.md rule added (via Cortex promotion) → add T1/T3 questions for new rule context

---

2. Solver Agent Design

2.1 Agent Profile

The solver agent is a headless Claude Code session spawned with `--dangerously-skip-permissions`. It receives a structured prompt containing:

SOLVER SYSTEM PROMPT:
You are answering operational questions about this codebase. Use the available tools
(Read, Grep, Bash, WebFetch) to find the definitive answer. You have access to all files
on this machine.

Rules:
1. Do NOT guess. Use tools to verify every factual claim.
2. Reach a definitive answer or explicitly state "ANSWER_NOT_FOUND" with the reason.
3. Your final answer must be wrapped in <answer> tags.
4. After your answer, list every file you read in <sources> tags.
5. If you discover a discrepancy between two sources, note it in <conflicts> tags.
6. Stop after 20 tool calls if no answer found (flag as TIMEOUT).

QUESTION: {question_text}

2.2 Tool Usage Strategy

The solver is seeded with a tool priority ladder based on question type:

Question Type	Tool Priority	Reasoning
Lookup	Read → Grep → Bash	Exact data in files; grep for ports/IPs faster than reading
Procedure	Read → Grep → Bash (for validation)	Procedures need full context, not just matches
Diagnostic	Grep → Read → Bash (run the command)	Error patterns found by grep; cause by reading; reproduce by running
Architectural	Read → RAG++ query → GK traversal	Synthesis needs semantic search across multiple files
Comparative	Read (both targets) → Bash (diff)	Side-by-side inspection

The solver does NOT have access to WebSearch — all answers must come from the local codebase and deployed infrastructure. This is by design: we are learning OUR knowledge, not generic web knowledge.

2.3 RAG++ and GK Integration

For Architectural and Comparative questions, the solver can call:

bash

# RAG++ query (via SSH tunnel on Mac1 :8000)
curl -s -X POST http://localhost:8000/api/rag/gateway/context \
  -H "Content-Type: application/json" \
  -d '{"query": "{question_text}", "k_rag": 5, "include_graph": true}'

This returns `related_turns` (pgvector) and `graph_context` (GK entity traversal). The solver uses these to seed its Read calls — looking at the top 3 related files before doing free-form exploration.

Key insight: This makes the solver a meta-demonstration of how to use RAG++ effectively. The trajectories it produces are also training examples for RAG++ query formulation.

2.4 Multi-Attempt Per Question

Following KARL's self-play design, each question gets G=8 solve attempts (configurable). Each attempt is independent — different random seed, different tool call ordering, potentially different answer. This is the core of self-play: variation in trajectories enables quality filtering.

For expensive questions (Architectural, Comparative), G=3 to control costs. For cheap Lookup questions, G=8 to get clean pass-rate signal.

2.5 Solver as Pane Job

Each solver attempt runs as a pane spawn using the existing Pane Spawn Protocol:

1. Spawn new Terminal window on Mac1 (or route to Mac2/Mac4 via mesh coordinator)
2. `unset CLAUDECODE` (nested session guard)
3. `claude --dangerously-skip-permissions` in the spawn directory
4. Clipboard-paste the solver prompt (avoid shell escaping issues)
5. Wait for session end via SessionEnd hook detection
6. Read trajectory from verbose-all.jsonl for this session_id

The session_id from the spawn links the solver session back to the question_bank entry. This is the record-keeping mechanism.

---

3. Trajectory Recording

3.1 What to Capture

The `verbose-all.jsonl` already captures the full trajectory structure per session. The schema from `response_hook.py` gives us:

json

{
  "prompt_id": "...",
  "session_id": "...",
  "prompt_text": "QUESTION: What port does the Graph Kernel run on?",
  "assistant_turns": [
    {
      "turn_index": 0,
      "tool_calls": [
        {
          "tool_id": "...",
          "tool_name": "Grep",
          "parameters": {"pattern": "Graph Kernel", "path": "[home-path]
          "result": ":8001 (SSH tunnel to cloud-vm native)",
          "success": true,
          "duration_ms": 45.2
        },
        {
          "tool_name": "Read",
          "parameters": {"file_path": "/Users/.../MEMORY.md"},
          "result": "...",
          "success": true
        }
      ],
      "text_response": "<answer>Port 8001, running natively on cloud-vm (not Docker)</answer>",
      "stop_reason": "end_turn"
    }
  ],
  "files_read": ["MEMORY.md"],
  "git_repo": "mohameddiomande"
}

3.2 Trajectory Metadata Augmentation

After the solver session ends, a `trajectory_augmentor.py` enriches the raw verbose entry with:

json

{
  "karl_meta": {
    "question_id": "qb-7fa3c2",
    "question_text": "What port does the Graph Kernel run on?",
    "question_type": "lookup",
    "attempt_index": 3,
    "batch_id": "b-001",
    "extracted_answer": "Port 8001, running natively on cloud-vm",
    "answer_found": true,
    "tool_call_count": 2,
    "unique_files_read": ["MEMORY.md"],
    "first_useful_tool_index": 0,
    "trajectory_length_tokens": 1240,
    "reward": null,
    "judge_score": null,
    "pass": null
  }
}

The `reward` field is populated by the quality filtering stage (Section 4).

3.3 Trajectory Storage

Trajectories live in a new Cortex subdirectory `[home-path]` as JSONL shards by batch:

[home-path]
  question_bank.jsonl        # all generated questions + status
  trajectories/
    batch-001.jsonl          # all solver attempts for batch 001
    batch-002.jsonl
    ...
  filtered/
    batch-001-filtered.jsonl # quality-filtered subset for training
  training/
    sft-dataset-v1.jsonl     # formatted for MLX LoRA
    sft-dataset-v2.jsonl

This stays off Supabase (too noisy for a 141-table project). Cross-machine sync via Syncthing to `Desktop/homelab/karl/` (already syncs to cloud-vm).

3.4 Trajectory Compression for Long Attempts

For Architectural questions with 20+ tool calls, KARL's compression boundary approach applies directly. The trajectory is split at natural "pivot points" where the agent synthesized intermediate findings:

[Read MEMORY.md] -> [Grep for 'NUMU Bus'] -> [Read numu-fare.md]
    ^-- segment 1 (context gathering) --|

[Read daemon.py lines 195-336] -> [Read mesh_coordinator.py]
    ^-- segment 2 (implementation deep-dive) --|

[Text: "The NUMU Bus operates on..."] -> <answer>...</answer>
    ^-- segment 3 (synthesis) --|

Each segment has its own reward computation (partial credit). The LoRA training uses all segments but masks tool result tokens (only model output tokens contribute to the loss).

---

4. Quality Filtering

4.1 Pass-Rate Filtering (KARL Stage 1)

For each question with G=8 attempts:
- Compute `pass_rate = (correct_attempts / G)`
- Discard if `pass_rate < 0.1`: Question is unsolvable from current codebase (either too ambiguous, or answer genuinely not in corpus). Flag question as `unsolvable` in question_bank.jsonl.
- Discard if `pass_rate > 0.9`: Question is trivially solved (one-hop Read). Little learning signal. Keep in question bank for bootstrapping but don't include in LoRA dataset.
- Keep 0.1 <= pass_rate <= 0.9: The "sweet spot" — hard enough to need learning, solvable enough to train on.

For G=3 (expensive questions): thresholds adjust to 0/1 pass counts (0 = unsolvable, 3/3 = trivial, 1-2/3 = keep).

4.2 Correct-Answer Extraction

"Correct" is defined by comparing extracted answers across attempts. The comparator:

1. Extract `<answer>` tag content from each attempt.
2. Cluster answers by semantic similarity (BM25 Jaccard > 0.6 across answer tokens).
3. The plurality cluster is the "consensus answer."
4. Any attempt whose answer matches the consensus answer (or is a strict superset) is marked `pass=true`.
5. If no plurality (all answers diverge): flag question as `ambiguous`, reduce its tier priority, re-generate with more specific phrasing.

For Lookup questions, exact or near-exact string match is used instead of BM25 (port numbers, IP addresses, file paths are not fuzzy).

4.3 LLM Judge Quality Filtering (KARL Stage 2)

A Claude Haiku judge evaluates each question-answer pair against four rubrics:

JUDGE PROMPT:
Evaluate this question-answer pair for an internal ops knowledge base.

Question: {question_text}
Consensus answer: {consensus_answer}
Source files: {source_files}

Score each dimension 1-5:
1. FACTUAL_ACCURACY: Is the answer verifiable from the cited source files?
2. COMPLETENESS: Does the answer cover all aspects of the question?
3. QUESTION_CLARITY: Is the question unambiguous and specific?
4. TRAINING_VALUE: Would a less-experienced engineer benefit from seeing this Q&A?

Return JSON: {"factual": N, "complete": N, "clarity": N, "value": N, "discard_reason": "..." or null}

Filter thresholds:
- Discard if `factual < 3` (answer contradicts source files)
- Discard if `clarity < 3` (question too vague to learn from)
- Discard if any score is 1 (catastrophic failure)
- Accept if `mean(all four) >= 3.5`
- Flag for human review if `factual >= 3 AND mean < 3.5`

4.4 Deduplication

Before adding to the training dataset:

1. Exact match: Hash of normalized question text. Drop exact duplicates.
2. Semantic dedup: BM25 Jaccard against existing training set. Drop if similarity > 0.7 to any existing Q.
3. Cross-batch dedup: Questions from earlier batches take precedence. Newer duplicates are discarded unless they have a higher judge score.

Dedup runs per-batch, not globally, for efficiency. A global dedup pass runs weekly.

4.5 Adversarial Self-Check

For every 10th question, the judge also attempts to answer the question WITHOUT looking at the source (using only its training knowledge). If the judge's answer matches the consensus answer, the question is likely "general knowledge" rather than proprietary — flag as `low_moat` and deprioritize for LoRA training (use for SKILL.md improvement instead, where the value is documentation, not differentiation).

---

5. Bootstrapping Schedule

5.1 Phase 0: Seed Run (Week 1)

Generate the initial question bank from the highest-density corpus.

Target: 200 questions across all 5 types
Corpus: T1 (MEMORY.md topic files) + T2 (Gen2 SKILL.md files) only
G: 3 attempts per question (cost control)
Goal: Establish question bank, validate pipeline, tune judge thresholds
Expected output: ~120 questions pass filtering (60% retention expected)

Run manually: `python3 [home-path]`

Estimated duration: 4-6 hours (question gen: 20 min, solving: 3-4 hours at G=3, judging: 30 min)
Estimated cost: ~$5-10 (haiku-class model for all steps)

5.2 Phase 1: Regular Cadence (Weeks 2-4)

Daily batch runs after Phase 0 validates the pipeline.

Batch size: 50 questions per run
Corpus rotation: T1 Mon/Fri, T2 Tue, T3 Wed, T4/T5 Thu
G: 5 attempts per question
Schedule: Prefect flow `karl_daily_batch` at 03:00 UTC (off-peak)

Prefect flow registration:

python

@flow(name="karl-daily-batch", schedules=[CronSchedule(cron="0 3 * * *")])
def karl_daily_batch():
    questions = generate_batch(n=50)
    trajectories = solve_batch(questions, G=5)
    filtered = quality_filter(trajectories)
    update_skill_mds(filtered)  # near-term output
    append_training_set(filtered)  # for LoRA when 500+ examples ready

5.3 Phase 2: LoRA Training Trigger (Week 4+)

Trigger a Mac5 LoRA training run when:
- Training set reaches 500 filtered examples, OR
- Weekly batch rate has been stable for 2 weeks

python

# [home-path]
def check_training_trigger():
    n_examples = count_training_examples()
    if n_examples >= 500:
        dispatch_to_mac5_lora(sft_dataset_path)

5.4 Phase 3: Bootstrapped Iteration (Week 6+)

After the first LoRA adapter is trained on Mac5, use the improved model as the solver for the next batch — KARL's iterative bootstrapping loop:

Iteration 1: Base model (sonnet-4-6) solves questions → 500 trajectories → LoRA v1
Iteration 2: LoRA v1 (via Mac5 MLX :8100) solves questions → 1200 trajectories → LoRA v2
Iteration 3: LoRA v2 solves questions → 2500 trajectories → LoRA v3

Each iteration's improved model generates better training data (more precise tool calls, more accurate answers), which improves the next model. This is the compounding flywheel.

---

6. Pane-Parallel Execution

6.1 Parallelism Strategy

KARL ran multiple solver instances in parallel. We have the same capability via the Pane Spawn Protocol and mesh distribution.

Mac1 (primary): Up to 4 concurrent solver panes. Constraint: Mac1 is also the build host and Xcode machine. Limit to 2 during business hours, 4 during overnight batch runs.

Mac2 (secondary): 2-4 concurrent solver panes. Mac2 is the iOS workstation on Account 2. Same business-hours limit.

Mac4/Mac5 (compute): For Bash-heavy diagnostic questions (port probes, grep across large codebases), Mac4/Mac5 can run solver panes without Tailscale overhead. These are ideal for T3/T4 questions that don't need local file access.

cloud-vm (infra questions): For questions about cloud-vm-specific services (Prefect, Grafana, Docker Compose), spawn solver panes directly on cloud-vm via SSH + AppleScript tunnel. The solver has ground-truth access to the services it's answering about.

6.2 Mesh Coordinator Integration

The existing `mesh_coordinator.py` in `[home-path]` already handles cross-machine work distribution. The KARL orchestrator registers solver jobs as "mesh tasks" with the coordinator:

python

from mesh_coordinator import MeshCoordinator

mc = MeshCoordinator()
for question in batch:
    target_machine = route_question_to_machine(question)  # T3 → cloud-vm, T1 → mac1
    mc.dispatch_task(
        task_type="karl_solver",
        payload={"question_id": question.id, "question_text": question.text, "G": 5},
        target_machine=target_machine,
        priority=question.tier
    )

The coordinator's claim guard prevents two solver panes from taking the same question simultaneously.

6.3 Pane Spawn Sequence

1. Pane orchestrator wakes, checks backlog at [home-path]
2. Finds KARL solver tasks (type=karl_solver, priority from question tier)
3. Checks available pane slots (respects business-hours limits)
4. For each slot: spawn solver pane with question payload via clipboard
5. When solver session ends (SessionEnd hook), trajectory_augmentor.py runs
6. Augmentor writes enriched trajectory to batch-NNN.jsonl
7. Pane orchestrator picks up next question from backlog

6.4 Throughput Estimate

With 4 concurrent panes on Mac1+Mac2 and 2 on Mac4:
- Average solver session: 3 min (Lookup) to 12 min (Architectural)
- Average: ~6 min per question at G=5 = 30 min per question fully solved
- 6 concurrent panes: 6 questions per 30 min = 12/hr
- Overnight batch (10pm-6am = 8hr): 96 questions fully solved
- Per week (5 overnight runs): 480 questions

This means a 500-question training set is achievable within 1 week of full-speed running. At $0.32/batch of 50 (haiku), the weekly cost is ~$3.20. Highly tractable.

---

7. Output Formats

7.1 Output A: SKILL.md Improvement (Near-Term, Zero Training Cost)

The fastest path to value. Filtered trajectories reveal the CORRECT tool sequence for answering a class of question. That sequence becomes the "Workflow" section of the corresponding SKILL.md.

Example transformation:

Before (static template):

markdown

## Workflow
1. Check current state
2. Execute operation
3. Verify success
4. Report outcome

After (trajectory-derived):

markdown

## Workflow — Deploy Prefect Flow to cloud-vm

### Verified Tool Sequence (from 12 successful trajectories, pass_rate=0.83)

1. Read current flow registration:
   `ssh cloud-vm prefect flow ls` or check `flows/feed-hub/` for matching filename
2. Verify cloud-vm Docker stack is running:
   `ssh cloud-vm docker ps | grep prefect`
3. Copy flow file to VM (avoid heredoc — scp instead):
   `scp flows/feed-hub/{flow_name}.py cloud-vm:[home-path]`
4. Register flow (from cloud-vm):
   `ssh cloud-vm "cd [home-path] && python3 {flow_name}.py"`
5. Verify in Prefect UI:
   `open http://localhost:4200` (via SSH tunnel or Nexus Portal)

### Gotchas (extracted from failed trajectories)
- SSH heredoc mangles `${...}` — always scp files, never pipe code over SSH
- Port 4200 is Prefect UI, not the API. API is on 4200 too but `/api/` path
- Flow name in Prefect registry ≠ filename. Use `@flow(name="...")` decorator

Update mechanism: A `skill_updater.py` script:
1. Groups filtered trajectories by question type + domain
2. Extracts the modal tool sequence (most common successful path)
3. Compares against existing SKILL.md workflow section
4. If modal sequence diverges significantly (>40
5. Human reviews proposed update in Nexus Portal `/skills` page (new page, see Section 7.3)
6. On approval, Edit tool patches the SKILL.md. Memory Guardian ensures additive-only.

7.2 Output B: LoRA Training Dataset (Medium-Term, Mac5)

For the Mac5 LoRA adapter, trajectories are converted to SFT format using KARL's approach adapted for Claude Code tool use:

SFT record format (for MLX LoRA):

json

{
  "messages": [
    {
      "role": "system",
      "content": "You are answering operational questions about the OpenClaw infrastructure. Use tools to verify every claim."
    },
    {
      "role": "user",
      "content": "What port does the Graph Kernel run on, and on which machine?"
    },
    {
      "role": "assistant",
      "content": "Let me check the memory documentation.\n\n[TOOL: Grep pattern='Graph Kernel' path='[home-path] shows `:8001 (SSH tunnel to cloud-vm native)` at line 35.\n\n<answer>Port 8001. The Graph Kernel runs natively on cloud-vm (not in Docker). It is accessed from Mac1 via an SSH tunnel: `ssh -f -N -L 8001:localhost:8001 cloud-vm`</answer>\n\n<sources>[home-path]
    }
  ],
  "trajectory_id": "batch-001-q7fa3c2-attempt-3",
  "question_type": "lookup",
  "reward": 0.92,
  "pass_rate": 0.875
}

Key design choices matching KARL:
- Tool results are represented as inline text (not structured JSON) — mirrors how Claude Code interleaves tool results in its response
- The model output (what gets trained) is the reasoning + tool call decisions; tool results are masked from loss
- Trajectories with higher `reward` scores are upsampled 2x in the training set (reward-weighted sampling)

corpus-to-sft.py extension: The existing `corpus-to-sft.py` script needs a `--karl-trajectories` flag that reads from `[home-path]` instead of the standard unified.jsonl format.

Mac5 training command (MLX LoRA v0.29+):

bash

python3 -m mlx_lm lora \
  --model [home-path] \
  --data [home-path] \
  --train \
  --num-layers 8 \
  --batch-size 4 \
  --iters 600 \
  --save-every 100 \
  --adapter-path [home-path]

7.3 Output C: Nexus Portal Dashboard (Observability)

A new `/karl` page in the Nexus Portal showing:
- Question bank statistics (total, by type, by status: unsolved/passing/trivial/stale)
- Quality filter funnel (generated → pass-rate filter → judge filter → training set)
- Skill improvement diffs (proposed vs. current SKILL.md)
- LoRA training history (adapter versions, loss curves)
- Coverage heatmap (which corpus areas have good question coverage vs. gaps)

This connects KARL into the existing observability stack without adding new infrastructure.

---

8. Freshness: When to Regenerate Questions

8.1 Change Detection

The freshness watcher subscribes to the `file_modified` and `file_created` events on the Mesh Event Bus (`:8600`). Relevant file patterns:

python

FRESHNESS_WATCHERS = {
    "[home-path] "T1",
    "[home-path] "T2",
    "flows/feed-hub/*.py": "T3",
    "[home-path] "T4",
    "[home-path] "T5",  # append-only, new prompts = new questions
    "CLAUDE.md": "T1",  # new rules = new lookup questions
    "MEMORY.md": "T1",
}

8.2 Staleness Rules

Event	Action	Urgency
Port number change in MEMORY.md	Mark all lookup questions from MEMORY.md as stale	High (next batch)
New skill added to registry	Generate 10 questions for new skill immediately	High
New Prefect flow file added	Generate 8 procedure questions for new flow	Medium (next night)
CLAUDE.md rule added	Generate 3 lookup questions for new rule	Medium
Hook modified	Mark procedure questions for that hook as stale	Medium
Existing question answer diverges from current file	Mark question as stale	Low (next weekly pass)

8.3 Drift Detection

A weekly drift-check job compares the consensus answers in the filtered dataset against the current state of the codebase. For each filtered Q&A pair:

1. Re-run the solver ONCE against the current codebase
2. Compare new answer against stored consensus answer
3. If Jaccard similarity < 0.5: codebase has drifted, mark Q&A as `drifted`
4. Drifted Q&As are removed from the LoRA training set (but kept with `drifted=true` for audit)
5. Regenerate fresh questions from the modified source

This prevents the LoRA adapter from learning outdated procedures.

---

9. Cost Analysis

9.1 Token Cost Breakdown (Claude Haiku for all LLM steps)

Operation	Input tokens	Output tokens	Cost per unit
Question generation (per 2K chunk)	2,500	300	$0.00063
Solver attempt (lookup)	4,000	500	$0.00163
Solver attempt (procedure)	8,000	1,200	$0.00350
Solver attempt (architectural)	15,000	2,500	$0.00688
LLM judge (per Q&A)	3,000	300	$0.00113
Drift check (per Q&A, 1 solve)	6,000	800	$0.00250

Haiku pricing: $0.25/MTok input, $1.25/MTok output.

9.2 Weekly Budget Projections

Conservative (50 questions/week):

Question generation: 50 × $0.00063 × 4 chunks avg = $0.13
Solving (G=5 per Q): 50 × 5 × $0.003 avg = $0.75
Judging: 50 × $0.00113 = $0.06
Drift check (200 Q bank): 200 × $0.0025 = $0.50
Total: ~$1.44/week

Full speed (200 questions/week at overnight pace):

Question generation: 200 × $0.00063 × 4 = $0.50
Solving: 200 × 5 × $0.003 avg = $3.00
Judging: 200 × $0.00113 = $0.23
Drift check (1000 Q bank): 1000 × $0.0025 = $2.50
Total: ~$6.23/week

LoRA training on Mac5: Electricity only. Approximately $0.02-0.05 per training run at 30 min on M4.

Monthly ceiling at full speed: ~$25/month — far below any commercial ML training cost. The entire knowledge base training pipeline runs for the price of a Spotify subscription.

9.3 Cost Optimization Options

1. Haiku for generation + judgment, skip solver Claude calls for Lookup questions: Lookup questions can be "solved" by deterministic file reads (no model inference). Auto-answer via Grep/Read script. Saves 40

2. Amortize with existing sessions: When a human session naturally answers an infra question, that trajectory is already in verbose-all.jsonl. A post-hoc classifier can retroactively tag these as KARL trajectories without any additional API cost.

3. Question reuse across iterations: A question that generated a good training example in iteration 1 doesn't need re-generation. Only new/stale questions need fresh generation.

---

10. Risks

10.1 Self-Referential Loops

Risk: The question generator reads SKILL.md files that were written by the skill updater that uses KARL trajectories. Over time, questions and skills co-evolve toward a local optimum that does not reflect the actual codebase.

Mitigation:
- T3-T4 sources (actual code) are always weighted 2x over T1-T2 (documentation) in question generation. Code does not lie; docs drift.
- Monthly "ground truth audit": randomly sample 20 Q&As and verify manually against live infrastructure.
- The LLM judge rubric penalizes circular answers ("the SKILL.md says X, therefore X is correct"). Require source citation to be code files, not documentation files, for Factual_Accuracy scoring.
- Adversarial check (Section 4.5) detects when answers are "general knowledge" vs. proprietary.

10.2 Stale Answers in LoRA Weights

Risk: Once a wrong or outdated procedure is baked into a LoRA adapter, it will be confidently wrong on future queries. The adapter cannot un-learn without retraining.

Mitigation:
- Weekly drift detection (Section 8.3) marks drifted Q&As before they can re-enter training.
- LoRA adapters are versioned and the active version is explicit in the MLX Server config. Rolling back to v(n-1) is a 30-second operation.
- The SKILL.md update pipeline (Output A) is decoupled from the LoRA pipeline (Output B). SKILL.md can be corrected immediately via Edit; LoRA requires a new training run.
- The decay detector already handles skill staleness — add trajectory staleness to the same decay ladder (60d inactive = flag for drift check).

10.3 Confirmation Bias in Pass-Rate Filtering

Risk: If the codebase has a systemic error (e.g., wrong port documented in MEMORY.md), all solver attempts will "agree" on the wrong answer, giving it high pass-rate and high judge score.

Mitigation:
- For infrastructure-critical facts (ports, IPs, credentials), the solver must verify against LIVE state, not just documentation. Add a "live verification" tool call: `ssh cloud-vm netstat -tlnp | grep 8001` rather than just reading MEMORY.md.
- Factual_Accuracy rubric in the judge specifically requires: "Can this be verified against running services, not just documentation?"
- Cross-reference rule: Lookup questions about ports/IPs must cite at least 2 independent sources (MEMORY.md + live probe, or MEMORY.md + docker-compose.yml).

10.4 Compute Contention

Risk: Overnight batch runs spawn 4-6 solver panes, consuming Mac1/Mac2 resources needed for builds.

Mitigation:
- Pane orchestrator business-hours limits (Section 6.1) already handle this.
- KARL batch jobs register with the mesh coordinator at LOW priority — any human session or build job preempts them.
- Add a `max_karl_panes` setting to the pane orchestrator config. Default 2, settable to 0 for urgent build days.
- Hard stop: if any LaunchAgent reports a build failure (via Mesh Event Bus), karl_daily_batch pauses all spawns immediately.

10.5 Question Quality Decay

Risk: As the easy/clear questions are exhausted, the generator produces increasingly obscure or unanswerable questions, wasting solver compute.

Mitigation:
- Track `unsolvable_rate` per corpus tier per week. If T3 unsolvable rate exceeds 30
- Vary question complexity intentionally: 60
- The adversarial self-check (Section 4.5) catches questions that are answerable from general knowledge. These need rephrasing to include proprietary specifics before re-entering the bank.

10.6 Privacy / Secret Leakage in Training Data

Risk: Solver trajectories may capture credential values, API keys, or private data that appear in files read during solving.

Mitigation:
- A `secret_scrubber.py` pass runs on every trajectory before it enters `filtered/` or `training/`. Uses regex patterns from the credential architecture: Supabase JWT patterns, Tailscale node keys, API key formats.
- Any trajectory containing a secret pattern is quarantined, not discarded — flagged for manual review (secrets in code = a problem that needs fixing, not hiding).
- Training dataset is stored locally only (not synced to Supabase). Cross-machine sync to Mac5 via direct Tailscale transfer, not via cloud storage.

---

11. Integration with Existing Systems

11.1 Cortex Integration

KARL extends Cortex rather than replacing it:

Cortex Component	Role with KARL
`extractor.py`	Continues to mine prompt frequency → generates Type-5 (contextual) questions
`generator.py`	Extended to generate SKILL.md updates from trajectory patterns, not just static templates
`ops_trigger.py`	Continues to inject SKILL.md content. Skills are now trajectory-derived, not generic
`correction_detector.py`	Corrections become NEGATIVE reward signals (trajectory that prompted correction gets `reward=-1.0`)
`frequency_tracker.py`	Frequency still drives promotion ladder. KARL adds a second signal: question pass-rate
`models.py`	Two new entry types: `trajectory` and `reward_signal` added alongside existing 6 types
`decay/detector.py`	Adds `drift_check` action alongside existing disable/archive actions

New CortexEntry types:

python

ENTRY_TYPES = frozenset({
    "skill_candidate",
    "routing_decision",
    "correction",
    "rule",
    "decay_flag",
    "invocation_record",
    "trajectory",       # NEW: a solved question trajectory
    "reward_signal",    # NEW: outcome reward for a trajectory
})

11.2 Evolution World Integration

KARL can function as a new Evolution World layer — but this is Phase 2, not Phase 1.

Phase 1 (now): KARL runs standalone, feeds improved SKILL.mds back into Cortex.

Phase 2 (Week 8+): KARL trajectories become "genomes" for EW L2. The OAPL reward function becomes a fitness signal: skills with high pass-rate across KARL questions score higher fitness in EW's selection algorithm. EW's L2 strategy selection (elitist, diversity_driven, etc.) can then choose whether to focus on improving weak skills or diversifying across domains. This mirrors KARL's OAPL policy optimization mapped onto EW's evolutionary selection.

11.3 Pane Orchestrator Integration

The pane orchestrator's backlog at `[home-path]` gets a new task type:

json

{
  "type": "karl_solver",
  "question_id": "qb-7fa3c2",
  "question_text": "...",
  "attempt_index": 2,
  "priority": 2,
  "target_machine": "mac1",
  "created_at": "2026-03-10T03:00:00Z",
  "deadline": "2026-03-10T07:00:00Z"
}

The orchestrator's 5-phase cycle (sense→select→mutate→check→adapt) handles KARL tasks at LOW priority, filling idle pane slots overnight. No changes to the orchestrator core logic required.

---

12. Implementation Plan

Sprint 1 (Days 1-3): Pipeline Skeleton

Deliverables:
1. `[home-path]` — module root
2. `[home-path]` — Tier 1 generation only (MEMORY.md topics)
3. `[home-path]` — solver session spawn + session_id linkage
4. `[home-path]` — enriches verbose-all entries post-session
5. `[home-path]` — pass-rate filter only (no LLM judge yet)
6. Update `cortex/models.py` to add `trajectory` and `reward_signal` entry types
7. Seed run: 20 questions from MEMORY.md, G=3, validate end-to-end

Success criteria: 15+ questions in question_bank.jsonl, 5+ filtered trajectories in filtered/, no pipeline crashes.

Sprint 2 (Days 4-7): Quality + SKILL.md Output

Deliverables:
1. `[home-path]` — judge rubric + Haiku API calls
2. `[home-path]` — trajectory → SKILL.md diff proposals
3. `[home-path]` — BM25 question dedup
4. Update `cortex/forge/generator.py` to accept trajectory-derived workflow sections
5. Nexus Portal `/karl` page (stats only, no controls)
6. Full Phase 0 seed run: 200 questions, 3 tiers

Success criteria: At least 2 SKILL.md improvement proposals generated and reviewed. Judge accuracy > 80

Sprint 3 (Days 8-14): Automation + Parallelism

Deliverables:
1. Prefect flow `karl_daily_batch` registered on cloud-vm
2. Pane orchestrator integration (backlog task type)
3. Mesh coordinator routing (T3 → cloud-vm, T1 → mac1)
4. Freshness watcher subscribed to Mesh Event Bus
5. Secret scrubber pass before filtered/ writes
6. Phase 1 full cadence: 50 questions/night for 5 nights

Success criteria: 250+ questions solved without manual intervention. Pipeline runs overnight autonomously.

Sprint 4 (Days 15-28): LoRA Training + Iteration

Deliverables:
1. `[home-path]` — converts filtered/ to MLX LoRA format
2. Updated `corpus-to-sft.py` with `--karl-trajectories` flag
3. First LoRA training run on Mac5 (target: 500 examples)
4. Drift detector Prefect flow `karl_weekly_drift`
5. Iteration 2 bootstrapping: Mac5 adapter as solver

Success criteria: LoRA v1 trained with loss < 1.5. At least 10

---

13. Success Metrics

Metric	Baseline	Target (4 weeks)	Target (12 weeks)
Questions in bank	0	200	1,000
Filtered training examples	0	120	600
SKILL.md improvement proposals	0	5	20
SKILL.md updates applied	0	3	15
Pass-rate on held-out Q set	-	- (no adapter yet)	+15
Pipeline cost/week	-	<$2 \| <$8
Stale question rate	-	<20
Solver trajectory coverage (

---

Sources

### Codebase Files Read
- `[home]/.claude/prompt-logs/verbose-all.jsonl` (3,241 entries, tool_call schema confirmed)
- `[home]/.claude/prompt-logs/unified.jsonl` (3,917 entries, simplified tool schema)
- `[home]/.claude/prompt-logs/prompts-all.jsonl` (909 entries, domain distribution analyzed)
- `[home]/.claude/cortex/models.py` (CortexEntry dataclass, 6 entry types)
- `[home]/.claude/cortex/forge/extractor.py` (SKIP_PATTERNS, n-gram clustering)
- `[home]/.claude/skills/registry.json` (83 skill dirs, 12 Gen2 active)
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md` (primary research corpus)
- `[home]/.claude/projects/-Users-mohameddiomande/memory/` directory listing (29 topic files)

### Key Infrastructure Facts Confirmed
- Verbose-all entries have full `tool_call` schema: `tool_name`, `parameters`, `result`, `success`, `duration_ms`, `exit_code`, `file_diff`
- 2,206/3,241 verbose entries are from `mohameddiomande` repo (primary training corpus)
- Domain frequency: prefect(513), debug(253), evo(225), git(178), ios(133), deploy(119), mesh(109), supabase(95)
- Mac5 LoRA pipeline confirmed: adapter v1 trained in 188.4s on 972 examples, loss 1.694
- Pane spawn limitations confirmed: ~30
- cost estimate: $1.44-6.23/week at full speed (haiku-class)

### Stage 0 External Research (from stage0-research.md)
- KARL paper: arXiv 2603.05218 (Databricks AI Research, March 2026)
- OAPL algorithm: stable with 400+ step policy lags, 3x fewer samples than GRPO
- Synthetic data pipeline: question generation → G=8 solver attempts → pass-rate filter (0.1-0.9 sweet spot) → LLM judge → dedup → training set
- Scale: iteration 1: 1,218+6,270 prompts; iteration 2: 1,336+11,371 (compounding)
- Base model (GLM 4.5 Air, 52.6 avg) matched Claude Opus (67.5) through trajectory RL alone

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-trajectory-intelligence/stage1-path-e.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture · is Stage Research