KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite
Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's single M4 chip using offline advantage estimation instead of online rollouts. The core insight: we don't need live rollout infrastructure when we already have 3,249 logged trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with exit codes, file diffs, and success signals. The approach converts those trajectories into advantage-weighted training examples, computes rewards from build results, correcti
Full Public Reader
# KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite
Run: karl-trajectory-intelligence
Generated: 2026-03-10
Method: Evolution³ — divergent exploration
---
Executive Summary
Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's
single M4 chip using offline advantage estimation instead of online rollouts. The core
insight: we don't need live rollout infrastructure when we already have 3,249 logged
trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with
exit codes, file diffs, and success signals. The approach converts those trajectories into
advantage-weighted training examples, computes rewards from build results, correction
signals, and user approval proxies, and trains a LoRA adapter on Mac5 that learns which
tool-use sequences are associated with successful task completion.
The target is a LoRA adapter specialized for tool-use reasoning: given a prompt and a
skill context, predict the high-advantage next action. This replaces the static
`(prompt) -> inject SKILL.md content` pipeline with a `(prompt + trajectory context) ->
learned action selection` model that improves as more trajectories accumulate.
---
1. OAPL Simplification: What Survives on a Single M4
1.1 The Full OAPL Objective (KARL Paper)
The KARL OAPL loss is a regression objective derived from the KL-regularized RL problem:
OAPL objective (full):
L(pi) = sum_x sum_i ( beta * ln(pi(y_i|x) / pi_ref(y_i|x)) - A*(x, y_i) )^2
where:
A*(x, y_i) = r(x, y_i) - V*(x)
V*(x) = beta * ln( (1/G) * sum_i exp(r(x, y_i) / beta) )
G = number of rollouts per prompt x
beta = KL regularization coefficient (controls deviation from pi_ref)
pi_ref = frozen reference policy (base model weights)
r(x, y_i) = reward for response y_i to prompt xOAPL's key innovation over GRPO: V(x) is the soft optimal value*, not a simple
baseline. It is computed in closed form from the rewards of all G rollouts, requiring
no value network, no importance weight clipping, and no gradient through the value
estimate. This makes it stable at policy lags up to 400+ gradient steps — the training
data can be much older than in PPO/GRPO without degrading optimization.
1.2 Components OAPL-Lite Keeps
KL Regularization (CRITICAL — keep)
The KL penalty `beta * ln(pi(y_i|x) / pi_ref(y_i|x))` prevents the policy from
drifting catastrophically away from the base model. On Mac5 with a 4-bit quantized
Gemma-3-1b, catastrophic forgetting is the primary failure mode. The KL term ensures
the adapter stays within a reasonable neighborhood of the base. In practice: keep a
frozen reference adapter (the current v1 adapter) and compute log-probability ratios at
training time using teacher-forcing on held-out examples.
Implementation: MLX LoRA already supports this via its `--learning-rate` + early
stopping on validation loss. We add an explicit KL penalty term to the per-example
loss weight (advantage-weighted SFT is equivalent to OAPL under mild assumptions — see
Section 8).
Advantage Estimation (CRITICAL — keep)
Each training example gets an advantage weight `A = r - V_baseline` where `r` is the
computed trajectory reward and `V_baseline` is the mean reward across all trajectories
from the same skill/domain bucket. High-advantage examples (trajectories where the
agent did significantly better than average) get upweighted. Low-advantage examples
are downweighted or excluded. This is the core of what makes OAPL more sample-efficient
than plain SFT.
Offline Training (CRITICAL — keep)
Instead of running live rollouts, we use the existing `verbose-all.jsonl` (3,249
entries, 157 with tool sequences) as a fixed offline dataset. OAPL's stability at
large policy lags (400+ gradient steps) means the age of these trajectories is not a
fundamental problem — it is a design feature that Path B exploits.
1.3 Components OAPL-Lite Drops
Online Rollouts (DROP)
KARL runs 8 parallel rollouts per training prompt, requiring concurrent model inference
on a GPU cluster. Mac5 is a single M4 chip; running 8 parallel inference processes
against the 4-bit quantized Gemma model would consume the entire 16GB unified memory
during training. Dropped entirely. The offline trajectory dataset is the substitute.
Importance Weighting (DROP)
Standard off-policy RL uses importance weights `pi(y|x) / pi_behavior(y|x)` to correct
for distribution shift between the behavior policy (which collected trajectories) and
the current training policy. OAPL explicitly claims stability without importance
weighting due to the KL constraint. With offline data, we cannot compute `pi_behavior`
reliably anyway — the trajectories were generated by Claude Code (Anthropic API), not by
our local model. Dropped.
Compression Steps (DROP)
KARL handles long trajectories by inserting compression boundaries where the model
summarizes prior context before continuing. Our trajectories are shorter (median ~15
tool calls vs. KARL's 50-200 steps), and we're training on the full ChatML sequence
within MLX's 256-token max sequence length. No compression infrastructure needed.
For longer trajectories (>256 tokens), we truncate at the most informative prefix
(see Section 4).
Multi-Task Transfer (DROP for now)
KARL trains on 6 enterprise search task types simultaneously and shows generalization.
Path B trains on a single task distribution: Claude Code tool-use for software
engineering. Multi-task extension is a Stage 2 option.
Pass-Rate Filtering (SIMPLIFY)
KARL filters out trivially solved AND trivially failed examples (the pass-rate filter).
Path B uses a simpler quality gate: discard trajectories where `reward < -0.3` (noise
floor) and `reward > 0.95` (trivially easy, no learning signal). Keep the middle band
of genuinely informative trajectories.
1.4 The Resulting Simplified System
OAPL-Lite Pipeline:
verbose-all.jsonl (3,249 entries)
|
v
trajectory_extractor.py -- filter to tool-bearing entries (157 found)
|
v
reward_computer.py -- r(x,y): [0,1] per trajectory
|
v
advantage_weighter.py -- A = r - V_baseline per domain
|
v
sft_formatter_trajectory.py -- advantage-weighted ChatML examples
| (each example repeated ceil(A*k) times OR
| per-example weight stored as metadata)
v
MLX LoRA training on Mac5 -- python3 -m mlx_lm lora
|
v
adapter_v2 (tool-use specialist)
|
v
fused model at :8100 -- hot-swapped via finetune-daemon.py---
2. Trajectory-to-SFT Conversion
2.1 Input Data Inventory
From our direct inspection of `verbose-all.jsonl`:
| Metric | Value |
|---|---|
| Total verbose entries | 3,249 |
| Entries with tool_calls | 157 |
| Max tool calls per entry | 144 |
| Median tool calls (for tool entries) | ~15 |
| Entries with 5+ tool calls | ~116 |
| Entries with exit codes parseable | ~90 |
| Entries with files_modified | ~40 |
| Entries with errors[] | ~15 |
The tool_call schema in verbose-all has the fields we need:
{
"tool_id": "call_...",
"tool_name": "shell_command",
"tool_type": "shell_command",
"parameters": {"command": "...", "exit_code": 0},
"result": "Exit code: 0\nOutput: ..."
}Note: `tool_name` values in verbose-all are the source-tool names from various agents
(shell_command, exec_command, run_terminal_cmd, read_file, codebase_search, update_plan).
These differ from Claude Code's canonical names (Bash, Read, Edit, Write, Glob, Grep).
The converter must normalize these.
2.2 Trajectory Extraction Logic
New file: `[home-path]` (~180 lines)
TOOL_NAME_MAP = {
"shell_command": "Bash",
"exec_command": "Bash",
"run_terminal_cmd": "Bash",
"read_file": "Read",
"write_file": "Write",
"codebase_search": "Grep",
"glob_file_search": "Glob",
"update_plan": "Write",
"view_file": "Read",
}
def extract_trajectory(entry: dict) -> Optional[TrajectoryExample]:
"""Convert a VerbosePromptEntry to a TrajectoryExample for OAPL training."""
prompt = (entry.get("prompt_text") or "").strip()
if not prompt or len(prompt) < 15:
return None
# Skip hook/system prompts
if prompt.startswith("<system-reminder>") or prompt.startswith("SessionStart:"):
return None
# Extract tool sequence from assistant_turns
turns = entry.get("assistant_turns") or []
tool_events = []
for turn in turns:
for tc in (turn.get("tool_calls") or []):
raw_name = tc.get("tool_name") or tc.get("tool_type") or "Unknown"
canonical_name = TOOL_NAME_MAP.get(raw_name, raw_name)
result_str = str(tc.get("result") or "")
# Parse exit code
exit_code = None
if "Exit code: 0" in result_str:
exit_code = 0
elif "Exit code: 1" in result_str or "Exit code: 2" in result_str:
exit_code = 1
tool_events.append(ToolEvent(
tool_name=canonical_name,
parameters=tc.get("parameters") or {},
exit_code=exit_code,
result_preview=result_str[:200],
duration_ms=_parse_duration(result_str),
))
if len(tool_events) < 2:
return None # Require at least 2 tool calls for trajectory learning
return TrajectoryExample(
prompt_id=entry.get("prompt_id", ""),
session_id=entry.get("session_id", ""),
prompt_text=prompt,
tool_events=tool_events,
files_modified=entry.get("files_modified") or [],
files_created=entry.get("files_created") or [],
errors=entry.get("errors") or [],
git_repo=entry.get("git_repo"),
intent_classification=entry.get("intent_classification"),
complexity_score=entry.get("complexity_score"),
)2.3 Training Example Format
Each trajectory becomes a ChatML example where the assistant response encodes the
optimal tool-use reasoning chain. The key design decision: we train on the observed
tool sequence as the "correct" answer, weighted by the advantage of the outcome.
{
"messages": [
{
"role": "system",
"content": "You are a Cognitive Twin — a personal AI delegate for software engineering tasks. When given a task, reason through the optimal tool-use sequence. Consider which tools to call, in what order, and what success looks like."
},
{
"role": "user",
"content": "Task: deploy the flows to cloud-vm\n\nContext:\n- Domain: deploy\n- Skill: ops:deploy\n- Relevant gotchas: SSH heredoc mangles variables; port 8080 is Docker proxy\n- Git state: branch=main, repo=mohameddiomande\n\nWhat is your tool-use plan?"
},
{
"role": "assistant",
"content": "<thinking>\nDeploy task targeting cloud-vm. Need to: read docker-compose config, verify service state, run systemctl restart via SSH, confirm success.\n</thinking>\n\nTool sequence: Read(docker-compose.yml) → Bash(ssh cloud-vm 'systemctl status') → Bash(ssh cloud-vm 'systemctl restart prefect') → Bash(verify)\n\nKey checks:\n1. Read docker-compose.yml first to confirm service names\n2. SSH commands must be single-quoted to avoid heredoc variable expansion\n3. Verify exit_code=0 on systemctl restart before marking complete"
}
],
"advantage_weight": 0.73
}The `advantage_weight` field is used by the custom training loop to modulate per-example
loss contribution. Examples with high advantage (trajectory was much better than average)
are up-weighted. Examples near the baseline are weighted ~1.0. Negative-advantage examples
are still included but with reduced weight, serving as negative contrast examples.
2.4 Trajectory Context Construction
The user message is not just the raw prompt — it includes contextual fields that exist in
`verbose-all.jsonl` and are already extracted by the response hook:
| Field | Source | Purpose |
|---|---|---|
| `prompt_text` | Direct from entry | Core task description |
| `intent_classification` | response_hook.py | Domain label (deploy, ios, git, etc.) |
| `complexity_score` | response_hook.py | Float [0,1] indicating task complexity |
| `git_repo` | Direct from entry | Project context |
| `git_branch` | Direct from entry | Branch context |
| `skill_name` | Matched from ops_trigger invocation_records | Which skill was injected (if any) |
| `skill_gotchas` | Loaded from SKILL.md | Hard-won gotchas as context |
The skill gotchas are the bridge between the static SKILL.md system and the learned
trajectories: we inject the gotchas into the training context so the model learns to
reason through them, not just pattern-match on prompt text.
2.5 Sequence Length Management
MLX is configured with `MAX_SEQ_LENGTH = 256` in the finetune-daemon. Tool sequences
with 20+ steps easily exceed this. Strategy:
1. Compress the tool sequence representation: Instead of reproducing full tool
parameters, encode the sequence as a compact string: `Read(docker-compose.yml) →
Bash[0] → Bash[0] → Bash[0]` where `[exit_code]` is appended only when informative.
2. Prioritize the reasoning prefix: The most learning-relevant part is the initial
reasoning in `<thinking>` blocks and the first 8 tool calls. Truncate at step 8 if
needed; the completion reward already captures the whole-trajectory outcome.
3. For long trajectories (>50 tool calls): Sample 3 non-overlapping 8-step windows
— the opening window, the highest-failure-density window, and the closing window.
Each window becomes a separate training example with the same reward signal. This
increases data diversity without requiring sequence length expansion.
4. Future option: Increase `MAX_SEQ_LENGTH` to 512 or 1024 for the LoRA run. Mac5's
M4 16GB can handle this for Gemma-3-1b at 4-bit — the memory constraint is on batch
size, not sequence length. With `batch_size=1` and `num_layers=4`, 512 tokens is
feasible. Measure peak memory before committing.
---
3. Reward Function
3.1 Design Principles
The reward function is the core engineering challenge of OAPL-Lite. Unlike KARL's
nugget-based accuracy (which compares retrieved documents against ground-truth answers),
we must infer task success from observable side-effects. Three principles:
1. Use process signals, not just outcome: Exit codes, file modification counts,
and absence of error arrays are direct evidence of process quality, not just
final outcome. A trajectory that hit 3 Bash exit_code=1 errors and then succeeded
is worse than one that succeeded on the first attempt.
2. Use session-lagged signals sparingly: Correction signals from the next prompt
are the strongest available outcome indicator, but they require cross-turn linkage.
The reward function computes a within-trajectory score that is available immediately,
then enriches it with a session-lagged correction signal when available.
3. Normalize per skill-domain bucket: Absolute rewards are not comparable across
tasks. A deploy trajectory and a git trajectory have different baseline difficulty.
Normalize by computing z-scores within each domain (deploy, ios, git, supabase, etc.)
to produce the advantage A = r - V_baseline used in OAPL-Lite.
3.2 Reward Signal Components
Component 1: Process Cleanliness (weight 0.35)
Measures how clean the tool-use process was, independent of outcome.
def process_cleanliness(events: List[ToolEvent]) -> float:
"""[0, 1]. Higher = cleaner execution."""
if not events:
return 0.5
bash_events = [e for e in events if e.tool_name == "Bash"]
if not bash_events:
return 0.85 # Read-only trajectories are inherently clean
fail_count = sum(1 for e in bash_events if e.exit_code == 1)
total_count = len(bash_events)
fail_rate = fail_count / total_count
# Penalty scaling: first failure is less bad (could be expected check),
# multiple consecutive failures are bad (retry loops without progress)
consecutive_fails = _max_consecutive(bash_events, lambda e: e.exit_code == 1)
consecutive_penalty = min(0.4, consecutive_fails * 0.1)
return max(0.0, 1.0 - fail_rate * 0.6 - consecutive_penalty)Component 2: File Modification Signal (weight 0.25)
For software engineering tasks, modifying the right files is a proxy for task completion.
def file_modification_signal(
entry: TrajectoryExample,
domain: str,
) -> float:
"""[0, 1]. Higher = more evidence of productive file changes."""
modified = len(entry.files_modified)
created = len(entry.files_created)
if domain in ("ios", "deploy", "docker"):
# These domains should modify config/source files
if modified + created == 0:
return 0.2 # Low — task should have changed something
if modified >= 1:
return min(1.0, 0.5 + modified * 0.15)
elif domain in ("git", "monitoring"):
# Commits, restarts — file changes are optional
if modified + created >= 1:
return 0.8
return 0.6 # Neutral — not modifying files is fine for git/monitor
elif domain == "debug":
# Debug might not modify files (just diagnosis)
return 0.7 # Neutral
return 0.5 # Default neutralComponent 3: Error Array Signal (weight 0.20)
The `errors` field in VerbosePromptEntry captures hook-detected errors (non-zero exits
that the hook classified as errors, Python tracebacks surfaced via response_hook).
def error_signal(entry: TrajectoryExample) -> float:
"""[0, 1]. Higher = fewer detected errors."""
error_count = len(entry.errors)
if error_count == 0:
return 1.0
elif error_count == 1:
return 0.6 # Single error, may have been recovered
elif error_count <= 3:
return 0.3
else:
return 0.1Component 4: Correction Signal (weight 0.20 when available, else 0)
Requires cross-session lookup. The reward computer checks if any `correction` type
CortexEntry (in `[home-path]`) is timestamped within 120 seconds
after the trajectory's `captured_at` and shares the same `session_id`. If found:
def correction_signal(trajectory: TrajectoryExample) -> Optional[float]:
"""[0, 1] or None if correction data not available."""
entries = load_cortex_entries_for_session(trajectory.session_id)
corrections_after = [
e for e in entries
if e.type == "correction"
and e.timestamp > trajectory.captured_at
and (e.timestamp - trajectory.captured_at).seconds < 120
]
if not corrections_after:
# No correction data — could mean no correction happened OR
# the session ended before the next prompt.
return None # Do not penalize; treat as missing data
if len(corrections_after) >= 1:
return 0.0 # Strong negative: user had to correct within 2 minutes
return 1.0 # Correction absent in next prompt (2-minute window)3.3 Composite Reward Computation
def compute_reward(
entry: TrajectoryExample,
domain: str,
) -> RewardResult:
"""Compute composite reward r in [0, 1]."""
w_process = 0.35
w_file = 0.25
w_error = 0.20
w_correction = 0.20
r_process = process_cleanliness(entry.tool_events)
r_file = file_modification_signal(entry, domain)
r_error = error_signal(entry)
r_correction = correction_signal(entry)
if r_correction is None:
# Redistribute correction weight to other signals
total_other = w_process + w_file + w_error
r = (r_process * w_process/total_other +
r_file * w_file/total_other +
r_error * w_error/total_other)
correction_available = False
else:
r = (r_process * w_process +
r_file * w_file +
r_error * w_error +
r_correction * w_correction)
correction_available = True
return RewardResult(
reward=r,
process_component=r_process,
file_component=r_file,
error_component=r_error,
correction_component=r_correction,
correction_available=correction_available,
)3.4 Advantage Computation (The OAPL Core)
def compute_advantages(
trajectories: List[TrajectoryWithReward],
) -> List[TrajectoryWithAdvantage]:
"""
Compute OAPL-style advantages: A = r - V_baseline.
V_baseline = per-domain mean reward (soft baseline, not optimal value).
For full OAPL: V*(x) = beta * ln((1/G) * sum_i exp(r_i / beta)).
With offline data and no grouped rollouts, we use the domain mean as proxy.
"""
from collections import defaultdict
domain_rewards = defaultdict(list)
for t in trajectories:
domain_rewards[t.domain].append(t.reward.reward)
domain_baselines = {
domain: sum(rewards) / len(rewards)
for domain, rewards in domain_rewards.items()
}
result = []
for t in trajectories:
baseline = domain_baselines.get(t.domain, 0.5)
advantage = t.reward.reward - baseline
# OAPL soft advantage: scale by beta (KL regularization strength)
# beta=0.1 is from KARL paper; we use 0.05 for tighter KL constraint on M4
soft_advantage = advantage / 0.05 # Equivalent to exp(r/beta) normalization
# Clip to prevent extreme weighting
clipped_advantage = max(-2.0, min(2.0, soft_advantage))
result.append(TrajectoryWithAdvantage(
**t.__dict__,
advantage=clipped_advantage,
baseline=baseline,
))
return result3.5 Reward Signal Calibration (Expected Values)
Based on our inspection of 157 tool-bearing trajectories:
| Domain | Expected Baseline r | Notes |
|---|---|---|
| deploy | ~0.55 | Mix of clean deploys and retry loops |
| ios | ~0.45 | Higher Bash failure rate (xcodebuild is noisy) |
| git | ~0.70 | Usually clean, simple tool sequences |
| debug | ~0.40 | Expected high failure rates (exploratory) |
| supabase | ~0.60 | Mostly read + verify patterns |
| monitoring | ~0.65 | Typically clean check + report patterns |
Trajectories with advantage >= +0.3 (much better than domain average) become the
positive training signal. Trajectories with advantage <= -0.3 become negative contrast
examples. The ~60
---
4. Training Pipeline: End-to-End
4.1 Data Pipeline Steps
Step 1: Extract (trajectory_extractor.py)
Input: [home-path] (3,249 entries)
Filter: tool_count >= 2, non-system prompts
Output: ~157 TrajectoryExample objects
Step 2: Reward (reward_computer.py)
Input: 157 TrajectoryExamples
Compute: r per trajectory using 4-component formula
Enrich: correction signals from cortex/entries.jsonl where available
Output: 157 TrajectoryWithReward objects
Step 3: Advantage (advantage_weighter.py)
Input: 157 TrajectoryWithReward
Compute: domain baselines, soft advantages per OAPL formula
Filter: discard if |advantage| < 0.05 (genuinely uninformative)
Output: ~130 TrajectoryWithAdvantage objects (estimate)
Step 4: Format (sft_formatter_trajectory.py)
Input: 130 TrajectoryWithAdvantage
Convert: each → ChatML example with trajectory context + tool reasoning
Handle: sequences >256 tokens via windowing strategy (Section 2.5)
Augment: high-advantage trajectories (A > 0.5) duplicated 2x for emphasis
Output: ~180-200 training examples (ChatML JSONL)
Step 5: Split (80/10/10)
Train: ~145 examples
Valid: ~18 examples
Test: ~18 examples
Note: small test set — primary evaluation via A/B comparison (Section 5)
Step 6: Train (MLX LoRA on Mac5)
Command: python3 -m mlx_lm lora \
--model mlx-community/gemma-3-1b-it-4bit \
--data [home-path] \
--adapter-path [home-path] \
--train \
--iters 1000 \
--batch-size 1 \
--num-layers 4 \
--max-seq-length 256 \
--learning-rate 5e-5
Expected duration: ~400s (2x adapter v1's 188.4s for 2x iterations)
Expected loss: target < 1.5 (vs. v1's 1.694 baseline)
Step 7: Fuse and hot-swap
Command: python3 -m mlx_lm fuse \
--model mlx-community/gemma-3-1b-it-4bit \
--adapter-path [home-path] \
--save-path [home-path]
Then: restart MLX server at :8100 via finetune-daemon.py hot_swap_adapter(2)4.2 Advantage-Weighted SFT: Implementation Detail
Standard MLX LoRA does not support per-example loss weights natively. Path B uses
two complementary strategies:
Strategy A: Repeat-based upweighting (simpler)
High-advantage examples (A > 0.5) are duplicated `ceil(A * 2)` times in the training
JSONL. Example with A=0.8 appears twice; A=1.2 appears three times. Negative-advantage
examples (A < -0.3) are included once but with `instruction` modified to include a
"what NOT to do" framing:
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Task: deploy to cloud-vm\n\n[CONTRAST EXAMPLE: The following approach led to failures. Identify what went wrong.]"},
{"role": "assistant", "content": "Problematic sequence: Bash(ssh cloud-vm 'systemctl restart') without first reading config → exit_code=1 repeated 3x → session abandoned without verification.\n\nWhat should have happened: Read(docker-compose.yml) first to confirm service name, then SSH with correct service identifier."}
]
}Strategy B: Custom loss wrapper (future — Stage 2)
Patch MLX LoRA to accept a `--loss-weights` JSONL file mapping example index to
float weight. This is 20-30 lines of MLX Python and enables continuous advantage
weighting rather than discrete repetition. Deferred to Stage 2 because it requires
forking `mlx_lm`.
4.3 Batch of Corrective Trajectories
The cortex/entries.jsonl has 399 entries including `correction` type records. These are
a pre-labeled negative dataset. Each correction entry links to a session_id and contains
the correction text. Cross-referencing against verbose-all trajectories from the same
session_id produces correction-linked negative examples with explicit `advantage = -1.0`
— the strongest possible negative training signal.
Expected yield: ~30-50 correction-linked trajectory pairs (estimate based on cortex's
399 entries and 157 tool trajectories sharing session IDs).
4.4 Incremental Training Protocol
Initial run (adapter v2):
- Source: all 157 tool trajectories from verbose-all.jsonl
- Expected training examples: ~180-200 after augmentation
- Training: 1,000 iterations, lr=5e-5
- Duration: ~400s on M4
Subsequent runs (adapter v3, v4, ...):
- Triggered when trajectory_tap.py (from Path A) accumulates 50+ new annotated trajectories
- Source: existing corpus + new annotated trajectories from trajectories.jsonl
- The finetune-daemon already watches for new turns and has the hot-swap mechanism
- Each run retrains from scratch (not fine-tuning on fine-tune) to avoid catastrophic
forgetting from iterated LoRA updates
Cadence: With ~30 active sessions/day and ~10
~3-5 new tool trajectories/day. The 50-example threshold for retraining would be hit
in ~10-17 days, creating adapter v3 ~2 weeks after initial deployment.
---
5. A/B Evaluation Design
5.1 The Evaluation Problem
The fused model at :8100 is used by the cognitive twin pipeline — but evaluating it
against a production Claude API agent is not a fair comparison. The correct comparison
is:
Baseline: Current ops skill injection (static SKILL.md context + no learned
trajectory reasoning)
Treatment: Same SKILL.md injection + the OAPL-Lite fused model's tool-use plan
prepended to the context
5.2 Automated Evaluation Metrics
Metric 1: Plan Accuracy Score
For a held-out set of 18 test trajectories (see Section 4.1), prompt the fused model
with the trajectory context (minus the tool sequence) and measure how well its predicted
tool sequence matches the actual tool sequence using:
def plan_accuracy(predicted: List[str], actual: List[str]) -> float:
"""Weighted edit distance: order matters but some flexibility allowed."""
# Use longest-common-subsequence (LCS) normalized by actual length
lcs_len = lcs(predicted, actual)
return lcs_len / len(actual)Expected baseline (before OAPL-Lite): ~0.2 (model has no trajectory training)
Target (after OAPL-Lite): ~0.5+ (model has internalized common tool sequences)
Metric 2: First-Action Accuracy
The most actionable: does the model predict the correct first tool to call?
def first_action_accuracy(predicted_plans: List, actual_trajectories: List) -> float:
correct = sum(
1 for pred, actual in zip(predicted_plans, actual_trajectories)
if pred[0] == actual[0]
)
return correct / len(predicted_plans)For deploy tasks: correct first action is usually Read (config file) before any Bash.
For ios tasks: Read (project.yml) before xcodebuild. For debug: Bash (check logs).
Expected baseline: ~0.35 (random chance for 4 common tools)
Target: ~0.60+
Metric 3: Advantage-Weighted Prediction Quality
Weight each test example by its advantage score when computing metrics. High-advantage
examples (the genuinely difficult, genuinely successful trajectories) should be predicted
more accurately than low-advantage ones. If the model learns nothing useful, accuracy
will be uncorrelated with advantage. If it learns the right patterns, accuracy should
increase monotonically with advantage.
5.3 Live A/B via Pane Routing
After initial automated evaluation passes thresholds, deploy a live A/B via the pane
orchestrator:
# In pane_orchestrator controller.py sense() phase:
pane_count = len(all_panes)
karl_panes = pane_count // 2 # 50% of panes get KARL treatment
for i, pane in enumerate(all_panes):
if i < karl_panes:
# Treatment: inject tool-use plan from fused model before skill content
pane.set_flag("karl_oapl_lite", True)
else:
# Control: standard skill injection only
pane.set_flag("karl_oapl_lite", False)The ops_trigger.py hook checks the `karl_oapl_lite` flag. When set, it POSTs the
prompt to the MLX server at :8100 and prepends the returned tool plan to the SKILL.md
injection.
# In ops_trigger.py, after skill match:
if _is_karl_pane(session_id):
try:
plan = _query_mlx_server(prompt_text, skill_name, timeout_ms=200)
injection = f"[Learned Tool Plan]\n{plan}\n\n" + skill_content
except:
injection = skill_content # Fall back to baseline on timeoutThe 200ms timeout is critical: the MLX server on Mac5 (Tailscale IP [ip])
takes ~150ms for a 256-token inference. Total overhead: ~150ms + network = within budget
if the hook budget is 500ms and this check runs after the primary skill match.
5.4 Evaluation Period and Decision Gate
Run the live A/B for 2 weeks (consistent with Path A's accumulation window). Compute:
lift = treatment.success_rate - control.success_ratewhere `success_rate` is computed from trajectory annotations (Path A's tap, if deployed
in parallel). If `lift >= 0.05` (5 percentage point improvement), promote the KARL
treatment to 100
sequence quality and revise reward function before the next training run.
---
6. Integration with numu-weave Pipeline
6.1 Current numu-weave Architecture
The `NUMUWeave` class (`[home]/bin/numu-daemon/packages/numu-weave/src/index.ts`)
is the existing cognitive twin pipeline connector with three stations:
1. Corpus Builder (`addCorpusEntries`, `exportCorpus`): Accepts `CorpusEntry[]`
with `{instruction, input, output, source}`. Currently fed by `sft-formatter.py`
(text-only SFT pairs from verbose-all).
2. Fine-Tune Trainer (`startTraining`, `checkTrainingStatus`): POSTs corpus to
the finetune-daemon at Mac5 `:9200/train`. Watches `/status` endpoint.
3. Evaluator (`evaluate`): A/B comparison between base and fused model. Currently
placeholder scores (0.72 base, 0.78 fused) — not wired to real metrics.
6.2 Extension Points for OAPL-Lite
Extension 1: New source type for trajectory corpus
Add `"trajectory"` as a valid `source` in `CorpusEntry`:
export interface CorpusEntry {
id: string;
instruction: string;
input: string;
output: string;
source: "prompt-logger" | "memory" | "thread" | "trajectory"; // ADD trajectory
createdAt: string;
// New fields for trajectory entries:
advantageWeight?: number; // [-2.0, 2.0] from OAPL advantage computation
domain?: string; // "deploy" | "ios" | "git" | etc.
reward?: number; // [0, 1] raw reward
isContrast?: boolean; // True for negative-advantage examples
}Extension 2: Advantage-weighted corpus export
The current `exportCorpus()` outputs uniform JSONL. Add an `exportWeighted()` method
that applies repeat-based upweighting (Strategy A from Section 4.2):
exportWeighted(): string {
return this.corpus.flatMap((e) => {
const weight = e.advantageWeight ?? 1.0;
const repeats = e.isContrast ? 1 : Math.ceil(Math.abs(weight));
return Array(repeats).fill(null).map(() =>
JSON.stringify({
instruction: e.instruction,
input: e.input,
output: e.output,
})
);
}).join("\n");
}Extension 3: Wire evaluator to trajectory metrics
Replace the placeholder evaluation in `evaluate()` with a real call to the
`skill_metrics.json` endpoint (from Path A's metrics aggregator):
async evaluate(): Promise<EvalResult> {
// Fetch from Dashboard API
const res = await fetch("http://[ip]:8421/api/karl/skill-metrics");
const metrics = await res.json() as SkillMetrics;
const baseScore = metrics.baseline?.no_skill_success_rate ?? 0.55;
const fusedScore = this._computeMeanSuccessRate(metrics.skills);
const improvement = fusedScore - baseScore;
return { ..., baseScore, fusedScore, improvement };
}Extension 4: Trajectory feeder script
New Python script: `[home-path]`
"""
Feed OAPL-Lite trajectory corpus into numu-weave via HTTP.
The finetune-daemon exposes POST /train. This script:
1. Runs trajectory_extractor.py
2. Runs reward_computer.py + advantage_weighter.py
3. Runs sft_formatter_trajectory.py
4. Writes output to Desktop/homelab/compute-pair/sft-output/oapl-lite.jsonl
5. Signals finetune-daemon: POST http://[ip]:9200/train
with {"trigger": true, "source": "oapl-lite"}
"""This preserves the existing data flow that `finetune-daemon.py` already implements —
we add oapl-lite.jsonl as a new source alongside the existing `prompt-sft.jsonl` and
`browser-sft/train.jsonl` in `merge_training_data()`.
6.3 finetune-daemon.py Integration
The existing daemon at `Desktop/homelab/compute-pair/finetune-daemon.py` already has
the complete MLX training + hot-swap pipeline (lines 246-422 — confirmed by reading the
file). Path B only needs to add one new data source to `merge_training_data()`:
sources = [
SFT_OUTPUT_DIR / "prompt-sft.jsonl",
Path.home() / "Desktop/homelab/compute-pair/browser-sft/train.jsonl",
DPO_DIR / "dpo-chosen.jsonl", # Future: race protocol
Path.home() / ".claude/karl/oapl-lite.jsonl", # ADD: OAPL-Lite trajectories
]The training command in `run_mlx_training()` already uses the correct CLI flags
(`python3 -m mlx_lm lora`, `--num-layers`) that avoid the known v0.29+ gotchas.
---
7. Scale Analysis
7.1 Training Data Scale
| Stage | Trajectory count | SFT examples (after augment) | Notes |
|---|---|---|---|
| Initial (verbose-all) | 157 | ~180-200 | Includes repeat upweighting |
| + correction pairs | +30-50 | +30-50 | Negative contrast examples |
| After 2 weeks (Path A tap) | +100 | +150 | New annotated trajectories |
| After 4 weeks | +250 | +375 | Approaching KARL's Stage I scale |
| After 8 weeks | +500 | +750 | Competitive with adapter v1's 972 examples |
Concern: 157 initial examples is small. KARL used 1,218+ examples in Stage I.
The risk of overfitting is real. Mitigations:
- The validation loss metric prevents overfitting during training (MLX stops if
val loss diverges from train loss)
- Restricting to `--num-layers 4` (4 LoRA layers out of model's 18 total) limits
adapter capacity, reducing overfitting risk
- We do NOT train on the test set (18 held-out examples from Section 5.2)
7.2 Mac5 Memory Constraints
M4 16GB unified memory analysis for training:
| Component | Memory | Notes |
|---|---|---|
| Base model (Gemma-3-1b at 4-bit) | ~1.5 GB | Fixed overhead |
| LoRA adapter weights (4 layers) | ~50 MB | Small relative to base |
| Optimizer state (Adam) | ~200 MB | Gradient + second moment for LoRA params |
| Activation memory (batch=1, seq=256) | ~800 MB | Forward pass intermediate values |
| KV cache | ~300 MB | Attention key/value states |
| Total estimated | ~2.9 GB | Well within 16GB |
Increasing to seq=512 would add ~600MB activation memory — still feasible (~3.5GB
total). Increasing batch_size to 2 would double activation memory to ~1.6GB —
still feasible (~4.5GB total). Mac5 has headroom for more aggressive training than
the current defaults.
MLX Server at :8100 during training: The fused model server consumes ~2.5GB when
active. Concurrent training + serving would require ~5.4GB — still well within 16GB.
However, to prevent interference, the finetune-daemon already kills the server before
training (line 381: `subprocess.run(["pkill", "-f", "mlx_lm.*server"])`) and restarts
it after hot-swap. This is the correct protocol.
7.3 Training Time Estimate
Based on adapter v1 benchmark: 188.4s for 500 iterations on 972 examples.
Time per iteration ≈ 188.4s / 500 = 0.377s/iter
Path B initial run:
- 1,000 iterations
- ~180 examples (smaller corpus, similar per-example cost)
- Estimated: 1,000 * 0.377s ≈ 377s (~6.3 minutes)
Future runs with 500 examples:
- 1,000 iterations
- Estimated: ~380s (example count doesn't linearly scale iteration cost
for batch_size=1 with gradient accumulation)Total pipeline time including extraction, reward computation, formatting:
- Extraction + reward: ~30s (reading 3,249 JSON lines + Python computation)
- MLX training: ~380s
- Fusion: ~60s
- Server restart: ~10s
- Total: ~8 minutes end-to-end
This is well within finetune-daemon's 10-minute training timeout.
7.4 Expected Loss Trajectory
| Run | Examples | Target Loss | Interpretation |
|---|---|---|---|
| v1 (baseline) | 972 (text SFT) | 1.694 (actual) | Text-only cognitive twin |
| v2 (OAPL-Lite init) | ~200 | 1.5-1.6 | Tool-use specialist, small corpus |
| v3 (2-week) | ~350 | 1.4-1.5 | Larger corpus, better coverage |
| v4 (4-week) | ~600 | 1.3-1.4 | Approaching KARL-scale improvement |
Loss below 1.4 indicates the model has internalized meaningful tool-use patterns beyond
random guessing. Loss above 1.7 would indicate the trajectory format is not learning
above the base model's priors (retrain trigger).
---
8. Mathematical Formulation of the Offline OAPL Objective
8.1 Full OAPL Objective (Reference)
KARL's OAPL solves the KL-regularized RL problem:
max_pi E_{x~D}[ E_{y~pi(.|x)}[r(x,y)] - beta * KL(pi(.|x) || pi_ref(.|x)) ]The closed-form optimal policy is:
pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x,y) / beta)
where Z(x) = sum_y pi_ref(y|x) * exp(r(x,y) / beta) (partition function)OAPL approximates this with a regression loss. With G offline samples y_1,...,y_G per x:
L_OAPL(pi) = sum_x sum_i { beta * ln[pi(y_i|x) / pi_ref(y_i|x)] - A*(x,y_i) }^2
A*(x,y_i) = r(x,y_i) - V*(x)
V*(x) = beta * ln[ (1/G) * sum_j exp(r(x,y_j) / beta) ] (soft optimal value)8.2 OAPL-Lite Objective (Our Approximation)
In the offline single-rollout setting (G=1 per prompt from our trajectory log), the
optimal value V(x) degenerates: with a single sample, V(x) = r(x,y_1), so A*(x,y_1)
= 0 for every example. This is a fundamental limitation of single-rollout offline data.
Our solution: Use the domain-mean baseline as a proxy for V*(x).
V_approx(x) = mean_{j: domain(x_j) = domain(x)} r(x_j, y_j)
A_approx(x, y) = r(x, y) - V_approx(x)
L_OAPL-Lite(pi) = sum_i { A_approx(x_i, y_i) * log pi(y_i | x_i) }This is not the full OAPL regression loss — it is advantage-weighted cross-entropy
(also known as REINFORCE with baseline). The equivalence to OAPL holds when:
1. The KL term `beta * ln(pi/pi_ref)` is approximated by the L2 norm constraint
imposed by the LoRA rank (low-rank adaptation limits deviation from base)
2. V_approx(x) is a good proxy for V*(x) (valid when domain baselines are stable)
This approximation was independently shown to work well in the ILQL and filtered
behavior cloning literature, and is precisely the gradient that advantage-weighted SFT
optimizes.
8.3 Why This Converges Despite Single-Rollout Data
The key theoretical result (from OAPL paper Section 4.3): OAPL is stable with policy
lags up to L = 1/(2beta) gradient steps. With beta=0.05:
L* = 1 / (2 * 0.05) = 10 gradient stepsThis seems worse than KARL's 400-step stability. However, KARL's stability claim is
for online OAPL where rollouts are updated periodically. In the offline setting, we
compute the advantage once from fixed data and do not update the behavior policy. The
stability concern (importance weight blow-up, reward hacking) is much weaker because
the training distribution is frozen. The offline stability is governed by the KL penalty
(the LoRA rank constraint) rather than the policy lag.
In practice: if validation loss diverges during training (val_loss > train_loss + 0.3),
reduce learning_rate from 5e-5 to 1e-5 and rerun. This is the operational stability
gate.
8.4 Relationship to DPO
Direct Preference Optimization (DPO) also emerges from the same KL-regularized RL
framework with a specific pairing structure (chosen vs. rejected completions). OAPL-Lite
with negative contrast examples (Section 4.2) is structurally similar to DPO:
DPO loss: -log sigmoid[ beta * (log pi(y+|x)/pi_ref(y+|x) - log pi(y-|x)/pi_ref(y-|x)) ]
OAPL-Lite with contrast: advantage-weighted CE where y+ gets positive weight, y- gets negative framingThe difference: DPO requires explicit (chosen, rejected) pairs. OAPL-Lite uses scalar
advantage weights. If we accumulate enough correction-linked pairs (Section 4.3), a
future Stage 2 option is to switch to DPO using those pairs directly — the `dpo-pairs/`
directory in the compute-pair setup already exists for this purpose.
---
9. Risks
9.1 Distribution Shift (High Probability, Manageable)
The risk: Training on Claude Code's API-generated trajectories (from Claude Opus
4.6 or similar) and deploying the learned patterns to the local Gemma-3-1b fused model
creates a fundamental mismatch. The trajectories in verbose-all were generated by a
much more capable model with different tool-calling tendencies. The local model cannot
replicate those trajectories.
Why this matters for OAPL-Lite specifically: The "correct" tool sequences in our
training data reflect what a large frontier model would do, not what a 1B parameter
local model can do. We are teaching the 1B model to predict actions that are within
its capability range.
Mitigation:
1. The local model is not expected to execute these trajectories — it generates a
plan that is injected as context to the main Claude session. Distribution shift
between the plan generator and the executor is acceptable.
2. Restrict training to short trajectories (<=15 tool calls, ~100 examples) where the
local model has realistic capacity to internalize the pattern.
3. Frame the training output as "tool-use reasoning" (explicit thinking steps) rather
than exact action sequences — this is more generalizable across model capabilities.
Counterargument: If the local model can only generate incoherent plans, the
200ms inference investment is wasted compute that degrades rather than helps. Measure
plan coherence qualitatively on the first 10 outputs before deploying to live panes.
9.2 Reward Hacking (Medium Probability)
The risk: The reward function uses proxy signals (Bash exit codes, file
modifications, error arrays) that can be satisfied without the task actually succeeding.
A model could learn that "add more file modifications" always improves reward, regardless
of whether those modifications are useful.
Why this is particularly risky for tool-use training: Unlike text quality, tool-use
sequences can contain genuinely harmful hacks (creating dummy files to inflate the
file_modification_signal, running `exit 0` to force exit_code success).
Mitigation:
1. The file_modification_signal caps at `modified >= 1` for most domains — adding more
files beyond 1 doesn't improve the reward. No incentive for excessive file creation.
2. Exit codes are parsed from actual Bash output, not from the model's generated text.
The model cannot learn to "hack" exit codes since they are observed signals.
3. The correction_signal (20
generates a visually-clean-but-useless trajectory, the next-prompt correction will
penalize it despite the clean exit codes.
4. The LoRA rank constraint (4 layers, low capacity) limits the model's ability to
memorize reward-hacking patterns specific to the training set.
Counterargument: With only 157 training examples, overfitting to specific reward
patterns is more likely than systematic reward hacking. The real risk is not a
sophisticated hack but a brute-force memorization of the 20 highest-reward examples.
Monitor validation loss: if train_loss << val_loss, this is occurring.
9.3 Compute Limits and Mac5 Availability (Low Probability)
The risk: Mac5 is also serving the MLX Server at :8100, running Ollama, and
participating in the exo cluster. Training + serving + clustering simultaneously could
cause memory pressure.
Mitigation:
1. The finetune-daemon already kills the MLX server before training. The exo cluster
can be gracefully detached from Mac4's master perspective.
2. Training is scheduled off-peak (can be added to finetune-daemon's `POLL_INTERVAL`
logic: only train between 2AM-6AM if Mac5's memory is above threshold).
3. The 8-minute training window is short enough that brief Mac5 unavailability for
other services is acceptable.
9.4 Small Corpus Overfitting (High Probability with 157 examples)
The risk: 157 examples is genuinely small for LoRA training. The model will likely
memorize much of the training set after 1,000 iterations rather than generalizing.
Evidence: Adapter v1 had loss 1.694 on 972 examples after 500 iterations. With
157 examples and 1,000 iterations, we are doing ~6.4 effective epochs. Even with
dropout and the KL constraint, memorization is likely.
Mitigation:
1. Use early stopping based on validation loss: stop training when val_loss increases
for 3 consecutive checkpoints. With 18 validation examples, this is a coarse signal
but better than fixed iterations.
2. Reduce `--iters` from 1,000 to 500 for the initial run (consistent with v1). Scale
iterations as corpus grows.
3. Add Gaussian noise to advantage weights during training (jitter ±0.1) to prevent
exact memorization of reward-specific patterns.
Acceptance criterion: If the test plan accuracy is >= 0.4 (better than chance) and
val_loss < 1.6, the initial run is a success despite the small corpus.
9.5 Correction Detector Cross-Contamination
The risk: The correction_detector.py is also used by the Cortex pipeline for rule
promotion. Using it as a reward signal creates a feedback loop: high-correction
trajectories get negative reward, which trains the model to avoid actions that trigger
corrections, which reduces correction events, which reduces the training signal, etc.
Assessment: This feedback loop operates over weeks (correction → reward → training
→ deployment → behavior change → fewer corrections). It is actually the desired loop:
a system that learns to avoid generating corrections is learning to act correctly. The
risk is that the loop overshoots — the model becomes too conservative and avoids all
actions that even pattern-match to past corrections, including correct ones.
Mitigation: Monitor the correction_signal component in `RewardResult` logs. If the
fraction of trajectories with `correction_available=True` drops below 10
never triggering the correction detector), investigate whether the remaining trajectories
are genuinely correction-free or whether the signal has gone silent.
---
10. Implementation Sequence
Phase 1: Data Pipeline (Days 1-2)
1. Write `[home-path]` (~180 lines)
- Parse verbose-all.jsonl
- Extract TrajectoryExample objects
- Normalize tool names via TOOL_NAME_MAP
- Output: `[home-path]` (~157 records)
2. Write `[home-path]` (~200 lines)
- Implement 4-component reward formula
- Cross-reference cortex/entries.jsonl for correction signals
- Output: `[home-path]`
3. Write `[home-path]` (~100 lines)
- Compute per-domain baselines
- Apply OAPL-Lite advantage formula
- Output: `[home-path]`
4. Write `[home-path]` (~200 lines)
- Convert to ChatML format with trajectory context
- Apply repeat-based upweighting for high-advantage examples
- Add contrast examples for negative-advantage trajectories
- Write to `[home-path]` + train/valid/test split
Validation gate: `wc -l [home-path]` should be
140-160. Run `python3 -c "import json; [json.loads(l) for l in open('train.jsonl')]"` —
should parse without errors.
Phase 2: Initial Training on Mac5 (Day 3)
5. Copy `[home-path]` to Mac5 via:
rsync -av [home-path] mohameddiomande@[ip]:[home-path]6. Trigger training via finetune-daemon:
curl -X POST http://[ip]:9200/train \
-H "Content-Type: application/json" \
-d '{"source": "oapl-lite", "data_dir": "[home-path] "iters": 500}'(Requires adding `data_dir` and `source` params to the daemon's `/train` handler.)
7. Monitor training: `curl http://[ip]:9200/status`
8. After completion: hot-swap adapter via `hot_swap_adapter(2)` in daemon.
Phase 3: Evaluation (Days 4-5)
9. Run automated plan accuracy evaluation on 18 held-out test examples
10. Qualitative review of 10 generated tool plans (are they coherent?)
11. Deploy to 50
Phase 4: Integration with Path A (Week 2+)
12. Update `feed_weave.py` to incorporate Path A's annotated trajectories.jsonl
13. Set up Prefect flow: `oapl_lite_retrain` — triggers when trajectory_tap accumulates
50+ new records.
14. Add `oapl-lite` source to `merge_training_data()` in finetune-daemon.py.
---
Sources
Codebase Files Read (with line-level depth)
- `[home]/bin/numu-daemon/packages/numu-weave/src/index.ts`
(270 lines — full NUMUWeave class, CorpusEntry schema, WeaveConfig defaults,
startTraining/evaluate stubs)
- `[home]/Desktop/homelab/compute-pair/finetune-daemon.py`
(540 lines — full training + hot-swap + Prometheus metrics daemon; confirmed
MLX CLI flags, adapter versioning, Mac5 Tailscale IP, merge_training_data sources)
- `[home]/Desktop/homelab/compute-pair/sft-formatter.py`
(296 lines — sft-formatter: verbose-all parsing, VerbosePromptEntry schema,
ChatML output format, dedup via sha256)
- `[home]/.openclaw/browser/corpus-to-sft.py`
(316 lines — browser corpus converter; shows SFT pipeline pattern for new source types)
- `[home]/.claude/prompt-logs/verbose-all.jsonl`
(3,249 entries — confirmed schema: prompt_id, session_id, prompt_text, assistant_turns,
tool_calls with tool_name/parameters/result, files_modified, files_created, errors,
git_repo, git_branch, captured_at)
- `[home]/.claude/prompt-logs/unified.jsonl`
(3,928 entries — confirmed tool_count distribution: 3,925 at 0, 3 at 1 — confirming
that tool_calls in unified.jsonl are simplified and verbose-all is the correct source)
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md`
(KARL paper summary, OAPL objective, Mac5 fine-tune infrastructure, Cortex system)
- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-a.md`
(Path A Trajectory Tap — outcome signals, session buffer, annotation protocol)
Memory Files Consulted
- `[home]/.claude/projects/-Users-mohameddiomande/memory/MEMORY.md`
(Mac5 IPs, LoRA CLI gotchas, adapter v1 stats, finetune-daemon ports)
- `[home]/.claude/agent-memory/research-engine/MEMORY.md`
(KARL paper stats, Stage 0 summary)
Direct Trajectory Data Analyzed
- 3,249 verbose-all.jsonl entries scanned programmatically
- 157 entries with tool_calls confirmed
- Max trajectory depth: 144 tool calls (map-architecture task)
- Exit code extraction confirmed from result string parsing
- Distribution: 41 entries with 1-4 calls, 75+ entries with 5+ calls
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/karl-trajectory-intelligence/stage1-path-b.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research