Grand Diomande Research · Full HTML Reader

KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite

Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's single M4 chip using offline advantage estimation instead of online rollouts. The core insight: we don't need live rollout infrastructure when we already have 3,249 logged trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with exit codes, file diffs, and success signals. The approach converts those trajectories into advantage-weighted training examples, computes rewards from build results, correcti

Agents That Account for Themselves proposal experiment writeup candidate score 26 .md

Full Public Reader

# KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite
Run: karl-trajectory-intelligence
Generated: 2026-03-10
Method: Evolution³ — divergent exploration

---

Executive Summary

Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's
single M4 chip using offline advantage estimation instead of online rollouts. The core
insight: we don't need live rollout infrastructure when we already have 3,249 logged
trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with
exit codes, file diffs, and success signals. The approach converts those trajectories into
advantage-weighted training examples, computes rewards from build results, correction
signals, and user approval proxies, and trains a LoRA adapter on Mac5 that learns which
tool-use sequences are associated with successful task completion.

The target is a LoRA adapter specialized for tool-use reasoning: given a prompt and a
skill context, predict the high-advantage next action. This replaces the static
`(prompt) -> inject SKILL.md content` pipeline with a `(prompt + trajectory context) ->
learned action selection` model that improves as more trajectories accumulate.

---

1. OAPL Simplification: What Survives on a Single M4

1.1 The Full OAPL Objective (KARL Paper)

The KARL OAPL loss is a regression objective derived from the KL-regularized RL problem:

OAPL objective (full):
  L(pi) = sum_x sum_i ( beta * ln(pi(y_i|x) / pi_ref(y_i|x)) - A*(x, y_i) )^2

where:
  A*(x, y_i) = r(x, y_i) - V*(x)
  V*(x) = beta * ln( (1/G) * sum_i exp(r(x, y_i) / beta) )
  G = number of rollouts per prompt x
  beta = KL regularization coefficient (controls deviation from pi_ref)
  pi_ref = frozen reference policy (base model weights)
  r(x, y_i) = reward for response y_i to prompt x

OAPL's key innovation over GRPO: V(x) is the soft optimal value*, not a simple
baseline. It is computed in closed form from the rewards of all G rollouts, requiring
no value network, no importance weight clipping, and no gradient through the value
estimate. This makes it stable at policy lags up to 400+ gradient steps — the training
data can be much older than in PPO/GRPO without degrading optimization.

1.2 Components OAPL-Lite Keeps

KL Regularization (CRITICAL — keep)
The KL penalty `beta * ln(pi(y_i|x) / pi_ref(y_i|x))` prevents the policy from
drifting catastrophically away from the base model. On Mac5 with a 4-bit quantized
Gemma-3-1b, catastrophic forgetting is the primary failure mode. The KL term ensures
the adapter stays within a reasonable neighborhood of the base. In practice: keep a
frozen reference adapter (the current v1 adapter) and compute log-probability ratios at
training time using teacher-forcing on held-out examples.

Implementation: MLX LoRA already supports this via its `--learning-rate` + early
stopping on validation loss. We add an explicit KL penalty term to the per-example
loss weight (advantage-weighted SFT is equivalent to OAPL under mild assumptions — see
Section 8).

Advantage Estimation (CRITICAL — keep)
Each training example gets an advantage weight `A = r - V_baseline` where `r` is the
computed trajectory reward and `V_baseline` is the mean reward across all trajectories
from the same skill/domain bucket. High-advantage examples (trajectories where the
agent did significantly better than average) get upweighted. Low-advantage examples
are downweighted or excluded. This is the core of what makes OAPL more sample-efficient
than plain SFT.

Offline Training (CRITICAL — keep)
Instead of running live rollouts, we use the existing `verbose-all.jsonl` (3,249
entries, 157 with tool sequences) as a fixed offline dataset. OAPL's stability at
large policy lags (400+ gradient steps) means the age of these trajectories is not a
fundamental problem — it is a design feature that Path B exploits.

1.3 Components OAPL-Lite Drops

Online Rollouts (DROP)
KARL runs 8 parallel rollouts per training prompt, requiring concurrent model inference
on a GPU cluster. Mac5 is a single M4 chip; running 8 parallel inference processes
against the 4-bit quantized Gemma model would consume the entire 16GB unified memory
during training. Dropped entirely. The offline trajectory dataset is the substitute.

Importance Weighting (DROP)
Standard off-policy RL uses importance weights `pi(y|x) / pi_behavior(y|x)` to correct
for distribution shift between the behavior policy (which collected trajectories) and
the current training policy. OAPL explicitly claims stability without importance
weighting due to the KL constraint. With offline data, we cannot compute `pi_behavior`
reliably anyway — the trajectories were generated by Claude Code (Anthropic API), not by
our local model. Dropped.

Compression Steps (DROP)
KARL handles long trajectories by inserting compression boundaries where the model
summarizes prior context before continuing. Our trajectories are shorter (median ~15
tool calls vs. KARL's 50-200 steps), and we're training on the full ChatML sequence
within MLX's 256-token max sequence length. No compression infrastructure needed.
For longer trajectories (>256 tokens), we truncate at the most informative prefix
(see Section 4).

Multi-Task Transfer (DROP for now)
KARL trains on 6 enterprise search task types simultaneously and shows generalization.
Path B trains on a single task distribution: Claude Code tool-use for software
engineering. Multi-task extension is a Stage 2 option.

Pass-Rate Filtering (SIMPLIFY)
KARL filters out trivially solved AND trivially failed examples (the pass-rate filter).
Path B uses a simpler quality gate: discard trajectories where `reward < -0.3` (noise
floor) and `reward > 0.95` (trivially easy, no learning signal). Keep the middle band
of genuinely informative trajectories.

1.4 The Resulting Simplified System

OAPL-Lite Pipeline:
  verbose-all.jsonl (3,249 entries)
      |
      v
  trajectory_extractor.py        -- filter to tool-bearing entries (157 found)
      |
      v
  reward_computer.py             -- r(x,y): [0,1] per trajectory
      |
      v
  advantage_weighter.py          -- A = r - V_baseline per domain
      |
      v
  sft_formatter_trajectory.py    -- advantage-weighted ChatML examples
      |                             (each example repeated ceil(A*k) times OR
      |                              per-example weight stored as metadata)
      v
  MLX LoRA training on Mac5      -- python3 -m mlx_lm lora
      |
      v
  adapter_v2 (tool-use specialist)
      |
      v
  fused model at :8100           -- hot-swapped via finetune-daemon.py

---

2. Trajectory-to-SFT Conversion

2.1 Input Data Inventory

From our direct inspection of `verbose-all.jsonl`:

MetricValue
Total verbose entries3,249
Entries with tool_calls157
Max tool calls per entry144
Median tool calls (for tool entries)~15
Entries with 5+ tool calls~116
Entries with exit codes parseable~90
Entries with files_modified~40
Entries with errors[]~15

The tool_call schema in verbose-all has the fields we need:

json
{
  "tool_id": "call_...",
  "tool_name": "shell_command",
  "tool_type": "shell_command",
  "parameters": {"command": "...", "exit_code": 0},
  "result": "Exit code: 0\nOutput: ..."
}

Note: `tool_name` values in verbose-all are the source-tool names from various agents
(shell_command, exec_command, run_terminal_cmd, read_file, codebase_search, update_plan).
These differ from Claude Code's canonical names (Bash, Read, Edit, Write, Glob, Grep).
The converter must normalize these.

2.2 Trajectory Extraction Logic

New file: `[home-path]` (~180 lines)

python
TOOL_NAME_MAP = {
    "shell_command": "Bash",
    "exec_command": "Bash",
    "run_terminal_cmd": "Bash",
    "read_file": "Read",
    "write_file": "Write",
    "codebase_search": "Grep",
    "glob_file_search": "Glob",
    "update_plan": "Write",
    "view_file": "Read",
}

def extract_trajectory(entry: dict) -> Optional[TrajectoryExample]:
    """Convert a VerbosePromptEntry to a TrajectoryExample for OAPL training."""
    prompt = (entry.get("prompt_text") or "").strip()
    if not prompt or len(prompt) < 15:
        return None

    # Skip hook/system prompts
    if prompt.startswith("<system-reminder>") or prompt.startswith("SessionStart:"):
        return None

    # Extract tool sequence from assistant_turns
    turns = entry.get("assistant_turns") or []
    tool_events = []
    for turn in turns:
        for tc in (turn.get("tool_calls") or []):
            raw_name = tc.get("tool_name") or tc.get("tool_type") or "Unknown"
            canonical_name = TOOL_NAME_MAP.get(raw_name, raw_name)
            result_str = str(tc.get("result") or "")

            # Parse exit code
            exit_code = None
            if "Exit code: 0" in result_str:
                exit_code = 0
            elif "Exit code: 1" in result_str or "Exit code: 2" in result_str:
                exit_code = 1

            tool_events.append(ToolEvent(
                tool_name=canonical_name,
                parameters=tc.get("parameters") or {},
                exit_code=exit_code,
                result_preview=result_str[:200],
                duration_ms=_parse_duration(result_str),
            ))

    if len(tool_events) < 2:
        return None  # Require at least 2 tool calls for trajectory learning

    return TrajectoryExample(
        prompt_id=entry.get("prompt_id", ""),
        session_id=entry.get("session_id", ""),
        prompt_text=prompt,
        tool_events=tool_events,
        files_modified=entry.get("files_modified") or [],
        files_created=entry.get("files_created") or [],
        errors=entry.get("errors") or [],
        git_repo=entry.get("git_repo"),
        intent_classification=entry.get("intent_classification"),
        complexity_score=entry.get("complexity_score"),
    )

2.3 Training Example Format

Each trajectory becomes a ChatML example where the assistant response encodes the
optimal tool-use reasoning chain. The key design decision: we train on the observed
tool sequence as the "correct" answer, weighted by the advantage of the outcome.

json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a Cognitive Twin — a personal AI delegate for software engineering tasks. When given a task, reason through the optimal tool-use sequence. Consider which tools to call, in what order, and what success looks like."
    },
    {
      "role": "user",
      "content": "Task: deploy the flows to cloud-vm\n\nContext:\n- Domain: deploy\n- Skill: ops:deploy\n- Relevant gotchas: SSH heredoc mangles variables; port 8080 is Docker proxy\n- Git state: branch=main, repo=mohameddiomande\n\nWhat is your tool-use plan?"
    },
    {
      "role": "assistant",
      "content": "<thinking>\nDeploy task targeting cloud-vm. Need to: read docker-compose config, verify service state, run systemctl restart via SSH, confirm success.\n</thinking>\n\nTool sequence: Read(docker-compose.yml) → Bash(ssh cloud-vm 'systemctl status') → Bash(ssh cloud-vm 'systemctl restart prefect') → Bash(verify)\n\nKey checks:\n1. Read docker-compose.yml first to confirm service names\n2. SSH commands must be single-quoted to avoid heredoc variable expansion\n3. Verify exit_code=0 on systemctl restart before marking complete"
    }
  ],
  "advantage_weight": 0.73
}

The `advantage_weight` field is used by the custom training loop to modulate per-example
loss contribution. Examples with high advantage (trajectory was much better than average)
are up-weighted. Examples near the baseline are weighted ~1.0. Negative-advantage examples
are still included but with reduced weight, serving as negative contrast examples.

2.4 Trajectory Context Construction

The user message is not just the raw prompt — it includes contextual fields that exist in
`verbose-all.jsonl` and are already extracted by the response hook:

FieldSourcePurpose
`prompt_text`Direct from entryCore task description
`intent_classification`response_hook.pyDomain label (deploy, ios, git, etc.)
`complexity_score`response_hook.pyFloat [0,1] indicating task complexity
`git_repo`Direct from entryProject context
`git_branch`Direct from entryBranch context
`skill_name`Matched from ops_trigger invocation_recordsWhich skill was injected (if any)
`skill_gotchas`Loaded from SKILL.mdHard-won gotchas as context

The skill gotchas are the bridge between the static SKILL.md system and the learned
trajectories: we inject the gotchas into the training context so the model learns to
reason through them, not just pattern-match on prompt text.

2.5 Sequence Length Management

MLX is configured with `MAX_SEQ_LENGTH = 256` in the finetune-daemon. Tool sequences
with 20+ steps easily exceed this. Strategy:

1. Compress the tool sequence representation: Instead of reproducing full tool
parameters, encode the sequence as a compact string: `Read(docker-compose.yml) →
Bash[0] → Bash[0] → Bash[0]` where `[exit_code]` is appended only when informative.

2. Prioritize the reasoning prefix: The most learning-relevant part is the initial
reasoning in `<thinking>` blocks and the first 8 tool calls. Truncate at step 8 if
needed; the completion reward already captures the whole-trajectory outcome.

3. For long trajectories (>50 tool calls): Sample 3 non-overlapping 8-step windows
— the opening window, the highest-failure-density window, and the closing window.
Each window becomes a separate training example with the same reward signal. This
increases data diversity without requiring sequence length expansion.

4. Future option: Increase `MAX_SEQ_LENGTH` to 512 or 1024 for the LoRA run. Mac5's
M4 16GB can handle this for Gemma-3-1b at 4-bit — the memory constraint is on batch
size, not sequence length. With `batch_size=1` and `num_layers=4`, 512 tokens is
feasible. Measure peak memory before committing.

---

3. Reward Function

3.1 Design Principles

The reward function is the core engineering challenge of OAPL-Lite. Unlike KARL's
nugget-based accuracy (which compares retrieved documents against ground-truth answers),
we must infer task success from observable side-effects. Three principles:

1. Use process signals, not just outcome: Exit codes, file modification counts,
and absence of error arrays are direct evidence of process quality, not just
final outcome. A trajectory that hit 3 Bash exit_code=1 errors and then succeeded
is worse than one that succeeded on the first attempt.

2. Use session-lagged signals sparingly: Correction signals from the next prompt
are the strongest available outcome indicator, but they require cross-turn linkage.
The reward function computes a within-trajectory score that is available immediately,
then enriches it with a session-lagged correction signal when available.

3. Normalize per skill-domain bucket: Absolute rewards are not comparable across
tasks. A deploy trajectory and a git trajectory have different baseline difficulty.
Normalize by computing z-scores within each domain (deploy, ios, git, supabase, etc.)
to produce the advantage A = r - V_baseline used in OAPL-Lite.

3.2 Reward Signal Components

Component 1: Process Cleanliness (weight 0.35)
Measures how clean the tool-use process was, independent of outcome.

python
def process_cleanliness(events: List[ToolEvent]) -> float:
    """[0, 1]. Higher = cleaner execution."""
    if not events:
        return 0.5

    bash_events = [e for e in events if e.tool_name == "Bash"]
    if not bash_events:
        return 0.85  # Read-only trajectories are inherently clean

    fail_count = sum(1 for e in bash_events if e.exit_code == 1)
    total_count = len(bash_events)
    fail_rate = fail_count / total_count

    # Penalty scaling: first failure is less bad (could be expected check),
    # multiple consecutive failures are bad (retry loops without progress)
    consecutive_fails = _max_consecutive(bash_events, lambda e: e.exit_code == 1)
    consecutive_penalty = min(0.4, consecutive_fails * 0.1)

    return max(0.0, 1.0 - fail_rate * 0.6 - consecutive_penalty)

Component 2: File Modification Signal (weight 0.25)
For software engineering tasks, modifying the right files is a proxy for task completion.

python
def file_modification_signal(
    entry: TrajectoryExample,
    domain: str,
) -> float:
    """[0, 1]. Higher = more evidence of productive file changes."""
    modified = len(entry.files_modified)
    created = len(entry.files_created)

    if domain in ("ios", "deploy", "docker"):
        # These domains should modify config/source files
        if modified + created == 0:
            return 0.2  # Low — task should have changed something
        if modified >= 1:
            return min(1.0, 0.5 + modified * 0.15)
    elif domain in ("git", "monitoring"):
        # Commits, restarts — file changes are optional
        if modified + created >= 1:
            return 0.8
        return 0.6  # Neutral — not modifying files is fine for git/monitor
    elif domain == "debug":
        # Debug might not modify files (just diagnosis)
        return 0.7  # Neutral

    return 0.5  # Default neutral

Component 3: Error Array Signal (weight 0.20)
The `errors` field in VerbosePromptEntry captures hook-detected errors (non-zero exits
that the hook classified as errors, Python tracebacks surfaced via response_hook).

python
def error_signal(entry: TrajectoryExample) -> float:
    """[0, 1]. Higher = fewer detected errors."""
    error_count = len(entry.errors)
    if error_count == 0:
        return 1.0
    elif error_count == 1:
        return 0.6  # Single error, may have been recovered
    elif error_count <= 3:
        return 0.3
    else:
        return 0.1

Component 4: Correction Signal (weight 0.20 when available, else 0)
Requires cross-session lookup. The reward computer checks if any `correction` type
CortexEntry (in `[home-path]`) is timestamped within 120 seconds
after the trajectory's `captured_at` and shares the same `session_id`. If found:

python
def correction_signal(trajectory: TrajectoryExample) -> Optional[float]:
    """[0, 1] or None if correction data not available."""
    entries = load_cortex_entries_for_session(trajectory.session_id)
    corrections_after = [
        e for e in entries
        if e.type == "correction"
        and e.timestamp > trajectory.captured_at
        and (e.timestamp - trajectory.captured_at).seconds < 120
    ]

    if not corrections_after:
        # No correction data — could mean no correction happened OR
        # the session ended before the next prompt.
        return None  # Do not penalize; treat as missing data

    if len(corrections_after) >= 1:
        return 0.0  # Strong negative: user had to correct within 2 minutes

    return 1.0  # Correction absent in next prompt (2-minute window)

3.3 Composite Reward Computation

python
def compute_reward(
    entry: TrajectoryExample,
    domain: str,
) -> RewardResult:
    """Compute composite reward r in [0, 1]."""
    w_process = 0.35
    w_file = 0.25
    w_error = 0.20
    w_correction = 0.20

    r_process = process_cleanliness(entry.tool_events)
    r_file = file_modification_signal(entry, domain)
    r_error = error_signal(entry)
    r_correction = correction_signal(entry)

    if r_correction is None:
        # Redistribute correction weight to other signals
        total_other = w_process + w_file + w_error
        r = (r_process * w_process/total_other +
             r_file * w_file/total_other +
             r_error * w_error/total_other)
        correction_available = False
    else:
        r = (r_process * w_process +
             r_file * w_file +
             r_error * w_error +
             r_correction * w_correction)
        correction_available = True

    return RewardResult(
        reward=r,
        process_component=r_process,
        file_component=r_file,
        error_component=r_error,
        correction_component=r_correction,
        correction_available=correction_available,
    )

3.4 Advantage Computation (The OAPL Core)

python
def compute_advantages(
    trajectories: List[TrajectoryWithReward],
) -> List[TrajectoryWithAdvantage]:
    """
    Compute OAPL-style advantages: A = r - V_baseline.

    V_baseline = per-domain mean reward (soft baseline, not optimal value).
    For full OAPL: V*(x) = beta * ln((1/G) * sum_i exp(r_i / beta)).
    With offline data and no grouped rollouts, we use the domain mean as proxy.
    """
    from collections import defaultdict
    domain_rewards = defaultdict(list)
    for t in trajectories:
        domain_rewards[t.domain].append(t.reward.reward)

    domain_baselines = {
        domain: sum(rewards) / len(rewards)
        for domain, rewards in domain_rewards.items()
    }

    result = []
    for t in trajectories:
        baseline = domain_baselines.get(t.domain, 0.5)
        advantage = t.reward.reward - baseline

        # OAPL soft advantage: scale by beta (KL regularization strength)
        # beta=0.1 is from KARL paper; we use 0.05 for tighter KL constraint on M4
        soft_advantage = advantage / 0.05  # Equivalent to exp(r/beta) normalization

        # Clip to prevent extreme weighting
        clipped_advantage = max(-2.0, min(2.0, soft_advantage))

        result.append(TrajectoryWithAdvantage(
            **t.__dict__,
            advantage=clipped_advantage,
            baseline=baseline,
        ))

    return result

3.5 Reward Signal Calibration (Expected Values)

Based on our inspection of 157 tool-bearing trajectories:

DomainExpected Baseline rNotes
deploy~0.55Mix of clean deploys and retry loops
ios~0.45Higher Bash failure rate (xcodebuild is noisy)
git~0.70Usually clean, simple tool sequences
debug~0.40Expected high failure rates (exploratory)
supabase~0.60Mostly read + verify patterns
monitoring~0.65Typically clean check + report patterns

Trajectories with advantage >= +0.3 (much better than domain average) become the
positive training signal. Trajectories with advantage <= -0.3 become negative contrast
examples. The ~60

---

4. Training Pipeline: End-to-End

4.1 Data Pipeline Steps

Step 1: Extract (trajectory_extractor.py)
  Input:  [home-path] (3,249 entries)
  Filter: tool_count >= 2, non-system prompts
  Output: ~157 TrajectoryExample objects

Step 2: Reward (reward_computer.py)
  Input:  157 TrajectoryExamples
  Compute: r per trajectory using 4-component formula
  Enrich: correction signals from cortex/entries.jsonl where available
  Output: 157 TrajectoryWithReward objects

Step 3: Advantage (advantage_weighter.py)
  Input:  157 TrajectoryWithReward
  Compute: domain baselines, soft advantages per OAPL formula
  Filter: discard if |advantage| < 0.05 (genuinely uninformative)
  Output: ~130 TrajectoryWithAdvantage objects (estimate)

Step 4: Format (sft_formatter_trajectory.py)
  Input:  130 TrajectoryWithAdvantage
  Convert: each → ChatML example with trajectory context + tool reasoning
  Handle: sequences >256 tokens via windowing strategy (Section 2.5)
  Augment: high-advantage trajectories (A > 0.5) duplicated 2x for emphasis
  Output: ~180-200 training examples (ChatML JSONL)

Step 5: Split (80/10/10)
  Train:  ~145 examples
  Valid:  ~18 examples
  Test:   ~18 examples
  Note: small test set — primary evaluation via A/B comparison (Section 5)

Step 6: Train (MLX LoRA on Mac5)
  Command: python3 -m mlx_lm lora \
    --model mlx-community/gemma-3-1b-it-4bit \
    --data [home-path] \
    --adapter-path [home-path] \
    --train \
    --iters 1000 \
    --batch-size 1 \
    --num-layers 4 \
    --max-seq-length 256 \
    --learning-rate 5e-5
  Expected duration: ~400s (2x adapter v1's 188.4s for 2x iterations)
  Expected loss: target < 1.5 (vs. v1's 1.694 baseline)

Step 7: Fuse and hot-swap
  Command: python3 -m mlx_lm fuse \
    --model mlx-community/gemma-3-1b-it-4bit \
    --adapter-path [home-path] \
    --save-path [home-path]
  Then: restart MLX server at :8100 via finetune-daemon.py hot_swap_adapter(2)

4.2 Advantage-Weighted SFT: Implementation Detail

Standard MLX LoRA does not support per-example loss weights natively. Path B uses
two complementary strategies:

Strategy A: Repeat-based upweighting (simpler)
High-advantage examples (A > 0.5) are duplicated `ceil(A * 2)` times in the training
JSONL. Example with A=0.8 appears twice; A=1.2 appears three times. Negative-advantage
examples (A < -0.3) are included once but with `instruction` modified to include a
"what NOT to do" framing:

json
{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "Task: deploy to cloud-vm\n\n[CONTRAST EXAMPLE: The following approach led to failures. Identify what went wrong.]"},
    {"role": "assistant", "content": "Problematic sequence: Bash(ssh cloud-vm 'systemctl restart') without first reading config → exit_code=1 repeated 3x → session abandoned without verification.\n\nWhat should have happened: Read(docker-compose.yml) first to confirm service name, then SSH with correct service identifier."}
  ]
}

Strategy B: Custom loss wrapper (future — Stage 2)
Patch MLX LoRA to accept a `--loss-weights` JSONL file mapping example index to
float weight. This is 20-30 lines of MLX Python and enables continuous advantage
weighting rather than discrete repetition. Deferred to Stage 2 because it requires
forking `mlx_lm`.

4.3 Batch of Corrective Trajectories

The cortex/entries.jsonl has 399 entries including `correction` type records. These are
a pre-labeled negative dataset. Each correction entry links to a session_id and contains
the correction text. Cross-referencing against verbose-all trajectories from the same
session_id produces correction-linked negative examples with explicit `advantage = -1.0`
— the strongest possible negative training signal.

Expected yield: ~30-50 correction-linked trajectory pairs (estimate based on cortex's
399 entries and 157 tool trajectories sharing session IDs).

4.4 Incremental Training Protocol

Initial run (adapter v2):
- Source: all 157 tool trajectories from verbose-all.jsonl
- Expected training examples: ~180-200 after augmentation
- Training: 1,000 iterations, lr=5e-5
- Duration: ~400s on M4

Subsequent runs (adapter v3, v4, ...):
- Triggered when trajectory_tap.py (from Path A) accumulates 50+ new annotated trajectories
- Source: existing corpus + new annotated trajectories from trajectories.jsonl
- The finetune-daemon already watches for new turns and has the hot-swap mechanism
- Each run retrains from scratch (not fine-tuning on fine-tune) to avoid catastrophic
forgetting from iterated LoRA updates

Cadence: With ~30 active sessions/day and ~10
~3-5 new tool trajectories/day. The 50-example threshold for retraining would be hit
in ~10-17 days, creating adapter v3 ~2 weeks after initial deployment.

---

5. A/B Evaluation Design

5.1 The Evaluation Problem

The fused model at :8100 is used by the cognitive twin pipeline — but evaluating it
against a production Claude API agent is not a fair comparison. The correct comparison
is:

Baseline: Current ops skill injection (static SKILL.md context + no learned
trajectory reasoning)

Treatment: Same SKILL.md injection + the OAPL-Lite fused model's tool-use plan
prepended to the context

5.2 Automated Evaluation Metrics

Metric 1: Plan Accuracy Score
For a held-out set of 18 test trajectories (see Section 4.1), prompt the fused model
with the trajectory context (minus the tool sequence) and measure how well its predicted
tool sequence matches the actual tool sequence using:

python
def plan_accuracy(predicted: List[str], actual: List[str]) -> float:
    """Weighted edit distance: order matters but some flexibility allowed."""
    # Use longest-common-subsequence (LCS) normalized by actual length
    lcs_len = lcs(predicted, actual)
    return lcs_len / len(actual)

Expected baseline (before OAPL-Lite): ~0.2 (model has no trajectory training)
Target (after OAPL-Lite): ~0.5+ (model has internalized common tool sequences)

Metric 2: First-Action Accuracy
The most actionable: does the model predict the correct first tool to call?

python
def first_action_accuracy(predicted_plans: List, actual_trajectories: List) -> float:
    correct = sum(
        1 for pred, actual in zip(predicted_plans, actual_trajectories)
        if pred[0] == actual[0]
    )
    return correct / len(predicted_plans)

For deploy tasks: correct first action is usually Read (config file) before any Bash.
For ios tasks: Read (project.yml) before xcodebuild. For debug: Bash (check logs).

Expected baseline: ~0.35 (random chance for 4 common tools)
Target: ~0.60+

Metric 3: Advantage-Weighted Prediction Quality
Weight each test example by its advantage score when computing metrics. High-advantage
examples (the genuinely difficult, genuinely successful trajectories) should be predicted
more accurately than low-advantage ones. If the model learns nothing useful, accuracy
will be uncorrelated with advantage. If it learns the right patterns, accuracy should
increase monotonically with advantage.

5.3 Live A/B via Pane Routing

After initial automated evaluation passes thresholds, deploy a live A/B via the pane
orchestrator:

python
# In pane_orchestrator controller.py sense() phase:
pane_count = len(all_panes)
karl_panes = pane_count // 2  # 50% of panes get KARL treatment

for i, pane in enumerate(all_panes):
    if i < karl_panes:
        # Treatment: inject tool-use plan from fused model before skill content
        pane.set_flag("karl_oapl_lite", True)
    else:
        # Control: standard skill injection only
        pane.set_flag("karl_oapl_lite", False)

The ops_trigger.py hook checks the `karl_oapl_lite` flag. When set, it POSTs the
prompt to the MLX server at :8100 and prepends the returned tool plan to the SKILL.md
injection.

python
# In ops_trigger.py, after skill match:
if _is_karl_pane(session_id):
    try:
        plan = _query_mlx_server(prompt_text, skill_name, timeout_ms=200)
        injection = f"[Learned Tool Plan]\n{plan}\n\n" + skill_content
    except:
        injection = skill_content  # Fall back to baseline on timeout

The 200ms timeout is critical: the MLX server on Mac5 (Tailscale IP [ip])
takes ~150ms for a 256-token inference. Total overhead: ~150ms + network = within budget
if the hook budget is 500ms and this check runs after the primary skill match.

5.4 Evaluation Period and Decision Gate

Run the live A/B for 2 weeks (consistent with Path A's accumulation window). Compute:

lift = treatment.success_rate - control.success_rate

where `success_rate` is computed from trajectory annotations (Path A's tap, if deployed
in parallel). If `lift >= 0.05` (5 percentage point improvement), promote the KARL
treatment to 100
sequence quality and revise reward function before the next training run.

---

6. Integration with numu-weave Pipeline

6.1 Current numu-weave Architecture

The `NUMUWeave` class (`[home]/bin/numu-daemon/packages/numu-weave/src/index.ts`)
is the existing cognitive twin pipeline connector with three stations:

1. Corpus Builder (`addCorpusEntries`, `exportCorpus`): Accepts `CorpusEntry[]`
with `{instruction, input, output, source}`. Currently fed by `sft-formatter.py`
(text-only SFT pairs from verbose-all).

2. Fine-Tune Trainer (`startTraining`, `checkTrainingStatus`): POSTs corpus to
the finetune-daemon at Mac5 `:9200/train`. Watches `/status` endpoint.

3. Evaluator (`evaluate`): A/B comparison between base and fused model. Currently
placeholder scores (0.72 base, 0.78 fused) — not wired to real metrics.

6.2 Extension Points for OAPL-Lite

Extension 1: New source type for trajectory corpus

Add `"trajectory"` as a valid `source` in `CorpusEntry`:

typescript
export interface CorpusEntry {
  id: string;
  instruction: string;
  input: string;
  output: string;
  source: "prompt-logger" | "memory" | "thread" | "trajectory";  // ADD trajectory
  createdAt: string;
  // New fields for trajectory entries:
  advantageWeight?: number;   // [-2.0, 2.0] from OAPL advantage computation
  domain?: string;            // "deploy" | "ios" | "git" | etc.
  reward?: number;            // [0, 1] raw reward
  isContrast?: boolean;       // True for negative-advantage examples
}

Extension 2: Advantage-weighted corpus export

The current `exportCorpus()` outputs uniform JSONL. Add an `exportWeighted()` method
that applies repeat-based upweighting (Strategy A from Section 4.2):

typescript
exportWeighted(): string {
  return this.corpus.flatMap((e) => {
    const weight = e.advantageWeight ?? 1.0;
    const repeats = e.isContrast ? 1 : Math.ceil(Math.abs(weight));
    return Array(repeats).fill(null).map(() =>
      JSON.stringify({
        instruction: e.instruction,
        input: e.input,
        output: e.output,
      })
    );
  }).join("\n");
}

Extension 3: Wire evaluator to trajectory metrics

Replace the placeholder evaluation in `evaluate()` with a real call to the
`skill_metrics.json` endpoint (from Path A's metrics aggregator):

typescript
async evaluate(): Promise<EvalResult> {
  // Fetch from Dashboard API
  const res = await fetch("http://[ip]:8421/api/karl/skill-metrics");
  const metrics = await res.json() as SkillMetrics;

  const baseScore = metrics.baseline?.no_skill_success_rate ?? 0.55;
  const fusedScore = this._computeMeanSuccessRate(metrics.skills);
  const improvement = fusedScore - baseScore;

  return { ..., baseScore, fusedScore, improvement };
}

Extension 4: Trajectory feeder script

New Python script: `[home-path]`

python
"""
Feed OAPL-Lite trajectory corpus into numu-weave via HTTP.

The finetune-daemon exposes POST /train. This script:
1. Runs trajectory_extractor.py
2. Runs reward_computer.py + advantage_weighter.py
3. Runs sft_formatter_trajectory.py
4. Writes output to Desktop/homelab/compute-pair/sft-output/oapl-lite.jsonl
5. Signals finetune-daemon: POST http://[ip]:9200/train
   with {"trigger": true, "source": "oapl-lite"}
"""

This preserves the existing data flow that `finetune-daemon.py` already implements —
we add oapl-lite.jsonl as a new source alongside the existing `prompt-sft.jsonl` and
`browser-sft/train.jsonl` in `merge_training_data()`.

6.3 finetune-daemon.py Integration

The existing daemon at `Desktop/homelab/compute-pair/finetune-daemon.py` already has
the complete MLX training + hot-swap pipeline (lines 246-422 — confirmed by reading the
file). Path B only needs to add one new data source to `merge_training_data()`:

python
sources = [
    SFT_OUTPUT_DIR / "prompt-sft.jsonl",
    Path.home() / "Desktop/homelab/compute-pair/browser-sft/train.jsonl",
    DPO_DIR / "dpo-chosen.jsonl",          # Future: race protocol
    Path.home() / ".claude/karl/oapl-lite.jsonl",  # ADD: OAPL-Lite trajectories
]

The training command in `run_mlx_training()` already uses the correct CLI flags
(`python3 -m mlx_lm lora`, `--num-layers`) that avoid the known v0.29+ gotchas.

---

7. Scale Analysis

7.1 Training Data Scale

StageTrajectory countSFT examples (after augment)Notes
Initial (verbose-all)157~180-200Includes repeat upweighting
+ correction pairs+30-50+30-50Negative contrast examples
After 2 weeks (Path A tap)+100+150New annotated trajectories
After 4 weeks+250+375Approaching KARL's Stage I scale
After 8 weeks+500+750Competitive with adapter v1's 972 examples

Concern: 157 initial examples is small. KARL used 1,218+ examples in Stage I.
The risk of overfitting is real. Mitigations:
- The validation loss metric prevents overfitting during training (MLX stops if
val loss diverges from train loss)
- Restricting to `--num-layers 4` (4 LoRA layers out of model's 18 total) limits
adapter capacity, reducing overfitting risk
- We do NOT train on the test set (18 held-out examples from Section 5.2)

7.2 Mac5 Memory Constraints

M4 16GB unified memory analysis for training:

ComponentMemoryNotes
Base model (Gemma-3-1b at 4-bit)~1.5 GBFixed overhead
LoRA adapter weights (4 layers)~50 MBSmall relative to base
Optimizer state (Adam)~200 MBGradient + second moment for LoRA params
Activation memory (batch=1, seq=256)~800 MBForward pass intermediate values
KV cache~300 MBAttention key/value states
Total estimated~2.9 GBWell within 16GB

Increasing to seq=512 would add ~600MB activation memory — still feasible (~3.5GB
total). Increasing batch_size to 2 would double activation memory to ~1.6GB —
still feasible (~4.5GB total). Mac5 has headroom for more aggressive training than
the current defaults.

MLX Server at :8100 during training: The fused model server consumes ~2.5GB when
active. Concurrent training + serving would require ~5.4GB — still well within 16GB.
However, to prevent interference, the finetune-daemon already kills the server before
training (line 381: `subprocess.run(["pkill", "-f", "mlx_lm.*server"])`) and restarts
it after hot-swap. This is the correct protocol.

7.3 Training Time Estimate

Based on adapter v1 benchmark: 188.4s for 500 iterations on 972 examples.

Time per iteration ≈ 188.4s / 500 = 0.377s/iter

Path B initial run:
  - 1,000 iterations
  - ~180 examples (smaller corpus, similar per-example cost)
  - Estimated: 1,000 * 0.377s ≈ 377s (~6.3 minutes)

Future runs with 500 examples:
  - 1,000 iterations
  - Estimated: ~380s (example count doesn't linearly scale iteration cost
    for batch_size=1 with gradient accumulation)

Total pipeline time including extraction, reward computation, formatting:
- Extraction + reward: ~30s (reading 3,249 JSON lines + Python computation)
- MLX training: ~380s
- Fusion: ~60s
- Server restart: ~10s
- Total: ~8 minutes end-to-end

This is well within finetune-daemon's 10-minute training timeout.

7.4 Expected Loss Trajectory

RunExamplesTarget LossInterpretation
v1 (baseline)972 (text SFT)1.694 (actual)Text-only cognitive twin
v2 (OAPL-Lite init)~2001.5-1.6Tool-use specialist, small corpus
v3 (2-week)~3501.4-1.5Larger corpus, better coverage
v4 (4-week)~6001.3-1.4Approaching KARL-scale improvement

Loss below 1.4 indicates the model has internalized meaningful tool-use patterns beyond
random guessing. Loss above 1.7 would indicate the trajectory format is not learning
above the base model's priors (retrain trigger).

---

8. Mathematical Formulation of the Offline OAPL Objective

8.1 Full OAPL Objective (Reference)

KARL's OAPL solves the KL-regularized RL problem:

max_pi E_{x~D}[ E_{y~pi(.|x)}[r(x,y)] - beta * KL(pi(.|x) || pi_ref(.|x)) ]

The closed-form optimal policy is:

pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x,y) / beta)

where Z(x) = sum_y pi_ref(y|x) * exp(r(x,y) / beta)   (partition function)

OAPL approximates this with a regression loss. With G offline samples y_1,...,y_G per x:

L_OAPL(pi) = sum_x sum_i { beta * ln[pi(y_i|x) / pi_ref(y_i|x)] - A*(x,y_i) }^2

A*(x,y_i) = r(x,y_i) - V*(x)

V*(x) = beta * ln[ (1/G) * sum_j exp(r(x,y_j) / beta) ]    (soft optimal value)

8.2 OAPL-Lite Objective (Our Approximation)

In the offline single-rollout setting (G=1 per prompt from our trajectory log), the
optimal value V(x) degenerates: with a single sample, V(x) = r(x,y_1), so A*(x,y_1)
= 0 for every example. This is a fundamental limitation of single-rollout offline data.

Our solution: Use the domain-mean baseline as a proxy for V*(x).

V_approx(x) = mean_{j: domain(x_j) = domain(x)} r(x_j, y_j)

A_approx(x, y) = r(x, y) - V_approx(x)

L_OAPL-Lite(pi) = sum_i { A_approx(x_i, y_i) * log pi(y_i | x_i) }

This is not the full OAPL regression loss — it is advantage-weighted cross-entropy
(also known as REINFORCE with baseline). The equivalence to OAPL holds when:
1. The KL term `beta * ln(pi/pi_ref)` is approximated by the L2 norm constraint
imposed by the LoRA rank (low-rank adaptation limits deviation from base)
2. V_approx(x) is a good proxy for V*(x) (valid when domain baselines are stable)

This approximation was independently shown to work well in the ILQL and filtered
behavior cloning literature, and is precisely the gradient that advantage-weighted SFT
optimizes.

8.3 Why This Converges Despite Single-Rollout Data

The key theoretical result (from OAPL paper Section 4.3): OAPL is stable with policy
lags up to L = 1/(2beta) gradient steps. With beta=0.05:

L* = 1 / (2 * 0.05) = 10 gradient steps

This seems worse than KARL's 400-step stability. However, KARL's stability claim is
for online OAPL where rollouts are updated periodically. In the offline setting, we
compute the advantage once from fixed data and do not update the behavior policy. The
stability concern (importance weight blow-up, reward hacking) is much weaker because
the training distribution is frozen. The offline stability is governed by the KL penalty
(the LoRA rank constraint) rather than the policy lag.

In practice: if validation loss diverges during training (val_loss > train_loss + 0.3),
reduce learning_rate from 5e-5 to 1e-5 and rerun. This is the operational stability
gate.

8.4 Relationship to DPO

Direct Preference Optimization (DPO) also emerges from the same KL-regularized RL
framework with a specific pairing structure (chosen vs. rejected completions). OAPL-Lite
with negative contrast examples (Section 4.2) is structurally similar to DPO:

DPO loss: -log sigmoid[ beta * (log pi(y+|x)/pi_ref(y+|x) - log pi(y-|x)/pi_ref(y-|x)) ]

OAPL-Lite with contrast: advantage-weighted CE where y+ gets positive weight, y- gets negative framing

The difference: DPO requires explicit (chosen, rejected) pairs. OAPL-Lite uses scalar
advantage weights. If we accumulate enough correction-linked pairs (Section 4.3), a
future Stage 2 option is to switch to DPO using those pairs directly — the `dpo-pairs/`
directory in the compute-pair setup already exists for this purpose.

---

9. Risks

9.1 Distribution Shift (High Probability, Manageable)

The risk: Training on Claude Code's API-generated trajectories (from Claude Opus
4.6 or similar) and deploying the learned patterns to the local Gemma-3-1b fused model
creates a fundamental mismatch. The trajectories in verbose-all were generated by a
much more capable model with different tool-calling tendencies. The local model cannot
replicate those trajectories.

Why this matters for OAPL-Lite specifically: The "correct" tool sequences in our
training data reflect what a large frontier model would do, not what a 1B parameter
local model can do. We are teaching the 1B model to predict actions that are within
its capability range.

Mitigation:
1. The local model is not expected to execute these trajectories — it generates a
plan that is injected as context to the main Claude session. Distribution shift
between the plan generator and the executor is acceptable.
2. Restrict training to short trajectories (<=15 tool calls, ~100 examples) where the
local model has realistic capacity to internalize the pattern.
3. Frame the training output as "tool-use reasoning" (explicit thinking steps) rather
than exact action sequences — this is more generalizable across model capabilities.

Counterargument: If the local model can only generate incoherent plans, the
200ms inference investment is wasted compute that degrades rather than helps. Measure
plan coherence qualitatively on the first 10 outputs before deploying to live panes.

9.2 Reward Hacking (Medium Probability)

The risk: The reward function uses proxy signals (Bash exit codes, file
modifications, error arrays) that can be satisfied without the task actually succeeding.
A model could learn that "add more file modifications" always improves reward, regardless
of whether those modifications are useful.

Why this is particularly risky for tool-use training: Unlike text quality, tool-use
sequences can contain genuinely harmful hacks (creating dummy files to inflate the
file_modification_signal, running `exit 0` to force exit_code success).

Mitigation:
1. The file_modification_signal caps at `modified >= 1` for most domains — adding more
files beyond 1 doesn't improve the reward. No incentive for excessive file creation.
2. Exit codes are parsed from actual Bash output, not from the model's generated text.
The model cannot learn to "hack" exit codes since they are observed signals.
3. The correction_signal (20
generates a visually-clean-but-useless trajectory, the next-prompt correction will
penalize it despite the clean exit codes.
4. The LoRA rank constraint (4 layers, low capacity) limits the model's ability to
memorize reward-hacking patterns specific to the training set.

Counterargument: With only 157 training examples, overfitting to specific reward
patterns is more likely than systematic reward hacking. The real risk is not a
sophisticated hack but a brute-force memorization of the 20 highest-reward examples.
Monitor validation loss: if train_loss << val_loss, this is occurring.

9.3 Compute Limits and Mac5 Availability (Low Probability)

The risk: Mac5 is also serving the MLX Server at :8100, running Ollama, and
participating in the exo cluster. Training + serving + clustering simultaneously could
cause memory pressure.

Mitigation:
1. The finetune-daemon already kills the MLX server before training. The exo cluster
can be gracefully detached from Mac4's master perspective.
2. Training is scheduled off-peak (can be added to finetune-daemon's `POLL_INTERVAL`
logic: only train between 2AM-6AM if Mac5's memory is above threshold).
3. The 8-minute training window is short enough that brief Mac5 unavailability for
other services is acceptable.

9.4 Small Corpus Overfitting (High Probability with 157 examples)

The risk: 157 examples is genuinely small for LoRA training. The model will likely
memorize much of the training set after 1,000 iterations rather than generalizing.

Evidence: Adapter v1 had loss 1.694 on 972 examples after 500 iterations. With
157 examples and 1,000 iterations, we are doing ~6.4 effective epochs. Even with
dropout and the KL constraint, memorization is likely.

Mitigation:
1. Use early stopping based on validation loss: stop training when val_loss increases
for 3 consecutive checkpoints. With 18 validation examples, this is a coarse signal
but better than fixed iterations.
2. Reduce `--iters` from 1,000 to 500 for the initial run (consistent with v1). Scale
iterations as corpus grows.
3. Add Gaussian noise to advantage weights during training (jitter ±0.1) to prevent
exact memorization of reward-specific patterns.

Acceptance criterion: If the test plan accuracy is >= 0.4 (better than chance) and
val_loss < 1.6, the initial run is a success despite the small corpus.

9.5 Correction Detector Cross-Contamination

The risk: The correction_detector.py is also used by the Cortex pipeline for rule
promotion. Using it as a reward signal creates a feedback loop: high-correction
trajectories get negative reward, which trains the model to avoid actions that trigger
corrections, which reduces correction events, which reduces the training signal, etc.

Assessment: This feedback loop operates over weeks (correction → reward → training
→ deployment → behavior change → fewer corrections). It is actually the desired loop:
a system that learns to avoid generating corrections is learning to act correctly. The
risk is that the loop overshoots — the model becomes too conservative and avoids all
actions that even pattern-match to past corrections, including correct ones.

Mitigation: Monitor the correction_signal component in `RewardResult` logs. If the
fraction of trajectories with `correction_available=True` drops below 10
never triggering the correction detector), investigate whether the remaining trajectories
are genuinely correction-free or whether the signal has gone silent.

---

10. Implementation Sequence

Phase 1: Data Pipeline (Days 1-2)

1. Write `[home-path]` (~180 lines)
- Parse verbose-all.jsonl
- Extract TrajectoryExample objects
- Normalize tool names via TOOL_NAME_MAP
- Output: `[home-path]` (~157 records)

2. Write `[home-path]` (~200 lines)
- Implement 4-component reward formula
- Cross-reference cortex/entries.jsonl for correction signals
- Output: `[home-path]`

3. Write `[home-path]` (~100 lines)
- Compute per-domain baselines
- Apply OAPL-Lite advantage formula
- Output: `[home-path]`

4. Write `[home-path]` (~200 lines)
- Convert to ChatML format with trajectory context
- Apply repeat-based upweighting for high-advantage examples
- Add contrast examples for negative-advantage trajectories
- Write to `[home-path]` + train/valid/test split

Validation gate: `wc -l [home-path]` should be
140-160. Run `python3 -c "import json; [json.loads(l) for l in open('train.jsonl')]"` —
should parse without errors.

Phase 2: Initial Training on Mac5 (Day 3)

5. Copy `[home-path]` to Mac5 via:

   rsync -av [home-path] mohameddiomande@[ip]:[home-path]

6. Trigger training via finetune-daemon:

   curl -X POST http://[ip]:9200/train \
     -H "Content-Type: application/json" \
     -d '{"source": "oapl-lite", "data_dir": "[home-path] "iters": 500}'

(Requires adding `data_dir` and `source` params to the daemon's `/train` handler.)

7. Monitor training: `curl http://[ip]:9200/status`

8. After completion: hot-swap adapter via `hot_swap_adapter(2)` in daemon.

Phase 3: Evaluation (Days 4-5)

9. Run automated plan accuracy evaluation on 18 held-out test examples
10. Qualitative review of 10 generated tool plans (are they coherent?)
11. Deploy to 50

Phase 4: Integration with Path A (Week 2+)

12. Update `feed_weave.py` to incorporate Path A's annotated trajectories.jsonl
13. Set up Prefect flow: `oapl_lite_retrain` — triggers when trajectory_tap accumulates
50+ new records.
14. Add `oapl-lite` source to `merge_training_data()` in finetune-daemon.py.

---

Sources

Codebase Files Read (with line-level depth)

- `[home]/bin/numu-daemon/packages/numu-weave/src/index.ts`
(270 lines — full NUMUWeave class, CorpusEntry schema, WeaveConfig defaults,
startTraining/evaluate stubs)

- `[home]/Desktop/homelab/compute-pair/finetune-daemon.py`
(540 lines — full training + hot-swap + Prometheus metrics daemon; confirmed
MLX CLI flags, adapter versioning, Mac5 Tailscale IP, merge_training_data sources)

- `[home]/Desktop/homelab/compute-pair/sft-formatter.py`
(296 lines — sft-formatter: verbose-all parsing, VerbosePromptEntry schema,
ChatML output format, dedup via sha256)

- `[home]/.openclaw/browser/corpus-to-sft.py`
(316 lines — browser corpus converter; shows SFT pipeline pattern for new source types)

- `[home]/.claude/prompt-logs/verbose-all.jsonl`
(3,249 entries — confirmed schema: prompt_id, session_id, prompt_text, assistant_turns,
tool_calls with tool_name/parameters/result, files_modified, files_created, errors,
git_repo, git_branch, captured_at)

- `[home]/.claude/prompt-logs/unified.jsonl`
(3,928 entries — confirmed tool_count distribution: 3,925 at 0, 3 at 1 — confirming
that tool_calls in unified.jsonl are simplified and verbose-all is the correct source)

- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage0-research.md`
(KARL paper summary, OAPL objective, Mac5 fine-tune infrastructure, Cortex system)

- `[home]/Desktop/evo-cube-output/karl-trajectory-intelligence/stage1-path-a.md`
(Path A Trajectory Tap — outcome signals, session buffer, annotation protocol)

Memory Files Consulted

- `[home]/.claude/projects/-Users-mohameddiomande/memory/MEMORY.md`
(Mac5 IPs, LoRA CLI gotchas, adapter v1 stats, finetune-daemon ports)

- `[home]/.claude/agent-memory/research-engine/MEMORY.md`
(KARL paper stats, Stage 0 summary)

Direct Trajectory Data Analyzed

  • 3,249 verbose-all.jsonl entries scanned programmatically
  • 157 entries with tool_calls confirmed
  • Max trajectory depth: 144 tool calls (map-architecture task)
  • Exit code extraction confirmed from result string parsing
  • Distribution: 41 entries with 1-4 calls, 75+ entries with 5+ calls

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-trajectory-intelligence/stage1-path-b.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research