Back to corpus
proposalexperiment writeup candidatescore 26

KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite

Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's single M4 chip using offline advantage estimation instead of online rollouts. The core insight: we don't need live rollout infrastructure when we already have 3,249 logged trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with exit codes, file diffs, and success signals. The approach converts those trajectories into advantage-weighted training examples, computes rewards from build results, correcti

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

# KARL Integration — Evolution³ / Stage 1, Path B: OAPL-Lite **Run:** karl-trajectory-intelligence **Generated:** 2026-03-10 **Method:** Evolution³ — divergent exploration Path B implements a stripped-down version of KARL's OAPL algorithm that runs on Mac5's single M4 chip using offline advantage estimation instead of online rollouts. The core insight: we don't need live rollout infrastructure when we already have 3,249 logged trajectories in `verbose-all.jsonl`, 157 of which contain rich tool-use sequences with exit codes, file diffs, and success signals. The approach converts those trajectories into advantage-weighted training examples, computes rewards from build results, correction signals, and user approval proxies, and trains a LoRA adapter on Mac5 that learns which tool-use sequences are associated with successful task completion. The target is a LoRA adapter specialized for tool-use reasoning: given a prompt and a skill context, predict the high-advantage next action. This replaces the static `(prompt) -> inject SKILL.md content` pipeline with a `(prompt + trajectory context) -> learned action selection` model that improves as more trajectories accumulate. The KARL OAPL loss is a regression objective derived from the KL-regularized RL problem: OAPL's key innovation over GRPO: V*(x) is the *soft optimal value*, not a simple baseline. It is computed in closed form from the rewards of all G rollouts, requiring no value network, no importance weight clipping, and no gradient through the value estimate. This makes it stable at policy lags up to 400+ gradient steps — the training data can be much older than in PPO/GRPO without degrading optimization.

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.