Grand Diomande Research · Full HTML Reader

Teaching My AI Agent to Learn From Its Own Mistakes

In early March, Databricks published KARL (Knowledge Agents via Reinforcement Learning), a system that trains enterprise search agents via reinforcement learning. They had 26 researchers, enterprise GPUs, and a proprietary base model. Their agent beats Claude Opus 4.6 and GPT 5.2 on enterprise search benchmarks at 33% lower cost.

Agents That Account for Themselves research note experiment writeup candidate score 40 .md

Full Public Reader

Teaching My AI Agent to Learn From Its Own Mistakes

How I Built a Self-Improving Code Agent on a Mac Mini, Inspired by Databricks' KARL

Mohamed Diomande, March 2026

---

I read the paper and thought: what if I applied the same idea, not to enterprise search, but to the software engineering agent I use every day? And not on a GPU cluster, but on the Mac Minis in my living room?

Five days later, I had 485 recorded trajectories, a 5-signal reward engine, a trained LoRA adapter, and a system that automatically learns from every coding session I run. Here's how it works, how it differs from the original, and what I learned.

---

The Core Idea

KARL's premise is simple: instead of hand-writing rules for how an AI agent should behave, you record what the agent actually does, score those recordings based on outcomes, and train on the best ones.

Databricks applies this to enterprise search. Agent runs a vector search, reads documents, compresses context, synthesizes an answer. They score it against ground-truth "nuggets" and run off-policy RL (their OAPL algorithm) to improve the model.

I apply it to software engineering. Agent reads files, writes code, runs tests, deploys services. I score it based on whether the user corrected the agent, whether builds passed, whether the tool sequence was efficient. Then I export the best trajectories as advantage-weighted SFT data and fine-tune a small open-source model.

Same philosophy. Radically different execution.

---

Architecture: 4 Taps, 22 Files, 0 New Infrastructure

The biggest constraint was: no new infrastructure. I already run Claude Code across 5 machines connected via Tailscale, with tasks dispatched from Discord and Telegram. The agent already has hooks that fire on every tool use. I needed to piggyback on what exists.

The 4-Tap System

Claude Code has a hook system. Scripts fire on specific events: when a prompt is submitted, when a tool is used, when the agent finishes responding. I wired 4 tap points into these hooks:

Tap A fires on prompt submission. Opens a JSON buffer file for the session. Records the prompt, working directory, and which skill (if any) was injected.

Tap B fires after every tool use. Appends the tool name, key parameters, success/failure status, and exit code to the buffer. This is the main data collection point. Every Read, Edit, Write, Bash, Grep, and Glob call gets captured.

Tap C fires when the agent finishes responding. Flushes the buffer into `trajectories.jsonl` as a complete record. Runs the reward engine inline to score the trajectory before writing.

Tap D fires on the next prompt submission. Examines the incoming text for correction signals: "no, I meant," "try again," "that's wrong." If detected, it retroactively annotates the previous trajectory as a failure. The user's natural behavior becomes the label.

The entire capture system adds about 5ms per tool call. The agent doesn't know it's being recorded.

Why Hooks, Not Rollouts

Databricks collects data through controlled rollouts with their custom "aroll" harness. They spin up solver instances, run 8 attempts per question, and filter by pass rate.

I don't have a separate data collection phase. Every real work session, on every machine, on every project, automatically becomes training data. The agent is always in production and always being recorded.

The tradeoff: I don't get clean, controlled environments. My data is messy, from real projects with real bugs. But that's also the point. The agent should learn from real conditions, not curated benchmarks.

---

The Reward Engine: 5 Signals, Not 1

This is where the biggest divergence from Databricks happens.

Their reward is binary nugget completion: did the answer contain the required information? Score: 0 or 1.

Mine decomposes reward into 5 continuous signals:

### Outcome (30
Cross-turn signals. Did the user correct us on the next turn? Did they ask for a redo? Did a subsequent build succeed? Did the session continue (implying satisfaction)?

This is the signal that requires Tap D. If the user says "that's wrong" next turn, this trajectory gets penalized.

### Process (25
Within-turn execution quality. What percentage of tool calls succeeded? Were bash commands clean (exit code 0)? Were errors scattered or concentrated?

Key detail: I weight later steps more heavily. A tool failure at step 2 is less costly than a failure at step 15. The intuition: early failures are part of exploration; late failures mean the approach didn't work.

### Efficiency (15
Trajectory shape analysis. Tool diversity measured by Shannon entropy (using 5 different tools is better than running Bash 50 times). Tools per minute (2-8 is the sweet spot). What fraction of tool calls actually modified files.

### Verification (15
Did the agent verify its work? Specifically: did it run tests (pytest, cargo test, npm test)? Did it run a build check? Did it read a file after editing it?

Agents that verify their work should be rewarded. Agents that edit and move on without checking should not.

### Consistency (15
Internal coherence. Did the agent read a file before editing it? Did it avoid thrashing (editing the same file 3+ times in rapid succession)?

The composite score is:

R = 0.30 * outcome + 0.25 * process + 0.15 * efficiency
    + 0.15 * verification + 0.15 * consistency

All signals normalize to [0, 1]. The advantage is computed as a z-score against the domain baseline:

A = (R - baseline_domain) / max(std_domain, 1.0)

This normalizes across domains with different reward variances. An iOS trajectory scoring 0.60 might have positive advantage if the iOS baseline is 0.55, while an infra trajectory scoring 0.60 might have negative advantage if the infra baseline is 0.62.

---

What 485 Trajectories Look Like

After 5 days of production capture:

Stat	Value
Total trajectories	485
Live capture	429 (88.5
Backfilled from logs	56 (11.5
Mean reward	0.583
Median reward	0.601
Std deviation	~0.07
Positive advantage	84.3
Domains covered	10

The domain breakdown shows where the agent actually spends its time:

Domain	Count	Mean Reward
_global	174	0.590
ios	100	0.595
infra	68	0.572
web	37	0.606
automation	32	0.558
data	23	0.552
creative	21	0.573
systems	21	0.566
knowledge	6	0.560
ml	2	0.455

Top-performing trajectories (reward > 0.65) share a pattern: Read -> Read -> Edit -> Bash(test) -> Edit -> Bash(build). The classic "understand, modify, verify" loop.

Worst-performing trajectories (reward < 0.40) tend to be Bash-monoculture: long sequences of only Bash commands with no file reads and no verification.

---

Training: OAPL-Lite on a Mac Mini

Databricks uses full off-policy RL with their OAPL algorithm. This requires infrastructure for policy updates, value function estimation, and KL-regularized optimization.

I use what I call OAPL-Lite: advantage-weighted supervised fine-tuning. The idea:

1. Compute advantage for each trajectory (reward minus domain baseline)
2. Filter to positive-advantage trajectories only
3. Oversample proportional to advantage (high advantage = 3x, medium = 2x, low = 1x)
4. Export as ChatML JSONL
5. Fine-tune with MLX LoRA on gemma-3-1b-it-4bit

The SFT data looks like:

json

{
  "messages": [
    {"role": "system", "content": "You are an expert software engineering assistant..."},
    {"role": "user", "content": "[the original prompt]"},
    {"role": "assistant", "content": "1. [ok] Read ../ContentView.swift\n2. [ok] Read ../AppReducer.swift\n3. [ok] Edit ../AppReducer.swift\n4. [ok] Bash: xcodebuild build\n\nResult: 4/4 tools succeeded, reward=0.68"}
  ]
}

The model learns to predict effective tool-use sequences given a task description. It's not learning to write code. It's learning to plan which tools to use in which order.

FlowRL-Style Balanced Sampling

With 10 domains of very different sizes (174 _global vs 2 ml), naive sampling would collapse to _global trajectories. I implemented distribution-balanced sampling: equal representation per domain, with oversampling (with replacement) for underrepresented domains.

Four sampling strategies are available:
- Balanced: Equal per domain (default for training)
- Advantage-weighted: Softmax-temperature sampling proportional to advantage
- Top-k: Highest reward only
- Uniform: Random baseline

Synthetic Data Augmentation

To supplement the 119 real SFT examples, I generate synthetic trajectories from 3 sources:

1. Canonical patterns: 6 high-value tool sequences (read-edit-verify, search-read-write, debug-diagnose-fix, etc.) instantiated with domain-specific prompts
2. Augmented trajectories: High-reward real trajectories with slightly rephrased prompts
3. Counterfactuals: Low-reward trajectories paired with what the correct approach should have been

This brought total training data to 119 + 37 = 156 examples. Small by any standard, but with advantage weighting and balanced sampling, enough to move the needle on a 1B model.

Results

KARL v2 adapter trained on Mac5 (M4 16GB):
- Training examples: 84 (35 real SFT + 37 synthetic + 12 augmented)
- Iterations: 500
- Test loss: 1.843
- Training time: ~8 minutes

The adapter was never formally evaluated against a baseline before Mac5 went offline. This is the honest gap in the work. I have the architecture, the data, and a trained adapter, but not a rigorous A/B comparison.

---

The Cortex Bridge: Behavioral Intelligence

One component that has no equivalent in the Databricks paper is the Cortex bridge.

Cortex is a separate behavioral intelligence system that tracks routing decisions (which skill was selected for which prompt), correction patterns (when users override the agent's approach), and session-level metadata.

The bridge (`cortex_karl_bridge.py`) joins KARL trajectories with Cortex entries by session ID. This enriches each trajectory with:

The skill routing decision that triggered this session
Whether Cortex detected corrections in the session window
The inferred domain from Cortex routing

The bridge also provides cross-system success inference: if Cortex saw a correction, the trajectory gets a negative signal even if the tool-level metrics looked fine.

---

Lessons and Honest Gaps

### What Worked
- Hook-wired capture: Zero overhead, no new infrastructure, every session recorded automatically
- 5-signal reward: Much richer signal than binary completion. Verification and consistency scores surface real quality differences.
- Domain-aware baselines: Advantage z-scoring normalizes across domains with different difficulty levels

### What Didn't Work (Yet)
- Correction detection was broken: A schema change in Claude Code renamed `tool_result` to `tool_response`. The hook was reading an empty string for weeks. Every trajectory showed `success=True`. The most novel reward signal (Tap D corrections) captured 0 corrections. Fixed as of March 15, 2026.
- Vector routing barely activated: 1.9
- No A/B evaluation: The adapter was trained but never compared against baseline on controlled tasks. Mac5 (where the adapter lives) went offline.
- Small training set: 84 examples is extremely small. Databricks uses thousands of rollouts per task.

### What I'd Do Differently
1. Evaluate first, train second. I should have established a benchmark suite before training.
2. Run the reward engine on the holdout set with AND without the adapter to get a real delta.
3. Keep Mac5 up. The M4 Mac Mini is the only machine with enough memory for MLX LoRA training.

---

How This Differs From Databricks

	Databricks KARL	This Implementation
Team	26 researchers	1 developer + Claude agents
Base model	GLM 4.5 Air (large)	gemma-3-1b-it-4bit (1B)
RL method	Full OAPL	OAPL-Lite (advantage-weighted SFT)
Tools	Vector search only	10+ tools (full IDE)
Domain	Enterprise search	Software engineering (50+ projects)
Reward	Binary nugget completion	5-signal composite with temporal weighting
Data source	Synthetic benchmarks	Live production sessions
Cross-turn signal	None	Tap D retroactive correction
Behavioral context	None	Cortex bridge enrichment
Infrastructure	GPU cluster	5-node mesh, Apple Silicon

The core thesis: the same RL-for-agents philosophy works outside enterprise search, with richer signals, on commodity hardware, from live production data. The architecture is the contribution. The evaluation is the next step.

---

What's Next

1. Recompute rewards with the fixed hook. Now that `tool_response` is being read correctly, new trajectories will have accurate process and verification scores.
2. Get Mac5 back online and run the A/B evaluation.
3. Grow the holdout set from 20 to 100+ trajectories with domain balance.
4. Iterate: Use the improved model as the new data generator (Databricks' iterative bootstrapping approach).
5. Publish the evaluation as a follow-up once the numbers are solid.

The code is live. The hooks are firing. Every session is being recorded with corrected signals. The self-improving loop continues.

---

This system was built in a single session on March 10, 2026, resulting in 22 Python files, 485 trajectories, a trained adapter, and this documentation. The implementation references Databricks' KARL paper (arXiv 2603.05218) and adapts its principles for multi-tool software engineering agents on edge hardware.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

karl-blog-post.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture