Teaching My AI Agent to Learn From Its Own Mistakes
In early March, Databricks published KARL (Knowledge Agents via Reinforcement Learning), a system that trains enterprise search agents via reinforcement learning. They had 26 researchers, enterprise GPUs, and a proprietary base model. Their agent beats Claude Opus 4.6 and GPT 5.2 on enterprise search benchmarks at 33% lower cost.
Full Public Reader
Teaching My AI Agent to Learn From Its Own Mistakes
How I Built a Self-Improving Code Agent on a Mac Mini, Inspired by Databricks' KARL
Mohamed Diomande, March 2026
---
In early March, Databricks published KARL (Knowledge Agents via Reinforcement Learning), a system that trains enterprise search agents via reinforcement learning. They had 26 researchers, enterprise GPUs, and a proprietary base model. Their agent beats Claude Opus 4.6 and GPT 5.2 on enterprise search benchmarks at 33
I read the paper and thought: what if I applied the same idea, not to enterprise search, but to the software engineering agent I use every day? And not on a GPU cluster, but on the Mac Minis in my living room?
Five days later, I had 485 recorded trajectories, a 5-signal reward engine, a trained LoRA adapter, and a system that automatically learns from every coding session I run. Here's how it works, how it differs from the original, and what I learned.
---
The Core Idea
KARL's premise is simple: instead of hand-writing rules for how an AI agent should behave, you record what the agent actually does, score those recordings based on outcomes, and train on the best ones.
Databricks applies this to enterprise search. Agent runs a vector search, reads documents, compresses context, synthesizes an answer. They score it against ground-truth "nuggets" and run off-policy RL (their OAPL algorithm) to improve the model.
I apply it to software engineering. Agent reads files, writes code, runs tests, deploys services. I score it based on whether the user corrected the agent, whether builds passed, whether the tool sequence was efficient. Then I export the best trajectories as advantage-weighted SFT data and fine-tune a small open-source model.
Same philosophy. Radically different execution.
---
Architecture: 4 Taps, 22 Files, 0 New Infrastructure
The biggest constraint was: no new infrastructure. I already run Claude Code across 5 machines connected via Tailscale, with tasks dispatched from Discord and Telegram. The agent already has hooks that fire on every tool use. I needed to piggyback on what exists.
The 4-Tap System
Claude Code has a hook system. Scripts fire on specific events: when a prompt is submitted, when a tool is used, when the agent finishes responding. I wired 4 tap points into these hooks:
Tap A fires on prompt submission. Opens a JSON buffer file for the session. Records the prompt, working directory, and which skill (if any) was injected.
Tap B fires after every tool use. Appends the tool name, key parameters, success/failure status, and exit code to the buffer. This is the main data collection point. Every Read, Edit, Write, Bash, Grep, and Glob call gets captured.
Tap C fires when the agent finishes responding. Flushes the buffer into `trajectories.jsonl` as a complete record. Runs the reward engine inline to score the trajectory before writing.
Tap D fires on the next prompt submission. Examines the incoming text for correction signals: "no, I meant," "try again," "that's wrong." If detected, it retroactively annotates the previous trajectory as a failure. The user's natural behavior becomes the label.
The entire capture system adds about 5ms per tool call. The agent doesn't know it's being recorded.
Why Hooks, Not Rollouts
Databricks collects data through controlled rollouts with their custom "aroll" harness. They spin up solver instances, run 8 attempts per question, and filter by pass rate.
I don't have a separate data collection phase. Every real work session, on every machine, on every project, automatically becomes training data. The agent is always in production and always being recorded.
The tradeoff: I don't get clean, controlled environments. My data is messy, from real projects with real bugs. But that's also the point. The agent should learn from real conditions, not curated benchmarks.
---
The Reward Engine: 5 Signals, Not 1
This is where the biggest divergence from Databricks happens.
Their reward is binary nugget completion: did the answer contain the required information? Score: 0 or 1.
Mine decomposes reward into 5 continuous signals:
### Outcome (30
Cross-turn signals. Did the user correct us on the next turn? Did they ask for a redo? Did a subsequent build succeed? Did the session continue (implying satisfaction)?
This is the signal that requires Tap D. If the user says "that's wrong" next turn, this trajectory gets penalized.
### Process (25
Within-turn execution quality. What percentage of tool calls succeeded? Were bash commands clean (exit code 0)? Were errors scattered or concentrated?
Key detail: I weight later steps more heavily. A tool failure at step 2 is less costly than a failure at step 15. The intuition: early failures are part of exploration; late failures mean the approach didn't work.
### Efficiency (15
Trajectory shape analysis. Tool diversity measured by Shannon entropy (using 5 different tools is better than running Bash 50 times). Tools per minute (2-8 is the sweet spot). What fraction of tool calls actually modified files.
### Verification (15
Did the agent verify its work? Specifically: did it run tests (pytest, cargo test, npm test)? Did it run a build check? Did it read a file after editing it?
Agents that verify their work should be rewarded. Agents that edit and move on without checking should not.
### Consistency (15
Internal coherence. Did the agent read a file before editing it? Did it avoid thrashing (editing the same file 3+ times in rapid succession)?
The composite score is:
R = 0.30 * outcome + 0.25 * process + 0.15 * efficiency
+ 0.15 * verification + 0.15 * consistencyAll signals normalize to [0, 1]. The advantage is computed as a z-score against the domain baseline:
A = (R - baseline_domain) / max(std_domain, 1.0)This normalizes across domains with different reward variances. An iOS trajectory scoring 0.60 might have positive advantage if the iOS baseline is 0.55, while an infra trajectory scoring 0.60 might have negative advantage if the infra baseline is 0.62.
---
What 485 Trajectories Look Like
After 5 days of production capture:
| Stat | Value |
|---|---|
| Total trajectories | 485 |
| Live capture | 429 (88.5 |
| Backfilled from logs | 56 (11.5 |
| Mean reward | 0.583 |
| Median reward | 0.601 |
| Std deviation | ~0.07 |
| Positive advantage | 84.3 |
| Domains covered | 10 |
The domain breakdown shows where the agent actually spends its time:
| Domain | Count | Mean Reward |
|---|---|---|
| _global | 174 | 0.590 |
| ios | 100 | 0.595 |
| infra | 68 | 0.572 |
| web | 37 | 0.606 |
| automation | 32 | 0.558 |
| data | 23 | 0.552 |
| creative | 21 | 0.573 |
| systems | 21 | 0.566 |
| knowledge | 6 | 0.560 |
| ml | 2 | 0.455 |
Top-performing trajectories (reward > 0.65) share a pattern: Read -> Read -> Edit -> Bash(test) -> Edit -> Bash(build). The classic "understand, modify, verify" loop.
Worst-performing trajectories (reward < 0.40) tend to be Bash-monoculture: long sequences of only Bash commands with no file reads and no verification.
---
Training: OAPL-Lite on a Mac Mini
Databricks uses full off-policy RL with their OAPL algorithm. This requires infrastructure for policy updates, value function estimation, and KL-regularized optimization.
I use what I call OAPL-Lite: advantage-weighted supervised fine-tuning. The idea:
1. Compute advantage for each trajectory (reward minus domain baseline)
2. Filter to positive-advantage trajectories only
3. Oversample proportional to advantage (high advantage = 3x, medium = 2x, low = 1x)
4. Export as ChatML JSONL
5. Fine-tune with MLX LoRA on gemma-3-1b-it-4bit
The SFT data looks like:
{
"messages": [
{"role": "system", "content": "You are an expert software engineering assistant..."},
{"role": "user", "content": "[the original prompt]"},
{"role": "assistant", "content": "1. [ok] Read ../ContentView.swift\n2. [ok] Read ../AppReducer.swift\n3. [ok] Edit ../AppReducer.swift\n4. [ok] Bash: xcodebuild build\n\nResult: 4/4 tools succeeded, reward=0.68"}
]
}The model learns to predict effective tool-use sequences given a task description. It's not learning to write code. It's learning to plan which tools to use in which order.
FlowRL-Style Balanced Sampling
With 10 domains of very different sizes (174 _global vs 2 ml), naive sampling would collapse to _global trajectories. I implemented distribution-balanced sampling: equal representation per domain, with oversampling (with replacement) for underrepresented domains.
Four sampling strategies are available:
- Balanced: Equal per domain (default for training)
- Advantage-weighted: Softmax-temperature sampling proportional to advantage
- Top-k: Highest reward only
- Uniform: Random baseline
Synthetic Data Augmentation
To supplement the 119 real SFT examples, I generate synthetic trajectories from 3 sources:
1. Canonical patterns: 6 high-value tool sequences (read-edit-verify, search-read-write, debug-diagnose-fix, etc.) instantiated with domain-specific prompts
2. Augmented trajectories: High-reward real trajectories with slightly rephrased prompts
3. Counterfactuals: Low-reward trajectories paired with what the correct approach should have been
This brought total training data to 119 + 37 = 156 examples. Small by any standard, but with advantage weighting and balanced sampling, enough to move the needle on a 1B model.
Results
KARL v2 adapter trained on Mac5 (M4 16GB):
- Training examples: 84 (35 real SFT + 37 synthetic + 12 augmented)
- Iterations: 500
- Test loss: 1.843
- Training time: ~8 minutes
The adapter was never formally evaluated against a baseline before Mac5 went offline. This is the honest gap in the work. I have the architecture, the data, and a trained adapter, but not a rigorous A/B comparison.
---
The Cortex Bridge: Behavioral Intelligence
One component that has no equivalent in the Databricks paper is the Cortex bridge.
Cortex is a separate behavioral intelligence system that tracks routing decisions (which skill was selected for which prompt), correction patterns (when users override the agent's approach), and session-level metadata.
The bridge (`cortex_karl_bridge.py`) joins KARL trajectories with Cortex entries by session ID. This enriches each trajectory with:
- The skill routing decision that triggered this session
- Whether Cortex detected corrections in the session window
- The inferred domain from Cortex routing
The bridge also provides cross-system success inference: if Cortex saw a correction, the trajectory gets a negative signal even if the tool-level metrics looked fine.
---
Lessons and Honest Gaps
### What Worked
- Hook-wired capture: Zero overhead, no new infrastructure, every session recorded automatically
- 5-signal reward: Much richer signal than binary completion. Verification and consistency scores surface real quality differences.
- Domain-aware baselines: Advantage z-scoring normalizes across domains with different difficulty levels
### What Didn't Work (Yet)
- Correction detection was broken: A schema change in Claude Code renamed `tool_result` to `tool_response`. The hook was reading an empty string for weeks. Every trajectory showed `success=True`. The most novel reward signal (Tap D corrections) captured 0 corrections. Fixed as of March 15, 2026.
- Vector routing barely activated: 1.9
- No A/B evaluation: The adapter was trained but never compared against baseline on controlled tasks. Mac5 (where the adapter lives) went offline.
- Small training set: 84 examples is extremely small. Databricks uses thousands of rollouts per task.
### What I'd Do Differently
1. Evaluate first, train second. I should have established a benchmark suite before training.
2. Run the reward engine on the holdout set with AND without the adapter to get a real delta.
3. Keep Mac5 up. The M4 Mac Mini is the only machine with enough memory for MLX LoRA training.
---
How This Differs From Databricks
| Databricks KARL | This Implementation | |
|---|---|---|
| Team | 26 researchers | 1 developer + Claude agents |
| Base model | GLM 4.5 Air (large) | gemma-3-1b-it-4bit (1B) |
| RL method | Full OAPL | OAPL-Lite (advantage-weighted SFT) |
| Tools | Vector search only | 10+ tools (full IDE) |
| Domain | Enterprise search | Software engineering (50+ projects) |
| Reward | Binary nugget completion | 5-signal composite with temporal weighting |
| Data source | Synthetic benchmarks | Live production sessions |
| Cross-turn signal | None | Tap D retroactive correction |
| Behavioral context | None | Cortex bridge enrichment |
| Infrastructure | GPU cluster | 5-node mesh, Apple Silicon |
The core thesis: the same RL-for-agents philosophy works outside enterprise search, with richer signals, on commodity hardware, from live production data. The architecture is the contribution. The evaluation is the next step.
---
What's Next
1. Recompute rewards with the fixed hook. Now that `tool_response` is being read correctly, new trajectories will have accurate process and verification scores.
2. Get Mac5 back online and run the A/B evaluation.
3. Grow the holdout set from 20 to 100+ trajectories with domain balance.
4. Iterate: Use the improved model as the new data generator (Databricks' iterative bootstrapping approach).
5. Publish the evaluation as a follow-up once the numbers are solid.
The code is live. The hooks are firing. Every session is being recorded with corrected signals. The self-improving loop continues.
---
This system was built in a single session on March 10, 2026, resulting in 22 Python files, 485 trajectories, a trained adapter, and this documentation. The implementation references Databricks' KARL paper (arXiv 2603.05218) and adapts its principles for multi-tool software engineering agents on edge hardware.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
karl-blog-post.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture