KARL-Edge: Multi-Signal Reinforcement Learning for Software Engineering Agents on Commodity Hardware
We present KARL-Edge, an adaptation of the Knowledge Agents via Reinforcement Learning (KARL) framework to multi-tool software engineering agents running on commodity Apple Silicon hardware. Where the original KARL system (Chang et al., 2026) trains enterprise search agents using full off-policy RL with binary reward signals, our system introduces three architectural contributions: (1) a 5-signal composite reward function that decomposes trajectory quality into outcome, process, efficiency, verification, and consis
Full Public Reader
KARL-Edge: Multi-Signal Reinforcement Learning for Software Engineering Agents on Commodity Hardware
Mohamed Diomande
Independent Researcher
Technical Report, March 2026
---
Abstract
We present KARL-Edge, an adaptation of the Knowledge Agents via Reinforcement Learning (KARL) framework to multi-tool software engineering agents running on commodity Apple Silicon hardware. Where the original KARL system (Chang et al., 2026) trains enterprise search agents using full off-policy RL with binary reward signals, our system introduces three architectural contributions: (1) a 5-signal composite reward function that decomposes trajectory quality into outcome, process, efficiency, verification, and consistency dimensions; (2) a hook-wired zero-overhead trajectory capture system that records production sessions without separate data collection infrastructure; and (3) a retroactive cross-turn correction signal that uses the user's natural behavior as an implicit reward label. We report preliminary results on 485 trajectories across 10 software engineering domains, with a mean composite reward of 0.583 and 84.3
---
1. Introduction
The training of tool-using language model agents via reinforcement learning has emerged as a promising approach to improving agent quality without manual rule engineering. Chang et al. (2026) demonstrate this with KARL, achieving Pareto-optimal cost-quality tradeoffs against Claude Opus 4.6 and GPT 5.2 on enterprise knowledge tasks. Their approach combines a novel off-policy RL algorithm (OAPL), an agentic data synthesis pipeline, and iterative bootstrapping across multiple training iterations.
We ask: can the same principles be applied to a fundamentally different agent domain (software engineering instead of enterprise search), with fundamentally different constraints (commodity hardware instead of GPU clusters, live production data instead of synthetic benchmarks)?
We present KARL-Edge, a system that:
- Records tool-use trajectories from live Claude Code sessions across a 5-node mesh network
- Scores trajectories using a 5-signal composite reward function with z-score advantage normalization
- Trains LoRA adapters on a 1B-parameter open-source model using advantage-weighted SFT
- Operates with zero additional infrastructure, piggybacking on existing hook systems
Our primary contributions are architectural. The reward function decomposition, the hook-wired capture system, and the retroactive correction signal are design choices that could be applied to any agent training pipeline, independent of the specific RL algorithm or base model.
---
2. Background
2.1 KARL (Chang et al., 2026)
The original KARL system trains enterprise search agents on GLM 4.5 Air using OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference), an off-policy RL algorithm that avoids importance weighting clipping by directly optimizing the squared Bellman error:
Key design choices:
- Single tool: Vector search as the sole external tool
- Binary reward: Nugget-based completion scoring, $r(x, y) \in \{0, 1\}$
- Agentic synthesis: Two-stage pipeline generating question-answer pairs from curated corpora
- Iterative bootstrapping: Each RL iteration uses the improved model as the new data generator
- KARLBench: 6-task evaluation suite spanning constraint-driven search, cross-document synthesis, tabular reasoning, exhaustive retrieval, procedural reasoning, and fact aggregation
Results: KARL matches Claude Opus 4.6 quality at 33
2.2 Differences in Problem Setting
Our setting differs from Chang et al. in several fundamental ways:
| Dimension | KARL (Original) | KARL-Edge (Ours) |
|---|---|---|
| Agent domain | Enterprise search | Software engineering |
| Tool space | $|\mathcal{T}| = 1$ (vector search) | $|\mathcal{T}| \geq 10$ (Read, Edit, Write, Bash, Grep, Glob, Task, ...) |
| Environment | Curated document corpora | Live codebases (50+ projects) |
| Reward observability | Immediate (answer vs. ground truth) | Delayed (next-turn correction, build outcome) |
| Compute | Enterprise GPU cluster | Apple Silicon (M4, 16GB) |
| Data source | Synthetic rollouts | Live production sessions |
These differences motivate architectural adaptations in reward design, data collection, and training methodology.
---
3. System Architecture
3.1 Trajectory Capture: The 4-Tap System
We define a trajectory $\tau$ as a sequence of tool-use events within a single agent response:
where $p$ is the user prompt, $s$ is the inferred skill/domain, $t_i \in \mathcal{T}$ is the tool name, $\theta_i$ is the tool's input parameters, $o_i \in \{\text{success}, \text{failure}, \text{unknown}\}$ is the outcome, and $\mathbf{m}$ is timing/context metadata.
Capture is achieved through 4 tap points wired into the existing Claude Code hook system:
Tap A ($\text{UserPromptSubmit}$): Initializes buffer $B_s$ for session $s$. Records $p$, working directory, and skill injection.
Tap B ($\text{PostToolUse}$): Appends event $(t_i, \theta_i, o_i)$ to $B_s$. Extracts success from exit codes and error patterns. Latency budget: $\leq 5\text{ms}$.
Tap C ($\text{Stop}$): Flushes $B_s$ to persistent store. Computes inline reward $R(\tau)$. Appends to `trajectories.jsonl`.
Tap D ($\text{UserPromptSubmit}_{t+1}$): Examines prompt $p_{t+1}$ for correction patterns (regex: "no,? I meant|try again|that's wrong|redo|fix that"). If matched, retroactively annotates $\tau_t$ with $\text{correction\_detected} = \text{True}$.
Key property: Tap D provides a delayed reward signal without explicit human labeling. The user's natural language behavior serves as an implicit binary annotation.
3.2 Trajectory Store
Trajectories are stored as append-only JSONL with file-level locking (POSIX `flock`). Each record contains:
{session_id, channel, recorded_at, skill: {name, domain},
context: {prompt_text, cwd, git_repo},
trajectory: {tool_sequence, tool_counts, total_tools,
successes, failures, bash_errors, events},
outcome: {reward_score, advantage, components, ...},
timing: {started_at, ended_at, duration_s}}The store supports backfill from historical prompt logs (56 records recovered) and live capture (429 records at time of writing).
---
4. Reward Function
4.1 Multi-Signal Decomposition
We decompose trajectory reward into 5 orthogonal signals. Let $\tau$ be a trajectory with $n$ tool events.
with weights $\mathbf{w} = (0.30, 0.25, 0.15, 0.15, 0.15)$ and all component scores $\in [0, 1]$.
4.2 Outcome Score $R_O$
The outcome score captures cross-turn quality signals:
where $\mathcal{S}$ is the set of available outcome signals and $\delta_j$ are signed contributions:
| Signal $j$ | Present: True | Present: False |
|---|---|---|
| correction_detected | $-0.35$ | $+0.35$ |
| redo_detected | $-0.25$ | $+0.25$ |
| build_success | $+0.20$ | $-0.10$ |
| session_continued | $+0.20$ | $0.00$ |
The base score of 0.5 ensures graceful degradation when signals are absent (e.g., no build was triggered, so build_success is null). Missing signals contribute 0, not a penalty.
4.3 Process Score $R_P$
The process score evaluates within-turn execution quality with temporal weighting.
Temporally-weighted success rate: Later tool calls are weighted more heavily. For event $i$ in a trajectory of length $n$:
Intuition: Early failures are part of exploration (the agent tries an approach and discovers it doesn't work). Late failures indicate the approach didn't converge. A trajectory that fails at step 2 and succeeds at step 15 should score higher than one that succeeds at step 2 and fails at step 15.
Bash cleanliness: $\text{BC} = 1 - (\text{bash\_errors} / \text{bash\_count})$
Error density: Consecutive failures are penalized more than scattered failures:
Late-stage penalty: Failures in the final 25
The composite process score:
4.4 Efficiency Score $R_E$
Tool diversity via normalized Shannon entropy:
where $c_t$ is the count of tool $t$ in the trajectory and $\mathcal{T}_\tau$ is the set of unique tools used. Single-tool trajectories ($|\mathcal{T}_\tau| = 1$) receive a monoculture penalty: $\text{Div} = 0.3$.
Duration efficiency: Tools per minute in the optimal range:
File touch rate: Fraction of mutation tools (Write, Edit) relative to total:
4.5 Verification Score $R_V$
Checks whether the agent verified its work after making mutations:
where:
- $\text{has\_test}$: trajectory contains pytest/npm test/cargo test/go test
- $\text{has\_build}$: trajectory contains xcodebuild/cargo build/npm run build
- $\text{read\_after\_write}$: a Read event targets a file previously modified by Edit/Write
If no mutations occurred ($\nexists i : t_i \in \{\text{Write, Edit}\}$), $R_V = 0.6$ (neutral).
4.6 Consistency Score $R_C$
Read-before-write discipline:
Anti-thrashing: Files edited 3+ times in a session are flagged:
4.7 Advantage Computation
Advantage is computed as a z-score against domain-specific baselines:
where $\bar{R}_d$ and $\sigma_d$ are the mean and standard deviation of rewards for domain $d$. The floor of 1.0 on $\sigma_d$ prevents degenerate scaling in low-variance domains.
Interpretation: $A > 0$ means the trajectory performed better than average for its domain. $A < 0$ means it performed worse. This normalizes across domains with different difficulty levels.
---
5. Training Pipeline
5.1 OAPL-Lite: Advantage-Weighted SFT
Where Chang et al. use full off-policy RL (OAPL), we approximate with advantage-weighted supervised fine-tuning. The key simplification: instead of optimizing the squared Bellman error with policy gradient updates, we filter and oversample trajectories proportional to their advantage, then train via standard SFT.
Given a set of scored trajectories $\{\tau_k, R_k, A_k\}_{k=1}^{N}$:
1. Filter: Discard trajectories with $A_k \leq 0$ or $|\text{events}(\tau_k)| < 2$
2. Oversample: Duplicate trajectories proportional to advantage:
3. Deduplicate: SHA-256 content hash on (prompt + plan) to remove exact duplicates
4. Format: Convert to ChatML (system + user prompt + assistant tool plan)
5. Train: MLX LoRA fine-tuning on gemma-3-1b-it-4bit
5.2 FlowRL Balanced Sampling
To prevent training collapse on overrepresented domains, we implement distribution-balanced sampling. Given $D$ domains with trajectory counts $\{n_1, \ldots, n_D\}$, the balanced sampler draws $\lfloor B/D \rfloor$ trajectories per domain (with replacement for underrepresented domains), where $B$ is batch size.
Four strategies are available:
- Balanced: Equal per-domain representation (default)
- Advantage: Softmax-temperature sampling with $T = 2.0$:
- Top-k: Highest reward trajectories only
- Uniform: Random baseline
5.3 Synthetic Data Augmentation
Three synthetic data sources supplement real trajectories:
1. Pattern instantiation: 6 canonical tool-use patterns (read-edit-verify, search-read-write, debug-diagnose-fix, explore-plan-implement, create-module, deploy-verify) instantiated with domain-specific prompts and placeholder values
2. Trajectory augmentation: High-reward trajectories ($R > 0.55$) with synonym-substituted prompts
3. Counterfactual generation: Low-reward trajectories ($R < 0.40$) paired with canonical "corrected approach" sequences
5.4 Training Configuration
- Base model: gemma-3-1b-it-4bit (Google, quantized)
- Method: LoRA (rank=16, alpha=32)
- Framework: MLX (Apple Silicon optimized)
- Hardware: Mac Mini M4, 16GB unified memory
- Training data: 84 examples (35 real SFT + 37 synthetic + 12 augmented)
- Iterations: 500
- Test loss: 1.843
- Training time: ~8 minutes
---
6. Behavioral Intelligence Bridge
6.1 Cortex Integration
KARL-Edge integrates with Cortex, a behavioral intelligence system that tracks agent routing decisions and correction patterns independently. The bridge (`cortex_karl_bridge.py`) performs session-level joins:
For trajectory $\tau$ with session $s$:
1. Query Cortex entries where $\text{entry.session\_id} = s$
2. Extract routing decisions (skill, domain) from `routing_decision` or `invocation_record` entries
3. Count corrections within the trajectory's time window $[t_{\text{start}}, t_{\text{end}} + 300\text{s}]$
4. Infer success from multi-signal heuristics (build success > correction absence > tool success rate > session continuation)
The bridge enriches trajectories with behavioral context unavailable from tool-level signals alone.
6.2 Shadow Vector Routing
Skill routing operates in two layers:
1. Regex layer (< 1ms): Pattern matching on prompt text and file paths. 32 skill patterns across 10 domains.
2. Vector layer (async, ~300ms): Prompt embedding via gemini-embedding-001 (3072 dimensions), cosine similarity against skill embedding centroids, weighted by trajectory success rate.
The vector layer operates in shadow mode: it logs routing decisions to `routing_shadow.jsonl` but does not override the regex layer. Once sufficient agreement data is collected, the vector layer can be promoted to primary.
Current status: 623 shadow entries, 1.9
---
7. Preliminary Results
7.1 Trajectory Statistics
| Metric | Value |
|---|---|
| Total trajectories | 485 |
| Live captured | 429 (88.5 |
| Backfilled | 56 (11.5 |
| Mean $R$ | 0.583 |
| Median $R$ | 0.601 |
| Std $R$ | ~0.07 |
| Min $R$ | 0.326 |
| Max $R$ | 0.704 |
| Positive advantage | 84.3 |
| Domains | 10 |
7.2 Domain-Level Analysis
| Domain | $n$ | $\bar{R}$ | Best Tool Pattern |
|---|---|---|---|
| web | 37 | 0.606 | Read-Read-Edit-Bash |
| ios | 100 | 0.595 | Glob-Read-Edit-Bash(xcodebuild) |
| _global | 174 | 0.590 | Mixed |
| creative | 21 | 0.573 | Bash-heavy (evo-cubed) |
| infra | 68 | 0.572 | Read-Bash-Bash-Bash |
| systems | 21 | 0.566 | Read-Write-Edit-Bash(pytest) |
| knowledge | 6 | 0.560 | Read-Read-Read |
| automation | 32 | 0.558 | Read-Edit-Bash(deploy) |
| data | 23 | 0.552 | Read-Bash(sql)-Edit |
| ml | 2 | 0.455 | Bash-monoculture |
7.3 Holdout Evaluation
20 held-out trajectories (stratified across domains):
| Metric | Value |
|---|---|
| Mean $R$ | 0.568 |
| Std | 0.082 |
| Min | 0.328 |
| Max | 0.694 |
| Weak domains ($< \bar{R} - 2\sigma$) | ml (0.328), knowledge (0.400) |
| Strong domains ($> \bar{R} + \sigma$) | desktop (0.657) |
| Domain spread | 0.329 |
| Generalization | Needs work |
7.4 Reward Signal Analysis
The reward distribution shows clear separation between high-quality and low-quality trajectories:
- Top decile ($R > 0.65$, $n \approx 49$): Mean $R = 0.668$. Characterized by diverse tool use, verification steps, and read-before-write discipline.
- Bottom decile ($R < 0.44$, $n \approx 49$): Mean $R = 0.442$. Characterized by Bash monoculture (single tool type), no verification, and no file mutations.
The verification signal $R_V$ shows the strongest discriminative power: trajectories with at least one test/build command score 0.12 higher on average than those without.
7.5 Adapter Training (Preliminary)
The KARL v2 adapter was trained on Mac5 (M4, 16GB) with the following configuration:
- Training examples: 84 (35 real + 37 synthetic + 12 augmented)
- Base: gemma-3-1b-it-4bit
- Method: LoRA (rank 16, alpha 32)
- Iterations: 500
- Final test loss: 1.843
Critical limitation: No A/B evaluation was conducted before the training hardware went offline. The adapter exists but has not been compared against the base model on held-out tasks.
---
8. Limitations
We identify several significant limitations:
### 8.1 Corrupted Reward Signals
A schema change in Claude Code (renaming `tool_result` to `tool_response` in the hook input) caused the PostToolUse hook to read empty strings for all tool responses. Consequences:
- `success` was always `True` (exit codes never parsed)
- `exit_code` was always `None`
- Process score $R_P$ was inflated for Bash-heavy trajectories
- Correction detection (Tap D) captured 0 corrections
This was discovered and fixed on March 15, 2026. All trajectories prior to this date have partially unreliable process and verification scores. The outcome and efficiency scores are unaffected (they derive from trajectory structure, not tool response content).
### 8.2 No Controlled Evaluation
The adapter was never evaluated against the base model on identical tasks. We report reward statistics on the training and holdout sets, but cannot claim the adapter produces better tool-use plans than the base model.
### 8.3 Small Training Set
84 training examples is orders of magnitude smaller than the thousands of rollouts used by Chang et al. While advantage weighting and balanced sampling help, this is a clear scaling limitation.
### 8.4 Imbalanced Domain Coverage
The domain distribution is heavily skewed: `_global` (36
### 8.5 Missing Cross-Turn Signals
The correction detection bug means we have zero data points for the most novel reward signal (Tap D). The outcome score $R_O$ defaults to 0.5 (neutral) for all trajectories, reducing it to effectively a 4-signal reward.
---
9. Related Work
Agent training via RL: KARL (Chang et al., 2026) is the most directly related work. Others include Voyager (Wang et al., 2023) for Minecraft agents, SWE-Agent (Yang et al., 2024) for software engineering, and AgentQ (Putta et al., 2024) for web agents.
Trajectory-based learning: ReAct (Yao et al., 2023) and Reflexion (Shinn et al., 2023) use trajectory reflection for self-improvement, but without RL training. Our approach records trajectories for offline training rather than online reflection.
Multi-signal reward: Reward decomposition has been explored in robotics (van Seijen et al., 2017) and game playing (Juozapaitis et al., 2019), but not previously applied to LLM agent tool-use trajectories.
Edge-device fine-tuning: MLX (Apple, 2024) enables LoRA fine-tuning on Apple Silicon. Our work demonstrates that agent training is feasible at this scale.
---
10. Conclusion
KARL-Edge demonstrates that reinforcement learning from tool-use trajectories transfers from enterprise search to software engineering, from GPU clusters to commodity hardware, and from synthetic benchmarks to live production data. The architectural contributions, a 5-signal composite reward, hook-wired zero-overhead capture, and retroactive correction detection, are independent of the specific base model, RL algorithm, or training scale.
The honest status: the architecture is sound and deployed; the data pipeline works and has collected 485 trajectories; the reward decomposition provides richer signals than binary completion scoring. But the evaluation is preliminary. The most novel reward signal (cross-turn corrections) was silent due to a schema bug. The trained adapter has not been A/B tested. The training set is small.
The path forward is clear: fix the signals (done), recompute rewards, grow the holdout set, and run controlled evaluation. The system is live and recording with corrected hooks. Every new session contributes to the next training iteration.
---
References
Chang, J.D., Drozdov, A., Toshniwal, S., et al. (2026). KARL: Knowledge Agents via Reinforcement Learning. arXiv:2603.05218.
Putta, P., Mills, E., Garg, N., et al. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.07199.
Shinn, N., Cassano, F., Gopinath, A., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
van Seijen, H., Fatemi, M., Romoff, J., et al. (2017). Hybrid Reward Architecture for Reinforcement Learning. NeurIPS 2017.
Wang, G., Xie, Y., Jiang, Y., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291.
Yang, J., Jimenez, C.E., Wettig, A., et al. (2024). SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793.
Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
---
Appendix A: Component Weight Sensitivity
The reward function weights $\mathbf{w} = (0.30, 0.25, 0.15, 0.15, 0.15)$ were chosen based on intuition about signal importance. A formal sensitivity analysis would perturb each weight by $\pm 0.05$ and measure the effect on trajectory ranking (Kendall's $\tau$ correlation with the default ranking). This analysis is pending.
Appendix B: Full Trajectory Schema
{
"id": "traj_{session_hash}_{timestamp}",
"session_id": "uuid",
"channel": "live|backfill",
"recorded_at": "ISO-8601",
"skill": {"name": "string|null", "domain": "string|null"},
"context": {"prompt_text": "string", "cwd": "string", "git_repo": "string|null"},
"trajectory": {
"tool_sequence": ["Read", "Edit", "Bash", ...],
"tool_counts": {"Read": 3, "Edit": 2, "Bash": 1},
"total_tools": 6,
"successes": 5,
"failures": 1,
"bash_errors": 0,
"events": [{
"tool_name": "Read",
"key_params": {"file_path": "/path/to/file.py"},
"success": true,
"exit_code": null,
"duration_ms": null,
"ts": "ISO-8601"
}, ...]
},
"outcome": {
"reward_score": 0.683,
"advantage": 0.095,
"outcome_score": 0.85,
"process_score": 0.72,
"efficiency_score": 0.58,
"verification_score": 0.70,
"consistency_score": 0.80,
"correction_detected": false,
"build_success": true,
"annotation_status": "scored"
},
"timing": {
"started_at": "ISO-8601",
"ended_at": "ISO-8601",
"duration_s": 142.5
},
"cortex_bridge": {
"cortex_skill": "ops:ios",
"cortex_domain": "ios",
"corrections_in_window": 0,
"cortex_entries": 3
}
}Appendix C: Evaluation Roadmap
To bring evaluation to publication standard:
1. Reward recomputation: Run `reward_engine.py --backfill --force` with fixed hook to recompute all 485 trajectories
2. Grow holdout: Stratified sample of 100+ trajectories (10+ per domain)
3. A/B test: Run same task set with and without KARL adapter, compare:
- Mean reward on held-out tasks
- Build/test pass rate
- Correction frequency
- Task completion time
4. Iterative bootstrapping: Use improved model to generate new SFT data (Databricks' Iter 2/3 approach)
5. Domain transfer: Test if adapter trained on ios+infra improves web+systems performance
6. Weight sensitivity: Perturb reward weights and measure ranking stability
Promotion Decision
Convert into the standard paper schema, add citations, and render a draft PDF.
Source Anchor
karl-research-paper.md
Detected Structure
Abstract · Introduction · Method · Evaluation · References · Math · Code Anchors · Architecture