KARL: Advantage-Weighted Training from Full Agent Session Traces

Full HTML reader

Read the full artifact

Extracted abstract or opening context

Standard supervised fine-tuning (SFT) for language model agents operates on input-output pairs: a prompt and the response the model should produce. This format captures *what* an agent said but discards *why* it made specific decisions. We present KARL (Knowledge-Augmented Reinforcement Learning), a trajectory intelligence system that trains language model agents from full session traces rather than isolated completions. A trajectory in KARL records every tool call, file read, code edit, bash command, success signal, and failure signal across an entire work session, preserving the sequential decision structure that determines session outcomes. KARL computes a 5-signal composite reward function (outcome, process, efficiency, verification, and consistency) and applies z-score advantage weighting to identify the decisions that mattered most within each trajectory. We report results from an operational deployment across 11 domains and 290 trajectories (21,380 tool calls): two LoRA adapters trained on Gemma-3-4B-it, one on 972 random examples (loss 1.694) and one on 35 advantage-weighted examples (loss 1.843), both trained on Apple M4 hardware in under 3 minutes. A leave-one-out ablation study on the 5-signal reward function reveals that efficiency (tool diversity via Shannon entropy) is the most important signal (impact = 0.568), while outcome (task completion) is the least important (impact = 0.005), with the key finding that how an agent works matters more than whether it succeeds. We additionally report results from a complementary geometric analysis of conversational trajectories: transition pressure variability predicts conversation convergence at 69.8% accuracy ($z = 2.72$, $p < 0.007$), providing a geometric complement to KARL's reward-based scoring. We describe the complete system: trajectory extraction from Claude Code hook infrastructure, the 5-signal reward function with Bayesian-smoothed domain baselines, FlowRL-style balanced sampling, a Cortex behavioral intelligence bridge for live correction capture, and a remote MLX LoRA training pipeline. We clearly distinguish proven results (operational metrics, training loss, signal ablation, anticipation geometry signal strength) from proposed experiments (downstream evaluation, cross-domain transfer, reward-geometry fusion).

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.