KARL: Trajectory Reward Engine
KARL scores full agent sessions instead of isolated completions. The ledger records tool calls, failures, verifications, and corrections, then uses a reward model to identify which trajectories deserve to become training data.
Paper workspace
Live draft structure
Artifacts
System and paper source
KARL is a running system. A public PDF render should follow privacy review of trajectory examples.
source-only
Editable source
Running system plus paper source exists. Raw traces remain private; public page should expose method and proof summaries only.
Source anchors
karl/paper/karl-paper.md
karl-research-paper.md
karl reward engine and trajectory ledger deployment
Method tags
Ingest intersections
Status
Running; thousands of trajectories scored, retraining loop active.
Key claims
01
How an agent works can be more instructive than whether one final response looked correct.
02
Trajectory-level reward creates a continuous learning signal from real work.
03
Training data selection should be advantage-weighted, not random.
Public reading note
System summary public; raw session traces are private.
Standard skeleton
What this paper must keep proving
problem
Agent systems produce rich tool-use behavior but usually discard the process signal after the final answer.
method
Score complete trajectories with observable process/outcome signals and use high-advantage sessions for training data.
implementation
Trajectory taps, ledger daemon, reward engine, SFT exporter, and entity/skill performance bridge.
data
Private real agent sessions summarized into privacy-safe metrics and training examples after review.
evaluation
Reward ablations, downstream training lift, routing improvements, and benchmark head-to-heads.
references
Process reward models, agent trajectory learning, Databricks agent training, SWE-bench, RLHF.
openQuestions
Official real-repo downstream lift remains the hard public proof gate.
Checkpoints and references
Proof chain
Running scored ledger
KARL trajectory store and ledger daemon
The system is operational; raw traces stay private.
Downstream training lift
reward-selected adapter tests
Early positive signals exist; broader official benchmark proof remains open.