Agents/Reward and Trajectories·6 min read

Every Trajectory Gets a Score

What happens when every piece of agent work on your infrastructure is scored at the moment it is emitted.

Agents on my infrastructure do not just complete tasks. Every trajectory, every recorded sequence of tool calls, outcomes, and artifacts, receives a composite reward score the moment it is written: process quality, outcome, tool discipline, and domain-calibrated baselines learned from thousands of prior trajectories.

The reason is selection. When you fine-tune a model on your own agent's work, the question is never whether you have data. You have too much. The question is which fraction of it deserves to shape the next model. An unscored corpus answers that question with vibes; a scored corpus answers it with a threshold.

Score at emit time

Scoring trajectories in a batch afterward invites drift: the scorer and the work fall out of sync. Scoring at emit time means every record is born with its grade, and the training pipeline never waits.

The hard lesson came from a silent failure. For weeks, every record scored exactly 0.5 on every signal, the neutral default, because the scorer's field names had drifted from the on-disk schema. Nothing crashed. The numbers looked plausible in aggregate. Three latent bugs, each individually invisible. The fix was mechanical; the lesson was not: a reward system needs its own evaluation, distributions you inspect, baselines you can name, and alarms for suspicious uniformity.

The current frontier is integrity. Recent benchmark work confirmed what I had seen locally: outcome-only grading overestimates agents, because a trajectory can claim success without evidence. A run that reports a passing test should carry the test output. A claimed file change should carry the diff. The next signal in the composite penalizes exactly that gap, outcomes without corroborating artifacts, because the most dangerous agent is not the one that fails but the one that reports success unverified.