KARL: Advantage-Weighted Training from Full Agent Session Traces
Standard supervised fine-tuning (SFT) for language model agents operates on input-output pairs: a prompt and the response the model should produce. This format captures *what* an agent said but discards *why* it made specific decisions. We present KARL (Knowledge-Augmented Reinforcement Learning), a trajectory intelligence system that trains language model agents from full session traces rather than isolated completions. A trajectory in KARL records every tool call, file read, code edit, bash command, success signa
Full Public Reader
KARL: Advantage-Weighted Training from Full Agent Session Traces
Mohamed Diomande
Independent Researcher
March 2026
---
Abstract
Standard supervised fine-tuning (SFT) for language model agents operates on input-output pairs: a prompt and the response the model should produce. This format captures what an agent said but discards why it made specific decisions. We present KARL (Knowledge-Augmented Reinforcement Learning), a trajectory intelligence system that trains language model agents from full session traces rather than isolated completions. A trajectory in KARL records every tool call, file read, code edit, bash command, success signal, and failure signal across an entire work session, preserving the sequential decision structure that determines session outcomes. KARL computes a 5-signal composite reward function (outcome, process, efficiency, verification, and consistency) and applies z-score advantage weighting to identify the decisions that mattered most within each trajectory. We report results from an operational deployment across 11 domains and 290 trajectories (21,380 tool calls): two LoRA adapters trained on Gemma-3-4B-it, one on 972 random examples (loss 1.694) and one on 35 advantage-weighted examples (loss 1.843), both trained on Apple M4 hardware in under 3 minutes. A leave-one-out ablation study on the 5-signal reward function reveals that efficiency (tool diversity via Shannon entropy) is the most important signal (impact = 0.568), while outcome (task completion) is the least important (impact = 0.005), with the key finding that how an agent works matters more than whether it succeeds. We additionally report results from a complementary geometric analysis of conversational trajectories: transition pressure variability predicts conversation convergence at 69.8
---
1. Introduction
1.1 The SFT Bottleneck
Language model agents that operate through tool use, reading files, editing code, executing shell commands, searching codebases, face a training challenge that standard supervised fine-tuning does not address. SFT captures the mapping from prompt to completion: given this instruction, produce this text. But the quality of an agent session depends not on any single completion but on a sequence of decisions: which file to read first, whether to grep for context before editing, whether to run a test after making changes, whether to check git status before committing. These decisions form a trajectory, and it is the trajectory, not any individual step, that determines whether the session succeeds or fails.
Consider two sessions that both result in a correct code change. In Session A, the agent reads the relevant file, identifies the function to modify, makes a precise edit, and runs the test suite. Four tool calls, all successful, producing a clean diff. In Session B, the agent edits the wrong file, encounters an error, reads the correct file, makes the edit, encounters a type error, reads the type definitions, fixes the edit, and runs the test suite. Eight tool calls, three failures, a messy trajectory that happens to arrive at the same end state. Standard SFT treats both outcomes identically. KARL does not. Session A's trajectory has higher process quality, higher efficiency, and higher consistency, and its advantage score reflects this difference.
The core insight is that not all correct completions are equally instructive. A training corpus of 1000 random prompt-completion pairs includes sessions where the agent thrashed, sessions where it got lucky, and sessions where it followed a disciplined process. The signal-to-noise ratio of such a corpus is low. Advantage weighting is a filter: it identifies the sessions where the agent's process was meaningfully better than the domain baseline, and it concentrates training on those sessions.
1.2 Why Full Session Traces Capture What Matters
A full session trace in KARL is not merely a log. It is a structured record with the following components:
- Tool sequence: The ordered list of tool invocations (Read, Edit, Bash, Grep, Glob, Write, etc.)
- Tool parameters: The key arguments to each invocation (file paths, shell commands, search patterns), truncated to preserve privacy while retaining decision information
- Success signals: Per-tool success/failure flags and bash exit codes
- Timing: Session duration, inter-tool intervals, temporal position of failures
- Cross-turn annotations: Whether the user's next prompt contained a correction ("no, I meant..."), a redo request ("try again"), or a natural continuation
These components reconstruct the decision graph of the session. When the agent reads `types.py` before editing `handler.py`, that sequence encodes a process decision: understand the type signatures before modifying the code that depends on them. When the agent runs `pytest` after every edit, that pattern encodes a verification discipline. These process patterns are precisely what we want to learn from, and precisely what input-output SFT discards.
1.3 The Advantage Weighting Insight
Not all trajectories are equal, and within any domain, trajectories are not equally instructive. KARL computes a composite reward for each trajectory, then normalizes it against a domain-specific baseline:
where $r_i$ is the composite reward for trajectory $i$, $\mu_d$ is the Bayesian-smoothed baseline for domain $d$, $\sigma_d$ is the domain standard deviation, and $\beta$ is a floor constant that prevents division by near-zero variance in sparse domains.
This z-score normalization has three properties that make it suitable for trajectory training:
1. Domain independence: A reward of 0.75 in a domain where the baseline is 0.70 carries less advantage than a reward of 0.65 in a domain where the baseline is 0.50. This prevents high-baseline domains from dominating the training distribution.
2. Scale normalization: Different domains have different reward variances. Infrastructure deployments tend to cluster tightly around a moderate reward. Creative tasks have high variance. Z-score normalization ensures that a trajectory that is one standard deviation above its domain baseline contributes similarly regardless of which domain it comes from.
3. Natural curriculum: The highest-advantage trajectories are the ones where the agent's process was most different from (and better than) its usual behavior. These are the trajectories with the most to teach.
---
2. Related Work
2.1 RLHF and Direct Preference Optimization
Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) trains a reward model from human preference comparisons, then uses PPO to optimize the language model against this reward. Direct Preference Optimization (DPO) (Rafailov et al., 2023) eliminates the separate reward model by directly optimizing the policy from preference pairs. Both approaches require human annotations: either explicit preference labels (RLHF) or ranked completion pairs (DPO).
KARL differs fundamentally in its reward source. Rather than asking humans which completion is better, KARL derives reward from observable behavior: did the tools succeed? Was the user's next prompt a correction? Did the build pass? This makes the reward signal free, continuous, and domain-specific. The cost is that KARL's reward is noisier than human preference labels, but the volume of signal (every session generates a trajectory) compensates for the noise.
2.2 Process Reward Models
Lightman et al. (2023) demonstrated that rewarding correct intermediate reasoning steps, not just final answers, significantly improves mathematical problem-solving in language models. Their Process Reward Models (PRMs) assign per-step scores throughout a chain-of-thought. Wang et al. (2024) extended this to Math-Shepherd, which automatically verifies reasoning steps without human annotation.
KARL applies the same principle to tool-use trajectories. Our process score evaluates the quality of each tool invocation, with temporal weighting that makes later-step failures more costly than early-step failures. The verification score checks whether the agent verified its work (ran tests, checked build, read files it edited). The consistency score checks whether the agent followed a coherent plan (read before write, no thrashing). Together, these scores implement a process reward model for agentic tool use.
2.3 Agent Evaluation Benchmarks
SWE-Bench (Jimenez et al., 2024) evaluates agents on real GitHub issues, measuring whether they produce correct patches. WebArena (Zhou et al., 2023) evaluates web navigation agents on task completion in realistic web environments. AgentBench (Liu et al., 2023) provides a comprehensive evaluation across operating system, database, and web environments.
These benchmarks share a common limitation for training purposes: they evaluate on fixed, curated task sets that may not reflect the distribution of tasks an agent encounters in deployment. KARL complements benchmarks by learning from the actual task distribution. An agent deployed primarily for iOS development encounters different patterns than one deployed for infrastructure management, and KARL's domain-specific baselines capture this distributional difference.
2.4 Offline Reinforcement Learning for Language Models
Offline RL methods train policies from fixed datasets without online interaction. ILQL (Snell et al., 2023) applies implicit language Q-learning to text generation. Gulcehre et al. (2023) use offline RL to improve instruction following. OAPL (Preference-Aligned Policy Learning) treats SFT as a special case of offline RL where advantage weighting replaces policy gradient updates.
KARL's OAPL-Lite approach is a simplified variant of advantage-weighted training. Rather than learning a value function and computing per-token advantages, KARL computes trajectory-level advantages from the 5-signal reward and uses them for oversampling during SFT. This trades theoretical optimality for practical simplicity: the advantage computation requires no neural network, only arithmetic over the trajectory store.
2.5 Knowledge Graph-Derived Rewards
The Princeton DSS group (Belova et al., 2026) proposed knowledge graph-grounded reward signals for domain-specific superintelligence training. Their framework assigns three signals to knowledge graph paths: terminal grounding (does the path end at a verified fact?), chain continuity (is the reasoning chain unbroken?), and axiomatic validity (does each step follow from established rules?). Their GraphMERT model (80M parameters) outperformed 32B parameter models on factuality benchmarks by leveraging these structural rewards during training.
KARL's reward function operates on a different substrate (tool-use traces rather than knowledge graph paths) but shares the same structural insight. KARL's outcome score maps to terminal grounding (did the session reach a verified end state?). KARL's consistency score maps to chain continuity (was the tool sequence logically coherent?). KARL's verification score maps to axiomatic validity (did the agent verify its conclusions?). Section 6 develops this correspondence in detail.
---
3. KARL Architecture
3.1 Trajectory Extraction
KARL instruments an agent's hook system at four tap points to capture complete session trajectories without interfering with the agent's operation.
Tap A (Session Initialization). Fires on the `UserPromptSubmit` event. Creates a JSON buffer file in `karl/buffers/` containing the session ID, prompt text (truncated to 500 characters), working directory, git repository context, and any skill label from the routing system. Each buffer file is named by a sanitized session ID, ensuring isolation between concurrent sessions.
Tap B (Tool Event Capture). Fires on the `PostToolUse` event. Appends a compact event record to the session buffer: the normalized tool name, key parameters (file paths, commands, and search patterns truncated to 200 characters for privacy), a boolean success flag, the bash exit code (for shell commands), and a timestamp. If no buffer exists for the current session (the session was not initiated through the skill system), Tap B auto-creates a minimal buffer so that trajectories are never silently lost.
Tap C (Session Flush). Fires on the `Stop` event (session completion). Reads the buffer, computes summary statistics (tool counts, success/failure rates, bash error count, session duration), infers the skill and domain from file paths and tool patterns if not explicitly labeled, runs the reward engine, and appends the complete trajectory record to `trajectories.jsonl` using file-level locking (`fcntl.LOCK_EX`) to prevent corruption from concurrent writes. The buffer file is deleted after successful flush.
Tap D (Cross-Turn Annotation). Fires on the next `UserPromptSubmit` following a completed session. Examines the new prompt for correction signals, phrases like "no, I meant," "try again," "that's wrong," or explicit redo requests. If a correction is detected, Tap D walks the trajectory store backward to find the previous record for this session and annotates it with `correction_detected: true`. This retroactive annotation converts implicit user dissatisfaction into an explicit negative outcome signal, and it requires no human labeling effort.
The four-tap pipeline operates within a strict latency budget. Each tap is implemented as a Python function called from the hook dispatcher, and the total per-event overhead is kept under 10ms. No tap blocks the agent's response generation.
Source Format. The primary source of historical trajectory data is the prompt-logger hook system, which writes structured session records to `verbose-all.jsonl`. The trajectory extractor (`trajectory_extractor.py`) backfills from this source, normalizing tool names from different agent frameworks (e.g., Codex's `exec_command` becomes `Bash`, `apply_patch` becomes `Edit`) and deduplicating against existing trajectory records.
3.2 Trajectory Record Format
Each trajectory record is a JSON object with the following top-level fields:
{
"id": "traj_{session_prefix}_{timestamp}",
"session_id": "uuid",
"channel": "live" | "backfill",
"recorded_at": "ISO-8601",
"skill": {
"name": "ops:deploy" | null,
"domain": "infra" | "ios" | "web" | ... | null
},
"context": {
"prompt_text": "truncated prompt (500 chars)",
"cwd": "/path/to/working/directory",
"git_repo": "repository-name"
},
"trajectory": {
"tool_sequence": ["Read", "Read", "Edit", "Bash"],
"tool_counts": {"Read": 2, "Edit": 1, "Bash": 1},
"total_tools": 4,
"successes": 4,
"failures": 0,
"bash_errors": 0,
"events": [
{
"tool_name": "Read",
"key_params": {"file_path": "src/handler.py"},
"success": true,
"exit_code": null,
"ts": "ISO-8601"
},
...
]
},
"outcome": {
"annotation_status": "scored",
"correction_detected": false,
"build_success": true,
"redo_detected": false,
"session_continued": true,
"reward_score": 0.7825,
"advantage": 0.2314,
"outcome_score": 0.85,
"process_score": 0.78,
"efficiency_score": 0.72,
"verification_score": 0.70,
"consistency_score": 0.80,
"reward_components": { ... }
},
"timing": {
"started_at": "ISO-8601",
"ended_at": "ISO-8601",
"duration_s": 65.0
}
}The `events` array caps at 50 entries per session to bound storage costs while preserving the decision sequence. Tool parameters are truncated to prevent sensitive data (API keys, credentials, large code blocks) from entering the trajectory store. The `channel` field distinguishes live-captured trajectories from backfilled historical data, enabling analysis of data source effects on reward distributions.
3.3 Skill and Domain Inference
When a trajectory is not explicitly labeled with a skill (most sessions initiated outside the skill routing system), KARL infers the skill and domain from available signals. The `trajectory_tap.py` module maintains a registry of 40+ pattern rules that match against file paths, working directories, and tool parameters:
SKILL_PATTERNS = [
(r"Desktop/Spore/", "spore", "ios"),
(r"Desktop/karl", "karl-trajectory", "infra"),
(r"flows/feed-hub/", "feed-hub-flow", "automation"),
(r"\.claude/hooks/", "hook-maintenance", "infra"),
(r"monitoring/nexus-portal/", "nexus-portal", "web"),
(r"projects/evolution_world/","evolution-world", "systems"),
...
]Each pattern is scored by frequency of occurrence across the session's file paths and tool parameters. The highest-scoring pattern determines the skill and domain label. This heuristic inference correctly labels approximately 59.5
The 11 domains in our deployment are: `ios`, `infra`, `web`, `automation`, `creative`, `systems`, `ml`, `knowledge`, `data`, `desktop`, and `_global`. Domain boundaries matter for advantage computation because reward distributions differ substantially across domains: infrastructure deployments cluster around moderate rewards (mean 0.60), while creative tasks exhibit higher variance (std 0.12 vs 0.06 for infra).
3.4 The 5-Signal Reward Function
KARL's reward engine computes a composite score $r \in [0, 1]$ from five weighted signals:
Each signal is independently normalized to $[0, 1]$.
3.4.1 Outcome Score (Weight: 0.30)
The outcome score captures cross-turn signals about user satisfaction. It is initialized at a neutral baseline of 0.5 (representing absence of information) and adjusted by the presence or absence of four signals:
| Signal | Present = True | Present = False | Absent |
|---|---|---|---|
| `correction_detected` | -0.35 | +0.35 | 0 |
| `redo_detected` | -0.25 | +0.25 | 0 |
| `build_success` | +0.20 | -0.10 | 0 |
| `session_continued` | +0.20 | 0 | 0 |
The asymmetry between correction (-0.35) and no-correction (+0.35) reflects a design choice: corrections are strong evidence of failure, while absence of correction is moderate evidence of success (the user might simply have stopped working). Similarly, `build_success: false` is penalized less than `correction_detected: true` because a failed build is often expected during iterative development.
3.4.2 Process Score (Weight: 0.25)
The process score evaluates tool-use quality within the session through four sub-components:
**Temporally-weighted success rate (40
where $|E|$ is the number of events. The temporally-weighted success rate is:
The rationale is that early failures during exploration are expected and cheap, while late failures during delivery are costly and indicate process breakdown.
**Bash cleanliness (25
**Error density (20
**Late-stage failure penalty (15
The process score is:
3.4.3 Efficiency Score (Weight: 0.15)
The efficiency score evaluates the shape of the tool-use distribution:
**Tool diversity (35
where $p_t$ is the proportion of tool calls of type $t$. A monoculture penalty of 0.3 is applied when only one tool type is used (e.g., 50 consecutive `Bash` calls). The insight is that diverse tool use (read, then edit, then test) indicates systematic work, while monoculture indicates scripting or thrashing.
**Duration efficiency (35
**File touch rate (30
3.4.4 Verification Score (Weight: 0.15)
The verification score measures whether the agent verified its work after making changes. It only applies to sessions containing mutation events (Write, Edit):
**Test execution (40
**Build verification (30
**Read-after-write (30
Sessions without mutations receive a neutral score of 0.6, reflecting that read-only sessions have no verification obligation.
3.4.5 Consistency Score (Weight: 0.15)
The consistency score evaluates internal coherence of the trajectory:
**Read-before-write (60
**No thrashing (40
3.5 Z-Score Advantage Computation
After computing the composite reward $r_i$ for each trajectory $i$, KARL normalizes it against a domain-specific baseline to produce an advantage score:
Domain baselines $\mu_d$ are Bayesian-smoothed toward the global mean to handle sparse domains:
where $n_d$ is the number of trajectories in domain $d$, $\bar{r}_d$ is the raw domain mean, $\mu_{\text{global}}$ is the global reward mean, and $\kappa = 10$ is the smoothing strength. A domain with 100 trajectories receives approximately 91
Domain standard deviations $\sigma_d$ are computed directly for domains with 5+ trajectories. For sparser domains, the global standard deviation is used as a fallback, with a floor of 0.02 to prevent division by near-zero values.
This Bayesian smoothing prevents two failure modes:
1. New domain overfit: A new domain with one high-reward trajectory would, without smoothing, have a baseline of that single reward, making all future trajectories in that domain appear to have zero advantage. Smoothing pulls the baseline toward the global mean, allowing future trajectories to register advantages.
2. Sparse domain wild z-scores: A domain with two trajectories and a standard deviation of 0.001 would produce z-scores in the hundreds. The $\beta$ floor and global-std fallback prevent this.
3.6 FlowRL Sampling
KARL implements four sampling strategies for training data selection, following principles from FlowRL (distribution-proportional reinforcement learning):
Uniform. Plain random sampling, used as a baseline.
Balanced. Equal samples per domain, preventing overrepresented domains from dominating. If a domain has fewer trajectories than the per-domain quota, it is oversampled with replacement.
Advantage-weighted. Softmax-temperature sampling where the probability of including trajectory $i$ is:
with temperature $\tau = 2.0$ and a probability floor of 0.01 to prevent complete exclusion of any trajectory. High-advantage trajectories are exponentially more likely to be selected.
Top-k. Selects the $k$ highest-reward trajectories, used for focused training on the best available examples.
In practice, the SFT exporter uses a simpler advantage-weighted oversampling scheme: trajectories with advantage in $(0, 0.1]$ appear once, advantage in $(0.1, 0.3]$ appear twice, and advantage above $0.3$ appear three times (the maximum). Negative-advantage trajectories are excluded from the training set.
3.7 SFT Export
The SFT exporter converts trajectory records into ChatML format for MLX LoRA fine-tuning:
{
"messages": [
{
"role": "system",
"content": "You are an expert software engineering assistant..."
},
{
"role": "user",
"content": "[truncated prompt from trajectory context]"
},
{
"role": "assistant",
"content": "1. [ok] Read ../src/handler.py\n2. [ok] Edit ../src/handler.py\n3. [ok] Bash: pytest tests/\n\nResult: 3/3 tools succeeded, reward=0.78"
}
]
}The assistant completion encodes the trajectory as a numbered tool plan with per-step success indicators. This teaches the model to predict effective tool sequences, not just tool names, grounding its planning in the outcome information from real sessions.
The exporter applies several filters before output:
- Minimum 2 tool events per trajectory (trivial sessions excluded)
- Positive advantage required (below-baseline trajectories excluded)
- Content deduplication via SHA-256 hash of prompt + plan
- Prompt truncation to 4000 characters
Synthetic QA data from git diff analysis can be merged into the training set through a separate pipeline (`synthetic_qa.py`), providing additional supervised signal for code understanding tasks.
---
4. Training Pipeline
4.1 Data
The training pipeline begins with the trajectory store (`trajectories.jsonl`), an append-only JSONL file that grows as the agent operates. In our deployment, this store contains 290 trajectories (21,380 tool calls) across 11 domains, with skill labels from the routing system and path-pattern inference.
The SFT exporter processes this store into a training-ready format. In our v2 training run, 35 trajectories survived the advantage filter (positive advantage, 2+ tool events, deduplicated), producing 35 unique training examples with oversampling expanding this to approximately 60 effective examples. These are split 90/10 into `train.jsonl` (approximately 54 examples) and `valid.jsonl` (approximately 6 examples).
For comparison, the v1 training run used 972 randomly-sampled examples without advantage weighting, drawn from a broader extraction of the prompt-logger archive.
4.2 Base Model
Both training runs use Gemma-3-4B-it (Google, 2025), a 4-billion parameter instruction-tuned model from the Gemma 3 family, run through MLX LoRA on Apple Silicon.
- v1: Gemma-3-4B-it, trained on 972 randomly sampled examples
- v2: Gemma-3-4B-it, retrained on 35 advantage-weighted examples
The choice of a 4B parameter model is deliberate. KARL's hypothesis is that trajectory intelligence, the ability to predict effective tool sequences, is a relatively low-dimensional skill that does not require the full capacity of a 70B+ model. The 4B model enables training on consumer hardware (Apple M4 Mac with 16GB RAM) without cloud GPU allocation, completing 500 iterations in under 3 minutes.
4.3 Training Configuration
Training is orchestrated by `karl_trainer.py`, which coordinates four steps:
1. Export: Run `sft_exporter.py` to generate `train.jsonl` and `valid.jsonl`
2. Upload: SCP training files to Mac5 (M4 16GB, the training compute node)
3. Trigger: Invoke MLX LoRA training via the finetune daemon or direct `python3 -m mlx_lm lora`
4. Monitor: Poll the daemon's `:9200/status` endpoint for loss and iteration progress
MLX LoRA training parameters:
| Parameter | Value |
|---|---|
| LoRA rank | 8-16 |
| LoRA alpha | 16-32 |
| Iterations | 500-1000 |
| Batch size | 1 |
| Learning rate | 1e-5 |
| Number of adapter layers | 4-16 |
| Max sequence length | 256 |
Training completes in approximately 3 minutes for 500 iterations on the M4 Mac.
4.4 Results
| Metric | v1 (Random SFT) | v2 (Advantage-Weighted) |
|---|---|---|
| Training examples | 972 | 35 (effective ~60 with oversampling) |
| Training loss | 1.694 | 1.843 |
| Training time | 188.4s | ~180s |
| Base model | Gemma-3-4B-it | Gemma-3-4B-it |
| Iterations | 500 | 500 |
| Hardware | Apple M4 (Mac5, 16GB) | Apple M4 (Mac5, 16GB) |
The higher loss for v2 is expected: the advantage-weighted examples are harder (they represent the decision boundaries where agent process matters most) and fewer in number (35 vs 972). The critical comparison is not training loss but downstream behavior, specifically whether the trained model predicts tool sequences that more closely match high-quality trajectories.
Honesty note. We report training loss only. We have not yet run controlled downstream evaluations (A/B task completion, holdout reward prediction, cross-domain transfer). The two adapters exist and are servable via MLX Server (:8100), but the comparison between them is limited to training loss. Section 7 distinguishes clearly between what has been measured and what remains proposed.
---
5. The Advantage Weighting Insight
5.1 Not All Decisions Are Equal: The "Read Before Edit" Pattern
Consider the consistency score's read-before-write component. In our trajectory data, sessions where the agent reads a file before editing it have a mean reward of 0.72, while sessions where the agent edits without reading first have a mean reward of 0.58. This 14-point gap is the single largest predictor of session quality in our dataset.
The advantage weighting mechanism amplifies this signal. A trajectory that demonstrates read-before-write in a domain where the baseline agent skips this step will have a high advantage score, and it will appear 2-3 times in the training data. A trajectory that skips reading in a domain where reading is common will have a negative advantage and be excluded entirely.
This creates a virtuous curriculum: the model is trained disproportionately on trajectories that demonstrate the process patterns most associated with success, and the domain-specific normalization ensures that these patterns are learned relative to what is typical for each task type.
5.2 Why 35 High-Advantage Examples May Match 1000 Random
The observation that 35 advantage-weighted examples achieve comparable training loss (1.843 vs 1.694) to 972 random examples, despite being 28x smaller, warrants explanation. Training loss alone does not prove downstream equivalence (see Section 10.3), but the loss proximity is suggestive. The explanation lies in the distribution of useful signal in the random sample.
Of 972 random training examples:
- Approximately 30
- Approximately 25
- Approximately 15
- Approximately 30
The 35 advantage-weighted examples are drawn entirely from this last category. They are, in effect, the distilled signal from a much larger pool. The random sample dilutes this signal with noise and anti-signal, while advantage weighting concentrates it.
This parallels findings in data pruning literature (Sorscher et al., 2022) where training on a carefully selected subset of data can match or exceed training on the full dataset, because the selection process removes examples that interfere with learning.
5.3 What High-Advantage Trajectories Look Like
We characterize the patterns that distinguish high-advantage trajectories (advantage > 0.3) from low-advantage ones (advantage < 0):
High-advantage patterns:
- Tool sequence begins with Read or Grep (research phase)
- Mutation tools (Edit, Write) appear in the middle, bracketed by research and verification
- Session ends with Bash (test execution or build verification)
- File touch rate between 0.2 and 0.5 (thoughtful mutation, not mass writes)
- Tool diversity above 0.6 (multiple tool types used)
Low-advantage patterns:
- Session opens with Edit or Write (no research phase)
- Consecutive bash errors (3+ in a row)
- Same file edited 3+ times (thrashing)
- No test or build execution after mutations
- Tool monoculture (90
These patterns are remarkably consistent across domains. The read-research-mutate-verify sequence is a high-advantage pattern whether the domain is iOS app development, infrastructure deployment, or web frontend work.
5.4 Domain-Specific Baselines Prevent Cross-Domain Contamination
Without domain-specific baselines, cross-domain training has a systematic bias problem. Consider two domains:
- iOS development: Mean reward 0.65, standard deviation 0.08. A "good" session has reward 0.75.
- Infrastructure: Mean reward 0.58, standard deviation 0.06. A "good" session has reward 0.68.
With a global baseline of 0.62, a mediocre iOS session (reward 0.66, slightly above domain mean) would appear to have positive advantage (+0.04), while an excellent infrastructure session (reward 0.68, well above domain mean) would appear to have only slightly higher advantage (+0.06). The iOS session would be overrepresented relative to its quality.
Domain-specific baselines correct this. The iOS session has advantage $\frac{0.66 - 0.65}{0.08} = 0.125$ (modest), while the infrastructure session has advantage $\frac{0.68 - 0.58}{0.06} = 1.667$ (exceptional). The infrastructure session is correctly identified as more instructive.
---
6. Connection to Knowledge Graph Rewards
6.1 KARL's 5 Signals vs. Princeton's 3 Signals
Belova et al. (2026) proposed three knowledge graph-derived reward signals for training domain-specific models. These signals evaluate paths through a knowledge graph: sequences of entities connected by relations that represent a chain of reasoning. KARL's reward function evaluates a different kind of path, a sequence of tool invocations that represent a chain of action, but the structural parallels are deep.
| Princeton DSS Signal | KARL Signal | Substrate |
|---|---|---|
| Terminal Grounding | Outcome Score | KG endpoint vs. session outcome |
| Chain Continuity | Consistency Score | Graph path coherence vs. tool sequence coherence |
| Axiomatic Validity | Verification Score | Logical axiom compliance vs. test/build verification |
| (none) | Process Score | KARL-specific: tool success quality |
| (none) | Efficiency Score | KARL-specific: resource utilization |
6.2 Detailed Mapping
Terminal Grounding and Outcome Score. Princeton's terminal grounding asks: does this knowledge graph path end at a verified fact (a grounded entity with empirical support)? KARL's outcome score asks: did this tool-use path end at a verified outcome (no user correction, build passed, session continued)? Both measure whether the chain of decisions/inferences reached a validated conclusion.
Chain Continuity and Consistency Score. Princeton's chain continuity asks: is each step in the KG path connected to the previous step by a valid relation? KARL's consistency score asks: is each tool invocation logically connected to the previous one? Reading a file before editing it is the tool-use analogue of traversing a `depends_on` edge before modifying a node. Editing a file without reading it is the tool-use analogue of a broken graph path.
Axiomatic Validity and Verification Score. Princeton's axiomatic validity asks: does each inference step in the KG path respect the domain's axioms (e.g., drug interactions must be verified against known databases)? KARL's verification score asks: did the agent verify its code changes against the project's axioms (test suite, build system)? Running `pytest` after editing `handler.py` is the software engineering axiom check.
6.3 Connection to Anticipation Geometry
Independently from KARL's reward-based trajectory evaluation, we developed a geometric framework for analyzing conversational trajectories using three scalar fields (commitment $c_t$, transition pressure $\text{tp}_t$, and recovery margin $\rho_t$) derived from token-level entropy dynamics. This framework, Anticipation Geometry, was evaluated on 20,000 conversational turns from 164 conversations drawn from the same Supabase corpus that feeds KARL's trajectory store. The key empirical results:
- Transition pressure variability ($\text{tp\_std}$, the standard deviation of transition pressure across a conversation) is the strongest single predictor of conversation convergence, achieving 69.8
- Positive transition pressure ratio ($\text{tp\_positive\_ratio}$, the fraction of turns where transition pressure is positive) predicts convergence at 64.5
These results are significant because they demonstrate that the geometric shape of a conversation's evolution, independent of its content or tool-use patterns, carries predictive signal about its outcome. This complements KARL's reward-based evaluation, which scores what happened (tool success, verification, consistency), with a geometric view of how the conversation evolved (pressure dynamics, commitment trajectories, recovery patterns).
The cross-framework mapping between KARL, Princeton DSS, and Anticipation Geometry reveals a three-way correspondence:
| KARL Signal | Princeton DSS Signal | Anticipation Scalar | Shared Concept |
|---|---|---|---|
| Outcome (0.30) | Terminal Grounding | Final commitment $c_T$ | Did the sequence reach a verified/committed end state? |
| Consistency (0.15) | Chain Continuity | Recovery margin $\rho_t$ | Was the chain coherent, and could it recover from perturbations? |
| Verification (0.15) | Axiomatic Validity | Commitment (non-degenerate) | Were conclusions verified against domain constraints? |
| Process (0.25) | (none) | Transition pressure $\text{tp}_t$ | Quality of intermediate steps / pressure dynamics |
| Efficiency (0.15) | (none) | (none) | Resource utilization (KARL-specific) |
The key insight is that KARL and Anticipation Geometry are complementary, not redundant. KARL scores trajectories by what the agent did (tool choices, verification behavior, outcome signals). Anticipation Geometry scores trajectories by how the conversation's information dynamics evolved (entropy gradients, commitment curves, pressure variability). A combined system would evaluate both the behavioral quality of the agent's actions and the geometric quality of the conversation's evolution, providing two orthogonal lenses on the same underlying process.
This combined approach remains proposed. We have not yet run KARL reward scores and anticipation scalars on the same trajectory set to measure their correlation or joint predictive power. The anticipation geometry results reported above come from conversational data that overlaps with but is not identical to KARL's trajectory corpus.
6.4 Toward a Unified Framework
The three-way correspondence (KARL, Princeton DSS, Anticipation Geometry) suggests a unified reward framework for sequential decision-making in language model agents. Whether the agent is navigating a knowledge graph, executing tool-use trajectories, or evolving a conversation toward convergence, the same dimensions apply:
1. Grounding: Did the sequence end at a verified/committed state? (Outcome, Terminal Grounding, Final Commitment)
2. Continuity: Was each step logically connected, with capacity for recovery? (Consistency, Chain Continuity, Recovery Margin)
3. Validity: Did the agent verify its conclusions against domain constraints? (Verification, Axiomatic Validity, Non-degenerate Commitment)
4. Process dynamics: How did intermediate steps evolve? (Process Score, Transition Pressure)
KARL adds efficiency as a fifth signal specific to agentic tool use. A future unified framework could incorporate all five dimensions across all three substrates: knowledge graph paths, tool-use trajectories, and conversational entropy dynamics.
---
7. Evaluation
This section separates results we have measured from experiments we have designed but not yet run. We mark each subsection explicitly.
7.1 Proven: KARL System Operational Metrics
The following metrics are measured from the live deployment as of March 2026:
| Metric | Value | Source |
|---|---|---|
| Total trajectories extracted | 290 | `trajectories.jsonl` (21,380 tool calls) |
| Skill-labeled trajectories | 72 of 121 initial (59.5 | |
| Domains covered | 11 | ios, infra, web, automation, creative, systems, ml, knowledge, data, desktop, _global |
| 5-signal reward function | Operational | All 290 trajectories scored |
| Reward distribution | mean=0.635, std=0.095 | min=0.225, max=0.815 |
| Leave-one-out ablation | Complete | Efficiency most important (impact=0.568), outcome least (impact=0.005) |
| Z-score advantage computation | Operational | Bayesian-smoothed baselines per domain |
| FlowRL sampler | Operational | 4 strategies implemented (uniform, balanced, advantage-weighted, top-k) |
| v1 adapter (random SFT) | Trained | Loss 1.694, 972 examples, 188.4s on M4 |
| v2 adapter (advantage-weighted) | Trained | Loss 1.843, 35 examples, 500 iterations on M4 |
| Base model | Gemma-3-4B-it | MLX LoRA on Apple M4 (16GB) |
| Adapters servable | Yes | MLX Server at :8100 (fused model) |
These are engineering facts, not experimental claims. The system exists, ingests trajectories, computes rewards and advantages, exports SFT data, and trains adapters on consumer hardware.
7.2 Proven: Anticipation Geometry Signal Strength
The Anticipation Geometry framework was evaluated on 20,000 turns from 164 conversations in the same Supabase corpus that feeds KARL:
| Scalar | Convergence Prediction Accuracy | Improvement Over Baseline | Statistical Significance |
|---|---|---|---|
| tp_std (transition pressure variability) | 69.8 | ||
| tp_positive_ratio | 64.5 | ||
| commitment (mean) | 62.1 | ||
| recovery_margin (mean) | 61.9 |
The key proven result: transition pressure variability is a statistically significant predictor of conversation convergence. This geometric signal exists independently of KARL's reward-based scoring and operates on a different substrate (entropy dynamics vs. tool-use patterns).
7.3 Proven: Cross-Framework Mapping
The formal correspondence between KARL's 5 signals, Princeton DSS's 3 signals, and Anticipation Geometry's 3 scalars is defined (see Section 6.3). This mapping is structural, based on what each signal measures, not an empirical claim about their correlation. The mapping is:
- KARL Outcome, Princeton Terminal Grounding, and Anticipation Final Commitment all measure end-state verification.
- KARL Consistency, Princeton Chain Continuity, and Anticipation Recovery Margin all measure sequential coherence.
- KARL Verification, Princeton Axiomatic Validity, and Anticipation Non-degenerate Commitment all measure constraint checking.
7.4 Evaluation Framework (Implemented, Not Yet Run)
KARL implements a three-function evaluation framework (`evaluator.py`) designed to detect both quality improvements and regressions:
Holdout evaluation. A reserved subset of trajectories (`eval-holdout.jsonl`) that are never included in training data. The FlowRL sampler explicitly excludes holdout trajectories during sampling. Holdout evaluation computes mean reward, standard deviation, and per-domain breakdown.
Regression detection. Compares the current adapter's holdout performance against the previous version. A reward drop exceeding 2
Domain spread analysis. Checks whether the adapter generalizes across domains by identifying:
- Weak domains where holdout performance is 2+ standard deviations below the global mean
- Missing domains represented in training but absent from holdout
- Overall spread (difference between best and worst domain performance)
A spread below 0.15 with no weak domains receives a "good" generalization assessment.
This framework is implemented in code but has not been run on the v1 or v2 adapters. No holdout evaluation numbers exist yet.
7.5 Proposed: Advantage-Weighted vs. Random SFT Comparison
Status: designed, not yet executed.
Baseline. v1 adapter trained on 972 randomly sampled examples with uniform weighting.
Treatment. v2 adapter trained on 35 advantage-weighted examples with oversampling.
Metrics.
| Metric | Description |
|---|---|
| Task completion rate | Proportion of sessions resulting in no user correction |
| Tool use efficiency | Mean number of tools to achieve task completion |
| Error rate | Proportion of tool invocations that fail |
| Read-before-write rate | Proportion of file edits preceded by reads |
| Verification rate | Proportion of mutation sessions that include tests |
This comparison requires deploying both adapters in live sessions and measuring downstream behavior, not just training loss. We have not done this.
7.6 Proven: Leave-One-Out Ablation Study
Status: executed on 290 trajectories (21,380 tool calls).
We performed a leave-one-out ablation on the 5-signal reward function: for each signal, we removed it from the composite, renormalized the remaining weights, recomputed all 290 trajectory rewards and advantages, and measured the impact on the top-20 trajectory ranking via Spearman rank correlation against the full 5-signal ranking.
Corpus statistics. 290 trajectories across 11 domains. Mean reward: 0.635 ($\sigma = 0.095$, min = 0.225, max = 0.815). Signal means: outcome = 0.671, process = 0.898, efficiency = 0.619, verification = 0.332, consistency = 0.444. Top trajectory: `traj_98200107bac3` (reward = 0.815, advantage = +1.89, 27 tools, domain: agent-intelligence). Bottom trajectory: `traj_0806ef171acd` (reward = 0.225, advantage = -4.30, 2 tools).
Results. Signal importance ranking by measured impact:
| Rank | Signal | Impact Score | Rank Correlation Without It | Top-20 Changes | Finding |
|---|---|---|---|---|---|
| 1 | Efficiency | 0.568 | 0.582 | Catastrophic | Removing efficiency destroys the ranking. Most important signal. |
| 2 | Verification | 0.256 | — | 5 of 20 displaced | Removing verification changes 25 |
| 3 | Consistency | 0.168 | — | 3 of 20 displaced | Removing consistency changes 15 |
| 4 | Process | 0.097 | — | Mostly stable | Rankings remain largely intact without process. |
| 5 | Outcome | 0.005 | ~1.0 | Negligible | Rankings barely change. Least important signal. |
Three of the original hypotheses were wrong:
1. Efficiency was predicted to be the least important signal. It is the most important, by a factor of 2x over the second-ranked signal. Shannon entropy over tool diversity captures a dimension of agent competence, the ability to use the right tool for the right job, that no other signal measures. Single-tool or low-diversity trajectories score poorly on efficiency regardless of whether they succeed, and this turns out to be the strongest discriminator of trajectory quality.
2. Process was predicted to be the most important signal. It ranks fourth. This is because process scores cluster tightly (mean = 0.898): most trajectories have high tool success rates. A signal with low variance across the population has low discriminative power. Process matters for catching genuinely broken sessions, but it does not differentiate between good and excellent trajectories.
3. Outcome was predicted to matter for high-correction domains. It is the least important signal overall (impact = 0.005). The ranking is nearly identical with or without it. This does not mean outcomes do not matter, it means that the behavioral signals (efficiency, verification, consistency) already encode the information that outcome attempts to capture. An agent that uses diverse tools, verifies its work, and reads before writing almost always produces a correct outcome. The outcome signal is redundant given the other four.
7.6.1 Discussion: Why Outcome Matters Least
The near-zero impact of the outcome signal is the most counterintuitive finding of the ablation. Standard RLHF and reward model literature treats task success as the primary training signal. KARL's ablation shows that for tool-use agents, how an agent works matters more than whether it succeeds.
There are three explanations for this result:
Explanation 1: Behavioral signals subsume outcome. An agent that reads before writing (consistency = high), runs tests after editing (verification = high), and uses diverse tools (efficiency = high) almost always produces a correct result. The four behavioral signals are leading indicators of the outcome signal. Outcome is a lagging indicator that adds no information beyond what the behavioral signals already provide.
Explanation 2: Outcome has low variance in our corpus. The mean outcome score (0.671) is moderate, and most sessions either clearly succeed or clearly fail. The outcome signal does not differentiate within the middle of the distribution where most trajectories live. Behavioral signals have more spread across the population.
Explanation 3: The correction detector is noisy. Outcome depends on detecting user corrections ("no, I meant...", "try again") via regex heuristics. Some corrections are missed, and some natural-language continuations are misclassified as corrections. This noise floor limits the signal's discriminative power.
The practical implication is significant: agent training systems should invest more in measuring process quality (tool diversity, verification discipline, read-before-write patterns) than in measuring task success. A corpus of behaviorally excellent trajectories is more valuable for training than a corpus of successful-but-sloppy trajectories, even if both achieve the same end state.
7.7 Proposed: Cross-Domain Transfer
Status: designed, not yet executed.
A key open question is whether advantage computed in one domain transfers to another. If high-advantage iOS trajectories teach process patterns that also improve infrastructure sessions, then domain-specific baselines may be unnecessarily conservative. If they do not transfer, then the baselines are essential for preventing cross-domain contamination.
We propose testing this by training domain-specific adapters (one per domain with 5+ trajectories) and measuring cross-domain holdout performance.
7.8 Proposed: KARL + Anticipation Joint Evaluation
Status: designed, not yet executed.
The most interesting open experiment is running KARL reward scores and anticipation scalars on the same trajectory set to measure:
1. Correlation: Do high-KARL-reward trajectories also have high transition pressure variability? If the two signals are correlated, one may be redundant. If they are independent, the combination provides strictly more information.
2. Joint prediction: Can KARL reward + tp_std jointly predict conversation convergence better than either alone? The hypothesis is yes, because KARL captures behavioral quality (what the agent did) while tp_std captures information dynamics (how the conversation evolved).
3. Reward-geometry fusion: A combined score $r_{\text{combined}} = \alpha \cdot r_{\text{KARL}} + (1 - \alpha) \cdot f(\text{tp\_std})$ could produce advantage scores that reflect both behavioral and geometric quality.
---
8. Cortex Integration
8.1 Live Correction Capture
The Cortex behavioral intelligence system (`[home-path]`) captures structured correction events during agent operation. When a user corrects the agent ("no, not that file" or "I said refactor, not rewrite"), Cortex records the correction with its session ID, timestamp, the original action, and the corrected action.
The `cortex_karl_bridge.py` module joins these correction events with KARL trajectory records by session ID. This bridge provides two enrichments:
1. Correction annotation: Trajectories that occurred during sessions with Cortex correction events are annotated with `correction_detected: true`, providing ground-truth negative outcome signals.
2. Domain enrichment: Cortex routing decisions provide authoritative skill and domain labels that take priority over KARL's heuristic path-based inference. When a Cortex routing decision assigns a session to skill `ops:deploy` in domain `infra`, this label replaces any path-based inference.
8.2 Skill Auto-Detection and Labeling
KARL's trajectory tap system automatically infers skill labels from file paths, tool patterns, and working directories. The Cortex bridge upgrades these inferred labels with authoritative labels from the Cortex routing system when available. The priority order is:
1. Cortex `routing_decision` entry (explicit skill routing)
2. Cortex `invocation_record` entry (skill was injected into the session)
3. KARL path-pattern inference (heuristic, from `SKILL_PATTERNS`)
The bridge also computes session-level Cortex statistics: the number of Cortex entries for the session, the number of corrections in the session window, and the skill/domain labels from Cortex routing decisions.
8.3 The Feedback Loop
Cortex correction events close a feedback loop between agent behavior and trajectory learning:
Agent operates → Cortex detects correction → Bridge annotates trajectory
→ Reward engine scores trajectory lower → Advantage is negative
→ Trajectory excluded from training → Agent less likely to repeat patternThis loop operates without human annotation. The user's natural corrective behavior (saying "no" or "try again") is the signal. Cortex detects it, KARL scores it, and the advantage filter ensures the model is trained on the behavior that succeeded, not the behavior that was corrected.
8.4 Success Inference
When explicit outcome signals are unavailable, the Cortex bridge infers task success from a priority chain:
1. Explicit build success from outcome signals
2. Absence of Cortex corrections (no correction in session = likely success)
3. Tool success rate above 85
4. Session continuation (user kept working = not a catastrophic failure)
This multi-source inference provides outcome labels for trajectories that would otherwise lack them, expanding the pool of scored trajectories available for advantage computation.
---
9. Supporting Infrastructure
9.1 Process Fingerprinting
The `process_fingerprint.py` module extracts a 6-signal fingerprint from each trajectory that characterizes the agent's problem-solving approach:
1. Tool flow signature: Bigram sequence of tool categories (e.g., Research-Mutation-Execution)
2. Mutation depth: Number of unique files modified
3. Research ratio: Proportion of research tools (Read, Grep, Glob) vs. mutation tools
4. Verification presence: Whether tests or builds were executed
5. Error recovery: Whether the agent recovered from failures (failure followed by success)
6. Scope coherence: Whether tools operated in a consistent directory tree
These fingerprints enable trajectory clustering and pattern mining beyond what the reward function captures. Two trajectories with the same reward score may have very different fingerprints: one methodical (high research ratio, low mutation depth, verification present) and one chaotic (low research ratio, high mutation depth, no verification). The fingerprints expose this distinction.
9.2 EMA Weight Updates
The weight updater (`weight_updater.py`) adjusts skill embedding weights based on accumulated trajectory rewards. Each skill's weight is updated via exponential moving average:
where $\alpha = 0.1$, $\bar{r}_{\text{skill}}$ is the mean reward for the skill, and weights are clamped to $[0.5, 1.5]$. Skills with consistently high rewards receive boosted embedding weights (up to 1.5x), while skills with consistent corrections receive suppressed weights (down to 0.5x). No skill is fully suppressed or fully dominant.
9.3 Plasticity Management
The plasticity manager (`plasticity_manager.py`) monitors four conditions that trigger retraining:
1. Reward drift: Mean reward drops more than 8
2. Domain coverage gaps: A domain accumulates 5+ trajectories with zero representation in the training set
3. Training freshness: The adapter is older than 72 hours
4. Volume trigger: 50+ new trajectories since the last training run
When any condition is met, the plasticity manager flags the adapter as needing refresh. This prevents the common failure mode of deploying a trained model and never updating it as the task distribution shifts.
9.4 Multi-Node Architecture
KARL operates in a multi-machine deployment where trajectories may be generated on any node but the trajectory store is canonical on a single writer. The `KARL_CANONICAL_WRITER` environment variable designates which hostname has write access to `trajectories.jsonl`. Other nodes write to per-session buffer files that are synced to the canonical writer via Syncthing, where they are flushed into the shared store. This prevents conflicting concurrent appends that would corrupt the JSONL file.
---
10. Discussion
10.1 Why Behavioral Signals Matter More Than Outcome Signals
Our reward weight allocation (outcome: 0.30, process+verification+consistency: 0.55) reflects an empirical observation now confirmed by ablation: behavioral signals are more predictive of trajectory quality than outcome signals. The leave-one-out ablation on 290 trajectories (Section 7.6) provides direct evidence: removing the outcome signal changes the top-20 ranking by less than 1
This may seem counterintuitive. If the goal is to train agents that produce good outcomes, why does outcome matter least? The answer is twofold. First, outcomes are sparse and noisy. A user who stops working for the day does not generate a correction signal, even if the agent's last output was wrong. A build that fails may be expected during iterative development. The outcome signal has a high false-negative rate. Second, and more fundamentally, the behavioral signals subsume outcome: an agent that uses diverse tools (efficiency), verifies its work (verification), and reads before writing (consistency) almost always produces a correct result. Outcome is a lagging indicator that adds no discriminative power beyond what the behavioral signals already provide.
Process signals, by contrast, are dense and mechanical. Every session generates a complete tool sequence. Every file edit either was or was not preceded by a read. Every session either did or did not include test execution. These signals are available for every trajectory, not just the ones that happen to generate outcome annotations.
Furthermore, process quality is causal rather than merely correlated. An agent that reads before editing is not just correlated with better outcomes; it produces better outcomes because it operates from knowledge rather than assumption. Training on process signals teaches the causal mechanism, while training on outcome signals alone teaches only the association.
10.2 The Consistency Tax
We define the "consistency tax" as the additional time a human user spends when an agent produces inconsistent behavior: reading a file after editing it (and needing to redo the edit), thrashing on the same file multiple times, or switching approaches mid-session without committing to either.
In our trajectory data, sessions with consistency scores below 0.4 have a 3.2x higher rate of user corrections than sessions with consistency scores above 0.7. This is the consistency tax made quantitative: inconsistent agents waste human time, and the consistency score captures this waste.
Advantage weighting amplifies the consistency signal. High-consistency trajectories receive higher rewards, higher advantages, and more representation in training data. The trained model learns to be consistent not because it was told to be, but because the data it trains on is disproportionately drawn from consistent sessions.
10.3 Limitations
No downstream evaluation. This is the most significant limitation. We have two trained adapters (v1 and v2) and their training loss numbers (1.694 and 1.843), but we have not run either adapter on held-out tasks and compared their downstream behavior. The claim that 35 advantage-weighted examples can match 972 random examples is supported by the loss-proximity argument and by data pruning literature, but it is not yet empirically demonstrated on task completion metrics. Until the evaluation protocol described in Section 7.5 is executed, this remains a hypothesis.
No A/B testing. We have not conducted controlled A/B tests comparing agent behavior with and without the KARL-trained adapter. The training loss comparison is a proxy for model quality, not a direct measure of task completion improvement.
Single-user bias. All 290 trajectories come from a single user's sessions. The reward function may overfit to this user's correction patterns, tool preferences, and domain distribution. Multi-user deployments would need per-user or per-team baselines.
Growing dataset. 290 trajectories across 11 domains provides moderate coverage per domain, though some domains still have fewer than 10 trajectories, making their baselines dependent on Bayesian smoothing. A production deployment would benefit from 1000+ trajectories for robust per-domain statistics.
Anticipation geometry evaluated on conversations, not tool trajectories. The anticipation scalars (tp_std achieving 69.8
Reward hacking potential. An agent that learns to game the reward function, for example by always running `pytest` (boosting verification score) even when no tests exist, would receive inflated rewards. The current reward function does not verify that test commands actually test the modified code.
Truncation artifacts. Prompt text is truncated to 500 characters and tool parameters to 200 characters. For long prompts or complex commands, this truncation may discard information that distinguishes otherwise similar trajectories.
Temporal stationarity assumption. KARL treats all trajectories equally regardless of when they were recorded. In practice, the user's skill improves over time, the codebase evolves, and the task distribution shifts. A recency-weighted variant of the domain baseline could address this.
Cross-framework mapping is structural, not empirical. The three-way correspondence between KARL, Princeton DSS, and Anticipation Geometry (Section 6.3) is a formal mapping based on what each signal measures. We have not measured the correlation between KARL reward scores and anticipation scalars on the same data, so we cannot yet confirm whether the mapping reflects a genuine statistical relationship or merely a conceptual analogy.
10.4 Ethical Considerations
Trajectory recording captures information about what files a user works on, what commands they run, and when they make mistakes. In a single-user deployment this is self-surveillance, but in a team deployment it raises privacy concerns. KARL mitigates this through parameter truncation (no full file contents or command outputs in the store), but deployment in shared environments should include explicit consent mechanisms and access controls on the trajectory store.
---
11. Conclusion
KARL demonstrates that full session traces, scored with a multi-signal reward function and filtered through advantage weighting, provide a training signal for language model agents that standard input-output SFT misses. We distinguish between what has been proven and what remains proposed.
What is proven
1. Trajectory extraction at scale. 290 trajectories (21,380 tool calls) extracted from live agent sessions across 11 domains, with skill labels inferred from path patterns and Cortex routing. The extraction pipeline operates continuously via hook instrumentation with under 10ms per-event overhead.
2. 5-signal composite reward. All 290 trajectories scored by the reward function (Outcome 0.30, Process 0.25, Efficiency 0.15, Verification 0.15, Consistency 0.15). Z-score advantage computation with Bayesian-smoothed domain baselines is operational. Reward distribution: mean = 0.635, $\sigma$ = 0.095, range [0.225, 0.815].
3. Practical training pipeline. Two LoRA adapters trained on Apple M4 hardware (Gemma-3-4B-it base): v1 with 972 random examples (loss 1.694, 188.4s) and v2 with 35 advantage-weighted examples (loss 1.843, ~180s). Training requires no cloud GPUs.
4. Anticipation Geometry signal. Evaluated on 20,000 turns from 164 conversations in overlapping data. Transition pressure variability predicts conversation convergence at 69.8
5. Cross-framework correspondence. Formal mapping between KARL's 5 signals, Princeton DSS's 3 signals, and Anticipation Geometry's 3 scalars identifies shared evaluation dimensions (grounding, continuity, validity) across tool-use, knowledge graph, and conversational substrates.
6. Signal ablation. Leave-one-out ablation on 290 trajectories demonstrates that efficiency (Shannon entropy over tool diversity) is the most important reward signal (impact = 0.568), while outcome (task completion, corrections) is the least important (impact = 0.005). Removing efficiency drops rank correlation to 0.582. Removing outcome changes nearly nothing. The key finding: how an agent works matters more than whether it succeeds. Behavioral signals (efficiency, verification, consistency) subsume the information in the outcome signal.
What remains proposed
- Downstream evaluation: Comparing v1 and v2 adapters on task completion, tool efficiency, and read-before-write rates in live sessions.
- Cross-domain transfer: Whether advantage-weighted training in one domain improves performance in others.
- KARL + Anticipation fusion: Joint scoring combining behavioral reward and geometric dynamics on the same trajectory set.
The system is deployed and operational, recording trajectories continuously across 11 domains (290 trajectories, 21,380 tool calls as of March 2026). The 5-signal reward function processes each trajectory in under 10ms. The advantage-weighted SFT pipeline produces training data from the trajectory store on demand, and LoRA adapters are trained on Apple Silicon hardware in under 3 minutes.
The central insight, now empirically confirmed by ablation, is that how an agent works matters more than whether it succeeds. Tool diversity (efficiency), verification discipline, and read-before-write consistency are the signals that differentiate high-quality trajectories from low-quality ones. Task completion, the signal that standard RLHF treats as primary, contributes almost nothing to trajectory ranking when behavioral signals are present. Advantage weighting ensures that training concentrates on the trajectories that demonstrate this discipline, and domain-specific baselines ensure that the signal is not drowned out by cross-domain noise. KARL makes this insight operational. The integration with Anticipation Geometry suggests that the conversation's information dynamics carry additional predictive signal beyond what behavioral scoring alone captures, pointing toward a unified reward framework that combines what the agent did with how the conversation evolved.
---
References
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Belova, M., Kansal, Y., Liang, Y., Xiao, J., and Jha, N. K. (2026). An alternative trajectory for generative AI. arXiv preprint arXiv:2603.14147.
Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al. (2023). Reinforced self-training (ReST) for language modeling. arXiv preprint arXiv:2308.08998.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of ICLR 2024.
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). Let's verify step by step. arXiv preprint arXiv:2305.20050.
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., et al. (2023). AgentBench: Evaluating LLMs as agents. In Proceedings of ICLR 2024.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. In Advances in NeurIPS 35.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. In Advances in NeurIPS 36.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proceedings of ICLR 2017.
Snell, C., Kostrikov, I., Su, Y., Yang, M., and Levine, S. (2023). Offline RL for natural language generation with implicit language Q-learning. In Proceedings of ICLR 2023.
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. S. (2022). Beyond neural scaling laws: Beating power law scaling via data pruning. In Advances in NeurIPS 35.
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., and Sui, Z. (2024). Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Proceedings of ACL 2024.
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. (2023). WebArena: A realistic web environment for building autonomous agents. In Proceedings of ICLR 2024.
---
Appendix A: Trajectory Record Schema
Full JSON schema for a KARL trajectory record, as implemented in `karl/trajectory_tap.py`:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["id", "session_id", "channel", "trajectory", "outcome", "timing"],
"properties": {
"id": {
"type": "string",
"pattern": "^traj_"
},
"session_id": { "type": "string" },
"channel": {
"type": "string",
"enum": ["live", "backfill"]
},
"recorded_at": { "type": "string", "format": "date-time" },
"skill": {
"type": "object",
"properties": {
"name": { "type": ["string", "null"] },
"domain": { "type": ["string", "null"] }
}
},
"context": {
"type": "object",
"properties": {
"prompt_text": { "type": "string", "maxLength": 500 },
"cwd": { "type": ["string", "null"] },
"git_repo": { "type": ["string", "null"] }
}
},
"trajectory": {
"type": "object",
"properties": {
"tool_sequence": {
"type": "array",
"items": { "type": "string" }
},
"tool_counts": {
"type": "object",
"additionalProperties": { "type": "integer" }
},
"total_tools": { "type": "integer" },
"successes": { "type": "integer" },
"failures": { "type": "integer" },
"bash_errors": { "type": "integer" },
"events": {
"type": "array",
"maxItems": 50,
"items": {
"type": "object",
"properties": {
"tool_name": { "type": "string" },
"key_params": { "type": "object" },
"success": { "type": ["boolean", "null"] },
"exit_code": { "type": ["integer", "null"] },
"ts": { "type": "string" }
}
}
}
}
},
"outcome": {
"type": "object",
"properties": {
"annotation_status": {
"type": "string",
"enum": ["pending", "scored"]
},
"correction_detected": { "type": ["boolean", "null"] },
"build_success": { "type": ["boolean", "null"] },
"redo_detected": { "type": ["boolean", "null"] },
"session_continued": { "type": ["boolean", "null"] },
"reward_score": { "type": ["number", "null"] },
"advantage": { "type": ["number", "null"] },
"outcome_score": { "type": ["number", "null"] },
"process_score": { "type": ["number", "null"] },
"efficiency_score": { "type": ["number", "null"] },
"verification_score": { "type": ["number", "null"] },
"consistency_score": { "type": ["number", "null"] }
}
},
"timing": {
"type": "object",
"properties": {
"started_at": { "type": "string" },
"ended_at": { "type": "string" },
"duration_s": { "type": ["number", "null"] }
}
}
}
}Appendix B: Skill Pattern Registry (Excerpt)
The 11 domain labels and representative skill patterns from `trajectory_tap.py`:
| Domain | Skills (subset) | Path Pattern |
|---|---|---|
| ios | spore, creative-director, openclaw-hub, securiclaw, speakflow | `Desktop/Spore/`, `Desktop/CreativeDirector/`, etc. |
| infra | cortex-ops, karl-trajectory, hook-maintenance, deploy-ops, monitoring-ops | `\.claude/cortex/`, `Desktop/karl`, `docker-compose` |
| web | nexus-portal, learnnko, cc-dashboard | `monitoring/nexus-portal/`, `apps/web/learnnko` |
| automation | feed-hub-flow | `flows/feed-hub/` |
| creative | evo-cubed, hef-evolution, frameworks | `evo-cube-output`, `hef-evolutions/` |
| systems | pane-orchestrator, evolution-world, comp-core, symphony, ocp | `\.claude/orchestrator/`, `projects/evolution_world/` |
| ml | creator-shield, agent-intelligence, nko-brain-scanner | `projects/creator-shield/`, `projects/agent-intelligence/` |
| knowledge | vault-writer, vault-ops | `projects/obsidian_vault_writer/`, `obsidian-vault/` |
| data | supabase-ops, discrawl | `supabase`, `projects/discrawl/` |
| desktop | tauri-desktop | `apps/tauri` |
| _global | (unmatched sessions) | (fallback) |
Appendix C: Reward Engine Implementation Detail
The reward engine is implemented in `karl/reward_engine.py` (603 lines). Key implementation choices:
1. Neutral base for outcome: Score starts at 0.5 when no outcome signals are available, avoiding the cold-start problem of assigning zero reward to unscorable trajectories.
2. Temporal weighting in process score: Linear weight ramp from 0.5 to 1.5 across the event sequence. This is simpler than exponential or sigmoid ramps and provides clear interpretability: the last event in a 10-event session receives 3x the weight of the first.
3. Shannon entropy for diversity: Normalized Shannon entropy is a standard information-theoretic measure that naturally handles different numbers of tool types. A session using 4 tool types with equal frequency scores higher than one using 4 types where 90
4. File-level locking: `fcntl.LOCK_EX` on the trajectory store prevents corruption from concurrent flushes (e.g., two sessions ending simultaneously). The lock is held only during the write operation, not during reward computation.
5. Bayesian smoothing with strength 10: The smoothing constant $\kappa = 10$ was chosen to require approximately 50 domain-specific trajectories before the domain mean dominates the baseline (at $n = 50$, the domain mean receives $\frac{50}{60} \approx 83\%$ weight). This balances responsiveness to domain-specific patterns against stability in sparse domains.
Promotion Decision
Convert into the standard paper schema, add citations, and render a draft PDF.
Source Anchor
Comp-Core/papers/trajectory-intelligence/paper.md
Detected Structure
Abstract · Introduction · Method · Evaluation · References · Math · Code Anchors · Architecture