Grand Diomande Research · Full HTML Reader

KARL Integration -- Evolution3 / Stage 0: RESEARCH

**This is exactly the trajectory data KARL needs.** The data exists but flows into storage (unified.jsonl, verbose-all.jsonl) without any feedback loop to skill improvement. The unified store has 3,909 entries with tool_calls arrays -- this is a goldmine of trajectory data that currently goes unused for learning.

Agents That Account for Themselves proposal experiment writeup candidate score 22 .md

Full Public Reader

# KARL Integration -- Evolution3 / Stage 0: RESEARCH
Run: karl-trajectory-intelligence
Generated: 2026-03-10
Method: Evolution3 -- four-stage recursive evoflow (research-grounded)
Run Directory: Desktop/evo-cube-output/karl-trajectory-intelligence/

## Noosphere Context
No prior dreams or patterns found on this topic. Fresh ground.

---

1. Existing Cortex System

### Architecture Overview
The Cortex is a self-improving behavioral intelligence system at `[home-path]` with 7 phases, 17 Python files, 29 tests, and 3 hooks. Its data store is `[home-path]` (currently 399 entries: 324 invocation_records + 75 decay_flags).

Key Components

#### 1a. Extractor (`[home-path]`, 287 lines)
- Pipeline: 4-pass -- Load+Filter, Tokenize+N-gram, Cluster (Jaccard >0.6), Enrich
- Input: `[home-path]` (currently 903 entries)
- Output: `CortexEntry(type="skill_candidate")` objects
- Filtering: Lines 41-67 define SKIP_PATTERNS (automated prompts, trivial inputs like "yes", "ok", "continue")
- Tokenization: Line 96-98 -- lowercase, strip punctuation, remove stop words, 2-4 word n-grams
- Clustering: Line 157 -- requires `min_count=15` for meaningful clusters. Jaccard threshold 0.6 at line 192
- Domain detection: Lines 70-81 -- 10 operational domains (ios, deploy, supabase, docker, git, prefect, monitoring, mesh, debug, asc)
- Cap: 30 candidates maximum (line 224)
- Critical gap: Only extracts intent labels from prompts. Does NOT capture tool-use sequences, file paths touched, success/failure signals, or the full trajectory of how a task was accomplished.

#### 1b. Generator (`[home-path]`, 178 lines)
- Input: `CortexEntry` skill candidate
- Output: Static SKILL.md file with frontmatter (YAML) + markdown body
- Template variables: Lines 19-59 define DOMAIN_TOOLS, DOMAIN_MACHINES, DOMAIN_TRIGGERS per domain
- Content: Generic 4-step workflow (check state, execute, verify, report) at lines 112-127
- Critical gap: Generated skills are static templates, not learned procedures. The "Workflow" section is the same 4 generic steps for every skill. No tool sequences, no gotcha accumulation from actual failures, no reward signal integration.

#### 1c. Correction Detector (`[home-path]`, 160 lines)
- Hook: Stop event, 500ms SIGALRM budget
- Detection: 14 regex patterns at lines 32-47 (e.g., "don't...use", "never...do", "always...prefer")
- Scoring: Weighted sum of pattern matches + metacognitive signals + prompt length penalties
- Threshold: 0.6 confidence at line 29
- Output: `CortexEntry(type="correction")` written to entries.jsonl
- Gap: Detects corrections but doesn't link them to the specific tool actions that prompted the correction. No trajectory context.

#### 1d. Frequency Tracker (`[home-path]`, 100+ lines)
- Threshold ladder: 3x = pane inject, 5x = propose CLAUDE.md rule, 7x = auto-promote (lines 28-29)
- Dedup: Jaccard word similarity >0.5 to group similar corrections (line 43)
- Rule cap: MAX_RULES = 15 in CLAUDE.md (line 24)
- Sentinel blocks: `` format for traceable rules

#### 1e. Decay Detector (`[home-path]`, 251 lines)
- Ladder: 30d warn, 60d disable, 90d archive (lines 33-35)
- Tracks: Both skills (via invocation_records) and rules (via correction matches)
- Actions: Can disable skills in registry and propose rule removal from CLAUDE.md
- Prefect flow: `cortex_decay_detector` daily at 6AM UTC

#### 1f. Ops Trigger / Router (`[home-path]`, 233 lines)
- Hook: UserPromptSubmit event, 500ms budget
- Flow: Load registry (mtime-cached), compile trigger regexes, match prompt, check pane claims, inject SKILL.md content, write invocation_record
- Injection: Prints `<system-reminder>[Cortex] Auto-triggered skill: {name}content</system-reminder>` to stdout
- Claim checking: Lines 124-142 -- prevents duplicate skill injection when another pane holds domain claim

#### 1g. Data Model (`[home-path]`, 169 lines)
- CortexEntry: Single flat dataclass with 6 types: skill_candidate, routing_decision, correction, rule, decay_flag, invocation_record
- Persistence: Append-only JSONL at `[home-path]`
- Missing types for KARL: No `trajectory`, `tool_sequence`, `reward_signal`, or `learned_procedure` type exists

System Flow Summary

prompts-all.jsonl --> extractor.py --> skill_candidate --> generator.py --> SKILL.md
                                                                              |
User prompt --> ops_trigger.py --> regex match? --> inject SKILL.md content    |
                                                                              |
Stop event --> correction_detector.py --> correction --> frequency_tracker.py --> rule --> CLAUDE.md
                                                                              |
Daily cron --> decay/detector.py --> decay_flag --> disable/archive

### Core Limitation
The entire Cortex pipeline operates on prompt text only. It has no visibility into:
1. What tools were actually called (Read, Edit, Bash, etc.)
2. In what order they were called
3. Whether they succeeded or failed
4. What files were touched
5. Whether the overall task succeeded
This makes it fundamentally a pattern-matching system, not a trajectory-learning system.

---

2. Current Skills Infrastructure

### Registry Stats
- Total skills: 88 in `[home-path]`
- Forged (by Cortex): 80
- Active: 13 (10 ops-enriched + 3 cortex management)
- Inactive/unflagged: 75 (mostly creative/philosophical skills from Jan 2026)
- Domains in use: `any`, `cortex`, plus 10 operational domains

### Skill Directory Layout
`[home-path]` contains 88+ subdirectories, each with a `SKILL.md`. Two distinct generations:
1. Gen 1 (Jan 2026): Hand-authored creative skills (phi:metaphysical, art:divergent, etc.) -- 49 total, mostly never invoked
2. Gen 2 (Mar 2026): Cortex-forged ops skills (ops:deploy, ops:ios, etc.) -- 13 active, enriched with gotchas from MEMORY.md

Sample SKILL.md -- ops:deploy (`[home-path]`)

yaml

---
name: ops:deploy
description: "Deploy services to cloud-vm..."
allowed-tools: [Bash, Read, Write, Edit]
forged: true
status: active
generation: 2
domains: [deploy]
target-machine: cloud-vm
auto-trigger:
  - "\b(deploy|restart|systemctl|cloud-vm)\b.*\b(flow|service|hub)\b"
---

Body contains: Intent, Workflow (6 generic steps), Gotchas (5 hard-won from MEMORY.md), Key Services table, References.

### How Skills Are Triggered
1. User submits prompt
2. `ops_trigger.py` (UserPromptSubmit hook) loads registry.json
3. Compiles trigger regex patterns from active forged skills
4. Matches prompt text against patterns
5. On match: loads SKILL.md, strips frontmatter, injects via stdout `<system-reminder>`
6. Records `invocation_record` in entries.jsonl

### Critical Observation
Skills are static documents injected as context. They don't adapt based on whether the agent's actions succeeded. A SKILL.md that says "Step 3: Verify success" doesn't know whether success was actually verified or what happened when verification failed. The KARL approach would replace these static playbooks with learned tool-use trajectories that adapt based on outcome rewards.

---

3. Hooks Architecture & Tool-Use Recording

### Current Hook Coverage
17/18 Claude Code events are hooked (per `hooks-architecture.md`):
- UserPromptSubmit: 5 hooks (prompt logging, design-trigger, correction detector, ops trigger, plan detector)
- Stop: 3 hooks (response logging, auto-continue, correction detector)
- PostToolUse: 5 hooks (file tracking, plan sync, cron bridge, bash audit, Discord progress)
- PreToolUse: 4 hooks (memory guardian, claim guard, design trigger)
- Plus 8 more event types

What IS Currently Recorded

#### Prompt Logger (`[home-path]`, 574 lines)
Records to `[home-path]`:
- Prompt text, session_id, CWD, timestamp, git context (repo, branch, commit, dirty state)
- Links prompts to parent prompts via hook state
- Mycelium keyword frequency sensing

#### Response Hook (`[home-path]`, 1283 lines)
Records verbose response data including:
- Full text response
- All tool calls with parameters and results (lines 225-285)
- Thinking blocks (if extended thinking)
- File diffs for Edit/Write operations
- Timing metrics, token usage
- Exit codes from Bash commands

#### Post Tool Hook (`[home-path]`, 287 lines)
Fires after each tool use:
- Cadence signaling for governance files
- Rate limit detection in tool results
- Plan file edit tracking
- Background task/subagent tracking

#### Unified Store (`[home-path]`, 404 lines)
Single source of truth: `[home-path]` (3,909 entries, 8.4MB)
Schema per entry:

json

{
  "id": "unique-id",
  "session_id": "...",
  "prompt": "user input text",
  "response": "assistant response text",
  "tool_calls": [{"name": "Read", "file": "foo.py"}, ...],
  "tool_count": 5,
  "duration_ms": 1234,
  "cwd": "/path",
  "git_repo": "repo-name",
  "tokens": {"input": 100, "output": 500}
}

#### Bash Audit (`[home-path]`)
Records all Bash tool invocations with commands and results.

### What IS Available but NOT Linked to Learning
The response_hook already captures detailed tool-use sequences including:
- Tool name, parameters, result, timing
- File diffs for Edit/Write
- Exit codes for Bash
- Thinking blocks (decision reasoning)

This is exactly the trajectory data KARL needs. The data exists but flows into storage (unified.jsonl, verbose-all.jsonl) without any feedback loop to skill improvement. The unified store has 3,909 entries with tool_calls arrays -- this is a goldmine of trajectory data that currently goes unused for learning.

### Key Files for Trajectory Recording
| File | Purpose | Lines |
|------|---------|-------|
| `[home-path]` | Captures tool sequences + results | 1283 |
| `[home-path]` | Flat prompt+response+tools store | 404 |
| `[home-path]` | Per-tool event tracking | 287 |
| `[home-path]` | Parses Claude transcript JSONL | ~200 |
| `[home-path]` | VerboseToolCall, ThinkingBlock models | ~200 |
| `[home-path]` | File snapshot + diff capture | ~150 |

---

4. Mac5 Fine-Tune Pipeline

### Current Infrastructure
From MEMORY.md and memory topic files:

Component	Location	Status
Fine-tune daemon	Mac5 `:9200`	Running (Prometheus metrics, /status, /health)
MLX Server	Mac5 `:8100`	Running (OpenAI-compatible, serves fused model)
Adapter v1	Mac5	Trained 2026-03-04, loss 1.694, 972 train examples, 188.4s on M4
TB5 link	Mac4 `[ip]` <-> Mac5 `[ip]`	Static IPs, active
LaunchAgent	`com.compcore.cognitive-twin`	Config/dependency issue (blocked)

Pipeline Flow

Prompt logs --> corpus-to-sft.py --> SFT JSONL --> MLX LoRA --> Fused model --> MLX Server :8100

### Key Gotchas (from MEMORY.md)
- MLX LoRA CLI (v0.29+): Use `python3 -m mlx_lm lora` (NOT `mlx_lm.lora`), `--num-layers` (NOT `--lora-layers`)
- Gemma 3 GGUF: `mlx_lm fuse --export-gguf` NOT supported for gemma3_text. Use fused safetensors via MLX server
- corpus-to-sft.py: `el.get("text")` returns `None` not `""` -- always use `(el.get("text") or "")`
- Mac5 has no Xcode: Built exo with `--no-deps` + pre-installed MLX wheel
- Metal Toolchain: New Xcode versions need `xcodebuild -downloadComponent MetalToolchain` before building MLX from source

### Relevance to KARL
The fine-tune pipeline currently trains on conversational text (prompt-response pairs). KARL would extend this to train on tool-use trajectories with outcome rewards. The MLX LoRA infrastructure on Mac5 is already capable of running this training -- the missing piece is the trajectory data format (currently prompt text only, not structured tool sequences with rewards) and the OAPL-style reward computation.

### NUMU Weave Package
From `numu-fare.md` line 36: `numu-weave` (270 lines) handles "Cognitive twin: corpus --> Mac5 LoRA trainer --> A/B evaluator". This is the existing pipeline connector that would need to be extended to handle trajectory-formatted training data.

---

5. RAG++ and Search Infrastructure

### Current Search Stack
| Component | Port | Tech | Purpose |
|-----------|------|------|---------|
| RAG++ | :8000 (SSH tunnel) | Docker, pgvector | Dual-plane retrieval (raw turns + semantic artifacts) |
| Graph Kernel | :8001 (SSH tunnel) | Rust native binary | Entity graph traversal, predicate-based reasoning |
| Context Intelligence | :8002 | Docker | Contextual scoring and ranking |
| cc-semantic | :8003 | Docker | Semantic similarity |
| NUMU Memory | (package) | BM25 + Vector + temporal + MMR | Hybrid search in daemon |

### RAG++ Gateway
- Endpoint: `POST /api/rag/gateway/context`
- Input: query, cwd, session_id, k_rag, max_tokens, include_graph
- Output: `related_turns` (pgvector), `graph_context` (GK traversal), `admissibility_token` (provenance)
- Embeddings: Gemini API (`GOOGLE_API_KEY` required)
- Storage: pgvector in Docker (persistent)

### Graph Kernel (GK)
- Rust native binary on cloud-vm (NOT Docker)
- Accessed from Mac1 via SSH tunnel
- Predicates: `works_on`, `built_with`, `has_feature`, `uses`, `evolved_from`, `is_a`, `belongs_to`, `tagged`, `deployed_on`
- Entities are bare lowercase names (e.g., `spore`, `evolution`)

### Evolution World Already Uses Search
The daemon's sense phase (daemon.py lines 399-450) queries RAG++ gateway for semantic context:

python

gk_payload = {"query": dynamic_query, "cwd": ..., "include_graph": True}
# Returns related_turns, graph_context, admissibility_token

### Relevance to KARL
KARL's architecture uses vector search as its primary tool. The existing RAG++ + pgvector + GK infrastructure maps directly to KARL's search component. Key difference: KARL learns which queries to issue through RL optimization, while current usage is hardcoded queries. A KARL integration could use the existing search infrastructure as the "tool" that the trained agent learns to call optimally.

### Search Quality Gap
No existing tracking of search result quality. When RAG++ returns results, there's no feedback loop measuring whether those results actually helped the agent complete its task. KARL's reward functions address exactly this gap.

---

6. Evolution World

### Architecture (`[home-path]`)
A recursive two-layer evolution engine with 4 non-halting invariants:

L1 (App Evolution): Genomes mutate toward milestones via TIE techniques
L2 (Meta-Evolution): The evolution process itself adapts based on L1 outcomes
L3 (Meta-Meta-Evolution): Evolves L2's constants every 30 L1 steps
4 invariants: KL divergence, bounded divergence, cross-layer forcing, no absorbing states

### Daemon (`daemon.py`, 966 lines)
5-phase heartbeat cycle: sense --> select --> mutate --> check --> adapt
- Heartbeat interval: Adaptive [30s, 600s], controlled by L2 fitness dynamics or feedback metabolism
- CALC integration: Dispatches tasks to multiple agents (Claude, Codex, Gemini) via NUMU bus
- NUMU listener: WebSocket-based instant completion signals (lines 195-336)
- Pulse Bridge: Blocking or non-blocking dispatch to code agents

### Engine (`engine.py`, 401 lines)
- Selection strategies: elitist, diversity_driven, rank_based, roulette, tournament (lines 346-369)
- Population crossover: Every 3 L1 cycles, transfers capabilities between apps (lines 158-168)
- Graduation: Apps at fitness >= 0.85 can graduate or speciate (lines 260-279)
- Immune system: Records invariant violations, quarantines harmful techniques

### Relevance to KARL
Evolution World is the closest existing system to KARL's RL approach:

Aspect	Evolution World	KARL
Optimization target	App fitness (milestone progress)	Search accuracy (nugget completion)
Selection	L2 strategy (tournament, elitist, etc.)	RL policy optimization
Reward signal	Milestone delta, novelty bits	Nugget-based accuracy score
Adaptation	L2 evolves L1 techniques	OAPL updates policy weights
Exploration	Diversity-driven selection, dream storms	Off-policy rollout diversity
State	Genome (capabilities, fitness, milestones)	Agent policy (model weights)

The L1/L2/L3 layered evolution in EW maps to KARL's iterative training loops where improved models generate better training data (bootstrapping). EW's immune system maps to KARL's quality filtering (removing trivially solved/unsolved examples).

---

7. KARL Paper -- External Research

### Paper Overview
Title: KARL: Knowledge Agents via Reinforcement Learning
Authors: Databricks AI Research
Published: March 2026 (arXiv: 2603.05218)
Claims: Matches Claude Opus 4.6 on KARLBench at 33

### OAPL Algorithm
OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy):

Core objective (KL-regularized RL):

max_pi E[r(x,y) - beta * KL(pi(.|x) || pi_ref(.|x))]

Closed-form optimal policy:

pi*(y|x) proportional to pi_ref(y|x) * exp(r(x,y) / beta)

Regression loss (the key innovation):

min_pi sum_x sum_i (beta * ln(pi(y_i|x) / pi_ref(y_i|x)) - (r(x,y_i) - V*(x)))^2

Where `V(x) = beta ln(1/G * sum_i exp(r(x,y_i) / beta))` estimates optimal value.

Key property: Stable with policy lags of 400+ gradient steps (100x more off-policy than prior approaches). No importance weighting or clipped ratios needed. Uses 3x fewer training samples than GRPO.

### Six Enterprise Search Behaviors
| Task | Type | Characteristic |
|------|------|---------------|
| BrowseComp-Plus | Constraint-driven entity search | Progressive filtering across attributes |
| TREC-Biogen | Cross-document report synthesis | Integrating dispersed findings |
| FinanceBench | Tabular + long-document reasoning | 100+ page financial reports |
| QAMPARI | Exhaustive entity retrieval | Complete set enumeration |
| FreshStack | Procedural technical docs | Step-by-step implementation |
| PMBench | Enterprise fact aggregation | Informal/fragmented internal docs |

Multi-task training on just BrowseComp-Plus + TREC-Biogen generalizes to all 4 held-out tasks.

### Synthetic Data Pipeline (Self-Play)
Stage I: Question-Answer synthesizer explores corpus via vector search, proposes question-answer pairs. Iterative bootstrapping: improved models generate better training data.

Stage II: Multiple solver instances attempt each question. Quality filtering:
1. Pass-rate filtering (remove trivially solved/unsolved)
2. LLM judge quality filtering (detect ambiguity, factual errors)
3. Deduplication

Scale: Iteration 1 produced 1,218 + 6,270 prompts. Iteration 2 produced 1,336 + 11,371 using improved model.

### Reward Functions
Unified nugget-based completion framework:
- Binary tasks: single nugget (correct/incorrect)
- Entity retrieval: each entity = separate nugget
- Report synthesis: multiple reference answers consolidated into nuggets
- Rewards assigned per-trajectory, same value to all segments when rollouts split at compression steps

### Trajectory Structure
- Multi-step rollouts containing model outputs (search queries) and tool outputs (documents)
- Log-probability computation masks out non-model tokens (prompts, tool outputs)
- Long trajectories split at compression boundaries: compressed_history --> follow-up steps
- Max horizons: 50-200 steps depending on task complexity
- 8 rollouts per training prompt

### Performance
| Model | In-Dist Avg | Out-of-Dist | Total |
|-------|-------------|-------------|-------|
| GLM 4.5 Air (base) | 55.4 | 51.2 | 52.6 |
| GPT 5.2 | 54.9 | 51.8 | 52.8 |
| Claude 4.6 Opus | 77.9 | 62.3 | 67.5 |
| KARL (N=10) | 77.1 | 62.7 | 67.5 |

Test-time compute scaling: N=3 rollouts = 64.1, N=10 = 67.5, N=20 = 68.1.

### Key Insight for Our Integration
KARL proves that trajectory-based RL training on tool-use sequences with outcome rewards dramatically improves performance over static prompting. The base model (GLM 4.5 Air at 52.6) closes to match Claude Opus (67.5) through trajectory learning alone. This is the same leap we want: moving from static SKILL.md templates to learned procedures that improve through experience.

---

8. Key Constraints

### Hard Constraints
1. Hook budget: All hooks must complete in <500ms (SIGALRM enforced). Trajectory recording must be lightweight -- no model inference in hooks.
2. Mac5 memory: M4 16GB limits LoRA training to smaller models. Cannot run OAPL at KARL's scale (they used GLM 4.5 Air).
3. No multi-GPU: Mac5 is a single M4. OAPL's large-batch off-policy training assumes distributed GPU clusters.
4. Existing data format: 3,909 unified.jsonl entries exist but tool_calls are simplified (name + key param only). Full trajectory would need verbose response data from verbose-all.jsonl or transcript files.
5. Registry schema: `registry.json` is schema-validated. Adding new fields requires careful migration.
6. CLAUDE.md protected: Memory Guardian enforces additive-only, max 30
7. Supabase table count: 141 tables already. New tables for trajectory storage should consolidate, not proliferate.

### Soft Constraints
1. Prompt log volume: 903 entries in prompts-all.jsonl. For KARL-style training, need 1000+ trajectory examples minimum.
2. Cortex entry types: Currently 6 types. Adding trajectory/reward types requires model.py changes.
3. MLX ecosystem: Limited to models with MLX support. Cannot directly use GLM 4.5 Air.
4. Cross-machine sync: Cortex propagation uses Supabase. Trajectory data would need same sync mechanism.

### Resource Constraints
| Resource | Available | KARL Needs | Gap |
|----------|-----------|------------|-----|
| Trajectory data | 3,909 unified entries (simplified tools) | 1000+ full trajectories | Need verbose tool sequences with success/fail |
| Compute (training) | Mac5 M4 16GB, MLX LoRA | Distributed GPU cluster | 100x scale gap -- need efficient adaptation |
| Reward signals | None (no outcome tracking) | Per-trajectory outcome scores | Must build reward computation |
| Search infrastructure | RAG++ + GK + pgvector | Vector search tool | Already have this |
| Policy model | Gemma/LLaMA via MLX | GLM 4.5 Air | Different base, but LoRA transfers |

---

9. Open Questions for Stage 1

### Architecture Questions
1. Where does OAPL fit in our stack? Full OAPL requires distributed training. Can we use a simplified version (e.g., offline advantage estimation from logged trajectories) that runs on Mac5's M4?

2. What is the "policy" in our context? KARL trains a search agent's policy (which queries to issue). In our case, the "policy" could be: (a) which tools to call in what order, (b) which SKILL.md content to inject, or (c) weights in a routing function. Option (b) is most tractable.

3. Trajectory granularity: Should we record per-turn tool sequences (what KARL does) or per-session task completions? Per-turn gives richer signal but exponentially more data.

4. Reward function design: KARL uses nugget-based accuracy. Our equivalent could be: (a) task completion (did the user say "ok" / "continue" after?), (b) correction absence (no behavioral correction detected within 3 prompts), (c) build success (for iOS/deploy tasks), (d) time-to-completion.

### Implementation Questions
5. Data pipeline: Should trajectory collection be a new hook (PostToolUse extension) or a post-hoc analysis of existing unified.jsonl + transcript files?

6. Skill evolution: Should skills "evolve" by updating SKILL.md content based on trajectories, or should we move to a completely different representation (e.g., embedding-based skill vectors that are retrieved via similarity)?

7. Integration with Evolution World: Should KARL-style training be a new L4 layer in Evolution World, or a separate system that feeds improved skills into EW's L1?

8. Bootstrapping: KARL uses self-play to generate training data. Can we bootstrap from existing unified.jsonl (3,909 entries) as seed data, then iteratively improve?

### Validation Questions
9. Measurement: How do we measure whether trajectory-learned skills outperform static SKILL.md? A/B testing across panes? Before/after metrics on task completion time?

10. Failure modes: What happens when a learned trajectory is wrong? KARL filters training data aggressively (pass-rate filtering, quality judges). We need equivalent safety mechanisms.

11. Decay interaction: How do trajectory-learned skills interact with the existing decay detector? A skill that worked last week might not work after a codebase change.

### Prioritization Questions
12. What to KARL-ify first? The 10 ops skills (deploy, ios, git, etc.) have the highest usage frequency and clearest success/failure signals. Creative skills (phi:metaphysical) have no measurable outcome.

13. Minimal viable trajectory: What's the smallest useful trajectory? Just "prompt + tool sequence + success_bit"? Or do we need the full KARL treatment (multi-step rollouts, compression, nugget-based rewards)?

---

Sources

### Codebase Files Read
- `[home-path]` (287 lines)
- `[home-path]` (178 lines)
- `[home-path]` (160 lines)
- `[home-path]` (100+ lines)
- `[home-path]` (251 lines)
- `[home-path]` (233 lines)
- `[home-path]` (169 lines)
- `[home-path]` (399 entries: 324 invocation_records, 75 decay_flags)
- `[home-path]` (574 lines)
- `[home-path]` (1283 lines)
- `[home-path]` (287 lines)
- `[home-path]` (404 lines)
- `[home-path]` (100+ lines)
- `[home-path]` (80+ lines)
- `[home-path]` (100+ lines)
- `[home-path]` (88 skills, 80 forged, 13 active)
- `[home-path]` (Gen 2 enriched skill example)
- `[home-path]` (Gen 1 hand-authored skill example)
- `[home-path]` (401 lines)
- `[home-path]` (966 lines)
- `[home-path]` (3,909 entries, 8.4MB)
- `[home-path]` (903 entries)

### Memory Files Consulted
- `[home-path]`
- `[home-path]`
- `[home-path]`
- `[home-path]`
- `[home-path]`

### External Sources
- [KARL: Knowledge Agents via Reinforcement Learning (arXiv)](https://arxiv.org/abs/2603.05218)
- [KARL HTML paper](https://arxiv.org/html/2603.05218v1)
- [Databricks KARL PDF](https://www.databricks.com/sites/default/files/2026-03/karl.pdf)
- [Databricks Blog: Meet KARL](https://www.databricks.com/blog/meet-karl-faster-agent-enterprise-knowledge-powered-custom-rl)
- [VentureBeat: Databricks RAG Agent](https://venturebeat.com/data/databricks-built-a-rag-agent-it-says-can-handle-every-kind-of-enterprise)
- [WinBuzzer: KARL 33
- [OAPL Algorithm (alphaXiv)](https://www.alphaxiv.org/overview/2603.05218)

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-trajectory-intelligence/stage0-research.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture · is Stage Research