Grand Diomande Research · Full HTML Reader

Stage 0: RESEARCH -- Flow RL: From GRPO to SAMPO

Bucket distribution: 0.3: 1 (0.9%) 0.4: 3 (2.7%) 0.5: 32 (28.8%) <-- mode 0.6: 41 (36.9%) <-- mode 0.7: 27 (24.3%) 0.8: 7 (6.3%) ```

Agents That Account for Themselves proposal experiment writeup candidate score 18 .md

Full Public Reader

Stage 0: RESEARCH -- Flow RL: From GRPO to SAMPO

## Source
- Video: code4AI "From GRPO to SAMPO: Solving Training Collapse in Agentic RL" (XoS5RlM2kog, score 7.5/10)
- FlowRL paper: arXiv 2509.15207 (LUMIA Lab, Sep 2025)
- PACED-RL paper: arXiv 2602.12642 (Feb 2026)
- ARLArena/SAMPO paper: arXiv 2602.21534 (UCLA, Feb 2026)
- GFlowNet Foundations: JMLR 2024, Bengio et al.

---

1. What Exists Today (KARL Codebase)

### File Inventory (14 Python files, [home-path])
| File | Lines | Purpose |
|------|-------|---------|
| `trajectory_tap.py` | 349 | 4 tap points (A/B/C/D) wired into Claude Code hooks |
| `reward_engine.py` | 428 | 3-signal composite reward (outcome 40
| `embedding_cache.py` | 214 | LRU cache for 3072-dim Gemini embeddings, async embed |
| `sft_exporter.py` | 323 | Advantage-weighted SFT export (OAPL-Lite oversampling) |
| `karl_trainer.py` | 269 | Mac5 training orchestration (SSH/SCP, MLX LoRA trigger) |
| `weight_updater.py` | 148 | EMA weight updates for skill embeddings |
| `trajectory_bridge.py` | 463 | Shadow routing analysis, promotion gate, EW technique recs |
| `bootstrap_skill_embeddings.py` | ~200 | Pre-compute skill vectors |
| `trajectory_extractor.py` | ~200 | Historical backfill from prompt logs |
| `karl_training_flow.py` | 168 | Weekly Prefect training flow (Sunday 3am) |
| `karl_analysis_flow.py` | 179 | Daily Prefect analysis flow (6:30am) |
| `synthetic_qa.py` | ~300 | Git-commit-based synthetic Q&A generation |

### Data Inventory
| File | Records | Size |
|------|---------|------|
| `trajectories.jsonl` | 111 | 620 KB |
| `routing_shadow.jsonl` | 87 | 19 KB |
| `karl-sft.jsonl` | 35 | 40 KB |
| `synthetic_qa.jsonl` | 37 | 23 KB |
| `skill_embeddings.pkl` | 13 skills | 360 KB |
| `prompt_embedding_cache.pkl` | ~100 entries | 166 KB |

Reward Distribution (Current)

Mean:  0.6047    Std: 0.0902
Min:   0.3223    Max: 0.7929

Bucket distribution:
  0.3: 1  (0.9%)
  0.4: 3  (2.7%)
  0.5: 32 (28.8%)  <-- mode
  0.6: 41 (36.9%)  <-- mode
  0.7: 27 (24.3%)
  0.8: 7  (6.3%)

### Advantage Distribution
- Mean advantage: 0.1047
- Positive advantages: 104/111 (93.7
- Negative advantages: 7/111 (6.3
- Only 1 skill label populated ("test"), domains empty

### Training Infrastructure
- Mac5: M4 16GB, MLX LoRA on gemma-3-1b-it-4bit
- Training params: 500 iters, batch 1, LoRA rank 8, 4 layers, lr 1e-5, max_seq 256
- Adapter v2: 35 SFT examples, test loss 1.843
- Fine-tune daemon on :9200 (Prometheus metrics)

### Current Training Algorithm
1. Compute 3-signal reward per trajectory (hardcoded weights: 0.40/0.35/0.25)
2. Compute advantage = reward - domain_baseline (default 0.5)
3. Filter trajectories with positive advantage
4. Oversample high-advantage trajectories (1x/2x/3x based on advantage buckets)
5. Export to ChatML SFT format
6. Train MLX LoRA on Mac5

Critical observation: This is pure reward MAXIMIZATION via advantage-weighted SFT oversampling. High-reward trajectories are overrepresented up to 3x. Low-reward trajectories are excluded entirely. This is exactly the mode collapse pattern that FlowRL was designed to fix.

---

2. What FlowRL Proposes (The Theory)

Core Insight: Distribution Matching vs Reward Maximization

Standard RL (PPO, GRPO, KARL's current OAPL-Lite):

max E[r(x,y)]    -- maximize expected reward

This converges to a single high-reward strategy and ignores all other viable approaches.

FlowRL:

min D_KL(pi_theta(y|x) || pi_tilde(y|x))    -- match the reward distribution

Where:

pi_tilde(y|x) = exp(beta * r(x,y)) * pi_ref(y|x) / Z_phi(x)

The policy learns to sample trajectories in PROPORTION to their rewards, not just the best ones.

The Trajectory Balance Loss

FlowRL's loss is mathematically equivalent to GFlowNet's trajectory balance:

L_FlowRL = w * (log Z_phi(x) + (1/|y|) * log pi_theta(y|x) - beta * r_hat(x,y) - (1/|y|) * log pi_ref(y|x))^2

Components:
- `Z_phi(x)`: Learnable partition function (3-layer MLP, trained 10x faster than policy)
- `beta`: Temperature (15 in the paper)
- `r_hat(x,y)`: Group-normalized rewards
- `w`: Clipped importance weight (PPO-style)
- `1/|y|`: Length normalization for gradient stability

### Key Results (from paper)
- Math benchmarks: +10.0
- Code reasoning: 37.4
- Diversity: Nearly 2x solution diversity vs GRPO (GPT-4o evaluated)
- KL divergence: Dramatically reduced (the 8.6 to 0.1 figure from the video)
- Ablation: removing importance sampling drops 35.63

### PACED-RL Extension
PACED-RL (arXiv 2602.12642) discovered that Z_phi(x) encodes per-prompt difficulty:

p_old(x) ~ beta * log Z*(x)

This enables adaptive curriculum scheduling with negligible overhead (0.035-0.110s per eval vs 308-1086s per training step).

Results: +29.1

---

3. What SAMPO Proposes (Stability Layer)

Source: ARLArena (arXiv 2602.21534, UCLA, Feb 2026)

SAMPO = Stable Agentic (Multi-) Policy Optimization

### The Four Design Dimensions
1. Loss Aggregation: Sequential Mean -- later tokens weighted more heavily for credit assignment
2. Clipping Strategy: Sequence-level clipping of importance ratios (not token-level like PPO)
3. Advantage Design: Entropy-modulated advantages prevent exploration paralysis
4. Trajectory Filtering: Dynamic batch filtering ensures non-zero advantage variance

### Training Collapse Mechanism
The fundamental instability: long sequences multiply token-level probability changes, making importance weight ratios extremely sensitive. Compound probability degradation.

SAMPO's fix -- sequence-level trust region:

w_seq = clip(pi_theta(y|x) / pi_old(y|x), 1-epsilon, 1+epsilon)

### Key Results
- 4B Qwen model: 92
- Consistent stability across web, embodied, math, game, and search agent tasks
- No task-specific tuning required

---

4. The Gap: What KARL Lacks

FlowRL Feature	KARL Status	Gap
Distribution matching objective	Reward maximization (oversample top)	Critical
Learnable partition function Z	None -- baseline is flat 0.5	Critical
Group-normalized rewards	Domain baselines exist but primitive	Medium
Length normalization	Not needed (trajectories, not tokens)	Low
Importance sampling	Not present	Medium
Solution diversity metric	None -- no diversity measurement	Critical
Trajectory balance loss	Squared error not used anywhere	Critical
Beta temperature tuning	Oversampling thresholds hardcoded	Medium
Curriculum scheduling (PACED)	No difficulty awareness	Low
Sequence-level clipping (SAMPO)	Not applicable (SFT, not RL)	Different
Entropy-modulated advantage	Simple subtraction from baseline	Medium

### Fundamental Architecture Mismatch
KARL trains a small model (gemma-3-1b-4bit) via supervised fine-tuning on tool-plan sequences. It does NOT do online policy gradient optimization. FlowRL and SAMPO are designed for online RL training loops.

This is the central tension: FlowRL's mathematical machinery assumes you are running gradient updates against a policy loss during training. KARL's training is offline SFT.

But: FlowRL's Principles Still Apply to Offline Training

1. Sampling proportional to reward -- instead of oversampling top trajectories, sample proportional to exp(beta * reward) / Z. This is distribution matching applied to the training data selection.

2. Learnable partition function -- Z can normalize the reward distribution per-domain, replacing the fixed baseline.

3. Diversity preservation -- instead of excluding low-advantage trajectories, include them with calibrated weights.

4. Group normalization -- normalize rewards within trajectory batches, not just against a global baseline.

---

5. Real Constraints

### Compute
- Mac5: M4 16GB, single machine, MLX only
- LoRA training budget: ~200s per run, 500 iterations
- No GPU cluster, no multi-node training
- Cannot run full FlowRL (requires online RL loop with billion-param model)

### Data
- 111 trajectories is tiny by ML standards
- Reward range compressed (0.32-0.79, std 0.09)
- 93.7
- Skill labels mostly empty -- 1 populated out of 111

### Time
- Weekly training cadence (Sunday 3am)
- Daily analysis cadence (6:30am)
- System is live and recording -- changes must be backward compatible

### Model
- gemma-3-1b-it-4bit -- too small for meaningful reward model training
- SFT-only -- no RLHF/PPO/GRPO infrastructure
- Max sequence length 256 tokens -- tool plans are short

---

6. Open Questions

1. Can FlowRL's distribution matching be applied to offline SFT data selection? The math assumes online RL, but the principle of sampling proportional to reward (not just maximizing reward) is universal.

2. What should Z represent in our setting? In FlowRL, Z normalizes the KL target distribution. In KARL, the closest analog is the domain baseline. Could a learned Z replace fixed baselines?

3. Is 111 trajectories enough to train a partition function? Z_phi is a 3-layer MLP. With 111 samples and 13 domains, overfitting is a real risk.

4. Does SAMPO's stability framework apply to SFT? SAMPO targets online RL training loops. The sequence-level clipping and entropy-modulated advantages may not translate directly to offline SFT.

5. How do we measure diversity? FlowRL uses GPT-4o to evaluate solution diversity. We need a cheaper proxy for trajectory diversity measurement.

6. Should we migrate from SFT to actual RL? If FlowRL's benefits require online RL, the right move might be building a proper RL training loop on Mac5, not grafting distribution matching onto SFT.

7. What is the minimum viable FlowRL integration? If we cannot do full FlowRL, what is the smallest meaningful change that captures the core insight of distribution matching?

---

7. Research Summary

The research reveals three tiers of integration complexity:

Tier 1 (Adapt principles to SFT): Replace KARL's oversampling-based reward maximization with distribution-proportional sampling. This captures the core insight with zero infrastructure change.

Tier 2 (Add partition function): Train a lightweight Z model that normalizes per-domain rewards, replacing fixed baselines. Requires new training code but no infrastructure change.

Tier 3 (Full RL migration): Build an online RL training loop on Mac5 with trajectory balance loss. Requires significant infrastructure work but captures the full FlowRL/SAMPO benefit.

The evidence suggests Tier 1 alone would fix the most critical gap: KARL's reward distribution is compressed (std 0.09) and nearly all trajectories have positive advantage (93.7

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/flow-rl-sampo/stage0-research.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research