Stage 0: RESEARCH -- Flow RL: From GRPO to SAMPO

Full HTML reader

Read the full artifact

Extracted abstract or opening context

## Source - Video: code4AI "From GRPO to SAMPO: Solving Training Collapse in Agentic RL" (XoS5RlM2kog, score 7.5/10) - FlowRL paper: arXiv 2509.15207 (LUMIA Lab, Sep 2025) - PACED-RL paper: arXiv 2602.12642 (Feb 2026) - ARLArena/SAMPO paper: arXiv 2602.21534 (UCLA, Feb 2026) - GFlowNet Foundations: JMLR 2024, Bengio et al. ### File Inventory (14 Python files, [home-path]) | File | Lines | Purpose | |------|-------|---------| | `trajectory_tap.py` | 349 | 4 tap points (A/B/C/D) wired into Claude Code hooks | | `reward_engine.py` | 428 | 3-signal composite reward (outcome 40%, process 35%, efficiency 25%) | | `embedding_cache.py` | 214 | LRU cache for 3072-dim Gemini embeddings, async embed | | `sft_exporter.py` | 323 | Advantage-weighted SFT export (OAPL-Lite oversampling) | | `karl_trainer.py` | 269 | Mac5 training orchestration (SSH/SCP, MLX LoRA trigger) | | `weight_updater.py` | 148 | EMA weight updates for skill embeddings | | `trajectory_bridge.py` | 463 | Shadow routing analysis, promotion gate, EW technique recs | | `bootstrap_skill_embeddings.py` | ~200 | Pre-compute skill vectors | | `trajectory_extractor.py` | ~200 | Historical backfill from prompt logs | | `karl_training_flow.py` | 168 | Weekly Prefect training flow (Sunday 3am) | | `karl_analysis_flow.py` | 179 | Daily Prefect analysis flow (6:30am) | | `synthetic_qa.py` | ~300 | Git-commit-based synthetic Q&A generation | ### Data Inventory | File | Records | Size | |------|---------|------| | `trajectories.jsonl` | 111 | 620 KB | | `routing_shadow.jsonl` | 87 | 19 KB | | `karl-sft.jsonl` | 35 | 40 KB | | `synthetic_qa.jsonl` | 37 | 23 KB | | `skill_embeddings.pkl` | 13 skills | 360 KB | | `prompt_embedding_cache.pkl` | ~100 entries | 166 KB | ### Advantage Distribution - Mean advantage: 0.1047 - Positive advantages: 104/111 (93.7%) - Negative advantages: 7/111 (6.3%) - Only 1 skill label populated ("test"), domains empty ### Training Infrastructure - Mac5: M4 16GB, MLX LoRA on gemma-3-1b-it-4bit - Training params: 500 iters, batch 1, LoRA rank 8, 4 layers, lr 1e-5, max_seq 256 - Adapter v2: 35 SFT examples, test loss 1.843 - Fine-tune daemon on :9200 (Prometheus metrics)

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.