Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver

Full HTML reader

Read the full artifact

Extracted abstract or opening context

**R1: Terminal Diff Fragility** - **Failure scenario:** ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - **Probability:** HIGH (70%). Terminal output is inherently messy. tmux capture-pane includes invisible characters. - **Impact:** HIGH. Without reliable diff, every turn fires the model with junk context, recreating V5's problems. - **Mitigation:** Multi-layer stripping: (1) `strip_ansi()` removes escape sequences via regex `\x1b\[[0-9;]*[a-zA-Z]`, (2) normalize whitespace, (3) hash only alpha-numeric content (ignore formatting), (4) use a "stable hash" that ignores timestamps matching `\d{2}:\d{2}:\d{2}` pattern. - **Validation:** Test against 100 real tmux captures (save from current sessions). Hash stability rate must be >90% on unchanged content. **R2: Phase Detection False Positives** - **Failure scenario:** Phase detector sees "running" in a file path (`/app/running-config.ts`) and classifies as WAITING. Or sees "error" in a variable name and classifies as ERROR. The session stalls or takes wrong actions. - **Probability:** MEDIUM (40%). Keyword matching on raw terminal output is inherently noisy. - **Impact:** HIGH. Wrong phase = wrong action. WAITING when BUILDING means the driver waits forever. - **Mitigation:** (1) Only match keywords in the LAST 3 lines (not all 80), (2) require keyword to be the dominant signal (not embedded in longer text), (3) add a "confidence" score -- if multiple conflicting signals, default to BUILDING (safest), (4) timeout: any phase held for >120s without change auto-transitions to STUCK. - **Validation:** Annotate 50 real pane snapshots with correct phase. Phase detector accuracy must be >85%. **R3: Model Generates Harmful Prompts** - **Failure scenario:** The 4B model generates a prompt that causes Claude to delete files, push to wrong branches, deploy broken code, or modify production data. The validation gate (Step 5) only checks for repetition and status, not destructive commands. - **Probability:** LOW (15%). The model is trained on Mohamed's patterns, which are rarely destructive. But a confused model could generate `git push --force` or `rm -rf`. - **Impact:** CRITICAL. Data loss, broken deployments. - **Mitigation:** Add a DESTRUCTIVE_PATTERNS blocklist to the validation gate: `['rm -rf', 'git push --force', 'drop table', 'git reset --hard', 'DELETE FROM', 'kill -9']`. Any generated prompt matching these patterns is rejected outright. - **Validation:** Test with 200 generated prompts. Zero destructive prompts must pass validation. **R4: Plan Generation Quality** - **Failure scenario:** The 4

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.