Grand Diomande Research · Full HTML Reader

Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver

**R1: Terminal Diff Fragility** - **Failure scenario:** ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - **Probability:** HIGH (70%). Terminal output is inherently messy. tmux capture-pane includes invisible characters. - **Impact:** HIGH. Without reliable diff, every turn fires the model with junk context

Agents That Account for Themselves research note experiment writeup candidate score 20 .md

Full Public Reader

Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver

---

3a. Risk Audit

CRITICAL RISKS

R1: Terminal Diff Fragility - Failure scenario: ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - Probability: HIGH (70 - Impact: HIGH. Without reliable diff, every turn fires the model with junk context, recreating V5's problems. - Mitigation: Multi-layer stripping: (1) `strip_ansi()` removes escape sequences via regex `\x1b\[[0-9;]*[a-zA-Z]`, (2) normalize whitespace, (3) hash only alpha-numeric content (ignore formatting), (4) use a "stable hash" that ignores timestamps matching `2:2:2` pattern. - Validation: Test against 100 real tmux captures (save from current sessions). Hash stability rate must be >90

R2: Phase Detection False Positives
- Failure scenario: Phase detector sees "running" in a file path (`/app/running-config.ts`) and classifies as WAITING. Or sees "error" in a variable name and classifies as ERROR. The session stalls or takes wrong actions.
- Probability: MEDIUM (40
- Impact: HIGH. Wrong phase = wrong action. WAITING when BUILDING means the driver waits forever.
- Mitigation: (1) Only match keywords in the LAST 3 lines (not all 80), (2) require keyword to be the dominant signal (not embedded in longer text), (3) add a "confidence" score -- if multiple conflicting signals, default to BUILDING (safest), (4) timeout: any phase held for >120s without change auto-transitions to STUCK.
- Validation: Annotate 50 real pane snapshots with correct phase. Phase detector accuracy must be >85

R3: Model Generates Harmful Prompts
- Failure scenario: The 4B model generates a prompt that causes Claude to delete files, push to wrong branches, deploy broken code, or modify production data. The validation gate (Step 5) only checks for repetition and status, not destructive commands.
- Probability: LOW (15
- Impact: CRITICAL. Data loss, broken deployments.
- Mitigation: Add a DESTRUCTIVE_PATTERNS blocklist to the validation gate: `['rm -rf', 'git push --force', 'drop table', 'git reset --hard', 'DELETE FROM', 'kill -9']`. Any generated prompt matching these patterns is rejected outright.
- Validation: Test with 200 generated prompts. Zero destructive prompts must pass validation.

MEDIUM RISKS

R4: Plan Generation Quality
- Failure scenario: The 4B model generates a plan with wrong file paths, wrong commands, wrong dependencies. The plan is used as fallback when the model fails validation 3 times, so a bad plan injects bad prompts.
- Probability: MEDIUM (50
- Impact: MEDIUM. Bad plan steps waste turns but don't cause data loss (Claude itself validates before executing).
- Mitigation: (1) Keep plan steps vague ("Create the health API route") not specific ("Create /app/api/health/route.ts with exact implementation"), (2) plan is a HINT, model can deviate, (3) validate plan steps against the same blocklist as generated prompts.
- Validation: Generate plans for 10 known projects. Human review for basic coherence.

R5: Session Record Corruption
- Failure scenario: Driver crashes mid-write, leaving a partial JSON file. Next driver start fails to load the session record and creates a new one, losing all history.
- Probability: LOW-MEDIUM (25
- Impact: MEDIUM. Loss of session history means temporary amnesia, but the session can recover from the pane state.
- Mitigation: (1) Atomic write: write to `{session_id}.tmp.json`, then rename, (2) backup: keep previous version as `{session_id}.bak.json`, (3) recovery: if main file is corrupt, try loading backup.
- Validation: Kill the driver process during a write. Verify recovery from backup.

R6: Token Budget Overflow
- Failure scenario: Pane diff is unexpectedly large (Claude outputs a 200-line error trace), combined with history and system prompt, the total exceeds 2048 tokens. The model receives truncated context and generates poor output.
- Probability: MEDIUM (35
- Impact: MEDIUM. Truncated context leads to confused generation, but the validation gate catches most bad outputs.
- Mitigation: (1) Hard cap diff at 20 lines (take first 5 + last 15), (2) hard cap total prompt at 1800 tokens before generation budget, (3) if over budget, drop plan_steps first, then truncate history to last 3 entries.
- Validation: Test with edge-case pane outputs (500-line error dumps, binary output, huge file listings).

R7: MLX Server Unavailability
- Failure scenario: Mac5 is offline, MLX server crashed, or network timeout. Every model call fails. The driver has no LLM to generate prompts.
- Probability: MEDIUM (30
- Impact: MEDIUM-HIGH. Without the model, the driver can only inject plan steps mechanically.
- Mitigation: (1) Plan-only fallback mode: step through plan_steps without model calls, (2) retry with exponential backoff on MLX failures, (3) health check MLX at session start, warn if unreachable, (4) optional: fall back to a different endpoint (local ollama, cloud API).
- Validation: Start a session with MLX offline. Verify plan-only mode works.

LOW RISKS

R8: Adaptive Interval Too Slow
- Failure scenario: 60s max_interval means the driver misses a rapid Claude output that needed immediate direction, causing Claude to idle for a full minute.
- Probability: LOW (20
- Impact: LOW. Wasted time, not wasted quality.
- Mitigation: (1) After injecting a prompt, use min_interval (20s) for the next 3 reads, (2) only use max_interval after 3 consecutive WAITING phases.

R9: NATS Integration Breaks Core Driver
- Failure scenario: NATS code throws an unhandled exception that crashes the main loop.
- Probability: LOW (10
- Impact: LOW if properly isolated. HIGH if not.
- Mitigation: Every NATS call is wrapped in `try/except: pass`. NATS failures log a warning but never block or crash the driver.

R10: Session ID Collision
- Failure scenario: Two sessions on different machines have the same pane ID, causing session record overwrites.
- Probability: LOW (5
- Impact: LOW. Confusion in logs.
- Mitigation: Session ID format: `{machine}_{pane_id}_{timestamp}`. Unique by construction.

---

3b. Expanded Specifications

SPEC 1: Session Record Manager (`session_manager.py`)

Purpose: CRUD operations for session records on disk.

File location: `Desktop/karl/karl/session_manager.py`

Interface:

python

class SessionRecord:
    version: int = 6
    session_id: str
    machine: str
    pane_id: str
    project: str
    goal: str
    plan_steps: list[str]
    plan_index: int = 0  # Next uncompleted step
    phase: str = "STARTING"
    turn_number: int = 0
    history: list[dict]  # max 8 entries
    last_3_prompts: list[str]
    pane_hash: str | None = None
    pane_hash_streak: int = 0
    prev_lines: list[str] = []  # last 40 lines for diff
    idle_seconds: float = 0
    created_at: str
    updated_at: str

SESSION_DIR = Path.home() / ".karl-sessions"

def load_session(session_id: str) -> SessionRecord | None
def save_session(session: SessionRecord) -> bool  # atomic write
def create_session(machine, pane_id, project, goal, plan_steps=None) -> SessionRecord
def update_turn(session: SessionRecord, generated_prompt: str, pane_result: PaneResult) -> None
def get_next_plan_step(session: SessionRecord) -> str | None
def advance_plan(session: SessionRecord) -> None

Atomic write pattern:

python

def save_session(session):
    path = SESSION_DIR / f"{session.session_id}.json"
    tmp = path.with_suffix('.tmp.json')
    bak = path.with_suffix('.bak.json')
    with open(tmp, 'w') as f:
        json.dump(asdict(session), f, indent=2, default=str)
    if path.exists():
        shutil.copy2(path, bak)
    os.rename(tmp, path)

Size: ~120 lines.

---

SPEC 2: Pane Processor (`pane_processor.py`)

Purpose: Read, clean, hash, diff pane output.

File location: `Desktop/karl/karl/pane_processor.py`

Interface:

python

@dataclass
class PaneResult:
    changed: bool
    diff: list[str]      # New lines since last read
    novelty: float        # 0.0-1.0
    all_lines: list[str]  # Full cleaned output
    raw_line_count: int

def strip_ansi(text: str) -> str
def stable_hash(lines: list[str]) -> str  # Ignores timestamps, whitespace
def compute_diff(prev_lines: list[str], curr_lines: list[str]) -> list[str]
def compute_novelty(diff_lines: list[str]) -> float
def process_pane(session: SessionRecord, raw_output: str) -> PaneResult
def is_spinner_line(line: str) -> bool
def detect_phase(session: SessionRecord, pane: PaneResult) -> str

ANSI stripping regex:

python

ANSI_RE = re.compile(r'\x1b\[[0-9;]*[a-zA-Z]|\x1b\].*?\x07|\x1b\[.*?[Hm]')
TIMESTAMP_RE = re.compile(r'\d{1,2}:\d{2}(:\d{2})?(\s*(AM|PM))?')
SPINNER_RE = re.compile(r'[⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏|/\\-]')

Stable hash:

python

def stable_hash(lines):
    # Remove timestamps, normalize whitespace, keep only alphanumeric content
    normalized = []
    for line in lines:
        clean = TIMESTAMP_RE.sub('', line)
        clean = re.sub(r'\s+', ' ', clean).strip()
        if len(clean) > 3:  # Skip noise lines
            normalized.append(clean)
    return hashlib.md5('\n'.join(normalized).encode()).hexdigest()[:12]

Diff algorithm:

python

def compute_diff(prev_lines, curr_lines):
    if not prev_lines:
        return curr_lines[-30:]

    # Find overlap point: last line of prev that appears in curr
    prev_set = set(l.strip() for l in prev_lines[-10:] if l.strip())
    overlap_idx = len(curr_lines)  # Default: everything is new

    for i in range(len(curr_lines) - 1, -1, -1):
        if curr_lines[i].strip() in prev_set:
            overlap_idx = i + 1
            break

    new_lines = curr_lines[overlap_idx:]
    if not new_lines:
        # No clear overlap found -- take last 20 lines
        new_lines = curr_lines[-20:]

    # Cap at 30 lines
    return new_lines[-30:]

Size: ~150 lines.

---

SPEC 3: Prompt Assembler (`prompt_assembler.py`)

Purpose: Build the model prompt from session state and pane data.

File location: `Desktop/karl/karl/prompt_assembler.py`

Interface:

python

V6_SYSTEM_PROMPT: str  # Fixed system prompt for V6

def assemble_context(session: SessionRecord, pane: PaneResult) -> list[dict]
def format_history(history: list[dict]) -> str
def format_plan_hint(plan_steps: list[str], plan_index: int) -> str
def format_diff(diff_lines: list[str], phase: str) -> str
def format_anti_repeat(last_3: list[str]) -> str
def count_tokens_approx(text: str) -> int  # word_count * 1.3
def trim_to_budget(components: dict, max_tokens: int) -> dict

Token budget manager:

python

MAX_TOTAL = 1800  # Leave 200+ for generation
PRIORITIES = ['system', 'identity', 'anti_repeat', 'diff', 'history', 'plan']

def trim_to_budget(components, max_tokens=MAX_TOTAL):
    total = sum(count_tokens_approx(v) for v in components.values())
    if total <= max_tokens:
        return components

    # Drop in reverse priority order
    for key in reversed(PRIORITIES):
        if key in ('system', 'identity', 'anti_repeat'):
            continue  # Never drop these
        if total <= max_tokens:
            break
        excess = total - max_tokens
        component_tokens = count_tokens_approx(components[key])
        if component_tokens > excess:
            # Truncate this component
            components[key] = truncate_text(components[key], component_tokens - excess)
        else:
            # Drop entirely
            components[key] = ''
        total = sum(count_tokens_approx(v) for v in components.values())
    return components

Size: ~130 lines.

---

SPEC 4: Validation Gate (`validation_gate.py`)

Purpose: Post-generation validation and regeneration logic.

File location: `Desktop/karl/karl/validation_gate.py`

Interface:

python

@dataclass
class ValidationResult:
    valid: bool
    reason: str | None = None  # status_blocked, exact_repeat, near_repeat, too_short, off_topic, destructive
    action: str | None = None  # regenerate, regenerate_with_constraint, block

BANNED_PHRASES: list[str]  # status variants
DESTRUCTIVE_PATTERNS: list[str]  # rm -rf, push --force, etc.

def validate_prompt(generated: str, session: SessionRecord) -> ValidationResult
def similarity(a: str, b: str) -> float  # Jaccard on word trigrams
def has_project_relevance(prompt: str, session: SessionRecord) -> bool
def build_constraint_prompt(original_context: list, reason: str, rejected: str) -> list[dict]

Banned phrases (comprehensive):

python

BANNED_PHRASES = [
    'status', 'check status', 'show status', 'what is the status',
    'what\'s the status', 'give me status', 'get status',
    'current status', 'project status',
]

DESTRUCTIVE_PATTERNS = [
    r'rm\s+-rf\s',
    r'git\s+push\s+--force',
    r'git\s+reset\s+--hard',
    r'DROP\s+TABLE',
    r'DELETE\s+FROM',
    r'git\s+checkout\s+\.',
    r'git\s+clean\s+-[fd]',
    r'kill\s+-9\s+\d',
    r'pkill\s+-9',
    r'sudo\s+rm',
]

Size: ~100 lines.

---

SPEC 5: V6 Main Driver (`twin_session_driver_v6.py`)

Purpose: Main loop tying all components together. CLI entry point.

File location: `Desktop/karl/twin_session_driver_v6.py`

Interface:

python

class V6SessionDriver:
    def __init__(self, machine, pane_id, project, goal, plan_steps=None,
                 max_turns=30, min_interval=20, max_interval=60,
                 mlx_url=None, dry_run=False)
    def run(self) -> SessionSummary
    def _read_pane(self) -> str | None
    def _inject(self, prompt: str) -> bool
    def _query_model(self, messages: list[dict]) -> str
    def _query_with_constraint(self, messages: list, reason: str, rejected: str) -> str

def auto_detect_session(raw_output: str, mlx_url: str) -> dict
def generate_plan(project: str, goal: str, mlx_url: str) -> list[str]

def main():  # CLI via argparse

CLI:

bash

# Seeded session with explicit goal
karl-twin-v6 mac1 agent-codex:1.1 \
  --project meshd-dashboard \
  --goal "Build health polling dashboard" \
  --turns 30

# Auto-detect session from pane content
karl-twin-v6 mac1 agent-codex:1.1 --auto-detect

# Dry run (log what would be injected)
karl-twin-v6 mac1 agent-codex:1.1 \
  --project meshd-dashboard \
  --goal "Build health polling dashboard" \
  --dry-run

# Resume existing session
karl-twin-v6 mac1 agent-codex:1.1 --resume

Size: ~250 lines (main driver + CLI).

---

3c. Master Execution Checklist

### Wave 0: Foundation (Day 1)
No dependencies. Can start immediately.

#	Task	Input	Output	Owner	Validation	Status
0.1	Create `[home-path]` directory and session record schema	SPEC 1	`session_manager.py` with dataclass + CRUD	agent	`pytest test_session_manager.py` -- create, save, load, atomic write, backup recovery	TODO
0.2	Build ANSI stripping + stable hash functions	SPEC 2	`pane_processor.py` with strip_ansi, stable_hash	agent	Test against 20 saved tmux captures, hash stability >90
0.3	Implement word-trigram similarity function	SPEC 4	`validation_gate.py` with similarity()	agent	similarity("status", "check status") > 0.3, similarity("build X", "deploy Y") < 0.3	TODO
0.4	Save 50 real pane captures from current sessions for testing	Session logs	`Desktop/karl/tests/fixtures/pane_captures/` (50 text files)	agent	Files exist, cover BUILDING/WAITING/ERROR/DONE phases	TODO

### Wave 1: Pane Processing (Day 2)
Depends on: Wave 0 (0.2 for strip_ansi, 0.4 for test fixtures)

#	Task	Input	Output	Owner	Validation	Status
1.1	Implement compute_diff and compute_novelty	SPEC 2, fixtures from 0.4	diff and novelty functions in pane_processor.py	agent	Diff correctly identifies new content in 40/50 test captures	TODO
1.2	Implement detect_phase with keyword matching	SPEC 2	Phase detection in pane_processor.py	agent	Phase accuracy >85
1.3	Implement process_pane combining hash+diff+phase	SPEC 2	Complete PaneResult pipeline	agent	Integration test: read raw -> PaneResult with correct fields	TODO
1.4	Write turn gating logic (should_fire)	Step 3 compound	Gating function returns TurnDecision	agent	Correctly gates: WAITING=skip, STUCK(3x)=skip, BUILDING+novelty>0.15=fire	TODO

### Wave 2: Prompt Assembly + Validation (Day 3)
Depends on: Wave 0 (0.1 for session record), Wave 1 (1.3 for PaneResult)

#	Task	Input	Output	Owner	Validation	Status
2.1	Write V6 system prompt	Compound Step 4	V6_SYSTEM_PROMPT constant	agent	Human review for clarity, <150 tokens	TODO
2.2	Implement assemble_context with token budget	SPEC 3	prompt_assembler.py	agent	Context never exceeds 1800 tokens on 50 test inputs	TODO
2.3	Implement validate_prompt with all 5 rules	SPEC 4	validation_gate.py complete	agent	Blocks "status", repeats, near-repeats, destructive, too-short	TODO
2.4	Implement build_constraint_prompt for regeneration	SPEC 4	Constraint prompt builder	agent	Constraint prompt includes rejected text and reason	TODO

### Wave 3: Main Driver (Day 4)
Depends on: Waves 0-2 (all components)

#	Task	Input	Output	Owner	Validation	Status
3.1	Implement V6SessionDriver main loop	SPEC 5, all prior waves	twin_session_driver_v6.py	agent	Dry-run against saved session logs produces valid prompts	TODO
3.2	Implement auto_detect_session	SPEC 5	Auto-detection from pane content	agent	Correctly identifies project/goal from 5 test panes	TODO
3.3	Implement generate_plan	SPEC 5	Plan generation from project+goal	agent	Generates 5-10 coherent steps for 5 test scenarios	TODO
3.4	Implement CLI with argparse	SPEC 5	Full CLI (--project, --goal, --auto-detect, --dry-run, --resume)	agent	All flags parsed correctly, help text clear	TODO
3.5	Implement adaptive interval logic	Compound Step 6	Dynamic wait times (20s-60s)	agent	Interval increases after 3 WAITING phases, resets on BUILDING	TODO
3.6	Implement plan fallback on validation failure	Compound Steps 5-6	Fallback to next plan step after 3 rejected attempts	agent	After 3 rejections, injects plan step instead of giving up	TODO

### Wave 4: Testing + Hardening (Day 5)
Depends on: Wave 3 (complete driver)

#	Task	Input	Output	Owner	Validation	Status
4.1	End-to-end dry-run test against all 6 existing session logs	Session logs	Comparison report: V5 actual vs V6 would-have-generated	agent	V6 generates 0 "status" prompts, 0 exact repeats	TODO
4.2	Stress test: 100-turn session with simulated pane data	Synthetic pane data	No crashes, no memory leaks, session record stays consistent	agent	Driver completes 100 turns without error	TODO
4.3	Hash stability test on 50 pane captures	Test fixtures	stable_hash produces same hash on cosmetic-only changes	agent	>90
4.4	Phase detection accuracy test	Annotated fixtures	Confusion matrix for 6 phases	agent	>85
4.5	Destructive pattern blocklist test	200 synthetic prompts (50 destructive)	Zero destructive prompts pass validation	agent	100
4.6	Token budget overflow test	Edge-case pane outputs (500 lines, binary, huge traces)	Prompt never exceeds 1800 tokens	agent	0 overflows on 20 edge cases	TODO

### Wave 5: Live Integration (Day 6-7)
Depends on: Wave 4 (all tests pass)

#	Task	Input	Output	Owner	Validation	Status
5.1	Live dry-run test: drive 3 real sessions without injecting	3 active tmux panes	JSONL logs showing what V6 would inject	human	Review logs, confirm no status/repeats, confirm project coherence	TODO
5.2	Live test: drive 1 real session with injection on a non-critical project	1 test pane	Actual session driven by V6	human	Session makes progress, no harmful prompts injected	TODO
5.3	A/B comparison: run V5 and V6 on parallel sessions, same project	2 panes, same seed	Side-by-side comparison of turn logs	human	V6 has fewer wasted turns, more progress per turn	TODO
5.4	Deploy to all 5 Macs via meshd	Working V6 driver	Driver accessible on mac1-5	agent	`karl-twin-v6 mac2 agent-codex:1.1 --dry-run` works from mac1	TODO

### Wave 6: NATS + Observability (Day 7-8, Optional)
Depends on: Wave 5 (live driver working)

#	Task	Input	Output	Owner	Validation	Status
6.1	Add NATS turn event publishing (fire-and-forget)	Step 8b compound	NATS events on karl.twin.turn	agent	Events appear in NATS monitor when driver runs	TODO
6.2	Add NATS idle detection (optional enhancer)	Step 8c compound	NATS subscriber for tool events	agent	Falls back gracefully when NATS offline	TODO
6.3	Add metrics tracking (turns fired/gated, validation rejections)	Step 8d compound	Metrics in session record + summary log	agent	Metrics correctly count all categories	TODO
6.4	Add session summary output at completion	--	Print summary: turns, phases, rejections, efficiency	agent	Summary is printed after every session	TODO

### Wave 7: Retraining Data Prep (Day 8-10, Phase 2)
Depends on: Wave 5 (live session data from V6)

#	Task	Input	Output	Owner	Validation	Status
7.1	Extract contrastive pairs from V5 session logs	126 existing turns	DPO-format pairs: (good_prompt, bad_prompt)	agent	50+ contrastive pairs from status and repeat failures	TODO
7.2	Generate 200 synthetic session-driving scenarios	Project templates	SFT examples in ChatML format	agent	200 examples covering error/progress/stuck/done scenarios	TODO
7.3	Merge V5 training data + V6 session-driving data	V4 train.jsonl + new data	Combined train/valid JSONL	agent	No data leakage between train and valid	TODO
7.4	Train V6 LoRA adapter on Mac5	Merged training data	V6 adapter weights	agent	Validation loss < V5 (currently 2.051 NLL)	TODO
7.5	Evaluate V6 model on session-driving test set	Test set from 7.1	Status generation rate, repeat rate, coherence score	human	Status rate <5

---

Dependency Graph

Wave 0 (Foundation)
  0.1  0.2  0.3  0.4      (all parallel)
  |    |    |    |
  v    v    v    v
Wave 1 (Pane Processing)
  1.1 depends on 0.2, 0.4
  1.2 depends on 0.4
  1.3 depends on 1.1, 1.2
  1.4 depends on 1.3, 0.1
  |
  v
Wave 2 (Prompt + Validation)
  2.1 no deps beyond writing
  2.2 depends on 0.1, 1.3
  2.3 depends on 0.3
  2.4 depends on 2.3
  |
  v
Wave 3 (Main Driver)
  3.1 depends on ALL of Waves 0-2
  3.2, 3.3 depend on 0.1
  3.4 depends on 3.1
  3.5, 3.6 depend on 3.1
  |
  v
Wave 4 (Testing)
  All depend on Wave 3
  |
  v
Wave 5 (Live)
  All depend on Wave 4
  |
  v
Wave 6 (NATS, optional)     Wave 7 (Retrain, Phase 2)
  Depends on Wave 5          Depends on Wave 5

Critical path: 0.1 -> 1.3 -> 2.2 -> 3.1 -> 4.1 -> 5.2

Total estimated effort:
- Waves 0-4: 5 agent sessions (~3-5 hours each)
- Wave 5: 2 human-supervised sessions
- Wave 6: 2 agent sessions (optional)
- Wave 7: 3 agent sessions (Phase 2)
- Critical path to first live test: 5 days

---

Pulse Auto-Spawn Candidates

Tasks tagged for automated Pulse dispatch:

Task	Session Type	Reason
0.1	agent	Pure Python module, no human judgment needed
0.2	agent	Regex + hashing, testable in isolation
0.3	agent	Single function, testable in isolation
1.1-1.4	agent	Algorithmic work with clear test criteria
2.1-2.4	agent	Prompt templates + validation logic
3.1-3.6	agent	Integration, but all components available
4.1-4.6	agent	Test writing, all testable
5.1-5.3	human	Requires watching live sessions
6.1-6.4	agent	NATS integration, testable
7.1-7.3	agent	Data processing
7.4-7.5	human	Training requires monitoring, evaluation requires judgment

Agent-dispatchable: 26/31 tasks (84
Human-required: 5/31 tasks (16

---

Kill Criteria

Day 5: If dry-run test (4.1) shows V6 generating >5
Day 7: If live test (5.2) shows V6 injecting harmful or incoherent prompts, halt. Review validation gate.
Day 14: If V6 has not been used for 3+ real sessions by human choice, the driver is not providing value. Review whether the architecture is too conservative or too aggressive.

---

File Summary

File	Lines	Purpose
`karl/session_manager.py`	~120	Session record CRUD
`karl/pane_processor.py`	~150	ANSI strip, hash, diff, phase detect
`karl/prompt_assembler.py`	~130	Context assembly with token budget
`karl/validation_gate.py`	~100	Post-generation validation + dedup
`twin_session_driver_v6.py`	~250	Main driver + CLI
`tests/test_session_manager.py`	~80	Unit tests
`tests/test_pane_processor.py`	~120	Unit + integration tests
`tests/test_validation_gate.py`	~80	Validation rule tests
`tests/test_driver_v6.py`	~100	Integration tests with mocked model
`tests/fixtures/pane_captures/`	50 files	Real pane output for testing
Total new code	~1130	+ ~380 tests

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-v6-session-driver/stage3-expand-master-plan.md

Detected Structure

Method · Evaluation · Math · Code Anchors · Architecture · is Stage Research