Grand Diomande Research · Full HTML Reader

Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver

**R1: Terminal Diff Fragility** - **Failure scenario:** ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - **Probability:** HIGH (70%). Terminal output is inherently messy. tmux capture-pane includes invisible characters. - **Impact:** HIGH. Without reliable diff, every turn fires the model with junk context

Agents That Account for Themselves research note experiment writeup candidate score 20 .md

Full Public Reader

Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver

---

3a. Risk Audit

CRITICAL RISKS

R1: Terminal Diff Fragility - Failure scenario: ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - Probability: HIGH (70 - Impact: HIGH. Without reliable diff, every turn fires the model with junk context, recreating V5's problems. - Mitigation: Multi-layer stripping: (1) `strip_ansi()` removes escape sequences via regex `\x1b\[[0-9;]*[a-zA-Z]`, (2) normalize whitespace, (3) hash only alpha-numeric content (ignore formatting), (4) use a "stable hash" that ignores timestamps matching `2:2:2` pattern. - Validation: Test against 100 real tmux captures (save from current sessions). Hash stability rate must be >90

R2: Phase Detection False Positives
- Failure scenario: Phase detector sees "running" in a file path (`/app/running-config.ts`) and classifies as WAITING. Or sees "error" in a variable name and classifies as ERROR. The session stalls or takes wrong actions.
- Probability: MEDIUM (40
- Impact: HIGH. Wrong phase = wrong action. WAITING when BUILDING means the driver waits forever.
- Mitigation: (1) Only match keywords in the LAST 3 lines (not all 80), (2) require keyword to be the dominant signal (not embedded in longer text), (3) add a "confidence" score -- if multiple conflicting signals, default to BUILDING (safest), (4) timeout: any phase held for >120s without change auto-transitions to STUCK.
- Validation: Annotate 50 real pane snapshots with correct phase. Phase detector accuracy must be >85

R3: Model Generates Harmful Prompts
- Failure scenario: The 4B model generates a prompt that causes Claude to delete files, push to wrong branches, deploy broken code, or modify production data. The validation gate (Step 5) only checks for repetition and status, not destructive commands.
- Probability: LOW (15
- Impact: CRITICAL. Data loss, broken deployments.
- Mitigation: Add a DESTRUCTIVE_PATTERNS blocklist to the validation gate: `['rm -rf', 'git push --force', 'drop table', 'git reset --hard', 'DELETE FROM', 'kill -9']`. Any generated prompt matching these patterns is rejected outright.
- Validation: Test with 200 generated prompts. Zero destructive prompts must pass validation.

MEDIUM RISKS

R4: Plan Generation Quality
- Failure scenario: The 4B model generates a plan with wrong file paths, wrong commands, wrong dependencies. The plan is used as fallback when the model fails validation 3 times, so a bad plan injects bad prompts.
- Probability: MEDIUM (50
- Impact: MEDIUM. Bad plan steps waste turns but don't cause data loss (Claude itself validates before executing).
- Mitigation: (1) Keep plan steps vague ("Create the health API route") not specific ("Create /app/api/health/route.ts with exact implementation"), (2) plan is a HINT, model can deviate, (3) validate plan steps against the same blocklist as generated prompts.
- Validation: Generate plans for 10 known projects. Human review for basic coherence.

R5: Session Record Corruption
- Failure scenario: Driver crashes mid-write, leaving a partial JSON file. Next driver start fails to load the session record and creates a new one, losing all history.
- Probability: LOW-MEDIUM (25
- Impact: MEDIUM. Loss of session history means temporary amnesia, but the session can recover from the pane state.
- Mitigation: (1) Atomic write: write to `{session_id}.tmp.json`, then rename, (2) backup: keep previous version as `{session_id}.bak.json`, (3) recovery: if main file is corrupt, try loading backup.
- Validation: Kill the driver process during a write. Verify recovery from backup.

R6: Token Budget Overflow
- Failure scenario: Pane diff is unexpectedly large (Claude outputs a 200-line error trace), combined with history and system prompt, the total exceeds 2048 tokens. The model receives truncated context and generates poor output.
- Probability: MEDIUM (35
- Impact: MEDIUM. Truncated context leads to confused generation, but the validation gate catches most bad outputs.
- Mitigation: (1) Hard cap diff at 20 lines (take first 5 + last 15), (2) hard cap total prompt at 1800 tokens before generation budget, (3) if over budget, drop plan_steps first, then truncate history to last 3 entries.
- Validation: Test with edge-case pane outputs (500-line error dumps, binary output, huge file listings).

R7: MLX Server Unavailability
- Failure scenario: Mac5 is offline, MLX server crashed, or network timeout. Every model call fails. The driver has no LLM to generate prompts.
- Probability: MEDIUM (30
- Impact: MEDIUM-HIGH. Without the model, the driver can only inject plan steps mechanically.
- Mitigation: (1) Plan-only fallback mode: step through plan_steps without model calls, (2) retry with exponential backoff on MLX failures, (3) health check MLX at session start, warn if unreachable, (4) optional: fall back to a different endpoint (local ollama, cloud API).
- Validation: Start a session with MLX offline. Verify plan-only mode works.

LOW RISKS

R8: Adaptive Interval Too Slow
- Failure scenario: 60s max_interval means the driver misses a rapid Claude output that needed immediate direction, causing Claude to idle for a full minute.
- Probability: LOW (20
- Impact: LOW. Wasted time, not wasted quality.
- Mitigation: (1) After injecting a prompt, use min_interval (20s) for the next 3 reads, (2) only use max_interval after 3 consecutive WAITING phases.

R9: NATS Integration Breaks Core Driver
- Failure scenario: NATS code throws an unhandled exception that crashes the main loop.
- Probability: LOW (10
- Impact: LOW if properly isolated. HIGH if not.
- Mitigation: Every NATS call is wrapped in `try/except: pass`. NATS failures log a warning but never block or crash the driver.

R10: Session ID Collision
- Failure scenario: Two sessions on different machines have the same pane ID, causing session record overwrites.
- Probability: LOW (5
- Impact: LOW. Confusion in logs.
- Mitigation: Session ID format: `{machine}_{pane_id}_{timestamp}`. Unique by construction.

---

3b. Expanded Specifications

SPEC 1: Session Record Manager (`session_manager.py`)

Purpose: CRUD operations for session records on disk.

File location: `Desktop/karl/karl/session_manager.py`

Interface:

python
class SessionRecord:
    version: int = 6
    session_id: str
    machine: str
    pane_id: str
    project: str
    goal: str
    plan_steps: list[str]
    plan_index: int = 0  # Next uncompleted step
    phase: str = "STARTING"
    turn_number: int = 0
    history: list[dict]  # max 8 entries
    last_3_prompts: list[str]
    pane_hash: str | None = None
    pane_hash_streak: int = 0
    prev_lines: list[str] = []  # last 40 lines for diff
    idle_seconds: float = 0
    created_at: str
    updated_at: str

SESSION_DIR = Path.home() / ".karl-sessions"

def load_session(session_id: str) -> SessionRecord | None
def save_session(session: SessionRecord) -> bool  # atomic write
def create_session(machine, pane_id, project, goal, plan_steps=None) -> SessionRecord
def update_turn(session: SessionRecord, generated_prompt: str, pane_result: PaneResult) -> None
def get_next_plan_step(session: SessionRecord) -> str | None
def advance_plan(session: SessionRecord) -> None

Atomic write pattern:

python
def save_session(session):
    path = SESSION_DIR / f"{session.session_id}.json"
    tmp = path.with_suffix('.tmp.json')
    bak = path.with_suffix('.bak.json')
    with open(tmp, 'w') as f:
        json.dump(asdict(session), f, indent=2, default=str)
    if path.exists():
        shutil.copy2(path, bak)
    os.rename(tmp, path)

Size: ~120 lines.

---

SPEC 2: Pane Processor (`pane_processor.py`)

Purpose: Read, clean, hash, diff pane output.

File location: `Desktop/karl/karl/pane_processor.py`

Interface:

python
@dataclass
class PaneResult:
    changed: bool
    diff: list[str]      # New lines since last read
    novelty: float        # 0.0-1.0
    all_lines: list[str]  # Full cleaned output
    raw_line_count: int

def strip_ansi(text: str) -> str
def stable_hash(lines: list[str]) -> str  # Ignores timestamps, whitespace
def compute_diff(prev_lines: list[str], curr_lines: list[str]) -> list[str]
def compute_novelty(diff_lines: list[str]) -> float
def process_pane(session: SessionRecord, raw_output: str) -> PaneResult
def is_spinner_line(line: str) -> bool
def detect_phase(session: SessionRecord, pane: PaneResult) -> str

ANSI stripping regex:

python
ANSI_RE = re.compile(r'\x1b\[[0-9;]*[a-zA-Z]|\x1b\].*?\x07|\x1b\[.*?[Hm]')
TIMESTAMP_RE = re.compile(r'\d{1,2}:\d{2}(:\d{2})?(\s*(AM|PM))?')
SPINNER_RE = re.compile(r'[⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏|/\\-]')

Stable hash:

python
def stable_hash(lines):
    # Remove timestamps, normalize whitespace, keep only alphanumeric content
    normalized = []
    for line in lines:
        clean = TIMESTAMP_RE.sub('', line)
        clean = re.sub(r'\s+', ' ', clean).strip()
        if len(clean) > 3:  # Skip noise lines
            normalized.append(clean)
    return hashlib.md5('\n'.join(normalized).encode()).hexdigest()[:12]

Diff algorithm:

python
def compute_diff(prev_lines, curr_lines):
    if not prev_lines:
        return curr_lines[-30:]

    # Find overlap point: last line of prev that appears in curr
    prev_set = set(l.strip() for l in prev_lines[-10:] if l.strip())
    overlap_idx = len(curr_lines)  # Default: everything is new

    for i in range(len(curr_lines) - 1, -1, -1):
        if curr_lines[i].strip() in prev_set:
            overlap_idx = i + 1
            break

    new_lines = curr_lines[overlap_idx:]
    if not new_lines:
        # No clear overlap found -- take last 20 lines
        new_lines = curr_lines[-20:]

    # Cap at 30 lines
    return new_lines[-30:]

Size: ~150 lines.

---

SPEC 3: Prompt Assembler (`prompt_assembler.py`)

Purpose: Build the model prompt from session state and pane data.

File location: `Desktop/karl/karl/prompt_assembler.py`

Interface:

python
V6_SYSTEM_PROMPT: str  # Fixed system prompt for V6

def assemble_context(session: SessionRecord, pane: PaneResult) -> list[dict]
def format_history(history: list[dict]) -> str
def format_plan_hint(plan_steps: list[str], plan_index: int) -> str
def format_diff(diff_lines: list[str], phase: str) -> str
def format_anti_repeat(last_3: list[str]) -> str
def count_tokens_approx(text: str) -> int  # word_count * 1.3
def trim_to_budget(components: dict, max_tokens: int) -> dict

Token budget manager:

python
MAX_TOTAL = 1800  # Leave 200+ for generation
PRIORITIES = ['system', 'identity', 'anti_repeat', 'diff', 'history', 'plan']

def trim_to_budget(components, max_tokens=MAX_TOTAL):
    total = sum(count_tokens_approx(v) for v in components.values())
    if total <= max_tokens:
        return components

    # Drop in reverse priority order
    for key in reversed(PRIORITIES):
        if key in ('system', 'identity', 'anti_repeat'):
            continue  # Never drop these
        if total <= max_tokens:
            break
        excess = total - max_tokens
        component_tokens = count_tokens_approx(components[key])
        if component_tokens > excess:
            # Truncate this component
            components[key] = truncate_text(components[key], component_tokens - excess)
        else:
            # Drop entirely
            components[key] = ''
        total = sum(count_tokens_approx(v) for v in components.values())
    return components

Size: ~130 lines.

---

SPEC 4: Validation Gate (`validation_gate.py`)

Purpose: Post-generation validation and regeneration logic.

File location: `Desktop/karl/karl/validation_gate.py`

Interface:

python
@dataclass
class ValidationResult:
    valid: bool
    reason: str | None = None  # status_blocked, exact_repeat, near_repeat, too_short, off_topic, destructive
    action: str | None = None  # regenerate, regenerate_with_constraint, block

BANNED_PHRASES: list[str]  # status variants
DESTRUCTIVE_PATTERNS: list[str]  # rm -rf, push --force, etc.

def validate_prompt(generated: str, session: SessionRecord) -> ValidationResult
def similarity(a: str, b: str) -> float  # Jaccard on word trigrams
def has_project_relevance(prompt: str, session: SessionRecord) -> bool
def build_constraint_prompt(original_context: list, reason: str, rejected: str) -> list[dict]

Banned phrases (comprehensive):

python
BANNED_PHRASES = [
    'status', 'check status', 'show status', 'what is the status',
    'what\'s the status', 'give me status', 'get status',
    'current status', 'project status',
]

DESTRUCTIVE_PATTERNS = [
    r'rm\s+-rf\s',
    r'git\s+push\s+--force',
    r'git\s+reset\s+--hard',
    r'DROP\s+TABLE',
    r'DELETE\s+FROM',
    r'git\s+checkout\s+\.',
    r'git\s+clean\s+-[fd]',
    r'kill\s+-9\s+\d',
    r'pkill\s+-9',
    r'sudo\s+rm',
]

Size: ~100 lines.

---

SPEC 5: V6 Main Driver (`twin_session_driver_v6.py`)

Purpose: Main loop tying all components together. CLI entry point.

File location: `Desktop/karl/twin_session_driver_v6.py`

Interface:

python
class V6SessionDriver:
    def __init__(self, machine, pane_id, project, goal, plan_steps=None,
                 max_turns=30, min_interval=20, max_interval=60,
                 mlx_url=None, dry_run=False)
    def run(self) -> SessionSummary
    def _read_pane(self) -> str | None
    def _inject(self, prompt: str) -> bool
    def _query_model(self, messages: list[dict]) -> str
    def _query_with_constraint(self, messages: list, reason: str, rejected: str) -> str

def auto_detect_session(raw_output: str, mlx_url: str) -> dict
def generate_plan(project: str, goal: str, mlx_url: str) -> list[str]

def main():  # CLI via argparse

CLI:

bash
# Seeded session with explicit goal
karl-twin-v6 mac1 agent-codex:1.1 \
  --project meshd-dashboard \
  --goal "Build health polling dashboard" \
  --turns 30

# Auto-detect session from pane content
karl-twin-v6 mac1 agent-codex:1.1 --auto-detect

# Dry run (log what would be injected)
karl-twin-v6 mac1 agent-codex:1.1 \
  --project meshd-dashboard \
  --goal "Build health polling dashboard" \
  --dry-run

# Resume existing session
karl-twin-v6 mac1 agent-codex:1.1 --resume

Size: ~250 lines (main driver + CLI).

---

3c. Master Execution Checklist

### Wave 0: Foundation (Day 1)
No dependencies. Can start immediately.

#TaskInputOutputOwnerValidationStatus
0.1Create `[home-path]` directory and session record schemaSPEC 1`session_manager.py` with dataclass + CRUDagent`pytest test_session_manager.py` -- create, save, load, atomic write, backup recoveryTODO
0.2Build ANSI stripping + stable hash functionsSPEC 2`pane_processor.py` with strip_ansi, stable_hashagentTest against 20 saved tmux captures, hash stability >90
0.3Implement word-trigram similarity functionSPEC 4`validation_gate.py` with similarity()agentsimilarity("status", "check status") > 0.3, similarity("build X", "deploy Y") < 0.3TODO
0.4Save 50 real pane captures from current sessions for testingSession logs`Desktop/karl/tests/fixtures/pane_captures/` (50 text files)agentFiles exist, cover BUILDING/WAITING/ERROR/DONE phasesTODO

### Wave 1: Pane Processing (Day 2)
Depends on: Wave 0 (0.2 for strip_ansi, 0.4 for test fixtures)

#TaskInputOutputOwnerValidationStatus
1.1Implement compute_diff and compute_noveltySPEC 2, fixtures from 0.4diff and novelty functions in pane_processor.pyagentDiff correctly identifies new content in 40/50 test capturesTODO
1.2Implement detect_phase with keyword matchingSPEC 2Phase detection in pane_processor.pyagentPhase accuracy >85
1.3Implement process_pane combining hash+diff+phaseSPEC 2Complete PaneResult pipelineagentIntegration test: read raw -> PaneResult with correct fieldsTODO
1.4Write turn gating logic (should_fire)Step 3 compoundGating function returns TurnDecisionagentCorrectly gates: WAITING=skip, STUCK(3x)=skip, BUILDING+novelty>0.15=fireTODO

### Wave 2: Prompt Assembly + Validation (Day 3)
Depends on: Wave 0 (0.1 for session record), Wave 1 (1.3 for PaneResult)

#TaskInputOutputOwnerValidationStatus
2.1Write V6 system promptCompound Step 4V6_SYSTEM_PROMPT constantagentHuman review for clarity, <150 tokensTODO
2.2Implement assemble_context with token budgetSPEC 3prompt_assembler.pyagentContext never exceeds 1800 tokens on 50 test inputsTODO
2.3Implement validate_prompt with all 5 rulesSPEC 4validation_gate.py completeagentBlocks "status", repeats, near-repeats, destructive, too-shortTODO
2.4Implement build_constraint_prompt for regenerationSPEC 4Constraint prompt builderagentConstraint prompt includes rejected text and reasonTODO

### Wave 3: Main Driver (Day 4)
Depends on: Waves 0-2 (all components)

#TaskInputOutputOwnerValidationStatus
3.1Implement V6SessionDriver main loopSPEC 5, all prior wavestwin_session_driver_v6.pyagentDry-run against saved session logs produces valid promptsTODO
3.2Implement auto_detect_sessionSPEC 5Auto-detection from pane contentagentCorrectly identifies project/goal from 5 test panesTODO
3.3Implement generate_planSPEC 5Plan generation from project+goalagentGenerates 5-10 coherent steps for 5 test scenariosTODO
3.4Implement CLI with argparseSPEC 5Full CLI (--project, --goal, --auto-detect, --dry-run, --resume)agentAll flags parsed correctly, help text clearTODO
3.5Implement adaptive interval logicCompound Step 6Dynamic wait times (20s-60s)agentInterval increases after 3 WAITING phases, resets on BUILDINGTODO
3.6Implement plan fallback on validation failureCompound Steps 5-6Fallback to next plan step after 3 rejected attemptsagentAfter 3 rejections, injects plan step instead of giving upTODO

### Wave 4: Testing + Hardening (Day 5)
Depends on: Wave 3 (complete driver)

#TaskInputOutputOwnerValidationStatus
4.1End-to-end dry-run test against all 6 existing session logsSession logsComparison report: V5 actual vs V6 would-have-generatedagentV6 generates 0 "status" prompts, 0 exact repeatsTODO
4.2Stress test: 100-turn session with simulated pane dataSynthetic pane dataNo crashes, no memory leaks, session record stays consistentagentDriver completes 100 turns without errorTODO
4.3Hash stability test on 50 pane capturesTest fixturesstable_hash produces same hash on cosmetic-only changesagent>90
4.4Phase detection accuracy testAnnotated fixturesConfusion matrix for 6 phasesagent>85
4.5Destructive pattern blocklist test200 synthetic prompts (50 destructive)Zero destructive prompts pass validationagent100
4.6Token budget overflow testEdge-case pane outputs (500 lines, binary, huge traces)Prompt never exceeds 1800 tokensagent0 overflows on 20 edge casesTODO

### Wave 5: Live Integration (Day 6-7)
Depends on: Wave 4 (all tests pass)

#TaskInputOutputOwnerValidationStatus
5.1Live dry-run test: drive 3 real sessions without injecting3 active tmux panesJSONL logs showing what V6 would injecthumanReview logs, confirm no status/repeats, confirm project coherenceTODO
5.2Live test: drive 1 real session with injection on a non-critical project1 test paneActual session driven by V6humanSession makes progress, no harmful prompts injectedTODO
5.3A/B comparison: run V5 and V6 on parallel sessions, same project2 panes, same seedSide-by-side comparison of turn logshumanV6 has fewer wasted turns, more progress per turnTODO
5.4Deploy to all 5 Macs via meshdWorking V6 driverDriver accessible on mac1-5agent`karl-twin-v6 mac2 agent-codex:1.1 --dry-run` works from mac1TODO

### Wave 6: NATS + Observability (Day 7-8, Optional)
Depends on: Wave 5 (live driver working)

#TaskInputOutputOwnerValidationStatus
6.1Add NATS turn event publishing (fire-and-forget)Step 8b compoundNATS events on karl.twin.turnagentEvents appear in NATS monitor when driver runsTODO
6.2Add NATS idle detection (optional enhancer)Step 8c compoundNATS subscriber for tool eventsagentFalls back gracefully when NATS offlineTODO
6.3Add metrics tracking (turns fired/gated, validation rejections)Step 8d compoundMetrics in session record + summary logagentMetrics correctly count all categoriesTODO
6.4Add session summary output at completion--Print summary: turns, phases, rejections, efficiencyagentSummary is printed after every sessionTODO

### Wave 7: Retraining Data Prep (Day 8-10, Phase 2)
Depends on: Wave 5 (live session data from V6)

#TaskInputOutputOwnerValidationStatus
7.1Extract contrastive pairs from V5 session logs126 existing turnsDPO-format pairs: (good_prompt, bad_prompt)agent50+ contrastive pairs from status and repeat failuresTODO
7.2Generate 200 synthetic session-driving scenariosProject templatesSFT examples in ChatML formatagent200 examples covering error/progress/stuck/done scenariosTODO
7.3Merge V5 training data + V6 session-driving dataV4 train.jsonl + new dataCombined train/valid JSONLagentNo data leakage between train and validTODO
7.4Train V6 LoRA adapter on Mac5Merged training dataV6 adapter weightsagentValidation loss < V5 (currently 2.051 NLL)TODO
7.5Evaluate V6 model on session-driving test setTest set from 7.1Status generation rate, repeat rate, coherence scorehumanStatus rate <5

---

Dependency Graph

Wave 0 (Foundation)
  0.1  0.2  0.3  0.4      (all parallel)
  |    |    |    |
  v    v    v    v
Wave 1 (Pane Processing)
  1.1 depends on 0.2, 0.4
  1.2 depends on 0.4
  1.3 depends on 1.1, 1.2
  1.4 depends on 1.3, 0.1
  |
  v
Wave 2 (Prompt + Validation)
  2.1 no deps beyond writing
  2.2 depends on 0.1, 1.3
  2.3 depends on 0.3
  2.4 depends on 2.3
  |
  v
Wave 3 (Main Driver)
  3.1 depends on ALL of Waves 0-2
  3.2, 3.3 depend on 0.1
  3.4 depends on 3.1
  3.5, 3.6 depend on 3.1
  |
  v
Wave 4 (Testing)
  All depend on Wave 3
  |
  v
Wave 5 (Live)
  All depend on Wave 4
  |
  v
Wave 6 (NATS, optional)     Wave 7 (Retrain, Phase 2)
  Depends on Wave 5          Depends on Wave 5

Critical path: 0.1 -> 1.3 -> 2.2 -> 3.1 -> 4.1 -> 5.2

Total estimated effort:
- Waves 0-4: 5 agent sessions (~3-5 hours each)
- Wave 5: 2 human-supervised sessions
- Wave 6: 2 agent sessions (optional)
- Wave 7: 3 agent sessions (Phase 2)
- Critical path to first live test: 5 days

---

Pulse Auto-Spawn Candidates

Tasks tagged for automated Pulse dispatch:

TaskSession TypeReason
0.1agentPure Python module, no human judgment needed
0.2agentRegex + hashing, testable in isolation
0.3agentSingle function, testable in isolation
1.1-1.4agentAlgorithmic work with clear test criteria
2.1-2.4agentPrompt templates + validation logic
3.1-3.6agentIntegration, but all components available
4.1-4.6agentTest writing, all testable
5.1-5.3humanRequires watching live sessions
6.1-6.4agentNATS integration, testable
7.1-7.3agentData processing
7.4-7.5humanTraining requires monitoring, evaluation requires judgment

Agent-dispatchable: 26/31 tasks (84
Human-required: 5/31 tasks (16

---

Kill Criteria

  • Day 5: If dry-run test (4.1) shows V6 generating >5
  • Day 7: If live test (5.2) shows V6 injecting harmful or incoherent prompts, halt. Review validation gate.
  • Day 14: If V6 has not been used for 3+ real sessions by human choice, the driver is not providing value. Review whether the architecture is too conservative or too aggressive.

---

File Summary

FileLinesPurpose
`karl/session_manager.py`~120Session record CRUD
`karl/pane_processor.py`~150ANSI strip, hash, diff, phase detect
`karl/prompt_assembler.py`~130Context assembly with token budget
`karl/validation_gate.py`~100Post-generation validation + dedup
`twin_session_driver_v6.py`~250Main driver + CLI
`tests/test_session_manager.py`~80Unit tests
`tests/test_pane_processor.py`~120Unit + integration tests
`tests/test_validation_gate.py`~80Validation rule tests
`tests/test_driver_v6.py`~100Integration tests with mocked model
`tests/fixtures/pane_captures/`50 filesReal pane output for testing
Total new code~1130+ ~380 tests

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

evo-cube-output/karl-v6-session-driver/stage3-expand-master-plan.md

Detected Structure

Method · Evaluation · Math · Code Anchors · Architecture · is Stage Research