Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver
**R1: Terminal Diff Fragility** - **Failure scenario:** ANSI escape codes, cursor repositioning, wrapped lines, and partial terminal renders cause the diff algorithm to produce garbage. The pane hash changes on every read (due to timestamp updates or animated spinners), defeating the "skip unchanged" optimization. - **Probability:** HIGH (70%). Terminal output is inherently messy. tmux capture-pane includes invisible characters. - **Impact:** HIGH. Without reliable diff, every turn fires the model with junk context
Full Public Reader
Stage 3: EXPAND + MASTER PLAN -- KARL V6 Session Driver
---
3a. Risk Audit
CRITICAL RISKS
R2: Phase Detection False Positives
- Failure scenario: Phase detector sees "running" in a file path (`/app/running-config.ts`) and classifies as WAITING. Or sees "error" in a variable name and classifies as ERROR. The session stalls or takes wrong actions.
- Probability: MEDIUM (40
- Impact: HIGH. Wrong phase = wrong action. WAITING when BUILDING means the driver waits forever.
- Mitigation: (1) Only match keywords in the LAST 3 lines (not all 80), (2) require keyword to be the dominant signal (not embedded in longer text), (3) add a "confidence" score -- if multiple conflicting signals, default to BUILDING (safest), (4) timeout: any phase held for >120s without change auto-transitions to STUCK.
- Validation: Annotate 50 real pane snapshots with correct phase. Phase detector accuracy must be >85
R3: Model Generates Harmful Prompts
- Failure scenario: The 4B model generates a prompt that causes Claude to delete files, push to wrong branches, deploy broken code, or modify production data. The validation gate (Step 5) only checks for repetition and status, not destructive commands.
- Probability: LOW (15
- Impact: CRITICAL. Data loss, broken deployments.
- Mitigation: Add a DESTRUCTIVE_PATTERNS blocklist to the validation gate: `['rm -rf', 'git push --force', 'drop table', 'git reset --hard', 'DELETE FROM', 'kill -9']`. Any generated prompt matching these patterns is rejected outright.
- Validation: Test with 200 generated prompts. Zero destructive prompts must pass validation.
MEDIUM RISKS
R4: Plan Generation Quality
- Failure scenario: The 4B model generates a plan with wrong file paths, wrong commands, wrong dependencies. The plan is used as fallback when the model fails validation 3 times, so a bad plan injects bad prompts.
- Probability: MEDIUM (50
- Impact: MEDIUM. Bad plan steps waste turns but don't cause data loss (Claude itself validates before executing).
- Mitigation: (1) Keep plan steps vague ("Create the health API route") not specific ("Create /app/api/health/route.ts with exact implementation"), (2) plan is a HINT, model can deviate, (3) validate plan steps against the same blocklist as generated prompts.
- Validation: Generate plans for 10 known projects. Human review for basic coherence.
R5: Session Record Corruption
- Failure scenario: Driver crashes mid-write, leaving a partial JSON file. Next driver start fails to load the session record and creates a new one, losing all history.
- Probability: LOW-MEDIUM (25
- Impact: MEDIUM. Loss of session history means temporary amnesia, but the session can recover from the pane state.
- Mitigation: (1) Atomic write: write to `{session_id}.tmp.json`, then rename, (2) backup: keep previous version as `{session_id}.bak.json`, (3) recovery: if main file is corrupt, try loading backup.
- Validation: Kill the driver process during a write. Verify recovery from backup.
R6: Token Budget Overflow
- Failure scenario: Pane diff is unexpectedly large (Claude outputs a 200-line error trace), combined with history and system prompt, the total exceeds 2048 tokens. The model receives truncated context and generates poor output.
- Probability: MEDIUM (35
- Impact: MEDIUM. Truncated context leads to confused generation, but the validation gate catches most bad outputs.
- Mitigation: (1) Hard cap diff at 20 lines (take first 5 + last 15), (2) hard cap total prompt at 1800 tokens before generation budget, (3) if over budget, drop plan_steps first, then truncate history to last 3 entries.
- Validation: Test with edge-case pane outputs (500-line error dumps, binary output, huge file listings).
R7: MLX Server Unavailability
- Failure scenario: Mac5 is offline, MLX server crashed, or network timeout. Every model call fails. The driver has no LLM to generate prompts.
- Probability: MEDIUM (30
- Impact: MEDIUM-HIGH. Without the model, the driver can only inject plan steps mechanically.
- Mitigation: (1) Plan-only fallback mode: step through plan_steps without model calls, (2) retry with exponential backoff on MLX failures, (3) health check MLX at session start, warn if unreachable, (4) optional: fall back to a different endpoint (local ollama, cloud API).
- Validation: Start a session with MLX offline. Verify plan-only mode works.
LOW RISKS
R8: Adaptive Interval Too Slow
- Failure scenario: 60s max_interval means the driver misses a rapid Claude output that needed immediate direction, causing Claude to idle for a full minute.
- Probability: LOW (20
- Impact: LOW. Wasted time, not wasted quality.
- Mitigation: (1) After injecting a prompt, use min_interval (20s) for the next 3 reads, (2) only use max_interval after 3 consecutive WAITING phases.
R9: NATS Integration Breaks Core Driver
- Failure scenario: NATS code throws an unhandled exception that crashes the main loop.
- Probability: LOW (10
- Impact: LOW if properly isolated. HIGH if not.
- Mitigation: Every NATS call is wrapped in `try/except: pass`. NATS failures log a warning but never block or crash the driver.
R10: Session ID Collision
- Failure scenario: Two sessions on different machines have the same pane ID, causing session record overwrites.
- Probability: LOW (5
- Impact: LOW. Confusion in logs.
- Mitigation: Session ID format: `{machine}_{pane_id}_{timestamp}`. Unique by construction.
---
3b. Expanded Specifications
SPEC 1: Session Record Manager (`session_manager.py`)
Purpose: CRUD operations for session records on disk.
File location: `Desktop/karl/karl/session_manager.py`
Interface:
class SessionRecord:
version: int = 6
session_id: str
machine: str
pane_id: str
project: str
goal: str
plan_steps: list[str]
plan_index: int = 0 # Next uncompleted step
phase: str = "STARTING"
turn_number: int = 0
history: list[dict] # max 8 entries
last_3_prompts: list[str]
pane_hash: str | None = None
pane_hash_streak: int = 0
prev_lines: list[str] = [] # last 40 lines for diff
idle_seconds: float = 0
created_at: str
updated_at: str
SESSION_DIR = Path.home() / ".karl-sessions"
def load_session(session_id: str) -> SessionRecord | None
def save_session(session: SessionRecord) -> bool # atomic write
def create_session(machine, pane_id, project, goal, plan_steps=None) -> SessionRecord
def update_turn(session: SessionRecord, generated_prompt: str, pane_result: PaneResult) -> None
def get_next_plan_step(session: SessionRecord) -> str | None
def advance_plan(session: SessionRecord) -> NoneAtomic write pattern:
def save_session(session):
path = SESSION_DIR / f"{session.session_id}.json"
tmp = path.with_suffix('.tmp.json')
bak = path.with_suffix('.bak.json')
with open(tmp, 'w') as f:
json.dump(asdict(session), f, indent=2, default=str)
if path.exists():
shutil.copy2(path, bak)
os.rename(tmp, path)Size: ~120 lines.
---
SPEC 2: Pane Processor (`pane_processor.py`)
Purpose: Read, clean, hash, diff pane output.
File location: `Desktop/karl/karl/pane_processor.py`
Interface:
@dataclass
class PaneResult:
changed: bool
diff: list[str] # New lines since last read
novelty: float # 0.0-1.0
all_lines: list[str] # Full cleaned output
raw_line_count: int
def strip_ansi(text: str) -> str
def stable_hash(lines: list[str]) -> str # Ignores timestamps, whitespace
def compute_diff(prev_lines: list[str], curr_lines: list[str]) -> list[str]
def compute_novelty(diff_lines: list[str]) -> float
def process_pane(session: SessionRecord, raw_output: str) -> PaneResult
def is_spinner_line(line: str) -> bool
def detect_phase(session: SessionRecord, pane: PaneResult) -> strANSI stripping regex:
ANSI_RE = re.compile(r'\x1b\[[0-9;]*[a-zA-Z]|\x1b\].*?\x07|\x1b\[.*?[Hm]')
TIMESTAMP_RE = re.compile(r'\d{1,2}:\d{2}(:\d{2})?(\s*(AM|PM))?')
SPINNER_RE = re.compile(r'[⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏|/\\-]')Stable hash:
def stable_hash(lines):
# Remove timestamps, normalize whitespace, keep only alphanumeric content
normalized = []
for line in lines:
clean = TIMESTAMP_RE.sub('', line)
clean = re.sub(r'\s+', ' ', clean).strip()
if len(clean) > 3: # Skip noise lines
normalized.append(clean)
return hashlib.md5('\n'.join(normalized).encode()).hexdigest()[:12]Diff algorithm:
def compute_diff(prev_lines, curr_lines):
if not prev_lines:
return curr_lines[-30:]
# Find overlap point: last line of prev that appears in curr
prev_set = set(l.strip() for l in prev_lines[-10:] if l.strip())
overlap_idx = len(curr_lines) # Default: everything is new
for i in range(len(curr_lines) - 1, -1, -1):
if curr_lines[i].strip() in prev_set:
overlap_idx = i + 1
break
new_lines = curr_lines[overlap_idx:]
if not new_lines:
# No clear overlap found -- take last 20 lines
new_lines = curr_lines[-20:]
# Cap at 30 lines
return new_lines[-30:]Size: ~150 lines.
---
SPEC 3: Prompt Assembler (`prompt_assembler.py`)
Purpose: Build the model prompt from session state and pane data.
File location: `Desktop/karl/karl/prompt_assembler.py`
Interface:
V6_SYSTEM_PROMPT: str # Fixed system prompt for V6
def assemble_context(session: SessionRecord, pane: PaneResult) -> list[dict]
def format_history(history: list[dict]) -> str
def format_plan_hint(plan_steps: list[str], plan_index: int) -> str
def format_diff(diff_lines: list[str], phase: str) -> str
def format_anti_repeat(last_3: list[str]) -> str
def count_tokens_approx(text: str) -> int # word_count * 1.3
def trim_to_budget(components: dict, max_tokens: int) -> dictToken budget manager:
MAX_TOTAL = 1800 # Leave 200+ for generation
PRIORITIES = ['system', 'identity', 'anti_repeat', 'diff', 'history', 'plan']
def trim_to_budget(components, max_tokens=MAX_TOTAL):
total = sum(count_tokens_approx(v) for v in components.values())
if total <= max_tokens:
return components
# Drop in reverse priority order
for key in reversed(PRIORITIES):
if key in ('system', 'identity', 'anti_repeat'):
continue # Never drop these
if total <= max_tokens:
break
excess = total - max_tokens
component_tokens = count_tokens_approx(components[key])
if component_tokens > excess:
# Truncate this component
components[key] = truncate_text(components[key], component_tokens - excess)
else:
# Drop entirely
components[key] = ''
total = sum(count_tokens_approx(v) for v in components.values())
return componentsSize: ~130 lines.
---
SPEC 4: Validation Gate (`validation_gate.py`)
Purpose: Post-generation validation and regeneration logic.
File location: `Desktop/karl/karl/validation_gate.py`
Interface:
@dataclass
class ValidationResult:
valid: bool
reason: str | None = None # status_blocked, exact_repeat, near_repeat, too_short, off_topic, destructive
action: str | None = None # regenerate, regenerate_with_constraint, block
BANNED_PHRASES: list[str] # status variants
DESTRUCTIVE_PATTERNS: list[str] # rm -rf, push --force, etc.
def validate_prompt(generated: str, session: SessionRecord) -> ValidationResult
def similarity(a: str, b: str) -> float # Jaccard on word trigrams
def has_project_relevance(prompt: str, session: SessionRecord) -> bool
def build_constraint_prompt(original_context: list, reason: str, rejected: str) -> list[dict]Banned phrases (comprehensive):
BANNED_PHRASES = [
'status', 'check status', 'show status', 'what is the status',
'what\'s the status', 'give me status', 'get status',
'current status', 'project status',
]
DESTRUCTIVE_PATTERNS = [
r'rm\s+-rf\s',
r'git\s+push\s+--force',
r'git\s+reset\s+--hard',
r'DROP\s+TABLE',
r'DELETE\s+FROM',
r'git\s+checkout\s+\.',
r'git\s+clean\s+-[fd]',
r'kill\s+-9\s+\d',
r'pkill\s+-9',
r'sudo\s+rm',
]Size: ~100 lines.
---
SPEC 5: V6 Main Driver (`twin_session_driver_v6.py`)
Purpose: Main loop tying all components together. CLI entry point.
File location: `Desktop/karl/twin_session_driver_v6.py`
Interface:
class V6SessionDriver:
def __init__(self, machine, pane_id, project, goal, plan_steps=None,
max_turns=30, min_interval=20, max_interval=60,
mlx_url=None, dry_run=False)
def run(self) -> SessionSummary
def _read_pane(self) -> str | None
def _inject(self, prompt: str) -> bool
def _query_model(self, messages: list[dict]) -> str
def _query_with_constraint(self, messages: list, reason: str, rejected: str) -> str
def auto_detect_session(raw_output: str, mlx_url: str) -> dict
def generate_plan(project: str, goal: str, mlx_url: str) -> list[str]
def main(): # CLI via argparseCLI:
# Seeded session with explicit goal
karl-twin-v6 mac1 agent-codex:1.1 \
--project meshd-dashboard \
--goal "Build health polling dashboard" \
--turns 30
# Auto-detect session from pane content
karl-twin-v6 mac1 agent-codex:1.1 --auto-detect
# Dry run (log what would be injected)
karl-twin-v6 mac1 agent-codex:1.1 \
--project meshd-dashboard \
--goal "Build health polling dashboard" \
--dry-run
# Resume existing session
karl-twin-v6 mac1 agent-codex:1.1 --resumeSize: ~250 lines (main driver + CLI).
---
3c. Master Execution Checklist
### Wave 0: Foundation (Day 1)
No dependencies. Can start immediately.
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 0.1 | Create `[home-path]` directory and session record schema | SPEC 1 | `session_manager.py` with dataclass + CRUD | agent | `pytest test_session_manager.py` -- create, save, load, atomic write, backup recovery | TODO |
| 0.2 | Build ANSI stripping + stable hash functions | SPEC 2 | `pane_processor.py` with strip_ansi, stable_hash | agent | Test against 20 saved tmux captures, hash stability >90 | |
| 0.3 | Implement word-trigram similarity function | SPEC 4 | `validation_gate.py` with similarity() | agent | similarity("status", "check status") > 0.3, similarity("build X", "deploy Y") < 0.3 | TODO |
| 0.4 | Save 50 real pane captures from current sessions for testing | Session logs | `Desktop/karl/tests/fixtures/pane_captures/` (50 text files) | agent | Files exist, cover BUILDING/WAITING/ERROR/DONE phases | TODO |
### Wave 1: Pane Processing (Day 2)
Depends on: Wave 0 (0.2 for strip_ansi, 0.4 for test fixtures)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 1.1 | Implement compute_diff and compute_novelty | SPEC 2, fixtures from 0.4 | diff and novelty functions in pane_processor.py | agent | Diff correctly identifies new content in 40/50 test captures | TODO |
| 1.2 | Implement detect_phase with keyword matching | SPEC 2 | Phase detection in pane_processor.py | agent | Phase accuracy >85 | |
| 1.3 | Implement process_pane combining hash+diff+phase | SPEC 2 | Complete PaneResult pipeline | agent | Integration test: read raw -> PaneResult with correct fields | TODO |
| 1.4 | Write turn gating logic (should_fire) | Step 3 compound | Gating function returns TurnDecision | agent | Correctly gates: WAITING=skip, STUCK(3x)=skip, BUILDING+novelty>0.15=fire | TODO |
### Wave 2: Prompt Assembly + Validation (Day 3)
Depends on: Wave 0 (0.1 for session record), Wave 1 (1.3 for PaneResult)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 2.1 | Write V6 system prompt | Compound Step 4 | V6_SYSTEM_PROMPT constant | agent | Human review for clarity, <150 tokens | TODO |
| 2.2 | Implement assemble_context with token budget | SPEC 3 | prompt_assembler.py | agent | Context never exceeds 1800 tokens on 50 test inputs | TODO |
| 2.3 | Implement validate_prompt with all 5 rules | SPEC 4 | validation_gate.py complete | agent | Blocks "status", repeats, near-repeats, destructive, too-short | TODO |
| 2.4 | Implement build_constraint_prompt for regeneration | SPEC 4 | Constraint prompt builder | agent | Constraint prompt includes rejected text and reason | TODO |
### Wave 3: Main Driver (Day 4)
Depends on: Waves 0-2 (all components)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 3.1 | Implement V6SessionDriver main loop | SPEC 5, all prior waves | twin_session_driver_v6.py | agent | Dry-run against saved session logs produces valid prompts | TODO |
| 3.2 | Implement auto_detect_session | SPEC 5 | Auto-detection from pane content | agent | Correctly identifies project/goal from 5 test panes | TODO |
| 3.3 | Implement generate_plan | SPEC 5 | Plan generation from project+goal | agent | Generates 5-10 coherent steps for 5 test scenarios | TODO |
| 3.4 | Implement CLI with argparse | SPEC 5 | Full CLI (--project, --goal, --auto-detect, --dry-run, --resume) | agent | All flags parsed correctly, help text clear | TODO |
| 3.5 | Implement adaptive interval logic | Compound Step 6 | Dynamic wait times (20s-60s) | agent | Interval increases after 3 WAITING phases, resets on BUILDING | TODO |
| 3.6 | Implement plan fallback on validation failure | Compound Steps 5-6 | Fallback to next plan step after 3 rejected attempts | agent | After 3 rejections, injects plan step instead of giving up | TODO |
### Wave 4: Testing + Hardening (Day 5)
Depends on: Wave 3 (complete driver)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 4.1 | End-to-end dry-run test against all 6 existing session logs | Session logs | Comparison report: V5 actual vs V6 would-have-generated | agent | V6 generates 0 "status" prompts, 0 exact repeats | TODO |
| 4.2 | Stress test: 100-turn session with simulated pane data | Synthetic pane data | No crashes, no memory leaks, session record stays consistent | agent | Driver completes 100 turns without error | TODO |
| 4.3 | Hash stability test on 50 pane captures | Test fixtures | stable_hash produces same hash on cosmetic-only changes | agent | >90 | |
| 4.4 | Phase detection accuracy test | Annotated fixtures | Confusion matrix for 6 phases | agent | >85 | |
| 4.5 | Destructive pattern blocklist test | 200 synthetic prompts (50 destructive) | Zero destructive prompts pass validation | agent | 100 | |
| 4.6 | Token budget overflow test | Edge-case pane outputs (500 lines, binary, huge traces) | Prompt never exceeds 1800 tokens | agent | 0 overflows on 20 edge cases | TODO |
### Wave 5: Live Integration (Day 6-7)
Depends on: Wave 4 (all tests pass)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 5.1 | Live dry-run test: drive 3 real sessions without injecting | 3 active tmux panes | JSONL logs showing what V6 would inject | human | Review logs, confirm no status/repeats, confirm project coherence | TODO |
| 5.2 | Live test: drive 1 real session with injection on a non-critical project | 1 test pane | Actual session driven by V6 | human | Session makes progress, no harmful prompts injected | TODO |
| 5.3 | A/B comparison: run V5 and V6 on parallel sessions, same project | 2 panes, same seed | Side-by-side comparison of turn logs | human | V6 has fewer wasted turns, more progress per turn | TODO |
| 5.4 | Deploy to all 5 Macs via meshd | Working V6 driver | Driver accessible on mac1-5 | agent | `karl-twin-v6 mac2 agent-codex:1.1 --dry-run` works from mac1 | TODO |
### Wave 6: NATS + Observability (Day 7-8, Optional)
Depends on: Wave 5 (live driver working)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 6.1 | Add NATS turn event publishing (fire-and-forget) | Step 8b compound | NATS events on karl.twin.turn | agent | Events appear in NATS monitor when driver runs | TODO |
| 6.2 | Add NATS idle detection (optional enhancer) | Step 8c compound | NATS subscriber for tool events | agent | Falls back gracefully when NATS offline | TODO |
| 6.3 | Add metrics tracking (turns fired/gated, validation rejections) | Step 8d compound | Metrics in session record + summary log | agent | Metrics correctly count all categories | TODO |
| 6.4 | Add session summary output at completion | -- | Print summary: turns, phases, rejections, efficiency | agent | Summary is printed after every session | TODO |
### Wave 7: Retraining Data Prep (Day 8-10, Phase 2)
Depends on: Wave 5 (live session data from V6)
| # | Task | Input | Output | Owner | Validation | Status |
|---|---|---|---|---|---|---|
| 7.1 | Extract contrastive pairs from V5 session logs | 126 existing turns | DPO-format pairs: (good_prompt, bad_prompt) | agent | 50+ contrastive pairs from status and repeat failures | TODO |
| 7.2 | Generate 200 synthetic session-driving scenarios | Project templates | SFT examples in ChatML format | agent | 200 examples covering error/progress/stuck/done scenarios | TODO |
| 7.3 | Merge V5 training data + V6 session-driving data | V4 train.jsonl + new data | Combined train/valid JSONL | agent | No data leakage between train and valid | TODO |
| 7.4 | Train V6 LoRA adapter on Mac5 | Merged training data | V6 adapter weights | agent | Validation loss < V5 (currently 2.051 NLL) | TODO |
| 7.5 | Evaluate V6 model on session-driving test set | Test set from 7.1 | Status generation rate, repeat rate, coherence score | human | Status rate <5 |
---
Dependency Graph
Wave 0 (Foundation)
0.1 0.2 0.3 0.4 (all parallel)
| | | |
v v v v
Wave 1 (Pane Processing)
1.1 depends on 0.2, 0.4
1.2 depends on 0.4
1.3 depends on 1.1, 1.2
1.4 depends on 1.3, 0.1
|
v
Wave 2 (Prompt + Validation)
2.1 no deps beyond writing
2.2 depends on 0.1, 1.3
2.3 depends on 0.3
2.4 depends on 2.3
|
v
Wave 3 (Main Driver)
3.1 depends on ALL of Waves 0-2
3.2, 3.3 depend on 0.1
3.4 depends on 3.1
3.5, 3.6 depend on 3.1
|
v
Wave 4 (Testing)
All depend on Wave 3
|
v
Wave 5 (Live)
All depend on Wave 4
|
v
Wave 6 (NATS, optional) Wave 7 (Retrain, Phase 2)
Depends on Wave 5 Depends on Wave 5Critical path: 0.1 -> 1.3 -> 2.2 -> 3.1 -> 4.1 -> 5.2
Total estimated effort:
- Waves 0-4: 5 agent sessions (~3-5 hours each)
- Wave 5: 2 human-supervised sessions
- Wave 6: 2 agent sessions (optional)
- Wave 7: 3 agent sessions (Phase 2)
- Critical path to first live test: 5 days
---
Pulse Auto-Spawn Candidates
Tasks tagged for automated Pulse dispatch:
| Task | Session Type | Reason |
|---|---|---|
| 0.1 | agent | Pure Python module, no human judgment needed |
| 0.2 | agent | Regex + hashing, testable in isolation |
| 0.3 | agent | Single function, testable in isolation |
| 1.1-1.4 | agent | Algorithmic work with clear test criteria |
| 2.1-2.4 | agent | Prompt templates + validation logic |
| 3.1-3.6 | agent | Integration, but all components available |
| 4.1-4.6 | agent | Test writing, all testable |
| 5.1-5.3 | human | Requires watching live sessions |
| 6.1-6.4 | agent | NATS integration, testable |
| 7.1-7.3 | agent | Data processing |
| 7.4-7.5 | human | Training requires monitoring, evaluation requires judgment |
Agent-dispatchable: 26/31 tasks (84
Human-required: 5/31 tasks (16
---
Kill Criteria
- Day 5: If dry-run test (4.1) shows V6 generating >5
- Day 7: If live test (5.2) shows V6 injecting harmful or incoherent prompts, halt. Review validation gate.
- Day 14: If V6 has not been used for 3+ real sessions by human choice, the driver is not providing value. Review whether the architecture is too conservative or too aggressive.
---
File Summary
| File | Lines | Purpose |
|---|---|---|
| `karl/session_manager.py` | ~120 | Session record CRUD |
| `karl/pane_processor.py` | ~150 | ANSI strip, hash, diff, phase detect |
| `karl/prompt_assembler.py` | ~130 | Context assembly with token budget |
| `karl/validation_gate.py` | ~100 | Post-generation validation + dedup |
| `twin_session_driver_v6.py` | ~250 | Main driver + CLI |
| `tests/test_session_manager.py` | ~80 | Unit tests |
| `tests/test_pane_processor.py` | ~120 | Unit + integration tests |
| `tests/test_validation_gate.py` | ~80 | Validation rule tests |
| `tests/test_driver_v6.py` | ~100 | Integration tests with mocked model |
| `tests/fixtures/pane_captures/` | 50 files | Real pane output for testing |
| Total new code | ~1130 | + ~380 tests |
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
evo-cube-output/karl-v6-session-driver/stage3-expand-master-plan.md
Detected Structure
Method · Evaluation · Math · Code Anchors · Architecture · is Stage Research