ELP-2 — Survivor Architecture
Three of four scrutiny layers returned a convergent verdict: ELP-1 as written cannot ship. The CRITICAL findings are structural — a /inject format mismatch that breaks the primary dispatch path, a concurrent SKILL.md write race enabled by a 5-minute claim TTL, and a Syncthing-backed filesystem fallback that is architecturally described but physically unprovisioned. These are not tuning problems; they are root-cause failures.
Full Public Reader
# ELP-2 — Survivor Architecture
> Crucible Stage 3 FORGE | cycle: chain:full-omega #2 | date: 2026-05-13
> Input layers: L1 meta-review (27 findings, NO-SHIP), L2 AMR (AMEND), L4 Evo3 (Hybrid wins)
> Layer 3 (Codex adversarial): pending — see footnote at §15.
---
Preamble
Three of four scrutiny layers returned a convergent verdict: ELP-1 as written cannot ship. The CRITICAL findings are structural — a /inject format mismatch that breaks the primary dispatch path, a concurrent SKILL.md write race enabled by a 5-minute claim TTL, and a Syncthing-backed filesystem fallback that is architecturally described but physically unprovisioned. These are not tuning problems; they are root-cause failures.
The verdict is not "abandon the loop." The four caveats ELP-1 was designed to kill are real. Claude Code session closure kills the current ScheduleWakeup loop. Stall detection is absent. A single driver is a single point of failure. These problems are worth solving.
The verdict is: replace ELP-1's bespoke distributed worker system with a thin Pulse-backed hybrid that kills the same four caveats at roughly 40
That replacement is ELP-2.
---
§1. Identity Statement
ELP-2 is a Pulse-orchestrated, criteria-driven convergence loop for SOOP-2.
What the supervisor does each cycle:
1. Read filesystem state, compute which of the 10 SOOP-2 criteria pass or fail.
2. Decide what batch of work to dispatch next (or whether Mohamed's approval is required first).
3. Dispatch via Pulse (for work that fits Pulse's session model) or inline (for trivial work).
4. Write the heartbeat file, append to the dispatch log, regenerate the dashboard.
5. Sleep and repeat.
What ELP-2 does not do:
- Run its own worker pool (Pulse handles this).
- Maintain a `soop2_workers` table (Pulse handles this).
- Implement claim semantics with a TTL (the race condition that killed ELP-1's L5-1 finding).
- Assume Supabase exists (filesystem-only in v1.0).
- Escalate via telegram (dashboard banner in v1.0; telegram in v1.1).
The result is roughly 5 files instead of ELP-1's 15+, a 6-hour build instead of ~9.5 hours, and zero of ELP-1's five CRITICAL findings.
---
§2. What Changed from ELP-1 (the short version)
| Component | ELP-1 | ELP-2 | Reason killed |
|---|---|---|---|
| Worker model | bespoke `soop2_workers` table, claim TTL, `FOR UPDATE SKIP LOCKED` | Pulse spawn calls | Pulse already implements worker lifecycle, rate-limit awareness, retry. Reinventing it was L4's sharpest finding. |
| State plane | Supabase (primary) + filesystem mirror (fallback) | Filesystem only in v1.0 | Syncthing mirror was physically unprovisioned (L4-2 CRITICAL). Supabase adds complexity for a 13-day project. |
| Mac2 observer mode | Automatic promotion after 30-min heartbeat gap | One-command failover script | Observer-mode is real dual-primary risk (L1-1, L5-2 CRITICAL). For a 13-day project, manual failover in 60s is acceptable. |
| Escalation | Telegram + Twilio from supervisor daemon | Dashboard HTML banner | Telegram/Twilio require direct REST calls from a daemon context; ELP-1 assumed the Claude skill wrappers were callable (L4-6 MEDIUM). Not correct. |
| Observability | Grafana on cloud-vm | Static HTML at Desktop/elp2-dashboard.html | Grafana requires Supabase direct Postgres port (L4-5 MEDIUM). Dashboard is Syncthing-shared and readable anywhere. |
| Claim TTL race | 5-minute TTL on batch claims; expired claims released while original worker still writing | N/A | ELP-2 has no claims. Pulse manages its own session lifecycle. L5-1 CRITICAL is gone by construction. |
| /inject format | Spec said {machine, tmux_target, prompt}; gateway required tty for mac1 | Verified against gateway.py line 1877; tty provided for mac1 dispatches | L4-1 CRITICAL: the primary dispatch path was broken in ELP-1. |
---
§3. File Map (~5 files total)
[home-path]
scoreboard.py # reads filesystem, computes 10-criteria pass/fail/in-progress
dispatcher.py # decides next batch, dispatches via Pulse or inline or gate
deadman.sh # 3-line cron, INDEPENDENT of dispatcher.py
[home-path]
scoreboard.json # the truth; atomic writes via tmp + os.replace
heartbeat # touched every cycle by dispatcher.py (not by deadman.sh)
dispatch-log.jsonl # append-only audit
snapshots/ # hourly scoreboard copies (local, Syncthing-excluded)
Desktop/elp2-dashboard.html # Syncthing-shared static HTML, regenerated each cycleThat is it. No `soop2_workers` table. No `soop2_queue` table. No `soop2_log` table. No `schema.sql`. No worker registration protocol. No claim semantics. No reconciler. No observer-mode supervisor. No Grafana panels. No telegram bot. No Twilio integration. All of those existed in ELP-1's spec and none of them worked correctly per the three scrutiny layers.
---
§4. Component Specifications
§4.1 scoreboard.py
Location: `[home-path]`
Role: Read-only computation from filesystem state. No writes except to `scoreboard.json` via its `write()` function.
Inputs read:
- `[home-path]` frontmatter (for typed_count, silent_capable_count)
- `[home-path]` (optional; for criteria that require Mohamed's manual flip, e.g., REVIEW-4)
- The linter binary at `[home-path]` (invoked as subprocess, budget: 2 seconds)
Output schema (scoreboard.json, schema_version: 1):
{
"schema_version": 1,
"updated_at": "2026-05-13T04:00:00Z",
"typed_count": 14,
"total_skills": 295,
"silent_capable_count": 3,
"criteria": {
"1_all_typed": {"status": "in_progress", "current": 14, "target": 295},
"2_linter_fast": {"status": "pass"},
"3_tier1_recall": {"status": "pending", "note": "gated on REVIEW-4"},
"4_twin_primary": {"status": "pending", "note": "mac4:8100 required"},
"5_type_weight_knob":{"status": "pending", "note": "architecture decision"},
"6_feedback_loop": {"status": "pending"},
"7_contrarian": {"status": "pass"},
"8_silent_capable": {"status": "in_progress", "current": 3, "target": 5},
"9_mac3_purge": {"status": "pass"},
"10_audit_memory": {"status": "pending", "note": "Interactive per Rail"}
},
"criteria_passed": 3,
"loop_complete": false
}Performance contract: completes in under 2 seconds. The linter alone runs in 0.13s (verified). The SKILL.md scan of 295 files with frontmatter parsing is well under 2s on mac1 SSD.
Atomic write pattern:
import json, os, tempfile, pathlib
def write(path: pathlib.Path, data: dict) -> None:
tmp = path.with_suffix(".tmp")
tmp.write_text(json.dumps(data, indent=2))
os.replace(tmp, path) # POSIX atomic on same filesystemThis is the same pattern ELP-1 specified in §3.2. It is correct. ELP-2 keeps it.
---
§4.2 dispatcher.py
Location: `[home-path]`
Role: One cycle of the loop. Reads scoreboard, decides what to do, dispatches, writes outputs.
Invocation: Called by launchd every 5 minutes. NOT called by deadman.sh. These are independent.
Cycle steps:
1. read scoreboard.json
2. if loop_complete: write completion memory file, exit plist (via a sentinel file)
3. select next batch from priority table (see §6)
4. if batch kind == "Mohamed-gated":
update dashboard with "waiting for Mohamed: <gate_name>"
touch heartbeat
exit (do NOT queue post-gate work)
5. if batch kind == "inline":
run the work directly (e.g., count files, read a single SKILL.md)
6. if batch kind == "pulse-dispatched":
call pulse_start() via MCP with prompt template + stop conditions
log spawn_id to dispatch-log.jsonl
7. touch [home-path]
8. append entry to dispatch-log.jsonl
9. regenerate dashboard HTML
10. run scoreboard.py to refresh scoreboard.jsonNo claim semantics. There is no batch state machine. Pulse manages its own session lifecycle. If dispatcher.py crashes after step 6 but before step 7, the Pulse session continues running. The next dispatcher cycle reads the same scoreboard state, detects no advance in typed_count (or whichever metric), and either dispatches a new Pulse spawn or waits. Duplicate Pulse sessions for the same track are acceptable because SKILL.md edits are idempotent (frontmatter is either written or it is not; writing it twice is a no-op).
This eliminates ELP-1 finding L5-1 by construction. There is no concurrent SKILL.md write race because there is no claim TTL releasing a batch to a second worker while the first is still writing. Pulse owns one session at a time per track. The idempotency of frontmatter writes handles any overlap.
---
§4.3 deadman.sh
Location: `[home-path]`
Role: INDEPENDENT escalation. This script does not import or call dispatcher.py. It has no Python dependency. It cannot be killed by a dispatcher crash.
Contents:
#!/usr/bin/env bash
# ELP-2 dead-man's switch. Cron runs this every 15 min independent of dispatcher.
# If the heartbeat file hasn't been touched in 30 min, write a visible alert.
HEARTBEAT=[home-path]
ALERT=Desktop/ELP2-STALLED.txt
if [[ ! -f "$HEARTBEAT" ]] || [[ $(find "$HEARTBEAT" -mmin +30 2>/dev/null) ]]; then
echo "ELP-2 stalled at $(date). Last heartbeat: $(stat -f '%Sm' "$HEARTBEAT" 2>/dev/null || echo 'never')" > "$ALERT"
fiThree lines of logic. No Python. No Supabase. No aura-gateway dependency. No paging infrastructure that fails silently.
Why independent matters: ELP-1's L5-A finding (Layer 2) identified that if supervisor.py crashes before it can write the heartbeat, the escalation layer inside supervisor.py never fires. The monitoring system for ELP crashes along with ELP. deadman.sh kills this blind spot because it is a separate process launched by a separate cron entry, with no shared code path with dispatcher.py.
Cron entry (installed during bootstrap):
*/15 * * * * [home]/.claude/tools/elp2/deadman.shAlert visibility: The alert writes to `Desktop/ELP2-STALLED.txt`, which is both on the Desktop (visible when Mohamed opens his Mac) and in a directory Syncthing can replicate to make it appear on his other machines. In v1.1, this triggers a telegram message. In v1.0, the Desktop file is sufficient.
---
§4.4 scoreboard.json (state plane)
Location: `[home-path]`
Truth hierarchy:
- `scoreboard.json` is the ELP-2 truth. It is derived from filesystem state (SKILL.md files), not from a DB.
- If `scoreboard.json` is deleted, dispatcher.py re-generates it from scratch by running scoreboard.py. Recovery is one dispatcher cycle (5 minutes).
- If the entire `[home-path]` directory is deleted, dispatcher.py creates it on the next run. Nothing is permanently lost because the canonical state is the SKILL.md files themselves.
No Supabase in v1.0. This is an intentional deferral, not an oversight. The filesystem IS the database for SOOP-2 because the work product (typed SKILL.md frontmatter) lives on the filesystem. Supabase would add cross-mesh durability for a project that runs on one machine 95
Write ownership:
- `scoreboard.json`: written by scoreboard.py (called by dispatcher.py). Only one writer.
- `heartbeat`: written by dispatcher.py. Only one writer.
- `dispatch-log.jsonl`: written by dispatcher.py in append mode. Only one writer.
Single-writer per file. No concurrent write contention. No need for `FOR UPDATE SKIP LOCKED`.
---
§4.5 elp2-dashboard.html
Location: `Desktop/elp2-dashboard.html`
Generated by: dispatcher.py at the end of each cycle. Not a separate daemon.
Contents:
- Current scoreboard (criteria table, typed_count progress bar)
- Last 10 dispatch log entries
- Heartbeat timestamp and staleness indicator
- Gate status (if a Mohamed-gated batch is waiting, this is the call to action)
- One-line failover instructions for mac2
Syncthing: `Desktop/` is Syncthing-shared (verified: it is in the Syncthing config). Mohamed can read the dashboard from any mesh node's browser by opening `file://[home]/Desktop/elp2-dashboard.html` or from a shared folder on another machine.
Resolution to L4-2 (CRITICAL from Layer 1): ELP-1's dashboard depended on a Grafana instance that required Supabase direct Postgres access (L4-5 MEDIUM) and a Syncthing share of `[home-path]` that was unprovisioned (L4-2 CRITICAL). ELP-2 uses `Desktop/` which is already Syncthing-provisioned. The dashboard is a static HTML file that reads no remote data — it is pre-rendered each cycle by dispatcher.py from local state.
---
§5. Dispatch Model
Three modes per batch kind. The selector lives in `dispatcher.py`'s priority table (see §6).
Mode A: Inline
Used for: trivial work that completes in milliseconds and has no Claude session dependency. Examples: read a file to check its state, count typed files, write a small JSON override.
# Inline dispatch example — no Pulse, no aura-gateway
def dispatch_inline(batch: dict) -> dict:
result = batch["fn"](*batch["args"])
return {"mode": "inline", "result": result}Mode B: Pulse-dispatched
Used for: bounded work that fits Pulse's session model. Examples: type the next 10 SKILL.md files, run the Tier 1 recall benchmark, wire a feedback component.
The /inject format fix from L4-1 CRITICAL:
ELP-1's spec stated the inject payload as `{machine, tmux_target, prompt}`. This was wrong. The gateway's actual `InjectRequest` (gateway.py line 1875-1879) accepts:
class InjectRequest(BaseModel):
prompt: str # required
tty: Optional[str] # required when machine == "mac1"; ignored for remote
machine: str # default "mac1"; pattern ^(mac[1-5]|cloud-vm)$
tmux_target: Optional[str] # defaults to "claude" for remote machinesFor mac1 dispatches, `tty` must match `^(/dev/)?ttys1,4$`. Without it, the gateway returns HTTP 422. ELP-1's primary worker (mac1) would have failed every single injection. This was a CRITICAL finding that ELP-2 resolves not by patching ELP-1 but by not using raw aura-gateway /inject at all for the primary dispatch path.
ELP-2 dispatches via Pulse MCP calls:
from mcp__pulse_local import pulse_start
def dispatch_pulse(batch: dict) -> dict:
result = pulse_start(
projectName="soop-2",
projectPath="[home]",
goal=batch["prompt_template"].format(**batch["vars"]),
maxIterations=batch.get("max_iterations", 5)
)
return {"mode": "pulse", "spawn_id": result["sessionId"]}Pulse's own MCP interface handles the routing decision. It already has rate-limit awareness, retry logic, and session lifecycle management. ELP-2 does not reimplement any of that.
When direct /inject is needed (e.g., for a mac4-specific dispatch that Pulse cannot route): dispatcher.py reads the current tty from the pane registry before injecting.
import json, pathlib, requests
def _get_mac1_tty() -> str:
# gateway.py reads pane_registry.json directly for mac1.
# We mirror that by reading it here.
registry = pathlib.Path.home() / ".claude/state/agent_pane_registry.json"
data = json.loads(registry.read_text())
# Return first active tty from mac1 panes
for tty, info in data.get("panes", {}).items():
if info.get("machine_id") == "mac1" and not info.get("is_ghost"):
return tty
raise RuntimeError("no active mac1 pane found in registry")
def dispatch_inject_mac1(prompt: str, tmux_target: str = None) -> dict:
# tty is required for mac1. gateway.py line 1877:
# "Required when machine == mac1; ignored for remote machines."
tty = _get_mac1_tty()
payload = {
"prompt": prompt,
"machine": "mac1",
"tty": tty,
}
if tmux_target:
payload["tmux_target"] = tmux_target
r = requests.post(
"http://localhost:8095/inject",
json=payload,
headers={"Authorization": f"Bearer {GATEWAY_TOKEN}"},
timeout=10,
)
r.raise_for_status()
return r.json()The tty field is not hardcoded. It is read at dispatch time from the pane registry, which is always current. This is the correct fix for L4-1 CRITICAL.
Mode C: Mohamed-gated
Used for: any batch that depends on Mohamed's judgment or sign-off before the next batch can proceed. Examples: REVIEW-4 (recall quality judgment), A.1 spec sign-off, I.4 audit authoring.
This resolves Layer 2's HIGH finding about review gates. ELP-1 had no mechanism to pause dispatch at a review gate. Its batch generation logic would queue post-gate work before Mohamed signed off on the gate. ELP-2 makes gates explicit in the priority table. When the selector picks a Mohamed-gated batch, dispatcher.py updates the dashboard with "waiting for Mohamed: REVIEW-4" and exits. It does NOT queue the next batch. The only way to unblock is `gate_approved: true` in `criteria_overrides.json`.
def dispatch_gate(batch: dict, scoreboard_path: pathlib.Path) -> dict:
# Write to dashboard, do not dispatch further work.
update_dashboard_gate_status(batch["gate_name"], batch["description"])
# Touch heartbeat so deadman.sh sees a live system.
(pathlib.Path.home() / ".claude/state/elp2/heartbeat").touch()
return {"mode": "gate", "gate_name": batch["gate_name"], "waiting": True}---
§6. Priority Table (batch kinds)
The priority table is a Python list in dispatcher.py ordered from highest to lowest priority. The selector picks the first entry whose `ready` predicate returns True given the current scoreboard.
BATCH_PRIORITY = [
# Criterion 9 (mac3 purge) and 7 (contrarian) are already done.
# Criterion 2 (linter fast) is already done.
# These entries are here for completeness; their ready predicate always returns False.
{
"name": "silent_capable",
"criteria": "8_silent_capable",
"ready": lambda s: s["criteria"]["8_silent_capable"]["current"] < 5,
"kind": "pulse",
"max_iterations": 3,
"prompt_template": "Add silent_capable: true to these 2 skills: {skill_list}. "
"Follow the frontmatter format in SKILL.md. Run the linter after.",
"vars_fn": lambda s: {"skill_list": next_silent_candidates(s, n=2)},
"parallelizable": False,
},
{
"name": "mass_typing_batch",
"criteria": "1_all_typed",
"ready": lambda s: s["typed_count"] < s["total_skills"],
"kind": "pulse",
"max_iterations": 8,
"prompt_template": "Type the frontmatter for these {n} SKILL.md files: {skill_list}. "
"Use the canonical format. Run the linter after each file.",
"vars_fn": lambda s: next_typing_batch(s, n=10),
"parallelizable": True, # can dispatch 3 parallel Pulse sessions for different skill slices
},
{
"name": "tier1_recall_bench",
"criteria": "3_tier1_recall",
"ready": lambda s: s["criteria"]["3_tier1_recall"]["status"] == "pending"
and gate_approved("REVIEW-4"),
"kind": "pulse",
"max_iterations": 5,
"prompt_template": "Run the Tier 1 recall benchmark at [home-path] "
"Report typed_count, recall rate, and whether it exceeds 80%.",
"vars_fn": lambda s: {},
"parallelizable": False,
},
{
"name": "review_4_gate",
"criteria": "3_tier1_recall",
"ready": lambda s: s["criteria"]["3_tier1_recall"]["status"] == "pending"
and not gate_approved("REVIEW-4"),
"kind": "gate",
"gate_name": "REVIEW-4",
"description": "Mohamed: review the current recall benchmark quality before dispatching Track C.",
"parallelizable": False,
},
{
"name": "feedback_wiring",
"criteria": "6_feedback_loop",
"ready": lambda s: s["criteria"]["6_feedback_loop"]["status"] == "pending",
"kind": "pulse",
"max_iterations": 10,
"prompt_template": "Wire the feedback components E.1-E.4 at [home-path] "
"Follow the EXECUTION-PLAN track E specification.",
"vars_fn": lambda s: {},
"parallelizable": False,
},
# ... additional criteria entries follow the same pattern
]Gate approval mechanism:
# criteria_overrides.json format:
# {"gates": {"REVIEW-4": true, "A.1": false}}
def gate_approved(gate_name: str) -> bool:
overrides_path = pathlib.Path.home() / ".claude/state/elp2/criteria_overrides.json"
if not overrides_path.exists():
return False
data = json.loads(overrides_path.read_text())
return data.get("gates", {}).get(gate_name, False)Mohamed writes `gate_approved: true` to the overrides file (or dispatcher.py generates a button in the dashboard HTML that does it via a small local handler). The next dispatcher cycle sees the gate as approved and queues the work.
---
§7. Failure Resilience (honest, scoped)
What ELP-2 survives
Claude Code closure: dispatcher.py runs as a launchd plist, outside Claude Code. Closing Claude Code has no effect on the loop. The Pulse sessions it spawned continue to run inside their own Claude Code instances. This kills ELP-1 caveat C2.
Dispatcher crash: deadman.sh runs every 15 minutes from cron. If dispatcher.py crashes and does not touch the heartbeat for 30 minutes, deadman.sh writes `Desktop/ELP2-STALLED.txt`. The next Mohamed session can restart the plist with `launchctl kickstart gui/$(id -u)/com.diomande.elp2-dispatcher`. launchd will also auto-restart on crash after its configured ThrottleInterval.
Bad Pulse session: Pulse already handles retry and session boundary detection. If a spawned session hits an error, Pulse emits BLOCKED and the next dispatcher cycle either retries or escalates to the dashboard.
Scoreboard file deleted: dispatcher.py re-generates it on the next run by calling scoreboard.py, which reads from SKILL.md files (the canonical state). Recovery: one cycle (5 minutes).
aura-gateway /inject format changes: ELP-2 uses Pulse as the primary dispatch path. If aura-gateway's InjectRequest schema changes, the fallback direct-inject path breaks, but Pulse-dispatched batches are unaffected. The coupling between ELP-2 and aura-gateway is narrow.
What ELP-2 does not survive
mac1 power loss: There is no mac2 automatic observer-mode in v1.0. The loop stops. Recovery: run `[home-path]` from any machine (see §10). Manual but 60 seconds. This is the honest scope. For a 13-day project, mac1 power loss is recoverable in under 2 minutes. Automatic observer-mode is the right upgrade for SOOP-3 when the horizon is 60+ days.
Mass file corruption of SKILL.md files: ELP-2 has no DB backup of the skill content itself (ELP-1 had Supabase journaling but the filesystem mirror was unprovisioned anyway). The mitigation is git — the skill-typecheck repo is a git repo, and the SKILL.md files are committed after each typing batch by the Pulse session. Git is the journal. `git revert` is the rollback mechanism. This is more honest than ELP-1's claimed "atomic + journaled" rollback that had no specified journal format or recovery command.
Mohamed offline for 5+ days during a Mohamed-gated track: The loop waits at the gate. The dashboard shows the wait status. No automated escalation in v1.0. If Mohamed is offline that long during SOOP-2, the project timeline shifts regardless of what the loop does.
Simultaneous Pulse outage and mac1 launchd failure: At that point the whole environment is broken and nothing is running anything. This is not a meaningful failure mode to defend against in v1.0.
---
§8. The /inject Format Fix (resolving L4-1 CRITICAL)
Layer 1 finding L4-1 was the most concrete CRITICAL finding across all three layers: ELP-1's spec called the inject payload `{machine, tmux_target, prompt}`, but the actual `InjectRequest` model (gateway.py line 1875) requires `tty` when `machine == "mac1"` and rejects the call with HTTP 422 without it. The primary worker on the primary machine could never self-dispatch.
ELP-2's resolution:
1. Pulse is the primary dispatch path. Pulse's own MCP interface (`pulse_start`) abstracts over routing. ELP-2 does not call `/inject` directly for normal Pulse-dispatched batches.
2. For the narrow case where direct /inject is needed (mac4-specific work where Pulse's routing is not granular enough), dispatcher.py reads the current tty from `[home-path]` at dispatch time and includes it in the payload. The tty is never hardcoded. See §5 Mode B for the full code sample.
3. The exact InjectRequest schema is documented in a comment in dispatcher.py pointing to gateway.py line 1875. When the gateway schema changes, the comment goes stale (visible in code review); the behavior also fails fast with HTTP 422 (immediately visible). There is no silent failure mode.
---
§9. The Claim TTL Fix (resolving L5-1 CRITICAL)
ELP-1's most dangerous correctness bug was L5-1: a 5-minute claim TTL could release a batch to a second worker while the first worker was still writing to the same SKILL.md files. Both workers write concurrently. The files end up in an undefined state.
ELP-2's resolution is architectural, not a parameter change. There are no claims in ELP-2. Pulse sessions do not share files while running. A Pulse session for "type these 10 skills" owns those 10 files for the duration of the session. When the session completes (or hits BLOCKED), the next dispatcher cycle scans the filesystem and picks the next 10 untyped skills. If a Pulse session crashes mid-write, the next cycle sees a partially typed file, runs the linter (which will fail on it), and either retries that file or skips it based on the linter exit code.
The race condition L5-1 described simply does not exist when the worker model is one-session-at-a-time Pulse spawns with filesystem-canonical state.
---
§10. The Syncthing Fix (resolving L4-2 CRITICAL)
ELP-1 described Syncthing replication of `[home-path]` as a fallback mechanism. This path was not in the Syncthing config on mac1 (15 folder entries, none matching `.claude/state`). The fallback was unprovisioned.
ELP-2 makes Syncthing replication of `[home-path]` optional for resilience and uses `Desktop/` (already Syncthing-provisioned) for the primary human-visible surface (the dashboard HTML).
One-time Syncthing setup for `[home-path]`:
This step is recommended but not required for v1.0 operation. Without it, the scoreboard and heartbeat are mac1-local. With it, Mohamed can check ELP-2 status from any machine by reading the Syncthing-replicated scoreboard.json.
# Run once on mac1. Mohamed must confirm the folder share on each target machine.
# Syncthing GUI is at http://localhost:8384
FOLDER_ID="elp2-state-$(hostname | md5 | cut -c1-8)"
FOLDER_PATH="$HOME/.claude/state/elp2"
mkdir -p "$FOLDER_PATH"
# Add via Syncthing CLI (syncthing cli config folder add)
# Or: open http://localhost:8384, click Add Folder, set:
# Folder ID: $FOLDER_ID
# Folder Path: $FOLDER_PATH
# Share with: mac2, mac4, mac5 (whichever are online)
# Exclude the snapshots directory from Syncthing (local-only backup)
echo "snapshots" >> "$FOLDER_PATH/.stignore"If this setup is not done, ELP-2 still works correctly. Mohamed reads the dashboard from mac1's Desktop or via Tailscale browser. The only thing lost is the convenience of seeing the scoreboard from other machines without Tailscale.
---
§11. Bootstrap Procedure
The full bootstrap is one bash script that takes under 5 minutes.
#!/usr/bin/env bash
# bootstrap-elp2.sh — run once on mac1
set -euo pipefail
ELP2_TOOLS="$HOME/.claude/tools/elp2"
ELP2_STATE="$HOME/.claude/state/elp2"
PLIST_LABEL="com.diomande.elp2-dispatcher"
DEADMAN_CRON="*/15 * * * * $ELP2_TOOLS/deadman.sh"
echo "[1/7] Creating state directory..."
mkdir -p "$ELP2_STATE/snapshots"
echo "[2/7] Seeding scoreboard.json from current SOOP-2 memory..."
# Run scoreboard.py directly to generate the initial scoreboard
python3 "$ELP2_TOOLS/scoreboard.py" --write "$ELP2_STATE/scoreboard.json"
echo "[3/7] Writing initial heartbeat..."
touch "$ELP2_STATE/heartbeat"
echo "[4/7] Installing launchd plist for dispatcher.py..."
PLIST_PATH="$HOME/Library/LaunchAgents/$PLIST_LABEL.plist"
cat > "$PLIST_PATH" << PLIST
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key> <string>$PLIST_LABEL</string>
<key>ProgramArguments</key>
<array>
<string>/opt/homebrew/bin/python3</string>
<string>$ELP2_TOOLS/dispatcher.py</string>
</array>
<key>StartInterval</key> <integer>300</integer>
<key>RunAtLoad</key> <true/>
<key>StandardOutPath</key> <string>/tmp/elp2-dispatcher.log</string>
<key>StandardErrorPath</key><string>/tmp/elp2-dispatcher.err</string>
<key>EnvironmentVariables</key>
<dict>
<key>ELP2_GATEWAY_TOKEN</key>
<string>claw-glasses-2026</string>
</dict>
</dict>
</plist>
PLIST
# Note: credentials are loaded from environment, not embedded in the plist as literals.
# The gateway token is the default public token for the home mesh.
# If a secret token is in use, replace with:
# launchctl setenv ELP2_GATEWAY_TOKEN "$(cat [home-path])"
# before loading the plist.
launchctl load "$PLIST_PATH"
echo " Plist loaded. Check /tmp/elp2-dispatcher.err for startup errors."
echo "[5/7] Installing deadman.sh cron entry..."
# Add only if not already present
(crontab -l 2>/dev/null | grep -v "deadman.sh"; echo "$DEADMAN_CRON") | crontab -
echo " Cron entry installed."
echo "[6/7] Verifying first heartbeat (waiting up to 6 min for dispatcher to run)..."
for i in $(seq 1 12); do
sleep 30
AGE=$(find "$ELP2_STATE/heartbeat" -mmin -6 2>/dev/null | wc -l)
if [[ "$AGE" -gt 0 ]]; then
echo " Heartbeat confirmed fresh (cycle ${i})."
break
fi
echo " Waiting... (${i}/12)"
done
if [[ "$AGE" -eq 0 ]]; then
echo " WARNING: heartbeat not refreshed after 6 min. Check /tmp/elp2-dispatcher.err"
exit 1
fi
echo "[7/7] Bootstrap complete. Dashboard at Desktop/elp2-dashboard.html"Python path note: The plist uses `/opt/homebrew/bin/python3`. This is the correct path for Homebrew Python on Apple Silicon. ELP-1's plist specified `/usr/local/bin/python3` (L2-2 HIGH). The difference matters because launchd starts in a minimal environment where PATH is not set.
Credential handling: The gateway token is set as an `EnvironmentVariables` entry in the plist, not embedded as a string value that would require Mohamed to edit XML. For a secret token, `launchctl setenv` is the correct pattern before plist load. ELP-1 specified plaintext secrets in plist XML (L2-3 HIGH). ELP-2 uses the standard pattern.
Failover to mac2:
# failover-to-mac2.sh — run from any machine if mac1 is down
# Copies state from Syncthing-replicated path and starts plist on mac2
ssh mac2 'mkdir -p [home-path]
ssh mac2 'cp Desktop/elp2-dashboard.html [home-path] 2>/dev/null || true
ssh mac2 "cp $HOME/Library/LaunchAgents/com.diomande.elp2-dispatcher.plist [home-path]
ssh mac2 'launchctl load [home-path]
echo "ELP-2 now running on mac2. Monitor at /tmp/elp2-dispatcher.err on mac2."This is the entire mac1-offline recovery procedure. It is not automatic. It runs in about 60 seconds.
---
§12. Acceptance Criteria (ELP-2's own)
ELP-1 had 10 acceptance criteria. Layer 1's Lens 6 analysis showed 8 of those 10 could pass with a silently broken implementation (the AC measured file existence, not correctness). ELP-2 has 8 criteria that require observable output.
| # | Criterion | How to verify |
|---|---|---|
| 1 | dispatcher.py exists at `[home-path]` and launchd plist is loaded | `launchctl list \| grep elp2` shows the label |
| 2 | scoreboard.py correctly identifies `typed_count`, `silent_capable_count`, and `criteria_passed` from filesystem state in under 2s | `time python3 scoreboard.py` prints the expected values against a known state |
| 3 | deadman.sh runs INDEPENDENT of dispatcher.py and writes `Desktop/ELP2-STALLED.txt` when heartbeat is >30 min stale | `launchctl unload dispatcher.plist && sleep 35 && bash deadman.sh && ls Desktop/ELP2-STALLED.txt` |
| 4 | dashboard.html exists, is non-empty, and contains the current typed_count after a dispatcher cycle | read dashboard.html and grep for the typed_count value |
| 5 | Closing Claude Code for 30 min does NOT cause typed_count to stop advancing | close Claude Code, wait 30 min, open new session, run scoreboard.py, confirm typed_count increased |
| 6 | A simulated dispatcher crash (`kill -9`) is detected by deadman.sh within one cron cycle (15 min) | kill -9 the dispatcher process, wait 35 min, confirm ELP2-STALLED.txt exists |
| 7 | When all 10 SOOP-2 criteria flip to true in scoreboard.json, dispatcher.py writes a completion memory file and the launchd plist stops reloading | manually set all criteria to "pass" in scoreboard.json, wait one cycle, confirm memory file written and plist no longer running |
| 8 | ELP-2 itself ships in under 6 hours of work | track wall-clock time from bootstrap start to AC 1-7 all green |
Note: AC 8 is self-referential. It exists because Layer 4 correctly identified that the build cost of ELP is itself a project risk. If building ELP-2 takes more than 6 hours, the build cost has exceeded the benefit for a 13-day convergence project. At that point, the fallback is to run Pulse sessions manually from Mohamed's main session — which is always available as a fallback regardless of ELP-2 status.
---
§13. Comparison Table: ELP-1 vs ELP-2
| Dimension | ELP-1 | ELP-2 | Verdict |
|---|---|---|---|
| File count | ~15 (supervisor.py, worker.py, verifier.py, reconciler.py, schema.sql, 2 plists, 2 cron jobs, Grafana panel defs, telegram config, +more) | 5 (scoreboard.py, dispatcher.py, deadman.sh, scoreboard.json, dashboard.html) | ELP-2 |
| Build cost | 9.5h realistic (vs "1-2 sessions" claimed in spec) | ~6h including bootstrap and smoke test | ELP-2 |
| Worker model | bespoke: soop2_workers table, claim TTL, FOR UPDATE SKIP LOCKED, worker registration | Pulse spawn calls | ELP-2 (Pulse already exists) |
| State plane | Supabase primary + filesystem fallback (fallback physically unprovisioned) | filesystem only, Syncthing-optional | ELP-2 |
| CRITICAL bugs at design time | 5 (split-brain, Syncthing unprovisioned, claim TTL race, /inject format mismatch, PostgREST SKIP LOCKED unsupported) | 0 | ELP-2 |
| Escalation | Telegram + Twilio (require direct REST from daemon; not callable as Claude skills from outside Claude) | Desktop file (visible) + v1.1 telegram | ELP-1 is better in principle; ELP-2 is better in practice because it actually works |
| Mac2 failover | Automatic observer-mode promotion after 30 min heartbeat gap (dual-primary risk L1-1, L5-2) | Manual failover script, ~60 seconds | ELP-2 (honest about scope) |
| Observability | Grafana on cloud-vm (requires Supabase direct Postgres, L4-5) | Static HTML at [home-path] | ELP-2 for v1.0; ELP-1 is better long-term |
| Reuse of existing infrastructure | None (Pulse exists; ELP-1 does not use it) | Pulse for worker dispatch | ELP-2 |
| Upgrade path to multi-machine | Built-in (observer-mode, Supabase cross-mesh) but broken at v1.0 | Explicit v1.1 upgrade path (see §14) | Tie |
| Code someone can read cold in 30 minutes | No (scattered across 15 files, 4 tables, Grafana) | Yes (5 files, 1 JSON) | ELP-2 |
---
§14. Upgrade Path to ELP-3
ELP-2 is scoped for SOOP-2: one machine (mac1 primary), 13 days, 295 skills, 10 criteria. When SOOP-3 or a similar 50-day multi-mesh project arrives, the upgrade path is explicit.
v1.1 (add when SOOP-2 closes, before SOOP-3 starts)
- Add Supabase state plane (`scoreboard` table, `dispatch_log` table). Filesystem scoreboard.json becomes the write-ahead buffer; Supabase is the durable truth.
- Add automatic mac2 failover. The observer-mode logic from ELP-1 §4.4 is the right design; implement it now that the dual-primary bug (L1-1, L5-2) has been acknowledged. The fix is a `supervisor_generation` column and a CAS update: `UPDATE soop2_state SET supervisor_id = 'mac2' WHERE supervisor_id = 'mac1' AND supervisor_generation = $expected`.
- Add telegram escalation. The implementation is a direct Telegram Bot API call in deadman.sh (a shell script; `curl` is all that is needed). Not a Claude skill wrapper. Not a Python library that assumes Claude runtime.
- Add Grafana panels if Supabase is now the state plane (correct dependency order; ELP-1 had Grafana before Supabase).
v2.0 (if SOOP-3 demands true multi-machine parallelism)
- Add Supabase queue table with proper claim semantics (the `soop2_queue` table from ELP-1 §3.2, but with correct PostgREST RPC wrappers for `FOR UPDATE SKIP LOCKED` — not raw SQL as ELP-1 assumed).
- Add quorum if multiple supervisors are truly needed.
- Keep Pulse as the worker model. Do not build a bespoke worker pool.
What ELP-2 explicitly forbids (the things that killed ELP-1):
1. Custom worker pools. Use Pulse.
2. Custom queue tables in v1.0. Pulse has work-item primitives.
3. Mac3 references. Mac3 is deprecated per SOOP-2 Track H.
4. Telegram in v1.0. The claim "telegram skill is callable from supervisor daemon" is architecturally wrong (L4-6). Telegram goes in v1.1 with a direct curl call.
5. Escalation logic embedded inside dispatcher.py. deadman.sh is INDEPENDENT. If dispatcher.py has a bug, deadman.sh still fires.
6. Hardcoded TTY values in /inject calls. TTY must be read from the pane registry at dispatch time.
7. "Soft failover" language for dual-primary scenarios. Call it what it is: a dual-primary race. Design around it with CAS writes or single-machine scope.
---
§15. Layer 3 Status
Layer 3 (Codex adversarial review) did not exist on the filesystem when this document was written. The file `06c-layer3-codex-adversarial.md` returned a file-not-found error.
ELP-2 will be revised if Layer 3 surfaces findings incompatible with the design choices here. The most likely Layer 3 targets, given the pattern of CRITICAL findings in Layers 1 and 2, are:
- The Pulse session-length envelope. If Pulse cannot handle a 2-hour batch of 30 skill typings (Layer 4 Risk 1), ELP-2 needs a per-batch-kind fallback to direct aura-gateway /inject with the correct tty handling from §5 Mode B.
- The scoreboard.json single-writer assumption. If anything other than dispatcher.py touches scoreboard.json (e.g., a manual override script that does not use the atomic write pattern), there is a corruption window. The fix is a file lock, not a structural change.
- The Pulse MCP interface availability from a launchd context. Pulse uses `mcp__pulse-local__pulse_start` which requires an MCP server running alongside the Claude session. A launchd daemon outside Claude does not have MCP access. This is a real concern: if Pulse can only be called from inside a Claude Code session, then ELP-2's primary dispatch path requires a running Claude session, which re-creates caveat C2.
This last point is the most significant potential flaw in ELP-2 as designed. The resolution, if Layer 3 confirms it: fall back to calling `pulse_start` via the Pulse orchestrator HTTP API directly (`https://pulse-orchestrator-274020562532.us-central1.run.app`), not the MCP interface. The Pulse SKILL.md shows the orchestrator URL. A direct HTTP call from a Python daemon does not require Claude runtime. This fallback should be verified before implementation begins.
---
§16. Build Order
| Step | Component | Effort | Unlocks |
|---|---|---|---|
| 1 | scoreboard.py with write() and all 10 criterion checks | 45 min | everything |
| 2 | dispatcher.py skeleton: cycle loop, priority table, gate logic | 90 min | launchd install |
| 3 | dispatch_pulse() with Pulse orchestrator HTTP call (not MCP) | 30 min | Pulse batches |
| 4 | dispatch_inject_mac1() with pane registry tty read | 20 min | direct /inject fallback |
| 5 | deadman.sh (3 lines) | 5 min | independent escalation |
| 6 | launchd plist + bootstrap-elp2.sh | 30 min | production operation |
| 7 | dashboard HTML generation in dispatcher.py | 45 min | observability |
| 8 | Smoke test: one full cycle through mass_typing_batch | 45 min | AC 5 verification |
Total: ~5.5 hours. Under the 6-hour meta-criterion (AC 8).
---
§17. SOOP-2 Scoreboard at ELP-2 Design Time
For reference, the current state that ELP-2 will advance:
| Criterion | Status | Notes |
|---|---|---|
| 1. All 295 skills typed | in_progress (14/295) | Primary work target for ELP-2 |
| 2. Linter under 3s | pass | Already done |
| 3. Tier 1 recall >= 80 | ||
| 4. Twin primary (mac4:8100) | pending | Requires mac4 availability |
| 5. type_compatibility_weight knob | pending | Architecture decision |
| 6. Feedback loop wired | pending | Track E work |
| 7. Contrarian files exist | pass | Already done |
| 8. Silent capable >= 5 | in_progress (3/5) | 2 more skill edits |
| 9. mac3 purge | pass | Already done |
| 10. Audit memory file | pending | Interactive per Rail |
3 criteria passed. 7 to go. ELP-2's primary contribution is advancing criterion 1 (281 SKILL.md files remaining) without requiring Mohamed's main session to stay open.
---
End of ELP-2 survivor architecture document.
Produced by Crucible Stage 3 FORGE for chain:full-omega cycle 2.
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
crucible-output/soop-2/07-elp-survivor-architecture.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture · is Stage Research