Grand Diomande Research · Full HTML Reader

ELP-2 — Survivor Architecture

Three of four scrutiny layers returned a convergent verdict: ELP-1 as written cannot ship. The CRITICAL findings are structural — a /inject format mismatch that breaks the primary dispatch path, a concurrent SKILL.md write race enabled by a 5-minute claim TTL, and a Syncthing-backed filesystem fallback that is architecturally described but physically unprovisioned. These are not tuning problems; they are root-cause failures.

Agents That Account for Themselves architecture technical paper candidate score 56 .md

Full Public Reader

# ELP-2 — Survivor Architecture
> Crucible Stage 3 FORGE | cycle: chain:full-omega #2 | date: 2026-05-13
> Input layers: L1 meta-review (27 findings, NO-SHIP), L2 AMR (AMEND), L4 Evo3 (Hybrid wins)
> Layer 3 (Codex adversarial): pending — see footnote at §15.

---

Preamble

The verdict is not "abandon the loop." The four caveats ELP-1 was designed to kill are real. Claude Code session closure kills the current ScheduleWakeup loop. Stall detection is absent. A single driver is a single point of failure. These problems are worth solving.

The verdict is: replace ELP-1's bespoke distributed worker system with a thin Pulse-backed hybrid that kills the same four caveats at roughly 40

That replacement is ELP-2.

---

§1. Identity Statement

ELP-2 is a Pulse-orchestrated, criteria-driven convergence loop for SOOP-2.

What the supervisor does each cycle:

1. Read filesystem state, compute which of the 10 SOOP-2 criteria pass or fail.
2. Decide what batch of work to dispatch next (or whether Mohamed's approval is required first).
3. Dispatch via Pulse (for work that fits Pulse's session model) or inline (for trivial work).
4. Write the heartbeat file, append to the dispatch log, regenerate the dashboard.
5. Sleep and repeat.

What ELP-2 does not do:

Run its own worker pool (Pulse handles this).
Maintain a `soop2_workers` table (Pulse handles this).
Implement claim semantics with a TTL (the race condition that killed ELP-1's L5-1 finding).
Assume Supabase exists (filesystem-only in v1.0).
Escalate via telegram (dashboard banner in v1.0; telegram in v1.1).

The result is roughly 5 files instead of ELP-1's 15+, a 6-hour build instead of ~9.5 hours, and zero of ELP-1's five CRITICAL findings.

---

§2. What Changed from ELP-1 (the short version)

Component	ELP-1	ELP-2	Reason killed
Worker model	bespoke `soop2_workers` table, claim TTL, `FOR UPDATE SKIP LOCKED`	Pulse spawn calls	Pulse already implements worker lifecycle, rate-limit awareness, retry. Reinventing it was L4's sharpest finding.
State plane	Supabase (primary) + filesystem mirror (fallback)	Filesystem only in v1.0	Syncthing mirror was physically unprovisioned (L4-2 CRITICAL). Supabase adds complexity for a 13-day project.
Mac2 observer mode	Automatic promotion after 30-min heartbeat gap	One-command failover script	Observer-mode is real dual-primary risk (L1-1, L5-2 CRITICAL). For a 13-day project, manual failover in 60s is acceptable.
Escalation	Telegram + Twilio from supervisor daemon	Dashboard HTML banner	Telegram/Twilio require direct REST calls from a daemon context; ELP-1 assumed the Claude skill wrappers were callable (L4-6 MEDIUM). Not correct.
Observability	Grafana on cloud-vm	Static HTML at Desktop/elp2-dashboard.html	Grafana requires Supabase direct Postgres port (L4-5 MEDIUM). Dashboard is Syncthing-shared and readable anywhere.
Claim TTL race	5-minute TTL on batch claims; expired claims released while original worker still writing	N/A	ELP-2 has no claims. Pulse manages its own session lifecycle. L5-1 CRITICAL is gone by construction.
/inject format	Spec said {machine, tmux_target, prompt}; gateway required tty for mac1	Verified against gateway.py line 1877; tty provided for mac1 dispatches	L4-1 CRITICAL: the primary dispatch path was broken in ELP-1.

---

§3. File Map (~5 files total)

[home-path]
  scoreboard.py       # reads filesystem, computes 10-criteria pass/fail/in-progress
  dispatcher.py       # decides next batch, dispatches via Pulse or inline or gate
  deadman.sh          # 3-line cron, INDEPENDENT of dispatcher.py

[home-path]
  scoreboard.json     # the truth; atomic writes via tmp + os.replace
  heartbeat           # touched every cycle by dispatcher.py (not by deadman.sh)
  dispatch-log.jsonl  # append-only audit
  snapshots/          # hourly scoreboard copies (local, Syncthing-excluded)

Desktop/elp2-dashboard.html   # Syncthing-shared static HTML, regenerated each cycle

That is it. No `soop2_workers` table. No `soop2_queue` table. No `soop2_log` table. No `schema.sql`. No worker registration protocol. No claim semantics. No reconciler. No observer-mode supervisor. No Grafana panels. No telegram bot. No Twilio integration. All of those existed in ELP-1's spec and none of them worked correctly per the three scrutiny layers.

---

§4. Component Specifications

§4.1 scoreboard.py

Location: `[home-path]`

Role: Read-only computation from filesystem state. No writes except to `scoreboard.json` via its `write()` function.

Inputs read:
- `[home-path]` frontmatter (for typed_count, silent_capable_count)
- `[home-path]` (optional; for criteria that require Mohamed's manual flip, e.g., REVIEW-4)
- The linter binary at `[home-path]` (invoked as subprocess, budget: 2 seconds)

Output schema (scoreboard.json, schema_version: 1):

json

{
  "schema_version": 1,
  "updated_at": "2026-05-13T04:00:00Z",
  "typed_count": 14,
  "total_skills": 295,
  "silent_capable_count": 3,
  "criteria": {
    "1_all_typed":       {"status": "in_progress", "current": 14, "target": 295},
    "2_linter_fast":     {"status": "pass"},
    "3_tier1_recall":    {"status": "pending", "note": "gated on REVIEW-4"},
    "4_twin_primary":    {"status": "pending", "note": "mac4:8100 required"},
    "5_type_weight_knob":{"status": "pending", "note": "architecture decision"},
    "6_feedback_loop":   {"status": "pending"},
    "7_contrarian":      {"status": "pass"},
    "8_silent_capable":  {"status": "in_progress", "current": 3, "target": 5},
    "9_mac3_purge":      {"status": "pass"},
    "10_audit_memory":   {"status": "pending", "note": "Interactive per Rail"}
  },
  "criteria_passed": 3,
  "loop_complete": false
}

Performance contract: completes in under 2 seconds. The linter alone runs in 0.13s (verified). The SKILL.md scan of 295 files with frontmatter parsing is well under 2s on mac1 SSD.

Atomic write pattern:

python

import json, os, tempfile, pathlib

def write(path: pathlib.Path, data: dict) -> None:
    tmp = path.with_suffix(".tmp")
    tmp.write_text(json.dumps(data, indent=2))
    os.replace(tmp, path)   # POSIX atomic on same filesystem

This is the same pattern ELP-1 specified in §3.2. It is correct. ELP-2 keeps it.

---

§4.2 dispatcher.py

Location: `[home-path]`

Role: One cycle of the loop. Reads scoreboard, decides what to do, dispatches, writes outputs.

Invocation: Called by launchd every 5 minutes. NOT called by deadman.sh. These are independent.

Cycle steps:

1. read scoreboard.json
2. if loop_complete: write completion memory file, exit plist (via a sentinel file)
3. select next batch from priority table (see §6)
4. if batch kind == "Mohamed-gated":
       update dashboard with "waiting for Mohamed: <gate_name>"
       touch heartbeat
       exit (do NOT queue post-gate work)
5. if batch kind == "inline":
       run the work directly (e.g., count files, read a single SKILL.md)
6. if batch kind == "pulse-dispatched":
       call pulse_start() via MCP with prompt template + stop conditions
       log spawn_id to dispatch-log.jsonl
7. touch [home-path]
8. append entry to dispatch-log.jsonl
9. regenerate dashboard HTML
10. run scoreboard.py to refresh scoreboard.json

No claim semantics. There is no batch state machine. Pulse manages its own session lifecycle. If dispatcher.py crashes after step 6 but before step 7, the Pulse session continues running. The next dispatcher cycle reads the same scoreboard state, detects no advance in typed_count (or whichever metric), and either dispatches a new Pulse spawn or waits. Duplicate Pulse sessions for the same track are acceptable because SKILL.md edits are idempotent (frontmatter is either written or it is not; writing it twice is a no-op).

This eliminates ELP-1 finding L5-1 by construction. There is no concurrent SKILL.md write race because there is no claim TTL releasing a batch to a second worker while the first is still writing. Pulse owns one session at a time per track. The idempotency of frontmatter writes handles any overlap.

---

§4.3 deadman.sh

Location: `[home-path]`

Role: INDEPENDENT escalation. This script does not import or call dispatcher.py. It has no Python dependency. It cannot be killed by a dispatcher crash.

Contents:

bash

#!/usr/bin/env bash
# ELP-2 dead-man's switch. Cron runs this every 15 min independent of dispatcher.
# If the heartbeat file hasn't been touched in 30 min, write a visible alert.
HEARTBEAT=[home-path]
ALERT=Desktop/ELP2-STALLED.txt
if [[ ! -f "$HEARTBEAT" ]] || [[ $(find "$HEARTBEAT" -mmin +30 2>/dev/null) ]]; then
    echo "ELP-2 stalled at $(date). Last heartbeat: $(stat -f '%Sm' "$HEARTBEAT" 2>/dev/null || echo 'never')" > "$ALERT"
fi

Three lines of logic. No Python. No Supabase. No aura-gateway dependency. No paging infrastructure that fails silently.

Why independent matters: ELP-1's L5-A finding (Layer 2) identified that if supervisor.py crashes before it can write the heartbeat, the escalation layer inside supervisor.py never fires. The monitoring system for ELP crashes along with ELP. deadman.sh kills this blind spot because it is a separate process launched by a separate cron entry, with no shared code path with dispatcher.py.

Cron entry (installed during bootstrap):

*/15 * * * * [home]/.claude/tools/elp2/deadman.sh

Alert visibility: The alert writes to `Desktop/ELP2-STALLED.txt`, which is both on the Desktop (visible when Mohamed opens his Mac) and in a directory Syncthing can replicate to make it appear on his other machines. In v1.1, this triggers a telegram message. In v1.0, the Desktop file is sufficient.

---

§4.4 scoreboard.json (state plane)

Location: `[home-path]`

Truth hierarchy:
- `scoreboard.json` is the ELP-2 truth. It is derived from filesystem state (SKILL.md files), not from a DB.
- If `scoreboard.json` is deleted, dispatcher.py re-generates it from scratch by running scoreboard.py. Recovery is one dispatcher cycle (5 minutes).
- If the entire `[home-path]` directory is deleted, dispatcher.py creates it on the next run. Nothing is permanently lost because the canonical state is the SKILL.md files themselves.

No Supabase in v1.0. This is an intentional deferral, not an oversight. The filesystem IS the database for SOOP-2 because the work product (typed SKILL.md frontmatter) lives on the filesystem. Supabase would add cross-mesh durability for a project that runs on one machine 95

Write ownership:
- `scoreboard.json`: written by scoreboard.py (called by dispatcher.py). Only one writer.
- `heartbeat`: written by dispatcher.py. Only one writer.
- `dispatch-log.jsonl`: written by dispatcher.py in append mode. Only one writer.

Single-writer per file. No concurrent write contention. No need for `FOR UPDATE SKIP LOCKED`.

---

§4.5 elp2-dashboard.html

Location: `Desktop/elp2-dashboard.html`

Generated by: dispatcher.py at the end of each cycle. Not a separate daemon.

Contents:
- Current scoreboard (criteria table, typed_count progress bar)
- Last 10 dispatch log entries
- Heartbeat timestamp and staleness indicator
- Gate status (if a Mohamed-gated batch is waiting, this is the call to action)
- One-line failover instructions for mac2

Syncthing: `Desktop/` is Syncthing-shared (verified: it is in the Syncthing config). Mohamed can read the dashboard from any mesh node's browser by opening `file://[home]/Desktop/elp2-dashboard.html` or from a shared folder on another machine.

Resolution to L4-2 (CRITICAL from Layer 1): ELP-1's dashboard depended on a Grafana instance that required Supabase direct Postgres access (L4-5 MEDIUM) and a Syncthing share of `[home-path]` that was unprovisioned (L4-2 CRITICAL). ELP-2 uses `Desktop/` which is already Syncthing-provisioned. The dashboard is a static HTML file that reads no remote data — it is pre-rendered each cycle by dispatcher.py from local state.

---

§5. Dispatch Model

Three modes per batch kind. The selector lives in `dispatcher.py`'s priority table (see §6).

Mode A: Inline

Used for: trivial work that completes in milliseconds and has no Claude session dependency. Examples: read a file to check its state, count typed files, write a small JSON override.

python

# Inline dispatch example — no Pulse, no aura-gateway
def dispatch_inline(batch: dict) -> dict:
    result = batch["fn"](*batch["args"])
    return {"mode": "inline", "result": result}

Mode B: Pulse-dispatched

Used for: bounded work that fits Pulse's session model. Examples: type the next 10 SKILL.md files, run the Tier 1 recall benchmark, wire a feedback component.

The /inject format fix from L4-1 CRITICAL:

ELP-1's spec stated the inject payload as `{machine, tmux_target, prompt}`. This was wrong. The gateway's actual `InjectRequest` (gateway.py line 1875-1879) accepts:

python

class InjectRequest(BaseModel):
    prompt: str       # required
    tty: Optional[str]   # required when machine == "mac1"; ignored for remote
    machine: str      # default "mac1"; pattern ^(mac[1-5]|cloud-vm)$
    tmux_target: Optional[str]  # defaults to "claude" for remote machines

For mac1 dispatches, `tty` must match `^(/dev/)?ttys1,4$`. Without it, the gateway returns HTTP 422. ELP-1's primary worker (mac1) would have failed every single injection. This was a CRITICAL finding that ELP-2 resolves not by patching ELP-1 but by not using raw aura-gateway /inject at all for the primary dispatch path.

ELP-2 dispatches via Pulse MCP calls:

python

from mcp__pulse_local import pulse_start

def dispatch_pulse(batch: dict) -> dict:
    result = pulse_start(
        projectName="soop-2",
        projectPath="[home]",
        goal=batch["prompt_template"].format(**batch["vars"]),
        maxIterations=batch.get("max_iterations", 5)
    )
    return {"mode": "pulse", "spawn_id": result["sessionId"]}

Pulse's own MCP interface handles the routing decision. It already has rate-limit awareness, retry logic, and session lifecycle management. ELP-2 does not reimplement any of that.

When direct /inject is needed (e.g., for a mac4-specific dispatch that Pulse cannot route): dispatcher.py reads the current tty from the pane registry before injecting.

python

import json, pathlib, requests

def _get_mac1_tty() -> str:
    # gateway.py reads pane_registry.json directly for mac1.
    # We mirror that by reading it here.
    registry = pathlib.Path.home() / ".claude/state/agent_pane_registry.json"
    data = json.loads(registry.read_text())
    # Return first active tty from mac1 panes
    for tty, info in data.get("panes", {}).items():
        if info.get("machine_id") == "mac1" and not info.get("is_ghost"):
            return tty
    raise RuntimeError("no active mac1 pane found in registry")

def dispatch_inject_mac1(prompt: str, tmux_target: str = None) -> dict:
    # tty is required for mac1. gateway.py line 1877:
    # "Required when machine == mac1; ignored for remote machines."
    tty = _get_mac1_tty()
    payload = {
        "prompt": prompt,
        "machine": "mac1",
        "tty": tty,
    }
    if tmux_target:
        payload["tmux_target"] = tmux_target
    r = requests.post(
        "http://localhost:8095/inject",
        json=payload,
        headers={"Authorization": f"Bearer {GATEWAY_TOKEN}"},
        timeout=10,
    )
    r.raise_for_status()
    return r.json()

The tty field is not hardcoded. It is read at dispatch time from the pane registry, which is always current. This is the correct fix for L4-1 CRITICAL.

Mode C: Mohamed-gated

Used for: any batch that depends on Mohamed's judgment or sign-off before the next batch can proceed. Examples: REVIEW-4 (recall quality judgment), A.1 spec sign-off, I.4 audit authoring.

This resolves Layer 2's HIGH finding about review gates. ELP-1 had no mechanism to pause dispatch at a review gate. Its batch generation logic would queue post-gate work before Mohamed signed off on the gate. ELP-2 makes gates explicit in the priority table. When the selector picks a Mohamed-gated batch, dispatcher.py updates the dashboard with "waiting for Mohamed: REVIEW-4" and exits. It does NOT queue the next batch. The only way to unblock is `gate_approved: true` in `criteria_overrides.json`.

python

def dispatch_gate(batch: dict, scoreboard_path: pathlib.Path) -> dict:
    # Write to dashboard, do not dispatch further work.
    update_dashboard_gate_status(batch["gate_name"], batch["description"])
    # Touch heartbeat so deadman.sh sees a live system.
    (pathlib.Path.home() / ".claude/state/elp2/heartbeat").touch()
    return {"mode": "gate", "gate_name": batch["gate_name"], "waiting": True}

---

§6. Priority Table (batch kinds)

The priority table is a Python list in dispatcher.py ordered from highest to lowest priority. The selector picks the first entry whose `ready` predicate returns True given the current scoreboard.

python

BATCH_PRIORITY = [
    # Criterion 9 (mac3 purge) and 7 (contrarian) are already done.
    # Criterion 2 (linter fast) is already done.
    # These entries are here for completeness; their ready predicate always returns False.

    {
        "name": "silent_capable",
        "criteria": "8_silent_capable",
        "ready": lambda s: s["criteria"]["8_silent_capable"]["current"] < 5,
        "kind": "pulse",
        "max_iterations": 3,
        "prompt_template": "Add silent_capable: true to these 2 skills: {skill_list}. "
                           "Follow the frontmatter format in SKILL.md. Run the linter after.",
        "vars_fn": lambda s: {"skill_list": next_silent_candidates(s, n=2)},
        "parallelizable": False,
    },
    {
        "name": "mass_typing_batch",
        "criteria": "1_all_typed",
        "ready": lambda s: s["typed_count"] < s["total_skills"],
        "kind": "pulse",
        "max_iterations": 8,
        "prompt_template": "Type the frontmatter for these {n} SKILL.md files: {skill_list}. "
                           "Use the canonical format. Run the linter after each file.",
        "vars_fn": lambda s: next_typing_batch(s, n=10),
        "parallelizable": True,  # can dispatch 3 parallel Pulse sessions for different skill slices
    },
    {
        "name": "tier1_recall_bench",
        "criteria": "3_tier1_recall",
        "ready": lambda s: s["criteria"]["3_tier1_recall"]["status"] == "pending"
                           and gate_approved("REVIEW-4"),
        "kind": "pulse",
        "max_iterations": 5,
        "prompt_template": "Run the Tier 1 recall benchmark at [home-path] "
                           "Report typed_count, recall rate, and whether it exceeds 80%.",
        "vars_fn": lambda s: {},
        "parallelizable": False,
    },
    {
        "name": "review_4_gate",
        "criteria": "3_tier1_recall",
        "ready": lambda s: s["criteria"]["3_tier1_recall"]["status"] == "pending"
                           and not gate_approved("REVIEW-4"),
        "kind": "gate",
        "gate_name": "REVIEW-4",
        "description": "Mohamed: review the current recall benchmark quality before dispatching Track C.",
        "parallelizable": False,
    },
    {
        "name": "feedback_wiring",
        "criteria": "6_feedback_loop",
        "ready": lambda s: s["criteria"]["6_feedback_loop"]["status"] == "pending",
        "kind": "pulse",
        "max_iterations": 10,
        "prompt_template": "Wire the feedback components E.1-E.4 at [home-path] "
                           "Follow the EXECUTION-PLAN track E specification.",
        "vars_fn": lambda s: {},
        "parallelizable": False,
    },
    # ... additional criteria entries follow the same pattern
]

Gate approval mechanism:

python

# criteria_overrides.json format:
# {"gates": {"REVIEW-4": true, "A.1": false}}

def gate_approved(gate_name: str) -> bool:
    overrides_path = pathlib.Path.home() / ".claude/state/elp2/criteria_overrides.json"
    if not overrides_path.exists():
        return False
    data = json.loads(overrides_path.read_text())
    return data.get("gates", {}).get(gate_name, False)

Mohamed writes `gate_approved: true` to the overrides file (or dispatcher.py generates a button in the dashboard HTML that does it via a small local handler). The next dispatcher cycle sees the gate as approved and queues the work.

---

§7. Failure Resilience (honest, scoped)

What ELP-2 survives

Claude Code closure: dispatcher.py runs as a launchd plist, outside Claude Code. Closing Claude Code has no effect on the loop. The Pulse sessions it spawned continue to run inside their own Claude Code instances. This kills ELP-1 caveat C2.

Dispatcher crash: deadman.sh runs every 15 minutes from cron. If dispatcher.py crashes and does not touch the heartbeat for 30 minutes, deadman.sh writes `Desktop/ELP2-STALLED.txt`. The next Mohamed session can restart the plist with `launchctl kickstart gui/$(id -u)/com.diomande.elp2-dispatcher`. launchd will also auto-restart on crash after its configured ThrottleInterval.

Bad Pulse session: Pulse already handles retry and session boundary detection. If a spawned session hits an error, Pulse emits BLOCKED and the next dispatcher cycle either retries or escalates to the dashboard.

Scoreboard file deleted: dispatcher.py re-generates it on the next run by calling scoreboard.py, which reads from SKILL.md files (the canonical state). Recovery: one cycle (5 minutes).

aura-gateway /inject format changes: ELP-2 uses Pulse as the primary dispatch path. If aura-gateway's InjectRequest schema changes, the fallback direct-inject path breaks, but Pulse-dispatched batches are unaffected. The coupling between ELP-2 and aura-gateway is narrow.

What ELP-2 does not survive

mac1 power loss: There is no mac2 automatic observer-mode in v1.0. The loop stops. Recovery: run `[home-path]` from any machine (see §10). Manual but 60 seconds. This is the honest scope. For a 13-day project, mac1 power loss is recoverable in under 2 minutes. Automatic observer-mode is the right upgrade for SOOP-3 when the horizon is 60+ days.

Mass file corruption of SKILL.md files: ELP-2 has no DB backup of the skill content itself (ELP-1 had Supabase journaling but the filesystem mirror was unprovisioned anyway). The mitigation is git — the skill-typecheck repo is a git repo, and the SKILL.md files are committed after each typing batch by the Pulse session. Git is the journal. `git revert` is the rollback mechanism. This is more honest than ELP-1's claimed "atomic + journaled" rollback that had no specified journal format or recovery command.

Mohamed offline for 5+ days during a Mohamed-gated track: The loop waits at the gate. The dashboard shows the wait status. No automated escalation in v1.0. If Mohamed is offline that long during SOOP-2, the project timeline shifts regardless of what the loop does.

Simultaneous Pulse outage and mac1 launchd failure: At that point the whole environment is broken and nothing is running anything. This is not a meaningful failure mode to defend against in v1.0.

---

§8. The /inject Format Fix (resolving L4-1 CRITICAL)

Layer 1 finding L4-1 was the most concrete CRITICAL finding across all three layers: ELP-1's spec called the inject payload `{machine, tmux_target, prompt}`, but the actual `InjectRequest` model (gateway.py line 1875) requires `tty` when `machine == "mac1"` and rejects the call with HTTP 422 without it. The primary worker on the primary machine could never self-dispatch.

ELP-2's resolution:

1. Pulse is the primary dispatch path. Pulse's own MCP interface (`pulse_start`) abstracts over routing. ELP-2 does not call `/inject` directly for normal Pulse-dispatched batches.

2. For the narrow case where direct /inject is needed (mac4-specific work where Pulse's routing is not granular enough), dispatcher.py reads the current tty from `[home-path]` at dispatch time and includes it in the payload. The tty is never hardcoded. See §5 Mode B for the full code sample.

3. The exact InjectRequest schema is documented in a comment in dispatcher.py pointing to gateway.py line 1875. When the gateway schema changes, the comment goes stale (visible in code review); the behavior also fails fast with HTTP 422 (immediately visible). There is no silent failure mode.

---

§9. The Claim TTL Fix (resolving L5-1 CRITICAL)

ELP-1's most dangerous correctness bug was L5-1: a 5-minute claim TTL could release a batch to a second worker while the first worker was still writing to the same SKILL.md files. Both workers write concurrently. The files end up in an undefined state.

ELP-2's resolution is architectural, not a parameter change. There are no claims in ELP-2. Pulse sessions do not share files while running. A Pulse session for "type these 10 skills" owns those 10 files for the duration of the session. When the session completes (or hits BLOCKED), the next dispatcher cycle scans the filesystem and picks the next 10 untyped skills. If a Pulse session crashes mid-write, the next cycle sees a partially typed file, runs the linter (which will fail on it), and either retries that file or skips it based on the linter exit code.

The race condition L5-1 described simply does not exist when the worker model is one-session-at-a-time Pulse spawns with filesystem-canonical state.

---

§10. The Syncthing Fix (resolving L4-2 CRITICAL)

ELP-1 described Syncthing replication of `[home-path]` as a fallback mechanism. This path was not in the Syncthing config on mac1 (15 folder entries, none matching `.claude/state`). The fallback was unprovisioned.

ELP-2 makes Syncthing replication of `[home-path]` optional for resilience and uses `Desktop/` (already Syncthing-provisioned) for the primary human-visible surface (the dashboard HTML).

One-time Syncthing setup for `[home-path]`:

This step is recommended but not required for v1.0 operation. Without it, the scoreboard and heartbeat are mac1-local. With it, Mohamed can check ELP-2 status from any machine by reading the Syncthing-replicated scoreboard.json.

bash

# Run once on mac1. Mohamed must confirm the folder share on each target machine.
# Syncthing GUI is at http://localhost:8384

FOLDER_ID="elp2-state-$(hostname | md5 | cut -c1-8)"
FOLDER_PATH="$HOME/.claude/state/elp2"
mkdir -p "$FOLDER_PATH"

# Add via Syncthing CLI (syncthing cli config folder add)
# Or: open http://localhost:8384, click Add Folder, set:
#   Folder ID: $FOLDER_ID
#   Folder Path: $FOLDER_PATH
#   Share with: mac2, mac4, mac5 (whichever are online)

# Exclude the snapshots directory from Syncthing (local-only backup)
echo "snapshots" >> "$FOLDER_PATH/.stignore"

If this setup is not done, ELP-2 still works correctly. Mohamed reads the dashboard from mac1's Desktop or via Tailscale browser. The only thing lost is the convenience of seeing the scoreboard from other machines without Tailscale.

---

§11. Bootstrap Procedure

The full bootstrap is one bash script that takes under 5 minutes.

bash

#!/usr/bin/env bash
# bootstrap-elp2.sh — run once on mac1

set -euo pipefail

ELP2_TOOLS="$HOME/.claude/tools/elp2"
ELP2_STATE="$HOME/.claude/state/elp2"
PLIST_LABEL="com.diomande.elp2-dispatcher"
DEADMAN_CRON="*/15 * * * * $ELP2_TOOLS/deadman.sh"

echo "[1/7] Creating state directory..."
mkdir -p "$ELP2_STATE/snapshots"

echo "[2/7] Seeding scoreboard.json from current SOOP-2 memory..."
# Run scoreboard.py directly to generate the initial scoreboard
python3 "$ELP2_TOOLS/scoreboard.py" --write "$ELP2_STATE/scoreboard.json"

echo "[3/7] Writing initial heartbeat..."
touch "$ELP2_STATE/heartbeat"

echo "[4/7] Installing launchd plist for dispatcher.py..."
PLIST_PATH="$HOME/Library/LaunchAgents/$PLIST_LABEL.plist"
cat > "$PLIST_PATH" << PLIST
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>           <string>$PLIST_LABEL</string>
  <key>ProgramArguments</key>
  <array>
    <string>/opt/homebrew/bin/python3</string>
    <string>$ELP2_TOOLS/dispatcher.py</string>
  </array>
  <key>StartInterval</key>   <integer>300</integer>
  <key>RunAtLoad</key>       <true/>
  <key>StandardOutPath</key> <string>/tmp/elp2-dispatcher.log</string>
  <key>StandardErrorPath</key><string>/tmp/elp2-dispatcher.err</string>
  <key>EnvironmentVariables</key>
  <dict>
    <key>ELP2_GATEWAY_TOKEN</key>
    <string>claw-glasses-2026</string>
  </dict>
</dict>
</plist>
PLIST
# Note: credentials are loaded from environment, not embedded in the plist as literals.
# The gateway token is the default public token for the home mesh.
# If a secret token is in use, replace with:
#   launchctl setenv ELP2_GATEWAY_TOKEN "$(cat [home-path])"
# before loading the plist.

launchctl load "$PLIST_PATH"
echo "   Plist loaded. Check /tmp/elp2-dispatcher.err for startup errors."

echo "[5/7] Installing deadman.sh cron entry..."
# Add only if not already present
(crontab -l 2>/dev/null | grep -v "deadman.sh"; echo "$DEADMAN_CRON") | crontab -
echo "   Cron entry installed."

echo "[6/7] Verifying first heartbeat (waiting up to 6 min for dispatcher to run)..."
for i in $(seq 1 12); do
    sleep 30
    AGE=$(find "$ELP2_STATE/heartbeat" -mmin -6 2>/dev/null | wc -l)
    if [[ "$AGE" -gt 0 ]]; then
        echo "   Heartbeat confirmed fresh (cycle ${i})."
        break
    fi
    echo "   Waiting... (${i}/12)"
done

if [[ "$AGE" -eq 0 ]]; then
    echo "   WARNING: heartbeat not refreshed after 6 min. Check /tmp/elp2-dispatcher.err"
    exit 1
fi

echo "[7/7] Bootstrap complete. Dashboard at Desktop/elp2-dashboard.html"

Python path note: The plist uses `/opt/homebrew/bin/python3`. This is the correct path for Homebrew Python on Apple Silicon. ELP-1's plist specified `/usr/local/bin/python3` (L2-2 HIGH). The difference matters because launchd starts in a minimal environment where PATH is not set.

Credential handling: The gateway token is set as an `EnvironmentVariables` entry in the plist, not embedded as a string value that would require Mohamed to edit XML. For a secret token, `launchctl setenv` is the correct pattern before plist load. ELP-1 specified plaintext secrets in plist XML (L2-3 HIGH). ELP-2 uses the standard pattern.

Failover to mac2:

bash

# failover-to-mac2.sh — run from any machine if mac1 is down
# Copies state from Syncthing-replicated path and starts plist on mac2

ssh mac2 'mkdir -p [home-path]
ssh mac2 'cp Desktop/elp2-dashboard.html [home-path] 2>/dev/null || true
ssh mac2 "cp $HOME/Library/LaunchAgents/com.diomande.elp2-dispatcher.plist [home-path]
ssh mac2 'launchctl load [home-path]
echo "ELP-2 now running on mac2. Monitor at /tmp/elp2-dispatcher.err on mac2."

This is the entire mac1-offline recovery procedure. It is not automatic. It runs in about 60 seconds.

---

§12. Acceptance Criteria (ELP-2's own)

ELP-1 had 10 acceptance criteria. Layer 1's Lens 6 analysis showed 8 of those 10 could pass with a silently broken implementation (the AC measured file existence, not correctness). ELP-2 has 8 criteria that require observable output.

#	Criterion	How to verify
1	dispatcher.py exists at `[home-path]` and launchd plist is loaded	`launchctl list \\| grep elp2` shows the label
2	scoreboard.py correctly identifies `typed_count`, `silent_capable_count`, and `criteria_passed` from filesystem state in under 2s	`time python3 scoreboard.py` prints the expected values against a known state
3	deadman.sh runs INDEPENDENT of dispatcher.py and writes `Desktop/ELP2-STALLED.txt` when heartbeat is >30 min stale	`launchctl unload dispatcher.plist && sleep 35 && bash deadman.sh && ls Desktop/ELP2-STALLED.txt`
4	dashboard.html exists, is non-empty, and contains the current typed_count after a dispatcher cycle	read dashboard.html and grep for the typed_count value
5	Closing Claude Code for 30 min does NOT cause typed_count to stop advancing	close Claude Code, wait 30 min, open new session, run scoreboard.py, confirm typed_count increased
6	A simulated dispatcher crash (`kill -9`) is detected by deadman.sh within one cron cycle (15 min)	kill -9 the dispatcher process, wait 35 min, confirm ELP2-STALLED.txt exists
7	When all 10 SOOP-2 criteria flip to true in scoreboard.json, dispatcher.py writes a completion memory file and the launchd plist stops reloading	manually set all criteria to "pass" in scoreboard.json, wait one cycle, confirm memory file written and plist no longer running
8	ELP-2 itself ships in under 6 hours of work	track wall-clock time from bootstrap start to AC 1-7 all green

Note: AC 8 is self-referential. It exists because Layer 4 correctly identified that the build cost of ELP is itself a project risk. If building ELP-2 takes more than 6 hours, the build cost has exceeded the benefit for a 13-day convergence project. At that point, the fallback is to run Pulse sessions manually from Mohamed's main session — which is always available as a fallback regardless of ELP-2 status.

---

§13. Comparison Table: ELP-1 vs ELP-2

Dimension	ELP-1	ELP-2	Verdict
File count	~15 (supervisor.py, worker.py, verifier.py, reconciler.py, schema.sql, 2 plists, 2 cron jobs, Grafana panel defs, telegram config, +more)	5 (scoreboard.py, dispatcher.py, deadman.sh, scoreboard.json, dashboard.html)	ELP-2
Build cost	9.5h realistic (vs "1-2 sessions" claimed in spec)	~6h including bootstrap and smoke test	ELP-2
Worker model	bespoke: soop2_workers table, claim TTL, FOR UPDATE SKIP LOCKED, worker registration	Pulse spawn calls	ELP-2 (Pulse already exists)
State plane	Supabase primary + filesystem fallback (fallback physically unprovisioned)	filesystem only, Syncthing-optional	ELP-2
CRITICAL bugs at design time	5 (split-brain, Syncthing unprovisioned, claim TTL race, /inject format mismatch, PostgREST SKIP LOCKED unsupported)	0	ELP-2
Escalation	Telegram + Twilio (require direct REST from daemon; not callable as Claude skills from outside Claude)	Desktop file (visible) + v1.1 telegram	ELP-1 is better in principle; ELP-2 is better in practice because it actually works
Mac2 failover	Automatic observer-mode promotion after 30 min heartbeat gap (dual-primary risk L1-1, L5-2)	Manual failover script, ~60 seconds	ELP-2 (honest about scope)
Observability	Grafana on cloud-vm (requires Supabase direct Postgres, L4-5)	Static HTML at [home-path]	ELP-2 for v1.0; ELP-1 is better long-term
Reuse of existing infrastructure	None (Pulse exists; ELP-1 does not use it)	Pulse for worker dispatch	ELP-2
Upgrade path to multi-machine	Built-in (observer-mode, Supabase cross-mesh) but broken at v1.0	Explicit v1.1 upgrade path (see §14)	Tie
Code someone can read cold in 30 minutes	No (scattered across 15 files, 4 tables, Grafana)	Yes (5 files, 1 JSON)	ELP-2

---

§14. Upgrade Path to ELP-3

ELP-2 is scoped for SOOP-2: one machine (mac1 primary), 13 days, 295 skills, 10 criteria. When SOOP-3 or a similar 50-day multi-mesh project arrives, the upgrade path is explicit.

v1.1 (add when SOOP-2 closes, before SOOP-3 starts)

Add Supabase state plane (`scoreboard` table, `dispatch_log` table). Filesystem scoreboard.json becomes the write-ahead buffer; Supabase is the durable truth.
Add automatic mac2 failover. The observer-mode logic from ELP-1 §4.4 is the right design; implement it now that the dual-primary bug (L1-1, L5-2) has been acknowledged. The fix is a `supervisor_generation` column and a CAS update: `UPDATE soop2_state SET supervisor_id = 'mac2' WHERE supervisor_id = 'mac1' AND supervisor_generation = $expected`.
Add telegram escalation. The implementation is a direct Telegram Bot API call in deadman.sh (a shell script; `curl` is all that is needed). Not a Claude skill wrapper. Not a Python library that assumes Claude runtime.
Add Grafana panels if Supabase is now the state plane (correct dependency order; ELP-1 had Grafana before Supabase).

v2.0 (if SOOP-3 demands true multi-machine parallelism)

Add Supabase queue table with proper claim semantics (the `soop2_queue` table from ELP-1 §3.2, but with correct PostgREST RPC wrappers for `FOR UPDATE SKIP LOCKED` — not raw SQL as ELP-1 assumed).
Add quorum if multiple supervisors are truly needed.
Keep Pulse as the worker model. Do not build a bespoke worker pool.

What ELP-2 explicitly forbids (the things that killed ELP-1):

1. Custom worker pools. Use Pulse.
2. Custom queue tables in v1.0. Pulse has work-item primitives.
3. Mac3 references. Mac3 is deprecated per SOOP-2 Track H.
4. Telegram in v1.0. The claim "telegram skill is callable from supervisor daemon" is architecturally wrong (L4-6). Telegram goes in v1.1 with a direct curl call.
5. Escalation logic embedded inside dispatcher.py. deadman.sh is INDEPENDENT. If dispatcher.py has a bug, deadman.sh still fires.
6. Hardcoded TTY values in /inject calls. TTY must be read from the pane registry at dispatch time.
7. "Soft failover" language for dual-primary scenarios. Call it what it is: a dual-primary race. Design around it with CAS writes or single-machine scope.

---

§15. Layer 3 Status

Layer 3 (Codex adversarial review) did not exist on the filesystem when this document was written. The file `06c-layer3-codex-adversarial.md` returned a file-not-found error.

ELP-2 will be revised if Layer 3 surfaces findings incompatible with the design choices here. The most likely Layer 3 targets, given the pattern of CRITICAL findings in Layers 1 and 2, are:

The Pulse session-length envelope. If Pulse cannot handle a 2-hour batch of 30 skill typings (Layer 4 Risk 1), ELP-2 needs a per-batch-kind fallback to direct aura-gateway /inject with the correct tty handling from §5 Mode B.
The scoreboard.json single-writer assumption. If anything other than dispatcher.py touches scoreboard.json (e.g., a manual override script that does not use the atomic write pattern), there is a corruption window. The fix is a file lock, not a structural change.
The Pulse MCP interface availability from a launchd context. Pulse uses `mcp__pulse-local__pulse_start` which requires an MCP server running alongside the Claude session. A launchd daemon outside Claude does not have MCP access. This is a real concern: if Pulse can only be called from inside a Claude Code session, then ELP-2's primary dispatch path requires a running Claude session, which re-creates caveat C2.

This last point is the most significant potential flaw in ELP-2 as designed. The resolution, if Layer 3 confirms it: fall back to calling `pulse_start` via the Pulse orchestrator HTTP API directly (`https://pulse-orchestrator-274020562532.us-central1.run.app`), not the MCP interface. The Pulse SKILL.md shows the orchestrator URL. A direct HTTP call from a Python daemon does not require Claude runtime. This fallback should be verified before implementation begins.

---

§16. Build Order

Step	Component	Effort	Unlocks
1	scoreboard.py with write() and all 10 criterion checks	45 min	everything
2	dispatcher.py skeleton: cycle loop, priority table, gate logic	90 min	launchd install
3	dispatch_pulse() with Pulse orchestrator HTTP call (not MCP)	30 min	Pulse batches
4	dispatch_inject_mac1() with pane registry tty read	20 min	direct /inject fallback
5	deadman.sh (3 lines)	5 min	independent escalation
6	launchd plist + bootstrap-elp2.sh	30 min	production operation
7	dashboard HTML generation in dispatcher.py	45 min	observability
8	Smoke test: one full cycle through mass_typing_batch	45 min	AC 5 verification

Total: ~5.5 hours. Under the 6-hour meta-criterion (AC 8).

---

§17. SOOP-2 Scoreboard at ELP-2 Design Time

For reference, the current state that ELP-2 will advance:

Criterion	Status	Notes
1. All 295 skills typed	in_progress (14/295)	Primary work target for ELP-2
2. Linter under 3s	pass	Already done
3. Tier 1 recall >= 80
4. Twin primary (mac4:8100)	pending	Requires mac4 availability
5. type_compatibility_weight knob	pending	Architecture decision
6. Feedback loop wired	pending	Track E work
7. Contrarian files exist	pass	Already done
8. Silent capable >= 5	in_progress (3/5)	2 more skill edits
9. mac3 purge	pass	Already done
10. Audit memory file	pending	Interactive per Rail

3 criteria passed. 7 to go. ELP-2's primary contribution is advancing criterion 1 (281 SKILL.md files remaining) without requiring Mohamed's main session to stay open.

---

End of ELP-2 survivor architecture document.
Produced by Crucible Stage 3 FORGE for chain:full-omega cycle 2.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

crucible-output/soop-2/07-elp-survivor-architecture.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture · is Stage Research