Grand Diomande Research · Full HTML Reader

Everlasting Loop Protocol (ELP-1) — Mesh-Backed Autonomous Convergence

> **Status:** v1 draft (2026-05-13). Born from honest accounting of the SOOP-2 single-Claude loop's caveats. > **Goal:** Drive multi-day acceptance-criteria convergence without any single point of failure. No daemon, no machine, no session, no model dependency that can kill the loop.

Agents That Account for Themselves research note experiment writeup candidate score 22 .md

Full Public Reader

Everlasting Loop Protocol (ELP-1) — Mesh-Backed Autonomous Convergence

> Status: v1 draft (2026-05-13). Born from honest accounting of the SOOP-2 single-Claude loop's caveats.
> Goal: Drive multi-day acceptance-criteria convergence without any single point of failure. No daemon, no machine, no session, no model dependency that can kill the loop.

0. The four caveats this protocol kills

Caveat	Root cause	ELP solution
C1: No visibility into the harness wake queue	ScheduleWakeup is a black-box claim	Supervisor maintains its own queue in Supabase + filesystem; wake claims are independently verifiable
C2: Wake dies if Claude Code closes	Wake is intra-session	Supervisor runs as launchd plist outside Claude; dispatches work into any available session/pane
C3: Loop is mine alone to drive	Single-driver pattern	Multi-worker pattern: any of mac1-5 + cloud-vm can pull from the queue; worker failure does not kill the loop
C4: No external supervisor for stall	Silent failures don't escalate	Stagnation detector pages Mohamed via telegram/SMS after configurable thresholds; failed workers get quarantined via cortex:rules

1. Failure model (what we explicitly defend against)

For each failure class, the protocol must continue:

Failure	Frequency	Recovery cost
Claude Code app closed	hourly	zero (supervisor outside Claude)
Single machine offline	daily	zero (mesh redistributes to other nodes)
Specific pane rate-limited	hourly	zero (cortex:rules quarantines pane, work redistributes)
Worker session hangs mid-batch	per-cycle	<5 min (claim TTL releases; another worker reclaims)
Supabase unreachable	rare	falls back to filesystem-mirror state
Tailscale partition	rare	mesh nodes work from local state; reconcile on heal
Power loss on Mac1	weeks	mac2/cloud-vm take over supervisor role
Mohamed unreachable	known	protocol continues; queues escalations for resume
Bad batch corrupts state	rare	state writes are atomic + journaled; rollback to last good

The protocol has no single component whose failure stalls the loop indefinitely. Every component has a recovery path.

2. Architecture (5 layers, mesh-native primitives)

+---------------------------------------------------------------+
| Layer 5: ESCALATION                                            |
|   telegram bot, SMS via twilio gateway, supabase alerts        |
|   Pages Mohamed only when 3+ consecutive cycles stall          |
+---------------------------------------------------------------+
| Layer 4: OBSERVABILITY                                          |
|   Grafana dashboard (cloud-vm), Supabase realtime,             |
|   Desktop/soop-2-dashboard.html (Syncthing-shared)            |
+---------------------------------------------------------------+
| Layer 3: VERIFICATION (independent of workers)                  |
|   Cron job runs linter every 5min, writes report to Supabase   |
|   Per-criterion verifier polls state, flips ✅ when threshold  |
+---------------------------------------------------------------+
| Layer 2: WORKER DISPATCH                                        |
|   Workers across mac1-5 + cloud-vm pull from soop2_queue        |
|   Atomic claim with 5min TTL, idempotent batch processing      |
|   aura-gateway (mac1:8095) injects prompts; ops:inject delivers|
+---------------------------------------------------------------+
| Layer 1: SUPERVISOR (launchd, outside Claude Code)              |
|   Runs every 5min as launchd plist on mac1 (with mac2 backup)  |
|   Reads scoreboard, generates work items, dispatches workers   |
|   Heartbeat to [home-path]       |
+---------------------------------------------------------------+
| Layer 0: STATE PLANE (durable, source of truth)                 |
|   Supabase tables: soop2_state, soop2_queue, soop2_log         |
|   Filesystem mirror: [home-path] via Syncthing      |
|   Atomic writes (tmp + os.replace), journaled, version-tagged  |
+---------------------------------------------------------------+

3. State plane (Layer 0)

3.1 Supabase schema

sql

-- The scoreboard. One row total.
create table soop2_state (
  id integer primary key default 1,
  schema_version integer not null default 1,
  updated_at timestamptz not null default now(),
  typed_count integer not null default 0,
  total_skills integer not null default 295,
  linter_last_run_at timestamptz,
  linter_last_elapsed_seconds numeric,
  linter_last_exit_code integer,
  silent_capable_count integer not null default 0,
  mac3_purged boolean not null default false,
  contrarian_live boolean not null default false,
  criteria_passed jsonb not null default '[]'::jsonb,  -- list of criterion IDs
  status text not null default 'active',  -- active|paused|stalled|completed
  last_advance_at timestamptz,
  consecutive_no_advance_cycles integer not null default 0,
  constraint single_row check (id = 1)
);

-- Work items. Workers claim with atomic update.
create table soop2_queue (
  id uuid primary key default gen_random_uuid(),
  created_at timestamptz not null default now(),
  kind text not null,  -- type-batch|silent-skill|tier1-expand|feedback-wire|audit
  payload jsonb not null,  -- batch contents specific to kind
  priority integer not null default 100,
  status text not null default 'pending',  -- pending|claimed|completed|failed|quarantined
  claimed_by text,  -- worker_id (e.g., mac1-claude-cli, mac4-codex)
  claimed_at timestamptz,
  claim_ttl_seconds integer not null default 300,
  completed_at timestamptz,
  attempt_count integer not null default 0,
  max_attempts integer not null default 3,
  result jsonb,
  error_log text
);

-- Append-only audit trail. Every event.
create table soop2_log (
  id bigserial primary key,
  ts timestamptz not null default now(),
  worker text,  -- supervisor|mac1-claude-cli|mac4-codex|...
  event text not null,  -- linter-run|batch-claimed|batch-completed|criterion-passed|alert
  details jsonb not null default '{}'::jsonb
);

-- Worker heartbeats. Each worker writes its status here.
create table soop2_workers (
  worker_id text primary key,  -- e.g., 'mac1-claude-cli'
  machine text not null,
  pane text,
  last_heartbeat_at timestamptz not null default now(),
  status text not null default 'idle',  -- idle|claiming|processing|paused|quarantined
  current_batch_id uuid,
  rate_limited_until timestamptz,
  consecutive_failures integer not null default 0
);

3.2 Filesystem mirror (Syncthing-shared)

Every node has a local copy at `[home-path]`:
- `scoreboard.json` — mirror of `soop2_state` row
- `queue/pending/<batch-id>.json` — atomic per-batch files
- `queue/claimed/<batch-id>.json` — moved here on claim
- `queue/completed/<batch-id>.json` — moved on completion
- `log/YYYY-MM-DD.jsonl` — daily log files, append-only
- `supervisor.heartbeat` — touched every cycle by supervisor

Syncthing handles propagation. Atomic writes via `tmp + os.replace`. On Supabase unavailability, all workers fall back to filesystem (eventually consistent, no data loss for the work in flight).

### 3.3 Write rules
- Supabase is preferred: workers read from + write to Supabase first
- Filesystem is fallback: if Supabase 503/timeout > 5s, switch to filesystem
- Reconciler runs every cycle: filesystem → Supabase replay on heal
- Idempotency: every event has a unique ID; replay is safe

4. Supervisor daemon (Layer 1)

### 4.1 Identity and placement
- Primary: `mac1`, runs as user launchd plist
- Backup: `mac2` runs same plist but in observer mode (only activates if mac1.heartbeat older than 10 min)
- Reason mac1 is primary: aura-gateway is already at `mac1:8095` (the dispatcher endpoint)

4.2 launchd plist

xml

<!-- [home-path] -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>            <string>com.diomande.soop2-supervisor</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/python3</string>
    <string>[home]/.claude/tools/soop2/supervisor.py</string>
    <string>--mode=primary</string>
  </array>
  <key>StartInterval</key>    <integer>300</integer>  <!-- 5 min -->
  <key>RunAtLoad</key>        <true/>
  <key>StandardOutPath</key>  <string>/tmp/soop2-supervisor.log</string>
  <key>StandardErrorPath</key><string>/tmp/soop2-supervisor.err</string>
  <key>EnvironmentVariables</key>
  <dict>
    <key>SOOP2_SUPABASE_URL</key>  <string>...</string>
    <key>SOOP2_SUPABASE_KEY</key>  <string>...</string>
  </dict>
</dict>
</plist>

4.3 Supervisor cycle (every 5 min)

1. heartbeat: touch [home-path]
2. snapshot: read scoreboard from Supabase (fallback filesystem)
3. run linter: subprocess.run(skill-typecheck.py); write report to Supabase
4. delta detection:
   - If typed_count increased since last cycle: reset consecutive_no_advance to 0, log advance event
   - If unchanged for 6 cycles (30 min): increment consecutive_no_advance
   - If consecutive_no_advance >= 12 (1h): emit stall warning to soop2_log
   - If consecutive_no_advance >= 24 (2h): page Mohamed (telegram + SMS)
5. queue maintenance:
   - Release expired claims: status='claimed' AND claimed_at + claim_ttl < now -> set status='pending', attempt_count++
   - Quarantine: any batch with attempt_count >= max_attempts -> status='quarantined', log; notify Mohamed if it blocks a criterion
6. work generation:
   - If pending queue size < 5: generate next batches based on remaining criteria
   - Batch kinds and generation rules in §6
7. worker pinging:
   - For each non-rate-limited worker in soop2_workers: check last_heartbeat; if stale > 10min, set status='idle'
   - For each idle worker: aura-gateway POST /inject with "claim next batch" prompt
8. status write: update soop2_state with new metrics
9. exit

### 4.4 Mac2 backup activation
Mac2 runs same plist with `--mode=observer`. Observer behavior:

1. Read primary heartbeat from filesystem (Syncthing-replicated)
2. If heartbeat is < 10 min old: do nothing
3. If heartbeat is 10-30 min old: log "primary stale"
4. If heartbeat is > 30 min old: promote self to primary, alert Mohamed
5. On promotion: become the authoritative writer; mac1 will demote when it returns

Soft failover. Both nodes never write simultaneously because the filesystem heartbeat is the lock.

5. Worker dispatch (Layer 2)

### 5.1 Worker model
A worker is a Claude Code session (or Codex pane) on a mesh node that pulls batches and processes them. Workers register themselves on session start by writing to `soop2_workers`.

Worker IDs are deterministic: `{machine}-{pane-name}`. Examples:
- `mac1-claude-cli` — the primary Claude pane on mac1
- `mac4-codex` — Codex.app on mac4 (uses codex-gateway, already shipped)
- `cloud-vm-claude` — Claude pane on cloud-vm via tmux

5.2 Worker lifecycle

On session start (via SessionStart hook or manual):
  1. Register: upsert into soop2_workers with status='idle'
  2. Heartbeat loop: every 60s write last_heartbeat_at = now()
  3. Wait for inject prompt from aura-gateway

On receiving "claim next batch":
  1. Atomic claim:
     UPDATE soop2_queue
     SET status='claimed', claimed_by={worker_id}, claimed_at=now(), attempt_count=attempt_count+1
     WHERE id = (SELECT id FROM soop2_queue WHERE status='pending' ORDER BY priority DESC, created_at LIMIT 1 FOR UPDATE SKIP LOCKED)
     RETURNING *
  2. Process batch per kind (§6)
  3. On success: UPDATE soop2_queue SET status='completed', result=..., completed_at=now()
  4. On failure: UPDATE soop2_queue SET status='pending' if attempt_count < max_attempts else 'quarantined', error_log=...
  5. Update worker status back to 'idle'
  6. Decrement consecutive_failures if success; increment on failure
  7. If consecutive_failures >= 3: self-quarantine for 30 min

On supervisor ping:
  1. Update last_heartbeat
  2. If status='idle' and batch is available: claim and process

### 5.3 Aura-gateway prompt-injection format
The supervisor invokes:

POST mac1:8095/inject
{
  "machine": "{worker.machine}",
  "tmux_target": "{worker.pane}",
  "prompt": "[SOOP-2 worker {worker_id}] Claim and process the next batch. Read [home-path] for available work. See [home-path] for the canonical processor."
}

`aura-gateway` already routes to the right mesh node via SSH+tmux send-keys (per `meshcontrol-cross-machine-shipped-2026-05-06.md`).

### 5.4 Backpressure and rate-limit awareness
Workers check `soop2_workers.rate_limited_until` for their own row before each claim. If a worker hits a rate limit:
1. Set its own `rate_limited_until = now() + observed_window`
2. Reject the claim (return to pending)
3. Skip prompts from aura-gateway for the duration

Supervisor also reads SessionStart's rate-limit info (Mohamed already sees rate-limit warnings on session start) and updates worker rows accordingly.

6. Batch kinds and processing

### 6.1 `kind=type-batch`
Payload: `{skills: [{path, category, input_type, output_type, effects, idempotent, silent_capable, commutes_with}, ...]}`
Worker behavior:
1. For each skill, apply the surgical FRONTMATTER_RE injection pattern
2. Run linter against the modified files
3. On all-clean: status='completed', result includes typed_delta
4. On any failure: status back to pending (if attempts left) with error_log

### 6.2 `kind=silent-skill`
Payload: `{skill_name, body_template}`
Creates a new silent-capable skill (like cortex:watch). Worker writes the SKILL.md with full typed frontmatter and silent return logic.

### 6.3 `kind=tier1-expand`
Payload: `{batch_skills: [...], embedding_target: ...}`
Worker runs `embedding_indexer.py` against the batch, updates the embedding cache, validates retrieval recall.

### 6.4 `kind=feedback-wire`
Payload: `{component: 'reaction_logger' | 'pattern_learner' | 'threshold_calibrator' | 'lse_bridge'}`
Worker patches the SEA production code at `[home-path]` to add the named feedback-loop component. Idempotent (no-op if already wired).

### 6.5 `kind=audit`
Payload: `{}`
Worker runs the full skill audit, generates `skills-typing-audit-YYYY-MM-DD.md` memory file. Only one of these should ever be in flight (priority=1, max 1 per day).

### 6.6 `kind=verifier`
Payload: `{criterion_id: 1..10}`
Worker checks the specific criterion's success signal against current state. Writes to `soop2_state.criteria_passed` if passing. This is how criteria flip from ⏳ to ✅.

7. Verification (Layer 3, independent of workers)

### 7.1 Continuous verifier cron
Separate launchd plist (`com.diomande.soop2-verifier`) runs `verifier.py` every 5 min, offset 150s from supervisor (avoids race conditions).

Verifier behavior:
1. Read scoreboard
2. For each criterion 1-10: run its automated check
3. Compare against current `criteria_passed`; emit `criterion_flipped` event on change
4. Write per-criterion timestamp to `soop2_state.criterion_passed_at_<N>`

7.2 Per-criterion automated checks

python

CHECKS = {
  1: lambda s: s['typed_count'] / s['total_skills'] >= 0.95,
  2: lambda s: s['linter_last_elapsed_seconds'] < 3.0,
  3: lambda s: tier1_recall_test_passing(),  # runs labeled-benchmark
  4: lambda s: tier2_endpoint_is_twin(),     # curl Mac4:8100/health
  5: lambda s: router_config_has_type_weight_knob(),
  6: lambda s: feedback_loop_closure_test_passing(),
  7: lambda s: contrarian_files_exist() and meta_review_has_round2(),
  8: lambda s: s['silent_capable_count'] >= 5,
  9: lambda s: mac3_grep_returns_only_in_archive(),
  10: lambda s: typing_audit_memory_file_exists(),
}

Each check is idempotent and fast (<1s).

### 7.3 Closure detection
When all 10 criteria flip to ✅:
1. Supervisor writes `soop2_state.status = 'completed'`
2. Generates final memory file `soop-2-completed-YYYY-MM-DD.md`
3. Pages Mohamed via telegram with completion summary
4. Stops dispatching new work
5. Workers gracefully transition to idle and de-register

8. Escalation (Layer 5)

### 8.1 Trigger thresholds
| Condition | Action |
|---|---|
| `consecutive_no_advance >= 12` (1h) | Log warning |
| `consecutive_no_advance >= 24` (2h) | telegram + SMS |
| Any criterion stalled in 'in_progress' > 24h | Daily summary to Mohamed |
| Worker fail_count >= 5 in 1h | Auto-quarantine that worker |
| All workers quarantined or rate-limited | Page Mohamed within 5 min |
| Linter exit code unstable (flipping) | Page Mohamed |
| Supabase + filesystem disagree after reconciliation | Page Mohamed (data integrity) |

### 8.2 Escalation delivery
Reuses existing `telegram` skill (gateway) and `sms` skill (twilio). Both already shipped per memory.

Message template:

SOOP-2 stall alert (cycle {N}):
- typed: {x}/{total} ({pct}%)
- last advance: {hours_ago}h ago
- criteria passed: {n}/10
- stuck on: {criterion_name}
- last error: {err_summary}
- action: {recommend}

Inspect: [home-path]

9. Observability (Layer 4)

### 9.1 Grafana dashboard
Cloud-vm already runs Grafana per memory. Add a SOOP-2 dashboard with panels:
- typed_count time-series (line chart)
- criteria_passed (status grid, 10 cells)
- worker_status (color-coded by worker_id)
- queue_depth + claim_rate (gauges)
- consecutive_no_advance (counter with red threshold at 24)
- linter exit_code history (binary chart)

Data source: Supabase Postgres direct query. Refresh every 30s.

### 9.2 Local dashboard fallback
For when grafana is unavailable, supervisor writes a static `Desktop/soop-2-dashboard.html` every cycle. Syncthing replicates. Mohamed opens it anywhere.

### 9.3 CLI status command
`soop2 status` (a small bash script in `[home-path]`):

bash

#!/bin/bash
psql $SOOP2_DB -c "SELECT typed_count, criteria_passed, status, last_advance_at FROM soop2_state"
echo "--- Queue ---"
psql $SOOP2_DB -c "SELECT kind, status, count(*) FROM soop2_queue GROUP BY kind, status"
echo "--- Workers ---"
psql $SOOP2_DB -c "SELECT worker_id, status, last_heartbeat_at FROM soop2_workers"

Fallback to filesystem if Supabase unavailable.

10. Bootstrap procedure (one-time deploy)

bash

# 1. Provision Supabase tables (idempotent SQL migrations)
psql $SOOP2_DB < [home-path]

# 2. Initialize filesystem mirror
mkdir -p [home-path]

# 3. Install supervisor launchd plist (mac1 primary)
cp [home-path] [home-path]
launchctl load [home-path]

# 4. Install observer plist on mac2
ssh mac2 "cp [home-path] [home-path] && launchctl load [home-path]

# 5. Seed initial state from current scoreboard
python3 [home-path] [home-path]

# 6. Register workers on each mesh node
ssh mac1 "python3 [home-path] --pane=claude-cli"
ssh mac4 "python3 [home-path] --pane=codex"
# (mac2/mac3/mac5/cloud-vm as available; mac3 stays out per SOOP-2 directive)

# 7. Verify
[home-path] status

# 8. Watch first cycle
tail -f /tmp/soop2-supervisor.log

After bootstrap, the protocol runs forever (or until soop2_state.status='completed'). Closing Claude Code on Mac1 does NOT stop it. Killing mac1 does NOT stop it. The only way to stop it is to disable both launchd plists OR set status='paused' in the DB.

11. Migration from current single-Claude loop

The ScheduleWakeup-based loop active right now is single-driver. ELP is multi-driver. Migration:

1. Phase A (today's single loop continues): the ScheduleWakeup wakes type their batches as already planned.
2. Phase B (bootstrap ELP): run the bootstrap procedure above. Initial seed comes from current `soop-2-launch-2026-05-12.md`.
3. Phase C (cutover): the next ScheduleWakeup fire detects ELP is live (file exists at `[home-path]`), registers self as worker `mac1-claude-cli`, and from then on operates as one of many workers rather than the sole driver.
4. Phase D (post-completion): when ELP reports `status=completed`, ScheduleWakeup chain cancels itself.

The migration is non-destructive. ELP can run alongside the single loop; first one to advance a batch claims it; the other's attempt becomes a no-op.

12. Anti-patterns this protocol forbids

1. Any worker that doesn't write a heartbeat (cannot be supervised)
2. Any state write that isn't atomic + journaled (loses progress on crash)
3. Any criterion that doesn't have an automated verifier (cannot be measured)
4. Hardcoded mac3 references (deprecated per SOOP-2 Track H)
5. Workers that bypass the queue (creates race conditions with claim semantics)
6. Supervisor that runs inside Claude Code (re-creates C2 caveat)
7. Escalation thresholds without a clear remediation step (alarm fatigue)
8. Filesystem-only state without Supabase fallback (loses durability)
9. Workers without rate-limit awareness (hammers the cap)
10. Any change to ELP state without a `soop2_log` event (audit hole)

13. Open questions for v1.1

1. Quorum vs single-supervisor: ELP-1 uses primary+observer with heartbeat-as-lock. For v1.1, consider Raft-style quorum across 3+ supervisors. Tradeoff: simpler now (heartbeat) vs more resilient later (quorum).
2. Worker capability declarations: should each worker advertise what `kind` of batches it can process? (e.g., mac1 worker can do tier1-expand because it has the embedding cache; cloud-vm worker can do typing). Probably yes, in v1.1.
3. Time-of-day awareness: Mohamed's wake hours vs sleep hours — should the protocol be more aggressive at night when no human interruption is welcome? Probably no; the work is async anyway.
4. Pull-vs-push: ELP-1 uses push (supervisor injects prompts). Pull (workers poll) is more resilient but spammier. Stay with push for v1.
5. Test mode: before going live, ELP needs a dry-run mode that processes synthetic batches without modifying real files. Add `--dry-run` flag everywhere.

14. Acceptance criteria for ELP itself

When all of these are true, ELP is operationally complete:

1. `[home-path]` exists and runs as launchd on mac1
2. `[home-path]` exists and is callable from any mesh node
3. Supabase schema is applied; all 4 tables exist with at least 1 test row
4. `soop2_state.status == 'active'` and `supervisor.heartbeat` is fresh (< 10 min old)
5. At least 1 worker is registered with a fresh heartbeat
6. Closing Claude Code on mac1 for 30 min does NOT cause `consecutive_no_advance` to increment
7. Manually quarantining a worker (set status='quarantined' in DB) does NOT stop other workers from progressing
8. Killing the Supabase connection for 10 min causes filesystem fallback; on heal, state reconciles correctly
9. A simulated stall (no advance for 2h) triggers a telegram message to Mohamed
10. When all 10 SOOP-2 criteria flip to ✅, ELP automatically writes the completion memory file and stops dispatching

15. Why this is "everlasting"

The protocol survives:
- Any single machine going offline (mesh redistributes)
- Claude Code being closed (supervisor is outside)
- Any single worker crashing (claim TTL releases work)
- Supabase being down (filesystem fallback)
- Tailscale partition (eventual consistency)
- Mohamed being offline (queues escalations, continues work)
- Bad code in a SOOP-2 batch (quarantine, alert, continue)
- The model being slow on a node (rate-limit aware redirect)
- A criterion definition being wrong (verifier runs forever, just doesn't flip)

The only ways to permanently stop ELP:
1. Mohamed sets `soop2_state.status = 'paused'`
2. All mesh nodes are simultaneously offline for > 30 days (Syncthing pruning)
3. Both supervisor plists are disabled and not re-enabled

Even then, the state in Supabase + filesystem mirror is durable. Restart trivially.

---

End of ELP-1 v1 draft. Authority gate: Mohamed Diomande review for v1.1 sign-off.
Next session: bootstrap procedure §10 ships ELP to production.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

crucible-output/soop-2/05-everlasting-loop-protocol.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture · is Stage Research