Everlasting Loop Protocol (ELP-1) — Mesh-Backed Autonomous Convergence
> **Status:** v1 draft (2026-05-13). Born from honest accounting of the SOOP-2 single-Claude loop's caveats. > **Goal:** Drive multi-day acceptance-criteria convergence without any single point of failure. No daemon, no machine, no session, no model dependency that can kill the loop.
Full Public Reader
Everlasting Loop Protocol (ELP-1) — Mesh-Backed Autonomous Convergence
> Status: v1 draft (2026-05-13). Born from honest accounting of the SOOP-2 single-Claude loop's caveats.
> Goal: Drive multi-day acceptance-criteria convergence without any single point of failure. No daemon, no machine, no session, no model dependency that can kill the loop.
0. The four caveats this protocol kills
| Caveat | Root cause | ELP solution |
|---|---|---|
| C1: No visibility into the harness wake queue | ScheduleWakeup is a black-box claim | Supervisor maintains its own queue in Supabase + filesystem; wake claims are independently verifiable |
| C2: Wake dies if Claude Code closes | Wake is intra-session | Supervisor runs as launchd plist outside Claude; dispatches work into any available session/pane |
| C3: Loop is mine alone to drive | Single-driver pattern | Multi-worker pattern: any of mac1-5 + cloud-vm can pull from the queue; worker failure does not kill the loop |
| C4: No external supervisor for stall | Silent failures don't escalate | Stagnation detector pages Mohamed via telegram/SMS after configurable thresholds; failed workers get quarantined via cortex:rules |
1. Failure model (what we explicitly defend against)
For each failure class, the protocol must continue:
| Failure | Frequency | Recovery cost |
|---|---|---|
| Claude Code app closed | hourly | zero (supervisor outside Claude) |
| Single machine offline | daily | zero (mesh redistributes to other nodes) |
| Specific pane rate-limited | hourly | zero (cortex:rules quarantines pane, work redistributes) |
| Worker session hangs mid-batch | per-cycle | <5 min (claim TTL releases; another worker reclaims) |
| Supabase unreachable | rare | falls back to filesystem-mirror state |
| Tailscale partition | rare | mesh nodes work from local state; reconcile on heal |
| Power loss on Mac1 | weeks | mac2/cloud-vm take over supervisor role |
| Mohamed unreachable | known | protocol continues; queues escalations for resume |
| Bad batch corrupts state | rare | state writes are atomic + journaled; rollback to last good |
The protocol has no single component whose failure stalls the loop indefinitely. Every component has a recovery path.
2. Architecture (5 layers, mesh-native primitives)
+---------------------------------------------------------------+
| Layer 5: ESCALATION |
| telegram bot, SMS via twilio gateway, supabase alerts |
| Pages Mohamed only when 3+ consecutive cycles stall |
+---------------------------------------------------------------+
| Layer 4: OBSERVABILITY |
| Grafana dashboard (cloud-vm), Supabase realtime, |
| Desktop/soop-2-dashboard.html (Syncthing-shared) |
+---------------------------------------------------------------+
| Layer 3: VERIFICATION (independent of workers) |
| Cron job runs linter every 5min, writes report to Supabase |
| Per-criterion verifier polls state, flips ✅ when threshold |
+---------------------------------------------------------------+
| Layer 2: WORKER DISPATCH |
| Workers across mac1-5 + cloud-vm pull from soop2_queue |
| Atomic claim with 5min TTL, idempotent batch processing |
| aura-gateway (mac1:8095) injects prompts; ops:inject delivers|
+---------------------------------------------------------------+
| Layer 1: SUPERVISOR (launchd, outside Claude Code) |
| Runs every 5min as launchd plist on mac1 (with mac2 backup) |
| Reads scoreboard, generates work items, dispatches workers |
| Heartbeat to [home-path] |
+---------------------------------------------------------------+
| Layer 0: STATE PLANE (durable, source of truth) |
| Supabase tables: soop2_state, soop2_queue, soop2_log |
| Filesystem mirror: [home-path] via Syncthing |
| Atomic writes (tmp + os.replace), journaled, version-tagged |
+---------------------------------------------------------------+3. State plane (Layer 0)
3.1 Supabase schema
-- The scoreboard. One row total.
create table soop2_state (
id integer primary key default 1,
schema_version integer not null default 1,
updated_at timestamptz not null default now(),
typed_count integer not null default 0,
total_skills integer not null default 295,
linter_last_run_at timestamptz,
linter_last_elapsed_seconds numeric,
linter_last_exit_code integer,
silent_capable_count integer not null default 0,
mac3_purged boolean not null default false,
contrarian_live boolean not null default false,
criteria_passed jsonb not null default '[]'::jsonb, -- list of criterion IDs
status text not null default 'active', -- active|paused|stalled|completed
last_advance_at timestamptz,
consecutive_no_advance_cycles integer not null default 0,
constraint single_row check (id = 1)
);
-- Work items. Workers claim with atomic update.
create table soop2_queue (
id uuid primary key default gen_random_uuid(),
created_at timestamptz not null default now(),
kind text not null, -- type-batch|silent-skill|tier1-expand|feedback-wire|audit
payload jsonb not null, -- batch contents specific to kind
priority integer not null default 100,
status text not null default 'pending', -- pending|claimed|completed|failed|quarantined
claimed_by text, -- worker_id (e.g., mac1-claude-cli, mac4-codex)
claimed_at timestamptz,
claim_ttl_seconds integer not null default 300,
completed_at timestamptz,
attempt_count integer not null default 0,
max_attempts integer not null default 3,
result jsonb,
error_log text
);
-- Append-only audit trail. Every event.
create table soop2_log (
id bigserial primary key,
ts timestamptz not null default now(),
worker text, -- supervisor|mac1-claude-cli|mac4-codex|...
event text not null, -- linter-run|batch-claimed|batch-completed|criterion-passed|alert
details jsonb not null default '{}'::jsonb
);
-- Worker heartbeats. Each worker writes its status here.
create table soop2_workers (
worker_id text primary key, -- e.g., 'mac1-claude-cli'
machine text not null,
pane text,
last_heartbeat_at timestamptz not null default now(),
status text not null default 'idle', -- idle|claiming|processing|paused|quarantined
current_batch_id uuid,
rate_limited_until timestamptz,
consecutive_failures integer not null default 0
);3.2 Filesystem mirror (Syncthing-shared)
Every node has a local copy at `[home-path]`:
- `scoreboard.json` — mirror of `soop2_state` row
- `queue/pending/<batch-id>.json` — atomic per-batch files
- `queue/claimed/<batch-id>.json` — moved here on claim
- `queue/completed/<batch-id>.json` — moved on completion
- `log/YYYY-MM-DD.jsonl` — daily log files, append-only
- `supervisor.heartbeat` — touched every cycle by supervisor
Syncthing handles propagation. Atomic writes via `tmp + os.replace`. On Supabase unavailability, all workers fall back to filesystem (eventually consistent, no data loss for the work in flight).
### 3.3 Write rules
- Supabase is preferred: workers read from + write to Supabase first
- Filesystem is fallback: if Supabase 503/timeout > 5s, switch to filesystem
- Reconciler runs every cycle: filesystem → Supabase replay on heal
- Idempotency: every event has a unique ID; replay is safe
4. Supervisor daemon (Layer 1)
### 4.1 Identity and placement
- Primary: `mac1`, runs as user launchd plist
- Backup: `mac2` runs same plist but in observer mode (only activates if mac1.heartbeat older than 10 min)
- Reason mac1 is primary: aura-gateway is already at `mac1:8095` (the dispatcher endpoint)
4.2 launchd plist
<!-- [home-path] -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key> <string>com.diomande.soop2-supervisor</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/python3</string>
<string>[home]/.claude/tools/soop2/supervisor.py</string>
<string>--mode=primary</string>
</array>
<key>StartInterval</key> <integer>300</integer> <!-- 5 min -->
<key>RunAtLoad</key> <true/>
<key>StandardOutPath</key> <string>/tmp/soop2-supervisor.log</string>
<key>StandardErrorPath</key><string>/tmp/soop2-supervisor.err</string>
<key>EnvironmentVariables</key>
<dict>
<key>SOOP2_SUPABASE_URL</key> <string>...</string>
<key>SOOP2_SUPABASE_KEY</key> <string>...</string>
</dict>
</dict>
</plist>4.3 Supervisor cycle (every 5 min)
1. heartbeat: touch [home-path]
2. snapshot: read scoreboard from Supabase (fallback filesystem)
3. run linter: subprocess.run(skill-typecheck.py); write report to Supabase
4. delta detection:
- If typed_count increased since last cycle: reset consecutive_no_advance to 0, log advance event
- If unchanged for 6 cycles (30 min): increment consecutive_no_advance
- If consecutive_no_advance >= 12 (1h): emit stall warning to soop2_log
- If consecutive_no_advance >= 24 (2h): page Mohamed (telegram + SMS)
5. queue maintenance:
- Release expired claims: status='claimed' AND claimed_at + claim_ttl < now -> set status='pending', attempt_count++
- Quarantine: any batch with attempt_count >= max_attempts -> status='quarantined', log; notify Mohamed if it blocks a criterion
6. work generation:
- If pending queue size < 5: generate next batches based on remaining criteria
- Batch kinds and generation rules in §6
7. worker pinging:
- For each non-rate-limited worker in soop2_workers: check last_heartbeat; if stale > 10min, set status='idle'
- For each idle worker: aura-gateway POST /inject with "claim next batch" prompt
8. status write: update soop2_state with new metrics
9. exit### 4.4 Mac2 backup activation
Mac2 runs same plist with `--mode=observer`. Observer behavior:
1. Read primary heartbeat from filesystem (Syncthing-replicated)
2. If heartbeat is < 10 min old: do nothing
3. If heartbeat is 10-30 min old: log "primary stale"
4. If heartbeat is > 30 min old: promote self to primary, alert Mohamed
5. On promotion: become the authoritative writer; mac1 will demote when it returnsSoft failover. Both nodes never write simultaneously because the filesystem heartbeat is the lock.
5. Worker dispatch (Layer 2)
### 5.1 Worker model
A worker is a Claude Code session (or Codex pane) on a mesh node that pulls batches and processes them. Workers register themselves on session start by writing to `soop2_workers`.
Worker IDs are deterministic: `{machine}-{pane-name}`. Examples:
- `mac1-claude-cli` — the primary Claude pane on mac1
- `mac4-codex` — Codex.app on mac4 (uses codex-gateway, already shipped)
- `cloud-vm-claude` — Claude pane on cloud-vm via tmux
5.2 Worker lifecycle
On session start (via SessionStart hook or manual):
1. Register: upsert into soop2_workers with status='idle'
2. Heartbeat loop: every 60s write last_heartbeat_at = now()
3. Wait for inject prompt from aura-gateway
On receiving "claim next batch":
1. Atomic claim:
UPDATE soop2_queue
SET status='claimed', claimed_by={worker_id}, claimed_at=now(), attempt_count=attempt_count+1
WHERE id = (SELECT id FROM soop2_queue WHERE status='pending' ORDER BY priority DESC, created_at LIMIT 1 FOR UPDATE SKIP LOCKED)
RETURNING *
2. Process batch per kind (§6)
3. On success: UPDATE soop2_queue SET status='completed', result=..., completed_at=now()
4. On failure: UPDATE soop2_queue SET status='pending' if attempt_count < max_attempts else 'quarantined', error_log=...
5. Update worker status back to 'idle'
6. Decrement consecutive_failures if success; increment on failure
7. If consecutive_failures >= 3: self-quarantine for 30 min
On supervisor ping:
1. Update last_heartbeat
2. If status='idle' and batch is available: claim and process### 5.3 Aura-gateway prompt-injection format
The supervisor invokes:
POST mac1:8095/inject
{
"machine": "{worker.machine}",
"tmux_target": "{worker.pane}",
"prompt": "[SOOP-2 worker {worker_id}] Claim and process the next batch. Read [home-path] for available work. See [home-path] for the canonical processor."
}`aura-gateway` already routes to the right mesh node via SSH+tmux send-keys (per `meshcontrol-cross-machine-shipped-2026-05-06.md`).
### 5.4 Backpressure and rate-limit awareness
Workers check `soop2_workers.rate_limited_until` for their own row before each claim. If a worker hits a rate limit:
1. Set its own `rate_limited_until = now() + observed_window`
2. Reject the claim (return to pending)
3. Skip prompts from aura-gateway for the duration
Supervisor also reads SessionStart's rate-limit info (Mohamed already sees rate-limit warnings on session start) and updates worker rows accordingly.
6. Batch kinds and processing
### 6.1 `kind=type-batch`
Payload: `{skills: [{path, category, input_type, output_type, effects, idempotent, silent_capable, commutes_with}, ...]}`
Worker behavior:
1. For each skill, apply the surgical FRONTMATTER_RE injection pattern
2. Run linter against the modified files
3. On all-clean: status='completed', result includes typed_delta
4. On any failure: status back to pending (if attempts left) with error_log
### 6.2 `kind=silent-skill`
Payload: `{skill_name, body_template}`
Creates a new silent-capable skill (like cortex:watch). Worker writes the SKILL.md with full typed frontmatter and silent return logic.
### 6.3 `kind=tier1-expand`
Payload: `{batch_skills: [...], embedding_target: ...}`
Worker runs `embedding_indexer.py` against the batch, updates the embedding cache, validates retrieval recall.
### 6.4 `kind=feedback-wire`
Payload: `{component: 'reaction_logger' | 'pattern_learner' | 'threshold_calibrator' | 'lse_bridge'}`
Worker patches the SEA production code at `[home-path]` to add the named feedback-loop component. Idempotent (no-op if already wired).
### 6.5 `kind=audit`
Payload: `{}`
Worker runs the full skill audit, generates `skills-typing-audit-YYYY-MM-DD.md` memory file. Only one of these should ever be in flight (priority=1, max 1 per day).
### 6.6 `kind=verifier`
Payload: `{criterion_id: 1..10}`
Worker checks the specific criterion's success signal against current state. Writes to `soop2_state.criteria_passed` if passing. This is how criteria flip from ⏳ to ✅.
7. Verification (Layer 3, independent of workers)
### 7.1 Continuous verifier cron
Separate launchd plist (`com.diomande.soop2-verifier`) runs `verifier.py` every 5 min, offset 150s from supervisor (avoids race conditions).
Verifier behavior:
1. Read scoreboard
2. For each criterion 1-10: run its automated check
3. Compare against current `criteria_passed`; emit `criterion_flipped` event on change
4. Write per-criterion timestamp to `soop2_state.criterion_passed_at_<N>`
7.2 Per-criterion automated checks
CHECKS = {
1: lambda s: s['typed_count'] / s['total_skills'] >= 0.95,
2: lambda s: s['linter_last_elapsed_seconds'] < 3.0,
3: lambda s: tier1_recall_test_passing(), # runs labeled-benchmark
4: lambda s: tier2_endpoint_is_twin(), # curl Mac4:8100/health
5: lambda s: router_config_has_type_weight_knob(),
6: lambda s: feedback_loop_closure_test_passing(),
7: lambda s: contrarian_files_exist() and meta_review_has_round2(),
8: lambda s: s['silent_capable_count'] >= 5,
9: lambda s: mac3_grep_returns_only_in_archive(),
10: lambda s: typing_audit_memory_file_exists(),
}Each check is idempotent and fast (<1s).
### 7.3 Closure detection
When all 10 criteria flip to ✅:
1. Supervisor writes `soop2_state.status = 'completed'`
2. Generates final memory file `soop-2-completed-YYYY-MM-DD.md`
3. Pages Mohamed via telegram with completion summary
4. Stops dispatching new work
5. Workers gracefully transition to idle and de-register
8. Escalation (Layer 5)
### 8.1 Trigger thresholds
| Condition | Action |
|---|---|
| `consecutive_no_advance >= 12` (1h) | Log warning |
| `consecutive_no_advance >= 24` (2h) | telegram + SMS |
| Any criterion stalled in 'in_progress' > 24h | Daily summary to Mohamed |
| Worker fail_count >= 5 in 1h | Auto-quarantine that worker |
| All workers quarantined or rate-limited | Page Mohamed within 5 min |
| Linter exit code unstable (flipping) | Page Mohamed |
| Supabase + filesystem disagree after reconciliation | Page Mohamed (data integrity) |
### 8.2 Escalation delivery
Reuses existing `telegram` skill (gateway) and `sms` skill (twilio). Both already shipped per memory.
Message template:
SOOP-2 stall alert (cycle {N}):
- typed: {x}/{total} ({pct}%)
- last advance: {hours_ago}h ago
- criteria passed: {n}/10
- stuck on: {criterion_name}
- last error: {err_summary}
- action: {recommend}
Inspect: [home-path]9. Observability (Layer 4)
### 9.1 Grafana dashboard
Cloud-vm already runs Grafana per memory. Add a SOOP-2 dashboard with panels:
- typed_count time-series (line chart)
- criteria_passed (status grid, 10 cells)
- worker_status (color-coded by worker_id)
- queue_depth + claim_rate (gauges)
- consecutive_no_advance (counter with red threshold at 24)
- linter exit_code history (binary chart)
Data source: Supabase Postgres direct query. Refresh every 30s.
### 9.2 Local dashboard fallback
For when grafana is unavailable, supervisor writes a static `Desktop/soop-2-dashboard.html` every cycle. Syncthing replicates. Mohamed opens it anywhere.
### 9.3 CLI status command
`soop2 status` (a small bash script in `[home-path]`):
#!/bin/bash
psql $SOOP2_DB -c "SELECT typed_count, criteria_passed, status, last_advance_at FROM soop2_state"
echo "--- Queue ---"
psql $SOOP2_DB -c "SELECT kind, status, count(*) FROM soop2_queue GROUP BY kind, status"
echo "--- Workers ---"
psql $SOOP2_DB -c "SELECT worker_id, status, last_heartbeat_at FROM soop2_workers"Fallback to filesystem if Supabase unavailable.
10. Bootstrap procedure (one-time deploy)
# 1. Provision Supabase tables (idempotent SQL migrations)
psql $SOOP2_DB < [home-path]
# 2. Initialize filesystem mirror
mkdir -p [home-path]
# 3. Install supervisor launchd plist (mac1 primary)
cp [home-path] [home-path]
launchctl load [home-path]
# 4. Install observer plist on mac2
ssh mac2 "cp [home-path] [home-path] && launchctl load [home-path]
# 5. Seed initial state from current scoreboard
python3 [home-path] [home-path]
# 6. Register workers on each mesh node
ssh mac1 "python3 [home-path] --pane=claude-cli"
ssh mac4 "python3 [home-path] --pane=codex"
# (mac2/mac3/mac5/cloud-vm as available; mac3 stays out per SOOP-2 directive)
# 7. Verify
[home-path] status
# 8. Watch first cycle
tail -f /tmp/soop2-supervisor.logAfter bootstrap, the protocol runs forever (or until soop2_state.status='completed'). Closing Claude Code on Mac1 does NOT stop it. Killing mac1 does NOT stop it. The only way to stop it is to disable both launchd plists OR set status='paused' in the DB.
11. Migration from current single-Claude loop
The ScheduleWakeup-based loop active right now is single-driver. ELP is multi-driver. Migration:
1. Phase A (today's single loop continues): the ScheduleWakeup wakes type their batches as already planned.
2. Phase B (bootstrap ELP): run the bootstrap procedure above. Initial seed comes from current `soop-2-launch-2026-05-12.md`.
3. Phase C (cutover): the next ScheduleWakeup fire detects ELP is live (file exists at `[home-path]`), registers self as worker `mac1-claude-cli`, and from then on operates as one of many workers rather than the sole driver.
4. Phase D (post-completion): when ELP reports `status=completed`, ScheduleWakeup chain cancels itself.
The migration is non-destructive. ELP can run alongside the single loop; first one to advance a batch claims it; the other's attempt becomes a no-op.
12. Anti-patterns this protocol forbids
1. Any worker that doesn't write a heartbeat (cannot be supervised)
2. Any state write that isn't atomic + journaled (loses progress on crash)
3. Any criterion that doesn't have an automated verifier (cannot be measured)
4. Hardcoded mac3 references (deprecated per SOOP-2 Track H)
5. Workers that bypass the queue (creates race conditions with claim semantics)
6. Supervisor that runs inside Claude Code (re-creates C2 caveat)
7. Escalation thresholds without a clear remediation step (alarm fatigue)
8. Filesystem-only state without Supabase fallback (loses durability)
9. Workers without rate-limit awareness (hammers the cap)
10. Any change to ELP state without a `soop2_log` event (audit hole)
13. Open questions for v1.1
1. Quorum vs single-supervisor: ELP-1 uses primary+observer with heartbeat-as-lock. For v1.1, consider Raft-style quorum across 3+ supervisors. Tradeoff: simpler now (heartbeat) vs more resilient later (quorum).
2. Worker capability declarations: should each worker advertise what `kind` of batches it can process? (e.g., mac1 worker can do tier1-expand because it has the embedding cache; cloud-vm worker can do typing). Probably yes, in v1.1.
3. Time-of-day awareness: Mohamed's wake hours vs sleep hours — should the protocol be more aggressive at night when no human interruption is welcome? Probably no; the work is async anyway.
4. Pull-vs-push: ELP-1 uses push (supervisor injects prompts). Pull (workers poll) is more resilient but spammier. Stay with push for v1.
5. Test mode: before going live, ELP needs a dry-run mode that processes synthetic batches without modifying real files. Add `--dry-run` flag everywhere.
14. Acceptance criteria for ELP itself
When all of these are true, ELP is operationally complete:
1. `[home-path]` exists and runs as launchd on mac1
2. `[home-path]` exists and is callable from any mesh node
3. Supabase schema is applied; all 4 tables exist with at least 1 test row
4. `soop2_state.status == 'active'` and `supervisor.heartbeat` is fresh (< 10 min old)
5. At least 1 worker is registered with a fresh heartbeat
6. Closing Claude Code on mac1 for 30 min does NOT cause `consecutive_no_advance` to increment
7. Manually quarantining a worker (set status='quarantined' in DB) does NOT stop other workers from progressing
8. Killing the Supabase connection for 10 min causes filesystem fallback; on heal, state reconciles correctly
9. A simulated stall (no advance for 2h) triggers a telegram message to Mohamed
10. When all 10 SOOP-2 criteria flip to ✅, ELP automatically writes the completion memory file and stops dispatching
15. Why this is "everlasting"
The protocol survives:
- Any single machine going offline (mesh redistributes)
- Claude Code being closed (supervisor is outside)
- Any single worker crashing (claim TTL releases work)
- Supabase being down (filesystem fallback)
- Tailscale partition (eventual consistency)
- Mohamed being offline (queues escalations, continues work)
- Bad code in a SOOP-2 batch (quarantine, alert, continue)
- The model being slow on a node (rate-limit aware redirect)
- A criterion definition being wrong (verifier runs forever, just doesn't flip)
The only ways to permanently stop ELP:
1. Mohamed sets `soop2_state.status = 'paused'`
2. All mesh nodes are simultaneously offline for > 30 days (Syncthing pruning)
3. Both supervisor plists are disabled and not re-enabled
Even then, the state in Supabase + filesystem mirror is durable. Restart trivially.
---
End of ELP-1 v1 draft. Authority gate: Mohamed Diomande review for v1.1 sign-off.
Next session: bootstrap procedure §10 ships ELP to production.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
crucible-output/soop-2/05-everlasting-loop-protocol.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture · is Stage Research