Grand Diomande Research · Full HTML Reader

Meta-Recursive Failure Pattern Audit

**Date:** 2026-03-14 **Auditor:** Meta-Recursive Explorer (Opus 4.6) **Data Sources:** Supabase mesh_events (23,124 total), failure_journal.jsonl (5,678 entries), cortex/entries.jsonl (679 entries), session-crons/results (11 ring buffers), failure-museum.md (10 exhibits, 26 gems), skills/registry.json (88 skills)

Agents That Account for Themselves proposal experiment writeup candidate score 28 .md

Full Public Reader

Meta-Recursive Failure Pattern Audit

Date: 2026-03-14
Auditor: Meta-Recursive Explorer (Opus 4.6)
Data Sources: Supabase mesh_events (23,124 total), failure_journal.jsonl (5,678 entries), cortex/entries.jsonl (679 entries), session-crons/results (11 ring buffers), failure-museum.md (10 exhibits, 26 gems), skills/registry.json (88 skills)

---

Executive Summary

The mesh failure detection system has a critical blind spot: **100

---

PATTERN 1: The Blind Classifier [ACTIVE, CRITICAL]

What: `failure-intel/handler.py` classifies 100

Evidence:
- `[home]/.claude/state/failure_journal.jsonl`: 5,678 entries, all class="unknown"
- 99.9
- The non-empty ones contain "Shell cwd was reset" messages, not actual errors

Root Cause: The handler extracts error text from `tool_response.stderr` + `tool_response.error` (line 178 of `[home]/.claude/hooks/failure-intel/handler.py`). But Claude Code's `PostToolUseFailure` hook input does not always populate these fields. The actual error content is likely in `tool_response.stdout`, `tool_response.content`, or the raw `tool_response` string itself. The handler concatenates two empty/missing fields and classifies the empty string, which matches nothing.

Impact:
- 4 recovery registry entries (ssh_timeout, xcode_build_failure, disk_full, import_error) have never fired
- All escalation hints default to the generic message
- `failure_pattern` events flood mesh_events (1,000+ events since Mar 7), all with `failure_class: unknown`

Recommended Fix:

python
# In handler.py, expand error extraction (line 175-180):
tool_response = data.get("tool_response", {})
error_text = ""
if isinstance(tool_response, dict):
    error_text = " ".join(filter(None, [
        str(tool_response.get("stderr", "")),
        str(tool_response.get("error", "")),
        str(tool_response.get("stdout", "")),
        str(tool_response.get("content", "")),
    ]))
elif isinstance(tool_response, str):
    error_text = tool_response

CLAUDE.md Gotcha Candidate:
> failure-intel handler: PostToolUseFailure hook input may put error content in `stdout`, `content`, or raw string, not just `stderr`/`error`. Always extract from all fields.

---

PATTERN 2: The Event Storm [ACTIVE, HIGH]

What: `failure_pattern` events dominate mesh_events at 55

Evidence:
- Last 1,000 mesh_events: 550 are `failure_pattern`, 402 are `health_summary`, only 48 are operational
- Peak rate: 173 failure_pattern events in a single hour (2026-03-11T11:00)
- Single session `3894a6d2` generated 32+ failure_pattern events in under 2 minutes
- Events have no deduplication: count increments 22, 23, 24, 25... each as a separate row

Root Cause: The handler fires a mesh event for every failure after the 3rd in a 10-minute window, not just the 3rd. With sessions generating dozens of failures, this creates an escalating storm of near-identical events.

Impact:
- mesh_events table growing fast (23,124 rows since Mar 7, ~8 days)
- Signal-to-noise ratio is terrible: operational events (spawn, evo3, parity) are buried
- health_summary events (the other 40

Recommended Fix:

python
# In handler.py, add dedup/rate-limit after line 211:
if count >= ESCALATION_THRESHOLD:
    # Only emit event at specific thresholds, not every increment
    if count not in (3, 10, 25, 50, 100):
        sys.exit(0)  # Skip intermediate counts

---

PATTERN 3: The Zombie Skill Registry [ACTIVE, MEDIUM]

What: 88 skills registered, 0 invocations, 355 decay flags

Evidence:
- `[home]/.claude/skills/registry.json`: 88 skills, all with `invocations: 0`
- `[home]/.claude/cortex/entries.jsonl`: 355 decay_flag entries, 70 at `days_inactive: 999` (never used)
- All skills were bulk-forged on 2026-01-26, registered on 2026-03-07
- Decay detector has been flagging them since Mar 7, but no action taken
- The 10 ops-domain skills (ops:ios, ops:deploy, ops:supabase, etc.) forged Mar 7 also show 0 invocations

Root Cause: Skills were created speculatively during an exploration phase (the "20-Session Hubris" from Exhibit #7). The cortex decay system correctly identifies them but has no auto-disable or cleanup mechanism for the 60-day threshold.

Impact:
- Entries.jsonl is 150KB and growing (mostly decay_flag noise)
- Registry is cluttered, making any skill-based routing less effective
- Cortex correction detection has logged 0 corrections, suggesting the detector may also be misconfigured

Failure Museum Parallel: This is a recurrence of "Generation Explosion" + "20-Session Hubris". The gem "Audit Before Expand" was never applied to the skills system.

Recommended Action:
- Archive all 88 skills with 0 invocations and >60 days inactive
- Or at minimum, move them to a `disabled` status in the registry
- The 10 ops skills (Mar 7) should be evaluated: if they were intended for the ops_trigger router, verify the router actually invokes them

---

PATTERN 4: The Correlation Engine False Alarm [ACTIVE, LOW]

What: correlation-engine cron has 100

Evidence:
- `[home]/.claude/session-crons/results/correlation-engine.jsonl`: 100 entries, all `status: error`
- Error output: `[cortex-correlator] 142 results, 38 failures, no correlated clusters found`
- Exit code is 0 (success), but status is marked `error`

Root Cause: The cron reporter marks any output containing "failures" or "no correlated clusters" as an error, even though this is expected informational output. The correlator is doing its job (scanning 142 results, finding 38 failures, but no clusters), and that is a valid result, not an error.

Impact:
- Inflates the cron failure count artificially
- If the cron pattern extractor runs, it will incorrectly label correlation-engine as unhealthy
- Makes real cron failures harder to spot

Recommended Fix: Either change the cron status determination logic to check exit_code rather than output text, or have the correlator output "ok" when no action is needed.

---

PATTERN 5: Health Summary Ghost Fleet [ACTIVE, LOW]

What: health_summary reports 3 nodes active, 0 healthy

Evidence:
- Last 3 health_summary payloads: `{"total": 3, "active": 3, "healthy": 0, "exhausted": 0, "offline": 0}`
- 3 active nodes but none classified as healthy

Root Cause: The health check criteria may be too strict, or the nodes are active but failing their health checks. With 0 healthy out of 3 active, the mesh is technically running in degraded mode but nobody is alerting on it because the event is informational.

Recommended Action: Investigate what the health criteria are. If all 3 nodes are serving requests successfully but failing a check (e.g., memory threshold, staleness counter), the threshold may need adjustment.

---

PATTERN 6: Cortex Cross-Machine Sync Stall [ACTIVE, LOW]

What: Cross-machine cortex sync has been stalled since Mar 9

Evidence:
- `[home]/.claude/cortex/last_sync.json`: `last_sync: 2026-03-09T23:04:45`, 5 days stale
- Sync result: `pulled: 0, propagated: 0`
- Propagation state shows 102 projects but 0 rules propagated

Impact:
- Any corrections or skill updates on other machines (Mac2, Mac3, Mac4, Mac5) are not reaching Mac1
- The cortex system was designed for cross-machine learning, but it has been effectively single-machine for 5 days

---

Resolved Patterns (Gems That Worked)

PatternResolutionGem Applied
Coordinate Labyrinth (Exhibit #10)React `__reactEventHandlers` approach now standard for ASC"Use API Internals, Not Physical Clicks"
Phantom Dispatch (Exhibit #9)Spawn governor and completion detection hooks added"Null State Is Not Error-Free"
Recursive Curation LoopMuseum in maintenance mode, no meta-curation"Recursive Systems Need Termination"
Disk FullDocker prune in recovery registry, cleanup as first-class task"Physical Limits Are Features"
SSH TimeoutAuto-recovery in registry: `gcloud compute instances reset`Codified in recovery-registry.json
TCA Macro Trust`-skipMacroValidation` flag now standardCodified in CLAUDE.md gotchas
Tunnel Healthtunnels ok (100

---

New Patterns Discovered

NP-1: The Self-Referential Failure Loop

The failure detection system is itself generating failures. When this audit runs Bash commands that produce non-zero exit codes (e.g., Python JSON parse errors from Supabase queries), the failure-intel handler fires, logs them as "unknown", hits the 3-event threshold, and broadcasts failure_pattern events to the mesh. This audit literally created new failure_pattern events in Supabase by the act of investigating failure patterns. This is a meta-recursive failure: the observer perturbs the system.

Recommended CLAUDE.md Gotcha:
> failure-intel self-reference: The PostToolUseFailure hook fires on ALL tool failures including investigative/diagnostic commands. This means debugging sessions inflate the failure journal. Consider filtering out sessions that are explicitly running diagnostics.

NP-2: The Invocation Void

324 `invocation_record` entries exist in cortex, but the skills registry shows 0 invocations for all 88 skills. The invocation records were batch-forged (all 10 ops skills registered at the exact same timestamp `2026-03-07T05:30:47`), not actual invocations. The cortex thinks skills are being used, but none have ever been invoked by a real user session.

NP-3: Failure Journal Unbounded Growth

The failure journal at 2.8MB and 5,678 entries has no rotation or truncation. At the current rate (~700 entries/day), it will hit 10MB within 2 weeks. The `load_recent_failures()` function reads the entire file on every PostToolUseFailure event, which means every failure triggers a full 2.8MB file read.

Recommended Fix: Add ring buffer rotation (keep last 500 entries) or switch to a time-windowed file per day.

---

Recommended Fixes Summary

PriorityFixFileEffort
P0Expand error field extraction in failure-intel handler`[home-path]`10 min
P0Add event deduplication/rate-limit for failure_pattern events`[home-path]`10 min
P1Add failure journal rotation (ring buffer, 500 entries)`[home-path]`15 min
P1Fix correlation-engine status logic (exit_code 0 = ok)Session cron reporter10 min
P2Archive 88 zero-invocation skills with >60d inactive`[home-path]`5 min
P2Restart cross-machine cortex sync`[home-path]`10 min
P2Investigate health_summary 0-healthy criteriaHealth check code15 min
P3Add diagnostic session filter to failure-intel`[home-path]`10 min

---

Exploration Tree

Root: "Meta-recursive failure pattern audit" (depth: 4, sources: 7)
  |
  +-- Branch 1: failure-museum.md (score: HIGH, 10 exhibits, 26 gems)
  |     +-- Sub: Active gems (8/26 reused, 18 dormant/new)
  |     +-- Sub: Failure signatures (11 known patterns)
  |
  +-- Branch 2: Supabase mesh_events (score: HIGH, 23,124 events)
  |     +-- Sub: event type distribution (12 distinct types)
  |     |     +-- Sub: failure_pattern dominance (55% of traffic)
  |     |     +-- Sub: health_summary (40% of traffic)
  |     |     +-- Sub: operational events (5% of traffic)
  |     +-- Sub: failure_pattern deep-dive
  |     |     +-- Sub: all class=unknown (100%)
  |     |     +-- Sub: single session storms (32 events in 2 min)
  |     |     +-- Sub: peak rate analysis (173/hour on Mar 11)
  |     +-- Sub: health_summary payload (3 active, 0 healthy)
  |
  +-- Branch 3: failure_journal.jsonl (score: CRITICAL, 5,678 entries)
  |     +-- Sub: class distribution (100% unknown)
  |     +-- Sub: empty error_snippet (99.9%)
  |     +-- Sub: tool distribution (93.8% Bash)
  |     +-- Sub: handler.py code analysis (extraction bug found)
  |
  +-- Branch 4: cortex/ (score: MEDIUM, 679 entries)
  |     +-- Sub: decay_flag entries (355, all >30 days)
  |     +-- Sub: invocation_records (324, all batch-forged)
  |     +-- Sub: correction entries (0)
  |     +-- Sub: cross-machine sync stall (since Mar 9)
  |     +-- Sub: skills registry (88 skills, 0 invocations)
  |
  +-- Branch 5: session-crons/results (score: MEDIUM)
  |     +-- Sub: correlation-engine (100% error, false alarm)
  |     +-- Sub: tunnel-health (100% ok)
  |     +-- Sub: all other crons (0% error rate)
  |
  +-- Branch 6: recovery-registry.json (score: LOW)
        +-- Sub: 4 entries registered (ssh, xcode, disk, import)
        +-- Sub: 0 ever fired (because classifier never matches)

---

Confidence

HIGH -- This audit reached full convergence across all 7 data sources. The primary finding (blind classifier) is deterministic, verified by examining 100

---

Related Questions

  • What is the actual PostToolUseFailure hook input schema, and which fields carry error content?
  • Should the failure_pattern event storm be cleaned from mesh_events (bulk delete of uninformative rows)?
  • Is the cortex correction detector working at all, given 0 correction entries in 8 days?
  • What defines "healthy" in the health_summary check, and why are 3/3 nodes unhealthy?

---

Generated by Meta-Recursive Explorer on 2026-03-14. All file paths are absolute and verified.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

mesh-dashboard/system-review.md

Detected Structure

Evaluation · Figures · Code Anchors