Meta-Recursive Failure Pattern Audit
**Date:** 2026-03-14 **Auditor:** Meta-Recursive Explorer (Opus 4.6) **Data Sources:** Supabase mesh_events (23,124 total), failure_journal.jsonl (5,678 entries), cortex/entries.jsonl (679 entries), session-crons/results (11 ring buffers), failure-museum.md (10 exhibits, 26 gems), skills/registry.json (88 skills)
Full Public Reader
Meta-Recursive Failure Pattern Audit
Date: 2026-03-14
Auditor: Meta-Recursive Explorer (Opus 4.6)
Data Sources: Supabase mesh_events (23,124 total), failure_journal.jsonl (5,678 entries), cortex/entries.jsonl (679 entries), session-crons/results (11 ring buffers), failure-museum.md (10 exhibits, 26 gems), skills/registry.json (88 skills)
---
Executive Summary
The mesh failure detection system has a critical blind spot: **100
---
PATTERN 1: The Blind Classifier [ACTIVE, CRITICAL]
What: `failure-intel/handler.py` classifies 100
Evidence:
- `[home]/.claude/state/failure_journal.jsonl`: 5,678 entries, all class="unknown"
- 99.9
- The non-empty ones contain "Shell cwd was reset" messages, not actual errors
Root Cause: The handler extracts error text from `tool_response.stderr` + `tool_response.error` (line 178 of `[home]/.claude/hooks/failure-intel/handler.py`). But Claude Code's `PostToolUseFailure` hook input does not always populate these fields. The actual error content is likely in `tool_response.stdout`, `tool_response.content`, or the raw `tool_response` string itself. The handler concatenates two empty/missing fields and classifies the empty string, which matches nothing.
Impact:
- 4 recovery registry entries (ssh_timeout, xcode_build_failure, disk_full, import_error) have never fired
- All escalation hints default to the generic message
- `failure_pattern` events flood mesh_events (1,000+ events since Mar 7), all with `failure_class: unknown`
Recommended Fix:
# In handler.py, expand error extraction (line 175-180):
tool_response = data.get("tool_response", {})
error_text = ""
if isinstance(tool_response, dict):
error_text = " ".join(filter(None, [
str(tool_response.get("stderr", "")),
str(tool_response.get("error", "")),
str(tool_response.get("stdout", "")),
str(tool_response.get("content", "")),
]))
elif isinstance(tool_response, str):
error_text = tool_responseCLAUDE.md Gotcha Candidate:
> failure-intel handler: PostToolUseFailure hook input may put error content in `stdout`, `content`, or raw string, not just `stderr`/`error`. Always extract from all fields.
---
PATTERN 2: The Event Storm [ACTIVE, HIGH]
What: `failure_pattern` events dominate mesh_events at 55
Evidence:
- Last 1,000 mesh_events: 550 are `failure_pattern`, 402 are `health_summary`, only 48 are operational
- Peak rate: 173 failure_pattern events in a single hour (2026-03-11T11:00)
- Single session `3894a6d2` generated 32+ failure_pattern events in under 2 minutes
- Events have no deduplication: count increments 22, 23, 24, 25... each as a separate row
Root Cause: The handler fires a mesh event for every failure after the 3rd in a 10-minute window, not just the 3rd. With sessions generating dozens of failures, this creates an escalating storm of near-identical events.
Impact:
- mesh_events table growing fast (23,124 rows since Mar 7, ~8 days)
- Signal-to-noise ratio is terrible: operational events (spawn, evo3, parity) are buried
- health_summary events (the other 40
Recommended Fix:
# In handler.py, add dedup/rate-limit after line 211:
if count >= ESCALATION_THRESHOLD:
# Only emit event at specific thresholds, not every increment
if count not in (3, 10, 25, 50, 100):
sys.exit(0) # Skip intermediate counts---
PATTERN 3: The Zombie Skill Registry [ACTIVE, MEDIUM]
What: 88 skills registered, 0 invocations, 355 decay flags
Evidence:
- `[home]/.claude/skills/registry.json`: 88 skills, all with `invocations: 0`
- `[home]/.claude/cortex/entries.jsonl`: 355 decay_flag entries, 70 at `days_inactive: 999` (never used)
- All skills were bulk-forged on 2026-01-26, registered on 2026-03-07
- Decay detector has been flagging them since Mar 7, but no action taken
- The 10 ops-domain skills (ops:ios, ops:deploy, ops:supabase, etc.) forged Mar 7 also show 0 invocations
Root Cause: Skills were created speculatively during an exploration phase (the "20-Session Hubris" from Exhibit #7). The cortex decay system correctly identifies them but has no auto-disable or cleanup mechanism for the 60-day threshold.
Impact:
- Entries.jsonl is 150KB and growing (mostly decay_flag noise)
- Registry is cluttered, making any skill-based routing less effective
- Cortex correction detection has logged 0 corrections, suggesting the detector may also be misconfigured
Failure Museum Parallel: This is a recurrence of "Generation Explosion" + "20-Session Hubris". The gem "Audit Before Expand" was never applied to the skills system.
Recommended Action:
- Archive all 88 skills with 0 invocations and >60 days inactive
- Or at minimum, move them to a `disabled` status in the registry
- The 10 ops skills (Mar 7) should be evaluated: if they were intended for the ops_trigger router, verify the router actually invokes them
---
PATTERN 4: The Correlation Engine False Alarm [ACTIVE, LOW]
What: correlation-engine cron has 100
Evidence:
- `[home]/.claude/session-crons/results/correlation-engine.jsonl`: 100 entries, all `status: error`
- Error output: `[cortex-correlator] 142 results, 38 failures, no correlated clusters found`
- Exit code is 0 (success), but status is marked `error`
Root Cause: The cron reporter marks any output containing "failures" or "no correlated clusters" as an error, even though this is expected informational output. The correlator is doing its job (scanning 142 results, finding 38 failures, but no clusters), and that is a valid result, not an error.
Impact:
- Inflates the cron failure count artificially
- If the cron pattern extractor runs, it will incorrectly label correlation-engine as unhealthy
- Makes real cron failures harder to spot
Recommended Fix: Either change the cron status determination logic to check exit_code rather than output text, or have the correlator output "ok" when no action is needed.
---
PATTERN 5: Health Summary Ghost Fleet [ACTIVE, LOW]
What: health_summary reports 3 nodes active, 0 healthy
Evidence:
- Last 3 health_summary payloads: `{"total": 3, "active": 3, "healthy": 0, "exhausted": 0, "offline": 0}`
- 3 active nodes but none classified as healthy
Root Cause: The health check criteria may be too strict, or the nodes are active but failing their health checks. With 0 healthy out of 3 active, the mesh is technically running in degraded mode but nobody is alerting on it because the event is informational.
Recommended Action: Investigate what the health criteria are. If all 3 nodes are serving requests successfully but failing a check (e.g., memory threshold, staleness counter), the threshold may need adjustment.
---
PATTERN 6: Cortex Cross-Machine Sync Stall [ACTIVE, LOW]
What: Cross-machine cortex sync has been stalled since Mar 9
Evidence:
- `[home]/.claude/cortex/last_sync.json`: `last_sync: 2026-03-09T23:04:45`, 5 days stale
- Sync result: `pulled: 0, propagated: 0`
- Propagation state shows 102 projects but 0 rules propagated
Impact:
- Any corrections or skill updates on other machines (Mac2, Mac3, Mac4, Mac5) are not reaching Mac1
- The cortex system was designed for cross-machine learning, but it has been effectively single-machine for 5 days
---
Resolved Patterns (Gems That Worked)
| Pattern | Resolution | Gem Applied |
|---|---|---|
| Coordinate Labyrinth (Exhibit #10) | React `__reactEventHandlers` approach now standard for ASC | "Use API Internals, Not Physical Clicks" |
| Phantom Dispatch (Exhibit #9) | Spawn governor and completion detection hooks added | "Null State Is Not Error-Free" |
| Recursive Curation Loop | Museum in maintenance mode, no meta-curation | "Recursive Systems Need Termination" |
| Disk Full | Docker prune in recovery registry, cleanup as first-class task | "Physical Limits Are Features" |
| SSH Timeout | Auto-recovery in registry: `gcloud compute instances reset` | Codified in recovery-registry.json |
| TCA Macro Trust | `-skipMacroValidation` flag now standard | Codified in CLAUDE.md gotchas |
| Tunnel Health | tunnels ok (100 |
---
New Patterns Discovered
NP-1: The Self-Referential Failure Loop
The failure detection system is itself generating failures. When this audit runs Bash commands that produce non-zero exit codes (e.g., Python JSON parse errors from Supabase queries), the failure-intel handler fires, logs them as "unknown", hits the 3-event threshold, and broadcasts failure_pattern events to the mesh. This audit literally created new failure_pattern events in Supabase by the act of investigating failure patterns. This is a meta-recursive failure: the observer perturbs the system.
Recommended CLAUDE.md Gotcha:
> failure-intel self-reference: The PostToolUseFailure hook fires on ALL tool failures including investigative/diagnostic commands. This means debugging sessions inflate the failure journal. Consider filtering out sessions that are explicitly running diagnostics.
NP-2: The Invocation Void
324 `invocation_record` entries exist in cortex, but the skills registry shows 0 invocations for all 88 skills. The invocation records were batch-forged (all 10 ops skills registered at the exact same timestamp `2026-03-07T05:30:47`), not actual invocations. The cortex thinks skills are being used, but none have ever been invoked by a real user session.
NP-3: Failure Journal Unbounded Growth
The failure journal at 2.8MB and 5,678 entries has no rotation or truncation. At the current rate (~700 entries/day), it will hit 10MB within 2 weeks. The `load_recent_failures()` function reads the entire file on every PostToolUseFailure event, which means every failure triggers a full 2.8MB file read.
Recommended Fix: Add ring buffer rotation (keep last 500 entries) or switch to a time-windowed file per day.
---
Recommended Fixes Summary
| Priority | Fix | File | Effort |
|---|---|---|---|
| P0 | Expand error field extraction in failure-intel handler | `[home-path]` | 10 min |
| P0 | Add event deduplication/rate-limit for failure_pattern events | `[home-path]` | 10 min |
| P1 | Add failure journal rotation (ring buffer, 500 entries) | `[home-path]` | 15 min |
| P1 | Fix correlation-engine status logic (exit_code 0 = ok) | Session cron reporter | 10 min |
| P2 | Archive 88 zero-invocation skills with >60d inactive | `[home-path]` | 5 min |
| P2 | Restart cross-machine cortex sync | `[home-path]` | 10 min |
| P2 | Investigate health_summary 0-healthy criteria | Health check code | 15 min |
| P3 | Add diagnostic session filter to failure-intel | `[home-path]` | 10 min |
---
Exploration Tree
Root: "Meta-recursive failure pattern audit" (depth: 4, sources: 7)
|
+-- Branch 1: failure-museum.md (score: HIGH, 10 exhibits, 26 gems)
| +-- Sub: Active gems (8/26 reused, 18 dormant/new)
| +-- Sub: Failure signatures (11 known patterns)
|
+-- Branch 2: Supabase mesh_events (score: HIGH, 23,124 events)
| +-- Sub: event type distribution (12 distinct types)
| | +-- Sub: failure_pattern dominance (55% of traffic)
| | +-- Sub: health_summary (40% of traffic)
| | +-- Sub: operational events (5% of traffic)
| +-- Sub: failure_pattern deep-dive
| | +-- Sub: all class=unknown (100%)
| | +-- Sub: single session storms (32 events in 2 min)
| | +-- Sub: peak rate analysis (173/hour on Mar 11)
| +-- Sub: health_summary payload (3 active, 0 healthy)
|
+-- Branch 3: failure_journal.jsonl (score: CRITICAL, 5,678 entries)
| +-- Sub: class distribution (100% unknown)
| +-- Sub: empty error_snippet (99.9%)
| +-- Sub: tool distribution (93.8% Bash)
| +-- Sub: handler.py code analysis (extraction bug found)
|
+-- Branch 4: cortex/ (score: MEDIUM, 679 entries)
| +-- Sub: decay_flag entries (355, all >30 days)
| +-- Sub: invocation_records (324, all batch-forged)
| +-- Sub: correction entries (0)
| +-- Sub: cross-machine sync stall (since Mar 9)
| +-- Sub: skills registry (88 skills, 0 invocations)
|
+-- Branch 5: session-crons/results (score: MEDIUM)
| +-- Sub: correlation-engine (100% error, false alarm)
| +-- Sub: tunnel-health (100% ok)
| +-- Sub: all other crons (0% error rate)
|
+-- Branch 6: recovery-registry.json (score: LOW)
+-- Sub: 4 entries registered (ssh, xcode, disk, import)
+-- Sub: 0 ever fired (because classifier never matches)---
Confidence
HIGH -- This audit reached full convergence across all 7 data sources. The primary finding (blind classifier) is deterministic, verified by examining 100
---
Related Questions
- What is the actual PostToolUseFailure hook input schema, and which fields carry error content?
- Should the failure_pattern event storm be cleaned from mesh_events (bulk delete of uninformative rows)?
- Is the cortex correction detector working at all, given 0 correction entries in 8 days?
- What defines "healthy" in the health_summary check, and why are 3/3 nodes unhealthy?
---
Generated by Meta-Recursive Explorer on 2026-03-14. All file paths are absolute and verified.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
mesh-dashboard/system-review.md
Detected Structure
Evaluation · Figures · Code Anchors