Grand Diomande Research · Full HTML Reader

Performance Prophet

ML-powered system that predicts slowdowns before users notice. Uses statistical models and anomaly detection to forecast performance degradation and alert proactively.

Agents That Account for Themselves technical note experiment writeup candidate score 50 .md

Full Public Reader

Performance Prophet

ML-powered system that predicts slowdowns before users notice. Uses statistical models and anomaly detection to forecast performance degradation and alert proactively.

Concept

Traditional monitoring alerts when thresholds are crossed. By then, users already feel the pain. Performance Prophet flips this: it learns your system's patterns and alerts before degradation becomes noticeable.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Collectors │ ──▶ │  Predictor  │ ──▶ │   Alerter   │
└─────────────┘     └─────────────┘     └─────────────┘
     │                    │                    │
     ▼                    ▼                    ▼
  Metrics DB         Model State          Notifications

Features

  • Triple Exponential Smoothing — Captures level, trend, and seasonality
  • Anomaly Detection — IQR-based + Z-score for outliers
  • Trend Extrapolation — Predicts where metrics are heading
  • Pattern Memory — Learns daily/weekly cycles
  • Proactive Alerts — Warning before threshold breach

Usage

bash
# Start collector daemon
prophet collect --interval 60

# Show current predictions
prophet predict

# Analyze specific metric
prophet analyze response_time --hours 24

# Show predicted incidents
prophet forecast --horizon 1h

# Interactive dashboard
prophet watch

# Pattern Learning (Gen 6+)
prophet patterns               # Show learned patterns & stats
prophet similar cpu_percent    # Find similar past incidents
prophet cascade memory_percent # Predict cascade effects
prophet link db_connections response_time 2.5 0.85  # Record causal link
prophet incident response_time --severity warning --cause "DB pool exhausted"

# Automatic Causality Discovery (Gen 6 Evolution)
prophet causality discover               # Auto-discover all causal relationships
prophet causality discover --metric cpu  # Discover relationships for specific metric
prophet causality explain --metric response_time  # Explain what causes/affects a metric
prophet causality graph                  # Generate Mermaid diagram of causal graph
prophet causality summary                # Summary of discovered relationships

# Causal Analysis
prophet causal-explain response_time     # What causes response_time? What does it affect?
prophet causal-roots error_rate          # Trace back to root causes
prophet causal-test memory_percent response_time  # Test if one metric causes another

Metrics Tracked

MetricSourceWarning Sign
Response TimeAPI logsUpward trend
Memory UsageSystem> 80
Queue DepthWorkersGrowing backlog
Error RateLogsSpike detection
CPU UsageSystemSustained climb
Disk I/OSystemSaturation approach

Prediction Models

### 1. Holt-Winters (Triple Exponential Smoothing)
Best for metrics with seasonality (traffic patterns, daily cycles).

### 2. ARIMA-lite
Moving average with autoregression for trend detection.

### 3. Isolation Forest
Anomaly detection without requiring labeled data.

### 4. Threshold Projection
Simple but effective: if current trend continues, when does it hit the threshold?

Alert Levels

LevelMeaningHorizon
🟢 HEALTHYAll metrics normal
🟡 WATCHUnusual pattern detected
🟠 WARNINGThreshold breach predicted< 1h
🔴 CRITICALThreshold breach predicted< 15m

Configuration

yaml
# [home-path]
metrics:
  response_time:
    threshold: 500ms
    warning_horizon: 30m
    seasonality: daily

  memory_percent:
    threshold: 85
    warning_horizon: 15m

  error_rate:
    threshold: 0.05
    anomaly_sensitivity: high

notifications:
  slack: "#ops-alerts"
  email: [email]
  cooldown: 5m

Integration

With Clawdbot Sessions

python
from prophet import Prophet

p = Prophet()
p.track_session_metric("response_time", response_ms)

if p.predict_breach("response_time", horizon="30m"):
    alert("Slowdown predicted in next 30 minutes")

### With Heartbeat
Add to HEARTBEAT.md:

markdown
- [ ] Performance Prophet status
- [ ] Any predicted incidents?

Implementation

This skill provides:
- `prophet.py` — Core prediction engine
- `collector.py` — Metric gathering
- `alerter.py` — Notification dispatch
- `cli.py` — Command line interface
- `models/` — Serialized model states

Pattern Learning (Gen 6 Evolution)

Performance Prophet learns from every incident to improve future predictions.

How It Works

1. Pattern Fingerprinting — Each metric trajectory gets a unique fingerprint based on shape characteristics (trend, acceleration, volatility, peak position)

2. Shape Classification — Fingerprints are classified into categories:
- `gradual_rise` — Slow buildup toward threshold
- `accelerating_rise` — Exponential degradation
- `spike` — Sudden jump
- `oscillating` — Irregular fluctuations
- `step` — Discrete jumps

3. Incident Memory — Past incidents are stored with their patterns, root causes, and resolutions

4. Pattern Matching — When current metrics match a known pattern, Prophet alerts with historical context

5. Causal Graph — Relationships between metrics are tracked (e.g., "when database connections degrade, response time follows 2.5 minutes later")

Cascade Prediction

When one metric degrades, Prophet predicts which others will follow:

🌊 Cascade Prediction: database_connections
  → response_time (lag: 2.5 min, correlation: 85%)
  → error_rate (lag: 5.0 min, correlation: 72%)

Similar Incident Lookup

Find past incidents with matching patterns:

bash
prophet similar response_time

🔍 Similar past incidents:
  📋 inc_20260131_response
     Root cause: Database connection pool exhausted
     Resolution: Increased pool size to 100
     Duration: 12.3 min

Time-to-Recovery (TTR) Prediction (Gen 6+)

During an active incident, answers the critical question: "How long until we're back to normal?"

How It Works

1. Pattern Matching — Compares current incident fingerprint to historical incidents
2. Multi-Factor Scoring — Considers severity, trend, peak deviation, pattern type
3. Bayesian Updating — Adjusts prediction as time passes
4. Confidence Intervals — Provides optimistic, median, and pessimistic estimates

Usage

bash
# Predict TTR for an active incident
prophet ttr response_time --value 650 --threshold 500 --elapsed 5 --trend stable

# Output:
# ⏱️  TTR Prediction: response_time
# ───────────────────────────────────
#    Estimated: 12 min remaining
#    Confidence: ████████░░ 80%
#
#    📊 Range:
#       Best case:  5 min
#       Most likely: 12 min
#       Worst case: 25 min
#
#    📋 Factors:
#       • Based on 8 similar past incidents
#       • Pattern type: gradual rise
#       • ⏱️ 5 min elapsed
#
#    💡 Recommendation:
#       Check: Database connection pool exhausted (common cause in similar incidents)

# View prediction accuracy
prophet ttr-stats

Pattern-Based Adjustments

Pattern TypeRecovery ModifierTypical Behavior
Spike0.5x (faster)Usually self-resolves
Gradual Rise1.0x (baseline)Standard recovery
Accelerating1.5x (slower)Needs intervention
Step Change2.0x (slowest)Usually deployment-related
Oscillating1.2x (moderate)Capacity contention

Integration with Incident Response

When Prophet detects an incident, it automatically provides TTR estimate:

🔴 INCIDENT: response_time breached threshold (650ms > 500ms)
   ⏱️ Estimated recovery: 15 min (confidence: 75%)
   💡 Similar to 3 past incidents - common cause: DB connection pool

SLO Budget Burn Prediction (Gen 6 Evolution)

Predicts when your error budget will be exhausted, giving you time to act before SLOs are breached.

Philosophy

SLOs are promises. This module predicts broken promises before they happen.

Traditional SLO monitoring: "Error budget exhausted" → already broke the promise
Prophet SLO prediction: "Error budget will exhaust in 3 days" → time to fix reliability

Defining SLOs

bash
# Define an availability SLO
prophet slo-define api_availability \
  --target 0.999 \
  --window 30 \
  --metric response_success_rate \
  --policy "Freeze deployments, prioritize reliability"

# Define a latency SLO
prophet slo-define api_latency \
  --target 0.95 \
  --window 30 \
  --policy "No risky changes until budget recovers"

# List all SLOs
prophet slo-list

Tracking Budget

bash
# Record good/total events (for event-based SLOs)
prophet slo-events api_availability 9990 10000

# Record a burn event with cause
prophet slo-burn api_availability 0.001 --cause "Deployment hiccup" --incident INC-123

Predictions

bash
# Show budget status for all SLOs
prophet slo-budget

# Output:
# 📊 SLO Budget Summary
#
#    🟡 api_availability
#       [████████████░░░] 85.2% remaining
#       Exhaustion in: 12.3d
#
#    🔴 api_latency
#       [████░░░░░░░░░░░] 28.1% remaining
#       Exhaustion in: 2.1d

# Detailed prediction for specific SLO
prophet slo-budget api_availability

# Output:
# 🟡 SLO Budget: api_availability
# ────────────────────────────────────────
#    Budget:     [████████████████░░░░] 85.2%
#    Status:     nominal
#    Confidence: 80%
#    Exhaustion: 12.3 days
#
#    📋 85.2% budget remaining. Burn rate is at expected.
#
#    💡 Recommendations:
#       • 📈 Monitor burn rate trend closely
#       • 🔧 Address known reliability issues

# Get JSON for dashboards
prophet slo-budget api_availability --json

Multi-Window Burn Rate Alerts

Based on Google SRE's multi-window alerting strategy:

bash
prophet slo-alerts

# Output:
# 📊 api_availability
#    🚨 PAGE: 2.00% burned in 5m, 3.50% in 60m (threshold: 2%)
#
#    Burn rates:
#       5m:  0.0040%/min
#       1h:  0.0006%/min
#       1d:  0.0001%/min
#       Trend: rising
Window PairBudget ThresholdSeverityMeaning
5m / 60m2
30m / 6h5
2h / 1d10
6h / 3d20

Status Levels

StatusIconMeaning
comfortable🟢Plenty of budget, low burn
nominal🟡On track for the window
elevated🟠Burning faster than expected
danger🔴Budget will exhaust early
exhausted💀Budget gone

Integration with Incidents

When Prophet detects an incident, it correlates with SLO burn:

🔴 INCIDENT: api_latency breached p99 threshold
   ⏱️ Estimated recovery: 15 min
   💸 Projected budget burn: 0.3%
   📊 Budget remaining after: 27.8%

Error Budget Policy Automation

Define policies that trigger automatically:

yaml
# [home-path]
slos:
  api_availability:
    target: 0.999
    window_days: 30
    policies:
      danger:
        - freeze_deployments
        - page_oncall
        - notify_stakeholders
      exhausted:
        - block_ci_merges
        - escalate_leadership

Automatic Causality Discovery (Gen 6 Evolution)

Prophet now automatically discovers which metrics influence each other — no manual configuration needed.

How It Works

Three statistical methods work together to identify causal relationships:

1. Cross-Correlation — Finds temporal relationships at different time lags
2. Granger Causality — Tests if one metric's past helps predict another's future
3. Transfer Entropy — Measures information flow between metrics

Why It Matters

Manual approach: "I think memory affects response time... maybe?" → Guesswork
Auto-discovery: "Memory causes response_time (85

Usage

bash
# Discover all causal relationships
prophet causality discover

# Output:
# 🧬 Discovering causal relationships...
#
# ✅ Discovered 5 causal links:
#
# 📊 memory_percent
#    → response_time
#       [████████░░] 82% strength
#       Lag: 2.5m | Method: granger
#    → disk_io
#       [██████░░░░] 61% strength
#       Lag: 5.0m | Method: cross_correlation

# Explain a specific metric
prophet causal-explain response_time

# Output:
# 🔍 Causal Analysis: response_time
#
# 📥 CAUSED BY:
#    ← memory_percent (strength: 82%, lag: 2.5m)
#    ← db_connections (strength: 71%, lag: 1.0m)
#
# 📤 CASCADES TO:
#    → error_rate (+3.0m, strength: 65%)
#       → user_complaints (+10.0m, strength: 45%)

# Test a specific relationship
prophet causal-test memory_percent response_time

# Output:
# 🔬 Testing: memory_percent → response_time
#
# 1️⃣ Cross-Correlation:
#    Peak lag: 2 minutes
#    Correlation: 0.823
#    Direction: memory_percent causes response_time
#
# 2️⃣ Granger Causality:
#    F-statistic: 15.32
#    Optimal lag: 2 minutes
#    Significance: strong
#    ✅ Granger causes: Yes
#
# 📋 Summary:
#    ✅ Evidence of causality: Correlated (0.82), Granger causes (F=15.3)

# Find root causes
prophet causal-roots error_rate --depth 3

# Output:
# 🌳 Root Causes: error_rate
#
# 📍 memory_percent (depth: 2)
# 📍 db_pool_size (depth: 2)
# 📍 traffic_load (depth: 1)

# Generate visual graph
prophet causality graph

# Output (Mermaid):
# ```mermaid
# graph LR
#     memory_percent--2m-->response_time
#     memory_percent--5m-->disk_io
#     response_time--3m-->error_rate
# ```

Causal Graph Visualization

Prophet can generate Mermaid diagrams of discovered relationships:

mermaid
graph LR
    memory_percent--2m-->response_time
    memory_percent--5m-->disk_io
    db_connections--1m-->response_time
    response_time--3m-->error_rate
    error_rate--7m-->user_complaints

Integration with Incident Response

When an incident occurs, Prophet now traces the causal chain:

🔴 INCIDENT: error_rate breached threshold (5% > 1%)

🔍 Root Cause Analysis:
   └── memory_percent spike detected 8 minutes ago
       └── response_time degraded 5 minutes ago
           └── error_rate breached now

💡 Likely root cause: memory_percent (82% confidence)
   Historical pattern: Memory pressure → slow responses → errors

Best Practices

1. Collect sufficient data — Run `prophet collect` for at least 24 hours before discovery
2. Run discovery periodically — Relationships can change with system updates
3. Verify surprising links — High correlation doesn't always mean causation
4. Use `causal-test` for specific hypotheses — Get detailed statistical evidence

Event Attribution (Gen 6+ Evolution)

Beyond metric-to-metric causality, Prophet can attribute anomalies to external events like deployments, config changes, traffic spikes, and scheduled jobs.

Recording Events

bash
# Record a deployment
prophet event deployment "Deployed v2.3.1" --severity medium

# Record a config change
prophet event config_change "Increased connection pool to 100" --severity low

# Record a scheduled job
prophet event scheduled_job "Nightly batch processing" --impact-window 30

# Record traffic spike
prophet event traffic_spike "Marketing campaign launch" --severity high

# Collect deployments from git automatically
prophet event-collect git --path /path/to/repo --hours 24

Attributing Anomalies

When an anomaly occurs, find what caused it:

bash
# Attribute a current anomaly
prophet attribute response_time

# Output:
# 🔍 Attributing anomaly: response_time at 2024-01-15 14:30
#    Looking back 60 minutes for events...
#
# ## Anomaly Attribution Report
#
# ### 1. 🟢 Deployed v2.3.1
# - **Type:** deployment
# - **Confidence:** high (78%)
# - **Lag:** 12.3 minutes before anomaly
# - **Temporal fit:** 95%
# - **Metric relevance:** 100%
# - **Status:** Pending confirmation (ID: `a7b3c9d2`)
#
# ### 2. 🟡 Increased connection pool
# - **Type:** config_change
# - **Confidence:** medium (52%)
# - **Lag:** 45.1 minutes before anomaly

# Attribute a past anomaly
prophet attribute cpu_percent --timestamp 2024-01-15T10:00:00 --lookback 120

Confirming Attributions (Learning)

Prophet learns from your confirmations to improve future attributions:

bash
# Confirm an attribution was correct
prophet event-confirm a7b3c9d2 --yes

# Reject an incorrect attribution
prophet event-confirm b8c4d0e3 --no

Viewing Learned Patterns

bash
# Show what Prophet has learned
prophet event-patterns

# Output:
# 🧠 Learned Attribution Patterns
#
# 📍 deployment → response_time
#    Typical lag: 8.5 min
#    Duration: 25.3 min
#    Occurrences: 12
#    Confidence boost: +35%
#
# 📍 scheduled_job → cpu_percent
#    Typical lag: 1.2 min
#    Duration: 15.0 min
#    Occurrences: 28
#    Confidence boost: +50%

Attribution Summary

bash
prophet event-summary --hours 48

# Output:
# 📊 Attribution Summary (last 48h)
#
# By Confidence Level:
#    🟢 high: 15
#    🟡 medium: 8
#    🟠 low: 3
#
# By Event Type:
#    • deployment: 12
#    • scheduled_job: 8
#    • traffic_spike: 4
#
# Confirmation Status:
#    ✅ Confirmed: 18
#    ❌ Rejected: 3
#    ⏳ Pending: 5
#    📈 Accuracy: 86%
#
# 🧠 Learned Patterns: 6

Event Types

TypeDescriptionTypical Metrics Affected
`deployment`Code/artifact deploymentsresponse_time, error_rate, cpu
`config_change`Configuration updatesresponse_time, error_rate
`scheduled_job`Cron jobs, batch processingcpu, memory, disk_io
`traffic_spike`Unusual traffic patternsresponse_time, queue_depth
`maintenance`Planned maintenanceavailability, response_time
`incident`Known incidentsvaries
`external`Third-party service issuesresponse_time, error_rate
`custom`User-defined eventsvaries

Integration with Incident Response

🔴 INCIDENT: response_time breached threshold (850ms > 500ms)

🔍 Causal Analysis:
   Metric chain: memory_percent → response_time (lag: 2.5m)

🎯 Event Attribution:
   1. 🟢 [HIGH] Deployed v2.3.1 (12 min ago)
      Historical pattern: deployments → response_time (+15%, 8m lag)
   2. 🟡 [MEDIUM] DB pool config change (45 min ago)

💡 Likely root cause: Deployment v2.3.1
   Suggested action: Review recent deployment changes

Why It Matters

Reactive monitoring: "Server is slow" → users already suffering
Proactive prediction: "Server will be slow in 20 minutes" → time to act

Manual causality: "What's causing this?" → Guesswork and war rooms
Auto-discovery: "memory_percent → response_time (lag: 2.5m)" → Instant root cause

Manual attribution: "Was it the deployment?" → Hours of correlation
Auto-attribution: "Deployment v2.3.1 (78

Reactive SLO monitoring: "Error budget exhausted" → SLO already broken
Proactive SLO prediction: "Budget exhausts in 3 days" → time to prioritize reliability

The difference between firefighting and fire prevention.

Predictive Capacity Planning (Gen 6+ Evolution)

Prophet now forecasts when you'll run out of capacity and recommends scaling actions before you hit limits.

Philosophy

Reactive capacity: "Disk full" → Scramble to add storage at 3 AM
Predictive capacity: "Disk will fill in 14 days" → Plan and budget calmly

Quick Start

bash
# Generate capacity plan
prophet capacity

# Output:
# 📊 Capacity Planning Report
#    Generated: 2024-01-15 14:30
#    Horizon: 90 days
#
# ✅ All systems have comfortable capacity headroom
#    Risk Score: 10%
#
# 📈 Capacity Forecasts:
#
# 🟢 cpu_percent
#    Utilization: [████░░░░░░] 42%
#    Current: 42.3 / 85 (threshold)
#    Trend: stable/declining ✓
#    Growth: -0.12/day (stable)
#    Confidence: 78%
#
# 🟡 disk_percent
#    Utilization: [███████░░░] 68%
#    Current: 68.2 / 90 (threshold)
#    Time to breach: 2.3 weeks
#    Forecast date: 2024-01-29
#    Growth: +1.35/day (linear)
#    Confidence: 85%
#
# 🔧 Scaling Recommendations:
#
# 🟡 disk_percent
#    Action: plan_scaling
#    Plan disk_percent capacity increase for ~2.3 weeks
#    Current: 68.2 → Recommended: 115.0
#    Act by: 2024-01-22
#    Cost impact: ~1.2x
#    • Forecast breach: 2024-01-29
#    • Time to plan and budget
#    Alternatives:
#      - Monitor and reassess in 1 week

Commands

bash
# Full capacity plan with all metrics
prophet capacity

# Detailed forecast for specific metric
prophet capacity-forecast disk_percent

# Analyze growth patterns
prophet capacity-growth memory_percent --window 14

# View/set thresholds
prophet capacity-threshold
prophet capacity-threshold cpu_percent --warning 70 --critical 85 --max 100

# Record scaling actions (for learning)
prophet capacity-scale disk_percent scale_up --old 100 --new 200 --description "Added second SSD"

# View scaling history
prophet capacity-history --days 30
prophet capacity-history --metric disk_percent

# Compare forecasts over time
prophet capacity-compare disk_percent --days 7

Growth Analysis

Prophet analyzes growth patterns to understand how resources are trending:

bash
prophet capacity-growth memory_percent

# Output:
# 📊 Growth Analysis: memory_percent
#
#    Current value: 72.50
#    Daily growth: +0.85
#    Growth rate: +1.17%/day
#
#    Trend: linear
#    Seasonality: daily
#    Volatility: 0.15
#
#    Data points: 2048
#    Window: 14 days
#    Confidence: 82%
#
# Projections (at current rate):
#    7d: 78.45
#    30d: 98.00
#    90d: 149.00

Trend Types

TrendDescriptionAction
`stable`Minimal changeMonitor
`linear`Steady growthPlan ahead
`exponential`AcceleratingAct soon
`saturating`Approaching limitMay self-limit

Urgency Levels

UrgencyTime to ThresholdRecommended Action
🟢 `comfortable`> 30 daysContinue monitoring
🔵 `planning`14-30 daysBudget and plan
🟡 `soon`7-14 daysSchedule scaling
🟠 `urgent`3-7 daysScale this week
🔴 `critical`< 3 daysEmergency scaling

Recording Scaling Actions

Track what you did to improve future predictions:

bash
# After upgrading disk
prophet capacity-scale disk_percent scale_up \
  --old 100 --new 200 \
  --description "Upgraded to 200GB SSD"

# After adding instances
prophet capacity-scale cpu_percent scale_out \
  --old 100 --new 200 \
  --description "Added second server"

# After optimization
prophet capacity-scale memory_percent optimize \
  --old 85 --new 60 \
  --description "Fixed memory leak in API service"

Comparing Forecasts

See if predictions are getting more accurate:

bash
prophet capacity-compare disk_percent --days 7

# Output:
# 📈 Forecast Trend: disk_percent (last 7 days)
#
# Date               Days to Threshold    Growth Rate     Confidence
# ----------------------------------------------------------------------
# 2024-01-08 10:30   25.3 days           +0.98/day       75%
# 2024-01-09 10:30   22.1 days           +1.12/day       78%
# 2024-01-10 10:30   19.8 days           +1.25/day       80%
# 2024-01-11 10:30   16.2 days           +1.35/day       82%
# 2024-01-12 10:30   14.5 days           +1.35/day       85%

Custom Thresholds

Set different thresholds for different environments:

bash
# Production: stricter thresholds
prophet capacity-threshold cpu_percent --warning 60 --critical 75 --max 100
prophet capacity-threshold memory_percent --warning 70 --critical 85 --max 100

# Development: more relaxed
prophet capacity-threshold disk_percent --warning 85 --critical 95 --max 100

Integration with Heartbeats

Add to HEARTBEAT.md for proactive capacity monitoring:

markdown
## Daily Capacity Check
- [ ] Run `prophet capacity` — any urgent scaling needed?
- [ ] Critical resources: < 7 days to threshold → escalate
- [ ] Planning horizon: 14-30 days → add to sprint backlog

Cost-Aware Recommendations

Prophet considers cost impact when recommending scaling:

bash
# Cost model in config
[home-path]
{
  "cost_model": {
    "cpu": {"scale_factor": 2, "cost_ratio": "2x"},
    "memory": {"scale_factor": 2, "cost_ratio": "1.5x"},
    "disk": {"scale_factor": 2, "cost_ratio": "1.2x"}
  }
}

Why It Matters

Reactive capacity: "Out of disk space" → 3 AM emergency
Predictive capacity: "Disk exhausts in 14 days" → calm planning

Manual tracking: "How fast is memory growing?" → Spreadsheets
Auto-analysis: "1.17

Surprise costs: "We need to double capacity NOW" → budget shock
Planned scaling: "Schedule 50

Capacity planning turns infrastructure surprises into scheduled maintenance.

Predictive Remediation Engine (Gen 6+ Evolution)

Prophet now recommends specific remediation actions when issues are detected, based on what worked historically and industry best practices.

Philosophy

Reactive debugging: "Server is slow" → Guesswork and war rooms
Predictive remediation: "Here are 3 fixes, ranked by effectiveness" → Instant action

How It Works

1. Knowledge Base — Built-in SRE best practices for common issues
2. Historical Learning — Records outcomes to improve future recommendations
3. Causal Awareness — Recommends fixing root causes, not symptoms
4. Pattern Matching — Different patterns need different fixes

Quick Start

bash
# Get remediation recommendations for a metric issue
prophet remediate response_time --value 850 --threshold 500

# Output:
# 🔧 Remediation Recommendations
#
# ### 1. Rollback recent deployment
#    [████████░░] 78% confidence
#    Revert to previous known-good version
#
#    📊 Expected improvement: 90%
#    ⏱️  Time to effect: ~5 min
#    🟢 Effort: low
#    🟡 Risk: medium
#    📋 Based on 12 past uses (83% success)
#
#    💡 Evidence:
#       • Recent deployment detected: v2.3.1
#       • Historical success rate: 83%
#       • Recommended for 'spike' pattern
#
# ### 2. Increase connection pool
#    [██████░░░░] 65% confidence
#    Increase database connection pool size
#    ...

Commands

bash
# Get recommendations for any metric
prophet remediate memory_percent --value 92 --threshold 85

# Include pattern context for better recommendations
prophet remediate cpu_percent --value 95 --pattern gradual_rise

# Skip causal analysis (faster, but less accurate)
prophet remediate response_time --value 800 --no-causality

# Record a remediation outcome (for learning)
prophet remediate-record mem_restart_service memory_percent \
  --before 92 --after 45 --threshold 85 --success \
  --time-to-effect 2 --duration 480

# View remediation statistics
prophet remediate-stats
prophet remediate-stats --action-id mem_restart_service

Adding Custom Actions

bash
# Add a custom remediation action
prophet remediate-action --add \
  --name "Restart API Gateway" \
  --description "Restart the API gateway to clear stale connections" \
  --category restart \
  --metrics response_time,error_rate \
  --commands "kubectl rollout restart deployment/api-gateway" \
  --impact 75 \
  --effort low \
  --risk low \
  --tags quick_fix,gateway

# List custom actions
prophet remediate-action
prophet remediate-action --metric response_time

Playbooks (Action Sequences)

Create playbooks for common incident types:

bash
# Create a playbook for memory issues
prophet remediate-playbook --create \
  --name "Memory Pressure Response" \
  --description "Standard response for memory exhaustion" \
  --trigger-metric memory_percent \
  --trigger-pattern gradual_rise \
  --actions "mem_restart_service,mem_increase_limit,mem_fix_leak"

# Find matching playbook for a situation
prophet remediate-playbook --match memory_percent --pattern gradual_rise

# List all playbooks
prophet remediate-playbook

Built-in Knowledge Base

Prophet includes remediation knowledge for common metrics:

MetricActions Available
`memory_percent`Restart service, increase limit, fix leak, add swap
`cpu_percent`Scale out, scale up, optimize code, rate limit
`response_time`Add cache, DB indexes, increase pool, rollback, scale
`error_rate`Rollback, circuit breaker, tune retries, fix bug
`disk_percent`Cleanup logs, expand volume, compress, archive
`queue_depth`Add workers, throttle input, priority queuing
`db_connections`Increase pool, fix leak, add replica

Confidence Scoring

Recommendations are ranked by confidence based on:

FactorWeightDescription
Historical success40
Pattern match15
Event alignment15
Risk penalty-5-35
Base knowledge30

Integration with Incident Response

When Prophet detects an incident, it automatically includes remediation:

🔴 INCIDENT: response_time breached threshold (850ms > 500ms)

🔍 Root Cause Analysis:
   └── db_connections spike detected 5 minutes ago

🔧 Recommended Actions:
   1. [78%] Increase connection pool (effort: trivial)
   2. [65%] Rollback recent deployment (effort: low)
   3. [52%] Add read replica (effort: high)

💡 Start with: Increase connection pool
   Highest confidence, lowest effort, addresses root cause

Learning Loop

Prophet learns from every remediation:

bash
# After successful remediation
prophet remediate-record db_increase_pool db_connections \
  --before 95 --after 40 --threshold 80 --success \
  --time-to-effect 1 --duration 1440 \
  --notes "Increased pool from 50 to 100"

# After failed remediation
prophet remediate-record rt_add_cache response_time \
  --before 850 --after 800 --threshold 500 \
  --notes "Cache helped but not enough - need DB optimization"

Why It Matters

Reactive debugging: "What should we try?" → Guesswork
Predictive remediation: "Try X (78

Manual runbooks: "Check the wiki..." → Stale documentation
Dynamic playbooks: "Based on 15 similar incidents..." → Living knowledge

Individual knowledge: "Ask Bob, he fixed this before" → Bus factor
Organizational memory: "Action X worked 83

Remediation recommendations turn tribal knowledge into actionable intelligence.

Prediction Explainer (Gen 6+ Evolution)

ML interpretability module that explains WHY predictions are made, not just what they predict.

Philosophy

> "A prediction without explanation is just a guess with confidence."

Traditional ML alerts say "response time will spike." But that's useless without knowing WHY it's going to spike and WHAT you can do to prevent it.

How It Works

1. Feature Attribution — SHAP-like contribution analysis shows how much each metric contributes to the prediction
2. Counterfactual Generation — "What would need to change to avoid this?" with specific targets
3. Pattern Matching — Recognizes common problematic patterns (resource exhaustion, swapping, backlog)
4. Natural Language Explanations — Human-readable summaries, not just numbers

Quick Start

bash
# Explain a current prediction
prophet explain response_time

# Demo with sample data
prophet explain-demo

# Show what changes would help
prophet counterfactual response_time

# Show feature contributions
prophet contributors response_time

# Update baselines from history
prophet update-baselines --days 7

Example Output

## 🟠 Prediction Explanation

**Metric:** response_time
**Predicted Value:** 650.00
**Confidence:** 85%
**Severity:** WARNING

### Summary
Response times are predicted to reach 650ms within 25 minutes, which would
exceed your 500ms threshold. Users may start noticing delays.

### Why This Is Happening

- ↑ **cpu_percent**: above CPU usage (78.5% vs normal 45.2%) is contributing to the prediction
- ↑ **memory_percent**: above memory (82.3% vs normal 55.1%) contributing to slowdowns
- ↑ **queue_depth**: Queue depth of 156 (above from 45) contributing to delays

### Pattern Matches

- 🔍 Resource exhaustion pattern: Both CPU and memory elevated

### Recommended Actions

🟢 Profile and optimize CPU-intensive operations — Expected impact: 35% risk reduction
🟢 Tune garbage collection parameters — Expected impact: 28% risk reduction
🟡 Add more compute capacity or scale horizontally — Expected impact: 20% risk reduction

### What Would Help

- Reducing **cpu_percent** by 25% would reduce risk by ~35%
- Reducing **memory_percent** by 20% would reduce risk by ~28%
- Reducing **queue_depth** by 50% would reduce risk by ~20%

Feature Attribution

The explainer calculates contribution scores for each metric:

FeatureContributionExplanation
cpu_percent+12.5Elevated CPU increases latency
memory_percent+8.3GC pauses causing delays
queue_depth+5.1Backlog building up
error_rate-2.0Low errors mitigating

Positive contributions push toward the prediction; negative contributions work against it.

Counterfactual Analysis

Shows what changes would avoid the predicted problem:

🔄 Counterfactuals for response_time
   Predicted: 650.00 → Threshold: 500

What would help:

1. 🟢 cpu_percent
   Current: 78.50 → Target: 58.87
   Change needed: 25% reduction
   Impact: ~35% risk reduction
   Action: Profile and optimize CPU-intensive operations

2. 🟢 memory_percent
   Current: 82.30 → Target: 65.84
   Change needed: 20% reduction
   Impact: ~28% risk reduction
   Action: Tune garbage collection parameters

Baseline Learning

The explainer learns what "normal" looks like for your system:

bash
# Update baselines from last week
prophet update-baselines --days 7

# Baselines are used to:
# - Calculate deviation (z-score) for each metric
# - Determine if values are unusually high/low
# - Weight contribution calculations

Integration with Other Modules

With Causality Engine:
- Explanations use discovered causal relationships
- "cpu_percent causes response_time (lag: 2.5 min)"

With Attribution:
- Links predictions to external events
- "Response time spike correlates with deploy_20260201"

With Remediation:
- Recommended actions come from remediation engine
- Actions have confidence scores from historical outcomes

Why It Matters

Black box alerts: "Response time will spike" → Panic
Explained predictions: "Response time will spike because CPU is exhausted" → Action

Guessing at fixes: "Maybe we should scale up?" → Wasted effort
Counterfactual guidance: "Reduce CPU by 25

Tribal knowledge: "Bob knows what to check" → Bus factor
Automated explanations: "Here's why and what to do" → Self-service

Explainable predictions turn alerts into actionable intelligence.

Deployment Risk Predictor (Gen 6+ Evolution)

Predicts the likelihood that a deployment will cause performance problems before it goes live. Uses historical patterns, change characteristics, system state, and timing factors to generate risk scores.

Philosophy

> "The best incident is the one that never happens."

Traditional deployment processes: merge → deploy → cross fingers → incident → rollback → postmortem.

Performance Prophet: merge → analyze risk → informed decision → targeted monitoring → smooth deployment.

How It Works

1. Code Change Analysis — Size, complexity, affected areas, dependency changes
2. Historical Patterns — What happened with similar deployments in the past
3. System State — Current health affects deployment risk
4. Timing Factors — Day, time, proximity to recent incidents
5. Context Signals — Test coverage, review thoroughness, deployment frequency

Quick Start

bash
# Analyze a deployment (auto-detects git changes)
prophet deploy-risk

# Analyze specific files/changes
prophet deploy-risk --files "src/db/migrations/001.sql,src/api/routes.py" \
    --message "feat: add payment processing"

# With test coverage
prophet deploy-risk --coverage 75

# In CI/CD - fail if high risk
prophet deploy-risk --fail-on-high

# Record deployment for learning
prophet deploy-risk --record

# Record outcome after deployment
prophet deploy-record <deployment-id> --incident-24h --rollback

# View deployment history
prophet deploy-history --days 30

# Show learned patterns
prophet deploy-patterns

Example Output

🔍 Analyzing deployment risk...

🟠 **Deployment Risk: 67/100 (HIGH)**

⚠️ **Recommendation: PROCEED WITH CAUTION** - Review risks below.

**Key Risk Factors:**
- Database migration detected
- 3 critical path files modified: src/db/migrations/001.sql, src/payments/handler.py
- Deploying on Friday - weekend/end of week
- Test coverage is 65% (below 70% threshold)

### Signal Breakdown

| Signal | Category | Value | Contribution |
|--------|----------|-------|--------------|
| migration_detected | code | 90% | ████ 0.23 |
| hotpath_touched | code | 60% | ██ 0.13 |
| deploy_day_risk | timing | 50% | █ 0.06 |
| test_coverage_delta | context | 14% | 0.03 |

### Recommendations

1. 🚨 Consider delaying this deployment
2. Test migration on production copy first
3. Ensure backward compatibility
4. Have rollback migration ready
5. Deploy during low-traffic window

**Optimal Deploy Window:** Tuesday 10:00 AM (lower risk window)

### Similar Past Deployments

- 2026-01-15: risk 72 → 🔴 incident (rolled back)
- 2026-01-08: risk 45 → ✅ ok

Risk Signals

Code Signals

SignalDescriptionWeight
`lines_changed`Total lines modified (logarithmic)15
`files_changed`Number of files touched12
`complexity_delta`Code complexity changes18
`dependency_changes`Package/dependency updates20
`migration_detected`Database migrations present25
`hotpath_touched`Critical path files modified22

Hot Path Patterns

These file patterns are considered high-risk:
- `database|db|query|sql` — Database layer
- `cache|redis|memcache` — Caching systems
- `auth|login|session|token` — Authentication
- `payment|billing|transaction` — Financial operations
- `queue|worker|job|task` — Background processing
- `api|endpoint|route` — API surface
- `config|settings|env` — Configuration

Historical Signals

SignalDescriptionWeight
`author_incident_rate`Author's past deployment success15
`similar_deployment_failures`Pattern-matched failure rate25
`recent_incident_proximity`Incidents in past week18
`regression_frequency`How often changes regress20

System State Signals

SignalDescriptionWeight
`current_health`Response time/error rate22
`active_alerts`Outstanding alerts20
`resource_headroom`CPU/memory available15
`traffic_level`Current load level12

Timing Signals

SignalDescriptionWeight
`deploy_hour_risk`Time of day (avoid 9pm-5am)15
`deploy_day_risk`Day of week (avoid Fri-Sun)12
`freeze_proximity`Near change freeze?18
`incident_proximity`Time since last incident20

Risk Levels

LevelScoreMeaningAction
🟢 SAFE< 20No concernsDeploy freely
🔵 LOW20-40Minor considerationsStandard process
🟡 MODERATE40-60Notable risksExtra monitoring
🟠 HIGH60-80Significant risksReview carefully
🔴 CRITICAL> 80Major concernsConsider delay

Go/No-Go Decisions

Based on risk analysis:
- GO — Risk is low, proceed with standard process
- CAUTION — Risk is moderate, proceed with extra monitoring
- NO_GO — Risk is high, address issues before deploying

CI/CD Integration

yaml
# GitHub Actions example
deploy:
  steps:
    - name: Analyze deployment risk
      run: |
        prophet deploy-risk --fail-on-high --record

    - name: Deploy (if safe)
      if: success()
      run: ./deploy.sh

    - name: Record outcome
      if: always()
      run: |
        if [ "${{ job.status }}" == "failure" ]; then
          prophet deploy-record $DEPLOYMENT_ID --incident-24h
        fi

Learning Loop

Prophet learns from every deployment:

bash
# After successful deployment
prophet deploy-record abc123def456

# After incident within 24h
prophet deploy-record abc123def456 --incident-24h

# After rollback
prophet deploy-record abc123def456 --rollback --incident-1h

Pattern Recognition

Over time, Prophet learns which patterns are risky for your codebase:

bash
prophet deploy-patterns

# Output:
# 📊 Learned Deployment Patterns
#
# **Changes to: src/db, src/api**
#    🔴 HIGH RISK
#    Deployments: 15 | Incidents: 5 (33%)
#    Avg Risk Score: 68
#
# **Changes to: src/utils**
#    🟢 SAFE
#    Deployments: 42 | Incidents: 0 (0%)
#    Avg Risk Score: 22

Why It Matters

Hope-based deployment: "It passed tests, ship it!" → 3 AM incident
Risk-aware deployment: "Risk is 72

Reactive incidents: "Why did this happen?" → Postmortem
Proactive prevention: "This looks similar to the Jan 15 outage" → Avoidance

Tribal knowledge: "Bob always deploys on Tuesday mornings" → Bus factor
Codified wisdom: "Tuesday 10am is 40

Manual judgment: "Seems fine to me" → Bias
Data-driven decisions: "67

Deployment risk prediction turns gut feelings into quantified decisions.

Adaptive Thresholds (Gen 6+ Evolution)

Static thresholds are the enemy of good alerting:
- Too tight → Alert fatigue, ignored warnings
- Too loose → Missed issues, angry users

Adaptive thresholds learn what "normal" looks like for YOUR system, accounting for:
- Time of day — Traffic spikes at 10am? Normal!
- Day of week — Quiet weekends? Different baseline
- Seasonal patterns — Holiday traffic? Adjusted
- System growth — Gradual metric creep over months
- Context — During deployment? Different expectations

Concept

Traditional: `if cpu > 75: alert()`
Adaptive: `if cpu > learned_p95_for_this_hour: alert()`

Thresholds adapt to reality, not the other way around.

Quick Start

bash
# View adaptive threshold status
prophet adaptive status

# Register a metric with base thresholds
prophet adaptive register cpu_percent 75 90 above
prophet adaptive register response_time 500 1000 above

# Learn optimal thresholds from history
prophet adaptive learn cpu_percent

# Check a value against adaptive thresholds
prophet adaptive check cpu_percent 82

# View growth trends
prophet adaptive growth

Context Windows

Prophet automatically detects time-based contexts:

ContextTimeDescription
`peak`9am-6pm weekdaysHigh traffic periods
`off_peak`6am-9am, 6pm-11pmNormal traffic
`quiet`12am-6amNight/minimal traffic
`weekend`Sat-SunWeekend patterns

Manual context windows for special situations:

bash
# Starting a deployment (relax thresholds temporarily)
prophet adaptive context start deployment

# Deployment complete
prophet adaptive context end deployment

# Maintenance window
prophet adaptive context start maintenance
prophet adaptive context end maintenance

Learning from Feedback

The system learns from your feedback on alerts:

bash
# View recent violations
prophet adaptive violations

# Output:
# [42] cpu_percent = 78.50 (warning) @ 2025-01-15 10:30 [?]
# [41] memory_percent = 85.20 (warning) @ 2025-01-15 09:15 [?]
# [40] response_time = 520.00 (warning) @ 2025-01-14 14:22 [✓]

# Provide feedback: was it a valid alert?
prophet adaptive feedback 42 invalid   # False positive → relax thresholds
prophet adaptive feedback 41 valid     # Good catch → maintain thresholds
prophet adaptive feedback 40 valid 15  # Valid, resolved in 15 minutes

Threshold Suggestions

Prophet analyzes patterns and suggests threshold adjustments:

bash
prophet adaptive suggest

# Output:
# 🔴 [relax_thresholds] cpu_percent
#    High false positive rate (45%)
#    Suggest: warning=86.25, critical=103.50
#
# 📈 [growth_warning] memory_percent
#    Gradual growth (18.5%)
#    🚨 memory_percent has grown 18.5% over analyzed period.
#    Consider adjusting base thresholds upward.

Growth Detection

Catch the "boiling frog" scenario — metrics slowly creeping up over weeks:

bash
prophet adaptive growth

# Output:
# 📈 cpu_percent: growing (12.3%)
# ➡️ memory_percent: stable (-2.1%)
# 🚀 disk_usage: rapid_growth (35.7%)

How It Works

1. Percentile Learning — Calculates p95/p99 from historical data per context
2. Feedback Integration — Adjusts thresholds based on false positive/negative feedback
3. Growth Adjustment — Detects long-term metric drift
4. Confidence Scoring — Only uses learned thresholds when confidence is high

Threshold Bands Per Context

Each metric can have different thresholds for different contexts:

cpu_percent thresholds:
  peak:     warning=82, critical=95 (learned, 87% confidence)
  off_peak: warning=70, critical=88 (learned, 72% confidence)
  quiet:    warning=45, critical=70 (learned, 65% confidence)
  weekend:  warning=55, critical=80 (learned, 58% confidence)

Status Report

bash
prophet adaptive status

# Output:
# 🎯 Adaptive Thresholds Status
#
# **Metrics Tracked:** 5
# **Metrics Adapted:** 3
# **Violations (30d):** 47
# **False Positive Rate:** 18.5%
# **Pending Feedback:** 12
#
# ## Per-Metric Status
#
# ### 🟢 cpu_percent (✓ adapted)
# - Warning: 75.00 → **82.50**
# - Critical: 90.00 → **97.35**
# - Context: peak | Confidence: 87%
# - Violations: 12 | FP: 2 | Pending: 1

Why It Matters

Static thresholds: "CPU > 75
Adaptive thresholds: "CPU > p95 for this hour" → Alerts only for anomalies

One-size-fits-all: "Response time > 500ms" → 3am alert for a slow batch job
Context-aware: "Response time > 200ms at 3am" → Catch actual issues

Set and forget: Thresholds drift out of touch with system reality
Continuous learning: Thresholds evolve with your system

Alert fatigue: "Just ignore CPU alerts, they're always noisy"
Actionable alerts: "If Prophet alerts, it's probably real"

Adaptive thresholds turn monitoring from a firehose into a precision instrument.

Database Schema

Adaptive thresholds data is stored in the same `prophet.db`:

  • `adaptive_metrics` — Metric configurations and learned bands
  • `context_windows` — Active context windows (deployments, maintenance)
  • `threshold_violations` — Historical violations for learning
  • `hourly_aggregates` — Per-hour percentile data for learning
  • `metric_growth` — Weekly growth tracking for drift detection

Anomaly Fingerprints (Gen 6+ Evolution)

Incident pattern recognition — Prophet learns to recognize recurring problems.

When an anomaly looks like something you've seen before, you shouldn't have to figure it out from scratch. Fingerprints capture the "signature" of incidents so Prophet can say: *"This looks 87

Concept

A fingerprint captures:
- Which metrics were affected (and how)
- The pattern shape (spike, sustained, oscillating)
- Magnitude (how far from normal)
- Causality chain (which metric went first)
- Time context (peak hours? weekend?)
- Resolution (what fixed it)
- TTR (how long to resolve)

When new anomalies occur, Prophet matches them against the library and suggests proven fixes.

Quick Start

bash
# List all fingerprints
prophet fingerprint list

# Capture current incident as a fingerprint
prophet fingerprint capture "Memory leak in API" --resolution "Restart pods" --tags memory,api

# Match current anomalies against library
prophet fingerprint match

# Show fingerprint details
prophet fingerprint show fp_abc123

# Update a fingerprint
prophet fingerprint update fp_abc123 --resolution "Restart pods AND fix query in /api/heavy"

# View library statistics
prophet fingerprint stats

Capturing Fingerprints

Capture the current anomalous state:

bash
# Basic capture
prophet fingerprint capture "DB connection exhaustion"

# With resolution and metadata
prophet fingerprint capture "Cache miss storm" \
  --lookback 45 \
  --resolution "Increase Redis maxmemory, add cache warming" \
  --tags cache,redis,performance \
  --severity critical

Options:
- `--lookback N` — Minutes of data to capture (default: 30)
- `--resolution "text"` — What fixed the issue
- `--tags tag1,tag2` — Categorization tags
- `--severity info|warning|critical` — Severity level

Matching Current State

When metrics go anomalous, check for known patterns:

bash
prophet fingerprint match

# Output:
# 🔍 Found 2 matching fingerprint(s)
#
# 🟢 **Memory leak in API service** (87% match)
#    Matched: memory_percent, response_time, gc_pause_ms
#    Missing: none
#    💡 Suggested: Restart API pods, investigate /api/heavy-query
#    Last seen: 12d ago
#
# 🟡 **Cache invalidation cascade** (64% match)
#    Matched: memory_percent, response_time
#    Missing: cache_hit_rate
#    Extra: gc_pause_ms
#    💡 Suggested: Check cache TTL settings
#    Last seen: 23d ago

Confidence levels:
- 🟢 high (≥85
- 🟡 medium (65-84
- 🟠 low (50-64

Managing Fingerprints

bash
# Show detailed fingerprint
prophet fingerprint show fp_abc123

# Update metadata
prophet fingerprint update fp_abc123 \
  --resolution "Better fix: Add connection pooling" \
  --ttr 25 \
  --notes "Happens after deployments with DB migrations"

# Delete a fingerprint
prophet fingerprint delete fp_abc123

# Find similar fingerprints (merge candidates)
prophet fingerprint similar fp_abc123

# Merge similar fingerprints
prophet fingerprint merge fp_abc123 fp_def456 --name "API memory issues (merged)"

Feedback Loop

Improve matching accuracy with feedback:

bash
# When a match was helpful
prophet fingerprint feedback fp_abc123 correct

# When a match was wrong
prophet fingerprint feedback fp_abc123 incorrect "Different root cause - was network latency"

Statistics

bash
prophet fingerprint stats

# Output:
# 📊 Fingerprint Library Statistics
#
#    Total fingerprints: 24
#    Unique metrics tracked: 8
#    Total occurrences: 156
#    Average TTR: 18.5 minutes
#
#    By severity:
#      🔵 info: 5
#      🟡 warning: 14
#      🔴 critical: 5
#
#    Most recurring patterns:
#      • Memory leak in API: 23 matches
#      • DB connection pool exhaustion: 18 matches
#      • Cache miss storm: 12 matches

How Similarity Matching Works

1. Metric overlap — Jaccard similarity between affected metrics
2. Direction match — Same pattern type (spike, sustained, etc.)
3. Magnitude similarity — Similar z-scores
4. Pattern hash — Shape of the anomaly curve
5. Time context bonus — Same time of day/week

Weights: Metric overlap (40

Resolution Playbook

Get a full resolution playbook for a known incident type:

bash
prophet fingerprint show fp_abc123

# Output includes:
# 📋 Resolution Playbook:
#    Resolution: Restart API pods, investigate /api/heavy-query
#    Expected TTR: 15 minutes
#    Occurrences: 23
#    Match accuracy: 87% (19 matches)

Integration with Alerts

When Prophet detects anomalies, it automatically checks fingerprints:

bash
# In HEARTBEAT.md or monitoring:
prophet fingerprint match --json | jq '.matches[0].suggested_resolution'

Why It Matters

Reinventing the wheel: "What's causing this? I feel like I've seen this before..."
Pattern recognition: "This is 92

Tribal knowledge: "Ask Bob, he fixed this last time"
Institutional memory: "Prophet remembers every incident and what fixed it"

Slow MTTR: "Let me investigate from scratch..."
Fast MTTR: "Known pattern, known fix, 15-minute resolution expected"

Isolated incidents: "Each incident is a one-off"
Pattern awareness: "This is the 5th occurrence — time to fix the root cause"

Fingerprints turn your incident history into a diagnostic superpower.

Database Schema

Fingerprint data is stored in `prophet.db`:

  • `fingerprints` — Main fingerprint records
  • `fingerprint_metrics` — Inverted index for fast lookups
  • `fingerprint_matches` — Match history for learning
  • `pattern_shapes` — Cached pattern hashes for shape matching

Simulation Engine (Gen 6+ Evolution)

War-game scenarios before they happen. Ask "what if?" and get predictions.

The Problem:
- Capacity planning is guesswork: "Will we survive Black Friday?"
- Incident response is reactive: "What happens if we lose a server?"
- Architecture decisions are risky: "Will this scaling strategy work?"

The Solution:
Simulate scenarios using the causality graph and prediction models to forecast cascade effects before they happen.

Quick Start

bash
# List available scenarios
prophet simulate list

# Run a preset scenario
prophet simulate run traffic_2x
prophet simulate run server_loss
prophet simulate run memory_leak

# Quick what-if analysis
prophet simulate what-if memory_percent 95
prophet simulate what-if cpu_percent 90

# Find breaking points
prophet simulate breaking-point cpu_percent
prophet simulate breaking-point memory_percent

# Compare scenarios (which is safer?)
prophet simulate compare traffic_2x server_loss

Preset Scenarios

PresetDescriptionType
`traffic_2x`Traffic doubles over 5 minTraffic Spike
`traffic_5x`5x traffic (Black Friday)Traffic Spike
`server_loss`One server goes downResource Loss
`scale_up_2x`Double capacityResource Gain
`memory_leak`Gradual memory increaseDegradation
`db_saturation`DB connections hit maxMetric Injection
`cascade_test`Force high memoryMetric Injection
`chaos_mild`Random 10
`chaos_severe`Random 30

Simulation Output

============================================================
SIMULATION: Double Traffic
============================================================

🟠 Impact: SIGNIFICANT
📊 Risk Score: 84/100
⏱️  Duration: 60 minutes

🚨 Threshold Breaches:
   • response_time: 520.0 (threshold: 500.0)
   • request_rate: 10000.0 (threshold: 10000.0)

🔗 Cascade Effects:
   cpu_percent → response_time
   response_time → error_rate
   queue_depth → response_time

📈 SLO Impact:
   latency_p99: █████░░░░░ 52%
   error_budget: ░░░░░░░░░░ 3%

⚠️  Warnings:
   ⚠️ Deep cascade detected (3 levels) - failure can propagate quickly

💡 Recommended Mitigations:
   1. Pre-scale horizontally before expected traffic increase
   2. Enable auto-scaling with appropriate thresholds
   3. Add response caching for common requests

Breaking Point Analysis

Find where your system breaks:

bash
prophet simulate breaking-point cpu_percent

# Output:
# 📊 Breaking Point Analysis: cpu_percent
#    Baseline value: 42.5
#    ⚠️  Breaks at: 2.5x (106.3)
#    📏 Headroom: 150%

Comparing Scenarios

Choose between strategies by comparing simulated outcomes:

bash
prophet simulate compare traffic_2x scale_up_2x chaos_mild

# Output:
# 🟢 SAFEST: Double Capacity
#    Risk Score: 15/100
#    Impact: minimal
#
# 🔴 RISKIEST: Double Traffic
#    Risk Score: 84/100
#    Impact: significant
#
# All Results (sorted by risk):
#    1. 🟢 Double Capacity: 15/100 (minimal)
#    2. 🟡 Mild Chaos: 45/100 (moderate)
#    3. 🔴 Double Traffic: 84/100 (significant)

Impact Levels

LevelRisk ScoreMeaning
🟢 NONE0-10No predicted impact
🟢 MINIMAL10-25Minor deviations, no breaches
🟡 MODERATE25-50Some metrics elevated
🟠 SIGNIFICANT50-70Threshold breaches predicted
🔴 SEVERE70-85Multiple cascading failures
💀 CRITICAL85-100System-wide impact expected

How It Works

1. Load Current State — Gets latest metric values and baselines
2. Apply Scenario — Modifies metrics according to scenario definition
3. Propagate Cascades — Uses causality graph to predict downstream effects
4. Check Thresholds — Identifies which metrics breach thresholds
5. Estimate SLO Impact — Predicts probability of SLO violations
6. Generate Mitigations — Recommends actions based on scenario type

Integration with Capacity Planning

bash
# Before expected traffic event
prophet simulate run traffic_5x

# If risk is high, test scaling
prophet simulate run scale_up_2x
prophet simulate compare traffic_5x scale_up_2x

Custom Scenarios

Create custom scenarios via CLI:

bash
prophet simulate custom \
  --name "Peak Hour Load" \
  --metric cpu_percent 85 \
  --metric memory_percent 80 \
  --duration 120

Why It Matters

Reactive capacity: "We crashed during Black Friday" → Postmortem
Proactive capacity: "Black Friday simulation shows 84

Hope-based architecture: "This should handle the load" → Guesswork
Evidence-based architecture: "Breaking point is 2.5x, we need headroom" → Data

Single-point thinking: "If we add servers..." → One variable
Cascade awareness: "Adding servers affects CPU, which affects response time, which..." → Full picture

Expensive failures: "Let's find out in production" → $$$ Safe failures: "Let's simulate first" → Free

Simulation turns architecture decisions from gut feelings into calculated risks.

---

User Experience Impact (Gen 6 Evolution)

Translates technical metrics into actual user impact. "Your p99 latency is 2.3s" becomes "~4,200 users/hour are likely abandoning checkout."

Concept

Engineers think in metrics. Users think in experience. Business thinks in dollars. User Impact bridges all three:

Technical: response_time = 2500ms
    ↓
User: "Frustrating - attention wanders"
    ↓
Business: ~70 abandonments/hr, $4,350/hr revenue loss

Key Insight

Users don't complain about "high latency." They say "this is slow" and leave. By the time you see it in metrics, they're already gone. User Impact quantifies the invisible damage.

Commands

bash
# System-wide impact summary
prophet impact

# List all configured user journeys
prophet impact-journey

# Impact for specific journey
prophet impact-journey checkout

# Impact of a specific metric
prophet impact-metric response_time --value 2500 --journey checkout

# Add or update a user journey
prophet impact-add --name payment \
  --stage CONVERSION \
  --traffic 500 \
  --conversion-rate 0.8 \
  --revenue 120 \
  --latency-target 300 \
  --latency-sensitivity 2.0

# Show impact trend over time
prophet impact-trend
prophet impact-trend checkout --hours 48

User Journeys

Pre-configured journeys map metrics to business context:

JourneyStageTrafficTarget LatencyRevenue/Conversion
`homepage`Discovery10k/hr300ms$0
`search`Discovery5k/hr200ms$0
`product_view`Consideration8k/hr500ms$0
`checkout`Conversion1k/hr400ms$75
`api`Engagement50k/hr100ms$0.01
`login`Engagement2k/hr300ms$0

Impact Severity Levels

LevelIconFrustrationUser Experience
IMPERCEPTIBLE0-5Users don't notice
MINOR🟢5-15Some slight annoyance
NOTICEABLE🟡15-35Many users frustrated
SIGNIFICANT🟠35-60Complaints likely
SEVERE🔴60-85Business-critical
CATASTROPHIC💀85-100Stop everything

Response Time Psychology

Based on UX research (Nielsen, Google, Akamai):

ThresholdUser Perception
< 100msInstantaneous - direct control
< 300msSmooth - unnoticeable delay
< 1sNoticeable - flow maintained
< 3sFrustrating - attention wanders
< 8sPainful - considering abandoning
> 10sBroken - most users leave

Abandonment Curve

Response time to abandonment probability (research-based):

Response Time → Abandonment
     0.5s    →    1%
     1.0s    →    3%
     2.0s    →    7%
     3.0s    →   13%
     4.0s    →   20%
     5.0s    →   28%
     6.0s    →   35%
     8.0s    →   45%
    10.0s    →   53%

Example Output

╔════════════════════════════════════════════╗
║     👤 USER EXPERIENCE IMPACT SUMMARY      ║
╚════════════════════════════════════════════╝

UX Score: 🟡🟡🟡⚪⚪ 60/100 (Degraded)
Frustration Index: 40/100

👥 Users Affected: ~50,000/hr
🚪 Max Abandonment: 3.0%

💸 Hourly Revenue Impact: $1,894
📊 Projected Daily Loss: $45,465

📈 Predicted UX Score: 55/100
⚠️  Critical in: ~2.0 hours

🔻 Most Affected Journeys:
   • checkout: 60/100 health
   • search: 66/100 health
   • api: 71/100 health

🚨 Critical Issues:
   • response_time on checkout: 40 frustration

Metric-Specific Impact

🔴 **checkout** — SEVERE
   Metric: response_time = 2500.0
   Frustration Score: 70/100

   👥 Affected Users: ~1,000/hr
   🚪 Abandonment Rate: 7.0%
   ❌ Est. Abandonments: ~70/hr

   💸 Lost Conversions: ~58/hr
   💰 Revenue Impact: $4,350/hr

   🎯 User Experience: Response: Frustrating - attention wanders

Customizing Journeys

Add your own journeys to match your product:

bash
# High-value subscription flow
prophet impact-add --name subscription_checkout \
  --stage CONVERSION \
  --traffic 200 \
  --conversion-rate 0.4 \
  --revenue 299 \
  --latency-target 250 \
  --latency-sensitivity 2.5 \
  --error-sensitivity 3.0 \
  --metrics "payment_latency,subscription_api"

Integration with Predictions

User Impact connects to Prophet's predictions:

  • If latency is predicted to increase, calculates future user impact
  • Shows "time to noticeable" — when users will start complaining
  • Combines with fingerprints to show impact of known incident patterns

Why It Matters

Technical alerting: "Response time crossed 2s threshold" → Engineer context
User impact alerting: "70 users abandoning checkout/hour, $4,350/hr loss" → Everyone understands

Metric overload: "We have 50 metrics in alert state" → What matters?
Impact prioritization: "Checkout at 70 frustration, homepage at 15" → Clear priority

Reactive support: "Users are complaining" → Already too late
Proactive protection: "UX score dropping, critical in 2 hours" → Time to act

Vague post-mortems: "There was degraded performance" → Hand-wavy
Quantified impact: "2-hour incident = 5,600 abandonments, $87,000 revenue impact" → Concrete

User Impact transforms abstract metrics into human stories and business reality.

Incident Memory (Gen 6 Evolution)

"Those who cannot remember the past are condemned to repeat it." — George Santayana

Most production incidents aren't novel — they're variations of past issues. Incident Memory gives Prophet a long-term memory of past incidents, enabling it to recognize when similar conditions are emerging and surface historical context + successful remediations.

Key Insight

When you see "response_time climbing, memory_percent at 82

Incident Memory answers: "This looks like the 2024-01-15 database connection exhaustion incident. Last time, increasing pool size to 100 fixed it in 12 minutes."

Core Features

FeatureDescription
Incident RecordingStore incidents with full context (metrics, timing, root cause, resolution)
Signature MatchingMulti-dimensional similarity scoring finds past incidents matching current conditions
Learning ExtractionAutomatically extracts root causes, remediations, and lessons from resolved incidents
Recurrence DetectionIdentifies incidents that happen periodically and predicts next occurrence
Confidence WeightingRecent incidents with clear outcomes get higher confidence

CLI Commands

bash
# View memory statistics
prophet memory stats

# Recent incidents
prophet memory recent [limit]

# Search incidents by keyword
prophet memory search "database"

# Record a new incident (interactive)
prophet memory record

# Resolve an incident
prophet memory resolve <incident_id> "Increased connection pool"

# Check for predicted incidents based on current state
prophet memory predict

# View learnings (root causes, remediations, preventions)
prophet memory learnings [incident_id]

# Add a learning to an incident
prophet memory add-learning <incident_id> remediation "Restart the worker pods"

# View recurring patterns
prophet memory recurring

# Auto-record from current breached metrics
prophet memory-auto --title "API slowdown"

Incident Categories

CategoryDescription
`latency`Response time / slowdowns
`availability`Outages, errors, failures
`saturation`Resource exhaustion (CPU, memory, connections)
`traffic`Load spikes, DDoS
`deployment`Deploy-induced issues
`external`Third-party failures
`cascade`Multi-service failures

Severity Levels (PagerDuty-style)

LevelMeaning
`sev1`Critical: customer-facing, major impact
`sev2`Major: significant degradation
`sev3`Minor: limited impact
`sev4`Informational: no user impact
`sev5`Noise: not really an incident

How Similarity Matching Works

Incident Memory uses multi-dimensional matching:

1. Primary Metrics (35
2. Pattern Shapes (25
3. Threshold Values (15
4. Temporal Context (10
5. Root Causes (15

Recurrence Detection

After 2+ similar incidents, Memory detects recurring patterns:

⚠️  Memory exhaustion during batch processing
   Type: daily (every 24.0h)
   Next predicted: 2024-02-03 03:00:00
   Confidence: 78%

Learning Types

TypeExample
`root_cause`"Database connection pool exhausted"
`remediation`"Increased pool size to 100"
`prevention`"Add connection pool monitoring"
`lesson`"Always check connection usage before deploying"

Remediations track effectiveness scores — how quickly did they resolve the issue?

Example Prediction

⚠️  INCIDENT PREDICTION

Category: saturation
Severity: sev2
Probability: 72%

Warning Signs:
  • Similar to Database connection exhaustion: metrics: memory_percent, db_connections
  • memory_percent pattern: gradual_rise
  • Recurring pattern: occurs daily, next expected 2024-02-03 03:00

Similar Past Incidents:
  • Database connection exhaustion (78% match)
  • Memory pressure incident (62% match)

Suggested Actions:
  ✓ Increase connection pool size before peak
  ✓ Pre-emptive: Restart worker pods
  ✓ Enable connection pool monitoring

📚 LEARNINGS FROM SIMILAR INCIDENTS

Root Causes (most common):
  → Connection pool exhausted under load (seen 5x)
  → Batch job holding connections (seen 3x)

Effective Remediations:
  ✓ Increase pool size to 100 (85% effective)
  ✓ Add connection timeout (78% effective)

Prevention Measures:
  🛡 Monitor connection pool utilization
  🛡 Set alerts at 70% pool usage

Integration with Other Prophet Features

  • Fingerprints: Signatures incorporate fingerprint patterns for matching
  • Causality: Root cause metrics from causality discovery feed into signatures
  • User Impact: Incidents record user impact scores for prioritization
  • Remediation: Learnings feed into remediation recommendations
  • Adaptive Thresholds: Recurring patterns inform threshold adjustments

Why It Matters

Without Memory:
- "This is slow" → Investigate from scratch
- No context on past resolutions
- Same issues repeat without recognition

With Memory:
- "This is slow" → "Last 3 times this happened, it was connection pool exhaustion"
- Historical remediations surface immediately
- Recurring patterns get predicted before they happen

Post-Mortem Value:
- Forced documentation of incidents during resolution
- Learnings accumulate over time
- New team members can search past incidents for context

Prevention Focus:
- Recurring patterns trigger proactive alerts
- Prevention measures surface before incidents
- Time-of-day patterns enable scheduled mitigation

Incident Memory transforms Prophet from a predictor to a learning system that gets smarter with every incident.

Confidence Calibration (Gen 6 Evolution)

"Trust, but verify" — Predictions are only useful if confidence scores are accurate.

When Prophet says "85

The Problem

Uncalibrated predictions erode trust:
- Overconfident: Says 90
- Underconfident: Says 50

Calibration ensures confidence = actual accuracy.

CLI Commands

bash
# Show overall calibration statistics
prophet calibration stats

# Output:
# ╔════════════════════════════════════════════════════════════╗
# ║              📊 CALIBRATION STATISTICS                     ║
# ╚════════════════════════════════════════════════════════════╝
#
# 📝 Total Predictions Tracked: 247
# ⏳ Pending Evaluation: 12
# ✅ Evaluated: 235
#
# 🎯 Overall Accuracy: 78.3%
#    Correct: 184 | Incorrect: 51
# 📊 Average Confidence: 76.1%
# 📈 Recent 7-Day Accuracy: 81.2%

# Detailed calibration report
prophet calibration report

# Output:
# ╔════════════════════════════════════════════════════════════╗
# ║            🎯 CONFIDENCE CALIBRATION REPORT               ║
# ╚════════════════════════════════════════════════════════════╝
#
# ┌─────────────────────────────────────────────────────────┐
# │ CALIBRATION BY CONFIDENCE BUCKET                        │
# ├────────────┬───────┬─────────┬──────────┬───────────────┤
# │ Confidence │ Count │ Correct │ Accuracy │ Calibration   │
# ├────────────┼───────┼─────────┼──────────┼───────────────┤
# │   0%-10%   │     5 │       0 │       0% │ 🎯 Excellent  │
# │  10%-20%   │     8 │       2 │      25% │ ✅ Good       │
# │  20%-30%   │    12 │       3 │      25% │ ⚠️ Fair       │
# │  30%-40%   │    18 │       8 │      44% │ ✅ Good       │
# │  40%-50%   │    22 │      11 │      50% │ 🎯 Excellent  │
# │  50%-60%   │    35 │      21 │      60% │ ✅ Good       │
# │  60%-70%   │    42 │      27 │      64% │ ⚠️ Fair       │
# │  70%-80%   │    48 │      39 │      81% │ ✅ Good       │
# │  80%-90%   │    31 │      26 │      84% │ 🎯 Excellent  │
# │  90%-100%  │    14 │      13 │      93% │ 🎯 Excellent  │
# └────────────┴───────┴─────────┴──────────┴───────────────┘

# Filter by prediction type
prophet calibration report --type threshold_breach --days 60

# Show predictions awaiting outcome evaluation
prophet calibration pending

# Manually record an outcome
prophet calibration resolve pred_20260202_abc123 correct 512.5 "Threshold was breached as predicted"

# Update adjustment factors (recalculate from outcomes)
prophet calibration update

# Show reliability by prediction type
prophet calibration reliability

# Track a prediction manually (for testing)
prophet calibration track threshold_breach response_time 0.85 30

Prediction Types Tracked

TypeWhat It PredictsEvaluation Method
`threshold_breach`Will metric cross threshold?Compare to actual value
`anomaly`Is this an anomaly?Manual confirmation
`trend_direction`Will metric go up/down?Compare direction
`cascade`Will one metric cause another to degrade?Check both metrics
`ttr`Time-to-recovery estimateCompare to actual duration
`similar_incident`Pattern match to past incidentConfirm root cause match
`capacity`Will we run out of capacity?Check capacity state
`deployment_risk`Will deployment cause issues?Post-deployment review

Calibration Metrics

Expected Calibration Error (ECE):
- Weighted average of |confidence - accuracy| per bucket
- Lower is better (0 = perfect calibration)
- < 0.05 Excellent | < 0.10 Good | < 0.20 Fair | > 0.20 Poor

Maximum Calibration Error (MCE):
- Worst bucket's calibration error
- Shows where predictions are most unreliable

Brier Score:
- Mean squared error of predictions
- Lower is better (0 = perfect)
- Combines calibration AND accuracy

Adjustment Factor:
- Multiplier to apply to raw confidence
- < 1.0 = overconfident, reduce confidence
- > 1.0 = underconfident, increase confidence

Automatic Integration

Calibration integrates automatically with other Prophet modules:

python
from calibration import CalibrationEngine, CalibratedPredictor, PredictionType

# Wrap any predictor with calibration
engine = CalibrationEngine()
predictor = CalibratedPredictor(engine)

# Make a calibrated prediction
# Raw confidence 0.85 might become 0.78 based on historical accuracy
adjusted_conf = predictor.predict_threshold_breach(
    metric="response_time",
    raw_confidence=0.85,
    horizon_minutes=30,
    threshold=500
)
# Prediction is automatically tracked for later evaluation

Auto-Evaluation

For threshold and trend predictions, calibration can automatically evaluate outcomes:

python
from calibration import CalibrationEngine

engine = CalibrationEngine()

# Auto-evaluate threshold breach predictions
def get_metric_value(metric):
    return current_metrics[metric]

engine.auto_evaluate_threshold_predictions(get_metric_value)

# Auto-evaluate trend predictions
def get_metric_history(metric, hours):
    return historical_values[metric][-hours*60:]

engine.auto_evaluate_trend_predictions(get_metric_history)

Why Calibration Matters

Before Calibration:
- Prophet: "90
- Reality: Actually right 65
- Result: Team ignores high-confidence alerts

After Calibration:
- Prophet: "65
- Reality: Actually right 65
- Result: Confidence scores you can trust

The Feedback Loop:
1. Make prediction with confidence
2. Track outcome (correct/incorrect)
3. Measure calibration over time
4. Adjust future confidence to match accuracy
5. Trust improves, decisions improve

Well-calibrated predictions let you make better decisions about when to act on alerts, when to ignore them, and how much effort to invest in prevention.

Ensemble Prediction (Gen 6 Evolution)

Single prediction models fail in specific scenarios. Ensemble prediction combines multiple models for more robust forecasting — averaging out individual model errors and using disagreement to estimate uncertainty.

Why Ensemble?

Single ModelProblem
Holt-WintersStruggles without seasonality
ARIMASensitive to trend breaks
LinearMisses non-linear patterns

Ensemble Solution: Combine all models with adaptive weights — each model's strengths compensate for others' weaknesses.

Models Combined

ModelBest ForWeight Learning
Holt-WintersSeasonal data (daily/weekly cycles)↑ if seasonal pattern detected
ARIMA-liteTrending data with autocorrelation↑ if strong trend
Threshold ProjectionClear linear trends↑ if high R²
PercentileBounded metrics, known distributions↑ for stable metrics
Seasonal NaiveStrongly seasonal (same as last cycle)↑ if high autocorrelation at season lag

CLI Commands

bash
# Make ensemble prediction
prophet ensemble predict cpu_percent
prophet ensemble predict response_time --horizon 60

# Show model performance statistics
prophet ensemble stats
prophet ensemble stats response_time  # For specific metric

# Show learned weights for metric
prophet ensemble weights cpu_percent

# Update weights from historical accuracy
prophet ensemble update-weights memory_percent

# Analyze when/why models disagree
prophet ensemble disagreement error_rate

# Predict threshold breach probability
prophet ensemble breach memory_percent 85
prophet ensemble breach cpu_percent 90 --horizon 30

# Show metric's pattern classification
prophet ensemble pattern response_time

Example Output

╔════════════════════════════════════════════════════════════╗
║                🎯 ENSEMBLE PREDICTION                       ║
╚════════════════════════════════════════════════════════════╝

  Metric:      cpu_percent
  Current:     68.42
  Horizon:     30 steps
  Predicted:   72.15
  Confidence:  78.3%
  Uncertainty: 15.2%
  Agreement:   84.8%
  Dominant:    holt_winters

  ┌─────────────────────┬────────────┬────────────┬──────────┐
  │ Model               │ Prediction │ Confidence │ Weight   │
  ├─────────────────────┼────────────┼────────────┼──────────┤
  │ holt_winters        │      72.34 │      82.5% │   35.2%  │
  │ arima_lite          │      71.89 │      76.1% │   28.1%  │
  │ threshold_proj      │      73.12 │      71.3% │   18.5%  │
  │ percentile          │      71.45 │      68.9% │   12.1%  │
  │ seasonal_naive      │      72.00 │      55.0% │    6.1%  │
  └─────────────────────┴────────────┴────────────┴──────────┘

  ✅ High agreement — models converge on similar prediction

Weight Learning

Weights adapt based on historical accuracy:

bash
# After enough predictions with recorded outcomes...
prophet ensemble update-weights cpu_percent

✅ Updated weights for cpu_percent:

  holt_winters: 38.2%   # Best performer
  arima_lite: 26.1%
  threshold_proj: 17.4%
  percentile: 11.8%
  seasonal_naive: 6.5%  # Worst performer

Learning algorithm: Inverse MSE weighting — models with lower mean squared error get higher weights.

Uncertainty from Disagreement

When models disagree, that's valuable information:

Agreement: 42.3%  ❌ Low agreement — significant uncertainty

💡 Models diverging suggests:
   - Regime change in data
   - Unusual pattern not seen before
   - Higher risk of prediction error

Agreement → Uncertainty mapping:
- Agreement ≥ 80
- Agreement 50-80
- Agreement < 50

Threshold Breach Prediction

Specialized ensemble for answering "Will we breach the threshold?"

bash
prophet ensemble breach memory_percent 85

╔════════════════════════════════════════════════════════════╗
║              🚨 THRESHOLD BREACH PREDICTION                 ║
╚════════════════════════════════════════════════════════════╝

  Metric:       memory_percent
  Threshold:    85.0
  Current:      78.23
  Predicted:    82.45
  Horizon:      30 steps

  🟡 Breach Probability: 45.2% (LOW RISK)
  Confidence:   72.1%
  Agreement:    81.3%
  Time to Breach: ~42.3 steps

Pattern Classification

The ensemble automatically classifies metric patterns to boost specialist models:

PatternDescriptionBoosted Models
`seasonal`Clear daily/weekly cyclesHolt-Winters, Seasonal Naive
`trending`Consistent upward/downwardARIMA-lite, Threshold Proj
`volatile`High variance, irregularIsolation Forest, Percentile
`stable`Low variance, predictablePercentile, Seasonal Naive

Python Integration

python
from ensemble import EnsemblePredictor, ThresholdBreachPredictor

# Basic ensemble prediction
ensemble = EnsemblePredictor()
prediction = ensemble.predict(
    metric="response_time",
    values=[100, 105, 110, 108, 115, 120],
    horizon=10
)

print(f"Predicted: {prediction.final_value}")
print(f"Confidence: {prediction.confidence:.1%}")
print(f"Agreement: {prediction.agreement_score:.1%}")

# Threshold breach prediction
breach_predictor = ThresholdBreachPredictor(ensemble)
result = breach_predictor.predict_breach(
    metric="memory_percent",
    values=memory_values,
    threshold=85.0,
    horizon_minutes=30
)

if result['breach_probability'] > 0.7:
    alert("High probability of memory breach!")

# Record outcomes for weight learning
ensemble.record_outcome(
    metric="response_time",
    prediction_type=PredictionType.VALUE_FORECAST,
    model_predictions=prediction.model_predictions,
    actual_value=118.5  # What actually happened
)

# Update weights periodically
ensemble.update_weights("response_time", PredictionType.VALUE_FORECAST)

Philosophy

No single model is best for everything.

  • Holt-Winters excels at seasonal data but struggles without clear cycles
  • ARIMA captures trends but is sensitive to regime changes
  • Simple linear projection is surprisingly robust for short horizons
  • Percentile-based prediction works great for bounded metrics

The ensemble learns which models work best for YOUR specific metrics and adapts over time. Disagreement between models becomes a feature, not a bug — it tells you when to be more cautious.

Result: More robust predictions that improve as they learn from outcomes.

Unified Health Score & Auto-Report (Gen 6 Evolution)

Instead of checking 16 different modules, get ONE number (0-100) and a human-readable story.

The Problem

Performance Prophet has grown to 16+ modules: predictions, thresholds, calibration, fingerprints, incidents, capacity, causality, remediation, and more. Each has its own dashboard. No single view answers: "How is my system doing RIGHT NOW?"

The Solution

A unified health score that:
1. Queries all modules and calculates a weighted score (0-100)
2. Grades your system (A+ through F)
3. Generates human-readable narratives explaining what's happening
4. Produces auto-reports (hourly/daily/weekly) with trends
5. Tracks score history with sparkline visualization

Health Dimensions

DimensionWeightWhat It Measures
Prediction25
Threshold20
Capacity15
Pattern15
Incident15
Calibration10

Grade Scale

GradeScoreMeaning
A+97-100Exceptional — everything running perfectly
A93-96Excellent — minor things to watch
A-90-92Very good
B+/B/B-80-89Good but some areas need attention
C+/C/C-70-79Average — multiple concerns
D60-69Poor — action needed
F0-59Critical — immediate intervention

CLI Commands

bash
# Current health score
prophet health

# Health score with JSON output
prophet health --json

# Health score trend (sparkline)
prophet health-history
prophet health-history --hours 48

# Auto-generated reports
prophet report              # Daily report (default)
prophet report hourly       # Last hour
prophet report daily        # Last 24 hours
prophet report weekly       # Last 7 days
prophet report daily --json # JSON output

Example Output

✨ System Health: 95.5/100 (A)
   Trend: 📈 Improving

   🟡 calibration   75.0/100 — Building calibration data (6 predictions tracked)
   🟢 pattern       92.0/100 — No anomalous patterns detected
   🟢 incident      95.0/100 — No incidents in the last 24 hours
   🟢 prediction   100.0/100 — All 6 metrics within expected bounds
   🟢 threshold    100.0/100 — All 3 adaptive thresholds healthy
   🟢 capacity     100.0/100 — Healthy capacity: 100% avg headroom

   Recommendations:
   🟡 Monitor prediction accuracy — slight drift detected

   System is running like a well-oiled machine. ✨

Report Example

══════════════════════════════════════════════════
  PERFORMANCE PROPHET — DAILY REPORT
  Generated: 2026-02-02 23:29
══════════════════════════════════════════════════

  ✨ Overall Health: 95.5/100 (A)
  📈 Period: avg 90 | low 78 | high 96

  📈 Trend: improving

  DIMENSIONS:
  ──────────────────────────────────────────────
  calibration  [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓·····]  75.0
  pattern      [██████████████████··]  92.0
  incident     [███████████████████·]  95.0
  prediction   [████████████████████] 100.0
  threshold    [████████████████████] 100.0
  capacity     [████████████████████] 100.0

  ✅ IMPROVEMENTS:
    • threshold improved by 30 points
    • capacity improved by 30 points

  📋 RECOMMENDATIONS:
    🟡 Monitor prediction accuracy — slight drift detected

══════════════════════════════════════════════════

Integration with Heartbeat

Health score can be checked during Clawdbot heartbeats:

python
from health_score import HealthScoreEngine
engine = HealthScoreEngine()
snapshot = engine.calculate_snapshot()
if snapshot.overall_score < 70:
    # Alert: system health degraded
    notify(snapshot.narrative)

Architecture

┌─────────────────────────────────────────────────────────┐
│                 All Prophet Modules                       │
│  Prophet │ Thresholds │ Calibration │ Fingerprints │ ... │
└────┬─────┴─────┬──────┴──────┬──────┴──────┬───────┴────┘
     │           │             │             │
     ▼           ▼             ▼             ▼
┌─────────────────────────────────────────────────────────┐
│              Health Score Engine                         │
│  Weighted combination → single 0-100 score              │
│  Grade assignment (A+ through F)                         │
│  Trend detection (improving/stable/degrading/volatile)  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│              Narrative Generator                         │
│  Human-readable stories from numbers                    │
│  Top concerns + strengths + recommendations             │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│              Auto-Report Engine                          │
│  Hourly / Daily / Weekly reports                        │
│  Score history + sparklines + trend analysis            │
│  Period comparisons + notable events                    │
└─────────────────────────────────────────────────────────┘

Philosophy

One number. One story. One glance.

Dashboards with 50 charts cause alert fatigue. A single health score with a narrative explanation lets you know instantly: "Do I need to pay attention right now?" If yes, the dimensions tell you where. If no, move on with your day.

The score degrades gracefully — if a module is unavailable, it falls back to neutral (70) rather than breaking. This means the health score is always available, even during partial outages.

Model Drift Detection (Gen 6 Evolution)

The Problem

All Prophet models learn from historical patterns. But systems change: new deployments, infrastructure migrations, traffic pattern shifts, seasonal evolution. Without drift detection, Prophet gives confidently wrong predictions.

The Solution

A meta-intelligence layer that monitors the monitors. Answers the question: "Can we still trust our predictions?"

Drift Types Detected

TypeWhat It DetectsMethod
CovariateInput distributions changedKS test + Page-Hinkley
PerformancePrediction accuracy degradedADWIN + error tracking
ConceptMetric relationships shiftedFisher z-transform on correlations
RegimeOperational mode changedRule-based multi-signal classification

Statistical Tests

  • Page-Hinkley — Sequential change detection (O(1) memory, real-time)
  • ADWIN — Adaptive windowing for concept drift detection
  • Kolmogorov-Smirnov — Non-parametric distribution comparison
  • CUSUM — Cumulative sum control charts for mean shifts
  • Welch's t-test — Window comparison significance testing
  • Fisher z-transform — Correlation comparison across time periods

Operational Regimes

Prophet recognizes 7 distinct operational modes:

RegimeCharacteristicsTrust Impact
NormalBaseline behaviorFull trust
PeakHigh utilization, low errorsNormal trust
Off-PeakLow utilizationNormal trust
MaintenanceVery low trafficNormal trust
IncidentHigh errors + latency30
DegradedElevated but not criticalModerate reduction
RecoveryImproving from bad state15

CLI Commands

bash
# Comprehensive drift check
prophet drift                    # Full drift report with trust scores
prophet drift --metric cpu       # Check specific metric
prophet drift --json             # JSON output

# Trust scores
prophet drift-trust              # All metric trust scores
prophet drift-trust cpu          # Trust for specific metric

# Operational regime
prophet drift-regime             # Current regime + transitions
prophet drift-regime --json      # JSON output

# Signal history
prophet drift-history            # Last 24h signals
prophet drift-history --hours 72 # Custom lookback
prophet drift-history --severity severe  # Filter by severity

# Prediction accuracy
prophet drift-accuracy           # All metrics
prophet drift-accuracy cpu       # Specific metric

# Resolve signals
prophet drift-resolve 42 --note "Deployed model v2"

# Interactive demos
prophet drift-demo --scenario gradual   # Slow mean shift
prophet drift-demo --scenario sudden    # Abrupt distribution change
prophet drift-demo --scenario concept   # Relationship decorrelation
prophet drift-demo --scenario regime    # Operational mode transitions

Trust Scoring

Each metric gets a trust score (0-100
- Active drift signals (severity × confidence)
- Current operational regime
- Recent prediction accuracy
- Concept drift affecting related metrics

Trust influences all other Prophet modules:
- Health Score weights predictions by trust
- Alerts include trust disclaimers when low
- Ensemble can deprioritize low-trust models

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Drift Engine                             │
│                                                              │
│  ┌────────────┐  ┌──────────────┐  ┌────────────────┐      │
│  │ Covariate  │  │ Performance  │  │    Concept     │      │
│  │ Detector   │  │  Detector    │  │   Detector     │      │
│  │            │  │              │  │                │      │
│  │ KS Test    │  │ ADWIN        │  │ Fisher z       │      │
│  │ Page-Hinkl │  │ Error Track  │  │ Correlation    │      │
│  └─────┬──────┘  └──────┬───────┘  └───────┬────────┘      │
│        │                 │                   │               │
│  ┌─────┴─────────────────┴───────────────────┴──────────┐   │
│  │              Regime Detector                          │   │
│  │   Classifies: normal/peak/incident/recovery/...      │   │
│  └─────┬────────────────────────────────────────────────┘   │
│        │                                                     │
│  ┌─────┴────────────────────────────────────────────────┐   │
│  │              Trust Scorer                             │   │
│  │   Per-metric trust (0-100%) from all signals         │   │
│  └─────┬────────────────────────────────────────────────┘   │
│        │                                                     │
│  ┌─────┴────────────────────────────────────────────────┐   │
│  │              Drift Advisor                            │   │
│  │   OK → Monitor → Recalibrate → Retrain → Suspend    │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Persistence: drift.db (SQLite)                             │
└─────────────────────────────────────────────────────────────┘

Severity → Action Mapping

SeverityTrust PenaltyRecommended Action
Minor-10
Moderate-25
Severe-45
Critical-70

Philosophy

> "A prediction model that doesn't know it's wrong is more dangerous than no model at all."

Traditional monitoring tells you when your system breaks. Drift detection tells you when your understanding of your system breaks. It's the difference between watching the road and checking your GPS is still accurate.

Fleet Intelligence (Gen 6 Evolution)

Extends Prophet from single-service monitoring to fleet-wide intelligence. Cross-service correlation, dependency mapping, blast radius estimation, and risk matrices.

Philosophy

> "In a microservice architecture, no service fails alone."

Single-service monitoring catches local problems. Fleet Intelligence catches systemic problems — when auth-service slows down, it predicts api-gateway, user-service, and search-service will follow.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Fleet Intelligence                         │
│                                                              │
│  ┌──────────────┐  ┌────────────────┐  ┌────────────────┐  │
│  │  Service      │  │  Correlation   │  │  Blast Radius  │  │
│  │  Registry     │  │  Engine        │  │  Estimator     │  │
│  │              │  │                │  │                │  │
│  │  Register    │  │  Cross-corr    │  │  Graph walk    │  │
│  │  Track       │  │  Lag detect    │  │  Impact score  │  │
│  │  Status      │  │  Auto-discover │  │  Cascade sim   │  │
│  └──────┬───────┘  └───────┬────────┘  └───────┬────────┘  │
│         │                  │                     │           │
│  ┌──────┴──────────────────┴─────────────────────┴───────┐  │
│  │              Dependency Graph                          │  │
│  │   Directed graph of service relationships             │  │
│  └──────┬────────────────────────────────────────────────┘  │
│         │                                                    │
│  ┌──────┴────────────────────────────────────────────────┐  │
│  │              Fleet Dashboard                           │  │
│  │   Overall health │ Risk matrix │ Topology view        │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
│  Persistence: fleet.db (SQLite)                             │
└─────────────────────────────────────────────────────────────┘

Registering Services

bash
# Register services with tiers (1=critical, 2=important, 3=normal)
prophet fleet register api-gateway --tier 1 --desc "Main entry point"
prophet fleet register auth-service --tier 1 --desc "Authentication"
prophet fleet register user-service --tier 2 --desc "User management"
prophet fleet register payment-svc --tier 1 --desc "Payment processing"
prophet fleet register notification-svc --tier 3 --desc "Email/SMS"

Declaring Dependencies

Dependencies can be declared manually or discovered automatically:

bash
# Manual: api-gateway depends on auth-service
prophet fleet deps add api-gateway auth-service --type upstream --strength 0.95

# Automatic: discover from metric correlations
prophet fleet discover

Fleet Status

bash
prophet fleet status

# Output:
# ══════════════════════════════════════════════════
#   🛰️  FLEET INTELLIGENCE
# ══════════════════════════════════════════════════
#
#   Fleet Health: [████████████████░░░░] 83% (B)
#
#   Services: 7 total
#     🟢 Healthy:  4
#     🟡 Degraded: 1
#     🔴 Critical: 0
#     ❓ Unknown:  2
#
#   ⚠️  Top Risks:
#     🔴 auth-service — widespread (4 services, 82% impact)

Blast Radius Estimation

The key feature: predict the cascading impact of a service failure.

bash
prophet fleet blast auth-service

# Output:
# 💥 Blast Radius: auth-service
# ────────────────────────────────────────
#   Severity: 🔴 WIDESPREAD
#   Affected: 4 service(s)
#   Impact:   82%
#   Depth:    2 hop(s)
#
#   📊 Affected Services:
#     api-gateway: [████████████████████] 100% (+0m, depth 1)
#     user-service: [████████████████████] 100% (+0m, depth 1)
#     search-service: [████████████████░░░░] 81% (+0m, depth 2)
#
#   🌊 Cascade Path:
#     auth-service → api-gateway → user-service → search-service
#
#   💡 Mitigations:
#     🚨 Tier-1 critical service — activate incident response
#     ⚡ Critical services at risk: api-gateway — prepare failover
#     🔌 Enable circuit breakers for: api-gateway, user-service

Blast Severity Levels

Severity
🟢 Contained
🟡 Limited
🟠 Moderate
🔴 Widespread
💀 Catastrophic

Cross-Service Correlation Discovery

Automatically discovers which services influence each other by cross-correlating their metrics:

bash
prophet fleet discover --min 0.6

# Output:
# ✅ Discovered 5 cross-service correlations:
#
#   api-gateway.response_time
#     → user-service.response_time
#       [████████░░] 82% (lag: 2.5m, n=150)
#
#   auth-service.cpu
#     → api-gateway.error_rate
#       [███████░░░] 71% (lag: 1.0m, n=200)

Service Topology

bash
# ASCII topology
prophet fleet topology

# Mermaid diagram (for docs/dashboards)
prophet fleet topology --mermaid

Risk Matrix

Combines failure likelihood (from current status) with blast radius impact:

bash
prophet fleet risk

# Output:
# 🎯 Risk Matrix
# ────────────────────────────────────────────────
#   Service            Tier  Status    Risk   Blast      Affected
#   user-service        T2    degraded  █████░  widespread 3
#   auth-service        T1    healthy   █░░░░░  widespread 4
#   payment-service     T1    healthy   █░░░░░  moderate   2

Fleet Events

Track fleet-level events (deployments, incidents, config changes):

bash
# Record events
prophet fleet event deployment api-gateway "Deployed v2.3.1"
prophet fleet event incident auth-service "Elevated latency"

# View recent events
prophet fleet event

Comparative Service Analysis

Compare the same metric across two services:

bash
prophet fleet compare api-gateway user-service response_time

# Returns: mean, std, p50, p95, p99, correlation, lag

Service Tiers

TierMeaningBlast Multiplier
1Critical (auth, payments)1.5× impact
2Important (user, search)1.2× impact
3Normal (analytics, notif.)1.0× impact

Integration with Other Prophet Modules

Fleet Intelligence feeds into existing Prophet modules:
- Health Score — fleet health contributes to overall system health
- Drift Detection — cross-service correlation changes signal architectural drift
- Incident Memory — fleet-wide incidents are stored with cascade paths
- SLO Budget — blast radius informs SLO budget burn predictions

All Fleet Commands

bash
prophet fleet status              # Fleet dashboard
prophet fleet status --json       # JSON output
prophet fleet register <name>     # Register a service
prophet fleet remove <name>       # Remove a service
prophet fleet blast <name>        # Blast radius estimation
prophet fleet discover            # Auto-discover correlations
prophet fleet discover --min 0.7  # Higher correlation threshold
prophet fleet topology            # ASCII topology
prophet fleet topology --mermaid  # Mermaid diagram
prophet fleet risk                # Risk matrix
prophet fleet deps                # All dependencies
prophet fleet deps <name>         # Dependencies for a service
prophet fleet compare <a> <b> <m> # Compare metric across services
prophet fleet event               # Recent events
prophet fleet event <type> <svc> <desc>  # Record event

Gen 7 Ideas (Future)

  • Fleet Anomaly Correlation — When multiple services degrade simultaneously, auto-identify the common root
  • Dependency Health Score — Per-dependency health tracking (is this link healthy?)
  • Fleet Simulation — "What if we lose an entire AZ?" chaos engineering simulation
  • Service SLA Composition — Calculate composite SLA from dependency chain
  • Fleet Drift — Detect when the dependency graph itself is changing

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

homelab/clawdbot/skills/performance-prophet/SKILL.md

Detected Structure

Method · Evaluation · Math · Figures · Code Anchors · Architecture