Performance Prophet
ML-powered system that predicts slowdowns before users notice. Uses statistical models and anomaly detection to forecast performance degradation and alert proactively.
Full Public Reader
Performance Prophet
ML-powered system that predicts slowdowns before users notice. Uses statistical models and anomaly detection to forecast performance degradation and alert proactively.
Concept
Traditional monitoring alerts when thresholds are crossed. By then, users already feel the pain. Performance Prophet flips this: it learns your system's patterns and alerts before degradation becomes noticeable.
Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Collectors │ ──▶ │ Predictor │ ──▶ │ Alerter │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Metrics DB Model State NotificationsFeatures
- Triple Exponential Smoothing — Captures level, trend, and seasonality
- Anomaly Detection — IQR-based + Z-score for outliers
- Trend Extrapolation — Predicts where metrics are heading
- Pattern Memory — Learns daily/weekly cycles
- Proactive Alerts — Warning before threshold breach
Usage
# Start collector daemon
prophet collect --interval 60
# Show current predictions
prophet predict
# Analyze specific metric
prophet analyze response_time --hours 24
# Show predicted incidents
prophet forecast --horizon 1h
# Interactive dashboard
prophet watch
# Pattern Learning (Gen 6+)
prophet patterns # Show learned patterns & stats
prophet similar cpu_percent # Find similar past incidents
prophet cascade memory_percent # Predict cascade effects
prophet link db_connections response_time 2.5 0.85 # Record causal link
prophet incident response_time --severity warning --cause "DB pool exhausted"
# Automatic Causality Discovery (Gen 6 Evolution)
prophet causality discover # Auto-discover all causal relationships
prophet causality discover --metric cpu # Discover relationships for specific metric
prophet causality explain --metric response_time # Explain what causes/affects a metric
prophet causality graph # Generate Mermaid diagram of causal graph
prophet causality summary # Summary of discovered relationships
# Causal Analysis
prophet causal-explain response_time # What causes response_time? What does it affect?
prophet causal-roots error_rate # Trace back to root causes
prophet causal-test memory_percent response_time # Test if one metric causes anotherMetrics Tracked
| Metric | Source | Warning Sign |
|---|---|---|
| Response Time | API logs | Upward trend |
| Memory Usage | System | > 80 |
| Queue Depth | Workers | Growing backlog |
| Error Rate | Logs | Spike detection |
| CPU Usage | System | Sustained climb |
| Disk I/O | System | Saturation approach |
Prediction Models
### 1. Holt-Winters (Triple Exponential Smoothing)
Best for metrics with seasonality (traffic patterns, daily cycles).
### 2. ARIMA-lite
Moving average with autoregression for trend detection.
### 3. Isolation Forest
Anomaly detection without requiring labeled data.
### 4. Threshold Projection
Simple but effective: if current trend continues, when does it hit the threshold?
Alert Levels
| Level | Meaning | Horizon |
|---|---|---|
| 🟢 HEALTHY | All metrics normal | — |
| 🟡 WATCH | Unusual pattern detected | — |
| 🟠 WARNING | Threshold breach predicted | < 1h |
| 🔴 CRITICAL | Threshold breach predicted | < 15m |
Configuration
# [home-path]
metrics:
response_time:
threshold: 500ms
warning_horizon: 30m
seasonality: daily
memory_percent:
threshold: 85
warning_horizon: 15m
error_rate:
threshold: 0.05
anomaly_sensitivity: high
notifications:
slack: "#ops-alerts"
email: [email]
cooldown: 5mIntegration
With Clawdbot Sessions
from prophet import Prophet
p = Prophet()
p.track_session_metric("response_time", response_ms)
if p.predict_breach("response_time", horizon="30m"):
alert("Slowdown predicted in next 30 minutes")### With Heartbeat
Add to HEARTBEAT.md:
- [ ] Performance Prophet status
- [ ] Any predicted incidents?Implementation
This skill provides:
- `prophet.py` — Core prediction engine
- `collector.py` — Metric gathering
- `alerter.py` — Notification dispatch
- `cli.py` — Command line interface
- `models/` — Serialized model states
Pattern Learning (Gen 6 Evolution)
Performance Prophet learns from every incident to improve future predictions.
How It Works
1. Pattern Fingerprinting — Each metric trajectory gets a unique fingerprint based on shape characteristics (trend, acceleration, volatility, peak position)
2. Shape Classification — Fingerprints are classified into categories:
- `gradual_rise` — Slow buildup toward threshold
- `accelerating_rise` — Exponential degradation
- `spike` — Sudden jump
- `oscillating` — Irregular fluctuations
- `step` — Discrete jumps
3. Incident Memory — Past incidents are stored with their patterns, root causes, and resolutions
4. Pattern Matching — When current metrics match a known pattern, Prophet alerts with historical context
5. Causal Graph — Relationships between metrics are tracked (e.g., "when database connections degrade, response time follows 2.5 minutes later")
Cascade Prediction
When one metric degrades, Prophet predicts which others will follow:
🌊 Cascade Prediction: database_connections
→ response_time (lag: 2.5 min, correlation: 85%)
→ error_rate (lag: 5.0 min, correlation: 72%)Similar Incident Lookup
Find past incidents with matching patterns:
prophet similar response_time
🔍 Similar past incidents:
📋 inc_20260131_response
Root cause: Database connection pool exhausted
Resolution: Increased pool size to 100
Duration: 12.3 minTime-to-Recovery (TTR) Prediction (Gen 6+)
During an active incident, answers the critical question: "How long until we're back to normal?"
How It Works
1. Pattern Matching — Compares current incident fingerprint to historical incidents
2. Multi-Factor Scoring — Considers severity, trend, peak deviation, pattern type
3. Bayesian Updating — Adjusts prediction as time passes
4. Confidence Intervals — Provides optimistic, median, and pessimistic estimates
Usage
# Predict TTR for an active incident
prophet ttr response_time --value 650 --threshold 500 --elapsed 5 --trend stable
# Output:
# ⏱️ TTR Prediction: response_time
# ───────────────────────────────────
# Estimated: 12 min remaining
# Confidence: ████████░░ 80%
#
# 📊 Range:
# Best case: 5 min
# Most likely: 12 min
# Worst case: 25 min
#
# 📋 Factors:
# • Based on 8 similar past incidents
# • Pattern type: gradual rise
# • ⏱️ 5 min elapsed
#
# 💡 Recommendation:
# Check: Database connection pool exhausted (common cause in similar incidents)
# View prediction accuracy
prophet ttr-statsPattern-Based Adjustments
| Pattern Type | Recovery Modifier | Typical Behavior |
|---|---|---|
| Spike | 0.5x (faster) | Usually self-resolves |
| Gradual Rise | 1.0x (baseline) | Standard recovery |
| Accelerating | 1.5x (slower) | Needs intervention |
| Step Change | 2.0x (slowest) | Usually deployment-related |
| Oscillating | 1.2x (moderate) | Capacity contention |
Integration with Incident Response
When Prophet detects an incident, it automatically provides TTR estimate:
🔴 INCIDENT: response_time breached threshold (650ms > 500ms)
⏱️ Estimated recovery: 15 min (confidence: 75%)
💡 Similar to 3 past incidents - common cause: DB connection poolSLO Budget Burn Prediction (Gen 6 Evolution)
Predicts when your error budget will be exhausted, giving you time to act before SLOs are breached.
Philosophy
SLOs are promises. This module predicts broken promises before they happen.
Traditional SLO monitoring: "Error budget exhausted" → already broke the promise
Prophet SLO prediction: "Error budget will exhaust in 3 days" → time to fix reliability
Defining SLOs
# Define an availability SLO
prophet slo-define api_availability \
--target 0.999 \
--window 30 \
--metric response_success_rate \
--policy "Freeze deployments, prioritize reliability"
# Define a latency SLO
prophet slo-define api_latency \
--target 0.95 \
--window 30 \
--policy "No risky changes until budget recovers"
# List all SLOs
prophet slo-listTracking Budget
# Record good/total events (for event-based SLOs)
prophet slo-events api_availability 9990 10000
# Record a burn event with cause
prophet slo-burn api_availability 0.001 --cause "Deployment hiccup" --incident INC-123Predictions
# Show budget status for all SLOs
prophet slo-budget
# Output:
# 📊 SLO Budget Summary
#
# 🟡 api_availability
# [████████████░░░] 85.2% remaining
# Exhaustion in: 12.3d
#
# 🔴 api_latency
# [████░░░░░░░░░░░] 28.1% remaining
# Exhaustion in: 2.1d
# Detailed prediction for specific SLO
prophet slo-budget api_availability
# Output:
# 🟡 SLO Budget: api_availability
# ────────────────────────────────────────
# Budget: [████████████████░░░░] 85.2%
# Status: nominal
# Confidence: 80%
# Exhaustion: 12.3 days
#
# 📋 85.2% budget remaining. Burn rate is at expected.
#
# 💡 Recommendations:
# • 📈 Monitor burn rate trend closely
# • 🔧 Address known reliability issues
# Get JSON for dashboards
prophet slo-budget api_availability --jsonMulti-Window Burn Rate Alerts
Based on Google SRE's multi-window alerting strategy:
prophet slo-alerts
# Output:
# 📊 api_availability
# 🚨 PAGE: 2.00% burned in 5m, 3.50% in 60m (threshold: 2%)
#
# Burn rates:
# 5m: 0.0040%/min
# 1h: 0.0006%/min
# 1d: 0.0001%/min
# Trend: rising| Window Pair | Budget Threshold | Severity | Meaning |
|---|---|---|---|
| 5m / 60m | 2 | ||
| 30m / 6h | 5 | ||
| 2h / 1d | 10 | ||
| 6h / 3d | 20 |
Status Levels
| Status | Icon | Meaning |
|---|---|---|
| comfortable | 🟢 | Plenty of budget, low burn |
| nominal | 🟡 | On track for the window |
| elevated | 🟠 | Burning faster than expected |
| danger | 🔴 | Budget will exhaust early |
| exhausted | 💀 | Budget gone |
Integration with Incidents
When Prophet detects an incident, it correlates with SLO burn:
🔴 INCIDENT: api_latency breached p99 threshold
⏱️ Estimated recovery: 15 min
💸 Projected budget burn: 0.3%
📊 Budget remaining after: 27.8%Error Budget Policy Automation
Define policies that trigger automatically:
# [home-path]
slos:
api_availability:
target: 0.999
window_days: 30
policies:
danger:
- freeze_deployments
- page_oncall
- notify_stakeholders
exhausted:
- block_ci_merges
- escalate_leadershipAutomatic Causality Discovery (Gen 6 Evolution)
Prophet now automatically discovers which metrics influence each other — no manual configuration needed.
How It Works
Three statistical methods work together to identify causal relationships:
1. Cross-Correlation — Finds temporal relationships at different time lags
2. Granger Causality — Tests if one metric's past helps predict another's future
3. Transfer Entropy — Measures information flow between metrics
Why It Matters
Manual approach: "I think memory affects response time... maybe?" → Guesswork
Auto-discovery: "Memory causes response_time (85
Usage
# Discover all causal relationships
prophet causality discover
# Output:
# 🧬 Discovering causal relationships...
#
# ✅ Discovered 5 causal links:
#
# 📊 memory_percent
# → response_time
# [████████░░] 82% strength
# Lag: 2.5m | Method: granger
# → disk_io
# [██████░░░░] 61% strength
# Lag: 5.0m | Method: cross_correlation
# Explain a specific metric
prophet causal-explain response_time
# Output:
# 🔍 Causal Analysis: response_time
#
# 📥 CAUSED BY:
# ← memory_percent (strength: 82%, lag: 2.5m)
# ← db_connections (strength: 71%, lag: 1.0m)
#
# 📤 CASCADES TO:
# → error_rate (+3.0m, strength: 65%)
# → user_complaints (+10.0m, strength: 45%)
# Test a specific relationship
prophet causal-test memory_percent response_time
# Output:
# 🔬 Testing: memory_percent → response_time
#
# 1️⃣ Cross-Correlation:
# Peak lag: 2 minutes
# Correlation: 0.823
# Direction: memory_percent causes response_time
#
# 2️⃣ Granger Causality:
# F-statistic: 15.32
# Optimal lag: 2 minutes
# Significance: strong
# ✅ Granger causes: Yes
#
# 📋 Summary:
# ✅ Evidence of causality: Correlated (0.82), Granger causes (F=15.3)
# Find root causes
prophet causal-roots error_rate --depth 3
# Output:
# 🌳 Root Causes: error_rate
#
# 📍 memory_percent (depth: 2)
# 📍 db_pool_size (depth: 2)
# 📍 traffic_load (depth: 1)
# Generate visual graph
prophet causality graph
# Output (Mermaid):
# ```mermaid
# graph LR
# memory_percent--2m-->response_time
# memory_percent--5m-->disk_io
# response_time--3m-->error_rate
# ```Causal Graph Visualization
Prophet can generate Mermaid diagrams of discovered relationships:
graph LR
memory_percent--2m-->response_time
memory_percent--5m-->disk_io
db_connections--1m-->response_time
response_time--3m-->error_rate
error_rate--7m-->user_complaintsIntegration with Incident Response
When an incident occurs, Prophet now traces the causal chain:
🔴 INCIDENT: error_rate breached threshold (5% > 1%)
🔍 Root Cause Analysis:
└── memory_percent spike detected 8 minutes ago
└── response_time degraded 5 minutes ago
└── error_rate breached now
💡 Likely root cause: memory_percent (82% confidence)
Historical pattern: Memory pressure → slow responses → errorsBest Practices
1. Collect sufficient data — Run `prophet collect` for at least 24 hours before discovery
2. Run discovery periodically — Relationships can change with system updates
3. Verify surprising links — High correlation doesn't always mean causation
4. Use `causal-test` for specific hypotheses — Get detailed statistical evidence
Event Attribution (Gen 6+ Evolution)
Beyond metric-to-metric causality, Prophet can attribute anomalies to external events like deployments, config changes, traffic spikes, and scheduled jobs.
Recording Events
# Record a deployment
prophet event deployment "Deployed v2.3.1" --severity medium
# Record a config change
prophet event config_change "Increased connection pool to 100" --severity low
# Record a scheduled job
prophet event scheduled_job "Nightly batch processing" --impact-window 30
# Record traffic spike
prophet event traffic_spike "Marketing campaign launch" --severity high
# Collect deployments from git automatically
prophet event-collect git --path /path/to/repo --hours 24Attributing Anomalies
When an anomaly occurs, find what caused it:
# Attribute a current anomaly
prophet attribute response_time
# Output:
# 🔍 Attributing anomaly: response_time at 2024-01-15 14:30
# Looking back 60 minutes for events...
#
# ## Anomaly Attribution Report
#
# ### 1. 🟢 Deployed v2.3.1
# - **Type:** deployment
# - **Confidence:** high (78%)
# - **Lag:** 12.3 minutes before anomaly
# - **Temporal fit:** 95%
# - **Metric relevance:** 100%
# - **Status:** Pending confirmation (ID: `a7b3c9d2`)
#
# ### 2. 🟡 Increased connection pool
# - **Type:** config_change
# - **Confidence:** medium (52%)
# - **Lag:** 45.1 minutes before anomaly
# Attribute a past anomaly
prophet attribute cpu_percent --timestamp 2024-01-15T10:00:00 --lookback 120Confirming Attributions (Learning)
Prophet learns from your confirmations to improve future attributions:
# Confirm an attribution was correct
prophet event-confirm a7b3c9d2 --yes
# Reject an incorrect attribution
prophet event-confirm b8c4d0e3 --noViewing Learned Patterns
# Show what Prophet has learned
prophet event-patterns
# Output:
# 🧠 Learned Attribution Patterns
#
# 📍 deployment → response_time
# Typical lag: 8.5 min
# Duration: 25.3 min
# Occurrences: 12
# Confidence boost: +35%
#
# 📍 scheduled_job → cpu_percent
# Typical lag: 1.2 min
# Duration: 15.0 min
# Occurrences: 28
# Confidence boost: +50%Attribution Summary
prophet event-summary --hours 48
# Output:
# 📊 Attribution Summary (last 48h)
#
# By Confidence Level:
# 🟢 high: 15
# 🟡 medium: 8
# 🟠 low: 3
#
# By Event Type:
# • deployment: 12
# • scheduled_job: 8
# • traffic_spike: 4
#
# Confirmation Status:
# ✅ Confirmed: 18
# ❌ Rejected: 3
# ⏳ Pending: 5
# 📈 Accuracy: 86%
#
# 🧠 Learned Patterns: 6Event Types
| Type | Description | Typical Metrics Affected |
|---|---|---|
| `deployment` | Code/artifact deployments | response_time, error_rate, cpu |
| `config_change` | Configuration updates | response_time, error_rate |
| `scheduled_job` | Cron jobs, batch processing | cpu, memory, disk_io |
| `traffic_spike` | Unusual traffic patterns | response_time, queue_depth |
| `maintenance` | Planned maintenance | availability, response_time |
| `incident` | Known incidents | varies |
| `external` | Third-party service issues | response_time, error_rate |
| `custom` | User-defined events | varies |
Integration with Incident Response
🔴 INCIDENT: response_time breached threshold (850ms > 500ms)
🔍 Causal Analysis:
Metric chain: memory_percent → response_time (lag: 2.5m)
🎯 Event Attribution:
1. 🟢 [HIGH] Deployed v2.3.1 (12 min ago)
Historical pattern: deployments → response_time (+15%, 8m lag)
2. 🟡 [MEDIUM] DB pool config change (45 min ago)
💡 Likely root cause: Deployment v2.3.1
Suggested action: Review recent deployment changesWhy It Matters
Reactive monitoring: "Server is slow" → users already suffering
Proactive prediction: "Server will be slow in 20 minutes" → time to act
Manual causality: "What's causing this?" → Guesswork and war rooms
Auto-discovery: "memory_percent → response_time (lag: 2.5m)" → Instant root cause
Manual attribution: "Was it the deployment?" → Hours of correlation
Auto-attribution: "Deployment v2.3.1 (78
Reactive SLO monitoring: "Error budget exhausted" → SLO already broken
Proactive SLO prediction: "Budget exhausts in 3 days" → time to prioritize reliability
The difference between firefighting and fire prevention.
Predictive Capacity Planning (Gen 6+ Evolution)
Prophet now forecasts when you'll run out of capacity and recommends scaling actions before you hit limits.
Philosophy
Reactive capacity: "Disk full" → Scramble to add storage at 3 AM
Predictive capacity: "Disk will fill in 14 days" → Plan and budget calmly
Quick Start
# Generate capacity plan
prophet capacity
# Output:
# 📊 Capacity Planning Report
# Generated: 2024-01-15 14:30
# Horizon: 90 days
#
# ✅ All systems have comfortable capacity headroom
# Risk Score: 10%
#
# 📈 Capacity Forecasts:
#
# 🟢 cpu_percent
# Utilization: [████░░░░░░] 42%
# Current: 42.3 / 85 (threshold)
# Trend: stable/declining ✓
# Growth: -0.12/day (stable)
# Confidence: 78%
#
# 🟡 disk_percent
# Utilization: [███████░░░] 68%
# Current: 68.2 / 90 (threshold)
# Time to breach: 2.3 weeks
# Forecast date: 2024-01-29
# Growth: +1.35/day (linear)
# Confidence: 85%
#
# 🔧 Scaling Recommendations:
#
# 🟡 disk_percent
# Action: plan_scaling
# Plan disk_percent capacity increase for ~2.3 weeks
# Current: 68.2 → Recommended: 115.0
# Act by: 2024-01-22
# Cost impact: ~1.2x
# • Forecast breach: 2024-01-29
# • Time to plan and budget
# Alternatives:
# - Monitor and reassess in 1 weekCommands
# Full capacity plan with all metrics
prophet capacity
# Detailed forecast for specific metric
prophet capacity-forecast disk_percent
# Analyze growth patterns
prophet capacity-growth memory_percent --window 14
# View/set thresholds
prophet capacity-threshold
prophet capacity-threshold cpu_percent --warning 70 --critical 85 --max 100
# Record scaling actions (for learning)
prophet capacity-scale disk_percent scale_up --old 100 --new 200 --description "Added second SSD"
# View scaling history
prophet capacity-history --days 30
prophet capacity-history --metric disk_percent
# Compare forecasts over time
prophet capacity-compare disk_percent --days 7Growth Analysis
Prophet analyzes growth patterns to understand how resources are trending:
prophet capacity-growth memory_percent
# Output:
# 📊 Growth Analysis: memory_percent
#
# Current value: 72.50
# Daily growth: +0.85
# Growth rate: +1.17%/day
#
# Trend: linear
# Seasonality: daily
# Volatility: 0.15
#
# Data points: 2048
# Window: 14 days
# Confidence: 82%
#
# Projections (at current rate):
# 7d: 78.45
# 30d: 98.00
# 90d: 149.00Trend Types
| Trend | Description | Action |
|---|---|---|
| `stable` | Minimal change | Monitor |
| `linear` | Steady growth | Plan ahead |
| `exponential` | Accelerating | Act soon |
| `saturating` | Approaching limit | May self-limit |
Urgency Levels
| Urgency | Time to Threshold | Recommended Action |
|---|---|---|
| 🟢 `comfortable` | > 30 days | Continue monitoring |
| 🔵 `planning` | 14-30 days | Budget and plan |
| 🟡 `soon` | 7-14 days | Schedule scaling |
| 🟠 `urgent` | 3-7 days | Scale this week |
| 🔴 `critical` | < 3 days | Emergency scaling |
Recording Scaling Actions
Track what you did to improve future predictions:
# After upgrading disk
prophet capacity-scale disk_percent scale_up \
--old 100 --new 200 \
--description "Upgraded to 200GB SSD"
# After adding instances
prophet capacity-scale cpu_percent scale_out \
--old 100 --new 200 \
--description "Added second server"
# After optimization
prophet capacity-scale memory_percent optimize \
--old 85 --new 60 \
--description "Fixed memory leak in API service"Comparing Forecasts
See if predictions are getting more accurate:
prophet capacity-compare disk_percent --days 7
# Output:
# 📈 Forecast Trend: disk_percent (last 7 days)
#
# Date Days to Threshold Growth Rate Confidence
# ----------------------------------------------------------------------
# 2024-01-08 10:30 25.3 days +0.98/day 75%
# 2024-01-09 10:30 22.1 days +1.12/day 78%
# 2024-01-10 10:30 19.8 days +1.25/day 80%
# 2024-01-11 10:30 16.2 days +1.35/day 82%
# 2024-01-12 10:30 14.5 days +1.35/day 85%Custom Thresholds
Set different thresholds for different environments:
# Production: stricter thresholds
prophet capacity-threshold cpu_percent --warning 60 --critical 75 --max 100
prophet capacity-threshold memory_percent --warning 70 --critical 85 --max 100
# Development: more relaxed
prophet capacity-threshold disk_percent --warning 85 --critical 95 --max 100Integration with Heartbeats
Add to HEARTBEAT.md for proactive capacity monitoring:
## Daily Capacity Check
- [ ] Run `prophet capacity` — any urgent scaling needed?
- [ ] Critical resources: < 7 days to threshold → escalate
- [ ] Planning horizon: 14-30 days → add to sprint backlogCost-Aware Recommendations
Prophet considers cost impact when recommending scaling:
# Cost model in config
[home-path]
{
"cost_model": {
"cpu": {"scale_factor": 2, "cost_ratio": "2x"},
"memory": {"scale_factor": 2, "cost_ratio": "1.5x"},
"disk": {"scale_factor": 2, "cost_ratio": "1.2x"}
}
}Why It Matters
Reactive capacity: "Out of disk space" → 3 AM emergency
Predictive capacity: "Disk exhausts in 14 days" → calm planning
Manual tracking: "How fast is memory growing?" → Spreadsheets
Auto-analysis: "1.17
Surprise costs: "We need to double capacity NOW" → budget shock
Planned scaling: "Schedule 50
Capacity planning turns infrastructure surprises into scheduled maintenance.
Predictive Remediation Engine (Gen 6+ Evolution)
Prophet now recommends specific remediation actions when issues are detected, based on what worked historically and industry best practices.
Philosophy
Reactive debugging: "Server is slow" → Guesswork and war rooms
Predictive remediation: "Here are 3 fixes, ranked by effectiveness" → Instant action
How It Works
1. Knowledge Base — Built-in SRE best practices for common issues
2. Historical Learning — Records outcomes to improve future recommendations
3. Causal Awareness — Recommends fixing root causes, not symptoms
4. Pattern Matching — Different patterns need different fixes
Quick Start
# Get remediation recommendations for a metric issue
prophet remediate response_time --value 850 --threshold 500
# Output:
# 🔧 Remediation Recommendations
#
# ### 1. Rollback recent deployment
# [████████░░] 78% confidence
# Revert to previous known-good version
#
# 📊 Expected improvement: 90%
# ⏱️ Time to effect: ~5 min
# 🟢 Effort: low
# 🟡 Risk: medium
# 📋 Based on 12 past uses (83% success)
#
# 💡 Evidence:
# • Recent deployment detected: v2.3.1
# • Historical success rate: 83%
# • Recommended for 'spike' pattern
#
# ### 2. Increase connection pool
# [██████░░░░] 65% confidence
# Increase database connection pool size
# ...Commands
# Get recommendations for any metric
prophet remediate memory_percent --value 92 --threshold 85
# Include pattern context for better recommendations
prophet remediate cpu_percent --value 95 --pattern gradual_rise
# Skip causal analysis (faster, but less accurate)
prophet remediate response_time --value 800 --no-causality
# Record a remediation outcome (for learning)
prophet remediate-record mem_restart_service memory_percent \
--before 92 --after 45 --threshold 85 --success \
--time-to-effect 2 --duration 480
# View remediation statistics
prophet remediate-stats
prophet remediate-stats --action-id mem_restart_serviceAdding Custom Actions
# Add a custom remediation action
prophet remediate-action --add \
--name "Restart API Gateway" \
--description "Restart the API gateway to clear stale connections" \
--category restart \
--metrics response_time,error_rate \
--commands "kubectl rollout restart deployment/api-gateway" \
--impact 75 \
--effort low \
--risk low \
--tags quick_fix,gateway
# List custom actions
prophet remediate-action
prophet remediate-action --metric response_timePlaybooks (Action Sequences)
Create playbooks for common incident types:
# Create a playbook for memory issues
prophet remediate-playbook --create \
--name "Memory Pressure Response" \
--description "Standard response for memory exhaustion" \
--trigger-metric memory_percent \
--trigger-pattern gradual_rise \
--actions "mem_restart_service,mem_increase_limit,mem_fix_leak"
# Find matching playbook for a situation
prophet remediate-playbook --match memory_percent --pattern gradual_rise
# List all playbooks
prophet remediate-playbookBuilt-in Knowledge Base
Prophet includes remediation knowledge for common metrics:
| Metric | Actions Available |
|---|---|
| `memory_percent` | Restart service, increase limit, fix leak, add swap |
| `cpu_percent` | Scale out, scale up, optimize code, rate limit |
| `response_time` | Add cache, DB indexes, increase pool, rollback, scale |
| `error_rate` | Rollback, circuit breaker, tune retries, fix bug |
| `disk_percent` | Cleanup logs, expand volume, compress, archive |
| `queue_depth` | Add workers, throttle input, priority queuing |
| `db_connections` | Increase pool, fix leak, add replica |
Confidence Scoring
Recommendations are ranked by confidence based on:
| Factor | Weight | Description |
|---|---|---|
| Historical success | 40 | |
| Pattern match | 15 | |
| Event alignment | 15 | |
| Risk penalty | -5-35 | |
| Base knowledge | 30 |
Integration with Incident Response
When Prophet detects an incident, it automatically includes remediation:
🔴 INCIDENT: response_time breached threshold (850ms > 500ms)
🔍 Root Cause Analysis:
└── db_connections spike detected 5 minutes ago
🔧 Recommended Actions:
1. [78%] Increase connection pool (effort: trivial)
2. [65%] Rollback recent deployment (effort: low)
3. [52%] Add read replica (effort: high)
💡 Start with: Increase connection pool
Highest confidence, lowest effort, addresses root causeLearning Loop
Prophet learns from every remediation:
# After successful remediation
prophet remediate-record db_increase_pool db_connections \
--before 95 --after 40 --threshold 80 --success \
--time-to-effect 1 --duration 1440 \
--notes "Increased pool from 50 to 100"
# After failed remediation
prophet remediate-record rt_add_cache response_time \
--before 850 --after 800 --threshold 500 \
--notes "Cache helped but not enough - need DB optimization"Why It Matters
Reactive debugging: "What should we try?" → Guesswork
Predictive remediation: "Try X (78
Manual runbooks: "Check the wiki..." → Stale documentation
Dynamic playbooks: "Based on 15 similar incidents..." → Living knowledge
Individual knowledge: "Ask Bob, he fixed this before" → Bus factor
Organizational memory: "Action X worked 83
Remediation recommendations turn tribal knowledge into actionable intelligence.
Prediction Explainer (Gen 6+ Evolution)
ML interpretability module that explains WHY predictions are made, not just what they predict.
Philosophy
> "A prediction without explanation is just a guess with confidence."
Traditional ML alerts say "response time will spike." But that's useless without knowing WHY it's going to spike and WHAT you can do to prevent it.
How It Works
1. Feature Attribution — SHAP-like contribution analysis shows how much each metric contributes to the prediction
2. Counterfactual Generation — "What would need to change to avoid this?" with specific targets
3. Pattern Matching — Recognizes common problematic patterns (resource exhaustion, swapping, backlog)
4. Natural Language Explanations — Human-readable summaries, not just numbers
Quick Start
# Explain a current prediction
prophet explain response_time
# Demo with sample data
prophet explain-demo
# Show what changes would help
prophet counterfactual response_time
# Show feature contributions
prophet contributors response_time
# Update baselines from history
prophet update-baselines --days 7Example Output
## 🟠 Prediction Explanation
**Metric:** response_time
**Predicted Value:** 650.00
**Confidence:** 85%
**Severity:** WARNING
### Summary
Response times are predicted to reach 650ms within 25 minutes, which would
exceed your 500ms threshold. Users may start noticing delays.
### Why This Is Happening
- ↑ **cpu_percent**: above CPU usage (78.5% vs normal 45.2%) is contributing to the prediction
- ↑ **memory_percent**: above memory (82.3% vs normal 55.1%) contributing to slowdowns
- ↑ **queue_depth**: Queue depth of 156 (above from 45) contributing to delays
### Pattern Matches
- 🔍 Resource exhaustion pattern: Both CPU and memory elevated
### Recommended Actions
🟢 Profile and optimize CPU-intensive operations — Expected impact: 35% risk reduction
🟢 Tune garbage collection parameters — Expected impact: 28% risk reduction
🟡 Add more compute capacity or scale horizontally — Expected impact: 20% risk reduction
### What Would Help
- Reducing **cpu_percent** by 25% would reduce risk by ~35%
- Reducing **memory_percent** by 20% would reduce risk by ~28%
- Reducing **queue_depth** by 50% would reduce risk by ~20%Feature Attribution
The explainer calculates contribution scores for each metric:
| Feature | Contribution | Explanation |
|---|---|---|
| cpu_percent | +12.5 | Elevated CPU increases latency |
| memory_percent | +8.3 | GC pauses causing delays |
| queue_depth | +5.1 | Backlog building up |
| error_rate | -2.0 | Low errors mitigating |
Positive contributions push toward the prediction; negative contributions work against it.
Counterfactual Analysis
Shows what changes would avoid the predicted problem:
🔄 Counterfactuals for response_time
Predicted: 650.00 → Threshold: 500
What would help:
1. 🟢 cpu_percent
Current: 78.50 → Target: 58.87
Change needed: 25% reduction
Impact: ~35% risk reduction
Action: Profile and optimize CPU-intensive operations
2. 🟢 memory_percent
Current: 82.30 → Target: 65.84
Change needed: 20% reduction
Impact: ~28% risk reduction
Action: Tune garbage collection parametersBaseline Learning
The explainer learns what "normal" looks like for your system:
# Update baselines from last week
prophet update-baselines --days 7
# Baselines are used to:
# - Calculate deviation (z-score) for each metric
# - Determine if values are unusually high/low
# - Weight contribution calculationsIntegration with Other Modules
With Causality Engine:
- Explanations use discovered causal relationships
- "cpu_percent causes response_time (lag: 2.5 min)"
With Attribution:
- Links predictions to external events
- "Response time spike correlates with deploy_20260201"
With Remediation:
- Recommended actions come from remediation engine
- Actions have confidence scores from historical outcomes
Why It Matters
Black box alerts: "Response time will spike" → Panic
Explained predictions: "Response time will spike because CPU is exhausted" → Action
Guessing at fixes: "Maybe we should scale up?" → Wasted effort
Counterfactual guidance: "Reduce CPU by 25
Tribal knowledge: "Bob knows what to check" → Bus factor
Automated explanations: "Here's why and what to do" → Self-service
Explainable predictions turn alerts into actionable intelligence.
Deployment Risk Predictor (Gen 6+ Evolution)
Predicts the likelihood that a deployment will cause performance problems before it goes live. Uses historical patterns, change characteristics, system state, and timing factors to generate risk scores.
Philosophy
> "The best incident is the one that never happens."
Traditional deployment processes: merge → deploy → cross fingers → incident → rollback → postmortem.
Performance Prophet: merge → analyze risk → informed decision → targeted monitoring → smooth deployment.
How It Works
1. Code Change Analysis — Size, complexity, affected areas, dependency changes
2. Historical Patterns — What happened with similar deployments in the past
3. System State — Current health affects deployment risk
4. Timing Factors — Day, time, proximity to recent incidents
5. Context Signals — Test coverage, review thoroughness, deployment frequency
Quick Start
# Analyze a deployment (auto-detects git changes)
prophet deploy-risk
# Analyze specific files/changes
prophet deploy-risk --files "src/db/migrations/001.sql,src/api/routes.py" \
--message "feat: add payment processing"
# With test coverage
prophet deploy-risk --coverage 75
# In CI/CD - fail if high risk
prophet deploy-risk --fail-on-high
# Record deployment for learning
prophet deploy-risk --record
# Record outcome after deployment
prophet deploy-record <deployment-id> --incident-24h --rollback
# View deployment history
prophet deploy-history --days 30
# Show learned patterns
prophet deploy-patternsExample Output
🔍 Analyzing deployment risk...
🟠 **Deployment Risk: 67/100 (HIGH)**
⚠️ **Recommendation: PROCEED WITH CAUTION** - Review risks below.
**Key Risk Factors:**
- Database migration detected
- 3 critical path files modified: src/db/migrations/001.sql, src/payments/handler.py
- Deploying on Friday - weekend/end of week
- Test coverage is 65% (below 70% threshold)
### Signal Breakdown
| Signal | Category | Value | Contribution |
|--------|----------|-------|--------------|
| migration_detected | code | 90% | ████ 0.23 |
| hotpath_touched | code | 60% | ██ 0.13 |
| deploy_day_risk | timing | 50% | █ 0.06 |
| test_coverage_delta | context | 14% | 0.03 |
### Recommendations
1. 🚨 Consider delaying this deployment
2. Test migration on production copy first
3. Ensure backward compatibility
4. Have rollback migration ready
5. Deploy during low-traffic window
**Optimal Deploy Window:** Tuesday 10:00 AM (lower risk window)
### Similar Past Deployments
- 2026-01-15: risk 72 → 🔴 incident (rolled back)
- 2026-01-08: risk 45 → ✅ okRisk Signals
Code Signals
| Signal | Description | Weight |
|---|---|---|
| `lines_changed` | Total lines modified (logarithmic) | 15 |
| `files_changed` | Number of files touched | 12 |
| `complexity_delta` | Code complexity changes | 18 |
| `dependency_changes` | Package/dependency updates | 20 |
| `migration_detected` | Database migrations present | 25 |
| `hotpath_touched` | Critical path files modified | 22 |
Hot Path Patterns
These file patterns are considered high-risk:
- `database|db|query|sql` — Database layer
- `cache|redis|memcache` — Caching systems
- `auth|login|session|token` — Authentication
- `payment|billing|transaction` — Financial operations
- `queue|worker|job|task` — Background processing
- `api|endpoint|route` — API surface
- `config|settings|env` — Configuration
Historical Signals
| Signal | Description | Weight |
|---|---|---|
| `author_incident_rate` | Author's past deployment success | 15 |
| `similar_deployment_failures` | Pattern-matched failure rate | 25 |
| `recent_incident_proximity` | Incidents in past week | 18 |
| `regression_frequency` | How often changes regress | 20 |
System State Signals
| Signal | Description | Weight |
|---|---|---|
| `current_health` | Response time/error rate | 22 |
| `active_alerts` | Outstanding alerts | 20 |
| `resource_headroom` | CPU/memory available | 15 |
| `traffic_level` | Current load level | 12 |
Timing Signals
| Signal | Description | Weight |
|---|---|---|
| `deploy_hour_risk` | Time of day (avoid 9pm-5am) | 15 |
| `deploy_day_risk` | Day of week (avoid Fri-Sun) | 12 |
| `freeze_proximity` | Near change freeze? | 18 |
| `incident_proximity` | Time since last incident | 20 |
Risk Levels
| Level | Score | Meaning | Action |
|---|---|---|---|
| 🟢 SAFE | < 20 | No concerns | Deploy freely |
| 🔵 LOW | 20-40 | Minor considerations | Standard process |
| 🟡 MODERATE | 40-60 | Notable risks | Extra monitoring |
| 🟠 HIGH | 60-80 | Significant risks | Review carefully |
| 🔴 CRITICAL | > 80 | Major concerns | Consider delay |
Go/No-Go Decisions
Based on risk analysis:
- GO — Risk is low, proceed with standard process
- CAUTION — Risk is moderate, proceed with extra monitoring
- NO_GO — Risk is high, address issues before deploying
CI/CD Integration
# GitHub Actions example
deploy:
steps:
- name: Analyze deployment risk
run: |
prophet deploy-risk --fail-on-high --record
- name: Deploy (if safe)
if: success()
run: ./deploy.sh
- name: Record outcome
if: always()
run: |
if [ "${{ job.status }}" == "failure" ]; then
prophet deploy-record $DEPLOYMENT_ID --incident-24h
fiLearning Loop
Prophet learns from every deployment:
# After successful deployment
prophet deploy-record abc123def456
# After incident within 24h
prophet deploy-record abc123def456 --incident-24h
# After rollback
prophet deploy-record abc123def456 --rollback --incident-1hPattern Recognition
Over time, Prophet learns which patterns are risky for your codebase:
prophet deploy-patterns
# Output:
# 📊 Learned Deployment Patterns
#
# **Changes to: src/db, src/api**
# 🔴 HIGH RISK
# Deployments: 15 | Incidents: 5 (33%)
# Avg Risk Score: 68
#
# **Changes to: src/utils**
# 🟢 SAFE
# Deployments: 42 | Incidents: 0 (0%)
# Avg Risk Score: 22Why It Matters
Hope-based deployment: "It passed tests, ship it!" → 3 AM incident
Risk-aware deployment: "Risk is 72
Reactive incidents: "Why did this happen?" → Postmortem
Proactive prevention: "This looks similar to the Jan 15 outage" → Avoidance
Tribal knowledge: "Bob always deploys on Tuesday mornings" → Bus factor
Codified wisdom: "Tuesday 10am is 40
Manual judgment: "Seems fine to me" → Bias
Data-driven decisions: "67
Deployment risk prediction turns gut feelings into quantified decisions.
Adaptive Thresholds (Gen 6+ Evolution)
Static thresholds are the enemy of good alerting:
- Too tight → Alert fatigue, ignored warnings
- Too loose → Missed issues, angry users
Adaptive thresholds learn what "normal" looks like for YOUR system, accounting for:
- Time of day — Traffic spikes at 10am? Normal!
- Day of week — Quiet weekends? Different baseline
- Seasonal patterns — Holiday traffic? Adjusted
- System growth — Gradual metric creep over months
- Context — During deployment? Different expectations
Concept
Traditional: `if cpu > 75: alert()`
Adaptive: `if cpu > learned_p95_for_this_hour: alert()`
Thresholds adapt to reality, not the other way around.
Quick Start
# View adaptive threshold status
prophet adaptive status
# Register a metric with base thresholds
prophet adaptive register cpu_percent 75 90 above
prophet adaptive register response_time 500 1000 above
# Learn optimal thresholds from history
prophet adaptive learn cpu_percent
# Check a value against adaptive thresholds
prophet adaptive check cpu_percent 82
# View growth trends
prophet adaptive growthContext Windows
Prophet automatically detects time-based contexts:
| Context | Time | Description |
|---|---|---|
| `peak` | 9am-6pm weekdays | High traffic periods |
| `off_peak` | 6am-9am, 6pm-11pm | Normal traffic |
| `quiet` | 12am-6am | Night/minimal traffic |
| `weekend` | Sat-Sun | Weekend patterns |
Manual context windows for special situations:
# Starting a deployment (relax thresholds temporarily)
prophet adaptive context start deployment
# Deployment complete
prophet adaptive context end deployment
# Maintenance window
prophet adaptive context start maintenance
prophet adaptive context end maintenanceLearning from Feedback
The system learns from your feedback on alerts:
# View recent violations
prophet adaptive violations
# Output:
# [42] cpu_percent = 78.50 (warning) @ 2025-01-15 10:30 [?]
# [41] memory_percent = 85.20 (warning) @ 2025-01-15 09:15 [?]
# [40] response_time = 520.00 (warning) @ 2025-01-14 14:22 [✓]
# Provide feedback: was it a valid alert?
prophet adaptive feedback 42 invalid # False positive → relax thresholds
prophet adaptive feedback 41 valid # Good catch → maintain thresholds
prophet adaptive feedback 40 valid 15 # Valid, resolved in 15 minutesThreshold Suggestions
Prophet analyzes patterns and suggests threshold adjustments:
prophet adaptive suggest
# Output:
# 🔴 [relax_thresholds] cpu_percent
# High false positive rate (45%)
# Suggest: warning=86.25, critical=103.50
#
# 📈 [growth_warning] memory_percent
# Gradual growth (18.5%)
# 🚨 memory_percent has grown 18.5% over analyzed period.
# Consider adjusting base thresholds upward.Growth Detection
Catch the "boiling frog" scenario — metrics slowly creeping up over weeks:
prophet adaptive growth
# Output:
# 📈 cpu_percent: growing (12.3%)
# ➡️ memory_percent: stable (-2.1%)
# 🚀 disk_usage: rapid_growth (35.7%)How It Works
1. Percentile Learning — Calculates p95/p99 from historical data per context
2. Feedback Integration — Adjusts thresholds based on false positive/negative feedback
3. Growth Adjustment — Detects long-term metric drift
4. Confidence Scoring — Only uses learned thresholds when confidence is high
Threshold Bands Per Context
Each metric can have different thresholds for different contexts:
cpu_percent thresholds:
peak: warning=82, critical=95 (learned, 87% confidence)
off_peak: warning=70, critical=88 (learned, 72% confidence)
quiet: warning=45, critical=70 (learned, 65% confidence)
weekend: warning=55, critical=80 (learned, 58% confidence)Status Report
prophet adaptive status
# Output:
# 🎯 Adaptive Thresholds Status
#
# **Metrics Tracked:** 5
# **Metrics Adapted:** 3
# **Violations (30d):** 47
# **False Positive Rate:** 18.5%
# **Pending Feedback:** 12
#
# ## Per-Metric Status
#
# ### 🟢 cpu_percent (✓ adapted)
# - Warning: 75.00 → **82.50**
# - Critical: 90.00 → **97.35**
# - Context: peak | Confidence: 87%
# - Violations: 12 | FP: 2 | Pending: 1Why It Matters
Static thresholds: "CPU > 75
Adaptive thresholds: "CPU > p95 for this hour" → Alerts only for anomalies
One-size-fits-all: "Response time > 500ms" → 3am alert for a slow batch job
Context-aware: "Response time > 200ms at 3am" → Catch actual issues
Set and forget: Thresholds drift out of touch with system reality
Continuous learning: Thresholds evolve with your system
Alert fatigue: "Just ignore CPU alerts, they're always noisy"
Actionable alerts: "If Prophet alerts, it's probably real"
Adaptive thresholds turn monitoring from a firehose into a precision instrument.
Database Schema
Adaptive thresholds data is stored in the same `prophet.db`:
- `adaptive_metrics` — Metric configurations and learned bands
- `context_windows` — Active context windows (deployments, maintenance)
- `threshold_violations` — Historical violations for learning
- `hourly_aggregates` — Per-hour percentile data for learning
- `metric_growth` — Weekly growth tracking for drift detection
Anomaly Fingerprints (Gen 6+ Evolution)
Incident pattern recognition — Prophet learns to recognize recurring problems.
When an anomaly looks like something you've seen before, you shouldn't have to figure it out from scratch. Fingerprints capture the "signature" of incidents so Prophet can say: *"This looks 87
Concept
A fingerprint captures:
- Which metrics were affected (and how)
- The pattern shape (spike, sustained, oscillating)
- Magnitude (how far from normal)
- Causality chain (which metric went first)
- Time context (peak hours? weekend?)
- Resolution (what fixed it)
- TTR (how long to resolve)
When new anomalies occur, Prophet matches them against the library and suggests proven fixes.
Quick Start
# List all fingerprints
prophet fingerprint list
# Capture current incident as a fingerprint
prophet fingerprint capture "Memory leak in API" --resolution "Restart pods" --tags memory,api
# Match current anomalies against library
prophet fingerprint match
# Show fingerprint details
prophet fingerprint show fp_abc123
# Update a fingerprint
prophet fingerprint update fp_abc123 --resolution "Restart pods AND fix query in /api/heavy"
# View library statistics
prophet fingerprint statsCapturing Fingerprints
Capture the current anomalous state:
# Basic capture
prophet fingerprint capture "DB connection exhaustion"
# With resolution and metadata
prophet fingerprint capture "Cache miss storm" \
--lookback 45 \
--resolution "Increase Redis maxmemory, add cache warming" \
--tags cache,redis,performance \
--severity criticalOptions:
- `--lookback N` — Minutes of data to capture (default: 30)
- `--resolution "text"` — What fixed the issue
- `--tags tag1,tag2` — Categorization tags
- `--severity info|warning|critical` — Severity level
Matching Current State
When metrics go anomalous, check for known patterns:
prophet fingerprint match
# Output:
# 🔍 Found 2 matching fingerprint(s)
#
# 🟢 **Memory leak in API service** (87% match)
# Matched: memory_percent, response_time, gc_pause_ms
# Missing: none
# 💡 Suggested: Restart API pods, investigate /api/heavy-query
# Last seen: 12d ago
#
# 🟡 **Cache invalidation cascade** (64% match)
# Matched: memory_percent, response_time
# Missing: cache_hit_rate
# Extra: gc_pause_ms
# 💡 Suggested: Check cache TTL settings
# Last seen: 23d agoConfidence levels:
- 🟢 high (≥85
- 🟡 medium (65-84
- 🟠 low (50-64
Managing Fingerprints
# Show detailed fingerprint
prophet fingerprint show fp_abc123
# Update metadata
prophet fingerprint update fp_abc123 \
--resolution "Better fix: Add connection pooling" \
--ttr 25 \
--notes "Happens after deployments with DB migrations"
# Delete a fingerprint
prophet fingerprint delete fp_abc123
# Find similar fingerprints (merge candidates)
prophet fingerprint similar fp_abc123
# Merge similar fingerprints
prophet fingerprint merge fp_abc123 fp_def456 --name "API memory issues (merged)"Feedback Loop
Improve matching accuracy with feedback:
# When a match was helpful
prophet fingerprint feedback fp_abc123 correct
# When a match was wrong
prophet fingerprint feedback fp_abc123 incorrect "Different root cause - was network latency"Statistics
prophet fingerprint stats
# Output:
# 📊 Fingerprint Library Statistics
#
# Total fingerprints: 24
# Unique metrics tracked: 8
# Total occurrences: 156
# Average TTR: 18.5 minutes
#
# By severity:
# 🔵 info: 5
# 🟡 warning: 14
# 🔴 critical: 5
#
# Most recurring patterns:
# • Memory leak in API: 23 matches
# • DB connection pool exhaustion: 18 matches
# • Cache miss storm: 12 matchesHow Similarity Matching Works
1. Metric overlap — Jaccard similarity between affected metrics
2. Direction match — Same pattern type (spike, sustained, etc.)
3. Magnitude similarity — Similar z-scores
4. Pattern hash — Shape of the anomaly curve
5. Time context bonus — Same time of day/week
Weights: Metric overlap (40
Resolution Playbook
Get a full resolution playbook for a known incident type:
prophet fingerprint show fp_abc123
# Output includes:
# 📋 Resolution Playbook:
# Resolution: Restart API pods, investigate /api/heavy-query
# Expected TTR: 15 minutes
# Occurrences: 23
# Match accuracy: 87% (19 matches)Integration with Alerts
When Prophet detects anomalies, it automatically checks fingerprints:
# In HEARTBEAT.md or monitoring:
prophet fingerprint match --json | jq '.matches[0].suggested_resolution'Why It Matters
Reinventing the wheel: "What's causing this? I feel like I've seen this before..."
Pattern recognition: "This is 92
Tribal knowledge: "Ask Bob, he fixed this last time"
Institutional memory: "Prophet remembers every incident and what fixed it"
Slow MTTR: "Let me investigate from scratch..."
Fast MTTR: "Known pattern, known fix, 15-minute resolution expected"
Isolated incidents: "Each incident is a one-off"
Pattern awareness: "This is the 5th occurrence — time to fix the root cause"
Fingerprints turn your incident history into a diagnostic superpower.
Database Schema
Fingerprint data is stored in `prophet.db`:
- `fingerprints` — Main fingerprint records
- `fingerprint_metrics` — Inverted index for fast lookups
- `fingerprint_matches` — Match history for learning
- `pattern_shapes` — Cached pattern hashes for shape matching
Simulation Engine (Gen 6+ Evolution)
War-game scenarios before they happen. Ask "what if?" and get predictions.
The Problem:
- Capacity planning is guesswork: "Will we survive Black Friday?"
- Incident response is reactive: "What happens if we lose a server?"
- Architecture decisions are risky: "Will this scaling strategy work?"
The Solution:
Simulate scenarios using the causality graph and prediction models to forecast cascade effects before they happen.
Quick Start
# List available scenarios
prophet simulate list
# Run a preset scenario
prophet simulate run traffic_2x
prophet simulate run server_loss
prophet simulate run memory_leak
# Quick what-if analysis
prophet simulate what-if memory_percent 95
prophet simulate what-if cpu_percent 90
# Find breaking points
prophet simulate breaking-point cpu_percent
prophet simulate breaking-point memory_percent
# Compare scenarios (which is safer?)
prophet simulate compare traffic_2x server_lossPreset Scenarios
| Preset | Description | Type |
|---|---|---|
| `traffic_2x` | Traffic doubles over 5 min | Traffic Spike |
| `traffic_5x` | 5x traffic (Black Friday) | Traffic Spike |
| `server_loss` | One server goes down | Resource Loss |
| `scale_up_2x` | Double capacity | Resource Gain |
| `memory_leak` | Gradual memory increase | Degradation |
| `db_saturation` | DB connections hit max | Metric Injection |
| `cascade_test` | Force high memory | Metric Injection |
| `chaos_mild` | Random 10 | |
| `chaos_severe` | Random 30 |
Simulation Output
============================================================
SIMULATION: Double Traffic
============================================================
🟠 Impact: SIGNIFICANT
📊 Risk Score: 84/100
⏱️ Duration: 60 minutes
🚨 Threshold Breaches:
• response_time: 520.0 (threshold: 500.0)
• request_rate: 10000.0 (threshold: 10000.0)
🔗 Cascade Effects:
cpu_percent → response_time
response_time → error_rate
queue_depth → response_time
📈 SLO Impact:
latency_p99: █████░░░░░ 52%
error_budget: ░░░░░░░░░░ 3%
⚠️ Warnings:
⚠️ Deep cascade detected (3 levels) - failure can propagate quickly
💡 Recommended Mitigations:
1. Pre-scale horizontally before expected traffic increase
2. Enable auto-scaling with appropriate thresholds
3. Add response caching for common requestsBreaking Point Analysis
Find where your system breaks:
prophet simulate breaking-point cpu_percent
# Output:
# 📊 Breaking Point Analysis: cpu_percent
# Baseline value: 42.5
# ⚠️ Breaks at: 2.5x (106.3)
# 📏 Headroom: 150%Comparing Scenarios
Choose between strategies by comparing simulated outcomes:
prophet simulate compare traffic_2x scale_up_2x chaos_mild
# Output:
# 🟢 SAFEST: Double Capacity
# Risk Score: 15/100
# Impact: minimal
#
# 🔴 RISKIEST: Double Traffic
# Risk Score: 84/100
# Impact: significant
#
# All Results (sorted by risk):
# 1. 🟢 Double Capacity: 15/100 (minimal)
# 2. 🟡 Mild Chaos: 45/100 (moderate)
# 3. 🔴 Double Traffic: 84/100 (significant)Impact Levels
| Level | Risk Score | Meaning |
|---|---|---|
| 🟢 NONE | 0-10 | No predicted impact |
| 🟢 MINIMAL | 10-25 | Minor deviations, no breaches |
| 🟡 MODERATE | 25-50 | Some metrics elevated |
| 🟠 SIGNIFICANT | 50-70 | Threshold breaches predicted |
| 🔴 SEVERE | 70-85 | Multiple cascading failures |
| 💀 CRITICAL | 85-100 | System-wide impact expected |
How It Works
1. Load Current State — Gets latest metric values and baselines
2. Apply Scenario — Modifies metrics according to scenario definition
3. Propagate Cascades — Uses causality graph to predict downstream effects
4. Check Thresholds — Identifies which metrics breach thresholds
5. Estimate SLO Impact — Predicts probability of SLO violations
6. Generate Mitigations — Recommends actions based on scenario type
Integration with Capacity Planning
# Before expected traffic event
prophet simulate run traffic_5x
# If risk is high, test scaling
prophet simulate run scale_up_2x
prophet simulate compare traffic_5x scale_up_2xCustom Scenarios
Create custom scenarios via CLI:
prophet simulate custom \
--name "Peak Hour Load" \
--metric cpu_percent 85 \
--metric memory_percent 80 \
--duration 120Why It Matters
Reactive capacity: "We crashed during Black Friday" → Postmortem
Proactive capacity: "Black Friday simulation shows 84
Hope-based architecture: "This should handle the load" → Guesswork
Evidence-based architecture: "Breaking point is 2.5x, we need headroom" → Data
Single-point thinking: "If we add servers..." → One variable
Cascade awareness: "Adding servers affects CPU, which affects response time, which..." → Full picture
Simulation turns architecture decisions from gut feelings into calculated risks.
---
User Experience Impact (Gen 6 Evolution)
Translates technical metrics into actual user impact. "Your p99 latency is 2.3s" becomes "~4,200 users/hour are likely abandoning checkout."
Concept
Engineers think in metrics. Users think in experience. Business thinks in dollars. User Impact bridges all three:
Technical: response_time = 2500ms
↓
User: "Frustrating - attention wanders"
↓
Business: ~70 abandonments/hr, $4,350/hr revenue lossKey Insight
Users don't complain about "high latency." They say "this is slow" and leave. By the time you see it in metrics, they're already gone. User Impact quantifies the invisible damage.
Commands
# System-wide impact summary
prophet impact
# List all configured user journeys
prophet impact-journey
# Impact for specific journey
prophet impact-journey checkout
# Impact of a specific metric
prophet impact-metric response_time --value 2500 --journey checkout
# Add or update a user journey
prophet impact-add --name payment \
--stage CONVERSION \
--traffic 500 \
--conversion-rate 0.8 \
--revenue 120 \
--latency-target 300 \
--latency-sensitivity 2.0
# Show impact trend over time
prophet impact-trend
prophet impact-trend checkout --hours 48User Journeys
Pre-configured journeys map metrics to business context:
| Journey | Stage | Traffic | Target Latency | Revenue/Conversion |
|---|---|---|---|---|
| `homepage` | Discovery | 10k/hr | 300ms | $0 |
| `search` | Discovery | 5k/hr | 200ms | $0 |
| `product_view` | Consideration | 8k/hr | 500ms | $0 |
| `checkout` | Conversion | 1k/hr | 400ms | $75 |
| `api` | Engagement | 50k/hr | 100ms | $0.01 |
| `login` | Engagement | 2k/hr | 300ms | $0 |
Impact Severity Levels
| Level | Icon | Frustration | User Experience |
|---|---|---|---|
| IMPERCEPTIBLE | ✅ | 0-5 | Users don't notice |
| MINOR | 🟢 | 5-15 | Some slight annoyance |
| NOTICEABLE | 🟡 | 15-35 | Many users frustrated |
| SIGNIFICANT | 🟠 | 35-60 | Complaints likely |
| SEVERE | 🔴 | 60-85 | Business-critical |
| CATASTROPHIC | 💀 | 85-100 | Stop everything |
Response Time Psychology
Based on UX research (Nielsen, Google, Akamai):
| Threshold | User Perception |
|---|---|
| < 100ms | Instantaneous - direct control |
| < 300ms | Smooth - unnoticeable delay |
| < 1s | Noticeable - flow maintained |
| < 3s | Frustrating - attention wanders |
| < 8s | Painful - considering abandoning |
| > 10s | Broken - most users leave |
Abandonment Curve
Response time to abandonment probability (research-based):
Response Time → Abandonment
0.5s → 1%
1.0s → 3%
2.0s → 7%
3.0s → 13%
4.0s → 20%
5.0s → 28%
6.0s → 35%
8.0s → 45%
10.0s → 53%Example Output
╔════════════════════════════════════════════╗
║ 👤 USER EXPERIENCE IMPACT SUMMARY ║
╚════════════════════════════════════════════╝
UX Score: 🟡🟡🟡⚪⚪ 60/100 (Degraded)
Frustration Index: 40/100
👥 Users Affected: ~50,000/hr
🚪 Max Abandonment: 3.0%
💸 Hourly Revenue Impact: $1,894
📊 Projected Daily Loss: $45,465
📈 Predicted UX Score: 55/100
⚠️ Critical in: ~2.0 hours
🔻 Most Affected Journeys:
• checkout: 60/100 health
• search: 66/100 health
• api: 71/100 health
🚨 Critical Issues:
• response_time on checkout: 40 frustrationMetric-Specific Impact
🔴 **checkout** — SEVERE
Metric: response_time = 2500.0
Frustration Score: 70/100
👥 Affected Users: ~1,000/hr
🚪 Abandonment Rate: 7.0%
❌ Est. Abandonments: ~70/hr
💸 Lost Conversions: ~58/hr
💰 Revenue Impact: $4,350/hr
🎯 User Experience: Response: Frustrating - attention wandersCustomizing Journeys
Add your own journeys to match your product:
# High-value subscription flow
prophet impact-add --name subscription_checkout \
--stage CONVERSION \
--traffic 200 \
--conversion-rate 0.4 \
--revenue 299 \
--latency-target 250 \
--latency-sensitivity 2.5 \
--error-sensitivity 3.0 \
--metrics "payment_latency,subscription_api"Integration with Predictions
User Impact connects to Prophet's predictions:
- If latency is predicted to increase, calculates future user impact
- Shows "time to noticeable" — when users will start complaining
- Combines with fingerprints to show impact of known incident patterns
Why It Matters
Technical alerting: "Response time crossed 2s threshold" → Engineer context
User impact alerting: "70 users abandoning checkout/hour, $4,350/hr loss" → Everyone understands
Metric overload: "We have 50 metrics in alert state" → What matters?
Impact prioritization: "Checkout at 70 frustration, homepage at 15" → Clear priority
Reactive support: "Users are complaining" → Already too late
Proactive protection: "UX score dropping, critical in 2 hours" → Time to act
Vague post-mortems: "There was degraded performance" → Hand-wavy
Quantified impact: "2-hour incident = 5,600 abandonments, $87,000 revenue impact" → Concrete
User Impact transforms abstract metrics into human stories and business reality.
Incident Memory (Gen 6 Evolution)
"Those who cannot remember the past are condemned to repeat it." — George Santayana
Most production incidents aren't novel — they're variations of past issues. Incident Memory gives Prophet a long-term memory of past incidents, enabling it to recognize when similar conditions are emerging and surface historical context + successful remediations.
Key Insight
When you see "response_time climbing, memory_percent at 82
Incident Memory answers: "This looks like the 2024-01-15 database connection exhaustion incident. Last time, increasing pool size to 100 fixed it in 12 minutes."
Core Features
| Feature | Description |
|---|---|
| Incident Recording | Store incidents with full context (metrics, timing, root cause, resolution) |
| Signature Matching | Multi-dimensional similarity scoring finds past incidents matching current conditions |
| Learning Extraction | Automatically extracts root causes, remediations, and lessons from resolved incidents |
| Recurrence Detection | Identifies incidents that happen periodically and predicts next occurrence |
| Confidence Weighting | Recent incidents with clear outcomes get higher confidence |
CLI Commands
# View memory statistics
prophet memory stats
# Recent incidents
prophet memory recent [limit]
# Search incidents by keyword
prophet memory search "database"
# Record a new incident (interactive)
prophet memory record
# Resolve an incident
prophet memory resolve <incident_id> "Increased connection pool"
# Check for predicted incidents based on current state
prophet memory predict
# View learnings (root causes, remediations, preventions)
prophet memory learnings [incident_id]
# Add a learning to an incident
prophet memory add-learning <incident_id> remediation "Restart the worker pods"
# View recurring patterns
prophet memory recurring
# Auto-record from current breached metrics
prophet memory-auto --title "API slowdown"Incident Categories
| Category | Description |
|---|---|
| `latency` | Response time / slowdowns |
| `availability` | Outages, errors, failures |
| `saturation` | Resource exhaustion (CPU, memory, connections) |
| `traffic` | Load spikes, DDoS |
| `deployment` | Deploy-induced issues |
| `external` | Third-party failures |
| `cascade` | Multi-service failures |
Severity Levels (PagerDuty-style)
| Level | Meaning |
|---|---|
| `sev1` | Critical: customer-facing, major impact |
| `sev2` | Major: significant degradation |
| `sev3` | Minor: limited impact |
| `sev4` | Informational: no user impact |
| `sev5` | Noise: not really an incident |
How Similarity Matching Works
Incident Memory uses multi-dimensional matching:
1. Primary Metrics (35
2. Pattern Shapes (25
3. Threshold Values (15
4. Temporal Context (10
5. Root Causes (15
Recurrence Detection
After 2+ similar incidents, Memory detects recurring patterns:
⚠️ Memory exhaustion during batch processing
Type: daily (every 24.0h)
Next predicted: 2024-02-03 03:00:00
Confidence: 78%Learning Types
| Type | Example |
|---|---|
| `root_cause` | "Database connection pool exhausted" |
| `remediation` | "Increased pool size to 100" |
| `prevention` | "Add connection pool monitoring" |
| `lesson` | "Always check connection usage before deploying" |
Remediations track effectiveness scores — how quickly did they resolve the issue?
Example Prediction
⚠️ INCIDENT PREDICTION
Category: saturation
Severity: sev2
Probability: 72%
Warning Signs:
• Similar to Database connection exhaustion: metrics: memory_percent, db_connections
• memory_percent pattern: gradual_rise
• Recurring pattern: occurs daily, next expected 2024-02-03 03:00
Similar Past Incidents:
• Database connection exhaustion (78% match)
• Memory pressure incident (62% match)
Suggested Actions:
✓ Increase connection pool size before peak
✓ Pre-emptive: Restart worker pods
✓ Enable connection pool monitoring
📚 LEARNINGS FROM SIMILAR INCIDENTS
Root Causes (most common):
→ Connection pool exhausted under load (seen 5x)
→ Batch job holding connections (seen 3x)
Effective Remediations:
✓ Increase pool size to 100 (85% effective)
✓ Add connection timeout (78% effective)
Prevention Measures:
🛡 Monitor connection pool utilization
🛡 Set alerts at 70% pool usageIntegration with Other Prophet Features
- Fingerprints: Signatures incorporate fingerprint patterns for matching
- Causality: Root cause metrics from causality discovery feed into signatures
- User Impact: Incidents record user impact scores for prioritization
- Remediation: Learnings feed into remediation recommendations
- Adaptive Thresholds: Recurring patterns inform threshold adjustments
Why It Matters
Without Memory:
- "This is slow" → Investigate from scratch
- No context on past resolutions
- Same issues repeat without recognition
With Memory:
- "This is slow" → "Last 3 times this happened, it was connection pool exhaustion"
- Historical remediations surface immediately
- Recurring patterns get predicted before they happen
Post-Mortem Value:
- Forced documentation of incidents during resolution
- Learnings accumulate over time
- New team members can search past incidents for context
Prevention Focus:
- Recurring patterns trigger proactive alerts
- Prevention measures surface before incidents
- Time-of-day patterns enable scheduled mitigation
Incident Memory transforms Prophet from a predictor to a learning system that gets smarter with every incident.
Confidence Calibration (Gen 6 Evolution)
"Trust, but verify" — Predictions are only useful if confidence scores are accurate.
When Prophet says "85
The Problem
Uncalibrated predictions erode trust:
- Overconfident: Says 90
- Underconfident: Says 50
Calibration ensures confidence = actual accuracy.
CLI Commands
# Show overall calibration statistics
prophet calibration stats
# Output:
# ╔════════════════════════════════════════════════════════════╗
# ║ 📊 CALIBRATION STATISTICS ║
# ╚════════════════════════════════════════════════════════════╝
#
# 📝 Total Predictions Tracked: 247
# ⏳ Pending Evaluation: 12
# ✅ Evaluated: 235
#
# 🎯 Overall Accuracy: 78.3%
# Correct: 184 | Incorrect: 51
# 📊 Average Confidence: 76.1%
# 📈 Recent 7-Day Accuracy: 81.2%
# Detailed calibration report
prophet calibration report
# Output:
# ╔════════════════════════════════════════════════════════════╗
# ║ 🎯 CONFIDENCE CALIBRATION REPORT ║
# ╚════════════════════════════════════════════════════════════╝
#
# ┌─────────────────────────────────────────────────────────┐
# │ CALIBRATION BY CONFIDENCE BUCKET │
# ├────────────┬───────┬─────────┬──────────┬───────────────┤
# │ Confidence │ Count │ Correct │ Accuracy │ Calibration │
# ├────────────┼───────┼─────────┼──────────┼───────────────┤
# │ 0%-10% │ 5 │ 0 │ 0% │ 🎯 Excellent │
# │ 10%-20% │ 8 │ 2 │ 25% │ ✅ Good │
# │ 20%-30% │ 12 │ 3 │ 25% │ ⚠️ Fair │
# │ 30%-40% │ 18 │ 8 │ 44% │ ✅ Good │
# │ 40%-50% │ 22 │ 11 │ 50% │ 🎯 Excellent │
# │ 50%-60% │ 35 │ 21 │ 60% │ ✅ Good │
# │ 60%-70% │ 42 │ 27 │ 64% │ ⚠️ Fair │
# │ 70%-80% │ 48 │ 39 │ 81% │ ✅ Good │
# │ 80%-90% │ 31 │ 26 │ 84% │ 🎯 Excellent │
# │ 90%-100% │ 14 │ 13 │ 93% │ 🎯 Excellent │
# └────────────┴───────┴─────────┴──────────┴───────────────┘
# Filter by prediction type
prophet calibration report --type threshold_breach --days 60
# Show predictions awaiting outcome evaluation
prophet calibration pending
# Manually record an outcome
prophet calibration resolve pred_20260202_abc123 correct 512.5 "Threshold was breached as predicted"
# Update adjustment factors (recalculate from outcomes)
prophet calibration update
# Show reliability by prediction type
prophet calibration reliability
# Track a prediction manually (for testing)
prophet calibration track threshold_breach response_time 0.85 30Prediction Types Tracked
| Type | What It Predicts | Evaluation Method |
|---|---|---|
| `threshold_breach` | Will metric cross threshold? | Compare to actual value |
| `anomaly` | Is this an anomaly? | Manual confirmation |
| `trend_direction` | Will metric go up/down? | Compare direction |
| `cascade` | Will one metric cause another to degrade? | Check both metrics |
| `ttr` | Time-to-recovery estimate | Compare to actual duration |
| `similar_incident` | Pattern match to past incident | Confirm root cause match |
| `capacity` | Will we run out of capacity? | Check capacity state |
| `deployment_risk` | Will deployment cause issues? | Post-deployment review |
Calibration Metrics
Expected Calibration Error (ECE):
- Weighted average of |confidence - accuracy| per bucket
- Lower is better (0 = perfect calibration)
- < 0.05 Excellent | < 0.10 Good | < 0.20 Fair | > 0.20 Poor
Maximum Calibration Error (MCE):
- Worst bucket's calibration error
- Shows where predictions are most unreliable
Brier Score:
- Mean squared error of predictions
- Lower is better (0 = perfect)
- Combines calibration AND accuracy
Adjustment Factor:
- Multiplier to apply to raw confidence
- < 1.0 = overconfident, reduce confidence
- > 1.0 = underconfident, increase confidence
Automatic Integration
Calibration integrates automatically with other Prophet modules:
from calibration import CalibrationEngine, CalibratedPredictor, PredictionType
# Wrap any predictor with calibration
engine = CalibrationEngine()
predictor = CalibratedPredictor(engine)
# Make a calibrated prediction
# Raw confidence 0.85 might become 0.78 based on historical accuracy
adjusted_conf = predictor.predict_threshold_breach(
metric="response_time",
raw_confidence=0.85,
horizon_minutes=30,
threshold=500
)
# Prediction is automatically tracked for later evaluationAuto-Evaluation
For threshold and trend predictions, calibration can automatically evaluate outcomes:
from calibration import CalibrationEngine
engine = CalibrationEngine()
# Auto-evaluate threshold breach predictions
def get_metric_value(metric):
return current_metrics[metric]
engine.auto_evaluate_threshold_predictions(get_metric_value)
# Auto-evaluate trend predictions
def get_metric_history(metric, hours):
return historical_values[metric][-hours*60:]
engine.auto_evaluate_trend_predictions(get_metric_history)Why Calibration Matters
Before Calibration:
- Prophet: "90
- Reality: Actually right 65
- Result: Team ignores high-confidence alerts
After Calibration:
- Prophet: "65
- Reality: Actually right 65
- Result: Confidence scores you can trust
The Feedback Loop:
1. Make prediction with confidence
2. Track outcome (correct/incorrect)
3. Measure calibration over time
4. Adjust future confidence to match accuracy
5. Trust improves, decisions improve
Well-calibrated predictions let you make better decisions about when to act on alerts, when to ignore them, and how much effort to invest in prevention.
Ensemble Prediction (Gen 6 Evolution)
Single prediction models fail in specific scenarios. Ensemble prediction combines multiple models for more robust forecasting — averaging out individual model errors and using disagreement to estimate uncertainty.
Why Ensemble?
| Single Model | Problem |
|---|---|
| Holt-Winters | Struggles without seasonality |
| ARIMA | Sensitive to trend breaks |
| Linear | Misses non-linear patterns |
Ensemble Solution: Combine all models with adaptive weights — each model's strengths compensate for others' weaknesses.
Models Combined
| Model | Best For | Weight Learning |
|---|---|---|
| Holt-Winters | Seasonal data (daily/weekly cycles) | ↑ if seasonal pattern detected |
| ARIMA-lite | Trending data with autocorrelation | ↑ if strong trend |
| Threshold Projection | Clear linear trends | ↑ if high R² |
| Percentile | Bounded metrics, known distributions | ↑ for stable metrics |
| Seasonal Naive | Strongly seasonal (same as last cycle) | ↑ if high autocorrelation at season lag |
CLI Commands
# Make ensemble prediction
prophet ensemble predict cpu_percent
prophet ensemble predict response_time --horizon 60
# Show model performance statistics
prophet ensemble stats
prophet ensemble stats response_time # For specific metric
# Show learned weights for metric
prophet ensemble weights cpu_percent
# Update weights from historical accuracy
prophet ensemble update-weights memory_percent
# Analyze when/why models disagree
prophet ensemble disagreement error_rate
# Predict threshold breach probability
prophet ensemble breach memory_percent 85
prophet ensemble breach cpu_percent 90 --horizon 30
# Show metric's pattern classification
prophet ensemble pattern response_timeExample Output
╔════════════════════════════════════════════════════════════╗
║ 🎯 ENSEMBLE PREDICTION ║
╚════════════════════════════════════════════════════════════╝
Metric: cpu_percent
Current: 68.42
Horizon: 30 steps
Predicted: 72.15
Confidence: 78.3%
Uncertainty: 15.2%
Agreement: 84.8%
Dominant: holt_winters
┌─────────────────────┬────────────┬────────────┬──────────┐
│ Model │ Prediction │ Confidence │ Weight │
├─────────────────────┼────────────┼────────────┼──────────┤
│ holt_winters │ 72.34 │ 82.5% │ 35.2% │
│ arima_lite │ 71.89 │ 76.1% │ 28.1% │
│ threshold_proj │ 73.12 │ 71.3% │ 18.5% │
│ percentile │ 71.45 │ 68.9% │ 12.1% │
│ seasonal_naive │ 72.00 │ 55.0% │ 6.1% │
└─────────────────────┴────────────┴────────────┴──────────┘
✅ High agreement — models converge on similar predictionWeight Learning
Weights adapt based on historical accuracy:
# After enough predictions with recorded outcomes...
prophet ensemble update-weights cpu_percent
✅ Updated weights for cpu_percent:
holt_winters: 38.2% # Best performer
arima_lite: 26.1%
threshold_proj: 17.4%
percentile: 11.8%
seasonal_naive: 6.5% # Worst performerLearning algorithm: Inverse MSE weighting — models with lower mean squared error get higher weights.
Uncertainty from Disagreement
When models disagree, that's valuable information:
Agreement: 42.3% ❌ Low agreement — significant uncertainty
💡 Models diverging suggests:
- Regime change in data
- Unusual pattern not seen before
- Higher risk of prediction errorAgreement → Uncertainty mapping:
- Agreement ≥ 80
- Agreement 50-80
- Agreement < 50
Threshold Breach Prediction
Specialized ensemble for answering "Will we breach the threshold?"
prophet ensemble breach memory_percent 85
╔════════════════════════════════════════════════════════════╗
║ 🚨 THRESHOLD BREACH PREDICTION ║
╚════════════════════════════════════════════════════════════╝
Metric: memory_percent
Threshold: 85.0
Current: 78.23
Predicted: 82.45
Horizon: 30 steps
🟡 Breach Probability: 45.2% (LOW RISK)
Confidence: 72.1%
Agreement: 81.3%
Time to Breach: ~42.3 stepsPattern Classification
The ensemble automatically classifies metric patterns to boost specialist models:
| Pattern | Description | Boosted Models |
|---|---|---|
| `seasonal` | Clear daily/weekly cycles | Holt-Winters, Seasonal Naive |
| `trending` | Consistent upward/downward | ARIMA-lite, Threshold Proj |
| `volatile` | High variance, irregular | Isolation Forest, Percentile |
| `stable` | Low variance, predictable | Percentile, Seasonal Naive |
Python Integration
from ensemble import EnsemblePredictor, ThresholdBreachPredictor
# Basic ensemble prediction
ensemble = EnsemblePredictor()
prediction = ensemble.predict(
metric="response_time",
values=[100, 105, 110, 108, 115, 120],
horizon=10
)
print(f"Predicted: {prediction.final_value}")
print(f"Confidence: {prediction.confidence:.1%}")
print(f"Agreement: {prediction.agreement_score:.1%}")
# Threshold breach prediction
breach_predictor = ThresholdBreachPredictor(ensemble)
result = breach_predictor.predict_breach(
metric="memory_percent",
values=memory_values,
threshold=85.0,
horizon_minutes=30
)
if result['breach_probability'] > 0.7:
alert("High probability of memory breach!")
# Record outcomes for weight learning
ensemble.record_outcome(
metric="response_time",
prediction_type=PredictionType.VALUE_FORECAST,
model_predictions=prediction.model_predictions,
actual_value=118.5 # What actually happened
)
# Update weights periodically
ensemble.update_weights("response_time", PredictionType.VALUE_FORECAST)Philosophy
No single model is best for everything.
- Holt-Winters excels at seasonal data but struggles without clear cycles
- ARIMA captures trends but is sensitive to regime changes
- Simple linear projection is surprisingly robust for short horizons
- Percentile-based prediction works great for bounded metrics
The ensemble learns which models work best for YOUR specific metrics and adapts over time. Disagreement between models becomes a feature, not a bug — it tells you when to be more cautious.
Result: More robust predictions that improve as they learn from outcomes.
Unified Health Score & Auto-Report (Gen 6 Evolution)
Instead of checking 16 different modules, get ONE number (0-100) and a human-readable story.
The Problem
Performance Prophet has grown to 16+ modules: predictions, thresholds, calibration, fingerprints, incidents, capacity, causality, remediation, and more. Each has its own dashboard. No single view answers: "How is my system doing RIGHT NOW?"
The Solution
A unified health score that:
1. Queries all modules and calculates a weighted score (0-100)
2. Grades your system (A+ through F)
3. Generates human-readable narratives explaining what's happening
4. Produces auto-reports (hourly/daily/weekly) with trends
5. Tracks score history with sparkline visualization
Health Dimensions
| Dimension | Weight | What It Measures |
|---|---|---|
| Prediction | 25 | |
| Threshold | 20 | |
| Capacity | 15 | |
| Pattern | 15 | |
| Incident | 15 | |
| Calibration | 10 |
Grade Scale
| Grade | Score | Meaning |
|---|---|---|
| A+ | 97-100 | Exceptional — everything running perfectly |
| A | 93-96 | Excellent — minor things to watch |
| A- | 90-92 | Very good |
| B+/B/B- | 80-89 | Good but some areas need attention |
| C+/C/C- | 70-79 | Average — multiple concerns |
| D | 60-69 | Poor — action needed |
| F | 0-59 | Critical — immediate intervention |
CLI Commands
# Current health score
prophet health
# Health score with JSON output
prophet health --json
# Health score trend (sparkline)
prophet health-history
prophet health-history --hours 48
# Auto-generated reports
prophet report # Daily report (default)
prophet report hourly # Last hour
prophet report daily # Last 24 hours
prophet report weekly # Last 7 days
prophet report daily --json # JSON outputExample Output
✨ System Health: 95.5/100 (A)
Trend: 📈 Improving
🟡 calibration 75.0/100 — Building calibration data (6 predictions tracked)
🟢 pattern 92.0/100 — No anomalous patterns detected
🟢 incident 95.0/100 — No incidents in the last 24 hours
🟢 prediction 100.0/100 — All 6 metrics within expected bounds
🟢 threshold 100.0/100 — All 3 adaptive thresholds healthy
🟢 capacity 100.0/100 — Healthy capacity: 100% avg headroom
Recommendations:
🟡 Monitor prediction accuracy — slight drift detected
System is running like a well-oiled machine. ✨Report Example
══════════════════════════════════════════════════
PERFORMANCE PROPHET — DAILY REPORT
Generated: 2026-02-02 23:29
══════════════════════════════════════════════════
✨ Overall Health: 95.5/100 (A)
📈 Period: avg 90 | low 78 | high 96
📈 Trend: improving
DIMENSIONS:
──────────────────────────────────────────────
calibration [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓·····] 75.0
pattern [██████████████████··] 92.0
incident [███████████████████·] 95.0
prediction [████████████████████] 100.0
threshold [████████████████████] 100.0
capacity [████████████████████] 100.0
✅ IMPROVEMENTS:
• threshold improved by 30 points
• capacity improved by 30 points
📋 RECOMMENDATIONS:
🟡 Monitor prediction accuracy — slight drift detected
══════════════════════════════════════════════════Integration with Heartbeat
Health score can be checked during Clawdbot heartbeats:
from health_score import HealthScoreEngine
engine = HealthScoreEngine()
snapshot = engine.calculate_snapshot()
if snapshot.overall_score < 70:
# Alert: system health degraded
notify(snapshot.narrative)Architecture
┌─────────────────────────────────────────────────────────┐
│ All Prophet Modules │
│ Prophet │ Thresholds │ Calibration │ Fingerprints │ ... │
└────┬─────┴─────┬──────┴──────┬──────┴──────┬───────┴────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Health Score Engine │
│ Weighted combination → single 0-100 score │
│ Grade assignment (A+ through F) │
│ Trend detection (improving/stable/degrading/volatile) │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Narrative Generator │
│ Human-readable stories from numbers │
│ Top concerns + strengths + recommendations │
└────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Auto-Report Engine │
│ Hourly / Daily / Weekly reports │
│ Score history + sparklines + trend analysis │
│ Period comparisons + notable events │
└─────────────────────────────────────────────────────────┘Philosophy
One number. One story. One glance.
Dashboards with 50 charts cause alert fatigue. A single health score with a narrative explanation lets you know instantly: "Do I need to pay attention right now?" If yes, the dimensions tell you where. If no, move on with your day.
The score degrades gracefully — if a module is unavailable, it falls back to neutral (70) rather than breaking. This means the health score is always available, even during partial outages.
Model Drift Detection (Gen 6 Evolution)
The Problem
All Prophet models learn from historical patterns. But systems change: new deployments, infrastructure migrations, traffic pattern shifts, seasonal evolution. Without drift detection, Prophet gives confidently wrong predictions.
The Solution
A meta-intelligence layer that monitors the monitors. Answers the question: "Can we still trust our predictions?"
Drift Types Detected
| Type | What It Detects | Method |
|---|---|---|
| Covariate | Input distributions changed | KS test + Page-Hinkley |
| Performance | Prediction accuracy degraded | ADWIN + error tracking |
| Concept | Metric relationships shifted | Fisher z-transform on correlations |
| Regime | Operational mode changed | Rule-based multi-signal classification |
Statistical Tests
- Page-Hinkley — Sequential change detection (O(1) memory, real-time)
- ADWIN — Adaptive windowing for concept drift detection
- Kolmogorov-Smirnov — Non-parametric distribution comparison
- CUSUM — Cumulative sum control charts for mean shifts
- Welch's t-test — Window comparison significance testing
- Fisher z-transform — Correlation comparison across time periods
Operational Regimes
Prophet recognizes 7 distinct operational modes:
| Regime | Characteristics | Trust Impact |
|---|---|---|
| Normal | Baseline behavior | Full trust |
| Peak | High utilization, low errors | Normal trust |
| Off-Peak | Low utilization | Normal trust |
| Maintenance | Very low traffic | Normal trust |
| Incident | High errors + latency | 30 |
| Degraded | Elevated but not critical | Moderate reduction |
| Recovery | Improving from bad state | 15 |
CLI Commands
# Comprehensive drift check
prophet drift # Full drift report with trust scores
prophet drift --metric cpu # Check specific metric
prophet drift --json # JSON output
# Trust scores
prophet drift-trust # All metric trust scores
prophet drift-trust cpu # Trust for specific metric
# Operational regime
prophet drift-regime # Current regime + transitions
prophet drift-regime --json # JSON output
# Signal history
prophet drift-history # Last 24h signals
prophet drift-history --hours 72 # Custom lookback
prophet drift-history --severity severe # Filter by severity
# Prediction accuracy
prophet drift-accuracy # All metrics
prophet drift-accuracy cpu # Specific metric
# Resolve signals
prophet drift-resolve 42 --note "Deployed model v2"
# Interactive demos
prophet drift-demo --scenario gradual # Slow mean shift
prophet drift-demo --scenario sudden # Abrupt distribution change
prophet drift-demo --scenario concept # Relationship decorrelation
prophet drift-demo --scenario regime # Operational mode transitionsTrust Scoring
Each metric gets a trust score (0-100
- Active drift signals (severity × confidence)
- Current operational regime
- Recent prediction accuracy
- Concept drift affecting related metrics
Trust influences all other Prophet modules:
- Health Score weights predictions by trust
- Alerts include trust disclaimers when low
- Ensemble can deprioritize low-trust models
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Drift Engine │
│ │
│ ┌────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Covariate │ │ Performance │ │ Concept │ │
│ │ Detector │ │ Detector │ │ Detector │ │
│ │ │ │ │ │ │ │
│ │ KS Test │ │ ADWIN │ │ Fisher z │ │
│ │ Page-Hinkl │ │ Error Track │ │ Correlation │ │
│ └─────┬──────┘ └──────┬───────┘ └───────┬────────┘ │
│ │ │ │ │
│ ┌─────┴─────────────────┴───────────────────┴──────────┐ │
│ │ Regime Detector │ │
│ │ Classifies: normal/peak/incident/recovery/... │ │
│ └─────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────┴────────────────────────────────────────────────┐ │
│ │ Trust Scorer │ │
│ │ Per-metric trust (0-100%) from all signals │ │
│ └─────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────┴────────────────────────────────────────────────┐ │
│ │ Drift Advisor │ │
│ │ OK → Monitor → Recalibrate → Retrain → Suspend │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Persistence: drift.db (SQLite) │
└─────────────────────────────────────────────────────────────┘Severity → Action Mapping
| Severity | Trust Penalty | Recommended Action |
|---|---|---|
| Minor | -10 | |
| Moderate | -25 | |
| Severe | -45 | |
| Critical | -70 |
Philosophy
> "A prediction model that doesn't know it's wrong is more dangerous than no model at all."
Traditional monitoring tells you when your system breaks. Drift detection tells you when your understanding of your system breaks. It's the difference between watching the road and checking your GPS is still accurate.
Fleet Intelligence (Gen 6 Evolution)
Extends Prophet from single-service monitoring to fleet-wide intelligence. Cross-service correlation, dependency mapping, blast radius estimation, and risk matrices.
Philosophy
> "In a microservice architecture, no service fails alone."
Single-service monitoring catches local problems. Fleet Intelligence catches systemic problems — when auth-service slows down, it predicts api-gateway, user-service, and search-service will follow.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Fleet Intelligence │
│ │
│ ┌──────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Service │ │ Correlation │ │ Blast Radius │ │
│ │ Registry │ │ Engine │ │ Estimator │ │
│ │ │ │ │ │ │ │
│ │ Register │ │ Cross-corr │ │ Graph walk │ │
│ │ Track │ │ Lag detect │ │ Impact score │ │
│ │ Status │ │ Auto-discover │ │ Cascade sim │ │
│ └──────┬───────┘ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │ │
│ ┌──────┴──────────────────┴─────────────────────┴───────┐ │
│ │ Dependency Graph │ │
│ │ Directed graph of service relationships │ │
│ └──────┬────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────┴────────────────────────────────────────────────┐ │
│ │ Fleet Dashboard │ │
│ │ Overall health │ Risk matrix │ Topology view │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Persistence: fleet.db (SQLite) │
└─────────────────────────────────────────────────────────────┘Registering Services
# Register services with tiers (1=critical, 2=important, 3=normal)
prophet fleet register api-gateway --tier 1 --desc "Main entry point"
prophet fleet register auth-service --tier 1 --desc "Authentication"
prophet fleet register user-service --tier 2 --desc "User management"
prophet fleet register payment-svc --tier 1 --desc "Payment processing"
prophet fleet register notification-svc --tier 3 --desc "Email/SMS"Declaring Dependencies
Dependencies can be declared manually or discovered automatically:
# Manual: api-gateway depends on auth-service
prophet fleet deps add api-gateway auth-service --type upstream --strength 0.95
# Automatic: discover from metric correlations
prophet fleet discoverFleet Status
prophet fleet status
# Output:
# ══════════════════════════════════════════════════
# 🛰️ FLEET INTELLIGENCE
# ══════════════════════════════════════════════════
#
# Fleet Health: [████████████████░░░░] 83% (B)
#
# Services: 7 total
# 🟢 Healthy: 4
# 🟡 Degraded: 1
# 🔴 Critical: 0
# ❓ Unknown: 2
#
# ⚠️ Top Risks:
# 🔴 auth-service — widespread (4 services, 82% impact)Blast Radius Estimation
The key feature: predict the cascading impact of a service failure.
prophet fleet blast auth-service
# Output:
# 💥 Blast Radius: auth-service
# ────────────────────────────────────────
# Severity: 🔴 WIDESPREAD
# Affected: 4 service(s)
# Impact: 82%
# Depth: 2 hop(s)
#
# 📊 Affected Services:
# api-gateway: [████████████████████] 100% (+0m, depth 1)
# user-service: [████████████████████] 100% (+0m, depth 1)
# search-service: [████████████████░░░░] 81% (+0m, depth 2)
#
# 🌊 Cascade Path:
# auth-service → api-gateway → user-service → search-service
#
# 💡 Mitigations:
# 🚨 Tier-1 critical service — activate incident response
# ⚡ Critical services at risk: api-gateway — prepare failover
# 🔌 Enable circuit breakers for: api-gateway, user-serviceBlast Severity Levels
| Severity |
|---|
| 🟢 Contained |
| 🟡 Limited |
| 🟠 Moderate |
| 🔴 Widespread |
| 💀 Catastrophic |
Cross-Service Correlation Discovery
Automatically discovers which services influence each other by cross-correlating their metrics:
prophet fleet discover --min 0.6
# Output:
# ✅ Discovered 5 cross-service correlations:
#
# api-gateway.response_time
# → user-service.response_time
# [████████░░] 82% (lag: 2.5m, n=150)
#
# auth-service.cpu
# → api-gateway.error_rate
# [███████░░░] 71% (lag: 1.0m, n=200)Service Topology
# ASCII topology
prophet fleet topology
# Mermaid diagram (for docs/dashboards)
prophet fleet topology --mermaidRisk Matrix
Combines failure likelihood (from current status) with blast radius impact:
prophet fleet risk
# Output:
# 🎯 Risk Matrix
# ────────────────────────────────────────────────
# Service Tier Status Risk Blast Affected
# user-service T2 degraded █████░ widespread 3
# auth-service T1 healthy █░░░░░ widespread 4
# payment-service T1 healthy █░░░░░ moderate 2Fleet Events
Track fleet-level events (deployments, incidents, config changes):
# Record events
prophet fleet event deployment api-gateway "Deployed v2.3.1"
prophet fleet event incident auth-service "Elevated latency"
# View recent events
prophet fleet eventComparative Service Analysis
Compare the same metric across two services:
prophet fleet compare api-gateway user-service response_time
# Returns: mean, std, p50, p95, p99, correlation, lagService Tiers
| Tier | Meaning | Blast Multiplier |
|---|---|---|
| 1 | Critical (auth, payments) | 1.5× impact |
| 2 | Important (user, search) | 1.2× impact |
| 3 | Normal (analytics, notif.) | 1.0× impact |
Integration with Other Prophet Modules
Fleet Intelligence feeds into existing Prophet modules:
- Health Score — fleet health contributes to overall system health
- Drift Detection — cross-service correlation changes signal architectural drift
- Incident Memory — fleet-wide incidents are stored with cascade paths
- SLO Budget — blast radius informs SLO budget burn predictions
All Fleet Commands
prophet fleet status # Fleet dashboard
prophet fleet status --json # JSON output
prophet fleet register <name> # Register a service
prophet fleet remove <name> # Remove a service
prophet fleet blast <name> # Blast radius estimation
prophet fleet discover # Auto-discover correlations
prophet fleet discover --min 0.7 # Higher correlation threshold
prophet fleet topology # ASCII topology
prophet fleet topology --mermaid # Mermaid diagram
prophet fleet risk # Risk matrix
prophet fleet deps # All dependencies
prophet fleet deps <name> # Dependencies for a service
prophet fleet compare <a> <b> <m> # Compare metric across services
prophet fleet event # Recent events
prophet fleet event <type> <svc> <desc> # Record eventGen 7 Ideas (Future)
- Fleet Anomaly Correlation — When multiple services degrade simultaneously, auto-identify the common root
- Dependency Health Score — Per-dependency health tracking (is this link healthy?)
- Fleet Simulation — "What if we lose an entire AZ?" chaos engineering simulation
- Service SLA Composition — Calculate composite SLA from dependency chain
- Fleet Drift — Detect when the dependency graph itself is changing
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
homelab/clawdbot/skills/performance-prophet/SKILL.md
Detected Structure
Method · Evaluation · Math · Figures · Code Anchors · Architecture