Grand Diomande Research · Full HTML Reader

Graph Kernel Monitoring Guide

The Graph Kernel service exposes metrics, structured logs, and health endpoints for production observability. This guide covers monitoring setup, key metrics, alerting, and troubleshooting.

Agents That Account for Themselves technical note experiment writeup candidate score 26 .md

Full Public Reader

Graph Kernel Monitoring Guide

Overview

The Graph Kernel service exposes metrics, structured logs, and health endpoints for production observability. This guide covers monitoring setup, key metrics, alerting, and troubleshooting.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Observability Stack                          │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐   │
│  │ Cloud       │   │ Cloud       │   │ Cloud Run           │   │
│  │ Logging     │   │ Monitoring  │   │ Metrics             │   │
│  │ (JSON logs) │   │ (Alerts)    │   │ (Built-in)          │   │
│  └──────┬──────┘   └──────┬──────┘   └──────────┬──────────┘   │
│         │                  │                     │              │
│         └──────────────────┼─────────────────────┘              │
│                            │                                     │
│  ┌─────────────────────────┴─────────────────────────────────┐  │
│  │                  Graph Kernel Service                      │  │
│  │  ┌─────────────┐  ┌──────────────┐  ┌─────────────────┐   │  │
│  │  │ /health/*   │  │ JSON Logging │  │ Custom Metrics  │   │  │
│  │  │ endpoints   │  │ (tracing)    │  │ (metrics crate) │   │  │
│  │  └─────────────┘  └──────────────┘  └─────────────────┘   │  │
│  └────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Health Endpoints

EndpointPurposeResponse
`GET /health`Detailed statusFull service state including DB
`GET /health/live`Liveness probe200 if process is running
`GET /health/ready`Readiness probe200 if accepting traffic, 503 otherwise
`GET /health/startup`Startup probe200 when fully initialized

Health Response Example

json
{
  "status": "healthy",
  "version": "0.1.0",
  "schema_version": "1.0.0",
  "policy_count": 1,
  "registry_fingerprint": "a1b2c3d4e5f6",
  "database": {
    "connected": true,
    "pool_size": 5,
    "pool_idle": 3,
    "pool_max": 10
  }
}

Metrics

Request Metrics

MetricTypeLabelsDescription
`graph_kernel_requests_total`Counterpath, method, statusTotal HTTP requests
`graph_kernel_request_duration_seconds`Histogrampath, methodRequest latency

Slice Metrics

MetricTypeDescription
`graph_kernel_slices_generated_total`CounterTotal slices generated
`graph_kernel_slice_turns_count`HistogramTurns per slice
`graph_kernel_slice_edges_count`HistogramEdges per slice
`graph_kernel_slice_latency_ms`HistogramSlice generation time

Token Metrics

MetricTypeLabelsDescription
`graph_kernel_token_verifications_total`CounterresultToken verification attempts

Database Metrics

MetricTypeLabelsDescription
`graph_kernel_db_queries_total`Counterquery_type, statusDatabase query count
`graph_kernel_db_query_duration_ms`Histogramquery_type, statusQuery latency

Structured Logging

Logs are emitted in JSON format for Cloud Logging integration:

json
{
  "timestamp": "2026-01-02T10:00:00.000Z",
  "level": "INFO",
  "target": "graph_kernel_service::access",
  "trace_id": "abc123def456",
  "method": "POST",
  "path": "/api/slice",
  "status": 200,
  "latency_ms": 45,
  "message": "request completed"
}

Log Fields

FieldDescription
`trace_id`Cloud Trace context or generated UUID
`method`HTTP method
`path`Request path
`status`Response status code
`latency_ms`Request duration in milliseconds

Log Levels

  • `ERROR`: Service errors, database failures
  • `WARN`: Degraded operation, missing HMAC secret
  • `INFO`: Request completion, startup/shutdown
  • `DEBUG`: Detailed execution traces (not enabled by default)

Alerting

Recommended Alert Policies

1. Service Unavailable

yaml
alert: GraphKernelDown
condition: probe_success{job="graph-kernel"} == 0
for: 2m
severity: critical
annotations:
  summary: Graph Kernel service is down
  action: Check Cloud Run logs, verify database connectivity

2. High Error Rate

yaml
alert: GraphKernelHighErrorRate
condition: |
  sum(rate(graph_kernel_requests_total{status=~"5.."}[5m])) /
  sum(rate(graph_kernel_requests_total[5m])) > 0.05
for: 5m
severity: warning
annotations:
  summary: Graph Kernel error rate above 5%
  action: Check logs for error patterns

3. Slow Slice Generation

yaml
alert: GraphKernelSlowSlices
condition: |
  histogram_quantile(0.95, rate(graph_kernel_slice_latency_ms[5m])) > 5000
for: 10m
severity: warning
annotations:
  summary: Graph Kernel p95 slice latency above 5s
  action: Check database performance, consider scaling

4. Database Connection Issues

yaml
alert: GraphKernelDatabaseUnhealthy
condition: |
  graph_kernel_health_database_connected == 0
for: 1m
severity: critical
annotations:
  summary: Graph Kernel lost database connectivity
  action: Check Supabase status, verify DATABASE_URL

5. Token Verification Failures

yaml
alert: GraphKernelTokenFailures
condition: |
  sum(rate(graph_kernel_token_verifications_total{result="invalid"}[5m])) /
  sum(rate(graph_kernel_token_verifications_total[5m])) > 0.10
for: 5m
severity: warning
annotations:
  summary: High rate of invalid token verifications
  action: Check HMAC secret consistency across services

Cloud Run Monitoring

Built-in Metrics

Cloud Run automatically provides:

  • Request count and latency
  • Container instance count
  • Memory and CPU utilization
  • Cold start frequency

Access via: Cloud Console → Cloud Run → graph-kernel → Metrics

Log Explorer Queries

Recent Errors

resource.type="cloud_run_revision"
resource.labels.service_name="graph-kernel"
severity>=ERROR

Slow Requests (>1s)

resource.type="cloud_run_revision"
resource.labels.service_name="graph-kernel"
jsonPayload.latency_ms > 1000

Slice Boundary Violations (Critical)

resource.type="cloud_run_revision"
resource.labels.service_name="graph-kernel"
jsonPayload.message:"SLICE_BOUNDARY_VIOLATION"

Troubleshooting

Common Issues

Service Returns 503

Symptoms: `/health/ready` returns 503, requests fail

Causes:
1. Database connection failed
2. Connection pool exhausted
3. Supabase rate limiting

Resolution:
1. Check `DATABASE_URL` secret is correct
2. Verify Supabase is accessible
3. Check connection pool stats in `/health`
4. Increase `DB_MAX_CONNECTIONS` if needed

High Latency

Symptoms: Slice generation >5s

Causes:
1. Large slice sizes (many turns)
2. Database query performance
3. Cold start overhead

Resolution:
1. Check `graph_kernel_slice_turns_count` histogram
2. Review slow query logs
3. Consider `min-instances=1` for warm starts
4. Tune policy `max_nodes` parameter

Token Verification Failures

Symptoms: `graph_kernel_token_verifications_total{result="invalid"}` increasing

Causes:
1. `KERNEL_HMAC_SECRET` mismatch between services
2. Token format corruption
3. Clock skew (if using time-based tokens)

Resolution:
1. Verify secret is identical in Graph Kernel and RAG++
2. Check token format (should be 32 hex chars)
3. Rotate secret and redeploy all services

Memory Issues

Symptoms: Container restarts, OOM errors

Causes:
1. Large batch slice requests
2. Connection leak
3. Unbounded caching

Resolution:
1. Increase `--memory` in Cloud Run
2. Check for connection leaks in logs
3. Monitor pool stats over time

Runbook

1. Service Not Starting

bash
# Check recent logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=graph-kernel" --limit 50

# Check revision status
gcloud run revisions list --service graph-kernel --region us-central1

# Check secrets
gcloud secrets versions access latest --secret kernel-hmac-secret

2. Force Deployment

bash
# Trigger new deployment
cd core/cc-graph-kernel
gcloud builds submit --config=cloudbuild-service.yaml

3. Rollback

bash
# List revisions
gcloud run revisions list --service graph-kernel --region us-central1

# Route traffic to previous revision
gcloud run services update-traffic graph-kernel \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION=100

4. Scale Manually

bash
# Set minimum instances (avoid cold starts)
gcloud run services update graph-kernel \
  --region us-central1 \
  --min-instances 1

# Set back to 0 for cost savings
gcloud run services update graph-kernel \
  --region us-central1 \
  --min-instances 0

Key Performance Indicators

KPITargetCritical
Availability>99.9
p50 Latency<100ms>500ms
p95 Latency<500ms>2s
Error Rate<0.1
Token Verification Success>99

References

  • [Cloud Run Monitoring](https://cloud.google.com/run/docs/monitoring)
  • [Cloud Logging Queries](https://cloud.google.com/logging/docs/view/logging-query-language)
  • [Production Invariants](../../../cc-rag-plus-plus/docs/PRODUCTION_INVARIANTS.md)
  • [Deployment Guide](./DEPLOYMENT.md)

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/semantic/cc-graph-kernel/docs/MONITORING.md

Detected Structure

Method · Evaluation · References · Architecture