Firewall: Monitoring & Health Checks

Monitor the TruthVouch Governance Gateway with Prometheus metrics, Grafana dashboards, and health check endpoints.

Health Check Endpoints

HTTP Health Endpoint

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "embeddings": "ok"
  }
}

Readiness Probe

curl http://localhost:8080/ready

Response (ready):

{
  "ready": true,
  "checks": {
    "database": "connected",
    "cache": "connected"
  }
}

Prometheus Metrics

Access metrics at:

curl http://localhost:9090/metrics

Request Metrics

# HELP gateway_requests_total Total requests processed
# TYPE gateway_requests_total counter
gateway_requests_total{method="scan",status="success"} 15423
gateway_requests_total{method="scan",status="blocked"} 234
gateway_requests_total{method="scan",status="error"} 12

# HELP gateway_request_duration_seconds Request latency
# TYPE gateway_request_duration_seconds histogram
gateway_request_duration_seconds_bucket{method="scan",le="0.01"} 10000
gateway_request_duration_seconds_bucket{method="scan",le="0.1"} 14900
gateway_request_duration_seconds_bucket{method="scan",le="1"} 15410

Scan Metrics

# HELP gateway_pii_detections_total PII detections
# TYPE gateway_pii_detections_total counter
gateway_pii_detections_total{type="ssn"} 45
gateway_pii_detections_total{type="credit_card"} 23
gateway_pii_detections_total{type="email"} 156

# HELP gateway_hallucinations_detected Hallucination detections
# TYPE gateway_hallucinations_detected counter
gateway_hallucinations_detected 234

# HELP gateway_injection_attempts_total Injection attempts
# TYPE gateway_injection_attempts_total counter
gateway_injection_attempts_total{type="prompt_injection"} 67
gateway_injection_attempts_total{type="sql_injection"} 12

# HELP gateway_toxicity_detections_total Toxicity detections
# TYPE gateway_toxicity_detections_total counter
gateway_toxicity_detections_total{category="hate_speech"} 23
gateway_toxicity_detections_total{category="harassment"} 45

Database Metrics

# HELP gateway_database_queries_total Total queries
# TYPE gateway_database_queries_total counter
gateway_database_queries_total{type="scan"} 15423

# HELP gateway_database_query_duration_seconds Query latency
# TYPE gateway_database_query_duration_seconds histogram
gateway_database_query_duration_seconds_bucket{le="0.01"} 10000
gateway_database_query_duration_seconds_bucket{le="0.1"} 14900

# HELP gateway_database_pool_size Current pool size
# TYPE gateway_database_pool_size gauge
gateway_database_pool_size 15

Cache Metrics

# HELP gateway_cache_hits_total Cache hits
# TYPE gateway_cache_hits_total counter
gateway_cache_hits_total{type="embedding"} 8234

# HELP gateway_cache_misses_total Cache misses
# TYPE gateway_cache_misses_total counter
gateway_cache_misses_total{type="embedding"} 1234

# HELP gateway_cache_hit_ratio Cache hit ratio
# TYPE gateway_cache_hit_ratio gauge
gateway_cache_hit_ratio 0.87

Grafana Dashboards

Import Dashboard

# Download dashboard JSON
curl -o gateway-dashboard.json \
  https://grafana.com/api/dashboards/18234/revisions/2/download

# Import via UI: Grafana → Dashboards → Import → Upload JSON

Key Panels

Request Rate (requests/sec):

rate(gateway_requests_total[1m])

Scan Duration (percentiles):

histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))

Error Rate:

rate(gateway_requests_total{status="error"}[1m]) / rate(gateway_requests_total[1m])

PII Detections (last hour):

sum(rate(gateway_pii_detections_total[1h]))

Cache Hit Ratio:

gateway_cache_hits_total / (gateway_cache_hits_total + gateway_cache_misses_total)

Kubernetes Health Checks

Liveness Probe (gRPC)

livenessProbe:
  grpc:
    port: 50052
  initialDelaySeconds: 30
  periodSeconds: 10

Readiness Probe (HTTP)

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Alerting Rules

Example Prometheus Rules

groups:
  - name: gateway
    rules:
      - alert: HighErrorRate
        expr: |
          rate(gateway_requests_total{status="error"}[5m]) /
          rate(gateway_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Gateway error rate > 5%"

      - alert: SlowScan
        expr: |
          histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 5
        for: 10m
        annotations:
          summary: "95th percentile scan latency > 5s"

      - alert: DatabaseConnectionFailed
        expr: |
          gateway_database_pool_size == 0
        for: 1m
        annotations:
          summary: "Gateway cannot connect to database"

      - alert: HighPIIDetectionRate
        expr: |
          rate(gateway_pii_detections_total[5m]) > 10
        for: 5m
        annotations:
          summary: "PII detections > 10/min"

      - alert: CacheHitRatioDegraded
        expr: |
          gateway_cache_hit_ratio < 0.5
        for: 10m
        annotations:
          summary: "Cache hit ratio degraded below 50%"

Logs

Log Levels

Configure in config.yaml:

logging:
  level: INFO  # DEBUG, INFO, WARN, ERROR

Log Format

Example JSON log:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "message": "Scan completed",
  "request_id": "req-abc123",
  "duration_ms": 45,
  "detections": {
    "pii": 1,
    "injection": 0,
    "toxicity": 0
  },
  "result": "allowed"
}

Log Aggregation

Send logs to external service (ELK, Datadog, Splunk):

logging:
  # Syslog output
  syslog_enabled: true
  syslog_host: logs.company.local
  syslog_port: 514

  # JSON structured logging
  json_output: true

Distributed Tracing

OpenTelemetry Integration

# In config.yaml
observability:
  tracing_enabled: true
  otel_endpoint: http://jaeger-collector:14268
  sample_rate: 0.1  # 10% sampling

View Traces

Access Jaeger UI:

http://localhost:16686

Performance Tuning

Monitor Key Metrics

# Watch in real-time
watch 'curl -s http://localhost:9090/metrics | grep gateway_request_duration_seconds_bucket'

Scaling Indicators

Scale up when:

Request latency (p95) > 1 second
Error rate > 5%
Database pool exhausted

Scale down when:

CPU utilization < 30%
Memory utilization < 40%
Request rate trending down

Custom Dashboards

Node.js Monitoring

const promClient = require('prom-client');

// Create custom metric
const scanLatency = new promClient.Histogram({
  name: 'custom_scan_latency',
  help: 'Custom scan latency',
  buckets: [0.01, 0.1, 0.5, 1]
});

// Record metric
scanLatency.observe(duration);

Python Monitoring

from prometheus_client import Histogram

scan_duration = Histogram(
    'custom_scan_duration_seconds',
    'Scan duration',
    buckets=[0.01, 0.1, 0.5, 1]
)

with scan_duration.time():
    # Scan code
    pass

Troubleshooting

Metrics Not Appearing

# Check Prometheus connectivity
curl -I http://gateway:9090/metrics

# Check metrics are being generated
docker exec gateway-container curl http://localhost:9090/metrics

High Latency

# Find slow queries
histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))

# Check database performance
rate(gateway_database_query_duration_seconds_bucket[5m])

Memory Leak

# Monitor memory over time
watch 'curl -s http://localhost:9090/metrics | grep gateway_memory'

See Network Requirements and Configuration for related settings.