Firewall: Monitoring & Health Checks
Monitor the TruthVouch Governance Gateway with Prometheus metrics, Grafana dashboards, and health check endpoints.
Health Check Endpoints
HTTP Health Endpoint
curl http://localhost:8080/healthResponse:
{ "status": "healthy", "timestamp": "2024-01-15T10:30:00Z", "checks": { "database": "ok", "redis": "ok", "embeddings": "ok" }}Readiness Probe
curl http://localhost:8080/readyResponse (ready):
{ "ready": true, "checks": { "database": "connected", "cache": "connected" }}Prometheus Metrics
Access metrics at:
curl http://localhost:9090/metricsRequest Metrics
# HELP gateway_requests_total Total requests processed# TYPE gateway_requests_total countergateway_requests_total{method="scan",status="success"} 15423gateway_requests_total{method="scan",status="blocked"} 234gateway_requests_total{method="scan",status="error"} 12
# HELP gateway_request_duration_seconds Request latency# TYPE gateway_request_duration_seconds histogramgateway_request_duration_seconds_bucket{method="scan",le="0.01"} 10000gateway_request_duration_seconds_bucket{method="scan",le="0.1"} 14900gateway_request_duration_seconds_bucket{method="scan",le="1"} 15410Scan Metrics
# HELP gateway_pii_detections_total PII detections# TYPE gateway_pii_detections_total countergateway_pii_detections_total{type="ssn"} 45gateway_pii_detections_total{type="credit_card"} 23gateway_pii_detections_total{type="email"} 156
# HELP gateway_hallucinations_detected Hallucination detections# TYPE gateway_hallucinations_detected countergateway_hallucinations_detected 234
# HELP gateway_injection_attempts_total Injection attempts# TYPE gateway_injection_attempts_total countergateway_injection_attempts_total{type="prompt_injection"} 67gateway_injection_attempts_total{type="sql_injection"} 12
# HELP gateway_toxicity_detections_total Toxicity detections# TYPE gateway_toxicity_detections_total countergateway_toxicity_detections_total{category="hate_speech"} 23gateway_toxicity_detections_total{category="harassment"} 45Database Metrics
# HELP gateway_database_queries_total Total queries# TYPE gateway_database_queries_total countergateway_database_queries_total{type="scan"} 15423
# HELP gateway_database_query_duration_seconds Query latency# TYPE gateway_database_query_duration_seconds histogramgateway_database_query_duration_seconds_bucket{le="0.01"} 10000gateway_database_query_duration_seconds_bucket{le="0.1"} 14900
# HELP gateway_database_pool_size Current pool size# TYPE gateway_database_pool_size gaugegateway_database_pool_size 15Cache Metrics
# HELP gateway_cache_hits_total Cache hits# TYPE gateway_cache_hits_total countergateway_cache_hits_total{type="embedding"} 8234
# HELP gateway_cache_misses_total Cache misses# TYPE gateway_cache_misses_total countergateway_cache_misses_total{type="embedding"} 1234
# HELP gateway_cache_hit_ratio Cache hit ratio# TYPE gateway_cache_hit_ratio gaugegateway_cache_hit_ratio 0.87Grafana Dashboards
Import Dashboard
# Download dashboard JSONcurl -o gateway-dashboard.json \ https://grafana.com/api/dashboards/18234/revisions/2/download
# Import via UI: Grafana → Dashboards → Import → Upload JSONKey Panels
Request Rate (requests/sec):
rate(gateway_requests_total[1m])Scan Duration (percentiles):
histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))Error Rate:
rate(gateway_requests_total{status="error"}[1m]) / rate(gateway_requests_total[1m])PII Detections (last hour):
sum(rate(gateway_pii_detections_total[1h]))Cache Hit Ratio:
gateway_cache_hits_total / (gateway_cache_hits_total + gateway_cache_misses_total)Kubernetes Health Checks
Liveness Probe (gRPC)
livenessProbe: grpc: port: 50052 initialDelaySeconds: 30 periodSeconds: 10Readiness Probe (HTTP)
readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5Alerting Rules
Example Prometheus Rules
groups: - name: gateway rules: - alert: HighErrorRate expr: | rate(gateway_requests_total{status="error"}[5m]) / rate(gateway_requests_total[5m]) > 0.05 for: 5m annotations: summary: "Gateway error rate > 5%"
- alert: SlowScan expr: | histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 5 for: 10m annotations: summary: "95th percentile scan latency > 5s"
- alert: DatabaseConnectionFailed expr: | gateway_database_pool_size == 0 for: 1m annotations: summary: "Gateway cannot connect to database"
- alert: HighPIIDetectionRate expr: | rate(gateway_pii_detections_total[5m]) > 10 for: 5m annotations: summary: "PII detections > 10/min"
- alert: CacheHitRatioDegraded expr: | gateway_cache_hit_ratio < 0.5 for: 10m annotations: summary: "Cache hit ratio degraded below 50%"Logs
Log Levels
Configure in config.yaml:
logging: level: INFO # DEBUG, INFO, WARN, ERRORLog Format
Example JSON log:
{ "timestamp": "2024-01-15T10:30:00Z", "level": "INFO", "message": "Scan completed", "request_id": "req-abc123", "duration_ms": 45, "detections": { "pii": 1, "injection": 0, "toxicity": 0 }, "result": "allowed"}Log Aggregation
Send logs to external service (ELK, Datadog, Splunk):
logging: # Syslog output syslog_enabled: true syslog_host: logs.company.local syslog_port: 514
# JSON structured logging json_output: trueDistributed Tracing
OpenTelemetry Integration
# In config.yamlobservability: tracing_enabled: true otel_endpoint: http://jaeger-collector:14268 sample_rate: 0.1 # 10% samplingView Traces
Access Jaeger UI:
http://localhost:16686Performance Tuning
Monitor Key Metrics
# Watch in real-timewatch 'curl -s http://localhost:9090/metrics | grep gateway_request_duration_seconds_bucket'Scaling Indicators
Scale up when:
- Request latency (p95) > 1 second
- Error rate > 5%
- Database pool exhausted
Scale down when:
- CPU utilization < 30%
- Memory utilization < 40%
- Request rate trending down
Custom Dashboards
Node.js Monitoring
const promClient = require('prom-client');
// Create custom metricconst scanLatency = new promClient.Histogram({ name: 'custom_scan_latency', help: 'Custom scan latency', buckets: [0.01, 0.1, 0.5, 1]});
// Record metricscanLatency.observe(duration);Python Monitoring
from prometheus_client import Histogram
scan_duration = Histogram( 'custom_scan_duration_seconds', 'Scan duration', buckets=[0.01, 0.1, 0.5, 1])
with scan_duration.time(): # Scan code passTroubleshooting
Metrics Not Appearing
# Check Prometheus connectivitycurl -I http://gateway:9090/metrics
# Check metrics are being generateddocker exec gateway-container curl http://localhost:9090/metricsHigh Latency
# Find slow querieshistogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))
# Check database performancerate(gateway_database_query_duration_seconds_bucket[5m])Memory Leak
# Monitor memory over timewatch 'curl -s http://localhost:9090/metrics | grep gateway_memory'See Network Requirements and Configuration for related settings.