Skip to content

Firewall: Monitoring & Health Checks

Monitor the TruthVouch Governance Gateway with Prometheus metrics, Grafana dashboards, and health check endpoints.

Health Check Endpoints

HTTP Health Endpoint

Terminal window
curl http://localhost:8080/health

Response:

{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"checks": {
"database": "ok",
"redis": "ok",
"embeddings": "ok"
}
}

Readiness Probe

Terminal window
curl http://localhost:8080/ready

Response (ready):

{
"ready": true,
"checks": {
"database": "connected",
"cache": "connected"
}
}

Prometheus Metrics

Access metrics at:

Terminal window
curl http://localhost:9090/metrics

Request Metrics

# HELP gateway_requests_total Total requests processed
# TYPE gateway_requests_total counter
gateway_requests_total{method="scan",status="success"} 15423
gateway_requests_total{method="scan",status="blocked"} 234
gateway_requests_total{method="scan",status="error"} 12
# HELP gateway_request_duration_seconds Request latency
# TYPE gateway_request_duration_seconds histogram
gateway_request_duration_seconds_bucket{method="scan",le="0.01"} 10000
gateway_request_duration_seconds_bucket{method="scan",le="0.1"} 14900
gateway_request_duration_seconds_bucket{method="scan",le="1"} 15410

Scan Metrics

# HELP gateway_pii_detections_total PII detections
# TYPE gateway_pii_detections_total counter
gateway_pii_detections_total{type="ssn"} 45
gateway_pii_detections_total{type="credit_card"} 23
gateway_pii_detections_total{type="email"} 156
# HELP gateway_hallucinations_detected Hallucination detections
# TYPE gateway_hallucinations_detected counter
gateway_hallucinations_detected 234
# HELP gateway_injection_attempts_total Injection attempts
# TYPE gateway_injection_attempts_total counter
gateway_injection_attempts_total{type="prompt_injection"} 67
gateway_injection_attempts_total{type="sql_injection"} 12
# HELP gateway_toxicity_detections_total Toxicity detections
# TYPE gateway_toxicity_detections_total counter
gateway_toxicity_detections_total{category="hate_speech"} 23
gateway_toxicity_detections_total{category="harassment"} 45

Database Metrics

# HELP gateway_database_queries_total Total queries
# TYPE gateway_database_queries_total counter
gateway_database_queries_total{type="scan"} 15423
# HELP gateway_database_query_duration_seconds Query latency
# TYPE gateway_database_query_duration_seconds histogram
gateway_database_query_duration_seconds_bucket{le="0.01"} 10000
gateway_database_query_duration_seconds_bucket{le="0.1"} 14900
# HELP gateway_database_pool_size Current pool size
# TYPE gateway_database_pool_size gauge
gateway_database_pool_size 15

Cache Metrics

# HELP gateway_cache_hits_total Cache hits
# TYPE gateway_cache_hits_total counter
gateway_cache_hits_total{type="embedding"} 8234
# HELP gateway_cache_misses_total Cache misses
# TYPE gateway_cache_misses_total counter
gateway_cache_misses_total{type="embedding"} 1234
# HELP gateway_cache_hit_ratio Cache hit ratio
# TYPE gateway_cache_hit_ratio gauge
gateway_cache_hit_ratio 0.87

Grafana Dashboards

Import Dashboard

Terminal window
# Download dashboard JSON
curl -o gateway-dashboard.json \
https://grafana.com/api/dashboards/18234/revisions/2/download
# Import via UI: Grafana → Dashboards → Import → Upload JSON

Key Panels

Request Rate (requests/sec):

rate(gateway_requests_total[1m])

Scan Duration (percentiles):

histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))

Error Rate:

rate(gateway_requests_total{status="error"}[1m]) / rate(gateway_requests_total[1m])

PII Detections (last hour):

sum(rate(gateway_pii_detections_total[1h]))

Cache Hit Ratio:

gateway_cache_hits_total / (gateway_cache_hits_total + gateway_cache_misses_total)

Kubernetes Health Checks

Liveness Probe (gRPC)

livenessProbe:
grpc:
port: 50052
initialDelaySeconds: 30
periodSeconds: 10

Readiness Probe (HTTP)

readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5

Alerting Rules

Example Prometheus Rules

groups:
- name: gateway
rules:
- alert: HighErrorRate
expr: |
rate(gateway_requests_total{status="error"}[5m]) /
rate(gateway_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "Gateway error rate > 5%"
- alert: SlowScan
expr: |
histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 5
for: 10m
annotations:
summary: "95th percentile scan latency > 5s"
- alert: DatabaseConnectionFailed
expr: |
gateway_database_pool_size == 0
for: 1m
annotations:
summary: "Gateway cannot connect to database"
- alert: HighPIIDetectionRate
expr: |
rate(gateway_pii_detections_total[5m]) > 10
for: 5m
annotations:
summary: "PII detections > 10/min"
- alert: CacheHitRatioDegraded
expr: |
gateway_cache_hit_ratio < 0.5
for: 10m
annotations:
summary: "Cache hit ratio degraded below 50%"

Logs

Log Levels

Configure in config.yaml:

logging:
level: INFO # DEBUG, INFO, WARN, ERROR

Log Format

Example JSON log:

{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"message": "Scan completed",
"request_id": "req-abc123",
"duration_ms": 45,
"detections": {
"pii": 1,
"injection": 0,
"toxicity": 0
},
"result": "allowed"
}

Log Aggregation

Send logs to external service (ELK, Datadog, Splunk):

logging:
# Syslog output
syslog_enabled: true
syslog_host: logs.company.local
syslog_port: 514
# JSON structured logging
json_output: true

Distributed Tracing

OpenTelemetry Integration

# In config.yaml
observability:
tracing_enabled: true
otel_endpoint: http://jaeger-collector:14268
sample_rate: 0.1 # 10% sampling

View Traces

Access Jaeger UI:

http://localhost:16686

Performance Tuning

Monitor Key Metrics

Terminal window
# Watch in real-time
watch 'curl -s http://localhost:9090/metrics | grep gateway_request_duration_seconds_bucket'

Scaling Indicators

Scale up when:

  • Request latency (p95) > 1 second
  • Error rate > 5%
  • Database pool exhausted

Scale down when:

  • CPU utilization < 30%
  • Memory utilization < 40%
  • Request rate trending down

Custom Dashboards

Node.js Monitoring

const promClient = require('prom-client');
// Create custom metric
const scanLatency = new promClient.Histogram({
name: 'custom_scan_latency',
help: 'Custom scan latency',
buckets: [0.01, 0.1, 0.5, 1]
});
// Record metric
scanLatency.observe(duration);

Python Monitoring

from prometheus_client import Histogram
scan_duration = Histogram(
'custom_scan_duration_seconds',
'Scan duration',
buckets=[0.01, 0.1, 0.5, 1]
)
with scan_duration.time():
# Scan code
pass

Troubleshooting

Metrics Not Appearing

Terminal window
# Check Prometheus connectivity
curl -I http://gateway:9090/metrics
# Check metrics are being generated
docker exec gateway-container curl http://localhost:9090/metrics

High Latency

# Find slow queries
histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))
# Check database performance
rate(gateway_database_query_duration_seconds_bucket[5m])

Memory Leak

Terminal window
# Monitor memory over time
watch 'curl -s http://localhost:9090/metrics | grep gateway_memory'

See Network Requirements and Configuration for related settings.