GitHub

Health & Monitoring

Monitor system health, check service availability, and diagnose issues with comprehensive health check endpoints. The detailed health check interfaces with MainOrchestrator.health_check() to verify all pipelines and services. Essential for production deployments and automated monitoring.

GET

/api/v1/health

Comprehensive health check with detailed service information. Routes through MainOrchestrator.health_check() to verify all pipelines and services.

GET /health - Root Health Check

Basic health check for overall system status. No authentication required.

Request

curl -X GET "https://your-api.com/health"

Response

{
"status": "ok",
"system": "faith-agentic-kg-rag",
"version": "2.0.0",
"timestamp": "2024-01-15T14:30:00Z",
"uptime_seconds": 86400,
"pipelines": {
"ingestion": "healthy",
"retrieval": "healthy",
"prediction": "healthy"
},
"services": {
"neo4j": "connected",
"qdrant": "connected",
"sql": "connected",
"redis": "connected"
}
}

Response Fields

FieldTypeDescription
statusstringOverall system status: "ok", "degraded", or "down"
systemstringSystem identifier
versionstringAPI version number
uptime_secondsintegerSystem uptime in seconds
pipelinesobjectStatus of each processing pipeline
servicesobjectConnection status for backend services

Agent Interaction

API Request → MainOrchestrator.health_check()
Check all pipelines:
├─→ UnifiedIngestionPipeline status
├─→ MultiKGRetrievalPipeline status
├─→ KnowledgeGroundingPipeline status
├─→ Qdrant connectivity
├─→ LLM service status
└─→ Memory managers status

Query Parameters

ParameterTypeDescription
detailedbooleanInclude detailed service diagnostics (default: false)

GET /metrics - System Metrics

Prometheus-format metrics for system monitoring.

Request

curl -X GET "https://your-api.com/metrics"

Response

Returns Prometheus exposition format with route availability, service status, and timestamps.

Visualization Service Health

Check the health of the visualization generation service (EnhancedKGVisualizer).

GET /api/visualizations/health

curl -X GET "https://your-api.com/api/visualizations/health"

Response

{
"status": "healthy",
"service": "kg_visualization",
"directory_writable": true,
"available_kgs": 2,
"total_visualizations": 8,
"last_generation": "2024-01-15T10:30:00Z",
"disk_space_available_gb": 50,
"timestamp": "2024-01-15T14:30:00Z"
}

Health Status Interpretation

Status Values

StatusMeaningAction Required
healthyAll systems operationalNone - system is working normally
degradedSome services impairedMonitor closely, investigate non-critical issues
downCritical failureImmediate action required

Service-Specific Status

ServiceCritical?Impact if Down
Neo4jYesNo KG queries or ingestion possible
QdrantYesNo vector search or retrieval
SQL DatabaseYesNo metadata or source tracking
RedisNoReduced performance (no caching)
VisualizationNoVisualizations unavailable

Monitoring Integration

Prometheus Metrics Export

from prometheus_client import Gauge, generate_latest
import requests
# Define metrics
system_status = Gauge('drip_system_status', 'System health status (1=healthy, 0=unhealthy)')
service_status = Gauge('drip_service_status', 'Service status', ['service'])
response_time = Gauge('drip_response_time_ms', 'Service response time', ['service'])
def update_metrics():
"""Update Prometheus metrics from health endpoint."""
response = requests.get(
"https://your-api.com/api/v1/health?detailed=true",
headers={"X-API-Key": "your-api-key"}
)
data = response.json()
# System status
system_status.set(1 if data["overall_status"] == "healthy" else 0)
# Service status and response times
for service in data["services"]:
status_value = 1 if service["status"] == "healthy" else 0
service_status.labels(service=service["service_name"]).set(status_value)
response_time.labels(service=service["service_name"]).set(service["response_time_ms"])
# Export metrics endpoint
def metrics_endpoint():
update_metrics()
return generate_latest()

Docker Health Check

FROM python:3.10
# ... your Dockerfile content ...
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1

Kubernetes Liveness Probe

apiVersion: v1
kind: Pod
metadata:
name: drip-api
spec:
containers:
- name: drip
image: your-drip-image:latest
livenessProbe:
httpGet:
path: /health
port: 8000
httpHeaders:
- name: X-API-Key
value: your-api-key
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/health
port: 8000
httpHeaders:
- name: X-API-Key
value: your-api-key
initialDelaySeconds: 10
periodSeconds: 5

Best Practices

  • Regular checks: Poll health endpoints every 30-60 seconds
  • Timeout handling: Set appropriate timeouts (5-10 seconds)
  • Retry logic: Implement exponential backoff for transient failures
  • Alerting: Set up alerts for degraded/down status
  • Dashboard: Display real-time health status in admin UI
  • Logging: Log all health check failures for debugging
  • Graceful degradation: Handle partial service failures gracefully

Troubleshooting

Service Connection Failed

Problem: neo4j service shows "disconnected"
Solution:
1. Check Neo4j container is running
2. Verify network connectivity
3. Check credentials in environment variables
4. Review Neo4j logs for errors

High Response Times

Problem: response_time_ms > 1000ms for multiple services
Solution:
1. Check system resources (CPU, memory)
2. Review database query performance
3. Check for connection pool exhaustion
4. Consider scaling up resources

Degraded Status

Problem: overall_status shows "degraded"
Solution:
1. Check detailed health endpoint for specific service issues
2. Review service logs
3. Test individual service connections
4. Check recent deployments for breaking changes