Resilience: helvia-rag-pipelines
Error handling and retry patterns for this service. Platform-wide patterns:
docs/architecture/resilience.md
HTTP Retry
- Library: httpx + backoff decorator (Python)
- Attempts: Varies by call type (see table)
- Backoff: Constant or exponential depending on call type
- On failure: Exception raised after retries exhausted
| Call type | Max attempts | Backoff | Trigger |
|---|---|---|---|
| LLM API calls | 5 | Constant 1s | httpx.HTTPError |
| Semantic search indexing | 3 (Exception) + 5 (HTTPError) | 0.5s constant | Exception, httpx.HTTPError |
| Translation | 5 | Constant 1s | Exception (in translation_service.py) |
| Pipeline operations | 3 | Exponential | Exception |
Queue Retry
Not applicable. No queue consumers in this service.
Timeouts
| Call | Timeout | Configured in |
|---|---|---|
| LLM API calls | 30s | httpx client config |
| Translation calls | 10s | httpx client config |
| HTTP connection pool | max_connections=1000, max_keepalive=20, keepalive_expiry=5s | httpx pool config |
| Vector DB operations (Qdrant/Milvus) | None (no explicit timeout) | Vector DB client |
Circuit Breakers
None implemented.
Fallback Strategy
| Failure scenario | Behaviour | User impact |
|---|---|---|
| LLM cache lookup fails | Falls back to direct LLM call ("Fallback to no cache") | Slightly higher latency, no user-visible impact |
| Embedding cache fails | Multi-tier fallback: Redis async, Redis sync, in-memory | Transparent to user |
| Translation provider fails | Round-robin provider selection via NLPProviderService | Next provider used, but selection is not failure-aware |
| LLM permanent failure (all retries exhausted) | Exception raised | Pipeline step fails, caller receives error |
Health Check
- Endpoint: None exposed for orchestration probes
- Startup checks: DB connection, Alembic migrations, Vector DB connectivity, collection setup (lifespan events)
Known Gaps
- No timeout on Vector DB operations (Qdrant/Milvus), so a hung vector store can block indefinitely
- No /health endpoint for Kubernetes liveness/readiness probes (startup checks exist but are not exposed)
- No circuit breaker for LLM providers. Repeated failures to a degraded provider consume all retry budget before moving on
- Provider selection (NLPProviderService) uses round-robin, not failure-aware routing. A failing provider keeps receiving traffic
- LLM call retries use constant 1s backoff instead of exponential, which can amplify load on a stressed provider