Error Handling & Resilience
How the platform handles failures. Last updated: 2026-03-19
Summary
| Pattern | Present? | Services |
|---|---|---|
| HTTP retry | Partial | hbf-core-api (shared lib, inherited by consumers), hbf-nlp (Azure only), helvia-rag-pipelines (backoff decorator), hbf-client-integrations (got retry), hbf-broadcast (Slack/Facebook), hbf-bot (401 refresh only), semantic-doc-segmenter (OpenAI SDK built-in, max_retries=20), hbf-data-manager (got retry on all methods), hbf-knowledge-manager (hbf-core-api inherited) |
| Queue retry | No | None |
| Kafka consumer retry | Yes | hbf-data-manager (ts-retry-promise, 3 attempts, 1s fixed delay per message) |
| Timeouts | Partial | hbf-core (per-client), hbf-bot (Facebook/Generic), hbf-session-manager, hbf-lcm, hbf-client-integrations, helvia-rag-pipelines (LLM/Translation), hbf-webchat (Direct Line), semantic-doc-segmenter (OpenAI 120-200s, callback 5s, job 600s), open-bot-framework (bot endpoint 5s), hbf-data-manager (HTTP 5s, Kafka session 45s), hbf-lcg (Redis microservice response/heartbeat 5s each) |
| Circuit breaker | No | None |
| Health checks | Partial | hbf-core (/actuator/health), hbf-lcm (/health, shallow), hbf-event-publisher (/ and /health, shallow), hbf-media-manager (/health, shallow), semantic-doc-segmenter (/debug/health, shallow), hbf-data-manager (/ and /health, shallow), hbf-lcg (none) |
| Fallback / graceful degradation | Partial | helvia-rag-pipelines (cache fallback), hbf-stats (UTC default), hbf-reports (graph failure tolerance), hbf-console (401 redirect), semantic-doc-segmenter (empty tags on tagging failure), open-bot-framework (Redis → in-memory fallback), hbf-lcg (leader election failover, GatewayCleaner session expiry) |
| Kafka retry | Partial | hbf-bot (producer only, exponential backoff), hbf-data-manager (consumer message handler, fixed delay) |
| Graceful shutdown | Partial | semantic-doc-segmenter (FastAPI lifespan, but does not drain in-flight jobs), hbf-data-manager (OnApplicationShutdown disconnects all consumers). hbf-data-retention explicitly lacks it. |
HTTP Retry (hbf-core-api)
The hbf-core-api library is a shared axios wrapper inherited by: hbf-notifications, hbf-data-retention, hbf-stats, hbf-reports, hbf-knowledge-manager, and any other service that depends on it.
- Library: axios v1.8.3
- Attempts: 3 total (1 initial + 2 retries)
- Backoff: Exponential with jitter:
(e^attempt - 2*random) * 1000ms - Retry condition: Network errors only (no server response received). HTTP 4xx and 5xx responses are returned immediately and are NOT retried.
- On permanent failure: Returns
HBFCoreApiResponsewith status 503. - Timeout: Not set. Axios default is no timeout, meaning requests can hang indefinitely.
- Circuit breaker: None.
Limitation: The retry-on-network-error-only policy means transient server errors (502, 503, 429) are never retried. A brief upstream restart causes immediate failure to the caller.
Per-service retry beyond hbf-core-api
| Service | Library | Retry Details |
|---|---|---|
| hbf-nlp | axios (direct) | Custom retryWithBackoff: 3 attempts, fixed 2000ms delay, 5xx only (Azure client). Raw axios calls have no retry. |
| helvia-rag-pipelines | httpx + backoff | LLM calls: 5 retries, constant 1s on httpx.HTTPError. Semantic search: 3 retries on Exception + 5 on HTTPError, 0.5s constant. Translation: 5 retries, 1s constant. Pipeline ops: 3 retries, exponential. |
| hbf-client-integrations | got | Retry enabled on GET, POST, PUT, PATCH, DELETE. Default 3 attempts. |
| hbf-broadcast | axios / request | Slack: 3 attempts, exponential backoff on 5xx. Facebook: 5 manual retries (deprecated request lib). App logic: 3 retries, exponential. |
| hbf-bot | axios | Microsoft Bot Framework: 1 retry on 401 (token refresh). Slack API: retries defaults to 0. RAG pipelines: no retry, silently returns undefined. |
| hbf-session-manager | got | trainOne(): manual retry loop, configurable attempts, no backoff delay. |
| hbf-data-manager | got | GET/POST/PUT/PATCH/DELETE all have got retry enabled (got default: 2 attempts). getBuffer() has no retry. |
| semantic-doc-segmenter | OpenAI SDK (AsyncOpenAI) | max_retries=20, timeout 120s. SDK-managed exponential backoff. Gemini calls: no retry. Callback POST: no retry, 5s timeout. |
Services with NO HTTP retry (beyond hbf-core-api inheritance)
| Service | Notes |
|---|---|
| hbf-core | No HTTP retry on outbound calls. Spring @Retryable is MongoDB-only. |
| hbf-lcm | got, retry not enabled. |
| hbf-event-publisher | got, no retry, no timeout. |
| hbf-media-manager | got, no retry, no timeout. |
| hbf-reports | got, no retry. Uses hbf-core-api for some calls. |
| hbf-webchat | Retry delegated to botframework-webchat SDK. |
| hbf-console | Fetch API + XHR, zero retry logic. |
| open-bot-framework | @nestjs/axios HttpService, no retry. 5s timeout on bot endpoint POST only. |
| hbf-knowledge-manager | Uses hbf-core-api (retry inherited). No additional HTTP retry. Azure Blob SDK handles its own internal retries. |
Queue Resilience
hbf-bot (Kafka)
- Kafka producer: exponential backoff (300ms initial, multiplier 2, max 30000ms, 5 retries).
- Consumer-side retry behavior not documented in the codebase.
hbf-data-manager (Kafka consumer)
- Library: kafkajs + ts-retry-promise v0.8.1
- Consumer group:
hbf-data-manager-consumer(configurable viaKAFKA_GROUP_ID) - Topic:
interaction-metadata(configurable viaKAFKA_TOPICS, comma-separated) - Per-message retry: 3 attempts, fixed 1000ms delay (no exponential backoff)
- On permanent failure (3 attempts exhausted): error logged, message dropped. No DLQ.
- Broker connect retry: recursive with 10s sleep — no upper bound, no alerting.
- Graceful shutdown:
OnApplicationShutdowndisconnects all consumers with per-consumer error catching.
Timeouts
| Service | Call | Timeout | Notes |
|---|---|---|---|
| hbf-core | NotificationServiceClient | 2000ms | Hardcoded |
| hbf-core | DataManagerClient | 5000ms | Hardcoded |
| hbf-core | LanguageToolClient | 5000ms | Hardcoded |
| hbf-core | HelviaNLPSpecificationClient | 120000ms (2min) | Hardcoded |
| hbf-core | HelviaRAGPipelineClient | 120000ms (2min) | Hardcoded |
| hbf-core | HelviaGPTPipelineClient | 120000ms (2min) | Hardcoded |
| hbf-core | AzureAIClient | 350000ms (~6min) | Hardcoded |
| hbf-core | Default HTTP client | 30000ms | Hardcoded |
| hbf-bot | Facebook / Generic dispatch | 10000ms | via got |
| hbf-bot | Slack API, RAG pipelines | None | No timeout set |
| hbf-nlp | Azure polling | NLP_PIPELINE_POLL_TIMEOUT_IN_SECS | Env-configurable |
| hbf-nlp | Raw axios calls | None | No timeout set |
| hbf-session-manager | POST/PATCH | SESSION_SERVICE_REQUEST_TIMEOUT or 5000ms | Env-configurable with default |
| hbf-session-manager | GET | None | May have no timeout |
| hbf-session-manager | NLP polling | NLP_PIPELINE_POLL_TIMEOUT_IN_SECS (360s default) | Poll interval 2s |
| hbf-lcm | Bot callbacks | 3000ms | Hardcoded |
| hbf-lcm | Translation | SERVICE_TRANSLATION_TIMEOUT_SECONDS | Env-configurable |
| hbf-client-integrations | Distributor service | DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000ms | Env-configurable |
| hbf-client-integrations | Client module | 15000ms | Hardcoded |
| hbf-client-integrations | kelly-hbf-interaction | 60000ms | Hardcoded |
| hbf-client-integrations | getBuffer() | None | No timeout on binary fetch |
| helvia-rag-pipelines | LLM calls | 30000ms (default) | Configurable |
| helvia-rag-pipelines | Translation | 10000ms | |
| helvia-rag-pipelines | Vector DB (Qdrant/Milvus) | None | No timeout on vector operations |
| hbf-webchat | Direct Line | 20000ms | SDK-managed |
| hbf-core-api | All calls | None | axios default = no timeout |
| hbf-notifications | HTTP calls | None | No timeout set |
| hbf-broadcast | HTTP calls | None | No timeout set |
| hbf-event-publisher | HTTP calls | None | No timeout, no retry |
| hbf-data-retention | HTTP calls | None | No timeout set |
| hbf-stats | HTTP calls | None | No timeout set |
| hbf-reports | HTTP calls | None | No timeout set |
| hbf-media-manager | HTTP calls | None | No timeout, no retry |
| hbf-console | Fetch/XHR | None | No timeout configured |
| open-bot-framework | Bot endpoint POST | 5000ms | Hardcoded in directline-conversation.service.ts |
| open-bot-framework | Redis connect | 2000ms | connectTimeout in atomic-operations.provider.ts |
| open-bot-framework | S3 upload | None | AWS SDK default |
| hbf-data-manager | HTTP calls (GET/POST/PUT/PATCH/DELETE) | DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000ms | HttpClientService |
| hbf-data-manager | getBuffer() binary fetch | None | No timeout set |
| hbf-data-manager | Kafka session timeout | KAFKA_SESSION_TIMEOUT_MS or 45000ms | KafkaConsumerService |
| hbf-knowledge-manager | hbf-core API calls | None | hbf-core-api axios default |
| hbf-knowledge-manager | Azure Blob download / list | SDK default | @azure/storage-blob internal |
| hbf-lcg | Redis microservice response | MICROSERVICE_RESPONSE_TIMEOUT_MILLIS or 5000ms | Env-configurable; awaits NestJS microservice reply |
| hbf-lcg | Redis microservice heartbeat | MICROSERVICE_HEARTBEAT_TIMEOUT_MILLIS or 5000ms | Env-configurable; controls liveness detection |
| hbf-lcg | Cisco polling interval | 5000ms | Fixed interval; no timeout on individual poll calls |
| hbf-lcg | Genesys inactive session check | 10000ms | Fixed polling interval for inactive check |
| semantic-doc-segmenter | OpenAI/Azure (default) | 120s | Hardcoded |
| semantic-doc-segmenter | OpenAI/Azure (title extraction) | 200s | Hardcoded |
| semantic-doc-segmenter | Callback POST | 5s | Hardcoded |
| semantic-doc-segmenter | Google language detection | 10s | Env: LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS |
| semantic-doc-segmenter | Job execution overall | 600s | Env: JOB_TIMEOUT_SECONDS |
| semantic-doc-segmenter | Gemini API | None | No timeout set |
Circuit Breakers
No service implements a circuit breaker. There is no mechanism to stop sending requests to a downstream service that is known to be failing. Every service will continue sending requests at full rate during an outage, amplifying load on struggling dependencies and delaying recovery.
Fallback Strategies
| Scenario | Fallback Behaviour | User Impact |
|---|---|---|
| helvia-rag-pipelines: embedding cache failure | Falls back to direct LLM call (skips Redis/memory cache) | Increased latency, higher LLM cost. Transparent to user. |
| helvia-rag-pipelines: multi-tier cache | Redis async -> Redis sync -> in-memory. Degrades through tiers on failure. | Gradual latency increase. Transparent to user. |
| hbf-stats: getOrganizationTimezone failure | Defaults to 'UTC' | Stats may show incorrect timezone. Silent, no error surfaced. |
| hbf-stats: updateTenantStats failure | Logs error, does not throw | Stats silently stale. Dashboard shows outdated numbers. |
| hbf-reports: generate_graphs failure | Caught, execution continues | Report generated without graphs. Partial output. |
| hbf-bot: RAG pipeline failure | Silently returns undefined | Bot response may lack RAG-augmented context. No error shown to end user. |
| hbf-broadcast: individual send failure | Uses Promise.allSettled() | One recipient failure does not block others. Partial delivery. |
| hbf-console: 401 response | Checks refresh token, clears storage, redirects to login | User is logged out. Session state lost. |
| hbf-webchat: token expiry | Tracks expiry for reconnection | Automatic reconnection attempt. Brief interruption possible. |
| hbf-core-api: permanent network failure | Returns HBFCoreApiResponse with status 503 | Caller receives structured error. User sees failure depending on caller handling. |
| hbf-core: any unhandled exception | ResponseExceptionHandler (@ControllerAdvice) | Standardized error response. Prevents raw stack traces. |
| semantic-doc-segmenter: article tagging LLM failure | Returns empty tags list, applies "Other" tag | Articles tagged as "Other" instead of meaningful categories |
| semantic-doc-segmenter: callback POST failure | Logged, job proceeds to complete | Caller never receives results; must poll GET endpoint |
| semantic-doc-segmenter: Gemini failure | Exception propagates, job marked FAILED | Job fails entirely if Gemini was selected parser |
| open-bot-framework: Redis unavailable at startup | Falls back to in-memory atomic counter manager | Activity watermarks non-durable; IDs will collide across instances |
| open-bot-framework: bot endpoint unreachable on activity POST | BadRequestException with error message | User receives 400 error |
| open-bot-framework: WebSocket client not registered after 3 retries | Transcript dropped, warning logged | Bot reply silently lost; user never receives message |
| hbf-knowledge-manager: webhook processing error after 200 ACK | Error caught and logged; Event Grid is not retried (ACK was sent) | File change permanently lost with no replay mechanism |
| hbf-knowledge-manager: no integration/KB found for webhook key | Returns early, logs warning/debug | No sync occurs. Silent. |
| hbf-knowledge-manager fullSync: per-file download/upload fails | Error added to SyncResult.errors[], continues to next file | Partial sync; caller receives error list in SyncResult |
| hbf-lcg: leader node failure | Leader election detects missing heartbeat; a follower is promoted automatically | Brief window of no GatewayCleaner execution during re-election |
| hbf-lcg: expired sessions (Zendesk, Genesys) | GatewayCleaner cron (leader-only) deletes sessions past TTL | Sessions silently removed; no user notification |
| hbf-lcg: Genesys WebSocket disconnection | Distributed coordination selects one node to own reconnection; others stand by | Brief message gap during reconnect; no data loss mechanism |
| hbf-lcg: pending Genesys session not established within 120s | GatewayCleaner removes pending session entry | Session silently abandoned; caller must retry from scratch |
Health Checks
| Service | Endpoint | Checks |
|---|---|---|
| hbf-core | /actuator/health | Spring Actuator (checks DB, disk, etc.) |
| hbf-core | /{tenant}/health-check | Tenant-scoped health |
| hbf-lcm | GET /health | Returns empty 200. No dependency checks. |
| hbf-event-publisher | GET / | Returns service name + version |
| hbf-event-publisher | GET /health | Stub, returns void |
| hbf-media-manager | GET /health | Returns 200. No dependency checks. |
| hbf-bot | None | No health endpoint |
| hbf-nlp | None | No health endpoint |
| helvia-rag-pipelines | None | No health endpoint |
| hbf-session-manager | None | No health endpoint |
| hbf-notifications | None | No health endpoint |
| hbf-client-integrations | None | No health endpoint |
| hbf-broadcast | None | No health endpoint |
| hbf-data-retention | None | No health endpoint |
| hbf-stats | None | No health endpoint |
| hbf-reports | None | No health endpoint |
| hbf-webchat | N/A | Frontend widget, not applicable |
| hbf-console | N/A | Frontend SPA, not applicable |
| semantic-doc-segmenter | GET /debug/health | Job counts (pending, processing, failed 24h). No dependency checks. |
| semantic-doc-segmenter | GET /debug/health_with_details | Same + detailed job lists. No dependency checks. |
| open-bot-framework | None | GET / returns "Hello World!". No health endpoint. |
| hbf-data-manager | GET / and GET /health | Returns status, timestamp, and uptime seconds. No dependency checks (no DB or Kafka connectivity check). |
| hbf-knowledge-manager | None | No health endpoint. |
| hbf-lcg | None | No health endpoint. |
16 of 19 backend services lack a health endpoint or have only shallow (no-dependency) checks. Only hbf-core has a meaningful health check via Spring Actuator.
Gaps & Recommendations
Critical (service outage risk)
-
No circuit breakers anywhere. A single downstream failure cascades to all callers. Every service hammers failing dependencies at full rate.
- Recommendation: Add circuit breakers (e.g., opossum for Node.js, resilience4j for Java) to hbf-core, hbf-bot, and helvia-rag-pipelines as a starting priority. These are the highest-fan-out services.
-
hbf-core-api has no timeout. The shared HTTP library used across many services has no default timeout. A hung downstream will hold connections and threads indefinitely.
- Recommendation: Set a default timeout (e.g., 30s) in hbf-core-api. Allow per-call overrides.
-
hbf-core-api only retries on network errors. Transient HTTP 502/503/429 responses are never retried. Brief restarts or load-balancer errors cause immediate caller failure.
- Recommendation: Extend retry to cover 502, 503, and 429 (with Retry-After header respect).
-
hbf-data-retention has no graceful shutdown. Runs as an infinite loop daemon. Interruption mid-deletion can leave data in an inconsistent state.
- Recommendation: Add SIGTERM/SIGINT handlers to complete the current batch before exiting.
High (degraded reliability)
-
15 of 18 backend services lack meaningful health checks. Orchestrators (Kubernetes, load balancers) cannot detect unhealthy instances. Failed services continue receiving traffic.
- Recommendation: Add /health endpoints that check critical dependencies (DB connectivity, Redis, downstream service reachability). Minimum: readiness and liveness probes for all services.
-
Multiple services have no HTTP timeout. hbf-notifications, hbf-broadcast, hbf-event-publisher, hbf-media-manager, hbf-stats, hbf-reports, hbf-data-retention, and hbf-nlp (raw calls) can all hang indefinitely on downstream calls.
- Recommendation: Establish a platform-wide default timeout (e.g., 30s) and require explicit opt-in for longer durations.
-
helvia-rag-pipelines provider selection is round-robin, not failure-aware. A failing LLM provider continues receiving requests in rotation.
- Recommendation: Track provider health and skip or deprioritize providers with recent failures.
-
helvia-rag-pipelines has no timeout on vector DB operations. Qdrant/Milvus queries can hang indefinitely.
- Recommendation: Set a timeout (e.g., 10s) on all vector DB calls.
Medium (operational visibility)
-
hbf-core timeouts are hardcoded. Changing timeout values requires a code change and redeployment.
- Recommendation: Move timeouts to environment variables with sensible defaults.
-
hbf-bot silently swallows RAG pipeline failures. Returns undefined with no logging or metric. Debugging missing RAG context requires manual investigation.
- Recommendation: Log RAG failures with correlation IDs. Emit a metric for monitoring.
-
Inconsistent retry strategies across services. Retry attempts range from 0 to 5, backoff strategies include fixed, exponential, exponential-with-jitter, and none. No platform standard exists.
- Recommendation: Define a platform retry standard (e.g., 3 attempts, exponential backoff with jitter, retry on 5xx + network errors) and implement via shared middleware or library configuration.
-
hbf-session-manager trainOne() retries without backoff. Retries fire immediately, potentially overwhelming the target during recovery.
- Recommendation: Add exponential backoff or at minimum a fixed delay between retries.
-
hbf-data-manager Kafka consumer has no dead-letter queue. Messages that fail all 3 retries are logged and permanently dropped. There is no replay mechanism. The code itself has a
// TODO: Consider adding DLQcomment acknowledging this gap.- Recommendation: Implement a DLQ (e.g., a separate Kafka topic
interaction-metadata.dlq) and publish failed messages there instead of dropping them.
- Recommendation: Implement a DLQ (e.g., a separate Kafka topic
-
hbf-data-manager Kafka broker connect retries forever.
KafkajsConsumer.connect()recursively retries with no upper bound, no alerting, and no back-off increase beyond the fixed 10s sleep. A permanently unreachable broker will loop silently indefinitely.- Recommendation: Cap broker connection retries (e.g., 10 attempts), then throw to allow the process to exit and be restarted by the orchestrator with proper alerting.
-
hbf-data-manager Kafka consumer retry uses fixed delay, not exponential backoff.
ts-retry-promiseis configured with a fixed 1000ms delay for all 3 retries. Transient DB spikes will cause rapid retries.- Recommendation: Switch to exponential backoff (e.g.,
ts-retry-promisebackoff: 'EXPONENTIAL'option) to avoid hammering the DB during recovery.
- Recommendation: Switch to exponential backoff (e.g.,
-
hbf-knowledge-manager webhook processing has no retry or DLQ after ACK. The controller sends HTTP 200 to Event Grid before processing. Any processing failure (hbf-core unreachable, Azure Blob download failure, etc.) is caught, logged, and the event is permanently lost. Event Grid will not retry because it received a 200.
- Recommendation: Persist the raw event payload to a durable queue (e.g., Bull/Redis or a dedicated DB table) immediately on receipt, before sending the 200. Process from the queue with retry. This preserves the fire-and-forget ACK advantage while making processing recoverable.
-
hbf-knowledge-manager has no timeout on hbf-core API calls. Inherits the hbf-core-api no-timeout default. A hung hbf-core call during webhook processing blocks the processing goroutine indefinitely. Because the 200 ACK is already sent, the user sees no error, but the server-side fiber leaks.
- Recommendation: Set a per-call timeout (e.g., 30s) when constructing
HBFCoreApiinhbf-core.service.ts, or await a platform-wide fix in hbf-core-api.
- Recommendation: Set a per-call timeout (e.g., 30s) when constructing
-
hbf-lcg has no health endpoint. The service manages live gateway sessions and WebSocket connections but exposes no endpoint for orchestrators to detect failures.
- Recommendation: Add a /health endpoint that checks Redis connectivity and reports leader election state.
-
hbf-lcg polling adapters use fixed intervals with no timeout on individual calls. Cisco polls every 5s and Genesys checks inactive sessions every 10s, but individual poll calls have no timeout. A slow upstream will hold the polling loop indefinitely.
- Recommendation: Add a per-call timeout to each adapter's polling fetch, shorter than the polling interval (e.g., 4s for Cisco, 8s for Genesys).
-
hbf-lcg Genesys WebSocket reconnection has no retry cap. Distributed coordination picks one node to reconnect, but there is no documented upper bound on reconnection attempts or backoff. A permanently dead Genesys endpoint will loop indefinitely.
- Recommendation: Apply capped exponential backoff (e.g., 5 attempts, max 60s delay) before marking the connection failed and alerting.
-
hbf-lcg session expiry is cleanup-only, with no caller notification. Expired or abandoned sessions (Zendesk 3600s, Genesys 3600s inactive, 120s pending) are silently deleted by GatewayCleaner. No event is emitted to inform connected clients.
- Recommendation: Emit a session-expired event (e.g., via Redis pub/sub or the existing microservice bus) so downstream consumers can react rather than discover expiry on next request.