Skip to main content

Error Handling & Resilience

How the platform handles failures. Last updated: 2026-03-19

Summary

PatternPresent?Services
HTTP retryPartialhbf-core-api (shared lib, inherited by consumers), hbf-nlp (Azure only), helvia-rag-pipelines (backoff decorator), hbf-client-integrations (got retry), hbf-broadcast (Slack/Facebook), hbf-bot (401 refresh only), semantic-doc-segmenter (OpenAI SDK built-in, max_retries=20), hbf-data-manager (got retry on all methods), hbf-knowledge-manager (hbf-core-api inherited)
Queue retryNoNone
Kafka consumer retryYeshbf-data-manager (ts-retry-promise, 3 attempts, 1s fixed delay per message)
TimeoutsPartialhbf-core (per-client), hbf-bot (Facebook/Generic), hbf-session-manager, hbf-lcm, hbf-client-integrations, helvia-rag-pipelines (LLM/Translation), hbf-webchat (Direct Line), semantic-doc-segmenter (OpenAI 120-200s, callback 5s, job 600s), open-bot-framework (bot endpoint 5s), hbf-data-manager (HTTP 5s, Kafka session 45s), hbf-lcg (Redis microservice response/heartbeat 5s each)
Circuit breakerNoNone
Health checksPartialhbf-core (/actuator/health), hbf-lcm (/health, shallow), hbf-event-publisher (/ and /health, shallow), hbf-media-manager (/health, shallow), semantic-doc-segmenter (/debug/health, shallow), hbf-data-manager (/ and /health, shallow), hbf-lcg (none)
Fallback / graceful degradationPartialhelvia-rag-pipelines (cache fallback), hbf-stats (UTC default), hbf-reports (graph failure tolerance), hbf-console (401 redirect), semantic-doc-segmenter (empty tags on tagging failure), open-bot-framework (Redis → in-memory fallback), hbf-lcg (leader election failover, GatewayCleaner session expiry)
Kafka retryPartialhbf-bot (producer only, exponential backoff), hbf-data-manager (consumer message handler, fixed delay)
Graceful shutdownPartialsemantic-doc-segmenter (FastAPI lifespan, but does not drain in-flight jobs), hbf-data-manager (OnApplicationShutdown disconnects all consumers). hbf-data-retention explicitly lacks it.

HTTP Retry (hbf-core-api)

The hbf-core-api library is a shared axios wrapper inherited by: hbf-notifications, hbf-data-retention, hbf-stats, hbf-reports, hbf-knowledge-manager, and any other service that depends on it.

  • Library: axios v1.8.3
  • Attempts: 3 total (1 initial + 2 retries)
  • Backoff: Exponential with jitter: (e^attempt - 2*random) * 1000ms
  • Retry condition: Network errors only (no server response received). HTTP 4xx and 5xx responses are returned immediately and are NOT retried.
  • On permanent failure: Returns HBFCoreApiResponse with status 503.
  • Timeout: Not set. Axios default is no timeout, meaning requests can hang indefinitely.
  • Circuit breaker: None.

Limitation: The retry-on-network-error-only policy means transient server errors (502, 503, 429) are never retried. A brief upstream restart causes immediate failure to the caller.

Per-service retry beyond hbf-core-api

ServiceLibraryRetry Details
hbf-nlpaxios (direct)Custom retryWithBackoff: 3 attempts, fixed 2000ms delay, 5xx only (Azure client). Raw axios calls have no retry.
helvia-rag-pipelineshttpx + backoffLLM calls: 5 retries, constant 1s on httpx.HTTPError. Semantic search: 3 retries on Exception + 5 on HTTPError, 0.5s constant. Translation: 5 retries, 1s constant. Pipeline ops: 3 retries, exponential.
hbf-client-integrationsgotRetry enabled on GET, POST, PUT, PATCH, DELETE. Default 3 attempts.
hbf-broadcastaxios / requestSlack: 3 attempts, exponential backoff on 5xx. Facebook: 5 manual retries (deprecated request lib). App logic: 3 retries, exponential.
hbf-botaxiosMicrosoft Bot Framework: 1 retry on 401 (token refresh). Slack API: retries defaults to 0. RAG pipelines: no retry, silently returns undefined.
hbf-session-managergottrainOne(): manual retry loop, configurable attempts, no backoff delay.
hbf-data-managergotGET/POST/PUT/PATCH/DELETE all have got retry enabled (got default: 2 attempts). getBuffer() has no retry.

| semantic-doc-segmenter | OpenAI SDK (AsyncOpenAI) | max_retries=20, timeout 120s. SDK-managed exponential backoff. Gemini calls: no retry. Callback POST: no retry, 5s timeout. |

Services with NO HTTP retry (beyond hbf-core-api inheritance)

ServiceNotes
hbf-coreNo HTTP retry on outbound calls. Spring @Retryable is MongoDB-only.
hbf-lcmgot, retry not enabled.
hbf-event-publishergot, no retry, no timeout.
hbf-media-managergot, no retry, no timeout.
hbf-reportsgot, no retry. Uses hbf-core-api for some calls.
hbf-webchatRetry delegated to botframework-webchat SDK.
hbf-consoleFetch API + XHR, zero retry logic.
open-bot-framework@nestjs/axios HttpService, no retry. 5s timeout on bot endpoint POST only.
hbf-knowledge-managerUses hbf-core-api (retry inherited). No additional HTTP retry. Azure Blob SDK handles its own internal retries.

Queue Resilience

hbf-bot (Kafka)

  • Kafka producer: exponential backoff (300ms initial, multiplier 2, max 30000ms, 5 retries).
  • Consumer-side retry behavior not documented in the codebase.

hbf-data-manager (Kafka consumer)

  • Library: kafkajs + ts-retry-promise v0.8.1
  • Consumer group: hbf-data-manager-consumer (configurable via KAFKA_GROUP_ID)
  • Topic: interaction-metadata (configurable via KAFKA_TOPICS, comma-separated)
  • Per-message retry: 3 attempts, fixed 1000ms delay (no exponential backoff)
  • On permanent failure (3 attempts exhausted): error logged, message dropped. No DLQ.
  • Broker connect retry: recursive with 10s sleep — no upper bound, no alerting.
  • Graceful shutdown: OnApplicationShutdown disconnects all consumers with per-consumer error catching.

Timeouts

ServiceCallTimeoutNotes
hbf-coreNotificationServiceClient2000msHardcoded
hbf-coreDataManagerClient5000msHardcoded
hbf-coreLanguageToolClient5000msHardcoded
hbf-coreHelviaNLPSpecificationClient120000ms (2min)Hardcoded
hbf-coreHelviaRAGPipelineClient120000ms (2min)Hardcoded
hbf-coreHelviaGPTPipelineClient120000ms (2min)Hardcoded
hbf-coreAzureAIClient350000ms (~6min)Hardcoded
hbf-coreDefault HTTP client30000msHardcoded
hbf-botFacebook / Generic dispatch10000msvia got
hbf-botSlack API, RAG pipelinesNoneNo timeout set
hbf-nlpAzure pollingNLP_PIPELINE_POLL_TIMEOUT_IN_SECSEnv-configurable
hbf-nlpRaw axios callsNoneNo timeout set
hbf-session-managerPOST/PATCHSESSION_SERVICE_REQUEST_TIMEOUT or 5000msEnv-configurable with default
hbf-session-managerGETNoneMay have no timeout
hbf-session-managerNLP pollingNLP_PIPELINE_POLL_TIMEOUT_IN_SECS (360s default)Poll interval 2s
hbf-lcmBot callbacks3000msHardcoded
hbf-lcmTranslationSERVICE_TRANSLATION_TIMEOUT_SECONDSEnv-configurable
hbf-client-integrationsDistributor serviceDISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000msEnv-configurable
hbf-client-integrationsClient module15000msHardcoded
hbf-client-integrationskelly-hbf-interaction60000msHardcoded
hbf-client-integrationsgetBuffer()NoneNo timeout on binary fetch
helvia-rag-pipelinesLLM calls30000ms (default)Configurable
helvia-rag-pipelinesTranslation10000ms
helvia-rag-pipelinesVector DB (Qdrant/Milvus)NoneNo timeout on vector operations
hbf-webchatDirect Line20000msSDK-managed
hbf-core-apiAll callsNoneaxios default = no timeout
hbf-notificationsHTTP callsNoneNo timeout set
hbf-broadcastHTTP callsNoneNo timeout set
hbf-event-publisherHTTP callsNoneNo timeout, no retry
hbf-data-retentionHTTP callsNoneNo timeout set
hbf-statsHTTP callsNoneNo timeout set
hbf-reportsHTTP callsNoneNo timeout set
hbf-media-managerHTTP callsNoneNo timeout, no retry
hbf-consoleFetch/XHRNoneNo timeout configured
open-bot-frameworkBot endpoint POST5000msHardcoded in directline-conversation.service.ts
open-bot-frameworkRedis connect2000msconnectTimeout in atomic-operations.provider.ts
open-bot-frameworkS3 uploadNoneAWS SDK default
hbf-data-managerHTTP calls (GET/POST/PUT/PATCH/DELETE)DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000msHttpClientService
hbf-data-managergetBuffer() binary fetchNoneNo timeout set
hbf-data-managerKafka session timeoutKAFKA_SESSION_TIMEOUT_MS or 45000msKafkaConsumerService
hbf-knowledge-managerhbf-core API callsNonehbf-core-api axios default
hbf-knowledge-managerAzure Blob download / listSDK default@azure/storage-blob internal
hbf-lcgRedis microservice responseMICROSERVICE_RESPONSE_TIMEOUT_MILLIS or 5000msEnv-configurable; awaits NestJS microservice reply
hbf-lcgRedis microservice heartbeatMICROSERVICE_HEARTBEAT_TIMEOUT_MILLIS or 5000msEnv-configurable; controls liveness detection
hbf-lcgCisco polling interval5000msFixed interval; no timeout on individual poll calls
hbf-lcgGenesys inactive session check10000msFixed polling interval for inactive check
semantic-doc-segmenterOpenAI/Azure (default)120sHardcoded
semantic-doc-segmenterOpenAI/Azure (title extraction)200sHardcoded
semantic-doc-segmenterCallback POST5sHardcoded
semantic-doc-segmenterGoogle language detection10sEnv: LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS
semantic-doc-segmenterJob execution overall600sEnv: JOB_TIMEOUT_SECONDS
semantic-doc-segmenterGemini APINoneNo timeout set

Circuit Breakers

No service implements a circuit breaker. There is no mechanism to stop sending requests to a downstream service that is known to be failing. Every service will continue sending requests at full rate during an outage, amplifying load on struggling dependencies and delaying recovery.

Fallback Strategies

ScenarioFallback BehaviourUser Impact
helvia-rag-pipelines: embedding cache failureFalls back to direct LLM call (skips Redis/memory cache)Increased latency, higher LLM cost. Transparent to user.
helvia-rag-pipelines: multi-tier cacheRedis async -> Redis sync -> in-memory. Degrades through tiers on failure.Gradual latency increase. Transparent to user.
hbf-stats: getOrganizationTimezone failureDefaults to 'UTC'Stats may show incorrect timezone. Silent, no error surfaced.
hbf-stats: updateTenantStats failureLogs error, does not throwStats silently stale. Dashboard shows outdated numbers.
hbf-reports: generate_graphs failureCaught, execution continuesReport generated without graphs. Partial output.
hbf-bot: RAG pipeline failureSilently returns undefinedBot response may lack RAG-augmented context. No error shown to end user.
hbf-broadcast: individual send failureUses Promise.allSettled()One recipient failure does not block others. Partial delivery.
hbf-console: 401 responseChecks refresh token, clears storage, redirects to loginUser is logged out. Session state lost.
hbf-webchat: token expiryTracks expiry for reconnectionAutomatic reconnection attempt. Brief interruption possible.
hbf-core-api: permanent network failureReturns HBFCoreApiResponse with status 503Caller receives structured error. User sees failure depending on caller handling.
hbf-core: any unhandled exceptionResponseExceptionHandler (@ControllerAdvice)Standardized error response. Prevents raw stack traces.
semantic-doc-segmenter: article tagging LLM failureReturns empty tags list, applies "Other" tagArticles tagged as "Other" instead of meaningful categories
semantic-doc-segmenter: callback POST failureLogged, job proceeds to completeCaller never receives results; must poll GET endpoint
semantic-doc-segmenter: Gemini failureException propagates, job marked FAILEDJob fails entirely if Gemini was selected parser
open-bot-framework: Redis unavailable at startupFalls back to in-memory atomic counter managerActivity watermarks non-durable; IDs will collide across instances
open-bot-framework: bot endpoint unreachable on activity POSTBadRequestException with error messageUser receives 400 error
open-bot-framework: WebSocket client not registered after 3 retriesTranscript dropped, warning loggedBot reply silently lost; user never receives message
hbf-knowledge-manager: webhook processing error after 200 ACKError caught and logged; Event Grid is not retried (ACK was sent)File change permanently lost with no replay mechanism
hbf-knowledge-manager: no integration/KB found for webhook keyReturns early, logs warning/debugNo sync occurs. Silent.
hbf-knowledge-manager fullSync: per-file download/upload failsError added to SyncResult.errors[], continues to next filePartial sync; caller receives error list in SyncResult
hbf-lcg: leader node failureLeader election detects missing heartbeat; a follower is promoted automaticallyBrief window of no GatewayCleaner execution during re-election
hbf-lcg: expired sessions (Zendesk, Genesys)GatewayCleaner cron (leader-only) deletes sessions past TTLSessions silently removed; no user notification
hbf-lcg: Genesys WebSocket disconnectionDistributed coordination selects one node to own reconnection; others stand byBrief message gap during reconnect; no data loss mechanism
hbf-lcg: pending Genesys session not established within 120sGatewayCleaner removes pending session entrySession silently abandoned; caller must retry from scratch

Health Checks

ServiceEndpointChecks
hbf-core/actuator/healthSpring Actuator (checks DB, disk, etc.)
hbf-core/{tenant}/health-checkTenant-scoped health
hbf-lcmGET /healthReturns empty 200. No dependency checks.
hbf-event-publisherGET /Returns service name + version
hbf-event-publisherGET /healthStub, returns void
hbf-media-managerGET /healthReturns 200. No dependency checks.
hbf-botNoneNo health endpoint
hbf-nlpNoneNo health endpoint
helvia-rag-pipelinesNoneNo health endpoint
hbf-session-managerNoneNo health endpoint
hbf-notificationsNoneNo health endpoint
hbf-client-integrationsNoneNo health endpoint
hbf-broadcastNoneNo health endpoint
hbf-data-retentionNoneNo health endpoint
hbf-statsNoneNo health endpoint
hbf-reportsNoneNo health endpoint
hbf-webchatN/AFrontend widget, not applicable
hbf-consoleN/AFrontend SPA, not applicable
semantic-doc-segmenterGET /debug/healthJob counts (pending, processing, failed 24h). No dependency checks.
semantic-doc-segmenterGET /debug/health_with_detailsSame + detailed job lists. No dependency checks.
open-bot-frameworkNoneGET / returns "Hello World!". No health endpoint.
hbf-data-managerGET / and GET /healthReturns status, timestamp, and uptime seconds. No dependency checks (no DB or Kafka connectivity check).
hbf-knowledge-managerNoneNo health endpoint.
hbf-lcgNoneNo health endpoint.

16 of 19 backend services lack a health endpoint or have only shallow (no-dependency) checks. Only hbf-core has a meaningful health check via Spring Actuator.

Gaps & Recommendations

Critical (service outage risk)

  1. No circuit breakers anywhere. A single downstream failure cascades to all callers. Every service hammers failing dependencies at full rate.

    • Recommendation: Add circuit breakers (e.g., opossum for Node.js, resilience4j for Java) to hbf-core, hbf-bot, and helvia-rag-pipelines as a starting priority. These are the highest-fan-out services.
  2. hbf-core-api has no timeout. The shared HTTP library used across many services has no default timeout. A hung downstream will hold connections and threads indefinitely.

    • Recommendation: Set a default timeout (e.g., 30s) in hbf-core-api. Allow per-call overrides.
  3. hbf-core-api only retries on network errors. Transient HTTP 502/503/429 responses are never retried. Brief restarts or load-balancer errors cause immediate caller failure.

    • Recommendation: Extend retry to cover 502, 503, and 429 (with Retry-After header respect).
  4. hbf-data-retention has no graceful shutdown. Runs as an infinite loop daemon. Interruption mid-deletion can leave data in an inconsistent state.

    • Recommendation: Add SIGTERM/SIGINT handlers to complete the current batch before exiting.

High (degraded reliability)

  1. 15 of 18 backend services lack meaningful health checks. Orchestrators (Kubernetes, load balancers) cannot detect unhealthy instances. Failed services continue receiving traffic.

    • Recommendation: Add /health endpoints that check critical dependencies (DB connectivity, Redis, downstream service reachability). Minimum: readiness and liveness probes for all services.
  2. Multiple services have no HTTP timeout. hbf-notifications, hbf-broadcast, hbf-event-publisher, hbf-media-manager, hbf-stats, hbf-reports, hbf-data-retention, and hbf-nlp (raw calls) can all hang indefinitely on downstream calls.

    • Recommendation: Establish a platform-wide default timeout (e.g., 30s) and require explicit opt-in for longer durations.
  3. helvia-rag-pipelines provider selection is round-robin, not failure-aware. A failing LLM provider continues receiving requests in rotation.

    • Recommendation: Track provider health and skip or deprioritize providers with recent failures.
  4. helvia-rag-pipelines has no timeout on vector DB operations. Qdrant/Milvus queries can hang indefinitely.

    • Recommendation: Set a timeout (e.g., 10s) on all vector DB calls.

Medium (operational visibility)

  1. hbf-core timeouts are hardcoded. Changing timeout values requires a code change and redeployment.

    • Recommendation: Move timeouts to environment variables with sensible defaults.
  2. hbf-bot silently swallows RAG pipeline failures. Returns undefined with no logging or metric. Debugging missing RAG context requires manual investigation.

    • Recommendation: Log RAG failures with correlation IDs. Emit a metric for monitoring.
  3. Inconsistent retry strategies across services. Retry attempts range from 0 to 5, backoff strategies include fixed, exponential, exponential-with-jitter, and none. No platform standard exists.

    • Recommendation: Define a platform retry standard (e.g., 3 attempts, exponential backoff with jitter, retry on 5xx + network errors) and implement via shared middleware or library configuration.
  4. hbf-session-manager trainOne() retries without backoff. Retries fire immediately, potentially overwhelming the target during recovery.

    • Recommendation: Add exponential backoff or at minimum a fixed delay between retries.
  5. hbf-data-manager Kafka consumer has no dead-letter queue. Messages that fail all 3 retries are logged and permanently dropped. There is no replay mechanism. The code itself has a // TODO: Consider adding DLQ comment acknowledging this gap.

    • Recommendation: Implement a DLQ (e.g., a separate Kafka topic interaction-metadata.dlq) and publish failed messages there instead of dropping them.
  6. hbf-data-manager Kafka broker connect retries forever. KafkajsConsumer.connect() recursively retries with no upper bound, no alerting, and no back-off increase beyond the fixed 10s sleep. A permanently unreachable broker will loop silently indefinitely.

    • Recommendation: Cap broker connection retries (e.g., 10 attempts), then throw to allow the process to exit and be restarted by the orchestrator with proper alerting.
  7. hbf-data-manager Kafka consumer retry uses fixed delay, not exponential backoff. ts-retry-promise is configured with a fixed 1000ms delay for all 3 retries. Transient DB spikes will cause rapid retries.

    • Recommendation: Switch to exponential backoff (e.g., ts-retry-promise backoff: 'EXPONENTIAL' option) to avoid hammering the DB during recovery.
  8. hbf-knowledge-manager webhook processing has no retry or DLQ after ACK. The controller sends HTTP 200 to Event Grid before processing. Any processing failure (hbf-core unreachable, Azure Blob download failure, etc.) is caught, logged, and the event is permanently lost. Event Grid will not retry because it received a 200.

    • Recommendation: Persist the raw event payload to a durable queue (e.g., Bull/Redis or a dedicated DB table) immediately on receipt, before sending the 200. Process from the queue with retry. This preserves the fire-and-forget ACK advantage while making processing recoverable.
  9. hbf-knowledge-manager has no timeout on hbf-core API calls. Inherits the hbf-core-api no-timeout default. A hung hbf-core call during webhook processing blocks the processing goroutine indefinitely. Because the 200 ACK is already sent, the user sees no error, but the server-side fiber leaks.

    • Recommendation: Set a per-call timeout (e.g., 30s) when constructing HBFCoreApi in hbf-core.service.ts, or await a platform-wide fix in hbf-core-api.
  10. hbf-lcg has no health endpoint. The service manages live gateway sessions and WebSocket connections but exposes no endpoint for orchestrators to detect failures.

    • Recommendation: Add a /health endpoint that checks Redis connectivity and reports leader election state.
  11. hbf-lcg polling adapters use fixed intervals with no timeout on individual calls. Cisco polls every 5s and Genesys checks inactive sessions every 10s, but individual poll calls have no timeout. A slow upstream will hold the polling loop indefinitely.

    • Recommendation: Add a per-call timeout to each adapter's polling fetch, shorter than the polling interval (e.g., 4s for Cisco, 8s for Genesys).
  12. hbf-lcg Genesys WebSocket reconnection has no retry cap. Distributed coordination picks one node to reconnect, but there is no documented upper bound on reconnection attempts or backoff. A permanently dead Genesys endpoint will loop indefinitely.

    • Recommendation: Apply capped exponential backoff (e.g., 5 attempts, max 60s delay) before marking the connection failed and alerting.
  13. hbf-lcg session expiry is cleanup-only, with no caller notification. Expired or abandoned sessions (Zendesk 3600s, Genesys 3600s inactive, 120s pending) are silently deleted by GatewayCleaner. No event is emitted to inform connected clients.

    • Recommendation: Emit a session-expired event (e.g., via Redis pub/sub or the existing microservice bus) so downstream consumers can react rather than discover expiry on next request.