Error Handling & Resilience

How the platform handles failures. Last updated: 2026-03-19

Summary

Pattern	Present?	Services
HTTP retry	Partial	hbf-core-api (shared lib, inherited by consumers), hbf-nlp (Azure only), helvia-rag-pipelines (backoff decorator), hbf-client-integrations (got retry), hbf-broadcast (Slack/Facebook), hbf-bot (401 refresh only), semantic-doc-segmenter (OpenAI SDK built-in, max_retries=20), hbf-data-manager (got retry on all methods), hbf-knowledge-manager (hbf-core-api inherited)
Queue retry	No	None
Kafka consumer retry	Yes	hbf-data-manager (ts-retry-promise, 3 attempts, 1s fixed delay per message)
Timeouts	Partial	hbf-core (per-client), hbf-bot (Facebook/Generic), hbf-session-manager, hbf-lcm, hbf-client-integrations, helvia-rag-pipelines (LLM/Translation), hbf-webchat (Direct Line), semantic-doc-segmenter (OpenAI 120-200s, callback 5s, job 600s), open-bot-framework (bot endpoint 5s), hbf-data-manager (HTTP 5s, Kafka session 45s), hbf-lcg (Redis microservice response/heartbeat 5s each)
Circuit breaker	No	None
Health checks	Partial	hbf-core (/actuator/health), hbf-lcm (/health, shallow), hbf-event-publisher (/ and /health, shallow), hbf-media-manager (/health, shallow), semantic-doc-segmenter (/debug/health, shallow), hbf-data-manager (/ and /health, shallow), hbf-lcg (none)
Fallback / graceful degradation	Partial	helvia-rag-pipelines (cache fallback), hbf-stats (UTC default), hbf-reports (graph failure tolerance), hbf-console (401 redirect), semantic-doc-segmenter (empty tags on tagging failure), open-bot-framework (Redis → in-memory fallback), hbf-lcg (leader election failover, GatewayCleaner session expiry)
Kafka retry	Partial	hbf-bot (producer only, exponential backoff), hbf-data-manager (consumer message handler, fixed delay)
Graceful shutdown	Partial	semantic-doc-segmenter (FastAPI lifespan, but does not drain in-flight jobs), hbf-data-manager (OnApplicationShutdown disconnects all consumers). hbf-data-retention explicitly lacks it.

HTTP Retry (hbf-core-api)

The hbf-core-api library is a shared axios wrapper inherited by: hbf-notifications, hbf-data-retention, hbf-stats, hbf-reports, hbf-knowledge-manager, and any other service that depends on it.

Library: axios v1.8.3
Attempts: 3 total (1 initial + 2 retries)
Backoff: Exponential with jitter: (e^attempt - 2*random) * 1000ms
Retry condition: Network errors only (no server response received). HTTP 4xx and 5xx responses are returned immediately and are NOT retried.
On permanent failure: Returns HBFCoreApiResponse with status 503.
Timeout: Not set. Axios default is no timeout, meaning requests can hang indefinitely.
Circuit breaker: None.

Limitation: The retry-on-network-error-only policy means transient server errors (502, 503, 429) are never retried. A brief upstream restart causes immediate failure to the caller.

Per-service retry beyond hbf-core-api

Service	Library	Retry Details
hbf-nlp	axios (direct)	Custom `retryWithBackoff`: 3 attempts, fixed 2000ms delay, 5xx only (Azure client). Raw axios calls have no retry.
helvia-rag-pipelines	httpx + backoff	LLM calls: 5 retries, constant 1s on `httpx.HTTPError`. Semantic search: 3 retries on Exception + 5 on HTTPError, 0.5s constant. Translation: 5 retries, 1s constant. Pipeline ops: 3 retries, exponential.
hbf-client-integrations	got	Retry enabled on GET, POST, PUT, PATCH, DELETE. Default 3 attempts.
hbf-broadcast	axios / request	Slack: 3 attempts, exponential backoff on 5xx. Facebook: 5 manual retries (deprecated `request` lib). App logic: 3 retries, exponential.
hbf-bot	axios	Microsoft Bot Framework: 1 retry on 401 (token refresh). Slack API: retries defaults to 0. RAG pipelines: no retry, silently returns undefined.
hbf-session-manager	got	`trainOne()`: manual retry loop, configurable attempts, no backoff delay.
hbf-data-manager	got	GET/POST/PUT/PATCH/DELETE all have got retry enabled (got default: 2 attempts). `getBuffer()` has no retry.

| semantic-doc-segmenter | OpenAI SDK (AsyncOpenAI) | max_retries=20, timeout 120s. SDK-managed exponential backoff. Gemini calls: no retry. Callback POST: no retry, 5s timeout. |

Services with NO HTTP retry (beyond hbf-core-api inheritance)

Service	Notes
hbf-core	No HTTP retry on outbound calls. Spring @Retryable is MongoDB-only.
hbf-lcm	got, retry not enabled.
hbf-event-publisher	got, no retry, no timeout.
hbf-media-manager	got, no retry, no timeout.
hbf-reports	got, no retry. Uses hbf-core-api for some calls.
hbf-webchat	Retry delegated to botframework-webchat SDK.
hbf-console	Fetch API + XHR, zero retry logic.
open-bot-framework	`@nestjs/axios` HttpService, no retry. 5s timeout on bot endpoint POST only.
hbf-knowledge-manager	Uses hbf-core-api (retry inherited). No additional HTTP retry. Azure Blob SDK handles its own internal retries.

Queue Resilience

hbf-bot (Kafka)

Kafka producer: exponential backoff (300ms initial, multiplier 2, max 30000ms, 5 retries).
Consumer-side retry behavior not documented in the codebase.

hbf-data-manager (Kafka consumer)

Library: kafkajs + ts-retry-promise v0.8.1
Consumer group: hbf-data-manager-consumer (configurable via KAFKA_GROUP_ID)
Topic: interaction-metadata (configurable via KAFKA_TOPICS, comma-separated)
Per-message retry: 3 attempts, fixed 1000ms delay (no exponential backoff)
On permanent failure (3 attempts exhausted): error logged, message dropped. No DLQ.
Broker connect retry: recursive with 10s sleep — no upper bound, no alerting.
Graceful shutdown: OnApplicationShutdown disconnects all consumers with per-consumer error catching.

Timeouts

Service	Call	Timeout	Notes
hbf-core	NotificationServiceClient	2000ms	Hardcoded
hbf-core	DataManagerClient	5000ms	Hardcoded
hbf-core	LanguageToolClient	5000ms	Hardcoded
hbf-core	HelviaNLPSpecificationClient	120000ms (2min)	Hardcoded
hbf-core	HelviaRAGPipelineClient	120000ms (2min)	Hardcoded
hbf-core	HelviaGPTPipelineClient	120000ms (2min)	Hardcoded
hbf-core	AzureAIClient	350000ms (~6min)	Hardcoded
hbf-core	Default HTTP client	30000ms	Hardcoded
hbf-bot	Facebook / Generic dispatch	10000ms	via got
hbf-bot	Slack API, RAG pipelines	None	No timeout set
hbf-nlp	Azure polling	NLP_PIPELINE_POLL_TIMEOUT_IN_SECS	Env-configurable
hbf-nlp	Raw axios calls	None	No timeout set
hbf-session-manager	POST/PATCH	SESSION_SERVICE_REQUEST_TIMEOUT or 5000ms	Env-configurable with default
hbf-session-manager	GET	None	May have no timeout
hbf-session-manager	NLP polling	NLP_PIPELINE_POLL_TIMEOUT_IN_SECS (360s default)	Poll interval 2s
hbf-lcm	Bot callbacks	3000ms	Hardcoded
hbf-lcm	Translation	SERVICE_TRANSLATION_TIMEOUT_SECONDS	Env-configurable
hbf-client-integrations	Distributor service	DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000ms	Env-configurable
hbf-client-integrations	Client module	15000ms	Hardcoded
hbf-client-integrations	kelly-hbf-interaction	60000ms	Hardcoded
hbf-client-integrations	getBuffer()	None	No timeout on binary fetch
helvia-rag-pipelines	LLM calls	30000ms (default)	Configurable
helvia-rag-pipelines	Translation	10000ms
helvia-rag-pipelines	Vector DB (Qdrant/Milvus)	None	No timeout on vector operations
hbf-webchat	Direct Line	20000ms	SDK-managed
hbf-core-api	All calls	None	axios default = no timeout
hbf-notifications	HTTP calls	None	No timeout set
hbf-broadcast	HTTP calls	None	No timeout set
hbf-event-publisher	HTTP calls	None	No timeout, no retry
hbf-data-retention	HTTP calls	None	No timeout set
hbf-stats	HTTP calls	None	No timeout set
hbf-reports	HTTP calls	None	No timeout set
hbf-media-manager	HTTP calls	None	No timeout, no retry
hbf-console	Fetch/XHR	None	No timeout configured
open-bot-framework	Bot endpoint POST	5000ms	Hardcoded in directline-conversation.service.ts
open-bot-framework	Redis connect	2000ms	`connectTimeout` in atomic-operations.provider.ts
open-bot-framework	S3 upload	None	AWS SDK default
hbf-data-manager	HTTP calls (GET/POST/PUT/PATCH/DELETE)	`DISTRIBUTER_SERVICE_REQUEST_TIMEOUT` or 5000ms	HttpClientService
hbf-data-manager	`getBuffer()` binary fetch	None	No timeout set
hbf-data-manager	Kafka session timeout	`KAFKA_SESSION_TIMEOUT_MS` or 45000ms	KafkaConsumerService
hbf-knowledge-manager	hbf-core API calls	None	hbf-core-api axios default
hbf-knowledge-manager	Azure Blob download / list	SDK default	@azure/storage-blob internal
hbf-lcg	Redis microservice response	`MICROSERVICE_RESPONSE_TIMEOUT_MILLIS` or 5000ms	Env-configurable; awaits NestJS microservice reply
hbf-lcg	Redis microservice heartbeat	`MICROSERVICE_HEARTBEAT_TIMEOUT_MILLIS` or 5000ms	Env-configurable; controls liveness detection
hbf-lcg	Cisco polling interval	5000ms	Fixed interval; no timeout on individual poll calls
hbf-lcg	Genesys inactive session check	10000ms	Fixed polling interval for inactive check
semantic-doc-segmenter	OpenAI/Azure (default)	120s	Hardcoded
semantic-doc-segmenter	OpenAI/Azure (title extraction)	200s	Hardcoded
semantic-doc-segmenter	Callback POST	5s	Hardcoded
semantic-doc-segmenter	Google language detection	10s	Env: LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS
semantic-doc-segmenter	Job execution overall	600s	Env: JOB_TIMEOUT_SECONDS
semantic-doc-segmenter	Gemini API	None	No timeout set

Circuit Breakers

No service implements a circuit breaker. There is no mechanism to stop sending requests to a downstream service that is known to be failing. Every service will continue sending requests at full rate during an outage, amplifying load on struggling dependencies and delaying recovery.

Fallback Strategies

Scenario	Fallback Behaviour	User Impact
helvia-rag-pipelines: embedding cache failure	Falls back to direct LLM call (skips Redis/memory cache)	Increased latency, higher LLM cost. Transparent to user.
helvia-rag-pipelines: multi-tier cache	Redis async -> Redis sync -> in-memory. Degrades through tiers on failure.	Gradual latency increase. Transparent to user.
hbf-stats: getOrganizationTimezone failure	Defaults to 'UTC'	Stats may show incorrect timezone. Silent, no error surfaced.
hbf-stats: updateTenantStats failure	Logs error, does not throw	Stats silently stale. Dashboard shows outdated numbers.
hbf-reports: generate_graphs failure	Caught, execution continues	Report generated without graphs. Partial output.
hbf-bot: RAG pipeline failure	Silently returns undefined	Bot response may lack RAG-augmented context. No error shown to end user.
hbf-broadcast: individual send failure	Uses Promise.allSettled()	One recipient failure does not block others. Partial delivery.
hbf-console: 401 response	Checks refresh token, clears storage, redirects to login	User is logged out. Session state lost.
hbf-webchat: token expiry	Tracks expiry for reconnection	Automatic reconnection attempt. Brief interruption possible.
hbf-core-api: permanent network failure	Returns HBFCoreApiResponse with status 503	Caller receives structured error. User sees failure depending on caller handling.
hbf-core: any unhandled exception	ResponseExceptionHandler (@ControllerAdvice)	Standardized error response. Prevents raw stack traces.
semantic-doc-segmenter: article tagging LLM failure	Returns empty tags list, applies "Other" tag	Articles tagged as "Other" instead of meaningful categories
semantic-doc-segmenter: callback POST failure	Logged, job proceeds to complete	Caller never receives results; must poll GET endpoint
semantic-doc-segmenter: Gemini failure	Exception propagates, job marked FAILED	Job fails entirely if Gemini was selected parser
open-bot-framework: Redis unavailable at startup	Falls back to in-memory atomic counter manager	Activity watermarks non-durable; IDs will collide across instances
open-bot-framework: bot endpoint unreachable on activity POST	BadRequestException with error message	User receives 400 error
open-bot-framework: WebSocket client not registered after 3 retries	Transcript dropped, warning logged	Bot reply silently lost; user never receives message
hbf-knowledge-manager: webhook processing error after 200 ACK	Error caught and logged; Event Grid is not retried (ACK was sent)	File change permanently lost with no replay mechanism
hbf-knowledge-manager: no integration/KB found for webhook key	Returns early, logs warning/debug	No sync occurs. Silent.
hbf-knowledge-manager fullSync: per-file download/upload fails	Error added to SyncResult.errors[], continues to next file	Partial sync; caller receives error list in SyncResult
hbf-lcg: leader node failure	Leader election detects missing heartbeat; a follower is promoted automatically	Brief window of no GatewayCleaner execution during re-election
hbf-lcg: expired sessions (Zendesk, Genesys)	GatewayCleaner cron (leader-only) deletes sessions past TTL	Sessions silently removed; no user notification
hbf-lcg: Genesys WebSocket disconnection	Distributed coordination selects one node to own reconnection; others stand by	Brief message gap during reconnect; no data loss mechanism
hbf-lcg: pending Genesys session not established within 120s	GatewayCleaner removes pending session entry	Session silently abandoned; caller must retry from scratch

Health Checks

Service	Endpoint	Checks
hbf-core	/actuator/health	Spring Actuator (checks DB, disk, etc.)
hbf-core	/{tenant}/health-check	Tenant-scoped health
hbf-lcm	GET /health	Returns empty 200. No dependency checks.
hbf-event-publisher	GET /	Returns service name + version
hbf-event-publisher	GET /health	Stub, returns void
hbf-media-manager	GET /health	Returns 200. No dependency checks.
hbf-bot	None	No health endpoint
hbf-nlp	None	No health endpoint
helvia-rag-pipelines	None	No health endpoint
hbf-session-manager	None	No health endpoint
hbf-notifications	None	No health endpoint
hbf-client-integrations	None	No health endpoint
hbf-broadcast	None	No health endpoint
hbf-data-retention	None	No health endpoint
hbf-stats	None	No health endpoint
hbf-reports	None	No health endpoint
hbf-webchat	N/A	Frontend widget, not applicable
hbf-console	N/A	Frontend SPA, not applicable
semantic-doc-segmenter	GET /debug/health	Job counts (pending, processing, failed 24h). No dependency checks.
semantic-doc-segmenter	GET /debug/health_with_details	Same + detailed job lists. No dependency checks.
open-bot-framework	None	GET / returns "Hello World!". No health endpoint.
hbf-data-manager	GET / and GET /health	Returns status, timestamp, and uptime seconds. No dependency checks (no DB or Kafka connectivity check).
hbf-knowledge-manager	None	No health endpoint.
hbf-lcg	None	No health endpoint.

16 of 19 backend services lack a health endpoint or have only shallow (no-dependency) checks. Only hbf-core has a meaningful health check via Spring Actuator.

Gaps & Recommendations

Critical (service outage risk)

No circuit breakers anywhere. A single downstream failure cascades to all callers. Every service hammers failing dependencies at full rate.
- Recommendation: Add circuit breakers (e.g., opossum for Node.js, resilience4j for Java) to hbf-core, hbf-bot, and helvia-rag-pipelines as a starting priority. These are the highest-fan-out services.
hbf-core-api has no timeout. The shared HTTP library used across many services has no default timeout. A hung downstream will hold connections and threads indefinitely.
- Recommendation: Set a default timeout (e.g., 30s) in hbf-core-api. Allow per-call overrides.
hbf-core-api only retries on network errors. Transient HTTP 502/503/429 responses are never retried. Brief restarts or load-balancer errors cause immediate caller failure.
- Recommendation: Extend retry to cover 502, 503, and 429 (with Retry-After header respect).
hbf-data-retention has no graceful shutdown. Runs as an infinite loop daemon. Interruption mid-deletion can leave data in an inconsistent state.
- Recommendation: Add SIGTERM/SIGINT handlers to complete the current batch before exiting.

High (degraded reliability)

15 of 18 backend services lack meaningful health checks. Orchestrators (Kubernetes, load balancers) cannot detect unhealthy instances. Failed services continue receiving traffic.
- Recommendation: Add /health endpoints that check critical dependencies (DB connectivity, Redis, downstream service reachability). Minimum: readiness and liveness probes for all services.
Multiple services have no HTTP timeout. hbf-notifications, hbf-broadcast, hbf-event-publisher, hbf-media-manager, hbf-stats, hbf-reports, hbf-data-retention, and hbf-nlp (raw calls) can all hang indefinitely on downstream calls.
- Recommendation: Establish a platform-wide default timeout (e.g., 30s) and require explicit opt-in for longer durations.
helvia-rag-pipelines provider selection is round-robin, not failure-aware. A failing LLM provider continues receiving requests in rotation.
- Recommendation: Track provider health and skip or deprioritize providers with recent failures.
helvia-rag-pipelines has no timeout on vector DB operations. Qdrant/Milvus queries can hang indefinitely.
- Recommendation: Set a timeout (e.g., 10s) on all vector DB calls.

Medium (operational visibility)

hbf-core timeouts are hardcoded. Changing timeout values requires a code change and redeployment.
- Recommendation: Move timeouts to environment variables with sensible defaults.
hbf-bot silently swallows RAG pipeline failures. Returns undefined with no logging or metric. Debugging missing RAG context requires manual investigation.
- Recommendation: Log RAG failures with correlation IDs. Emit a metric for monitoring.
Inconsistent retry strategies across services. Retry attempts range from 0 to 5, backoff strategies include fixed, exponential, exponential-with-jitter, and none. No platform standard exists.
- Recommendation: Define a platform retry standard (e.g., 3 attempts, exponential backoff with jitter, retry on 5xx + network errors) and implement via shared middleware or library configuration.
hbf-session-manager trainOne() retries without backoff. Retries fire immediately, potentially overwhelming the target during recovery.
- Recommendation: Add exponential backoff or at minimum a fixed delay between retries.
hbf-data-manager Kafka consumer has no dead-letter queue. Messages that fail all 3 retries are logged and permanently dropped. There is no replay mechanism. The code itself has a // TODO: Consider adding DLQ comment acknowledging this gap.
- Recommendation: Implement a DLQ (e.g., a separate Kafka topic interaction-metadata.dlq) and publish failed messages there instead of dropping them.
hbf-data-manager Kafka broker connect retries forever. KafkajsConsumer.connect() recursively retries with no upper bound, no alerting, and no back-off increase beyond the fixed 10s sleep. A permanently unreachable broker will loop silently indefinitely.
- Recommendation: Cap broker connection retries (e.g., 10 attempts), then throw to allow the process to exit and be restarted by the orchestrator with proper alerting.
hbf-data-manager Kafka consumer retry uses fixed delay, not exponential backoff. ts-retry-promise is configured with a fixed 1000ms delay for all 3 retries. Transient DB spikes will cause rapid retries.
- Recommendation: Switch to exponential backoff (e.g., ts-retry-promise backoff: 'EXPONENTIAL' option) to avoid hammering the DB during recovery.
hbf-knowledge-manager webhook processing has no retry or DLQ after ACK. The controller sends HTTP 200 to Event Grid before processing. Any processing failure (hbf-core unreachable, Azure Blob download failure, etc.) is caught, logged, and the event is permanently lost. Event Grid will not retry because it received a 200.
- Recommendation: Persist the raw event payload to a durable queue (e.g., Bull/Redis or a dedicated DB table) immediately on receipt, before sending the 200. Process from the queue with retry. This preserves the fire-and-forget ACK advantage while making processing recoverable.
hbf-knowledge-manager has no timeout on hbf-core API calls. Inherits the hbf-core-api no-timeout default. A hung hbf-core call during webhook processing blocks the processing goroutine indefinitely. Because the 200 ACK is already sent, the user sees no error, but the server-side fiber leaks.
- Recommendation: Set a per-call timeout (e.g., 30s) when constructing HBFCoreApi in hbf-core.service.ts, or await a platform-wide fix in hbf-core-api.
hbf-lcg has no health endpoint. The service manages live gateway sessions and WebSocket connections but exposes no endpoint for orchestrators to detect failures.
- Recommendation: Add a /health endpoint that checks Redis connectivity and reports leader election state.
hbf-lcg polling adapters use fixed intervals with no timeout on individual calls. Cisco polls every 5s and Genesys checks inactive sessions every 10s, but individual poll calls have no timeout. A slow upstream will hold the polling loop indefinitely.
- Recommendation: Add a per-call timeout to each adapter's polling fetch, shorter than the polling interval (e.g., 4s for Cisco, 8s for Genesys).
hbf-lcg Genesys WebSocket reconnection has no retry cap. Distributed coordination picks one node to reconnect, but there is no documented upper bound on reconnection attempts or backoff. A permanently dead Genesys endpoint will loop indefinitely.
- Recommendation: Apply capped exponential backoff (e.g., 5 attempts, max 60s delay) before marking the connection failed and alerting.
hbf-lcg session expiry is cleanup-only, with no caller notification. Expired or abandoned sessions (Zendesk 3600s, Genesys 3600s inactive, 120s pending) are silently deleted by GatewayCleaner. No event is emitted to inform connected clients.
- Recommendation: Emit a session-expired event (e.g., via Redis pub/sub or the existing microservice bus) so downstream consumers can react rather than discover expiry on next request.

Summary​

HTTP Retry (hbf-core-api)​

Per-service retry beyond hbf-core-api​

Services with NO HTTP retry (beyond hbf-core-api inheritance)​

Queue Resilience​

hbf-bot (Kafka)​

hbf-data-manager (Kafka consumer)​

Timeouts​

Circuit Breakers​

Fallback Strategies​

Health Checks​

Gaps & Recommendations​

Critical (service outage risk)​

High (degraded reliability)​

Medium (operational visibility)​