Resilience: hbf-bot
Timeout and retry policies, health endpoints, error handling patterns, and known gaps. Platform-wide resilience:
docs/architecture/resilience.md
Outbound HTTP
Base HTTP client
File: app/util/HttpRequest.ts
HttpRequest is a thin wrapper around axios. It has no timeout, retry, or circuit-breaker configuration of its own. All calls resolve or reject directly.
Built-in HTTP action (conversation node)
File: app/util/buildInFunctions.ts
The only place with explicit retry logic is the built-in HTTP node action:
| Property | Value |
|---|---|
| Default timeout | 10 000 ms (configurable per node in seconds via timeout param) |
| Max attempts | configurable per node (default: 1, i.e., no retry) |
| Retry condition | 5xx, timeout (ECONNABORTED), or no response (network error) |
| No-retry condition | 4xx client errors |
| Backoff | none (immediate retry) |
timeout: requestContext.timeout ? requestContext.timeout * 1000 : 10000,
// Retry on 5xx, timeout, or no response; never on 4xx
hbf-core-api calls
No explicit timeout or retry configuration. Failure propagates as a thrown exception, which is caught at the HBFTenant.processEvent level and logged. The subscriber update is then skipped for that request.
hbf-nlp calls
No timeout or retry. ExternalNLU wraps the call in a try/catch; on error it logs and the intent is left unset, which causes the conversation flow to fall through to a fallback node.
LiveChatGatewayClient
File: app/clients/livechat/LiveChatGatewayClient.ts
No timeout or retry. All methods catch errors, log them, and re-throw. Callers must handle the thrown error.
hbf-event-publisher
File: app/util/EventPublisherClient.ts
Wrapped in try/catch at handleTriggerFlows. On error: logs and swallows -- publisher failures are silent.
RAG pipelines
File: app/clients/rag/RagPipelinesClient.ts
Known gap: on any error, RagPipelinesClient.search logs the error and returns undefined. The caller receives undefined with no exception. Downstream code that does not explicitly check for undefined will silently produce an empty result.
Kafka producer
File: app/kafka/KafkaEventPublisher.ts
| Property | Value |
|---|---|
| Initial retry time | 300 ms |
| Multiplier | 2x |
| Jitter factor | 0.2 |
| Max retry time | 30 000 ms |
| Max retries | 5 |
| Request timeout | 30 000 ms |
| Producer mode | idempotent |
Connection is lazy and attempted once. If the broker is unreachable at first publish, the producer is not created and all subsequent publishes are silently skipped. Publish errors are logged but not propagated.
Redis (BotDeploymentCache)
File: app/system/storage/BotDeploymentCache.ts
| Property | Value |
|---|---|
| Retry strategy | exponential, 1 s * attempt, capped at 30 s |
| Max retries per request | 1 |
| Disconnect detection | tracked via connect / error events on the ioredis client |
HBFCoreBotDeploymentStorage.safeCacheOperation wraps every cache call:
- If Redis is disconnected: logs a warning, calls the optional
fallbackOperation(or returnsundefined). - If the cache operation throws: logs error, calls
fallbackOperationif provided.
Fallback for cache misses: fetch BotDeployment directly from hbf-core. This means the service degrades gracefully when Redis is down -- tenant loading continues at the cost of extra hbf-core calls.
Health Check Endpoints
| Endpoint | Method | Response | Notes |
|---|---|---|---|
GET /api/status | GET | 200 { status: "Running", tenants: { loaded: N, data: [...] } } | Always 200 if Express is running; does not probe Redis or hbf-core |
GET / | GET | HTML status page | CORS-restricted to homePageAllowedOrigin |
There is no liveness/readiness probe that checks downstream dependencies. The /api/status endpoint reflects only whether the Express process is accepting requests and how many tenants are loaded in memory.
Error Handling Patterns
Per-event isolation (HBFTenant.processEvent)
File: app/system/HBFTenant.ts
The entire event lifecycle is wrapped in a try/catch. On error:
- The error is logged with the tenant handle and error string.
- The
ComponentFlow.FlowCycleCompletedsignal is unbound for that event to avoid memory leaks. - No reply is sent to the user.
- The subscriber state is NOT saved (the
saveStatecall after the lifecycle is bypassed because the catch block returns early).
This means a crash mid-lifecycle leaves the subscriber's Redis blackboard in the last-successfully-saved state.
Subscriber state save
ConversationStateManager.saveState calls SubscribersClient.save. On failure (network error, hbf-core down): the exception propagates up to HBFTenant.processEvent's catch block and is logged. The HTTP response to the channel has already been sent (for async channels), so the user sees no error.
Tenant loading
HBFTenantsRepository.getByHandle wraps lazyGet in a try/catch. On failure: logs the error and returns undefined. The calling channel handler then treats the event as unroutable and drops it silently.
Channel handlers
BaseChannel.process handles events in a loop. Per-event errors are generally caught and logged at the component or lifecycle level. The channel always returns an HTTP 200 to the upstream platform (to prevent redelivery) regardless of whether processing succeeded.
Graceful Shutdown
There is no explicit graceful shutdown handler in the codebase. HbfApp.stop() calls this.server?.close(), which stops accepting new connections but does not:
- Wait for in-flight requests to complete.
- Flush the Kafka producer.
- Close the Redis connection explicitly (Redis uses
AsyncDisposablebut the disposer is not invoked fromstop()).
In practice the service relies on the container orchestrator (Kubernetes) to allow the process to drain via keepAliveTimeout (130 s) / headersTimeout (131 s) set on the HTTP server.
Known Gaps
| Gap | Location | Impact |
|---|---|---|
RAG pipeline failure returns undefined silently | RagPipelinesClient.search | Semantic search node produces empty result; no error visible to the conversation designer |
| No timeout on hbf-core API calls | @helvia/hbf-core-api HTTP client | A slow hbf-core can block the event lifecycle indefinitely |
| No timeout on hbf-nlp calls | ExternalNLU, GenerationRequestHandler | Same as above for NLU / LLM generation |
| No health probe for downstream services | StatusController | K8s readiness probe cannot detect hbf-core or Redis outages |
| Subscriber state not saved on lifecycle crash | HBFTenant.processEvent catch block | User may experience conversation state rollback after an error |
| No graceful Kafka producer flush on shutdown | KafkaEventPublisher | In-flight Kafka messages may be lost on SIGTERM |