Skip to main content

Resilience: hbf-bot

Timeout and retry policies, health endpoints, error handling patterns, and known gaps. Platform-wide resilience: docs/architecture/resilience.md

Outbound HTTP

Base HTTP client

File: app/util/HttpRequest.ts

HttpRequest is a thin wrapper around axios. It has no timeout, retry, or circuit-breaker configuration of its own. All calls resolve or reject directly.

Built-in HTTP action (conversation node)

File: app/util/buildInFunctions.ts

The only place with explicit retry logic is the built-in HTTP node action:

PropertyValue
Default timeout10 000 ms (configurable per node in seconds via timeout param)
Max attemptsconfigurable per node (default: 1, i.e., no retry)
Retry condition5xx, timeout (ECONNABORTED), or no response (network error)
No-retry condition4xx client errors
Backoffnone (immediate retry)
timeout: requestContext.timeout ? requestContext.timeout * 1000 : 10000,
// Retry on 5xx, timeout, or no response; never on 4xx

hbf-core-api calls

No explicit timeout or retry configuration. Failure propagates as a thrown exception, which is caught at the HBFTenant.processEvent level and logged. The subscriber update is then skipped for that request.

hbf-nlp calls

No timeout or retry. ExternalNLU wraps the call in a try/catch; on error it logs and the intent is left unset, which causes the conversation flow to fall through to a fallback node.

LiveChatGatewayClient

File: app/clients/livechat/LiveChatGatewayClient.ts

No timeout or retry. All methods catch errors, log them, and re-throw. Callers must handle the thrown error.

hbf-event-publisher

File: app/util/EventPublisherClient.ts

Wrapped in try/catch at handleTriggerFlows. On error: logs and swallows -- publisher failures are silent.

RAG pipelines

File: app/clients/rag/RagPipelinesClient.ts

Known gap: on any error, RagPipelinesClient.search logs the error and returns undefined. The caller receives undefined with no exception. Downstream code that does not explicitly check for undefined will silently produce an empty result.

Kafka producer

File: app/kafka/KafkaEventPublisher.ts

PropertyValue
Initial retry time300 ms
Multiplier2x
Jitter factor0.2
Max retry time30 000 ms
Max retries5
Request timeout30 000 ms
Producer modeidempotent

Connection is lazy and attempted once. If the broker is unreachable at first publish, the producer is not created and all subsequent publishes are silently skipped. Publish errors are logged but not propagated.

Redis (BotDeploymentCache)

File: app/system/storage/BotDeploymentCache.ts

PropertyValue
Retry strategyexponential, 1 s * attempt, capped at 30 s
Max retries per request1
Disconnect detectiontracked via connect / error events on the ioredis client

HBFCoreBotDeploymentStorage.safeCacheOperation wraps every cache call:

  • If Redis is disconnected: logs a warning, calls the optional fallbackOperation (or returns undefined).
  • If the cache operation throws: logs error, calls fallbackOperation if provided.

Fallback for cache misses: fetch BotDeployment directly from hbf-core. This means the service degrades gracefully when Redis is down -- tenant loading continues at the cost of extra hbf-core calls.

Health Check Endpoints

EndpointMethodResponseNotes
GET /api/statusGET200 { status: "Running", tenants: { loaded: N, data: [...] } }Always 200 if Express is running; does not probe Redis or hbf-core
GET /GETHTML status pageCORS-restricted to homePageAllowedOrigin

There is no liveness/readiness probe that checks downstream dependencies. The /api/status endpoint reflects only whether the Express process is accepting requests and how many tenants are loaded in memory.

Error Handling Patterns

Per-event isolation (HBFTenant.processEvent)

File: app/system/HBFTenant.ts

The entire event lifecycle is wrapped in a try/catch. On error:

  1. The error is logged with the tenant handle and error string.
  2. The ComponentFlow.FlowCycleCompleted signal is unbound for that event to avoid memory leaks.
  3. No reply is sent to the user.
  4. The subscriber state is NOT saved (the saveState call after the lifecycle is bypassed because the catch block returns early).

This means a crash mid-lifecycle leaves the subscriber's Redis blackboard in the last-successfully-saved state.

Subscriber state save

ConversationStateManager.saveState calls SubscribersClient.save. On failure (network error, hbf-core down): the exception propagates up to HBFTenant.processEvent's catch block and is logged. The HTTP response to the channel has already been sent (for async channels), so the user sees no error.

Tenant loading

HBFTenantsRepository.getByHandle wraps lazyGet in a try/catch. On failure: logs the error and returns undefined. The calling channel handler then treats the event as unroutable and drops it silently.

Channel handlers

BaseChannel.process handles events in a loop. Per-event errors are generally caught and logged at the component or lifecycle level. The channel always returns an HTTP 200 to the upstream platform (to prevent redelivery) regardless of whether processing succeeded.

Graceful Shutdown

There is no explicit graceful shutdown handler in the codebase. HbfApp.stop() calls this.server?.close(), which stops accepting new connections but does not:

  • Wait for in-flight requests to complete.
  • Flush the Kafka producer.
  • Close the Redis connection explicitly (Redis uses AsyncDisposable but the disposer is not invoked from stop()).

In practice the service relies on the container orchestrator (Kubernetes) to allow the process to drain via keepAliveTimeout (130 s) / headersTimeout (131 s) set on the HTTP server.

Known Gaps

GapLocationImpact
RAG pipeline failure returns undefined silentlyRagPipelinesClient.searchSemantic search node produces empty result; no error visible to the conversation designer
No timeout on hbf-core API calls@helvia/hbf-core-api HTTP clientA slow hbf-core can block the event lifecycle indefinitely
No timeout on hbf-nlp callsExternalNLU, GenerationRequestHandlerSame as above for NLU / LLM generation
No health probe for downstream servicesStatusControllerK8s readiness probe cannot detect hbf-core or Redis outages
Subscriber state not saved on lifecycle crashHBFTenant.processEvent catch blockUser may experience conversation state rollback after an error
No graceful Kafka producer flush on shutdownKafkaEventPublisherIn-flight Kafka messages may be lost on SIGTERM