Resilience: hbf-bot

Timeout and retry policies, health endpoints, error handling patterns, and known gaps. Platform-wide resilience: docs/architecture/resilience.md

Outbound HTTP

Base HTTP client

File: app/util/HttpRequest.ts

HttpRequest is a thin wrapper around axios. It has no timeout, retry, or circuit-breaker configuration of its own. All calls resolve or reject directly.

Built-in HTTP action (conversation node)

File: app/util/buildInFunctions.ts

The only place with explicit retry logic is the built-in HTTP node action:

Property	Value
Default timeout	10 000 ms (configurable per node in seconds via `timeout` param)
Max attempts	configurable per node (default: 1, i.e., no retry)
Retry condition	5xx, timeout (`ECONNABORTED`), or no response (network error)
No-retry condition	4xx client errors
Backoff	none (immediate retry)

timeout: requestContext.timeout ? requestContext.timeout * 1000 : 10000,
// Retry on 5xx, timeout, or no response; never on 4xx

hbf-core-api calls

No explicit timeout or retry configuration. Failure propagates as a thrown exception, which is caught at the HBFTenant.processEvent level and logged. The subscriber update is then skipped for that request.

hbf-nlp calls

No timeout or retry. ExternalNLU wraps the call in a try/catch; on error it logs and the intent is left unset, which causes the conversation flow to fall through to a fallback node.

LiveChatGatewayClient

File: app/clients/livechat/LiveChatGatewayClient.ts

No timeout or retry. All methods catch errors, log them, and re-throw. Callers must handle the thrown error.

hbf-event-publisher

File: app/util/EventPublisherClient.ts

Wrapped in try/catch at handleTriggerFlows. On error: logs and swallows -- publisher failures are silent.

RAG pipelines

File: app/clients/rag/RagPipelinesClient.ts

Known gap: on any error, RagPipelinesClient.search logs the error and returns undefined. The caller receives undefined with no exception. Downstream code that does not explicitly check for undefined will silently produce an empty result.

Kafka producer

File: app/kafka/KafkaEventPublisher.ts

Property	Value
Initial retry time	300 ms
Multiplier	2x
Jitter factor	0.2
Max retry time	30 000 ms
Max retries	5
Request timeout	30 000 ms
Producer mode	idempotent

Connection is lazy and attempted once. If the broker is unreachable at first publish, the producer is not created and all subsequent publishes are silently skipped. Publish errors are logged but not propagated.

Redis (BotDeploymentCache)

File: app/system/storage/BotDeploymentCache.ts

Property	Value
Retry strategy	exponential, 1 s * attempt, capped at 30 s
Max retries per request	1
Disconnect detection	tracked via `connect` / `error` events on the ioredis client

HBFCoreBotDeploymentStorage.safeCacheOperation wraps every cache call:

If Redis is disconnected: logs a warning, calls the optional fallbackOperation (or returns undefined).
If the cache operation throws: logs error, calls fallbackOperation if provided.

Fallback for cache misses: fetch BotDeployment directly from hbf-core. This means the service degrades gracefully when Redis is down -- tenant loading continues at the cost of extra hbf-core calls.

Health Check Endpoints

Endpoint	Method	Response	Notes
`GET /api/status`	GET	`200 { status: "Running", tenants: { loaded: N, data: [...] } }`	Always 200 if Express is running; does not probe Redis or hbf-core
`GET /`	GET	HTML status page	CORS-restricted to `homePageAllowedOrigin`

There is no liveness/readiness probe that checks downstream dependencies. The /api/status endpoint reflects only whether the Express process is accepting requests and how many tenants are loaded in memory.

Error Handling Patterns

Per-event isolation (`HBFTenant.processEvent`)

File: app/system/HBFTenant.ts

The entire event lifecycle is wrapped in a try/catch. On error:

The error is logged with the tenant handle and error string.
The ComponentFlow.FlowCycleCompleted signal is unbound for that event to avoid memory leaks.
No reply is sent to the user.
The subscriber state is NOT saved (the saveState call after the lifecycle is bypassed because the catch block returns early).

This means a crash mid-lifecycle leaves the subscriber's Redis blackboard in the last-successfully-saved state.

Subscriber state save

ConversationStateManager.saveState calls SubscribersClient.save. On failure (network error, hbf-core down): the exception propagates up to HBFTenant.processEvent's catch block and is logged. The HTTP response to the channel has already been sent (for async channels), so the user sees no error.

Tenant loading

HBFTenantsRepository.getByHandle wraps lazyGet in a try/catch. On failure: logs the error and returns undefined. The calling channel handler then treats the event as unroutable and drops it silently.

Channel handlers

BaseChannel.process handles events in a loop. Per-event errors are generally caught and logged at the component or lifecycle level. The channel always returns an HTTP 200 to the upstream platform (to prevent redelivery) regardless of whether processing succeeded.

Graceful Shutdown

There is no explicit graceful shutdown handler in the codebase. HbfApp.stop() calls this.server?.close(), which stops accepting new connections but does not:

Wait for in-flight requests to complete.
Flush the Kafka producer.
Close the Redis connection explicitly (Redis uses AsyncDisposable but the disposer is not invoked from stop()).

In practice the service relies on the container orchestrator (Kubernetes) to allow the process to drain via keepAliveTimeout (130 s) / headersTimeout (131 s) set on the HTTP server.

Known Gaps

Gap	Location	Impact
RAG pipeline failure returns `undefined` silently	`RagPipelinesClient.search`	Semantic search node produces empty result; no error visible to the conversation designer
No timeout on hbf-core API calls	`@helvia/hbf-core-api` HTTP client	A slow hbf-core can block the event lifecycle indefinitely
No timeout on hbf-nlp calls	`ExternalNLU`, `GenerationRequestHandler`	Same as above for NLU / LLM generation
No health probe for downstream services	`StatusController`	K8s readiness probe cannot detect hbf-core or Redis outages
Subscriber state not saved on lifecycle crash	`HBFTenant.processEvent` catch block	User may experience conversation state rollback after an error
No graceful Kafka producer flush on shutdown	`KafkaEventPublisher`	In-flight Kafka messages may be lost on SIGTERM

Outbound HTTP​

Base HTTP client​

Built-in HTTP action (conversation node)​

hbf-core-api calls​

hbf-nlp calls​

LiveChatGatewayClient​

hbf-event-publisher​

RAG pipelines​

Kafka producer​

Redis (BotDeploymentCache)​

Health Check Endpoints​

Error Handling Patterns​

Per-event isolation (HBFTenant.processEvent)​

Subscriber state save​

Tenant loading​

Channel handlers​

Graceful Shutdown​

Known Gaps​