Resilience: open-bot-framework
Error handling and retry patterns for this service. Platform-wide patterns:
docs/architecture/resilience.md
HTTP Retry
- Library:
@nestjs/axiosHttpService(raw axios, no shared retry wrapper — does NOT usehbf-core-api) - Attempts: 1 (no retry)
- Backoff: None
- On failure: Caught in
userReplyToConversation(), rethrows asBadRequestExceptionwith error message surfaced to the client. Bot endpoint failures are immediately visible to the user.
Queue Retry
Not applicable. open-bot-framework does not use Bull or any message queue.
WebSocket Delivery Retry
DirectLineGateway.sendToConversation() retries delivery to a connected WebSocket client up to 3 times when the conversation socket is not yet registered (e.g., client connecting slowly).
| Attempt | Delay before attempt |
|---|---|
| 1 | immediate |
| 2 | 1000ms |
| 3 | 2000ms |
Backoff is linear (not exponential): 1000 * attempt. After 3 failed attempts the transcript is dropped and a warning is logged. No persistence or replay mechanism exists.
Timeouts
| Call | Timeout | Configured in |
|---|---|---|
| Bot endpoint POST (activity forwarding) | 5000ms | directline-conversation.service.ts line 162 |
| Redis connect | 2000ms | atomic-operations.provider.ts (connectTimeout) |
| S3 upload | None | AWS SDK default (no explicit timeout) |
Circuit Breakers
None detected.
Redis Fallback
atomic-operations.provider.ts selects the atomic counter backend at startup:
- If
ATOMIC_OPERATIONS_IMPLEMENTATION=memoryorREDIS_URIis not set, uses in-memory backend immediately. - Attempts
redis.connect()withconnectTimeout: 2000ms. - On connection failure, logs a warning and falls back to
MemoryAtomicOperationsManager.
The in-memory fallback is explicitly labelled as development/emergency-only. Activity watermark counters lose durability and cross-instance consistency when it is active (single-instance only, resets on restart).
Exception Handling
A global AllExceptionsFilter (src/filters/exception.filter.ts) catches all unhandled exceptions:
HttpExceptionsubclasses: status code preserved, message surfaced to caller as structured JSON{ statusCode, timestamp, message, path, description }.- Non-HTTP exceptions: status 500, full
String(exception)as message, logged aterrorlevel. HttpException(non-500): logged atverboselevel only.
No exceptions are silently swallowed at the controller level. Business-logic errors in services throw NestJS HTTP exceptions which propagate through the filter.
Fallback Strategy
| Failure scenario | Behaviour | User impact |
|---|---|---|
| Bot endpoint unreachable / error on activity POST | BadRequestException thrown, AllExceptionsFilter returns 400 with error details | User receives explicit error response |
| Redis unreachable at startup | Falls back to in-memory atomic counter manager | Watermark counters non-durable; consistency lost across restarts or multiple instances |
| S3 upload failure | HttpException 500 thrown, bubbles to global filter | User receives 500 with error message |
| WebSocket client not connected after 3 retries | Transcript dropped, warning logged | Bot reply silently lost; user does not receive the message |
| Invalid/expired DirectLine token | UnauthorizedException or BadRequestException | 401/400 returned to caller |
Health Check
No dedicated health endpoint. GET / returns the string "Hello World!" (the NestJS scaffold default). There are no checks for database connectivity, Redis availability, or downstream reachability.
Known Gaps
- No HTTP retry on bot endpoint calls. A transient failure when forwarding a user activity to the bot results in an immediate error returned to the client.
- No timeout on S3 uploads. A hung S3 connection can stall the activity creation request indefinitely.
- No health endpoint. Orchestrators cannot detect unhealthy instances (DB disconnected, Redis down).
- WebSocket transcript drops are silent to the end user. The client receives no notification that the bot reply was lost; it simply never arrives.
- Redis fallback to in-memory is not safe for multi-instance deployments. Activity IDs will collide across instances.
- No circuit breaker on the bot endpoint. A slow or failing downstream bot will accept and hold all user-activity HTTP requests for the full 5s timeout duration.