Resilience: hbf-data-manager
Error handling and retry patterns for this service. Platform-wide patterns:
docs/architecture/resilience.md
HTTP Retry
- Library: got (via internal
HttpClientService) - Attempts: got default (2 retries on GET; POST/PUT/PATCH/DELETE also enabled)
- Backoff: got default exponential
- Timeout:
DISTRIBUTER_SERVICE_REQUEST_TIMEOUTenv var or 5000ms default, applied to GET/POST/PUT/PATCH/DELETE.getBuffer()has no timeout. - On failure: throws
HttpExceptionwith APM trace IDs. 409 Conflict on POST is handled as a non-error (returns body + code instead of throwing).
Note: HttpClientService is a copy of the shared pattern used in hbf-client-integrations. It does NOT use @helvia/hbf-core-api, so it does not inherit that library's retry-on-network-error policy.
Kafka Consumer Retry
| Stage | Retry Config | Backoff | On permanent failure |
|---|---|---|---|
| Message processing (per message) | 3 retries via ts-retry-promise | Fixed 1000ms delay | Error logged, message dropped |
| Broker connect (recursive) | Unlimited retries | Fixed 10000ms sleep | No limit — retries indefinitely |
Key implementation details:
ts-retry-promisev0.8.1 wraps theonMessagehandler insideKafkajsConsumer.consume().- On 3 consecutive handler failures, the error is caught and logged; the consumer continues to the next message. No dead-letter queue exists.
- Connection retries in
KafkajsConsumer.connect()are recursive with no upper bound. If Kafka is permanently unreachable the process will loop indefinitely with 10s sleep intervals. - A
// TODO: Consider adding DLQ for messages that fail after retriescomment exists inkafka.consumer.ts, acknowledging the gap.
Timeouts
| Call | Timeout | Configured in |
|---|---|---|
hbf-core /users/me (GET) | DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000ms | HttpClientService.get() |
| Any POST/PUT/PATCH/DELETE via HttpClientService | DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000ms | Per-method in HttpClientService |
getBuffer() binary fetch | None | No timeout configured |
| Kafka session timeout | KAFKA_SESSION_TIMEOUT_MS or 45000ms | KafkaConsumerService.onModuleInit() |
Circuit Breakers
None detected.
Fallback Strategy
| Failure scenario | Behaviour | User impact |
|---|---|---|
| Kafka message fails 3 retries | Error logged, message dropped, consumer continues | Interaction metadata for that message is permanently lost; no replay possible |
hbf-core /users/me unreachable | HttpException thrown, request fails | API caller receives 500 with APM trace IDs |
DB write failure (saveMetadata) | TypeORM throws, caught by Kafka retry mechanism (counts as one attempt) | Same as Kafka message failure above |
DB read failure (findByHbfEventId, etc.) | BadRequestException thrown | API caller receives 400 |
Graceful Shutdown
KafkaConsumerService implements OnApplicationShutdown. On NestJS shutdown signal, it disconnects all consumers (with per-consumer error catching). This is the strongest shutdown story in the platform for a Kafka consumer.
Known Gaps
- No dead-letter queue for Kafka messages that fail after 3 retries. Failed messages are logged and silently dropped.
KafkajsConsumer.connect()retries forever on broker unavailability — no circuit breaker, no max-attempt cap, no alerting.getBuffer()has no timeout configured.ts-retry-promiseuses fixed 1000ms delay, not exponential backoff.- No health endpoint for Kafka broker connectivity.
GET /healthreturns process uptime only; does not check DB or Kafka reachability.