Skip to main content

Resilience: hbf-data-manager

Error handling and retry patterns for this service. Platform-wide patterns: docs/architecture/resilience.md

HTTP Retry

  • Library: got (via internal HttpClientService)
  • Attempts: got default (2 retries on GET; POST/PUT/PATCH/DELETE also enabled)
  • Backoff: got default exponential
  • Timeout: DISTRIBUTER_SERVICE_REQUEST_TIMEOUT env var or 5000ms default, applied to GET/POST/PUT/PATCH/DELETE. getBuffer() has no timeout.
  • On failure: throws HttpException with APM trace IDs. 409 Conflict on POST is handled as a non-error (returns body + code instead of throwing).

Note: HttpClientService is a copy of the shared pattern used in hbf-client-integrations. It does NOT use @helvia/hbf-core-api, so it does not inherit that library's retry-on-network-error policy.

Kafka Consumer Retry

StageRetry ConfigBackoffOn permanent failure
Message processing (per message)3 retries via ts-retry-promiseFixed 1000ms delayError logged, message dropped
Broker connect (recursive)Unlimited retriesFixed 10000ms sleepNo limit — retries indefinitely

Key implementation details:

  • ts-retry-promise v0.8.1 wraps the onMessage handler inside KafkajsConsumer.consume().
  • On 3 consecutive handler failures, the error is caught and logged; the consumer continues to the next message. No dead-letter queue exists.
  • Connection retries in KafkajsConsumer.connect() are recursive with no upper bound. If Kafka is permanently unreachable the process will loop indefinitely with 10s sleep intervals.
  • A // TODO: Consider adding DLQ for messages that fail after retries comment exists in kafka.consumer.ts, acknowledging the gap.

Timeouts

CallTimeoutConfigured in
hbf-core /users/me (GET)DISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000msHttpClientService.get()
Any POST/PUT/PATCH/DELETE via HttpClientServiceDISTRIBUTER_SERVICE_REQUEST_TIMEOUT or 5000msPer-method in HttpClientService
getBuffer() binary fetchNoneNo timeout configured
Kafka session timeoutKAFKA_SESSION_TIMEOUT_MS or 45000msKafkaConsumerService.onModuleInit()

Circuit Breakers

None detected.

Fallback Strategy

Failure scenarioBehaviourUser impact
Kafka message fails 3 retriesError logged, message dropped, consumer continuesInteraction metadata for that message is permanently lost; no replay possible
hbf-core /users/me unreachableHttpException thrown, request failsAPI caller receives 500 with APM trace IDs
DB write failure (saveMetadata)TypeORM throws, caught by Kafka retry mechanism (counts as one attempt)Same as Kafka message failure above
DB read failure (findByHbfEventId, etc.)BadRequestException thrownAPI caller receives 400

Graceful Shutdown

KafkaConsumerService implements OnApplicationShutdown. On NestJS shutdown signal, it disconnects all consumers (with per-consumer error catching). This is the strongest shutdown story in the platform for a Kafka consumer.

Known Gaps

  • No dead-letter queue for Kafka messages that fail after 3 retries. Failed messages are logged and silently dropped.
  • KafkajsConsumer.connect() retries forever on broker unavailability — no circuit breaker, no max-attempt cap, no alerting.
  • getBuffer() has no timeout configured.
  • ts-retry-promise uses fixed 1000ms delay, not exponential backoff.
  • No health endpoint for Kafka broker connectivity. GET /health returns process uptime only; does not check DB or Kafka reachability.