Resilience: hbf-data-manager

Error handling and retry patterns for this service. Platform-wide patterns: docs/architecture/resilience.md

HTTP Retry

Library: got (via internal HttpClientService)
Attempts: got default (2 retries on GET; POST/PUT/PATCH/DELETE also enabled)
Backoff: got default exponential
Timeout: DISTRIBUTER_SERVICE_REQUEST_TIMEOUT env var or 5000ms default, applied to GET/POST/PUT/PATCH/DELETE. getBuffer() has no timeout.
On failure: throws HttpException with APM trace IDs. 409 Conflict on POST is handled as a non-error (returns body + code instead of throwing).

Note: HttpClientService is a copy of the shared pattern used in hbf-client-integrations. It does NOT use @helvia/hbf-core-api, so it does not inherit that library's retry-on-network-error policy.

Kafka Consumer Retry

Stage	Retry Config	Backoff	On permanent failure
Message processing (per message)	3 retries via `ts-retry-promise`	Fixed 1000ms delay	Error logged, message dropped
Broker connect (recursive)	Unlimited retries	Fixed 10000ms sleep	No limit — retries indefinitely

Key implementation details:

ts-retry-promise v0.8.1 wraps the onMessage handler inside KafkajsConsumer.consume().
On 3 consecutive handler failures, the error is caught and logged; the consumer continues to the next message. No dead-letter queue exists.
Connection retries in KafkajsConsumer.connect() are recursive with no upper bound. If Kafka is permanently unreachable the process will loop indefinitely with 10s sleep intervals.
A // TODO: Consider adding DLQ for messages that fail after retries comment exists in kafka.consumer.ts, acknowledging the gap.

Timeouts

Call	Timeout	Configured in
hbf-core `/users/me` (GET)	`DISTRIBUTER_SERVICE_REQUEST_TIMEOUT` or 5000ms	`HttpClientService.get()`
Any POST/PUT/PATCH/DELETE via HttpClientService	`DISTRIBUTER_SERVICE_REQUEST_TIMEOUT` or 5000ms	Per-method in `HttpClientService`
`getBuffer()` binary fetch	None	No timeout configured
Kafka session timeout	`KAFKA_SESSION_TIMEOUT_MS` or 45000ms	`KafkaConsumerService.onModuleInit()`

Circuit Breakers

None detected.

Fallback Strategy

Failure scenario	Behaviour	User impact
Kafka message fails 3 retries	Error logged, message dropped, consumer continues	Interaction metadata for that message is permanently lost; no replay possible
hbf-core `/users/me` unreachable	`HttpException` thrown, request fails	API caller receives 500 with APM trace IDs
DB write failure (`saveMetadata`)	TypeORM throws, caught by Kafka retry mechanism (counts as one attempt)	Same as Kafka message failure above
DB read failure (`findByHbfEventId`, etc.)	`BadRequestException` thrown	API caller receives 400

Graceful Shutdown

KafkaConsumerService implements OnApplicationShutdown. On NestJS shutdown signal, it disconnects all consumers (with per-consumer error catching). This is the strongest shutdown story in the platform for a Kafka consumer.

Known Gaps

No dead-letter queue for Kafka messages that fail after 3 retries. Failed messages are logged and silently dropped.
KafkajsConsumer.connect() retries forever on broker unavailability — no circuit breaker, no max-attempt cap, no alerting.
getBuffer() has no timeout configured.
ts-retry-promise uses fixed 1000ms delay, not exponential backoff.
No health endpoint for Kafka broker connectivity. GET /health returns process uptime only; does not check DB or Kafka reachability.

HTTP Retry​

Kafka Consumer Retry​

Timeouts​

Circuit Breakers​

Fallback Strategy​

Graceful Shutdown​

Known Gaps​