Skip to main content

Resilience: hbf-core

Error handling and retry patterns for this service. Platform-wide patterns: docs/architecture/resilience.md

HTTP Retry

  • Library: Spring Retry (@Retryable annotations with @EnableRetry)
  • Attempts: 3 (Spring Retry default)
  • Backoff: Fixed 300ms on AnalyticsService.addMessageTags; default on ChatSessionService.addMessages
  • On failure: Exception propagates to controller; ResponseExceptionHandler (@ControllerAdvice) returns structured ErrorResponse

Important: Retry is applied ONLY to specific MongoDB operations (UncategorizedMongoDbException), NOT to outbound HTTP calls.

Queue Retry

Not applicable. This service does not consume queues.

Timeouts

CallTimeoutConfigured in
HttpClient base default30000msHttpClient constructor
NotificationServiceClient2000msClient constructor (hardcoded)
DataManagerClient5000msClient constructor (hardcoded)
LanguageToolClient5000msClient constructor (hardcoded)
HelviaNLPSpecificationClient120000msClient constructor (hardcoded)
HelviaRAGPipelineClient120000msConfigurable
HelviaGPTPipelineClient120000msClient constructor (hardcoded)
AzureAIClient350000msClient constructor (hardcoded)
OpenAIClient350000msClient constructor (hardcoded)
IntegrationAuthenticationServiceOIDC30000msClient constructor (hardcoded)

Circuit Breakers

None. No circuit breaker library is configured or used.

Fallback Strategy

Failure scenarioBehaviourUser impact
MongoDB write fails (retryable)Spring Retry retries up to 3 timesTransparent if retry succeeds
Outbound HTTP call failsException propagates to ResponseExceptionHandlerStructured error response returned to caller
AI client call fails (Azure/OpenAI)Exception propagated, no retryOperation fails after 350s timeout

Health Check

  • /actuator/health (Spring Actuator) with management.endpoint.health.probes.enabled=true
  • /{tenant}/health-check custom endpoint

Known Gaps

  • No HTTP retry on external service calls (only MongoDB operations are retried)
  • No circuit breaker on any dependency
  • Timeout values hardcoded in client constructors, not configurable via properties (except RAG pipeline)
  • AI client calls (AzureAI, OpenAI) have 350s timeout but no retries
  • NLU/pipeline clients have long timeouts (120s) with no recovery strategy if the dependency is degraded