Skip to main content

Resilience: hbf-knowledge-manager

Error handling and retry patterns for this service. Platform-wide patterns: docs/architecture/resilience.md

HTTP Retry

  • Library: hbf-core-api (inherited via HBFCoreApi in hbf-core.service.ts)
  • Attempts: 3 total (1 initial + 2 retries) — inherited from hbf-core-api
  • Backoff: Exponential with jitter: (e^attempt - 2*random) * 1000ms — inherited
  • Retry condition: Network errors only. HTTP 4xx/5xx are not retried.
  • On permanent failure: Returns HBFCoreApiResponse with status 503.
  • Timeout: None. hbf-core-api sets no axios timeout.
  • Azure Blob SDK (@azure/storage-blob): No explicit retry config. SDK applies its own internal retry policy by default (StorageRetryPolicy: 4 attempts, exponential).

Queue Retry

Not applicable. hbf-knowledge-manager has no Bull, Kafka, or other queue consumers.

Timeouts

CallTimeoutConfigured in
hbf-core API (all methods)Nonehbf-core-api axios default
Azure Blob download / listSDK default@azure/storage-blob internal

Circuit Breakers

None detected.

Fallback Strategy

Failure scenarioBehaviourUser impact
Azure Event Grid webhook processing errorError caught and logged after 200 ACK has already been sent to Event GridEvent Grid does not retry (ACK was sent). File change silently lost.
No integration found for webhook keyReturns early, logs debug messageNo sync occurs. Silent.
No KB linked to integrationLogs warning, skips integrationNo sync occurs. Silent.
fullSync: individual file download/upload failsError caught per-file, added to SyncResult.errors[], continues to next filePartial sync. Caller receives SyncResult with error list.
Azure Blob validateConnection failureReturns false, logs warningCaller handles. No throw.

Fire-and-Forget Pattern (Key Resilience Design)

POST /webhooks/azure-blob responds 200 OK before processing. The flow is:

  1. Validate payload shape.
  2. Send res.status(200).send() — ACKs Event Grid immediately.
  3. Call processAzureBlobEvent() asynchronously (still awaited within the handler, but after the response is sent).
  4. Any error in step 3 is caught and logged; it never affects the HTTP response.

Purpose: Prevents Event Grid from treating slow processing as a failure and triggering its exponential retry storm (up to 24 hours of retries).

Risk: If processing fails after ACK, the event is permanently lost. There is no retry, no dead-letter queue, and no replay mechanism.

SharePoint webhooks (POST /webhooks/sharepoint) follow the same fire-and-forget pattern: ACK 202 immediately, then process asynchronously. The same risk of silent event loss applies.

Health Checks

None. No /health or readiness endpoint exists.

Known Gaps

  1. No timeout on hbf-core API calls. A hung downstream (hbf-core) will block the request thread indefinitely.
  2. No retry or DLQ for failed webhook event processing. Events lost after ACK are unrecoverable.
  3. No health endpoint. Orchestrators (Kubernetes) cannot detect unhealthy instances.
  4. fullSync errors per-file are returned in SyncResult.errors[] but the caller (SyncController) does not surface them in the HTTP response — it returns the full SyncResult object, so the client does receive the error list.
  5. Azure Blob SDK retry behavior is implicit (SDK default). No explicit configuration or visibility into retry attempts.