Resilience: hbf-knowledge-manager
Error handling and retry patterns for this service. Platform-wide patterns:
docs/architecture/resilience.md
HTTP Retry
- Library: hbf-core-api (inherited via
HBFCoreApiinhbf-core.service.ts) - Attempts: 3 total (1 initial + 2 retries) — inherited from hbf-core-api
- Backoff: Exponential with jitter:
(e^attempt - 2*random) * 1000ms— inherited - Retry condition: Network errors only. HTTP 4xx/5xx are not retried.
- On permanent failure: Returns
HBFCoreApiResponsewith status 503. - Timeout: None. hbf-core-api sets no axios timeout.
- Azure Blob SDK (
@azure/storage-blob): No explicit retry config. SDK applies its own internal retry policy by default (StorageRetryPolicy: 4 attempts, exponential).
Queue Retry
Not applicable. hbf-knowledge-manager has no Bull, Kafka, or other queue consumers.
Timeouts
| Call | Timeout | Configured in |
|---|---|---|
| hbf-core API (all methods) | None | hbf-core-api axios default |
| Azure Blob download / list | SDK default | @azure/storage-blob internal |
Circuit Breakers
None detected.
Fallback Strategy
| Failure scenario | Behaviour | User impact |
|---|---|---|
| Azure Event Grid webhook processing error | Error caught and logged after 200 ACK has already been sent to Event Grid | Event Grid does not retry (ACK was sent). File change silently lost. |
| No integration found for webhook key | Returns early, logs debug message | No sync occurs. Silent. |
| No KB linked to integration | Logs warning, skips integration | No sync occurs. Silent. |
fullSync: individual file download/upload fails | Error caught per-file, added to SyncResult.errors[], continues to next file | Partial sync. Caller receives SyncResult with error list. |
Azure Blob validateConnection failure | Returns false, logs warning | Caller handles. No throw. |
Fire-and-Forget Pattern (Key Resilience Design)
POST /webhooks/azure-blob responds 200 OK before processing. The flow is:
- Validate payload shape.
- Send
res.status(200).send()— ACKs Event Grid immediately. - Call
processAzureBlobEvent()asynchronously (still awaited within the handler, but after the response is sent). - Any error in step 3 is caught and logged; it never affects the HTTP response.
Purpose: Prevents Event Grid from treating slow processing as a failure and triggering its exponential retry storm (up to 24 hours of retries).
Risk: If processing fails after ACK, the event is permanently lost. There is no retry, no dead-letter queue, and no replay mechanism.
SharePoint webhooks (POST /webhooks/sharepoint) follow the same fire-and-forget pattern: ACK 202 immediately, then process asynchronously. The same risk of silent event loss applies.
Health Checks
None. No /health or readiness endpoint exists.
Known Gaps
- No timeout on hbf-core API calls. A hung downstream (hbf-core) will block the request thread indefinitely.
- No retry or DLQ for failed webhook event processing. Events lost after ACK are unrecoverable.
- No health endpoint. Orchestrators (Kubernetes) cannot detect unhealthy instances.
fullSyncerrors per-file are returned inSyncResult.errors[]but the caller (SyncController) does not surface them in the HTTP response — it returns the fullSyncResultobject, so the client does receive the error list.- Azure Blob SDK retry behavior is implicit (SDK default). No explicit configuration or visibility into retry attempts.