Resilience: hbf-core-api
Retry logic, timeout behavior, and error handling patterns in the HTTP client layer. Platform-wide resilience patterns:
docs/architecture/resilience.md
Retry Policy
Retries are implemented in makeApiRequest in src/core.ts.
| Property | Value |
|---|---|
| Max attempts | 3 (initial attempt + 2 retries) |
| Trigger | Network error only: axios error.request set, error.response absent |
| Backoff | Exponential with jitter: Math.round((Math.exp(attempt) - 2 * Math.random()) * 1000) ms |
| Max delay (approx.) | ~18s on attempt 3 (e^3 ≈ 20s minus jitter) |
| Retry on HTTP 4xx/5xx | No. Any response from the server (including 500, 503) is returned immediately without retry. |
| Retry on connection refused | Yes (no response received). |
| Retry on DNS failure | Yes (no response received). |
Retry Logic (simplified)
attempt 1: make request
-> got response (any status): return HBFCoreApiResponse
-> no response (network error) AND attempt <= 3: wait, increment, recurse
-> no response AND attempt > 3: return synthetic 503
-> setup error (bad config): return synthetic 503
Timeout Behavior
There is no timeout configured on any request.
axios uses undefined for timeout by default, meaning requests can hang indefinitely waiting for hbf-core to respond. If hbf-core becomes unresponsive without dropping the TCP connection (e.g., a stuck thread holding the socket open), the calling service will block on that request forever.
This affects every service that uses this library.
Error Handling
makeApiRequest never throws. It returns an HBFCoreApiResponse in all cases:
| Scenario | Result |
|---|---|
| HTTP 2xx | HBFCoreApiResponse with status from server |
| HTTP 4xx/5xx | HBFCoreApiResponse with error status from server |
| Network error (no response), attempts exhausted | HBFCoreApiResponse with status: 503, body { message: <error.message> } |
| axios setup error (bad config, pre-request) | HBFCoreApiResponse with status: 503, body { message: <error.message> } |
Extractors on HBFCoreApiResponse throw synchronously if the status is outside 2xx (except getOptionalValue and getFile, which return undefined on 404).
Sub-client methods propagate whatever the extractor returns or throws. There is no additional error wrapping at the client layer.
Known Gaps
-
No request timeout. A hung hbf-core connection blocks the caller indefinitely. All 13 consuming services are exposed to this.
-
HTTP 5xx not retried. If hbf-core returns 500 or 503, the library returns that response immediately. Only connectivity failures (no response at all) trigger a retry. A transient hbf-core crash that returns a 503 before closing the connection will not be retried.
-
No circuit breaker. There is no open/half-open/closed state machine. If hbf-core is down, every call goes through the full retry cycle (3 attempts, up to ~30s total) before failing.
-
No per-client or per-method timeout override. Timeout must be handled at the consuming service level (e.g., wrapping calls in
Promise.racewith asetTimeout). -
Token not refreshed. The token is set once at construction. If the token expires during the lifetime of the
HBFCoreApiinstance, all subsequent requests will receive 401 responses. The library does not detect this or trigger a token refresh.
Recommendations
- Add a default axios timeout in
requestHBFCore:timeout: 30000(30s) at minimum. Make it configurable via a constructor option. - Retry on 503: Extend the retry condition to include
error.response.status === 503with a short fixed delay. - Implement circuit breaker in the calling service if hbf-core is a critical dependency (e.g., using
opossum). - Wrap calls with a timeout in services where a hung call is unacceptable:
const result = await Promise.race([
coreApi.BotDeploymentClient.findByHandle(handle),
new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), 10000)),
]); - Recreate
HBFCoreApion token refresh rather than holding a long-lived instance.