Resilience: hbf-core-api

Retry logic, timeout behavior, and error handling patterns in the HTTP client layer. Platform-wide resilience patterns: docs/architecture/resilience.md

Retry Policy

Retries are implemented in makeApiRequest in src/core.ts.

Property	Value
Max attempts	3 (initial attempt + 2 retries)
Trigger	Network error only: axios `error.request` set, `error.response` absent
Backoff	Exponential with jitter: `Math.round((Math.exp(attempt) - 2 * Math.random()) * 1000)` ms
Max delay (approx.)	~18s on attempt 3 (`e^3 ≈ 20s` minus jitter)
Retry on HTTP 4xx/5xx	No. Any response from the server (including 500, 503) is returned immediately without retry.
Retry on connection refused	Yes (no response received).
Retry on DNS failure	Yes (no response received).

Retry Logic (simplified)

attempt 1: make request
  -> got response (any status): return HBFCoreApiResponse
  -> no response (network error) AND attempt <= 3: wait, increment, recurse
  -> no response AND attempt > 3: return synthetic 503
  -> setup error (bad config): return synthetic 503

Timeout Behavior

There is no timeout configured on any request.

axios uses undefined for timeout by default, meaning requests can hang indefinitely waiting for hbf-core to respond. If hbf-core becomes unresponsive without dropping the TCP connection (e.g., a stuck thread holding the socket open), the calling service will block on that request forever.

This affects every service that uses this library.

Error Handling

makeApiRequest never throws. It returns an HBFCoreApiResponse in all cases:

Scenario	Result
HTTP 2xx	`HBFCoreApiResponse` with status from server
HTTP 4xx/5xx	`HBFCoreApiResponse` with error status from server
Network error (no response), attempts exhausted	`HBFCoreApiResponse` with `status: 503`, body `{ message: <error.message> }`
axios setup error (bad config, pre-request)	`HBFCoreApiResponse` with `status: 503`, body `{ message: <error.message> }`

Extractors on HBFCoreApiResponse throw synchronously if the status is outside 2xx (except getOptionalValue and getFile, which return undefined on 404).

Sub-client methods propagate whatever the extractor returns or throws. There is no additional error wrapping at the client layer.

Known Gaps

No request timeout. A hung hbf-core connection blocks the caller indefinitely. All 13 consuming services are exposed to this.
HTTP 5xx not retried. If hbf-core returns 500 or 503, the library returns that response immediately. Only connectivity failures (no response at all) trigger a retry. A transient hbf-core crash that returns a 503 before closing the connection will not be retried.
No circuit breaker. There is no open/half-open/closed state machine. If hbf-core is down, every call goes through the full retry cycle (3 attempts, up to ~30s total) before failing.
No per-client or per-method timeout override. Timeout must be handled at the consuming service level (e.g., wrapping calls in Promise.race with a setTimeout).
Token not refreshed. The token is set once at construction. If the token expires during the lifetime of the HBFCoreApi instance, all subsequent requests will receive 401 responses. The library does not detect this or trigger a token refresh.

Recommendations

Add a default axios timeout in requestHBFCore: timeout: 30000 (30s) at minimum. Make it configurable via a constructor option.
Retry on 503: Extend the retry condition to include error.response.status === 503 with a short fixed delay.
Implement circuit breaker in the calling service if hbf-core is a critical dependency (e.g., using opossum).

Wrap calls with a timeout in services where a hung call is unacceptable:

const result = await Promise.race([
    coreApi.BotDeploymentClient.findByHandle(handle),
    new Promise((_, reject) => setTimeout(() => reject(new Error("timeout")), 10000)),
]);

Recreate HBFCoreApi on token refresh rather than holding a long-lived instance.

Retry Policy​

Retry Logic (simplified)​

Timeout Behavior​

Error Handling​

Known Gaps​

Recommendations​