Resilience: semantic-doc-segmenter

Job timeout and cancellation

Jobs run in subprocesses managed by JobExecutionManager (app/services/job_execution_manager.py). The polling loop (JobPollingService) checks for timed-out jobs on every iteration (every 1 second).

Timeout: JOB_TIMEOUT_SECONDS env var, default 600 seconds (10 minutes).

Cancellation mechanism (cooperative, not forceful):

JobPollingService detects elapsed time > threshold.
Sets cancellation_dict[job_id] = True (shared multiprocessing.Manager().dict()).
The worker subprocess checks this flag at multiple checkpoints in process_document via check_cancellation().
On flag detection, the worker raises asyncio.CancelledError, marks job as FAILED in DB, then re-raises.
The manager marks the future as CANCEL_REQUESTED and calls future.cancel() (best-effort; ineffective for already-running futures).

Checkpoints in doc_processing_service.process_document:

After format-to-Markdown conversion
Before segmentation
Before and during per-article tagging loop (every article)

Note: A subprocess that is blocked in a synchronous C library (e.g. PyMuPDF, Docling, Gemini HTTP call) will not respond to the cooperative flag until it returns. The cancellation is not pre-emptive.

LLM retries

Both call_openai() and call_azure_openai() in app/services/llm_service.py pass max_retries=20 to the OpenAI client constructor. The client uses its built-in exponential backoff for retryable errors (rate limits, 5xx). There is no additional retry layer above this.

Per-call timeouts:

Tagging: 120 seconds (2 minutes)
Language detection: LANGUAGE_DETECT_LLM_TIMEOUT_SECONDS (default 30 seconds)

Language detection fallback chain

USE_GOOGLE_LANGUAGE_DETECTION=true?
  Yes -> Google Cloud Translate API (timeout: LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS, default 10s)
         | success -> return language code
         | timeout / error -> fall through
  No  -> skip Google

LLM detection (LLM_BACKEND: openai or azure)
  | success -> return language code
  | empty result -> fall through
  | exception -> fall through

Default: SYSTEM_DEFAULT_LANGUAGE (default "en")

Database connection resilience

pool_pre_ping=True: tests connections before use; stale connections are recycled.
pool_reset_on_return="commit": commits any open transaction on connection return.
pool_recycle=3600: connections are recycled hourly to prevent MySQL wait_timeout disconnections.
Each polling loop iteration opens a fresh ForegroundSessionLocal() context manager to prevent session poisoning across iterations.
DB errors in the polling loop are caught and logged; the loop continues.

Callback delivery

post_results_to_callback() and progressive_post_to_callback() in app/services/job_service.py:

Use requests.post(..., timeout=5) -- 5 second hard timeout.
No retries on callback failure.
Failure is logged as ERROR but does not affect job status.

Error handling in the processing pipeline

All exceptions in process_document are caught in a top-level except Exception block:

Job is marked FAILED in DB with the full traceback stored in job.errormessage.
asyncio.CancelledError is handled separately: marks job FAILED with message "Job cancelled due to timeout".
Both paths open a fresh DB session to ensure the status update succeeds even if the original session is poisoned.

File upload validation

Max file size: MAX_FILE_SIZE_LIMIT (default 64 MB). Returns HTTP 413 on violation.
Filename sanitized via werkzeug.utils.secure_filename to prevent path traversal.
MIME type auto-detected via python-magic on the first MIMETYPE_MAGIC_LOOKAHEAD bytes (default 5000).

Job queue isolation

SELECT ... FOR UPDATE SKIP LOCKED is used when fetching NEW jobs. This prevents two polling loops (or future horizontal replicas) from picking up the same job. The job is marked PROCESSING and committed before the subprocess is spawned, releasing the row lock immediately.

Known gaps

No callback retries: If the callback URL is unavailable, segments are stored in DB but the caller is never notified. The caller must poll /jobs/{id} for status.
No forceful subprocess kill: If a worker is blocked in a synchronous system call, JOB_TIMEOUT_SECONDS cancellation is delayed until the call returns.
Single uvicorn worker: The --workers 1 constraint is intentional (the JobExecutionManager Singleton uses in-memory state). Horizontal scaling requires reworking the cancellation mechanism.
No dead letter queue: Failed jobs stay in the job table with status=FAILED. There is no automatic retry or alerting beyond Elastic APM.
Callback timeout 5s: Long-running callback endpoints will cause false negatives in delivery confirmation.

Job timeout and cancellation​

LLM retries​

Language detection fallback chain​

Database connection resilience​

Callback delivery​

Error handling in the processing pipeline​

File upload validation​

Job queue isolation​

Known gaps​