Resilience: semantic-doc-segmenter
Job timeout and cancellation
Jobs run in subprocesses managed by JobExecutionManager (app/services/job_execution_manager.py). The polling loop (JobPollingService) checks for timed-out jobs on every iteration (every 1 second).
Timeout: JOB_TIMEOUT_SECONDS env var, default 600 seconds (10 minutes).
Cancellation mechanism (cooperative, not forceful):
JobPollingServicedetects elapsed time > threshold.- Sets
cancellation_dict[job_id] = True(sharedmultiprocessing.Manager().dict()). - The worker subprocess checks this flag at multiple checkpoints in
process_documentviacheck_cancellation(). - On flag detection, the worker raises
asyncio.CancelledError, marks job as FAILED in DB, then re-raises. - The manager marks the future as
CANCEL_REQUESTEDand callsfuture.cancel()(best-effort; ineffective for already-running futures).
Checkpoints in doc_processing_service.process_document:
- After format-to-Markdown conversion
- Before segmentation
- Before and during per-article tagging loop (every article)
Note: A subprocess that is blocked in a synchronous C library (e.g. PyMuPDF, Docling, Gemini HTTP call) will not respond to the cooperative flag until it returns. The cancellation is not pre-emptive.
LLM retries
Both call_openai() and call_azure_openai() in app/services/llm_service.py pass max_retries=20 to the OpenAI client constructor. The client uses its built-in exponential backoff for retryable errors (rate limits, 5xx). There is no additional retry layer above this.
Per-call timeouts:
- Tagging: 120 seconds (2 minutes)
- Language detection:
LANGUAGE_DETECT_LLM_TIMEOUT_SECONDS(default 30 seconds)
Language detection fallback chain
USE_GOOGLE_LANGUAGE_DETECTION=true?
Yes -> Google Cloud Translate API (timeout: LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS, default 10s)
| success -> return language code
| timeout / error -> fall through
No -> skip Google
LLM detection (LLM_BACKEND: openai or azure)
| success -> return language code
| empty result -> fall through
| exception -> fall through
Default: SYSTEM_DEFAULT_LANGUAGE (default "en")
Database connection resilience
pool_pre_ping=True: tests connections before use; stale connections are recycled.pool_reset_on_return="commit": commits any open transaction on connection return.pool_recycle=3600: connections are recycled hourly to prevent MySQLwait_timeoutdisconnections.- Each polling loop iteration opens a fresh
ForegroundSessionLocal()context manager to prevent session poisoning across iterations. - DB errors in the polling loop are caught and logged; the loop continues.
Callback delivery
post_results_to_callback() and progressive_post_to_callback() in app/services/job_service.py:
- Use
requests.post(..., timeout=5)-- 5 second hard timeout. - No retries on callback failure.
- Failure is logged as ERROR but does not affect job status.
Error handling in the processing pipeline
All exceptions in process_document are caught in a top-level except Exception block:
- Job is marked
FAILEDin DB with the full traceback stored injob.errormessage. asyncio.CancelledErroris handled separately: marks job FAILED with message "Job cancelled due to timeout".- Both paths open a fresh DB session to ensure the status update succeeds even if the original session is poisoned.
File upload validation
- Max file size:
MAX_FILE_SIZE_LIMIT(default 64 MB). Returns HTTP 413 on violation. - Filename sanitized via
werkzeug.utils.secure_filenameto prevent path traversal. - MIME type auto-detected via
python-magicon the firstMIMETYPE_MAGIC_LOOKAHEADbytes (default 5000).
Job queue isolation
SELECT ... FOR UPDATE SKIP LOCKED is used when fetching NEW jobs. This prevents two polling loops (or future horizontal replicas) from picking up the same job. The job is marked PROCESSING and committed before the subprocess is spawned, releasing the row lock immediately.
Known gaps
- No callback retries: If the callback URL is unavailable, segments are stored in DB but the caller is never notified. The caller must poll
/jobs/{id}for status. - No forceful subprocess kill: If a worker is blocked in a synchronous system call,
JOB_TIMEOUT_SECONDScancellation is delayed until the call returns. - Single uvicorn worker: The
--workers 1constraint is intentional (theJobExecutionManagerSingleton uses in-memory state). Horizontal scaling requires reworking the cancellation mechanism. - No dead letter queue: Failed jobs stay in the
jobtable with status=FAILED. There is no automatic retry or alerting beyond Elastic APM. - Callback timeout 5s: Long-running callback endpoints will cause false negatives in delivery confirmation.