Architecture: semantic-doc-segmenter
Module layout
app/
main.py -- FastAPI app, lifespan hooks, router registration
config/
config.py -- All env vars with defaults
database.py -- SQLAlchemy engine, session factory, get_db dependency
auth/
jwt_bearer.py -- JWTBearer (HTTPBearer subclass), role check
routers/
jobs.py -- /jobs CRUD + cancel
documents.py -- /documents + /segments read endpoints
debug.py -- /debug/jobs/{id}, /debug/health*
models/
models.py -- SQLAlchemy ORM: Document, Job, DocumentSegment + enums
schemas/
document_schemas.py -- Pydantic schemas for document responses
jobs_schemas.py -- Pydantic schemas for job responses
segment_schema.py -- Pydantic schemas for segment responses
services/
job_polling_service.py -- JobPollingService: async polling loop (1s interval)
job_execution_manager.py -- JobExecutionManager (Singleton): ProcessPoolExecutor
doc_processing_service.py -- process_document(): full pipeline logic
doc_converter_service.py -- Format-to-Markdown converters (PDF/DOCX/PPTX/HTML)
llm_service.py -- call_openai(), call_azure_openai()
gemini_service.py -- generate_gemini_content() via google-genai
job_service.py -- create_job(), change_job_status(), callback POSTs
documents_service.py -- create_document(), change_document_status()
markdown_service.py -- Markdown tree parser, article generator, title extractor
image_upload_provider.py -- ImageUploadProvider ABC + ImageUploadMode type
s3_image_upload_provider.py -- S3ImageUploadProvider
tmp_image_upload_provider.py -- TmpImageUploadProvider
parsers/
PDFParser.py -- PDFParser ABC + PDFParserOptions dataclass
PyMuPDFParser.py -- PyMuPDF-based implementation
DoclingPDFParser.py -- Docling-based implementation
utils/
utils.py -- ProcessingStage enum, detect_language_for_text,
string_size, Singleton, timeit, autodetect_mime_type
logger.py -- app_logger (ECS or simple format via .ini)
exceptions.py -- DocumentNotFound, JobNotFound
prompts/
gemini_text_only.txt -- Gemini prompt for text-only PDF/HTML parsing
gemini_text_and_images.txt -- Gemini prompt for PDF parsing with image extraction
language_detection.txt -- LLM prompt for language detection
Key classes
| Class | File | Notes |
|---|---|---|
JWTBearer | auth/jwt_bearer.py | FastAPI dependency; validates HS256 JWT, checks role == "admin" |
JobPollingService | services/job_polling_service.py | Async loop; SELECT FOR UPDATE SKIP LOCKED to pick up NEW jobs |
JobExecutionManager | services/job_execution_manager.py | Singleton; wraps ProcessPoolExecutor; holds cancellation dict in shared memory |
PDFParser / PyMuPDFParser / DoclingPDFParser | parsers/ | Strategy pattern for PDF-to-Markdown |
ProcessingStage | utils/utils.py | Enum tracking pipeline progress: UPLOADED -> CONVERTING -> PROCESSING_MARKDOWN -> SEGMENTING -> TAGGING -> FINALIZING -> COMPLETED |
Request flow: POST /jobs
LLM backend selection
LLM_BACKEND env var selects which client is used for tagging and language detection:
openai(default):AsyncOpenAIwithOPENAI_API_KEYazure:AsyncAzureOpenAIwithAZURE_LLM_API_KEY,AZURE_LLM_ENDPOINT,AZURE_LLM_API_VERSION
Gemini is used only when pdf_parsing_backend=gemini-3 is passed in the job options. It is always called synchronously via google-genai client (not async).
PDF parsing backend selection
The pdf_parsing_backend option on a job (or PDF_PARSING_BACKEND env default) selects:
| Value | Parser class |
|---|---|
pymupdf (default) | PyMuPDFParser -- no OCR support |
docling | DoclingPDFParser -- supports OCR via tesseract or easyocr |
gemini-3 | Direct call to gemini_pdf_to_markdown() -- bypasses parser classes |
Concurrency model
- One uvicorn worker process (HTTP request handling is async).
JobPollingServiceruns as an asyncio task inside the uvicorn process.- Each job is executed in a separate subprocess spawned by
ProcessPoolExecutor(spawn method, not fork). - Inter-process cancellation uses a
multiprocessing.Manager().dict()(thecancellation_dict). BACKGROUND_TASK_LIMITenv var (default: 2) controls max concurrent job subprocesses.