Skip to main content

Architecture: semantic-doc-segmenter

Module layout

app/
main.py -- FastAPI app, lifespan hooks, router registration
config/
config.py -- All env vars with defaults
database.py -- SQLAlchemy engine, session factory, get_db dependency
auth/
jwt_bearer.py -- JWTBearer (HTTPBearer subclass), role check
routers/
jobs.py -- /jobs CRUD + cancel
documents.py -- /documents + /segments read endpoints
debug.py -- /debug/jobs/{id}, /debug/health*
models/
models.py -- SQLAlchemy ORM: Document, Job, DocumentSegment + enums
schemas/
document_schemas.py -- Pydantic schemas for document responses
jobs_schemas.py -- Pydantic schemas for job responses
segment_schema.py -- Pydantic schemas for segment responses
services/
job_polling_service.py -- JobPollingService: async polling loop (1s interval)
job_execution_manager.py -- JobExecutionManager (Singleton): ProcessPoolExecutor
doc_processing_service.py -- process_document(): full pipeline logic
doc_converter_service.py -- Format-to-Markdown converters (PDF/DOCX/PPTX/HTML)
llm_service.py -- call_openai(), call_azure_openai()
gemini_service.py -- generate_gemini_content() via google-genai
job_service.py -- create_job(), change_job_status(), callback POSTs
documents_service.py -- create_document(), change_document_status()
markdown_service.py -- Markdown tree parser, article generator, title extractor
image_upload_provider.py -- ImageUploadProvider ABC + ImageUploadMode type
s3_image_upload_provider.py -- S3ImageUploadProvider
tmp_image_upload_provider.py -- TmpImageUploadProvider
parsers/
PDFParser.py -- PDFParser ABC + PDFParserOptions dataclass
PyMuPDFParser.py -- PyMuPDF-based implementation
DoclingPDFParser.py -- Docling-based implementation
utils/
utils.py -- ProcessingStage enum, detect_language_for_text,
string_size, Singleton, timeit, autodetect_mime_type
logger.py -- app_logger (ECS or simple format via .ini)
exceptions.py -- DocumentNotFound, JobNotFound
prompts/
gemini_text_only.txt -- Gemini prompt for text-only PDF/HTML parsing
gemini_text_and_images.txt -- Gemini prompt for PDF parsing with image extraction
language_detection.txt -- LLM prompt for language detection

Key classes

ClassFileNotes
JWTBearerauth/jwt_bearer.pyFastAPI dependency; validates HS256 JWT, checks role == "admin"
JobPollingServiceservices/job_polling_service.pyAsync loop; SELECT FOR UPDATE SKIP LOCKED to pick up NEW jobs
JobExecutionManagerservices/job_execution_manager.pySingleton; wraps ProcessPoolExecutor; holds cancellation dict in shared memory
PDFParser / PyMuPDFParser / DoclingPDFParserparsers/Strategy pattern for PDF-to-Markdown
ProcessingStageutils/utils.pyEnum tracking pipeline progress: UPLOADED -> CONVERTING -> PROCESSING_MARKDOWN -> SEGMENTING -> TAGGING -> FINALIZING -> COMPLETED

Request flow: POST /jobs

LLM backend selection

LLM_BACKEND env var selects which client is used for tagging and language detection:

  • openai (default): AsyncOpenAI with OPENAI_API_KEY
  • azure: AsyncAzureOpenAI with AZURE_LLM_API_KEY, AZURE_LLM_ENDPOINT, AZURE_LLM_API_VERSION

Gemini is used only when pdf_parsing_backend=gemini-3 is passed in the job options. It is always called synchronously via google-genai client (not async).

PDF parsing backend selection

The pdf_parsing_backend option on a job (or PDF_PARSING_BACKEND env default) selects:

ValueParser class
pymupdf (default)PyMuPDFParser -- no OCR support
doclingDoclingPDFParser -- supports OCR via tesseract or easyocr
gemini-3Direct call to gemini_pdf_to_markdown() -- bypasses parser classes

Concurrency model

  • One uvicorn worker process (HTTP request handling is async).
  • JobPollingService runs as an asyncio task inside the uvicorn process.
  • Each job is executed in a separate subprocess spawned by ProcessPoolExecutor (spawn method, not fork).
  • Inter-process cancellation uses a multiprocessing.Manager().dict() (the cancellation_dict).
  • BACKGROUND_TASK_LIMIT env var (default: 2) controls max concurrent job subprocesses.