Skip to main content

Deployment: semantic-doc-segmenter

Docker

Base image: python:3.11.2-slim

Entrypoint (production):

poetry run uvicorn app.main:app --host 0.0.0.0 --port 8081 --workers 1 --log-config app/config/logging_ecs.ini

Port: 8081 (mapped 8081:8081 in docker-compose)

Build:

docker build -t semantic-doc-segmenter .

The Dockerfile installs system packages (pandoc, libmagic1, cmake, libgoogle-perftools-dev) at build time. Poetry virtualenv is created inside the image.

docker-compose

docker-compose.yaml -- production-style compose with bundled MySQL:

  • Service: semantic-doc-segmenter
  • Service: mysql (image: mysql:latest)
  • Service: test-semantic-doc-segmenter (runs pytest)

docker-compose.local.yml -- local dev overlay (generated by scripts/local-dev/generate-env.sh):

  • Disables bundled MySQL, replaces with a no-op busybox container
  • Adds entrypoint that runs alembic upgrade head before uvicorn
  • Joins platform-local external Docker network for shared platform MySQL

Local dev start command:

docker-compose --env-file .env -f docker-compose.yaml -f docker-compose.local.yml up --build

Health check

Defined in docker-compose.yaml:

healthcheck:
test: ["CMD", "curl", "-f", "http://semantic-doc-segmenter:8081"]
interval: 15s
timeout: 10s
retries: 3
start_period: 1s

The health check hits GET / which returns {"vendorName": ..., "latestVersion": ...} with no auth required.

/debug/health (auth required) returns structured job queue counts.

Environment variables

Required

VarExamplePurpose
MYSQL_URLmysql+pymysql://segmenter:pw@host/pdfdocuments?charset=utf8mb4Database connection
JWT_SECRET<random 64-char string>JWT signing secret
OPENAI_API_KEYsk-...OpenAI API key (required if LLM_BACKEND=openai)

LLM backend

VarDefaultPurpose
LLM_BACKENDopenaiopenai or azure
OPENAI_LLM_MODEL_FOR_TAGGERgpt-5.4Model for article tagging
OPENAI_LLM_MODEL_FOR_REFORMATgpt-5.4Model for markdown reformatting
OPENAI_LLM_MODEL_FOR_HEADINGSgpt-5.4Model for heading insertion
OPENAI_LLM_MODEL_FOR_LANGUAGE_DETECTIONgpt-5.4-nanoModel for language detection
AZURE_LLM_API_KEYAzure OpenAI API key
AZURE_LLM_API_VERSION2023-07-01-previewAzure API version
AZURE_LLM_ENDPOINThttps://helvia-francecentral-0.openai.azure.comAzure endpoint
AZURE_LLM_DEPLOYMENT_NAMEhelvia-francecentral-0Azure deployment name
AZURE_LLM_MODEL_FOR_TAGGERgpt-5.4
AZURE_LLM_MODEL_FOR_REFORMATgpt-5.4
AZURE_LLM_MODEL_FOR_HEADINGSgpt-5.4
AZURE_LLM_MODEL_FOR_LANGUAGE_DETECTIONgpt-5.4-nano

Gemini

VarDefaultPurpose
GEMINI_API_KEYGoogle Gemini API key
GEMINI_MODELgemini-3.1-pro-previewModel name
GEMINI_PROMPT_TEXT_ONLY_PATHapp/prompts/gemini_text_only.txtPath to text-only prompt
GEMINI_PROMPT_TEXT_AND_IMAGES_PATHapp/prompts/gemini_text_and_images.txtPath to image prompt
GEMINI_LOG_PROMPTfalseLog the prompt at INFO level
GEMINI_DEBUG_SAVE_OVERLAY_PDFfalseSave image-overlay PDFs for debugging
GEMINI_DEBUG_OVERLAY_DIR/tmpDirectory for overlay PDFs

Language detection

VarDefaultPurpose
USE_GOOGLE_LANGUAGE_DETECTIONfalseUse Google Cloud Translate for language detection
GOOGLE_APPLICATION_CREDENTIALSPath to GCP service account JSON
GOOGLE_CLOUD_TRANSLATE_PROJECT_IDhbf-language-translation-devGCP project for translation API
LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS10Timeout for Google detection
LANGUAGE_DETECT_LLM_TIMEOUT_SECONDS30Timeout for LLM detection
LANGUAGE_DETECTION_PROMPT_PATHapp/prompts/language_detection.txt
LANGUAGE_DETECT_LOOKAHEAD500Chars to inspect for detection
SYSTEM_DEFAULT_LANGUAGEenFallback language code
SYSTEM_NATIVE_LANGUAGESenComma-separated list of native language codes

Segmenter

VarDefaultPurpose
SEGMENTER_MAX_ARTICLE_SIZE2000Default max segment size
SEGMENTER_SIZE_UNITSnon_ws_charswords, chars, non_ws_chars, or tokens
MAX_FILE_SIZE_LIMIT67108864 (64 MB)Max upload size in bytes

PDF / OCR

VarDefaultPurpose
PDF_PARSING_BACKENDpymupdfpymupdf, docling, or gemini-3
OCR_BACKENDtesseracttesseract or easyocr
DOCX_TO_MD_CONVERTERmammothmammoth or pypandoc

Image handling

VarDefaultPurpose
ENABLE_IMAGE_HANDLINGfalseExtract and store images from documents
IMAGE_HANDLING_MODEs3s3 or tmp_file
AWS_S3_BUCKET_NAMEdoc-segmenter-imagesS3 bucket
AWS_REGION_NAMEus-east-1AWS region
AWS_ACCESS_KEY_IDAWS credentials
AWS_SECRET_ACCESS_KEYAWS credentials

Database pool

VarDefaultPurpose
SQLALCHEMY_POOL_SIZE5
SQLALCHEMY_OVERFLOW10
SQLALCHEMY_POOL_RECYCLE3600
MYSQL_USE_SSLfalseEnable SSL to MySQL (no cert validation)

Worker concurrency / timeouts

VarDefaultPurpose
BACKGROUND_TASK_LIMIT2Max concurrent job subprocesses
JOB_TIMEOUT_SECONDS600Seconds before a running job is force-cancelled

Logging / observability

VarDefaultPurpose
LOG_LEVEL_APPINFOApp logger level
LOG_LEVEL_ROOTINFORoot logger level
LOG_LEVEL_SQLERRORSQLAlchemy logger level
ENABLE_APMfalseEnable Elastic APM middleware

Other

VarDefaultPurpose
VERSION1.1.0Reported in GET / and OpenAPI
MIMETYPE_MAGIC_LOOKAHEAD5000Bytes inspected by python-magic for MIME detection

Alembic migrations

Migrations run automatically on startup (via app/main.py lifespan), and also explicitly in the local dev entrypoint:

MYSQL_URL=... poetry run alembic upgrade head

Local development (without Docker)

poetry install --no-root
PYTHONPATH=. poetry run uvicorn app.main:app --workers 1 --log-config app/config/logging_simple.ini

Requires a .env file with at minimum MYSQL_URL, JWT_SECRET, OPENAI_API_KEY.

Tests

# Integration tests (requires running service and DB)
poetry run pytest --envfile .env.test -vvv

# HURL tests
./bin/run_hurl_tests.sh