Deployment: semantic-doc-segmenter

Docker

Base image: python:3.11.2-slim

Entrypoint (production):

poetry run uvicorn app.main:app --host 0.0.0.0 --port 8081 --workers 1 --log-config app/config/logging_ecs.ini

Port: 8081 (mapped 8081:8081 in docker-compose)

Build:

docker build -t semantic-doc-segmenter .

The Dockerfile installs system packages (pandoc, libmagic1, cmake, libgoogle-perftools-dev) at build time. Poetry virtualenv is created inside the image.

docker-compose

docker-compose.yaml -- production-style compose with bundled MySQL:

Service: semantic-doc-segmenter
Service: mysql (image: mysql:latest)
Service: test-semantic-doc-segmenter (runs pytest)

docker-compose.local.yml -- local dev overlay (generated by scripts/local-dev/generate-env.sh):

Disables bundled MySQL, replaces with a no-op busybox container
Adds entrypoint that runs alembic upgrade head before uvicorn
Joins platform-local external Docker network for shared platform MySQL

Local dev start command:

docker-compose --env-file .env -f docker-compose.yaml -f docker-compose.local.yml up --build

Health check

Defined in docker-compose.yaml:

healthcheck:
  test: ["CMD", "curl", "-f", "http://semantic-doc-segmenter:8081"]
  interval: 15s
  timeout: 10s
  retries: 3
  start_period: 1s

The health check hits GET / which returns {"vendorName": ..., "latestVersion": ...} with no auth required.

/debug/health (auth required) returns structured job queue counts.

Environment variables

Required

Var	Example	Purpose
`MYSQL_URL`	`mysql+pymysql://segmenter:pw@host/pdfdocuments?charset=utf8mb4`	Database connection
`JWT_SECRET`	`<random 64-char string>`	JWT signing secret
`OPENAI_API_KEY`	`sk-...`	OpenAI API key (required if `LLM_BACKEND=openai`)

LLM backend

Var	Default	Purpose
`LLM_BACKEND`	`openai`	`openai` or `azure`
`OPENAI_LLM_MODEL_FOR_TAGGER`	`gpt-5.4`	Model for article tagging
`OPENAI_LLM_MODEL_FOR_REFORMAT`	`gpt-5.4`	Model for markdown reformatting
`OPENAI_LLM_MODEL_FOR_HEADINGS`	`gpt-5.4`	Model for heading insertion
`OPENAI_LLM_MODEL_FOR_LANGUAGE_DETECTION`	`gpt-5.4-nano`	Model for language detection
`AZURE_LLM_API_KEY`		Azure OpenAI API key
`AZURE_LLM_API_VERSION`	`2023-07-01-preview`	Azure API version
`AZURE_LLM_ENDPOINT`	`https://helvia-francecentral-0.openai.azure.com`	Azure endpoint
`AZURE_LLM_DEPLOYMENT_NAME`	`helvia-francecentral-0`	Azure deployment name
`AZURE_LLM_MODEL_FOR_TAGGER`	`gpt-5.4`
`AZURE_LLM_MODEL_FOR_REFORMAT`	`gpt-5.4`
`AZURE_LLM_MODEL_FOR_HEADINGS`	`gpt-5.4`
`AZURE_LLM_MODEL_FOR_LANGUAGE_DETECTION`	`gpt-5.4-nano`

Gemini

Var	Default	Purpose
`GEMINI_API_KEY`		Google Gemini API key
`GEMINI_MODEL`	`gemini-3.1-pro-preview`	Model name
`GEMINI_PROMPT_TEXT_ONLY_PATH`	`app/prompts/gemini_text_only.txt`	Path to text-only prompt
`GEMINI_PROMPT_TEXT_AND_IMAGES_PATH`	`app/prompts/gemini_text_and_images.txt`	Path to image prompt
`GEMINI_LOG_PROMPT`	`false`	Log the prompt at INFO level
`GEMINI_DEBUG_SAVE_OVERLAY_PDF`	`false`	Save image-overlay PDFs for debugging
`GEMINI_DEBUG_OVERLAY_DIR`	`/tmp`	Directory for overlay PDFs

Language detection

Var	Default	Purpose
`USE_GOOGLE_LANGUAGE_DETECTION`	`false`	Use Google Cloud Translate for language detection
`GOOGLE_APPLICATION_CREDENTIALS`		Path to GCP service account JSON
`GOOGLE_CLOUD_TRANSLATE_PROJECT_ID`	`hbf-language-translation-dev`	GCP project for translation API
`LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS`	`10`	Timeout for Google detection
`LANGUAGE_DETECT_LLM_TIMEOUT_SECONDS`	`30`	Timeout for LLM detection
`LANGUAGE_DETECTION_PROMPT_PATH`	`app/prompts/language_detection.txt`
`LANGUAGE_DETECT_LOOKAHEAD`	`500`	Chars to inspect for detection
`SYSTEM_DEFAULT_LANGUAGE`	`en`	Fallback language code
`SYSTEM_NATIVE_LANGUAGES`	`en`	Comma-separated list of native language codes

Segmenter

Var	Default	Purpose
`SEGMENTER_MAX_ARTICLE_SIZE`	`2000`	Default max segment size
`SEGMENTER_SIZE_UNITS`	`non_ws_chars`	`words`, `chars`, `non_ws_chars`, or `tokens`
`MAX_FILE_SIZE_LIMIT`	`67108864` (64 MB)	Max upload size in bytes

PDF / OCR

Var	Default	Purpose
`PDF_PARSING_BACKEND`	`pymupdf`	`pymupdf`, `docling`, or `gemini-3`
`OCR_BACKEND`	`tesseract`	`tesseract` or `easyocr`
`DOCX_TO_MD_CONVERTER`	`mammoth`	`mammoth` or `pypandoc`

Image handling

Var	Default	Purpose
`ENABLE_IMAGE_HANDLING`	`false`	Extract and store images from documents
`IMAGE_HANDLING_MODE`	`s3`	`s3` or `tmp_file`
`AWS_S3_BUCKET_NAME`	`doc-segmenter-images`	S3 bucket
`AWS_REGION_NAME`	`us-east-1`	AWS region
`AWS_ACCESS_KEY_ID`		AWS credentials
`AWS_SECRET_ACCESS_KEY`		AWS credentials

Database pool

Var	Default	Purpose
`SQLALCHEMY_POOL_SIZE`	`5`
`SQLALCHEMY_OVERFLOW`	`10`
`SQLALCHEMY_POOL_RECYCLE`	`3600`
`MYSQL_USE_SSL`	`false`	Enable SSL to MySQL (no cert validation)

Worker concurrency / timeouts

Var	Default	Purpose
`BACKGROUND_TASK_LIMIT`	`2`	Max concurrent job subprocesses
`JOB_TIMEOUT_SECONDS`	`600`	Seconds before a running job is force-cancelled

Logging / observability

Var	Default	Purpose
`LOG_LEVEL_APP`	`INFO`	App logger level
`LOG_LEVEL_ROOT`	`INFO`	Root logger level
`LOG_LEVEL_SQL`	`ERROR`	SQLAlchemy logger level
`ENABLE_APM`	`false`	Enable Elastic APM middleware

Other

Var	Default	Purpose
`VERSION`	`1.1.0`	Reported in `GET /` and OpenAPI
`MIMETYPE_MAGIC_LOOKAHEAD`	`5000`	Bytes inspected by `python-magic` for MIME detection

Alembic migrations

Migrations run automatically on startup (via app/main.py lifespan), and also explicitly in the local dev entrypoint:

MYSQL_URL=... poetry run alembic upgrade head

Local development (without Docker)

poetry install --no-root
PYTHONPATH=. poetry run uvicorn app.main:app --workers 1 --log-config app/config/logging_simple.ini

Requires a .env file with at minimum MYSQL_URL, JWT_SECRET, OPENAI_API_KEY.

Tests

# Integration tests (requires running service and DB)
poetry run pytest --envfile .env.test -vvv

# HURL tests
./bin/run_hurl_tests.sh

Docker​

docker-compose​

Health check​

Environment variables​

Required​

LLM backend​

Gemini​

Language detection​

Segmenter​

PDF / OCR​

Image handling​

Database pool​

Worker concurrency / timeouts​

Logging / observability​

Other​

Alembic migrations​

Local development (without Docker)​

Tests​