Deployment: semantic-doc-segmenter
Docker
Base image: python:3.11.2-slim
Entrypoint (production):
poetry run uvicorn app.main:app --host 0.0.0.0 --port 8081 --workers 1 --log-config app/config/logging_ecs.ini
Port: 8081 (mapped 8081:8081 in docker-compose)
Build:
docker build -t semantic-doc-segmenter .
The Dockerfile installs system packages (pandoc, libmagic1, cmake, libgoogle-perftools-dev) at build time. Poetry virtualenv is created inside the image.
docker-compose
docker-compose.yaml -- production-style compose with bundled MySQL:
- Service:
semantic-doc-segmenter - Service:
mysql(image:mysql:latest) - Service:
test-semantic-doc-segmenter(runspytest)
docker-compose.local.yml -- local dev overlay (generated by scripts/local-dev/generate-env.sh):
- Disables bundled MySQL, replaces with a no-op busybox container
- Adds entrypoint that runs
alembic upgrade headbefore uvicorn - Joins
platform-localexternal Docker network for shared platform MySQL
Local dev start command:
docker-compose --env-file .env -f docker-compose.yaml -f docker-compose.local.yml up --build
Health check
Defined in docker-compose.yaml:
healthcheck:
test: ["CMD", "curl", "-f", "http://semantic-doc-segmenter:8081"]
interval: 15s
timeout: 10s
retries: 3
start_period: 1s
The health check hits GET / which returns {"vendorName": ..., "latestVersion": ...} with no auth required.
/debug/health (auth required) returns structured job queue counts.
Environment variables
Required
| Var | Example | Purpose |
|---|---|---|
MYSQL_URL | mysql+pymysql://segmenter:pw@host/pdfdocuments?charset=utf8mb4 | Database connection |
JWT_SECRET | <random 64-char string> | JWT signing secret |
OPENAI_API_KEY | sk-... | OpenAI API key (required if LLM_BACKEND=openai) |
LLM backend
| Var | Default | Purpose |
|---|---|---|
LLM_BACKEND | openai | openai or azure |
OPENAI_LLM_MODEL_FOR_TAGGER | gpt-5.4 | Model for article tagging |
OPENAI_LLM_MODEL_FOR_REFORMAT | gpt-5.4 | Model for markdown reformatting |
OPENAI_LLM_MODEL_FOR_HEADINGS | gpt-5.4 | Model for heading insertion |
OPENAI_LLM_MODEL_FOR_LANGUAGE_DETECTION | gpt-5.4-nano | Model for language detection |
AZURE_LLM_API_KEY | Azure OpenAI API key | |
AZURE_LLM_API_VERSION | 2023-07-01-preview | Azure API version |
AZURE_LLM_ENDPOINT | https://helvia-francecentral-0.openai.azure.com | Azure endpoint |
AZURE_LLM_DEPLOYMENT_NAME | helvia-francecentral-0 | Azure deployment name |
AZURE_LLM_MODEL_FOR_TAGGER | gpt-5.4 | |
AZURE_LLM_MODEL_FOR_REFORMAT | gpt-5.4 | |
AZURE_LLM_MODEL_FOR_HEADINGS | gpt-5.4 | |
AZURE_LLM_MODEL_FOR_LANGUAGE_DETECTION | gpt-5.4-nano |
Gemini
| Var | Default | Purpose |
|---|---|---|
GEMINI_API_KEY | Google Gemini API key | |
GEMINI_MODEL | gemini-3.1-pro-preview | Model name |
GEMINI_PROMPT_TEXT_ONLY_PATH | app/prompts/gemini_text_only.txt | Path to text-only prompt |
GEMINI_PROMPT_TEXT_AND_IMAGES_PATH | app/prompts/gemini_text_and_images.txt | Path to image prompt |
GEMINI_LOG_PROMPT | false | Log the prompt at INFO level |
GEMINI_DEBUG_SAVE_OVERLAY_PDF | false | Save image-overlay PDFs for debugging |
GEMINI_DEBUG_OVERLAY_DIR | /tmp | Directory for overlay PDFs |
Language detection
| Var | Default | Purpose |
|---|---|---|
USE_GOOGLE_LANGUAGE_DETECTION | false | Use Google Cloud Translate for language detection |
GOOGLE_APPLICATION_CREDENTIALS | Path to GCP service account JSON | |
GOOGLE_CLOUD_TRANSLATE_PROJECT_ID | hbf-language-translation-dev | GCP project for translation API |
LANGUAGE_DETECT_GOOGLE_TIMEOUT_SECONDS | 10 | Timeout for Google detection |
LANGUAGE_DETECT_LLM_TIMEOUT_SECONDS | 30 | Timeout for LLM detection |
LANGUAGE_DETECTION_PROMPT_PATH | app/prompts/language_detection.txt | |
LANGUAGE_DETECT_LOOKAHEAD | 500 | Chars to inspect for detection |
SYSTEM_DEFAULT_LANGUAGE | en | Fallback language code |
SYSTEM_NATIVE_LANGUAGES | en | Comma-separated list of native language codes |
Segmenter
| Var | Default | Purpose |
|---|---|---|
SEGMENTER_MAX_ARTICLE_SIZE | 2000 | Default max segment size |
SEGMENTER_SIZE_UNITS | non_ws_chars | words, chars, non_ws_chars, or tokens |
MAX_FILE_SIZE_LIMIT | 67108864 (64 MB) | Max upload size in bytes |
PDF / OCR
| Var | Default | Purpose |
|---|---|---|
PDF_PARSING_BACKEND | pymupdf | pymupdf, docling, or gemini-3 |
OCR_BACKEND | tesseract | tesseract or easyocr |
DOCX_TO_MD_CONVERTER | mammoth | mammoth or pypandoc |
Image handling
| Var | Default | Purpose |
|---|---|---|
ENABLE_IMAGE_HANDLING | false | Extract and store images from documents |
IMAGE_HANDLING_MODE | s3 | s3 or tmp_file |
AWS_S3_BUCKET_NAME | doc-segmenter-images | S3 bucket |
AWS_REGION_NAME | us-east-1 | AWS region |
AWS_ACCESS_KEY_ID | AWS credentials | |
AWS_SECRET_ACCESS_KEY | AWS credentials |
Database pool
| Var | Default | Purpose |
|---|---|---|
SQLALCHEMY_POOL_SIZE | 5 | |
SQLALCHEMY_OVERFLOW | 10 | |
SQLALCHEMY_POOL_RECYCLE | 3600 | |
MYSQL_USE_SSL | false | Enable SSL to MySQL (no cert validation) |
Worker concurrency / timeouts
| Var | Default | Purpose |
|---|---|---|
BACKGROUND_TASK_LIMIT | 2 | Max concurrent job subprocesses |
JOB_TIMEOUT_SECONDS | 600 | Seconds before a running job is force-cancelled |
Logging / observability
| Var | Default | Purpose |
|---|---|---|
LOG_LEVEL_APP | INFO | App logger level |
LOG_LEVEL_ROOT | INFO | Root logger level |
LOG_LEVEL_SQL | ERROR | SQLAlchemy logger level |
ENABLE_APM | false | Enable Elastic APM middleware |
Other
| Var | Default | Purpose |
|---|---|---|
VERSION | 1.1.0 | Reported in GET / and OpenAPI |
MIMETYPE_MAGIC_LOOKAHEAD | 5000 | Bytes inspected by python-magic for MIME detection |
Alembic migrations
Migrations run automatically on startup (via app/main.py lifespan), and also explicitly in the local dev entrypoint:
MYSQL_URL=... poetry run alembic upgrade head
Local development (without Docker)
poetry install --no-root
PYTHONPATH=. poetry run uvicorn app.main:app --workers 1 --log-config app/config/logging_simple.ini
Requires a .env file with at minimum MYSQL_URL, JWT_SECRET, OPENAI_API_KEY.
Tests
# Integration tests (requires running service and DB)
poetry run pytest --envfile .env.test -vvv
# HURL tests
./bin/run_hurl_tests.sh