AI_BRIEF: semantic-doc-segmenter
What it does
semantic-doc-segmenter is a Python/FastAPI service that accepts documents (PDF, DOCX, PPTX, HTML, Markdown, TXT), converts them to structured Markdown, and splits them into semantically coherent segments (articles) sized to a configurable token/character limit. Segments are stored in MySQL and optionally delivered to a caller-supplied callback URL.
It is the document ingestion and pre-processing layer of the Helvia platform. Downstream consumers (e.g. RAG pipelines) read the resulting DocumentSegment rows to build knowledge bases.
Role in the platform
Caller (e.g. hbf-core / RAG pipeline)
|
| POST /jobs (multipart file upload)
v
semantic-doc-segmenter (this service)
|-- stores Document + Job in MySQL
|-- background worker converts to Markdown
|-- splits Markdown into segments
|-- optionally tags segments via LLM
|-- stores DocumentSegment rows
|-- POSTs results to callbackurl (if provided)
Tech stack
| Component | Technology |
|---|---|
| Runtime | Python 3.11, FastAPI, Uvicorn |
| Database | MySQL 8 via SQLAlchemy 2 + PyMySQL |
| Migrations | Alembic |
| LLM backends | OpenAI (AsyncOpenAI), Azure OpenAI (AsyncAzureOpenAI), Google Gemini (google-genai) |
| PDF parsing | PyMuPDF (pymupdf4llm), Docling, or Gemini |
| OCR | Tesseract (tesserocr), EasyOCR |
| DOCX conversion | mammoth or pypandoc |
| PPTX conversion | python-pptx |
| HTML conversion | markdownify |
| Image storage | AWS S3 (boto3) or local temp file |
| Language detection | Google Cloud Translate API or LLM |
| Observability | Elastic APM (elastic-apm), ECS-format structured logging |
| Packaging | Poetry |
Key entry points
| Path | Purpose |
|---|---|
app/main.py | FastAPI app creation, lifespan (migrations + job poller startup) |
app/routers/jobs.py | POST /jobs, GET /jobs, GET /jobs/{id}, PATCH /jobs/{id}, DELETE /jobs/{id} |
app/routers/documents.py | GET /documents, GET /documents/{id}, GET /documents/{id}/segments |
app/routers/debug.py | GET /debug/jobs/{id}, GET /debug/health, GET /debug/health_with_details |
app/services/job_polling_service.py | Async polling loop that picks up NEW jobs every second |
app/services/doc_processing_service.py | Core document-to-segment pipeline (process_document) |
app/services/job_execution_manager.py | ProcessPoolExecutor-based job runner (Singleton) |
app/config/config.py | All environment variable bindings with defaults |