Skip to main content

AI_BRIEF: semantic-doc-segmenter

What it does

semantic-doc-segmenter is a Python/FastAPI service that accepts documents (PDF, DOCX, PPTX, HTML, Markdown, TXT), converts them to structured Markdown, and splits them into semantically coherent segments (articles) sized to a configurable token/character limit. Segments are stored in MySQL and optionally delivered to a caller-supplied callback URL.

It is the document ingestion and pre-processing layer of the Helvia platform. Downstream consumers (e.g. RAG pipelines) read the resulting DocumentSegment rows to build knowledge bases.

Role in the platform

Caller (e.g. hbf-core / RAG pipeline)
|
| POST /jobs (multipart file upload)
v
semantic-doc-segmenter (this service)
|-- stores Document + Job in MySQL
|-- background worker converts to Markdown
|-- splits Markdown into segments
|-- optionally tags segments via LLM
|-- stores DocumentSegment rows
|-- POSTs results to callbackurl (if provided)

Tech stack

ComponentTechnology
RuntimePython 3.11, FastAPI, Uvicorn
DatabaseMySQL 8 via SQLAlchemy 2 + PyMySQL
MigrationsAlembic
LLM backendsOpenAI (AsyncOpenAI), Azure OpenAI (AsyncAzureOpenAI), Google Gemini (google-genai)
PDF parsingPyMuPDF (pymupdf4llm), Docling, or Gemini
OCRTesseract (tesserocr), EasyOCR
DOCX conversionmammoth or pypandoc
PPTX conversionpython-pptx
HTML conversionmarkdownify
Image storageAWS S3 (boto3) or local temp file
Language detectionGoogle Cloud Translate API or LLM
ObservabilityElastic APM (elastic-apm), ECS-format structured logging
PackagingPoetry

Key entry points

PathPurpose
app/main.pyFastAPI app creation, lifespan (migrations + job poller startup)
app/routers/jobs.pyPOST /jobs, GET /jobs, GET /jobs/{id}, PATCH /jobs/{id}, DELETE /jobs/{id}
app/routers/documents.pyGET /documents, GET /documents/{id}, GET /documents/{id}/segments
app/routers/debug.pyGET /debug/jobs/{id}, GET /debug/health, GET /debug/health_with_details
app/services/job_polling_service.pyAsync polling loop that picks up NEW jobs every second
app/services/doc_processing_service.pyCore document-to-segment pipeline (process_document)
app/services/job_execution_manager.pyProcessPoolExecutor-based job runner (Singleton)
app/config/config.pyAll environment variable bindings with defaults