Data Model: semantic-doc-segmenter

Database

MySQL 8, InnoDB engine, utf8mb4_unicode_ci collation throughout. Database name: pdfdocuments (conventional; set via MYSQL_URL).

SQLAlchemy models: app/models/models.py Alembic config: alembic.ini, migrations in alembic/versions/

Migration chain

47cd9e9e0aa6  Base -- creates document, job, documentsegment tables
      |
3a0135f92ea8  Add process_images field to job
      |
97c5f3f7f0ac  Add options column to job
      |
ac627aa39080  Add processing_stage and correlation_id to job  (HEAD)

Tables

`document`

Stores the raw uploaded file body alongside metadata.

Column	Type	Notes
`id`	VARCHAR(255) PK	UUID v4, generated on insert
`status`	ENUM	`NEW`, `PROCESSED`, `CANCELED`
`created`	DATETIME	UTC timestamp on insert
`updated`	DATETIME	UTC timestamp, updated on change
`body`	MEDIUMBLOB (16 MB)	Raw file bytes
`filename`	VARCHAR(255)	Sanitized via werkzeug `secure_filename`
`filesize`	INT	Bytes
`mimetype`	VARCHAR(255)	e.g. `application/pdf`
`doctype`	VARCHAR(10)	`pdf`, `docx`, `pptx`, `html`, `md`, `txt`

Index: ix_document_id on id.

`job`

One job per document submission. Tracks processing status and configuration.

Column	Type	Notes
`id`	VARCHAR(255) PK	UUID v4
`ip`	VARCHAR(20)	Client IP (X-Forwarded-For or direct)
`documentid`	VARCHAR(255) FK -> document.id
`status`	ENUM	`NEW`, `PENDING`, `PROCESSING`, `COMPLETED`, `FAILED`, `CANCELED`
`created`	DATETIME	UTC
`updated`	DATETIME	UTC
`completed`	DATETIME	Set when status becomes COMPLETED
`timeit`	INT	Total processing time in seconds
`callbackurl`	VARCHAR(2048)	Optional webhook URL
`maxsize`	INT	Max segment size in configured units
`usetags`	JSON	List of tag strings requested by caller
`maxtags`	INT	Max tags per segment (default 1)
`errormessage`	TEXT	Traceback on failure
`process_images`	BOOLEAN	Whether to extract and upload images
`options`	JSON	`pdf_parsing_backend`, `enable_ocr`, `ocr_backend`, `force_full_page_ocr`
`progress`	INT	0-100, updated during processing
`correlation_id`	VARCHAR(255)	UUID for end-to-end log tracing
`processing_stage`	VARCHAR(50)	Current `ProcessingStage` value

Indexes: ix_job_id, ix_job_documentid.

`documentsegment`

Output segments produced by the pipeline.

Column	Type	Notes
`id`	VARCHAR(255) PK	UUID v4
`documentid`	VARCHAR(255) FK -> document.id
`status`	ENUM	`COMPLETED`, `FAILED`
`body`	TEXT	Markdown content of the segment
`pagenr`	INT	Source page number
`group`	VARCHAR(255)	Heading group from Markdown tree
`title`	VARCHAR(255)	Extracted or LLM-generated title
`tags`	JSON	LLM-assigned tags (list of strings)
`created`	DATETIME	UTC
`updated`	DATETIME	UTC
`timeit`	INT	Per-segment timing (legacy, not actively used)
`lang`	VARCHAR(10)	ISO language code detected for the document
`ordinal`	INT	1-based position in document order

Indexes: ix_documentsegment_id, ix_documentsegment_documentid.

Enums (SQLAlchemy / Python)

class JobStatus(enum.Enum):
    NEW = "NEW"
    PENDING = "PENDING"
    PROCESSING = "PROCESSING"
    COMPLETED = "COMPLETED"
    FAILED = "FAILED"
    CANCELED = "CANCELED"

class DocumentStatus(enum.Enum):
    NEW = "NEW"
    PROCESSED = "PROCESSED"
    CANCELED = "CANCELED"

class SegmentStatus(enum.Enum):
    COMPLETED = "COMPLETED"
    FAILED = "FAILED"

ProcessingStage (not persisted as enum, stored as VARCHAR in job.processing_stage):

UPLOADED -> CONVERTING -> PROCESSING_MARKDOWN -> SEGMENTING -> TAGGING -> FINALIZING -> COMPLETED
                                                                                      \-> FAILED

Connection pool

Configured in app/config/database.py via app/config/config.py:

Env var	Default	Purpose
`SQLALCHEMY_POOL_SIZE`	5	Pool size
`SQLALCHEMY_OVERFLOW`	10	Max overflow connections
`SQLALCHEMY_POOL_RECYCLE`	3600	Seconds before recycling connections
`MYSQL_USE_SSL`	false	Enable SSL (no cert verification)

pool_pre_ping=True and pool_reset_on_return="commit" are set to guard against stale connections.

Database​

Migration chain​

Tables​

document​

job​

documentsegment​

Enums (SQLAlchemy / Python)​

Connection pool​