Skip to main content

Data Model: semantic-doc-segmenter

Database

MySQL 8, InnoDB engine, utf8mb4_unicode_ci collation throughout. Database name: pdfdocuments (conventional; set via MYSQL_URL).

SQLAlchemy models: app/models/models.py Alembic config: alembic.ini, migrations in alembic/versions/

Migration chain

47cd9e9e0aa6  Base -- creates document, job, documentsegment tables
|
3a0135f92ea8 Add process_images field to job
|
97c5f3f7f0ac Add options column to job
|
ac627aa39080 Add processing_stage and correlation_id to job (HEAD)

Tables

document

Stores the raw uploaded file body alongside metadata.

ColumnTypeNotes
idVARCHAR(255) PKUUID v4, generated on insert
statusENUMNEW, PROCESSED, CANCELED
createdDATETIMEUTC timestamp on insert
updatedDATETIMEUTC timestamp, updated on change
bodyMEDIUMBLOB (16 MB)Raw file bytes
filenameVARCHAR(255)Sanitized via werkzeug secure_filename
filesizeINTBytes
mimetypeVARCHAR(255)e.g. application/pdf
doctypeVARCHAR(10)pdf, docx, pptx, html, md, txt

Index: ix_document_id on id.

job

One job per document submission. Tracks processing status and configuration.

ColumnTypeNotes
idVARCHAR(255) PKUUID v4
ipVARCHAR(20)Client IP (X-Forwarded-For or direct)
documentidVARCHAR(255) FK -> document.id
statusENUMNEW, PENDING, PROCESSING, COMPLETED, FAILED, CANCELED
createdDATETIMEUTC
updatedDATETIMEUTC
completedDATETIMESet when status becomes COMPLETED
timeitINTTotal processing time in seconds
callbackurlVARCHAR(2048)Optional webhook URL
maxsizeINTMax segment size in configured units
usetagsJSONList of tag strings requested by caller
maxtagsINTMax tags per segment (default 1)
errormessageTEXTTraceback on failure
process_imagesBOOLEANWhether to extract and upload images
optionsJSONpdf_parsing_backend, enable_ocr, ocr_backend, force_full_page_ocr
progressINT0-100, updated during processing
correlation_idVARCHAR(255)UUID for end-to-end log tracing
processing_stageVARCHAR(50)Current ProcessingStage value

Indexes: ix_job_id, ix_job_documentid.

documentsegment

Output segments produced by the pipeline.

ColumnTypeNotes
idVARCHAR(255) PKUUID v4
documentidVARCHAR(255) FK -> document.id
statusENUMCOMPLETED, FAILED
bodyTEXTMarkdown content of the segment
pagenrINTSource page number
groupVARCHAR(255)Heading group from Markdown tree
titleVARCHAR(255)Extracted or LLM-generated title
tagsJSONLLM-assigned tags (list of strings)
createdDATETIMEUTC
updatedDATETIMEUTC
timeitINTPer-segment timing (legacy, not actively used)
langVARCHAR(10)ISO language code detected for the document
ordinalINT1-based position in document order

Indexes: ix_documentsegment_id, ix_documentsegment_documentid.

Enums (SQLAlchemy / Python)

class JobStatus(enum.Enum):
NEW = "NEW"
PENDING = "PENDING"
PROCESSING = "PROCESSING"
COMPLETED = "COMPLETED"
FAILED = "FAILED"
CANCELED = "CANCELED"

class DocumentStatus(enum.Enum):
NEW = "NEW"
PROCESSED = "PROCESSED"
CANCELED = "CANCELED"

class SegmentStatus(enum.Enum):
COMPLETED = "COMPLETED"
FAILED = "FAILED"

ProcessingStage (not persisted as enum, stored as VARCHAR in job.processing_stage):

UPLOADED -> CONVERTING -> PROCESSING_MARKDOWN -> SEGMENTING -> TAGGING -> FINALIZING -> COMPLETED
\-> FAILED

Connection pool

Configured in app/config/database.py via app/config/config.py:

Env varDefaultPurpose
SQLALCHEMY_POOL_SIZE5Pool size
SQLALCHEMY_OVERFLOW10Max overflow connections
SQLALCHEMY_POOL_RECYCLE3600Seconds before recycling connections
MYSQL_USE_SSLfalseEnable SSL (no cert verification)

pool_pre_ping=True and pool_reset_on_return="commit" are set to guard against stale connections.