Data Model: semantic-doc-segmenter
Database
MySQL 8, InnoDB engine, utf8mb4_unicode_ci collation throughout.
Database name: pdfdocuments (conventional; set via MYSQL_URL).
SQLAlchemy models: app/models/models.py
Alembic config: alembic.ini, migrations in alembic/versions/
Migration chain
47cd9e9e0aa6 Base -- creates document, job, documentsegment tables
|
3a0135f92ea8 Add process_images field to job
|
97c5f3f7f0ac Add options column to job
|
ac627aa39080 Add processing_stage and correlation_id to job (HEAD)
Tables
document
Stores the raw uploaded file body alongside metadata.
| Column | Type | Notes |
|---|---|---|
id | VARCHAR(255) PK | UUID v4, generated on insert |
status | ENUM | NEW, PROCESSED, CANCELED |
created | DATETIME | UTC timestamp on insert |
updated | DATETIME | UTC timestamp, updated on change |
body | MEDIUMBLOB (16 MB) | Raw file bytes |
filename | VARCHAR(255) | Sanitized via werkzeug secure_filename |
filesize | INT | Bytes |
mimetype | VARCHAR(255) | e.g. application/pdf |
doctype | VARCHAR(10) | pdf, docx, pptx, html, md, txt |
Index: ix_document_id on id.
job
One job per document submission. Tracks processing status and configuration.
| Column | Type | Notes |
|---|---|---|
id | VARCHAR(255) PK | UUID v4 |
ip | VARCHAR(20) | Client IP (X-Forwarded-For or direct) |
documentid | VARCHAR(255) FK -> document.id | |
status | ENUM | NEW, PENDING, PROCESSING, COMPLETED, FAILED, CANCELED |
created | DATETIME | UTC |
updated | DATETIME | UTC |
completed | DATETIME | Set when status becomes COMPLETED |
timeit | INT | Total processing time in seconds |
callbackurl | VARCHAR(2048) | Optional webhook URL |
maxsize | INT | Max segment size in configured units |
usetags | JSON | List of tag strings requested by caller |
maxtags | INT | Max tags per segment (default 1) |
errormessage | TEXT | Traceback on failure |
process_images | BOOLEAN | Whether to extract and upload images |
options | JSON | pdf_parsing_backend, enable_ocr, ocr_backend, force_full_page_ocr |
progress | INT | 0-100, updated during processing |
correlation_id | VARCHAR(255) | UUID for end-to-end log tracing |
processing_stage | VARCHAR(50) | Current ProcessingStage value |
Indexes: ix_job_id, ix_job_documentid.
documentsegment
Output segments produced by the pipeline.
| Column | Type | Notes |
|---|---|---|
id | VARCHAR(255) PK | UUID v4 |
documentid | VARCHAR(255) FK -> document.id | |
status | ENUM | COMPLETED, FAILED |
body | TEXT | Markdown content of the segment |
pagenr | INT | Source page number |
group | VARCHAR(255) | Heading group from Markdown tree |
title | VARCHAR(255) | Extracted or LLM-generated title |
tags | JSON | LLM-assigned tags (list of strings) |
created | DATETIME | UTC |
updated | DATETIME | UTC |
timeit | INT | Per-segment timing (legacy, not actively used) |
lang | VARCHAR(10) | ISO language code detected for the document |
ordinal | INT | 1-based position in document order |
Indexes: ix_documentsegment_id, ix_documentsegment_documentid.
Enums (SQLAlchemy / Python)
class JobStatus(enum.Enum):
NEW = "NEW"
PENDING = "PENDING"
PROCESSING = "PROCESSING"
COMPLETED = "COMPLETED"
FAILED = "FAILED"
CANCELED = "CANCELED"
class DocumentStatus(enum.Enum):
NEW = "NEW"
PROCESSED = "PROCESSED"
CANCELED = "CANCELED"
class SegmentStatus(enum.Enum):
COMPLETED = "COMPLETED"
FAILED = "FAILED"
ProcessingStage (not persisted as enum, stored as VARCHAR in job.processing_stage):
UPLOADED -> CONVERTING -> PROCESSING_MARKDOWN -> SEGMENTING -> TAGGING -> FINALIZING -> COMPLETED
\-> FAILED
Connection pool
Configured in app/config/database.py via app/config/config.py:
| Env var | Default | Purpose |
|---|---|---|
SQLALCHEMY_POOL_SIZE | 5 | Pool size |
SQLALCHEMY_OVERFLOW | 10 | Max overflow connections |
SQLALCHEMY_POOL_RECYCLE | 3600 | Seconds before recycling connections |
MYSQL_USE_SSL | false | Enable SSL (no cert verification) |
pool_pre_ping=True and pool_reset_on_return="commit" are set to guard against stale connections.