Database Schema Reference
Keep this file up to date. When models or migrations change, update this document accordingly.
MySQL database with InnoDB engine, utf8mb4 charset, utf8mb4_unicode_ci collation on all tables.
Models defined in: app/models/models.py
Migrations: alembic/versions/
Tables
document
Stores uploaded document binary data and metadata.
| Column | Type | Nullable | Default | Description |
|---|---|---|---|---|
id | String(255) PK | no | UUID | Primary key |
status | Enum(NEW, PROCESSED, CANCELED) | no | NEW | Processing status |
created | DateTime | no | UTC now | Creation timestamp |
updated | DateTime | no | UTC now | Last update (auto-updated) |
body | LargeBinary(16MB) | no | - | Raw document bytes |
filename | String(255) | no | - | Original filename |
filesize | Integer | no | - | File size in bytes |
mimetype | String(255) | no | - | MIME type |
doctype | String(10) | no | - | Document type (pdf, docx, pptx, xlsx, txt, html, md) |
job
Tracks document processing jobs. One job per document submission.
| Column | Type | Nullable | Default | Description |
|---|---|---|---|---|
id | String(255) PK | no | UUID | Primary key |
ip | String(20) | no | - | Client IP address |
documentid | String(255) FK | no | - | References document.id (indexed) |
status | Enum(NEW, PENDING, PROCESSING, COMPLETED, FAILED, CANCELED) | no | NEW | Job status |
created | DateTime | no | UTC now | Creation timestamp |
updated | DateTime | no | UTC now | Last update (auto-updated) |
completed | DateTime | yes | null | Completion timestamp |
timeit | Integer | yes | null | Processing time in seconds |
callbackurl | VARCHAR(2048) | yes | - | URL for progress/result callbacks |
maxsize | Integer | no | - | Max article size (in SEGMENTER_SIZE_UNITS) |
article_size | String(20) | yes | null | Named article size preset (small, medium, large, xlarge). Null for custom numeric sizes. |
usetags | JSON | no | - | Tag list for LLM tagging |
maxtags | Integer | no | 1 | Max tags per segment |
errormessage | TEXT | yes | - | Error details on failure |
process_images | Boolean | no | - | Whether to handle images |
options | JSON | no | - | Additional job options (mode, backend, OCR settings, etc.) |
progress | Integer | no | 0 | Processing progress (0-100) |
correlation_id | String(255) | yes | null | UUID for log correlation |
processing_stage | String(50) | yes | null | Current processing stage (e.g. UPLOADED, CONVERTING, SEGMENTING, COMPLETED) |
documentsegment
Stores the output segments/articles produced by document processing.
| Column | Type | Nullable | Default | Description |
|---|---|---|---|---|
id | String(255) PK | no | UUID | Primary key |
documentid | String(255) FK | no | - | References document.id (indexed) |
status | Enum(COMPLETED, FAILED) | no | - | Segment status |
body | TEXT | no | - | Segment text content (markdown) |
pagenr | Integer | no | - | Source page number |
group | String(255) | yes | null | Segment grouping |
title | String(255) | yes | null | Segment/article title |
tags | JSON | yes | - | LLM-assigned tags |
created | DateTime | no | UTC now | Creation timestamp |
updated | DateTime | no | UTC now | Last update (auto-updated) |
timeit | Integer | yes | null | Deprecated, no longer used |
lang | String(10) | no | - | Detected language code |
ordinal | Integer | no | - | Segment ordering index |
Relationships
document (1) ──< job (N)
│
└──< documentsegment (N)
- A
documentcan have multiplejobrecords (reprocessing) - A
documentcan have multipledocumentsegmentrecords (the output articles) DELETE /jobs/{jobid}cascades manually: deletes job, all segments for its document, and the document itself
Migration History
| Revision | Description |
|---|---|
47cd9e9e0aa6 | Base schema (document, job, documentsegment) |
3a0135f92ea8 | Add process_images field to job |
97c5f3f7f0ac | Add options JSON column to job |
ac627aa39080 | Add processing_stage and correlation_id to job |
854f0244c449 | Add article_size column to job (current head) |