API Schema Reference
Keep this file up to date. When any endpoint, parameter, or response shape changes, update this document accordingly.
All endpoints require JWT authentication via Authorization: Bearer <token> header. The token must contain {"role": "admin"}.
Jobs API (/jobs)
POST /jobs - Submit a document for processing
Accepts a multipart form upload. Starts asynchronous document segmentation.
Form parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
uploadfile | file | required | The document to process |
doctype | string | "" (auto-detect) | Document type: pdf, docx, pptx, xlsx, txt, html, md |
article_size | string | null | Named article size selector: small, medium, large, xlarge |
maxsize | int | SEGMENTER_MAX_ARTICLE_SIZE (2000) | Legacy numeric article size override (non-ws chars by default) |
usetags | string | "[]" | JSON array of tag strings for LLM tagging |
maxtags | int | 1 | Maximum number of tags per segment |
callbackurl | string | null | URL to POST progress/results to |
process_images | bool | null | Legacy image handling toggle |
include_images | bool | null | Preferred image toggle (overrides process_images) |
mode | string | null | Parsing mode selector (e.g. agentic) |
pdf_parsing_backend | string | null | Backend for PDF parsing (e.g. pymupdf, docling, agentic) |
extra_instructions | string | null | Extra instructions for agentic mode prompt |
enable_ocr | bool | false | Enable OCR text extraction |
ocr_backend | string | null | OCR backend to use |
force_full_page_ocr | bool | false | Force OCR on full page images |
article_size mappings:
| Name | maxsize (non-ws chars) |
|---|---|
small | 750 |
medium | 1500 |
large | 5000 |
xlarge | 9000 |
Precedence for max article size: usetags override > article_size > maxsize form param > env var default.
Response (ProcessDocReplySchema):
{
"success": true,
"jobid": "uuid",
"documentid": "uuid"
}
GET /jobs/ - List all jobs
Response (JobsReplySchema):
{
"items": [<JobSchema>, ...]
}
GET /jobs/{jobid} - Get job details
Response (JobSchema):
{
"id": "uuid",
"ip": "127.0.0.1",
"documentid": "uuid",
"status": "NEW|PROCESSING|COMPLETED|FAILED|CANCELED",
"created": "datetime",
"updated": "datetime",
"completed": "datetime|null",
"maxsize": 1500,
"article_size": "medium|null",
"timeit": "int|null",
"callbackurl": "string|null",
"progress": 0
}
Note: article_size is stored on the job at creation time. It is null for legacy jobs or custom numeric maxsize values that don't match a preset.
PATCH /jobs/{jobid} - Cancel a job
Request body:
{"status": "CANCELED"}
Response: JobSchema (same as GET).
Only jobs with status NEW or PROCESSING can be canceled. Already-canceled jobs return success (idempotent). Jobs with other statuses return 406.
DELETE /jobs/{jobid} - Delete a job and related data
Deletes the job, its document, and all segments. Returns 204 No Content.
Cannot delete jobs with status NEW, PROCESSING, or PENDING (returns 406).
Documents API (/documents)
GET /documents/ - List all documents
Response (DocumentsReplySchema):
{
"items": [<DocumentSchema>, ...]
}
GET /documents/{documentid} - Get document metadata
Response (DocumentSchema):
{
"id": "uuid",
"status": "NEW|PROCESSED|FAILED",
"created": "datetime",
"updated": "datetime",
"filename": "string",
"filesize": 12345,
"mimetype": "string|null"
}
Note: This endpoint returns metadata only, NOT segments.
GET /documents/{documentid}/segments - Get all segments for a document
This is where the actual segmentation results live.
Response (SegmentsReply):
{
"segments": [
{
"id": "uuid",
"documentid": "uuid",
"status": "NEW|PROCESSED",
"body": "the segment text content",
"pagenr": 1,
"group": "string|null",
"title": "string|null",
"tags": ["tag1", "tag2"],
"lang": "en",
"created": "datetime",
"updated": "datetime",
"timeit": 0,
"ordinal": 0
}
]
}
GET /documents/{documentid}/segments/{segmentid} - Get a single segment
Response: Same shape as above (SegmentsReply) with a single-element segments array.
Debug API (/debug)
GET /debug/jobs/{job_id} - Detailed job debug info
Returns extensive job information including correlation ID, processing stage, options, and document details. Useful for troubleshooting.
Response (untyped dict):
{
"job_id": "uuid",
"correlation_id": "uuid",
"status": "COMPLETED",
"processing_stage": "COMPLETED",
"progress": 100,
"created": "iso-datetime",
"updated": "iso-datetime",
"completed": "iso-datetime",
"processing_duration_seconds": 13.0,
"total_time_seconds": 6,
"document": {
"id": "uuid",
"filename": "file.pdf",
"doctype": "pdf",
"filesize": 12345,
"status": "PROCESSED"
},
"options": {"pdf_parsing_backend": "...", "article_size": "medium", ...},
"process_images": false,
"maxsize": 1500,
"article_size": "medium|null",
"usetags": [],
"maxtags": 1,
"callback_url": null,
"error_message": "",
"ip": "127.0.0.1"
}
GET /debug/health - Service health summary
Returns counts of pending, processing, and recently failed jobs.
GET /debug/health_with_details - Detailed service health
Same as health but includes job IDs and timestamps for each pending/processing/failed job.