Skip to main content

API Schema Reference

Keep this file up to date. When any endpoint, parameter, or response shape changes, update this document accordingly.

All endpoints require JWT authentication via Authorization: Bearer <token> header. The token must contain {"role": "admin"}.

Jobs API (/jobs)

POST /jobs - Submit a document for processing

Accepts a multipart form upload. Starts asynchronous document segmentation.

Form parameters:

ParameterTypeDefaultDescription
uploadfilefilerequiredThe document to process
doctypestring"" (auto-detect)Document type: pdf, docx, pptx, xlsx, txt, html, md
article_sizestringnullNamed article size selector: small, medium, large, xlarge
maxsizeintSEGMENTER_MAX_ARTICLE_SIZE (2000)Legacy numeric article size override (non-ws chars by default)
usetagsstring"[]"JSON array of tag strings for LLM tagging
maxtagsint1Maximum number of tags per segment
callbackurlstringnullURL to POST progress/results to
process_imagesboolnullLegacy image handling toggle
include_imagesboolnullPreferred image toggle (overrides process_images)
modestringnullParsing mode selector (e.g. agentic)
pdf_parsing_backendstringnullBackend for PDF parsing (e.g. pymupdf, docling, agentic)
extra_instructionsstringnullExtra instructions for agentic mode prompt
enable_ocrboolfalseEnable OCR text extraction
ocr_backendstringnullOCR backend to use
force_full_page_ocrboolfalseForce OCR on full page images

article_size mappings:

Namemaxsize (non-ws chars)
small750
medium1500
large5000
xlarge9000

Precedence for max article size: usetags override > article_size > maxsize form param > env var default.

Response (ProcessDocReplySchema):

{
"success": true,
"jobid": "uuid",
"documentid": "uuid"
}

GET /jobs/ - List all jobs

Response (JobsReplySchema):

{
"items": [<JobSchema>, ...]
}

GET /jobs/{jobid} - Get job details

Response (JobSchema):

{
"id": "uuid",
"ip": "127.0.0.1",
"documentid": "uuid",
"status": "NEW|PROCESSING|COMPLETED|FAILED|CANCELED",
"created": "datetime",
"updated": "datetime",
"completed": "datetime|null",
"maxsize": 1500,
"article_size": "medium|null",
"timeit": "int|null",
"callbackurl": "string|null",
"progress": 0
}

Note: article_size is stored on the job at creation time. It is null for legacy jobs or custom numeric maxsize values that don't match a preset.

PATCH /jobs/{jobid} - Cancel a job

Request body:

{"status": "CANCELED"}

Response: JobSchema (same as GET).

Only jobs with status NEW or PROCESSING can be canceled. Already-canceled jobs return success (idempotent). Jobs with other statuses return 406.

Deletes the job, its document, and all segments. Returns 204 No Content.

Cannot delete jobs with status NEW, PROCESSING, or PENDING (returns 406).

Documents API (/documents)

GET /documents/ - List all documents

Response (DocumentsReplySchema):

{
"items": [<DocumentSchema>, ...]
}

GET /documents/{documentid} - Get document metadata

Response (DocumentSchema):

{
"id": "uuid",
"status": "NEW|PROCESSED|FAILED",
"created": "datetime",
"updated": "datetime",
"filename": "string",
"filesize": 12345,
"mimetype": "string|null"
}

Note: This endpoint returns metadata only, NOT segments.

GET /documents/{documentid}/segments - Get all segments for a document

This is where the actual segmentation results live.

Response (SegmentsReply):

{
"segments": [
{
"id": "uuid",
"documentid": "uuid",
"status": "NEW|PROCESSED",
"body": "the segment text content",
"pagenr": 1,
"group": "string|null",
"title": "string|null",
"tags": ["tag1", "tag2"],
"lang": "en",
"created": "datetime",
"updated": "datetime",
"timeit": 0,
"ordinal": 0
}
]
}

GET /documents/{documentid}/segments/{segmentid} - Get a single segment

Response: Same shape as above (SegmentsReply) with a single-element segments array.

Debug API (/debug)

GET /debug/jobs/{job_id} - Detailed job debug info

Returns extensive job information including correlation ID, processing stage, options, and document details. Useful for troubleshooting.

Response (untyped dict):

{
"job_id": "uuid",
"correlation_id": "uuid",
"status": "COMPLETED",
"processing_stage": "COMPLETED",
"progress": 100,
"created": "iso-datetime",
"updated": "iso-datetime",
"completed": "iso-datetime",
"processing_duration_seconds": 13.0,
"total_time_seconds": 6,
"document": {
"id": "uuid",
"filename": "file.pdf",
"doctype": "pdf",
"filesize": 12345,
"status": "PROCESSED"
},
"options": {"pdf_parsing_backend": "...", "article_size": "medium", ...},
"process_images": false,
"maxsize": 1500,
"article_size": "medium|null",
"usetags": [],
"maxtags": 1,
"callback_url": null,
"error_message": "",
"ip": "127.0.0.1"
}

GET /debug/health - Service health summary

Returns counts of pending, processing, and recently failed jobs.

GET /debug/health_with_details - Detailed service health

Same as health but includes job IDs and timestamps for each pending/processing/failed job.