API Schema Reference

Keep this file up to date. When any endpoint, parameter, or response shape changes, update this document accordingly.

All endpoints require JWT authentication via Authorization: Bearer <token> header. The token must contain {"role": "admin"}.

Jobs API (`/jobs`)

POST `/jobs` - Submit a document for processing

Accepts a multipart form upload. Starts asynchronous document segmentation.

Form parameters:

Parameter	Type	Default	Description
`uploadfile`	file	required	The document to process
`doctype`	string	`""` (auto-detect)	Document type: `pdf`, `docx`, `pptx`, `xlsx`, `txt`, `html`, `md`
`article_size`	string	`null`	Named article size selector: `small`, `medium`, `large`, `xlarge`
`maxsize`	int	`SEGMENTER_MAX_ARTICLE_SIZE` (2000)	Legacy numeric article size override (non-ws chars by default)
`usetags`	string	`"[]"`	JSON array of tag strings for LLM tagging
`maxtags`	int	`1`	Maximum number of tags per segment
`callbackurl`	string	`null`	URL to POST progress/results to
`process_images`	bool	`null`	Legacy image handling toggle
`include_images`	bool	`null`	Preferred image toggle (overrides `process_images`)
`mode`	string	`null`	Parsing mode selector (e.g. `agentic`)
`pdf_parsing_backend`	string	`null`	Backend for PDF parsing (e.g. `pymupdf`, `docling`, `agentic`)
`extra_instructions`	string	`null`	Extra instructions for agentic mode prompt
`enable_ocr`	bool	`false`	Enable OCR text extraction
`ocr_backend`	string	`null`	OCR backend to use
`force_full_page_ocr`	bool	`false`	Force OCR on full page images

article_size mappings:

Name	`maxsize` (non-ws chars)
`small`	750
`medium`	1500
`large`	5000
`xlarge`	9000

Precedence for max article size: usetags override > article_size > maxsize form param > env var default.

Response (ProcessDocReplySchema):

{
  "success": true,
  "jobid": "uuid",
  "documentid": "uuid"
}

GET `/jobs/` - List all jobs

Response (JobsReplySchema):

{
  "items": [<JobSchema>, ...]
}

GET `/jobs/{jobid}` - Get job details

Response (JobSchema):

{
  "id": "uuid",
  "ip": "127.0.0.1",
  "documentid": "uuid",
  "status": "NEW|PROCESSING|COMPLETED|FAILED|CANCELED",
  "created": "datetime",
  "updated": "datetime",
  "completed": "datetime|null",
  "maxsize": 1500,
  "article_size": "medium|null",
  "timeit": "int|null",
  "callbackurl": "string|null",
  "progress": 0
}

Note: article_size is stored on the job at creation time. It is null for legacy jobs or custom numeric maxsize values that don't match a preset.

PATCH `/jobs/{jobid}` - Cancel a job

Request body:

{"status": "CANCELED"}

Response: JobSchema (same as GET).

Only jobs with status NEW or PROCESSING can be canceled. Already-canceled jobs return success (idempotent). Jobs with other statuses return 406.

DELETE `/jobs/{jobid}` - Delete a job and related data

Deletes the job, its document, and all segments. Returns 204 No Content.

Cannot delete jobs with status NEW, PROCESSING, or PENDING (returns 406).

Documents API (`/documents`)

GET `/documents/` - List all documents

Response (DocumentsReplySchema):

{
  "items": [<DocumentSchema>, ...]
}

GET `/documents/{documentid}` - Get document metadata

Response (DocumentSchema):

{
  "id": "uuid",
  "status": "NEW|PROCESSED|FAILED",
  "created": "datetime",
  "updated": "datetime",
  "filename": "string",
  "filesize": 12345,
  "mimetype": "string|null"
}

Note: This endpoint returns metadata only, NOT segments.

GET `/documents/{documentid}/segments` - Get all segments for a document

This is where the actual segmentation results live.

Response (SegmentsReply):

{
  "segments": [
    {
      "id": "uuid",
      "documentid": "uuid",
      "status": "NEW|PROCESSED",
      "body": "the segment text content",
      "pagenr": 1,
      "group": "string|null",
      "title": "string|null",
      "tags": ["tag1", "tag2"],
      "lang": "en",
      "created": "datetime",
      "updated": "datetime",
      "timeit": 0,
      "ordinal": 0
    }
  ]
}

GET `/documents/{documentid}/segments/{segmentid}` - Get a single segment

Response: Same shape as above (SegmentsReply) with a single-element segments array.

Debug API (`/debug`)

GET `/debug/jobs/{job_id}` - Detailed job debug info

Returns extensive job information including correlation ID, processing stage, options, and document details. Useful for troubleshooting.

Response (untyped dict):

{
  "job_id": "uuid",
  "correlation_id": "uuid",
  "status": "COMPLETED",
  "processing_stage": "COMPLETED",
  "progress": 100,
  "created": "iso-datetime",
  "updated": "iso-datetime",
  "completed": "iso-datetime",
  "processing_duration_seconds": 13.0,
  "total_time_seconds": 6,
  "document": {
    "id": "uuid",
    "filename": "file.pdf",
    "doctype": "pdf",
    "filesize": 12345,
    "status": "PROCESSED"
  },
  "options": {"pdf_parsing_backend": "...", "article_size": "medium", ...},
  "process_images": false,
  "maxsize": 1500,
  "article_size": "medium|null",
  "usetags": [],
  "maxtags": 1,
  "callback_url": null,
  "error_message": "",
  "ip": "127.0.0.1"
}

GET `/debug/health` - Service health summary

Returns counts of pending, processing, and recently failed jobs.

GET `/debug/health_with_details` - Detailed service health

Same as health but includes job IDs and timestamps for each pending/processing/failed job.

Jobs API (/jobs)​

POST /jobs - Submit a document for processing​

GET /jobs/ - List all jobs​

GET /jobs/{jobid} - Get job details​

PATCH /jobs/{jobid} - Cancel a job​

DELETE /jobs/{jobid} - Delete a job and related data​

Documents API (/documents)​

GET /documents/ - List all documents​

GET /documents/{documentid} - Get document metadata​

GET /documents/{documentid}/segments - Get all segments for a document​

GET /documents/{documentid}/segments/{segmentid} - Get a single segment​

Debug API (/debug)​

GET /debug/jobs/{job_id} - Detailed job debug info​

GET /debug/health - Service health summary​

GET /debug/health_with_details - Detailed service health​