Document extraction
Extract structured data from PDF files
Send any PDF — invoices, contracts, reports, SEC filings — and get structured JSON back. The API handles table detection, OCR for scanned documents, and progressive reading that stops once your schema is filled.
How it works
Send your PDF file
Upload via the API or pass a URL. The API auto-detects the format.
Define your schema
Describe the fields you want as a JSON schema. The API maps your document to your structure.
Get structured JSON
Receive typed data with confidence scores and citations back to the source document.
Example request
curl -X POST https://dev.thedrive.ai/api/v1/extract \
-H "X-API-Key: your_key" \
-F "file=@document.pdf" \
-F 'schema={"vendor": "string", "total": "number", "date": "string"}'
PDF processing features
Table-aware parsing
Rows and columns stay structured. Tables aren't collapsed into garbled text.
OCR + vision proofreading
Scanned PDFs go through OCR, then a vision model verifies against the original image.
Progressive reading
1000+ page documents return fast. The API stops reading once your schema is filled.
Document structure mapping
Auto-generates a table of contents and navigates directly to the relevant pages.
Sandboxed computation
Use the analyze endpoint to calculate totals, growth rates, and ratios from PDF data.
Citations
Every extracted field includes a reference back to the page and location in the PDF.
Start extracting from PDF files
Free tier includes 100 credits/month. No credit card required.