June 10, 2026
How to extract structured data from PDFs with Python in 2026
From pdfplumber to LLM-powered extraction — the approaches that work, the ones that don't, and when to use an API instead.
You have a PDF. You need structured data out of it — vendor name, invoice total, line items, dates. JSON, not text. Here are the approaches that actually work in 2026, when they break, and what to reach for instead.
Approach 1: pdfplumber + regex
The most common starting point. Parse the PDF, get text, write regex or string matching:
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
text = pdf.pages[0].extract_text()
tables = pdf.pages[0].extract_tables()
# Now write regex for every field you need...
import re
total = re.search(r"Total:?\s*\$?([\d,]+\.?\d*)", text)
Works when: Documents have consistent layouts. One vendor, one format, predictable structure.
Breaks when: You process invoices from 50 different vendors. Scanned documents. Tables that span pages. Columns that pdfplumber merges or splits incorrectly. You end up maintaining regex per vendor — that doesn't scale.
Approach 2: Tesseract OCR + LLM
For scanned documents, add an OCR layer:
import pytesseract
from PIL import Image
text = pytesseract.image_to_string(Image.open("scan.png"))
# Feed OCR text to GPT/Claude with a prompt
# Parse the LLM response as JSON
Works when: Scans are clean, high-resolution, printed text.
Breaks when: Handwriting, stamps, low-quality phone photos, faded text. OCR misreads characters ("$1,550" becomes "$1,5S0") and the LLM gets bad input. No confidence scores — you don't know when it's wrong.
Approach 3: send the PDF directly to an LLM
import anthropic
client = anthropic.Client()
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "base64", ...}},
{"type": "text", "text": "Extract vendor, total, line items as JSON"}
]
}]
)
Works when: Simple documents, few pages, well-formatted. Great for prototyping.
Breaks when: 500-page filings (exceeds context window). Tables with complex layouts. Math — LLMs hallucinate calculations. No typed output enforcement. No confidence scores. No citations. And you're paying full token cost for every page, even if the answer is on page 3.
Approach 4: schema-based extraction API
Define what you need. Send any file. Get typed JSON back:
from thedriveai import TheDriveAI
client = TheDriveAI(api_key="tda_live_...")
result = client.extract(
file="invoice.pdf",
schema={
"vendor": {"type": "string", "description": "Company name"},
"total": {"type": "number", "description": "Total amount due"},
"line_items": {"type": "array", "description": "All items with amount"},
},
)
print(result.data) # typed fields
print(result.confidence) # per-field confidence
print(result.citations) # source text for each field
What you get: Typed output matching your schema. Confidence scores per field. Source citations. Works on scanned documents, spreadsheets, images, websites — 107+ formats. Handles 1000+ page documents. 1 credit per page.
When you need computed answers (cross-checks, growth rates, verification), switch to /analyze — same schema, but it reasons over the document instead of just reading it.
Which approach should you use?
Build it yourself when
- Single document format, single vendor
- Full control over parsing logic matters
- You have time to maintain the pipeline
Use an API when
- Multiple formats, vendors, or document types
- You need confidence scores and citations
- Scanned documents are in the mix
- You're building an AI agent that encounters files
Try the playground with your own PDF — no signup needed for the first few calls.