How to extract structured data from PDFs with Python in 2026

You have a PDF. You need structured data out of it — vendor name, invoice total, line items, dates. JSON, not text. Here are the approaches that actually work in 2026, when they break, and what to reach for instead.

Approach 1: pdfplumber + regex

The most common starting point. Parse the PDF, get text, write regex or string matching:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    tables = pdf.pages[0].extract_tables()

# Now write regex for every field you need...
import re
total = re.search(r"Total:?\s*\$?([\d,]+\.?\d*)", text)

Works when: Documents have consistent layouts. One vendor, one format, predictable structure.

Breaks when: You process invoices from 50 different vendors. Scanned documents. Tables that span pages. Columns that pdfplumber merges or splits incorrectly. You end up maintaining regex per vendor — that doesn't scale.

Approach 2: Tesseract OCR + LLM

For scanned documents, add an OCR layer:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("scan.png"))
# Feed OCR text to GPT/Claude with a prompt
# Parse the LLM response as JSON

Works when: Scans are clean, high-resolution, printed text.

Breaks when: Handwriting, stamps, low-quality phone photos, faded text. OCR misreads characters ("$1,550" becomes "$1,5S0") and the LLM gets bad input. No confidence scores — you don't know when it's wrong.

Approach 3: send the PDF directly to an LLM

import anthropic

client = anthropic.Client()
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "document", "source": {"type": "base64", ...}},
            {"type": "text", "text": "Extract vendor, total, line items as JSON"}
        ]
    }]
)

Works when: Simple documents, few pages, well-formatted. Great for prototyping.

Breaks when: 500-page filings (exceeds context window). Tables with complex layouts. Math — LLMs hallucinate calculations. No typed output enforcement. No confidence scores. No citations. And you're paying full token cost for every page, even if the answer is on page 3.

Approach 4: schema-based extraction API

Define what you need. Send any file. Get typed JSON back:

from thedriveai import TheDriveAI

client = TheDriveAI(api_key="tda_live_...")

result = client.extract(
    file="invoice.pdf",
    schema={
        "vendor": {"type": "string", "description": "Company name"},
        "total": {"type": "number", "description": "Total amount due"},
        "line_items": {"type": "array", "description": "All items with amount"},
    },
)

print(result.data)        # typed fields
print(result.confidence)  # per-field confidence
print(result.citations)   # source text for each field

What you get: Typed output matching your schema. Confidence scores per field. Source citations. Works on scanned documents, spreadsheets, images, websites — 107+ formats. Handles 1000+ page documents. 1 credit per page.

When you need computed answers (cross-checks, growth rates, verification), switch to /analyze — same schema, but it reasons over the document instead of just reading it.

Which approach should you use?

Build it yourself when

Single document format, single vendor
Full control over parsing logic matters
You have time to maintain the pipeline

Use an API when

Multiple formats, vendors, or document types
You need confidence scores and citations
Scanned documents are in the mix
You're building an AI agent that encounters files

Try the playground with your own PDF — no signup needed for the first few calls.