← Blog

June 10, 2026

How to extract structured data from PDFs with Python in 2026

From pdfplumber to LLM-powered extraction — the approaches that work, the ones that don't, and when to use an API instead.

You have a PDF. You need structured data out of it — vendor name, invoice total, line items, dates. JSON, not text. Here are the approaches that actually work in 2026, when they break, and what to reach for instead.

Approach 1: pdfplumber + regex

The most common starting point. Parse the PDF, get text, write regex or string matching:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    tables = pdf.pages[0].extract_tables()

# Now write regex for every field you need...
import re
total = re.search(r"Total:?\s*\$?([\d,]+\.?\d*)", text)

Works when: Documents have consistent layouts. One vendor, one format, predictable structure.

Breaks when: You process invoices from 50 different vendors. Scanned documents. Tables that span pages. Columns that pdfplumber merges or splits incorrectly. You end up maintaining regex per vendor — that doesn't scale.

Approach 2: Tesseract OCR + LLM

For scanned documents, add an OCR layer:

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("scan.png"))
# Feed OCR text to GPT/Claude with a prompt
# Parse the LLM response as JSON

Works when: Scans are clean, high-resolution, printed text.

Breaks when: Handwriting, stamps, low-quality phone photos, faded text. OCR misreads characters ("$1,550" becomes "$1,5S0") and the LLM gets bad input. No confidence scores — you don't know when it's wrong.

Approach 3: send the PDF directly to an LLM

import anthropic

client = anthropic.Client()
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "document", "source": {"type": "base64", ...}},
            {"type": "text", "text": "Extract vendor, total, line items as JSON"}
        ]
    }]
)

Works when: Simple documents, few pages, well-formatted. Great for prototyping.

Breaks when: 500-page filings (exceeds context window). Tables with complex layouts. Math — LLMs hallucinate calculations. No typed output enforcement. No confidence scores. No citations. And you're paying full token cost for every page, even if the answer is on page 3.

Approach 4: schema-based extraction API

Define what you need. Send any file. Get typed JSON back:

from thedriveai import TheDriveAI

client = TheDriveAI(api_key="tda_live_...")

result = client.extract(
    file="invoice.pdf",
    schema={
        "vendor": {"type": "string", "description": "Company name"},
        "total": {"type": "number", "description": "Total amount due"},
        "line_items": {"type": "array", "description": "All items with amount"},
    },
)

print(result.data)        # typed fields
print(result.confidence)  # per-field confidence
print(result.citations)   # source text for each field

What you get: Typed output matching your schema. Confidence scores per field. Source citations. Works on scanned documents, spreadsheets, images, websites — 107+ formats. Handles 1000+ page documents. 1 credit per page.

When you need computed answers (cross-checks, growth rates, verification), switch to /analyze — same schema, but it reasons over the document instead of just reading it.

Which approach should you use?

Build it yourself when

  • Single document format, single vendor
  • Full control over parsing logic matters
  • You have time to maintain the pipeline

Use an API when

  • Multiple formats, vendors, or document types
  • You need confidence scores and citations
  • Scanned documents are in the mix
  • You're building an AI agent that encounters files

Try the playground with your own PDF — no signup needed for the first few calls.

Try it yourself

Free tier included. No credit card required.