How to extract tables from complex PDFs in 2026

Table extraction from PDFs is a solved problem — until you encounter a real-world document. Financial statements with merged cells. Balance sheets spanning 3 pages. Tables with nested sub-headers and footnotes. Legal documents with indented clause tables.

Simple tables work everywhere. Complex tables break everything. Here's what actually works in 2026.

Why PDF table extraction is hard

PDFs have no concept of "table." They're a visual format — text is positioned at coordinates on a canvas. When you see a table, you're seeing:

Text strings placed at specific x,y coordinates
Lines drawn between coordinates (maybe — some tables have no visible borders)
No metadata saying "this is a cell" or "this row spans two columns"

Every table extraction tool is essentially guessing structure from visual position. That works for simple grids. It fails when:

Cells are merged — a header spanning 3 columns looks like orphaned text
Tables span pages — page 2 has no header row, just continuation data
No visible borders — alignment is the only structural cue
Nested headers — "Q1 2024" spans under "Revenue" which spans under "Financial Results"
Footnotes inside tables — small text that disrupts row detection

Approach 1: pdfplumber (rule-based)

import pdfplumber

with pdfplumber.open("financial_statement.pdf") as pdf:
    page = pdf.pages[4]
    tables = page.extract_tables()
    # Returns list of lists — each inner list is a row

Works on: Tables with visible borders, simple grids, single-page tables.

Fails on: Borderless tables (most financial statements), merged cells (returns None for spanned cells), multi-page tables (each page is processed independently). You get fragmented data with missing headers on page 2+.

Accuracy on complex financial tables: ~55-65%

Approach 2: Camelot (specialized table extraction)

import camelot

# Stream mode for borderless tables
tables = camelot.read_pdf("financial_statement.pdf", pages="5", flavor="stream")
df = tables[0].df  # Returns pandas DataFrame

Works on: Better than pdfplumber on borderless tables (stream mode). Returns DataFrames directly. Good for single-page tables with consistent column spacing.

Fails on: Same multi-page problem. Merged cells still break column alignment. Stream mode is sensitive to text spacing — slight variations in column positions cause misalignment. Requires per-document tuning of detection parameters.

Accuracy on complex financial tables: ~65-75%

Approach 3: AWS Textract tables

import boto3

textract = boto3.client("textract")
response = textract.analyze_document(
    Document={"Bytes": pdf_bytes},
    FeatureTypes=["TABLES"]
)
# Returns cells with row/column indices and confidence scores

Works on: Detects table cells with row/column position. Handles some merged cells. Confidence scores per cell. Good on scanned documents.

Fails on: Still page-by-page — no automatic stitching of multi-page tables. Returns raw cells, not semantic data — you write the reconstruction code. Complex nested headers confuse the row/column assignment. Expensive at scale ($0.015/page).

Accuracy on complex financial tables: ~78-85%

Approach 4: Vision model (send the page image directly)

import anthropic

client = anthropic.Client()
# Convert PDF page to image and send to Claude/GPT-4o
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "data": page_image}},
            {"type": "text", "text": "Extract the revenue table as JSON"}
        ]
    }]
)

Works on: Understands visual layout naturally. Handles merged cells, nested headers, borderless tables. Good accuracy on single-page tables.

Fails on: Multi-page tables still require manual stitching across pages. No typed output enforcement — LLM might return inconsistent JSON. No confidence scores. Expensive for large documents (full token cost per page image). Sometimes hallucinates cell values instead of reading them accurately.

Accuracy on complex financial tables: ~85-92%

Approach 5: Schema-based extraction with reasoning

# Don't extract the table — extract what you need FROM the table
result = client.analyze(
    file="financial_statement.pdf",
    schema={
        "revenue_by_segment": {
            "type": "array",
            "description": "Each business segment and its quarterly revenue"
        },
        "total_revenue": {
            "type": "number",
            "description": "Total consolidated revenue"
        },
        "largest_segment": {
            "type": "string",
            "description": "Segment with highest revenue"
        }
    }
)

Key insight: You probably don't need the raw table. You need specific data from it. The /analyze endpoint navigates to the right pages, reads across page boundaries, and returns the answer you need — with reasoning traces showing exactly which cells were read.

Works on: Multi-page tables (reads across page boundaries). Complex headers. Merged cells. Any table layout. Returns typed JSON with citations.

Trade-off: Returns computed answers from the table, not the raw table structure itself. If you need the full table as a DataFrame, use approaches 1-4. If you need specific values or computed answers from the table, this is more accurate and handles complexity better.

Which approach to use

Scenario	Best approach
Simple single-page table, local processing	pdfplumber or Camelot
Need raw table structure as DataFrame	Camelot + manual stitching for multi-page
Scanned documents with tables	Textract or vision model
Need specific values from complex tables	Schema-based extraction (/extract)
Need computed answers across multi-page tables	Reasoning API (/analyze)
Compare table data across documents	Cross-analysis (/cross-analyze)

The paradigm shift: instead of extracting tables and then processing them, define what you need and let the API figure out how to get it. You skip the fragile table reconstruction step entirely.

Upload a financial statement and ask a question that spans a multi-page table — see the difference.