Multi-page tables in PDFs: why every extraction tool breaks (and how to fix it)

If you've built a document extraction pipeline, you've hit this: a table starts on page 5 and continues on page 6. Your parser returns two separate, broken tables. The header row is only on page 5. The totals row is only on page 6. Your structured output is garbage.

This is the #1 failure mode in PDF extraction. And almost every tool handles it the same way — badly.

Why PDF table extraction breaks across pages

PDF is a visual format, not a semantic one. There's no concept of "this table continues on the next page." Each page is an independent canvas. When pdfplumber or Textract processes a PDF, they work page by page:

Page 5: detects a table with 15 rows and a header
Page 6: detects a table with 10 rows and no header
Result: two separate tables instead of one 25-row table

The second table fragment might have shifted columns, repeated headers, or different formatting. Simple concatenation doesn't work because column alignment varies between pages.

Common workarounds (and why they fail)

Concatenate page text: Merge all pages into one text block. Loses table structure entirely — columns collapse into jumbled text.

Heuristic stitching: Match column widths between page breaks and merge tables that "look" continuous. Works on clean documents, breaks on real-world PDFs with varying margins, footnotes between table segments, or rotated pages.

Send the whole PDF to an LLM: Works for small documents. A 50-page financial statement with multiple multi-page tables exceeds context windows. And the LLM might hallucinate row values instead of reading them.

How document reasoning handles spanning tables

The /analyze endpoint takes a different approach. Instead of stitching tables programmatically, it reads across page ranges and reasons over the combined content:

result = client.analyze(
    file="financial_statement.pdf",
    schema={
        "total_revenue_by_segment": {
            "type": "array",
            "description": "Each business segment and its revenue"
        },
        "largest_segment": {
            "type": "string",
            "description": "Which segment has the highest revenue?"
        }
    }
)

The API navigates to the right section of the document, reads pages 12-14 where the revenue table spans, and returns the structured answer with citations pointing to specific pages. It doesn't try to merge table fragments — it reads the data and reasons over it like a person would.

When you need structured table data

For simple extraction from single-page tables, /extract works well — 1 credit per page, typed output, fast.

For multi-page tables where you need computed answers (sums, comparisons, cross-checks), /analyze handles the cross-page reasoning. 2 credits per page.

The key insight: don't try to reconstruct the table programmatically. Let the reasoning layer read across pages and extract what you actually need. You probably don't want the raw table — you want the answer.

Test it on your own multi-page documents — upload a financial statement and ask a question that spans pages.