What your AI agent actually sees when you send it a PDF

Your agent can't read a PDF. It reads text. The question is: what text does it actually get? The answer depends entirely on your extraction pipeline, and most pipelines throw away more information than they keep.

Layer 1: raw text extraction

The most basic approach — extract text from the PDF:

# What pdfplumber returns for a typical invoice
"Invoice #4521
Acme Corp
123 Main St
Consulting services
3200.00
Travel expenses 1550.00
Equipment rental
890.00
Subtotal 5740.00
Tax (8%) 459.20
Total Due
6199.20"

No structure. No columns. Numbers float next to descriptions with no clear association. Your agent sees a wall of text and has to guess which number belongs to which field.

Layer 2: table-aware extraction

Better tools preserve table structure:

# Table detected and structured
| Description        | Amount   |
|--------------------|----------|
| Consulting services| 3,200.00 |
| Travel expenses    | 1,550.00 |
| Equipment rental   |   890.00 |

Better — columns are preserved. But this is still text. Your agent needs to parse markdown tables, handle edge cases (merged cells, empty columns, multi-line cell values), and convert to typed data.

Layer 3: OCR for scanned documents

If the PDF is a scan (no embedded text), you need OCR first:

# What Tesseract returns for a scanned invoice
"lnvoice #4521
Acme Corp
l23 Main 5t
Consulting services
32OO.OO
Travel expenses 1S50.00"

# Errors: "Invoice" → "lnvoice", "123" → "l23",
# "St" → "5t", "3200" → "32OO", "1550" → "1S50"

OCR errors cascade. "32OO.OO" might parse as a number or might not. Your agent gets confidently wrong data with no way to know it's wrong.

Layer 4: structured JSON with confidence

What your agent actually needs:

{
  "data": {
    "invoice_number": "4521",
    "vendor": "Acme Corp",
    "total": 6199.20,
    "line_items": [
      {"description": "Consulting services", "amount": 3200.00},
      {"description": "Travel expenses", "amount": 1550.00},
      {"description": "Equipment rental", "amount": 890.00}
    ]
  },
  "confidence": {
    "invoice_number": "high",
    "vendor": "high",
    "total": "high"
  },
  "citations": {
    "total": "Total Due $6,199.20",
    "vendor": "Acme Corp, 123 Main St"
  }
}

Typed fields. Numbers are numbers, not strings. Confidence scores tell your agent when to trust the data and when to flag for review. Citations show exactly where each value came from.

The information loss problem

Most pipelines go: PDF → text → LLM prompt → unstructured response → JSON parsing. Every step loses information:

PDF → text: table structure lost
Text → LLM prompt: context window limits force truncation
LLM response → JSON: parsing failures on malformed output
No confidence scores at any step

With /extract, you skip the intermediate steps. PDF in, typed JSON out. The table structure, OCR correction, and output enforcement happen inside the API. Your agent gets exactly what it needs — structured data it can act on, with confidence scores it can trust.

Upload a document and see the difference.