The $100 invoice that broke 4 extraction APIs

Here's a simple test: create an invoice where the line items don't add up to the stated total. A $100 gap. Then send it to every document extraction API and see what comes back.

We did exactly that. The results were telling.

The test document

A two-page PDF invoice from a fictional company. Three line items:

Consulting services    $3,200.00
Travel expenses        $1,550.00
Equipment rental         $890.00
──────────────────────────────
Subtotal               $5,740.00    ← actual sum: $5,640
Tax (8%)                 $459.20
Total Due              $6,199.20

The subtotal says $5,740. The actual sum is $5,640. The $100 gap cascades into tax and total. Everything downstream is wrong — but the document looks completely normal.

What extraction APIs returned

Every extraction tool we tested returned the same thing:

{ "subtotal": 5740.00, "total": 6199.20 }

Correct extraction. Wrong numbers. The tools did their job — they read what the document said. But what the document said was wrong, and no extraction API flagged it.

This is the fundamental problem: extraction reads. It doesn't verify.

What document reasoning returned

The /analyze endpoint with a verification schema:

schema: {
  "line_items_match_subtotal": {
    "type": "boolean",
    "description": "Do the line items add up to the stated subtotal?"
  },
  "discrepancy_amount": {
    "type": "number",
    "description": "If there's a mismatch, what's the dollar difference?"
  }
}

{
  "data": {
    "line_items_match_subtotal": false,
    "discrepancy_amount": 100.00
  },
  "reasoning": {
    "line_items_match_subtotal": "$3,200 + $1,550 + $890 = $5,640.
      Stated subtotal: $5,740. Difference: $100.",
    "discrepancy_amount": "Computed sum $5,640 vs stated $5,740."
  },
  "confidence": {
    "line_items_match_subtotal": 0.99,
    "discrepancy_amount": 1.0
  }
}

The math was computed, not read. The $100 gap was caught because the API added the numbers instead of trusting what was printed.

Why this matters for production AI agents

This isn't a contrived edge case. In production, invoices have errors. Vendors make mistakes. Fraud happens. If your agent processes 1,000 invoices a month and trusts every stated total, it's approving payments that don't add up.

The pattern that works:

Use /extract to pull the fields (fast, 1 credit/page)
Use /analyze to verify the math (2 credits/page)
Flag discrepancies for human review

Your agent doesn't need to be right 100% of the time. It needs to know when it's not confident. That's what confidence scores and computed verification give you.

Try it with your own invoice in the playground.