June 13, 2026
The $100 invoice that broke 4 extraction APIs
We crafted an invoice where the line items don't add up. Every extraction tool returned the wrong answer. Only one approach caught it.
Here's a simple test: create an invoice where the line items don't add up to the stated total. A $100 gap. Then send it to every document extraction API and see what comes back.
We did exactly that. The results were telling.
The test document
A two-page PDF invoice from a fictional company. Three line items:
Consulting services $3,200.00
Travel expenses $1,550.00
Equipment rental $890.00
──────────────────────────────
Subtotal $5,740.00 ← actual sum: $5,640
Tax (8%) $459.20
Total Due $6,199.20
The subtotal says $5,740. The actual sum is $5,640. The $100 gap cascades into tax and total. Everything downstream is wrong — but the document looks completely normal.
What extraction APIs returned
Every extraction tool we tested returned the same thing:
{ "subtotal": 5740.00, "total": 6199.20 }
Correct extraction. Wrong numbers. The tools did their job — they read what the document said. But what the document said was wrong, and no extraction API flagged it.
This is the fundamental problem: extraction reads. It doesn't verify.
What document reasoning returned
The /analyze endpoint with a verification schema:
schema: {
"line_items_match_subtotal": {
"type": "boolean",
"description": "Do the line items add up to the stated subtotal?"
},
"discrepancy_amount": {
"type": "number",
"description": "If there's a mismatch, what's the dollar difference?"
}
}
{
"line_items_match_subtotal": {
"answer": false,
"reasoning": "$3,200 + $1,550 + $890 = $5,640.
Stated subtotal: $5,740. Difference: $100.",
"confidence": "high"
},
"discrepancy_amount": {
"answer": 100.00,
"reasoning": "Computed sum $5,640 vs stated $5,740.",
"confidence": "high"
}
}
The math was computed, not read. The $100 gap was caught because the API added the numbers instead of trusting what was printed.
Why this matters for production AI agents
This isn't a contrived edge case. In production, invoices have errors. Vendors make mistakes. Fraud happens. If your agent processes 1,000 invoices a month and trusts every stated total, it's approving payments that don't add up.
The pattern that works:
- Use /extract to pull the fields (fast, 1 credit/page)
- Use /analyze to verify the math (2 credits/page)
- Flag discrepancies for human review
Your agent doesn't need to be right 100% of the time. It needs to know when it's not confident. That's what confidence scores and computed verification give you.