← Blog

June 13, 2026

The $100 invoice that broke 4 extraction APIs

We crafted an invoice where the line items don't add up. Every extraction tool returned the wrong answer. Only one approach caught it.

Here's a simple test: create an invoice where the line items don't add up to the stated total. A $100 gap. Then send it to every document extraction API and see what comes back.

We did exactly that. The results were telling.

The test document

A two-page PDF invoice from a fictional company. Three line items:

Consulting services    $3,200.00
Travel expenses        $1,550.00
Equipment rental         $890.00
──────────────────────────────
Subtotal               $5,740.00    ← actual sum: $5,640
Tax (8%)                 $459.20
Total Due              $6,199.20

The subtotal says $5,740. The actual sum is $5,640. The $100 gap cascades into tax and total. Everything downstream is wrong — but the document looks completely normal.

What extraction APIs returned

Every extraction tool we tested returned the same thing:

{ "subtotal": 5740.00, "total": 6199.20 }

Correct extraction. Wrong numbers. The tools did their job — they read what the document said. But what the document said was wrong, and no extraction API flagged it.

This is the fundamental problem: extraction reads. It doesn't verify.

What document reasoning returned

The /analyze endpoint with a verification schema:

schema: {
  "line_items_match_subtotal": {
    "type": "boolean",
    "description": "Do the line items add up to the stated subtotal?"
  },
  "discrepancy_amount": {
    "type": "number",
    "description": "If there's a mismatch, what's the dollar difference?"
  }
}
{
  "line_items_match_subtotal": {
    "answer": false,
    "reasoning": "$3,200 + $1,550 + $890 = $5,640.
      Stated subtotal: $5,740. Difference: $100.",
    "confidence": "high"
  },
  "discrepancy_amount": {
    "answer": 100.00,
    "reasoning": "Computed sum $5,640 vs stated $5,740.",
    "confidence": "high"
  }
}

The math was computed, not read. The $100 gap was caught because the API added the numbers instead of trusting what was printed.

Why this matters for production AI agents

This isn't a contrived edge case. In production, invoices have errors. Vendors make mistakes. Fraud happens. If your agent processes 1,000 invoices a month and trusts every stated total, it's approving payments that don't add up.

The pattern that works:

  1. Use /extract to pull the fields (fast, 1 credit/page)
  2. Use /analyze to verify the math (2 credits/page)
  3. Flag discrepancies for human review

Your agent doesn't need to be right 100% of the time. It needs to know when it's not confident. That's what confidence scores and computed verification give you.

Try it with your own invoice in the playground.

Try it yourself

Free tier included. No credit card required.