OCR accuracy in 2026: what actually works on scanned documents

OCR isn't solved. It's better than it was five years ago, but if you're processing scanned documents in production, you're still dealing with misread characters, lost formatting, and confident-sounding wrong answers.

We process thousands of scanned documents through our API. Here's what we've learned about what actually works in 2026.

The accuracy hierarchy for scanned documents

From worst to best, based on our testing across invoices, receipts, contracts, and handwritten forms:

Approach	Printed text	Phone photos	Handwriting	Faded/stamps
Tesseract 5	~92%	~78%	~45%	~60%
Google Cloud Vision	~97%	~91%	~72%	~85%
Vision LLM (GPT-4o)	~96%	~93%	~80%	~88%
Hybrid (OCR + vision)	~99%	~96%	~83%	~93%

Character-level accuracy on our test set of 200 real-world documents. Results vary by document quality.

The key insight: no single approach wins across all document types. Hybrid approaches that combine OCR speed with vision model accuracy consistently outperform either approach alone.

Where each approach fails

Tesseract: "$1,550.00" → "$1,5S0.00". Confuses similar characters. No way to know it's wrong without the original image.

Cloud Vision APIs: Table structure gets flattened. Columns merge. Multi-column documents return text in wrong reading order.

Vision LLMs: Reads a faded "8" as "6" and confidently returns the wrong total. No confidence signal. Expensive for high-volume processing.

Hybrid: Slower (two passes). Costs more. But catches the errors that matter.

What matters for production OCR

Raw character accuracy isn't the full picture. In production, what matters is:

Field-level accuracy: Did you get the invoice total right? Getting 99% of characters right but misreading the total is a 100% failure for that field.
Confidence signals: Can you tell when OCR is uncertain? A confidence score lets your agent flag documents for human review instead of processing bad data.
Structured output: Raw OCR text is useless to an agent. It needs typed fields — vendor name as a string, total as a number, date as ISO 8601.

This is why we built OCR as a layer inside the extraction pipeline, not a standalone step. When you send a scanned document to /extract, OCR happens automatically — the API detects that it's a scan, runs the hybrid pipeline, and returns typed fields with confidence scores. You don't manage OCR separately.

Practical advice for scanned document processing

Don't use OCR alone. Always validate OCR output against the original image or use a hybrid approach.
Require confidence scores. If your pipeline doesn't tell you when it's uncertain, you're flying blind.
Test on your worst documents. The faded fax, the phone photo taken at an angle, the form with stamps over text. That's what production looks like.
Structured output, not raw text. Your agent needs JSON fields, not a wall of OCR text to parse with regex.

Upload a scanned document to the playground and see the difference between raw OCR and structured extraction with confidence scores.