June 30, 2026
PDF to markdown API: 5 approaches compared for 2026
From open-source libraries to managed APIs — which PDF-to-markdown approach actually preserves tables, headers, and structure?
Converting PDFs to markdown is the first step in most document AI pipelines. Clean markdown feeds into LLMs, RAG systems, and downstream extraction. But the quality of that conversion varies wildly — some approaches lose table structure, miss headers, or produce garbage on scanned documents.
We tested five approaches on the same set of documents: a financial statement with complex tables, a contract with nested headers, and a scanned invoice. Here's what each produced.
What makes good PDF-to-markdown conversion?
- Table preservation: Tables should render as proper markdown tables, not flattened text
- Header hierarchy: H1, H2, H3 should match the document's visual hierarchy
- Reading order: Multi-column layouts should read in the correct order
- Scanned documents: OCR should be transparent — you shouldn't need a separate step
- List detection: Bulleted and numbered lists should render as markdown lists
- Code/pre-formatted: Monospace or pre-formatted sections should use code blocks
The 5 approaches
1. PyMuPDF (open-source, local)
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("financial_statement.pdf")
Strengths: Fast, free, runs locally. No API calls. Good for simple documents with clear structure. The pymupdf4llm extension specifically targets LLM-friendly output.
Weaknesses: Tables with merged cells break. Header detection is heuristic (font-size based) and misses styled headers. Cannot handle scanned documents — requires embedded text. Multi-column layouts produce jumbled output.
Table accuracy: ~60% on complex financial tables. Simple tables work. Spanning cells don't.
2. Marker (open-source, local, ML-based)
from marker.converters.pdf import PdfConverter
converter = PdfConverter()
rendered = converter("financial_statement.pdf")
Strengths: Uses ML models for layout detection. Better header inference than rule-based approaches. Handles multi-column layouts. Open-source but requires GPU for reasonable speed.
Weaknesses: Slow without GPU (~30s per page on CPU). Table detection improved in recent versions but still struggles with complex financial tables. OCR quality depends on the underlying model. Large model downloads required.
Table accuracy: ~75% on complex tables. Good on standard layouts.
3. LlamaParse (API, RAG-optimized)
from llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data("financial_statement.pdf")
Strengths: Purpose-built for RAG pipelines. Good table preservation. Handles scanned documents. Native LlamaIndex integration. Generous free tier (1,000 pages/day).
Weaknesses: API dependency — can't run locally. Optimized for embedding/chunking, not human readability. Complex tables with nested headers sometimes produce markdown that's technically correct but hard to parse downstream. No JavaScript rendering for web-sourced PDFs.
Table accuracy: ~85% on complex tables. Strong on standard layouts.
4. Reducto (API, high-accuracy)
import reducto
client = reducto.Reducto(api_key="...")
result = client.parse.run(
document_url="financial_statement.pdf",
options={"output_mode": "markdown"}
)
Strengths: Multi-pass OCR + vision model pipeline. Excellent accuracy on complex layouts. Chart detection and description. SOC 2/HIPAA compliant. Best raw parsing quality in the market for complex documents.
Weaknesses: Enterprise pricing — more expensive than alternatives. Focused on document parsing, not combined with extraction or reasoning. No website support. Latency can be higher due to multi-pass processing.
Table accuracy: ~92% on complex tables. Leading accuracy for dense financial documents.
5. The Drive AI (API, multi-format)
import requests
response = requests.post(
"https://dev.thedrive.ai/api/v1/markdown/convert",
headers={"X-API-Key": "tda_live_..."},
files={"file": open("financial_statement.pdf", "rb")}
)
markdown = response.json()["markdown"]
# Or convert a URL directly:
# GET https://dev.thedrive.ai/md/https://example.com
Strengths: Handles 107+ file formats — not just PDFs but DOCX, XLSX, PPTX, images, and live websites. URL-to-markdown with JavaScript rendering. Automatic OCR for scanned documents. 1 credit per conversion ($0.01). When you need more than markdown, the same API offers structured extraction and reasoning.
Weaknesses: Table accuracy on very complex financial tables isn't quite at Reducto's level. Newer product with less enterprise track record.
Table accuracy: ~87% on complex tables. Strong across standard and moderately complex layouts.
Results comparison
| Approach | Tables | Headers | Scans | URLs | Cost/page |
|---|---|---|---|---|---|
| PyMuPDF | ~60% | Heuristic | No | No | Free |
| Marker | ~75% | ML-based | Basic | No | Free (GPU) |
| LlamaParse | ~85% | Good | Yes | No | ~$0.003 |
| Reducto | ~92% | Excellent | Yes | No | ~$0.01+ |
| The Drive AI | ~87% | Good | Yes | Yes | $0.01 |
Which approach should you use?
Use PyMuPDF when: Budget is zero, documents are simple, and you don't need OCR. Good for prototyping.
Use Marker when: You need local processing, have a GPU, and want better quality than PyMuPDF. Good for privacy-sensitive workflows.
Use LlamaParse when: You're building a RAG pipeline with LlamaIndex and need clean text for embeddings at scale.
Use Reducto when: You need the highest possible accuracy on complex financial documents and compliance matters.
Use The Drive AI when: You need markdown from more than just PDFs — DOCX, XLSX, websites, images — and want the option to go beyond markdown into structured extraction and reasoning with the same API. One tool for markdown, extraction, and analysis.
Beyond markdown: when you need more
Markdown is a stepping stone, not the destination. If your pipeline is: PDF → markdown → feed to LLM → parse LLM response → structured data — you have four failure points. Consider whether you actually need markdown, or whether you need structured extraction directly.
The Drive AI offers both. Use /markdown/convert when you genuinely need clean text. Use /extract when you need typed fields. Use /analyze when you need computed answers. Same API, same auth, same format support.
Try the markdown converter — paste a URL or upload a file.