June 6, 2026
Building AI agent tools: how to give your agent file understanding
Your agent encounters a PDF attachment, a spreadsheet, a URL. Here's how to make it understand any of them with one tool definition.
If you're building an AI agent, there's a moment you'll hit: your agent receives a file attachment or needs to read a webpage, and it has no idea what to do with it.
The typical solution is duct tape. pdfplumber for PDFs, openpyxl for spreadsheets, playwright for websites, pytesseract for scans, whisper for audio. Each format needs its own parsing logic, error handling, and output normalization. Your agent's tool list grows. Your codebase grows. Your bugs grow.
Here's the alternative: one tool that handles everything.
The tool definition
Whether you're using LangChain, CrewAI, OpenAI function calling, or your own agent framework, the tool looks the same:
# LangChain tool example
from langchain.tools import tool
from thedriveai import TheDriveAI
client = TheDriveAI(api_key="tda_live_...")
@tool
def understand_file(file_path: str, questions: dict) -> dict:
"""Extract structured data from any file or URL.
Pass a file path or URL and a schema describing what you need."""
return client.extract(file=file_path, schema=questions)
That's it. One tool. Your agent can now handle PDFs, DOCX, XLSX, images, audio, video, and URLs. The API detects the format, picks the right parser, handles OCR if needed, and returns typed JSON matching your schema.
What the agent sees
When your agent calls this tool, it gets back structured data it can reason about:
# Agent receives an invoice PDF attachment
result = understand_file(
file_path="invoice.pdf",
questions={
"vendor": {"type": "string", "description": "Company name"},
"total": {"type": "number", "description": "Total amount due"},
"due_date": {"type": "string", "description": "Payment due date"},
}
)
# result.data = {"vendor": "AWS", "total": 971.73, "due_date": "2024-12-01"}
# result.confidence = {"vendor": "high", "total": "high", "due_date": "high"}
# Agent receives a competitor URL
result = understand_file(
file_path="https://competitor.com",
questions={
"pricing": {"type": "array", "description": "Plan names and prices"},
"free_tier": {"type": "boolean", "description": "Is there a free plan?"},
}
)
Same tool, same interface. The agent doesn't need to know if it's dealing with a PDF, a HEIC photo, or a JavaScript-rendered website.
CrewAI tool example
If you're using CrewAI, the tool definition is similar:
from crewai.tools import tool
from thedriveai import TheDriveAI
client = TheDriveAI(api_key="tda_live_...")
@tool("File Understanding")
def understand_file(file_path: str, schema: dict) -> dict:
"""Extract structured data from any file or URL.
Returns typed JSON with confidence scores."""
return client.extract(file=file_path, schema=schema)
OpenAI function calling example
For agents built directly on OpenAI's function calling:
# Define as an OpenAI function
tools = [{
"type": "function",
"function": {
"name": "understand_file",
"description": "Extract structured data from any file or URL",
"parameters": {
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "File path or URL"},
"schema": {"type": "object", "description": "Fields to extract"}
},
"required": ["file_path", "schema"]
}
}
}]
# When the model calls it, execute:
def handle_tool_call(file_path, schema):
return client.extract(file=file_path, schema=schema)
Adding a reasoning tool
For questions that require computation — cross-checking totals, calculating growth rates, comparing clauses — add a second tool using /analyze:
@tool
def analyze_file(file_path: str, questions: dict) -> dict:
"""Reason over a document. Use for computations,
cross-checks, and questions that require thinking."""
return client.analyze(file=file_path, schema=questions)
Your agent now has two tools: one for reading, one for thinking. It decides which to use based on the task. Extract the vendor name? Use understand_file. Verify the invoice math? Use analyze_file.
Why one tool beats five libraries
- No format routing: Your agent doesn't need to figure out if it's a PDF, image, or spreadsheet
- Confidence scores: Your agent knows when it's not sure — it can ask a human instead of guessing
- Citations: Your agent can show the exact text it used to reach a conclusion
- One failure mode: Instead of debugging pdfplumber, tesseract, playwright, and whisper separately
- 107+ formats: Including formats you haven't thought of yet
The best agent architectures minimize tool count. One tool for files, one tool for reasoning. Everything else is handled by the API.
Try it in the playground — test with your own files before writing any integration code.