TL;DR
Mistral OCR 4 shipped on June 23, 2026, and it’s the first OCR model I’ve used that returns bounding boxes, block types, and per-word confidence scores alongside the extracted text. This tutorial walks through the Python SDK from basic PDF extraction to a structured invoice pipeline that outputs typed JSON. At $4 per 1,000 pages ($2 in batch mode), it undercuts most enterprise alternatives while beating them on a 72% blind-test win rate.
Why I Switched From MarkItDown
I’ve been running MarkItDown, Docling, and Marker in my RAG pipelines for about two months. They’re solid for clean, text-heavy PDFs. But the moment I fed them scanned invoices, multi-column research papers, or documents with mixed tables and figures, the output got messy. Tables collapsed into paragraph text. Headers merged with body content. And none of them told me how confident the extraction was, so I had to eyeball every page.
When Mistral dropped OCR 4, I spent an afternoon porting one of my pipelines over. The difference was immediate: each extracted block comes with a type label (title, table, equation, signature, code), a bounding box, and a confidence score. My downstream chunker stopped guessing where sections began and ended.
Here’s what I built, step by step.
Setup
Install the Mistral Python SDK and set your API key:
pip install mistralai
export MISTRAL_API_KEY="your-key-here"
Get your key from Mistral’s platform. The free tier gives you enough credits to process a few hundred pages, which is plenty for this tutorial.
import os
from mistralai import Mistral
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
That’s it. No extra dependencies, no model downloads, no GPU setup.
Step 1: Basic PDF Extraction
Start with the simplest call. Feed a PDF URL and get markdown back:
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
}
)
for page in response.pages:
print(f"--- Page {page.index} ---")
print(page.markdown[:500])
Output:
--- Page 0 ---
# LaMDA: Language Models for Dialog Applications
**Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer...**
## Abstract
We present LaMDA: Language Models for Dialog Applications.
LaMDA is a family of Transformer-based neural language models
specialized for dialog, which have up to 137B parameters...
The extracted markdown preserves headings, bold text, and paragraph structure. Tables come through as HTML by default, but you can switch that with the table_format parameter.
Step 2: Get Tables as Markdown or HTML
If your downstream pipeline prefers markdown tables over HTML:
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
table_format="markdown"
)
For structured data pipelines where you need to parse table cells programmatically, "html" is usually the better choice because the <tr>/<td> tags are easier to traverse than markdown pipes.
Step 3: Block Extraction With Bounding Boxes
Set include_blocks=True and every extracted region gets a bounding box, a type label, and its own text:
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
include_blocks=True
)
page = response.pages[0]
for block in page.blocks:
print(f"Type: {block.type:12s} | "
f"BBox: ({block.x}, {block.y}, {block.width}, {block.height}) | "
f"Text: {block.text[:60]}...")
Output:
Type: title | BBox: (142, 89, 728, 42) | Text: LaMDA: Language Models for Dialog Applications...
Type: text | BBox: (142, 145, 728, 38) | Text: Romal Thoppilan, Daniel De Freitas, Jamie Hall...
Type: title | BBox: (142, 210, 180, 24) | Text: Abstract...
Type: text | BBox: (142, 240, 350, 180) | Text: We present LaMDA: Language Models for Dialog Ap...
Type: table | BBox: (90, 520, 430, 220) | Text: | Model | Params | SSA | Quality |...
Type: equation | BBox: (160, 780, 300, 40) | Text: P(y|x) = \prod_{t=1}^{T} P(y_t | y_{<t}, x)...
The block types OCR 4 recognizes: text, title, list, table, image, equation, caption, code, references, aside_text, header, footer, and signature.
I use these labels in my chunking pipeline to split documents along structural boundaries: titles and headers start new chunks, tables and equations stay intact, and captions get attached to the preceding image or figure.
Step 4: Confidence Scores
Two granularity levels: per-page averages or per-word scores.
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
confidence_scores_granularity="word"
)
page = response.pages[0]
scores = page.confidence_scores
print(f"Page avg: {scores.average_page_confidence_score:.3f}")
print(f"Page min: {scores.minimum_page_confidence_score:.3f}")
low_confidence = [
w for w in scores.word_confidence_scores
if w.confidence < 0.8
]
print(f"Low-confidence words: {len(low_confidence)}")
for w in low_confidence[:5]:
print(f" '{w.text}' → {w.confidence:.2f}")
For a clean digital PDF, you’ll see average scores above 0.95. Scanned documents or low-resolution faxes drop into the 0.7–0.85 range. I flag any page below 0.85 for human review. It takes 30 seconds to verify vs. hours to debug a bad extraction downstream.
For page-level scores only (faster, cheaper if you don’t need word-level detail):
response = client.ocr.process(
model="mistral-ocr-latest",
document={"type": "document_url", "document_url": "..."},
confidence_scores_granularity="page"
)
Step 5: Structured Output With JSON Schema
This is the feature that made me port my whole pipeline. You pass a JSON schema alongside the document, and the model returns structured data that conforms to your spec.
Say you’re processing invoices. Define a schema:
invoice_schema = {
"type": "json_schema",
"json_schema": {
"name": "invoice",
"schema": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string"},
"due_date": {"type": "string"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"total": {"type": "number"}
},
"required": ["description", "quantity",
"unit_price", "total"]
}
},
"subtotal": {"type": "number"},
"tax": {"type": "number"},
"total_due": {"type": "number"}
},
"required": ["vendor_name", "invoice_number",
"total_due"]
}
}
}
Then pass it as document_annotation_format:
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://example.com/invoice.pdf"
},
document_annotation_format=invoice_schema,
document_annotation_prompt="Extract invoice details from this document."
)
import json
invoice = json.loads(response.document_annotation)
print(json.dumps(invoice, indent=2))
Output:
{
"vendor_name": "Acme Cloud Services",
"invoice_number": "INV-2026-0847",
"invoice_date": "2026-06-15",
"due_date": "2026-07-15",
"line_items": [
{
"description": "GPU compute (A100, 720 hrs)",
"quantity": 720,
"unit_price": 3.50,
"total": 2520.00
},
{
"description": "Object storage (2.4 TB)",
"quantity": 2.4,
"unit_price": 23.00,
"total": 55.20
}
],
"subtotal": 2575.20,
"tax": 489.29,
"total_due": 3064.49
}
The document_annotation_prompt parameter lets you guide the extraction. For messy scans, a prompt like “This is a handwritten invoice. Amounts are in EUR. Dates use DD/MM/YYYY format.” makes a real difference. I’ve found that being specific about currency and date format in the prompt cuts extraction errors by roughly half on my invoice dataset.
Step 6: Pydantic Models for Type Safety
If you’re building a production pipeline (and you should be using uv to manage it), define the schema as a Pydantic model instead of raw dicts. The Mistral SDK accepts model_json_schema() output directly:
from pydantic import BaseModel
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
class Invoice(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str | None = None
due_date: str | None = None
line_items: list[LineItem] = []
subtotal: float | None = None
tax: float | None = None
total_due: float
schema = {
"type": "json_schema",
"json_schema": {
"name": "invoice",
"schema": Invoice.model_json_schema()
}
}
response = client.ocr.process(
model="mistral-ocr-latest",
document={"type": "document_url",
"document_url": "https://example.com/invoice.pdf"},
document_annotation_format=schema,
document_annotation_prompt="Extract all invoice fields."
)
invoice = Invoice.model_validate_json(response.document_annotation)
print(f"Vendor: {invoice.vendor_name}")
print(f"Total: ${invoice.total_due:.2f}")
print(f"Items: {len(invoice.line_items)}")
The Pydantic model gives you validation, default values, and type coercion out of the box. A missing due_date returns None instead of crashing your pipeline.
Step 7: Local Files and Selective Pages
For documents that aren’t publicly accessible via URL, encode them as base64 data URIs:
import base64
from pathlib import Path
pdf_path = Path("contract.pdf")
pdf_bytes = pdf_path.read_bytes()
pdf_b64 = base64.standard_b64encode(pdf_bytes).decode()
response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{pdf_b64}"
},
include_blocks=True,
confidence_scores_granularity="page"
)
Yes, you pass base64 data URIs through the document_url field. The naming is a bit odd, but it works for PDFs, DOCX, PPTX, and images (PNG, JPEG, AVIF).
For large documents, you can also limit processing to specific pages:
response = client.ocr.process(
model="mistral-ocr-latest",
document={"type": "document_url",
"document_url": "https://example.com/report.pdf"},
pages="0-4"
)
The pages parameter accepts a range string ("0-4") or a list of integers ([0, 2, 7]). Pages are zero-indexed. This cuts your costs proportionally. If a 100-page report only has relevant data on pages 3–7, you pay for 5 pages instead of 100.
Pricing
| Tier | Cost | Use Case |
|---|---|---|
| Standard API | $4 / 1K pages | Real-time extraction, low latency |
| Batch API | $2 / 1K pages | Bulk processing, higher latency |
| Document AI | $5 / 1K pages | Structured output + annotations |
| Self-hosted | Contact sales | On-prem, data residency |
The Document AI tier adds the structured annotation features (JSON schema extraction, bbox annotations). If you only need raw text and tables, the standard $4 tier works fine.
For comparison: AWS Textract charges $1.50 per 1,000 pages for plain text and $15 per 1,000 pages for tables. Google Document AI runs $1.50–$30 per 1,000 pages depending on the processor. Mistral’s pricing sits in a competitive range, especially with the batch discount.
What OCR 4 Changed From OCR 3
| Feature | OCR 3 | OCR 4 |
|---|---|---|
| Bounding boxes | No | Yes (paragraph-level) |
| Block classification | No | 13 block types |
| Confidence scores | No | Per-word and per-page |
| Structured JSON output | Via separate model call | Built-in via document_annotation_format |
| Self-hosting | Yes (contact sales) | Single container deployment |
| Supported formats | PDF, images | PDF, DOC, PPT, OpenDocument, images |
| Languages | 30+ | 170 across 10 language groups |
| Batch API | Yes ($1/1K pages) | Yes ($2/1K pages) |
OCR 4 goes beyond text extraction. The bounding boxes and block types turn it into a document intelligence model that returns a structured representation of each page, not just raw text.
When OCR 4 Struggles
I’ve run about 400 pages through it over the past two days. Three patterns caused trouble:
Handwritten text tanks confidence scores below 0.6, and the extraction starts hallucinating words. If your pipeline handles handwritten documents regularly, you’ll need a human review step.
Multi-column academic papers sometimes get their reading order scrambled. The bounding boxes are correct, but the markdown concatenation can interleave paragraphs from different columns instead of flowing top-to-bottom within each.
Equations in low-resolution scans lose their LaTeX reconstruction below 200 DPI. The equation block types are still labeled correctly, but the content inside them gets garbled.
For clean digital PDFs, invoices, contracts, and presentations, it works extremely well. I’d estimate 80% of the documents I process fall into this category, and for those, the extraction quality is better than anything I’ve gotten from Textract or Document AI without post-processing.
FAQ
How do I install the Mistral OCR Python SDK?
Run pip install mistralai. The OCR functionality is part of the main SDK, no separate package needed. You’ll need a Mistral API key from console.mistral.ai.
What’s the difference between Mistral OCR 3 and OCR 4?
OCR 4 adds bounding boxes, block classification (13 types), per-word confidence scores, built-in structured JSON output, support for DOC/PPT/OpenDocument formats, 170 languages (up from 30+), batch API pricing, and single-container self-hosting. OCR 3 only extracted text and tables.
Can Mistral OCR 4 process images, not just PDFs?
Yes. OCR 4 accepts PNG, JPEG, and AVIF images via the image_url document type. The same include_blocks and confidence_scores_granularity parameters work on images.
How does Mistral OCR 4 compare to AWS Textract?
Mistral OCR 4 costs $4 per 1,000 pages for text + tables (vs. Textract’s $15 per 1,000 for tables). OCR 4 returns bounding boxes and block types in a single call. Textract requires separate API calls for different analysis types (text, tables, forms, queries). Mistral’s structured JSON output via document_annotation_format eliminates the need for a separate LLM call to parse the extraction.
Can I self-host Mistral OCR 4?
Yes. Mistral offers a single-container deployment for enterprise customers who need data residency. Contact their sales team for pricing; it’s not available on the standard API tier.
Sources
- Mistral OCR 4 announcement — official launch post with benchmarks and pricing
- Mistral OCR processor documentation — API reference and Python SDK examples
- Mistral OCR data extraction cookbook — structured output examples with annotations
- Mistral OCR API endpoint reference — full parameter specification
Bottom Line
Before OCR 4, I was stitching together three services to get bounding boxes, block types, and structured JSON extraction from one document. Now it’s a single API call. The Python SDK is clean, the pricing is reasonable at $2–5 per 1,000 pages, and the 72% blind-test win rate tracks with what I’ve seen processing real documents.
The weak spots are real: handwritten text, multi-column reading order, and low-res equation scans still need human review. But for the 80% of documents that are clean-enough PDFs with mixed content, OCR 4 handles the structural extraction that tools like MarkItDown and Docling punt on, and the confidence scores alone save hours of debugging. If you want to expose this pipeline as a tool for coding agents, wrapping it in a FastMCP server takes about 20 minutes.