Mistral OCR 4 Tutorial: Build a Document Pipeline in Python

Q: "How do I install the Mistral OCR Python SDK?"

" Run pip install mistralai. The OCR functionality is part of the main SDK, no separate package needed. You\u0026rsquo;ll need a Mistral API key from console.mistral.ai."

Q: "Can Mistral OCR 4 process images, not just PDFs?"

" Yes. OCR 4 accepts PNG, JPEG, and AVIF images via the image_url document type. The same include_blocks and confidence_scores_granularity parameters work on images."

Q: "Can I self-host Mistral OCR 4?"

" Yes. Mistral offers a single-container deployment for enterprise customers who need data residency. Contact their sales team for pricing; it\u0026rsquo;s not available on the standard API tier."

TL;DR

Mistral OCR 4 shipped on June 23, 2026, and it’s the first OCR model I’ve used that returns bounding boxes, block types, and per-word confidence scores alongside the extracted text. This tutorial walks through the Python SDK from basic PDF extraction to a structured invoice pipeline that outputs typed JSON. At $4 per 1,000 pages ($2 in batch mode), it undercuts most enterprise alternatives while beating them on a 72% blind-test win rate.

Why I Switched From MarkItDown

I’ve been running MarkItDown, Docling, and Marker in my RAG pipelines for about two months. They’re solid for clean, text-heavy PDFs. But the moment I fed them scanned invoices, multi-column research papers, or documents with mixed tables and figures, the output got messy. Tables collapsed into paragraph text. Headers merged with body content. And none of them told me how confident the extraction was, so I had to eyeball every page.

When Mistral dropped OCR 4, I spent an afternoon porting one of my pipelines over. The difference was immediate: each extracted block comes with a type label (title, table, equation, signature, code), a bounding box, and a confidence score. My downstream chunker stopped guessing where sections began and ended.

Here’s what I built, step by step.

Setup

Install the Mistral Python SDK and set your API key:

pip install mistralai
export MISTRAL_API_KEY="your-key-here"

Get your key from Mistral’s platform. The free tier gives you enough credits to process a few hundred pages, which is plenty for this tutorial.

import os
from mistralai import Mistral

client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

That’s it. No extra dependencies, no model downloads, no GPU setup.

Step 1: Basic PDF Extraction

Start with the simplest call. Feed a PDF URL and get markdown back:

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    }
)

for page in response.pages:
    print(f"--- Page {page.index} ---")
    print(page.markdown[:500])

Output:

--- Page 0 ---
# LaMDA: Language Models for Dialog Applications

**Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer...**

## Abstract

We present LaMDA: Language Models for Dialog Applications.
LaMDA is a family of Transformer-based neural language models
specialized for dialog, which have up to 137B parameters...

The extracted markdown preserves headings, bold text, and paragraph structure. Tables come through as HTML by default, but you can switch that with the table_format parameter.

Step 2: Get Tables as Markdown or HTML

If your downstream pipeline prefers markdown tables over HTML:

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    table_format="markdown"
)

For structured data pipelines where you need to parse table cells programmatically, "html" is usually the better choice because the <tr>/<td> tags are easier to traverse than markdown pipes.

Step 3: Block Extraction With Bounding Boxes

Set include_blocks=True and every extracted region gets a bounding box, a type label, and its own text:

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    include_blocks=True
)

page = response.pages[0]
for block in page.blocks:
    print(f"Type: {block.type:12s} | "
          f"BBox: ({block.x}, {block.y}, {block.width}, {block.height}) | "
          f"Text: {block.text[:60]}...")

Output:

Type: title        | BBox: (142, 89, 728, 42)  | Text: LaMDA: Language Models for Dialog Applications...
Type: text         | BBox: (142, 145, 728, 38)  | Text: Romal Thoppilan, Daniel De Freitas, Jamie Hall...
Type: title        | BBox: (142, 210, 180, 24)  | Text: Abstract...
Type: text         | BBox: (142, 240, 350, 180) | Text: We present LaMDA: Language Models for Dialog Ap...
Type: table        | BBox: (90, 520, 430, 220)  | Text: | Model | Params | SSA | Quality |...
Type: equation     | BBox: (160, 780, 300, 40)  | Text: P(y|x) = \prod_{t=1}^{T} P(y_t | y_{<t}, x)...

The block types OCR 4 recognizes: text, title, list, table, image, equation, caption, code, references, aside_text, header, footer, and signature.

I use these labels in my chunking pipeline to split documents along structural boundaries: titles and headers start new chunks, tables and equations stay intact, and captions get attached to the preceding image or figure.

Step 4: Confidence Scores

Two granularity levels: per-page averages or per-word scores.

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    confidence_scores_granularity="word"
)

page = response.pages[0]
scores = page.confidence_scores
print(f"Page avg:  {scores.average_page_confidence_score:.3f}")
print(f"Page min:  {scores.minimum_page_confidence_score:.3f}")

low_confidence = [
    w for w in scores.word_confidence_scores
    if w.confidence < 0.8
]
print(f"Low-confidence words: {len(low_confidence)}")
for w in low_confidence[:5]:
    print(f"  '{w.text}' → {w.confidence:.2f}")

For a clean digital PDF, you’ll see average scores above 0.95. Scanned documents or low-resolution faxes drop into the 0.7–0.85 range. I flag any page below 0.85 for human review. It takes 30 seconds to verify vs. hours to debug a bad extraction downstream.

For page-level scores only (faster, cheaper if you don’t need word-level detail):

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url", "document_url": "..."},
    confidence_scores_granularity="page"
)

Step 5: Structured Output With JSON Schema

This is the feature that made me port my whole pipeline. You pass a JSON schema alongside the document, and the model returns structured data that conforms to your spec.

Say you’re processing invoices. Define a schema:

invoice_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "invoice",
        "schema": {
            "type": "object",
            "properties": {
                "vendor_name": {"type": "string"},
                "invoice_number": {"type": "string"},
                "invoice_date": {"type": "string"},
                "due_date": {"type": "string"},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "number"},
                            "unit_price": {"type": "number"},
                            "total": {"type": "number"}
                        },
                        "required": ["description", "quantity",
                                     "unit_price", "total"]
                    }
                },
                "subtotal": {"type": "number"},
                "tax": {"type": "number"},
                "total_due": {"type": "number"}
            },
            "required": ["vendor_name", "invoice_number",
                         "total_due"]
        }
    }
}

Then pass it as document_annotation_format:

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://example.com/invoice.pdf"
    },
    document_annotation_format=invoice_schema,
    document_annotation_prompt="Extract invoice details from this document."
)

import json
invoice = json.loads(response.document_annotation)
print(json.dumps(invoice, indent=2))

Output:

{
  "vendor_name": "Acme Cloud Services",
  "invoice_number": "INV-2026-0847",
  "invoice_date": "2026-06-15",
  "due_date": "2026-07-15",
  "line_items": [
    {
      "description": "GPU compute (A100, 720 hrs)",
      "quantity": 720,
      "unit_price": 3.50,
      "total": 2520.00
    },
    {
      "description": "Object storage (2.4 TB)",
      "quantity": 2.4,
      "unit_price": 23.00,
      "total": 55.20
    }
  ],
  "subtotal": 2575.20,
  "tax": 489.29,
  "total_due": 3064.49
}

The document_annotation_prompt parameter lets you guide the extraction. For messy scans, a prompt like “This is a handwritten invoice. Amounts are in EUR. Dates use DD/MM/YYYY format.” makes a real difference. I’ve found that being specific about currency and date format in the prompt cuts extraction errors by roughly half on my invoice dataset.

Step 6: Pydantic Models for Type Safety

If you’re building a production pipeline (and you should be using uv to manage it), define the schema as a Pydantic model instead of raw dicts. The Mistral SDK accepts model_json_schema() output directly:

from pydantic import BaseModel

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

class Invoice(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str | None = None
    due_date: str | None = None
    line_items: list[LineItem] = []
    subtotal: float | None = None
    tax: float | None = None
    total_due: float

schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "invoice",
        "schema": Invoice.model_json_schema()
    }
}

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url",
              "document_url": "https://example.com/invoice.pdf"},
    document_annotation_format=schema,
    document_annotation_prompt="Extract all invoice fields."
)

invoice = Invoice.model_validate_json(response.document_annotation)
print(f"Vendor: {invoice.vendor_name}")
print(f"Total:  ${invoice.total_due:.2f}")
print(f"Items:  {len(invoice.line_items)}")

The Pydantic model gives you validation, default values, and type coercion out of the box. A missing due_date returns None instead of crashing your pipeline.

Step 7: Local Files and Selective Pages

For documents that aren’t publicly accessible via URL, encode them as base64 data URIs:

import base64
from pathlib import Path

pdf_path = Path("contract.pdf")
pdf_bytes = pdf_path.read_bytes()
pdf_b64 = base64.standard_b64encode(pdf_bytes).decode()

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{pdf_b64}"
    },
    include_blocks=True,
    confidence_scores_granularity="page"
)

Yes, you pass base64 data URIs through the document_url field. The naming is a bit odd, but it works for PDFs, DOCX, PPTX, and images (PNG, JPEG, AVIF).

For large documents, you can also limit processing to specific pages:

response = client.ocr.process(
    model="mistral-ocr-latest",
    document={"type": "document_url",
              "document_url": "https://example.com/report.pdf"},
    pages="0-4"
)

The pages parameter accepts a range string ("0-4") or a list of integers ([0, 2, 7]). Pages are zero-indexed. This cuts your costs proportionally. If a 100-page report only has relevant data on pages 3–7, you pay for 5 pages instead of 100.

Pricing

Per 1K pages (API)

Per 1K pages (Batch)

72%

Blind test win rate

170

Languages supported

Tier	Cost	Use Case
Standard API	$4 / 1K pages	Real-time extraction, low latency
Batch API	$2 / 1K pages	Bulk processing, higher latency
Document AI	$5 / 1K pages	Structured output + annotations
Self-hosted	Contact sales	On-prem, data residency

The Document AI tier adds the structured annotation features (JSON schema extraction, bbox annotations). If you only need raw text and tables, the standard $4 tier works fine.

For comparison: AWS Textract charges $1.50 per 1,000 pages for plain text and $15 per 1,000 pages for tables. Google Document AI runs $1.50–$30 per 1,000 pages depending on the processor. Mistral’s pricing sits in a competitive range, especially with the batch discount.

What OCR 4 Changed From OCR 3

Feature	OCR 3	OCR 4
Bounding boxes	No	Yes (paragraph-level)
Block classification	No	13 block types
Confidence scores	No	Per-word and per-page
Structured JSON output	Via separate model call	Built-in via `document_annotation_format`
Self-hosting	Yes (contact sales)	Single container deployment
Supported formats	PDF, images	PDF, DOC, PPT, OpenDocument, images
Languages	30+	170 across 10 language groups
Batch API	Yes ($1/1K pages)	Yes ($2/1K pages)

OCR 4 goes beyond text extraction. The bounding boxes and block types turn it into a document intelligence model that returns a structured representation of each page, not just raw text.

When OCR 4 Struggles

I’ve run about 400 pages through it over the past two days. Three patterns caused trouble:

Handwritten text tanks confidence scores below 0.6, and the extraction starts hallucinating words. If your pipeline handles handwritten documents regularly, you’ll need a human review step.
Multi-column academic papers sometimes get their reading order scrambled. The bounding boxes are correct, but the markdown concatenation can interleave paragraphs from different columns instead of flowing top-to-bottom within each.
Equations in low-resolution scans lose their LaTeX reconstruction below 200 DPI. The equation block types are still labeled correctly, but the content inside them gets garbled.

For clean digital PDFs, invoices, contracts, and presentations, it works extremely well. I’d estimate 80% of the documents I process fall into this category, and for those, the extraction quality is better than anything I’ve gotten from Textract or Document AI without post-processing.

FAQ

How do I install the Mistral OCR Python SDK?

Run pip install mistralai. The OCR functionality is part of the main SDK, no separate package needed. You’ll need a Mistral API key from console.mistral.ai.

What’s the difference between Mistral OCR 3 and OCR 4?

OCR 4 adds bounding boxes, block classification (13 types), per-word confidence scores, built-in structured JSON output, support for DOC/PPT/OpenDocument formats, 170 languages (up from 30+), batch API pricing, and single-container self-hosting. OCR 3 only extracted text and tables.

Can Mistral OCR 4 process images, not just PDFs?

Yes. OCR 4 accepts PNG, JPEG, and AVIF images via the image_url document type. The same include_blocks and confidence_scores_granularity parameters work on images.

How does Mistral OCR 4 compare to AWS Textract?

Mistral OCR 4 costs $4 per 1,000 pages for text + tables (vs. Textract’s $15 per 1,000 for tables). OCR 4 returns bounding boxes and block types in a single call. Textract requires separate API calls for different analysis types (text, tables, forms, queries). Mistral’s structured JSON output via document_annotation_format eliminates the need for a separate LLM call to parse the extraction.

Can I self-host Mistral OCR 4?

Yes. Mistral offers a single-container deployment for enterprise customers who need data residency. Contact their sales team for pricing; it’s not available on the standard API tier.

Sources

Mistral OCR 4 announcement — official launch post with benchmarks and pricing
Mistral OCR processor documentation — API reference and Python SDK examples
Mistral OCR data extraction cookbook — structured output examples with annotations
Mistral OCR API endpoint reference — full parameter specification

Bottom Line

Before OCR 4, I was stitching together three services to get bounding boxes, block types, and structured JSON extraction from one document. Now it’s a single API call. The Python SDK is clean, the pricing is reasonable at $2–5 per 1,000 pages, and the 72% blind-test win rate tracks with what I’ve seen processing real documents.

The weak spots are real: handwritten text, multi-column reading order, and low-res equation scans still need human review. But for the 80% of documents that are clean-enough PDFs with mixed content, OCR 4 handles the structural extraction that tools like MarkItDown and Docling punt on, and the confidence scores alone save hours of debugging. If you want to expose this pipeline as a tool for coding agents, wrapping it in a FastMCP server takes about 20 minutes.

TL;DR#

Why I Switched From MarkItDown#

Setup#

Step 1: Basic PDF Extraction#

Step 2: Get Tables as Markdown or HTML#

Step 3: Block Extraction With Bounding Boxes#

Step 4: Confidence Scores#

Step 5: Structured Output With JSON Schema#

Step 6: Pydantic Models for Type Safety#

Step 7: Local Files and Selective Pages#

Pricing#

What OCR 4 Changed From OCR 3#

When OCR 4 Struggles#

FAQ#

How do I install the Mistral OCR Python SDK?#

What’s the difference between Mistral OCR 3 and OCR 4?#

Can Mistral OCR 4 process images, not just PDFs?#

How does Mistral OCR 4 compare to AWS Textract?#

Can I self-host Mistral OCR 4?#

Sources#

Bottom Line#

Don't miss what's next

Related Articles

MarkItDown vs Docling vs Marker: PDF to Markdown for LLMs

LiteLLM Vulnerability: 6 CVEs, a Supply Chain Attack, and the Fixes

uv Python Tutorial: Build a CLI Tool From Scratch

Spec-Driven Development: Build a Python CLI From Spec to Code

TL;DR

Why I Switched From MarkItDown

Setup

Step 1: Basic PDF Extraction

Step 2: Get Tables as Markdown or HTML

Step 3: Block Extraction With Bounding Boxes

Step 4: Confidence Scores

Step 5: Structured Output With JSON Schema

Step 6: Pydantic Models for Type Safety

Step 7: Local Files and Selective Pages

Pricing

What OCR 4 Changed From OCR 3

When OCR 4 Struggles

FAQ

How do I install the Mistral OCR Python SDK?

What’s the difference between Mistral OCR 3 and OCR 4?

Can Mistral OCR 4 process images, not just PDFs?

How does Mistral OCR 4 compare to AWS Textract?

Can I self-host Mistral OCR 4?

Sources

Bottom Line