TL;DR
DuckDB and Polars are the two tools that finally made “just use Pandas” a bad default for anything past a gigabyte. DuckDB is an embedded SQL engine that queries Parquet and CSV in place and sips memory. Polars is a Rust-backed DataFrame library with a lazy API that flies on transform-heavy Python pipelines. If your team thinks in SQL and your data lives in files, lean DuckDB. If your pipeline is a long chain of joins, group-bys, and expressions written in Python, lean Polars. Most serious pipelines I’ve built in the last year use both, and they compose cleanly over Arrow with zero copy.
Why this comparison, and why now
I spent most of 2025 slowly deleting Pandas from a batch pipeline that ingests a few hundred gigabytes of Parquet a day. It took two tools to replace it: DuckDB for the read-and-aggregate front end, Polars for the messy per-row transforms downstream. Getting there meant running both against the same data, on the same box, more times than I’d like to admit.
That experience is the lens for this piece. Both projects shipped major releases in June 2026: DuckDB 1.5.4 on June 17, Polars 1.42.1 on June 30. So the usual “here’s a 2024 benchmark” articles are already out of date. I’ll pull real published numbers where I have them, flag them as reported, and tell you where my own hands-on lines up with the benchmarks and where it doesn’t.
If you’re still weighing the DataFrame side of this, I wrote a separate Polars vs Pandas breakdown that covers the migration story in more detail. This article is about the newer, sharper question: once you’ve outgrown Pandas, do you reach for a SQL engine or a DataFrame library?
The 30-second mental model
DuckDB is SQLite for analytics. It runs in your process, has no server, and its happy place is scanning columnar files and answering OLAP questions with SQL. You point it at a folder of Parquet and write SELECT.
Polars is a DataFrame library, closer in spirit to Pandas, but built in Rust with a query optimizer underneath. You write method chains in Python, and its lazy mode plans the whole computation before touching data.
Here’s the same task in both. Say you want the top five customers by revenue from a directory of Parquet files.
import duckdb
duckdb.sql("""
SELECT customer_id, sum(amount) AS revenue
FROM 'data/orders/*.parquet'
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 5
""").show()
import polars as pl
(
pl.scan_parquet("data/orders/*.parquet")
.group_by("customer_id")
.agg(pl.col("amount").sum().alias("revenue"))
.sort("revenue", descending=True)
.head(5)
.collect()
)
Both read the files lazily, both push the aggregation down, and on my laptop both finish a 12 GB version of this in single-digit seconds. The difference is entirely which one reads more naturally to you and your team. That sounds like a soft criterion, but it’s the one that will actually decide your choice, so I’m putting it up front.
Feature comparison at a glance
| Dimension | DuckDB 1.5.4 | Polars 1.42.1 |
|---|---|---|
| Core interface | SQL (plus Python/R/Java/etc. bindings) | DataFrame method chains (Python/Rust) |
| Written in | C++ | Rust |
| License | MIT | MIT |
| Execution | Vectorized, out-of-core with spill-to-disk | Vectorized, streaming engine (stable in 2026) |
| Larger-than-RAM data | Yes, automatic spill | Yes, via the streaming engine |
| Reads files in place | Parquet, CSV, JSON, Iceberg, Delta | Parquet, CSV, JSON, Iceberg |
| Query optimizer | Yes | Yes (lazy mode) |
| Persistent storage | Native .duckdb database + DuckLake | No, it’s compute-only |
| Best fit | SQL analytics, ad-hoc file querying | Transform-heavy Python pipelines |
Both are MIT-licensed and both run entirely in-process, which is the whole reason this comparison exists. Neither needs a cluster, a server, or a JVM babysitter.
Performance: closer than the marketing suggests
The headline is that on modern hardware, for most workloads under a few hundred gigabytes, these two finish within a rounding error of each other. The interesting differences show up at the tails.
The most thorough public benchmark I trust is codecentric’s Parquet test, which scaled a single table from 2 GB up to 2 TB. On the largest workload, DuckDB finished in roughly 45 seconds against Polars’ approximately one minute in default mode, and Polars slowed to near 100 seconds when async reads were forced on. On a single 140 GB file, DuckDB held a reported edge of about one second. That’s a real win for DuckDB, but it’s not the 10x gap that benchmark headlines love to imply.
Operation type flips the result, though. Across independent benchmarks the pattern is consistent: Polars tends to win CSV reads and joins, DuckDB pulls ahead on window functions, and group-bys land close to even. MotherDuck’s own run on a roughly 5 GB dataset clocked DuckDB at 2.3 seconds against Polars’ 3.3 seconds, with Pandas running out of memory entirely. The answer to “which is faster” depends on what your pipeline spends its time doing.
My own experience matches the shape of this. When I swapped a groupby-heavy stage from DuckDB SQL to a Polars lazy plan, wall-clock barely budged. The real wins came from dropping Pandas; choosing between these two was almost a wash.
Memory is the real divide
Speed comes out roughly even. Memory is where the gap opens, and on constrained hardware it’s the number that settles your choice.
In that same codecentric test, DuckDB peaked around 1.3 GB on the 140 GB file. Polars in its default mode peaked near 17 GB. That looks damning until you read the footnote: Polars defaults to memory-mapped I/O for local Parquet, so the operating system loads file pages on demand and the number you see is inflated by page cache, not true working set. Force async reads and Polars drops to around 750 MB on the same file, below DuckDB.
There’s a more important lesson buried in that benchmark. Partitioning your data cut DuckDB’s peak memory by about 8x and Polars’ by about 4x. File layout mattered more than engine choice. If you take one thing from this section, make it that: a well-partitioned dataset will save you more RAM than agonizing over which engine to run.
Where DuckDB has a durable edge is spill-to-disk. It’s automatic and it just works: when a query needs more memory than you have, DuckDB writes intermediate state to a temp directory and keeps going. Polars stabilized its streaming engine in 2026 for out-of-core execution, and it’s solid now, but you have to opt into it and structure your query to stay streamable. DuckDB’s version is the one you get for free.
SQL vs DataFrame: the ergonomics fork
This is the decision that usually settles it, and performance has almost nothing to do with it.
DuckDB gives you SQL. If your team already writes SQL against Snowflake or BigQuery, DuckDB is almost zero onboarding cost. You’re writing the same dialect against files instead of a warehouse. Joining three Parquet files with a window function is a query you already know how to write. And because it’s SQL, it’s readable by analysts who will never touch a Python method chain.
Polars gives you expressions. Once the API clicks, the chaining is a real pleasure to write, and the type system catches mistakes that SQL happily runs. Complex per-row logic (conditional columns, string manipulation, custom windowing) reads cleaner as Polars expressions than as nested SQL CASE statements. Here’s a transform that’s awkward in SQL and natural in Polars:
import polars as pl
df = pl.scan_parquet("data/events/*.parquet").with_columns(
session_gap=(pl.col("ts") - pl.col("ts").shift(1))
.over("user_id")
.dt.total_seconds(),
).with_columns(
is_new_session=(pl.col("session_gap") > 1800).fill_null(True)
).with_columns(
session_id=pl.col("is_new_session").cum_sum().over("user_id")
).collect()
That’s sessionization: assign a new session ID whenever a user is idle for more than 30 minutes. It’s three tidy expression blocks in Polars. The SQL equivalent works, but it’s a tower of window functions that most people copy-paste and pray over.
The flip side: exploratory “let me just count these grouped by that” work is faster to type as SQL. I keep a DuckDB REPL open for exactly this kind of poking around.
They’re better together
The framing of “DuckDB vs Polars” is mostly a headline convenience. In practice they share Apache Arrow as an in-memory format, so handing data from one to the other is a zero-copy operation. You don’t have to choose.
import duckdb
import polars as pl
# Heavy scan + aggregation in DuckDB, straight into a Polars frame
orders = duckdb.sql("""
SELECT customer_id, product, sum(amount) AS revenue
FROM 'data/orders/*.parquet'
WHERE order_date >= '2026-01-01'
GROUP BY customer_id, product
""").pl()
# Continue in Polars for the transform-heavy part
result = (
orders
.with_columns(
revenue_rank=pl.col("revenue").rank("dense", descending=True).over("customer_id")
)
.filter(pl.col("revenue_rank") <= 3)
)
.pl() on a DuckDB relation returns a Polars DataFrame without a serialization round-trip. It goes the other way too. You can register a Polars frame as a DuckDB table and run SQL on it. My production pipeline does exactly the first pattern: DuckDB owns the file scan and the coarse aggregaton, Polars owns the ranking and business logic. Each does the part it’s best at.
This is the pattern I lean on whenever a pipeline feeds structured output downstream. The document pipeline I built with Mistral OCR, for instance, dumps its extracted rows straight into DuckDB for aggregation and hands the result to Polars. Arrow-native tools are my default now for exactly that reason.
Recommendation matrix
| Your situation | Reach for |
|---|---|
| Team writes SQL, data lives in Parquet/CSV files | DuckDB |
| Ad-hoc querying, notebooks, “just count this” | DuckDB |
| Larger-than-RAM data, don’t want to think about it | DuckDB (automatic spill) |
| Transform-heavy Python pipeline (joins, expressions, windowing) | Polars |
| You want static typing and IDE autocomplete on transforms | Polars |
| Replacing a Pandas ETL script | Polars (closest migration path) |
| Sessionization, custom per-row logic, feature engineering | Polars |
| Serious production pipeline | Both, over Arrow |
Where each one still bites
Neither tool is finished. DuckDB’s Python API is thinner than Polars’ when you want to stay in DataFrame-land. You’re either writing SQL strings or using its relational API, which is less expressive than Polars expressions. Its error messages for malformed SQL over globbed files can be cryptic.
Polars’ pain is API churn and the learning curve. Coming from Pandas, the expression model takes a week to stop fighting, and .over() versus .group_by() semantics trip up almost everyone at first. The 2.0 roadmap is in discussion, which is exciting but also a signal that some APIs will shift. And the memory-mapping default I mentioned earlier will surprise you the first time you watch RAM spike on a file you thought was tiny.
For the wider Python tooling picture (package managers, type checkers, and the rest of the modern stack these two slot into), our Modern Python Tooling guide ties it all together, and the uv vs pip vs Poetry comparison covers how I install both in a reproducible environment.
FAQ
Is DuckDB faster than Polars?
For most workloads they’re within a rounding error. On very large single-file Parquet scans DuckDB has a small reported edge (about 45s vs 60s on a 2 TB test) and tends to lead on window functions. Polars tends to lead on CSV reading and joins. Operation type decides it, not the engine name.
Which uses less memory, DuckDB or Polars?
DuckDB, by default and by a wide margin, thanks to automatic spill-to-disk. Reported peaks were around 1.3 GB where Polars hit 17 GB on the same 140 GB file. But Polars’ number is inflated by memory-mapped I/O; force async reads and it drops below DuckDB. Partitioning your data cuts both dramatically.
When should I use DuckDB vs Polars?
Use DuckDB when you think in SQL and your data lives in files, or when you want larger-than-RAM handling for free. Use Polars when your pipeline is a chain of Python transforms (joins, expressions, custom windowing) and you want static typing and autocomplete.
Can DuckDB and Polars work together?
Yes, and you should let them. They share Apache Arrow in memory, so passing data between them is zero-copy. Call .pl() on a DuckDB result to get a Polars frame, or register a Polars frame as a DuckDB table and query it with SQL. A common pattern is DuckDB for the file scan and aggregation, Polars for the downstream transforms.
Do I still need Pandas?
For small data, notebooks, and the enormous pile of libraries that expect a Pandas DataFrame, yes. But for anything performance-sensitive past a gigabyte, both DuckDB and Polars will run circles around it. Most of my new pipelines don’t import Pandas at all.
Is Polars or DuckDB better for large datasets?
Both handle larger-than-RAM data now. DuckDB does it automatically with spill-to-disk. Polars does it through its streaming engine, which stabilized in 2026 but requires you to opt in and keep your query streamable. For the “I don’t want to think about it” case, DuckDB wins.
Bottom line
Stop thinking of this as a fight. DuckDB is the best way to query files with SQL and the safest choice when memory is tight. Polars is the best way to write fast, typed, transform-heavy pipelines in Python. They run in the same process, over the same Arrow buffers, and the strongest data pipelines I’ve shipped this year use each for the half it’s good at. Pick based on how your team thinks, SQL or Python, and reach for the other when the first one gets awkward. I’ve stopped expecting one of them to win outright, because in a real pipeline they end up splitting the work anyway.
Sources
- DuckDB official site and 1.5.4 release notes — current version, license, and feature set
- Polars on PyPI (1.42.1) — current version, release date, and license
- DuckDB vs. Polars: Performance & Memory on Parquet Data — codecentric — the 2 GB–2 TB benchmark this post’s numbers are drawn from
- DuckDB vs Pandas vs Polars — MotherDuck — operation-by-operation speed comparison
- Integration with Polars — DuckDB docs — the Arrow-based interop between the two