TL;DR
Polars 1.x is the better engine on every benchmark above ~1 GB. Joins land 3-9× faster, group-bys 5-10×, CSV reads 5×. Pandas 2.x still wins where it always did: small interactive datasets, anything that has to round-trip through scikit-learn or matplotlib, and string-heavy work. After porting two production pipelines this year I run both: Polars for the bulk transforms, Pandas for the last mile. The “should I rewrite everything?” answer is no.
Why this comparison keeps coming up
Polars hit 1.0 in mid-2024. By early 2026 it’s at 1.x with a stable API, lazy execution that actually works, and a streaming engine that handles tables bigger than RAM. The performance gap was always real, but in 2024 the API kept moving and the surrounding tooling was sparse. Now neither of those is a blocker, which is why every Python data team I talk to is asking the same question: do we keep paying the Pandas tax?
I spent the last two months running a serious comparison on a real workload: about 240 million rows of clickstream data spread across 18 Parquet files, with the kinds of joins, aggregations, and filtering that show up in actual ETL. Below is what I measured and where I landed.
At a glance
| Dimension | Pandas 2.2 | Polars 1.18 |
|---|---|---|
| Engine | NumPy / PyArrow backend | Rust + Apache Arrow |
| Parallelism | Single-threaded by default | Multi-core by default |
| Execution | Eager only | Eager + lazy with query optimizer |
| Memory model | Mostly row-friendly object dtype | Strict columnar |
| Streaming (>RAM) | Not natively | Yes, via LazyFrame.collect(engine="streaming") |
| API style | Method chains + indexer madness | Expressions, no .loc/.iloc |
| Null handling | NaN-as-null mess | First-class null |
| ML libraries | Native everywhere | Convert to Pandas/Arrow |
| Plotting | matplotlib, seaborn, plotly all native | to_pandas() first, mostly |
That table summarizes most of the friction. The performace difference is in the numbers below.
The benchmarks I actually ran
Workload: 240M-row Parquet dataset, 7 numeric columns, 3 string columns, ~14 GB on disk. Hardware: M2 Pro, 16-core, 32 GB RAM. Each operation ran 5 times; I report the median.
| Operation | Pandas 2.2 | Polars 1.18 (eager) | Polars 1.18 (lazy) | Speedup (lazy) |
|---|---|---|---|---|
| Read 14 GB Parquet | 41.2 s | 9.1 s | 8.7 s | 4.7× |
| Filter rows (single predicate) | 3.8 s | 0.71 s | 0.34 s | 11× |
| Group-by + 4 aggregates | 18.4 s | 2.9 s | 1.8 s | 10× |
| Inner join (5M × 240M) | 22.6 s | 3.4 s | 2.1 s | 10.7× |
| Sort by 2 columns | 14.1 s | 1.3 s | 1.3 s | 10.8× |
| String contains + filter | 6.2 s | 4.9 s | 4.6 s | 1.3× |
| Window function | 11.7 s | 1.6 s | 1.1 s | 10.6× |
| Write Parquet | 24.8 s | 6.4 s | 6.4 s | 3.9× |
A few observations from running this. First, lazy mode is not optional. Eager Polars is already fast, but the query optimizer routinely shaved another 30-60% off by reordering filters, pushing predicates into the Parquet reader, and skipping unused columns. Second, the string operation gap is small. If your pipeline is 80% regex parsing, the speedup story falls apart. Third, the join numbers held even when one side was big enough to make Pandas swap to disk.
What the API actually looks like
This is where most of the “should we switch” debate lives. People look at one cherry-picked snippet and conclude either “trivial” or “rewrite everything.” Neither is right.
Here’s the same operation in both: read a CSV, filter to a date window, group by user and event type, average a value, sort the result.
# Pandas
import pandas as pd
df = pd.read_csv("events.csv", parse_dates=["ts"])
out = (
df[df["ts"].between("2026-01-01", "2026-03-31")]
.groupby(["user_id", "event_type"], as_index=False)["value"]
.mean()
.sort_values("value", ascending=False)
)
# Polars (lazy)
import polars as pl
out = (
pl.scan_csv("events.csv", try_parse_dates=True)
.filter(pl.col("ts").is_between(pl.date(2026, 1, 1), pl.date(2026, 3, 31)))
.group_by(["user_id", "event_type"])
.agg(pl.col("value").mean())
.sort("value", descending=True)
.collect()
)
The shape is similar. The differences that bite during a port:
- Polars uses expressions (
pl.col("value").mean()) where Pandas uses string column names plus a method on a Series. Expressions compose, which means complex transforms become readable; the cost is a learning curve and a lot ofpl.col(...)typing. - No
.loc,.iloc, or boolean masking on the frame itself. Filtering is.filter(expr), period. After two months I miss boolean-mask filtering exactly zero times..filteris clearer. scan_*is lazy,read_*is eager. Usescan_*for anything non-trivial.collect()runs the optimized plan.- Date handling is its own thing.
try_parse_dates=Trueworks most of the time but occasionally needs.str.to_datetime()afterward.
When Pandas is the right answer
I’ve watched too many teams rewrite working pipelines because Polars looked cool. A few cases where I push back on the switch:
Small interactive notebooks. If your DataFrame fits in 1 GB and your team thinks in .loc, the speedup is microseconds you’ll never notice. The real bottleneck is Jupyter cell execution and developer fluency.
ML pipelines that touch scikit-learn, statsmodels, or anything plotly. These libraries return Pandas, expect Pandas, and document Pandas. Yes, Polars has .to_pandas() and zero-copy via Arrow, but you’ll be calling it constantly, and every conversion is a tax on readability. If you’re doing more to_pandas() than aggregation, you’re using the wrong tool.
Heavy string manipulation. Polars has solid string operations now, but Pandas with PyArrow string dtype (dtype_backend="pyarrow" since 2.0) has closed most of the gap. For complex regex extraction, I’ve seen Pandas come out ahead by 10-20%.
Code your team already understands. The API gap is real. A senior who can write any Pandas operation in their sleep needs 2-3 weeks of daily use to get fluent in Polars expressions. If the pipeline runs in 90 seconds and isn’t on the critical path, “fast enough” is fast enough.
When Polars is obviously right
The flip side. These are the cases where I rewrite without hesitating:
ETL on tables larger than 5 GB. This is where the 10× speedup compounds. A 30-minute Pandas job becomes a 3-minute Polars job and your pipeline goes from “schedule it overnight” to “run on demand.”
Anything memory-constrained. Streaming via LazyFrame.collect(engine="streaming") lets you process tables that don’t fit in RAM. Pandas can’t do this without DuckDB, Dask, or chunking gymnastics.
Pipelines with many sequential transforms. The lazy query planner reorders, fuses, and prunes. The same code in Pandas materializes intermediate results at every step, which is both slow and a memory liability.
New code with no Pandas debt. If you’re starting a project today and the data is non-trivial, Polars is the default. The API is more consistent, null semantics are sane, and you don’t have to remember which methods mutate.
The “use both” pattern
After two months of porting I landed on a pattern most teams will recognize once they’ve tried it: Polars for the bulk transform, Pandas for the last mile. Concretely:
import polars as pl
import seaborn as sns
# Heavy lifting in Polars
result = (
pl.scan_parquet("events/*.parquet")
.filter(pl.col("ts") > "2026-03-01")
.group_by(["country", "device"])
.agg([
pl.col("revenue").sum().alias("rev"),
pl.col("user_id").n_unique().alias("users"),
])
.sort("rev", descending=True)
.collect(engine="streaming")
)
# Hand off to the ML / plotting world
df = result.to_pandas()
sns.barplot(df.head(20), x="rev", y="country", hue="device")
Zero-copy through Arrow means to_pandas() is effectively free for numeric and Arrow-backed string columns. The cost shows up only when you’re converting list or struct columns, which most analytics workloads don’t have.
Migration cost, honestly
I’d been told the migration was “trivial, APIs are similar.” That was half-true. Here’s what actually took time on a 4,000-line internal ETL package:
- Index removal. Pandas indexes don’t translate. Anywhere we used the index for joins or alignment, we had to make it an explicit column. About 200 lines of changes, mostly mechanical.
- NaN vs null. Pandas treats NaN as null in some contexts but not others; Polars has real null. Everywhere we relied on
.isna()semantics needed verification. apply/transformpatterns. These mostly becomepl.col(...).map_elements(), or much better, native expressions. About 30% of our.applycalls turned out to be expressible natively, which made them 50-100× faster. The rest still work but lose most of the speedup.- Test fixtures. All our test data was Pandas. Rewriting fixtures took longer than the actual code port.
Total time: about 9 working days for a senior who already knew Polars. Net result: the daily ETL went from 47 minutes to 4 minutes, and we cut the EC2 instance size in half. Worth it. Would not have been worth it for a smaller pipeline.
Common pitfalls I hit
A short list, in case you’re starting:
pl.coleverywhere is verbose. Usepl.col(["a", "b", "c"])to act on multiple columns at once, andpl.exclude(...)to act on everything else.group_bydoesn’t sort by default. Pandas does, Polars doesn’t. Add.sort(...)after if you care about deterministic ordering.with_columnsreturns a new frame. Forgetting to assign the result is the most common bug during a port. Polars frames are immutable.- String columns with PyArrow backend in Pandas are a real thing. If you haven’t tried
pd.options.mode.dtype_backend = "pyarrow", do that before benchmarking against Polars. It closes the string gap meaningfully. - Plot libraries hate Polars frames. Always
.to_pandas()before plotting. The conversion is fast.
Decision flow
Here’s the actual question I ask when a new pipeline lands on my desk.
flowchart TD
A[New data pipeline] --> B{Data size}
B -->|< 1 GB| C[Use Pandas]
B -->|1-10 GB| D{Hot path?}
B -->|> 10 GB| E[Use Polars]
D -->|Yes| E
D -->|No| F{Team familiar with Polars?}
F -->|Yes| E
F -->|No| C
E --> G{Need ML / plotting?}
G -->|Yes| H[Polars for transforms,<br/>to_pandas for last mile]
G -->|No| I[Polars end to end]
That covers about 90% of the calls. The remaining 10% are odd corner cases (heavy regex, geospatial work, libraries that only speak Pandas) and they usually answer themselves once you spike the conversion cost.
FAQ
Is Polars faster than Pandas?
For most operations on datasets above 1 GB, yes. Typically 5-10× on group-bys and joins, up to 11× on sorting and filtering. The gap shrinks on small data and on heavy string manipulation. Pandas with the PyArrow backend has closed most of the string gap.
Should I switch from Pandas to Polars?
Switch if you’re starting a new pipeline, working with tables above 5 GB, or hitting memory limits. Don’t switch a working sub-second Pandas script just because Polars exists. The migration cost is real, especially around index removal and NaN-vs-null semantics.
When should I use Polars instead of Pandas?
Use Polars when data is large (>5 GB), when you need streaming/lazy evaluation, when you care about ETL latency, or when you’re starting fresh. Stick with Pandas for interactive small-data work, ML pipelines that depend on scikit-learn, and code your team already understands.
Does Polars work with scikit-learn?
Not directly. Most scikit-learn estimators expect NumPy arrays or Pandas DataFrames. Polars has .to_pandas() and .to_numpy() for the handoff, both of which are zero-copy for numeric columns thanks to the Arrow backend. The pattern is: Polars for preprocessing, convert at the boundary, fit with scikit-learn.
What are the main differences between Polars and Pandas?
Polars is multi-threaded, columnar (Arrow), and supports lazy evaluation with a query optimizer. Pandas is single-threaded by default, has a NumPy-rooted memory model, and only supports eager evaluation. Polars uses an expressions API; Pandas uses indexers and method chains. Polars has first-class null handling; Pandas has the NaN-as-null mess.
Is Pandas being replaced by Polars?
Not in 2026, and probably not soon. Pandas has 15 years of library inertia. Every Python data tutorial, ML library, and BI tool speaks it. Polars is winning the new-pipeline battle and the performance battle, but Pandas isn’t going anywhere. The realistic outcome is coexistence: Polars where you need speed, Pandas where the surrounding libraries are.
Bottom line
Polars in 2026 is the right default for new data pipelines and a serious option for any production ETL above a few GB. The 10× speedup is real and durable across the operations you actually run in production. The migration cost is also real and shouldn’t be hand-waved.
If you’re sitting on Pandas pipelines that run fine, leave them alone. If you’re writing something new, start in Polars. If you’re somewhere in between (a slow daily job that nobody loves), port it and reclaim the time. Just don’t expect the API to be a drop-in. Treat it like learning a sibling language rather than a renamed library.
If you want to see how the Python tooling story is shifting more broadly, I wrote about uv vs pip vs Poetry for package management and Python 3.14 free-threading benchmarks. Polars and free-threading actually compose better than I expected, and that’s worth its own post.
