Skip to content

Essentials of polars for pandas experts

Abstract

pandas is a standard tool for every data professional, although it does not scale well in production. Yet, being a standard is a strategic position to be, as libraries coming to solve the scale issue tend to meet data professionals where they are, by mimicing the pandas API (think: dask, pyspark.pandas).

polars is a new-ish tool that is probably replacing pandas at the time of writing. The goal of this post is to introduce the kind of mindset change needed to fully exploit polars in production.

What does a dataframe library do? A few things came to mind.

  • merge/join
  • group by
  • aggregate
  • windows function
  • rolling windows
  • etc

There are a plethora of dataframe libraries doing all these things. Yet polars seems to me a clear winner in the game of "finding the successor of pandas".

Update Feb 2025

polars may well replace pyspark with their annoucement of the cloud offering for vertical and horizontal scaling. Indeed, polars solves the same scaling problem as pyspark does and its API is very close to pyspark.sql although their implementations are very different.

Born in 2020, polars released its version 1.0 in mid-2024, officially marking its production readiness. A popular saying about polars is that people "came for the speed, stay for the syntax". This pronounces two ways in which polars is awesome

  • fast
  • elegant syntax: both intuitve and expressive.

The goal of this tutorial is to present some basic concetps for effective use of the library. But first, let's briefly mention

Why polars is fast in a nutshell

The speed is achieved by lazy execution, query optimizer, vectorized query engine, parallelism. It would be too arrogant of me to claim that I can explain all these terms with complete precision. The interested reader is invited to watch talks1 2 3 by the creator of the library Ritchie Vink for details.

In a nutshell,

  • Lazy execution builds the computation graph of data transformations without loading any data into memory. Think of this as composition of functions where no input is required (the schema of the input must be known though).
  • Query optimizer optimizes user's query, making them more efficient. The laziness leaves the room for optimising user's query e.g. changing the order of certain operations, fusing them, and all sorts of smart tricks that can boost the computation efficiency before any data is loaded. This is similar to what machine learnig compilers would do (think torch.compile and jax.jit).
  • Vectorized query engine leverages columnar memory format (arrow) and hardware optimizations (e.g. SIMD) for array manipulations.
  • Parallelism is a paradigm to distribute computation workloads effectively across all the CPU/GPU cores.

Eager vs Lazy

Eager is the "opposite" of lazy. In eager mode, data is loaded into memory before the first operation, and all operations are executed sequentially, one after another as user's query. This can be both wasteful and inefficient.

pandas only operates in eager mode, while polars operates in eager/lazy as per user's needs. The lazy API of polars is almost identical to the eager one so there is little mental overhead to users.

Lazy is good for speed, but this does not mean that Eager is useless. Data professionals rarely build data pipelines in one go, rahter, they build iteratively, one step at a time. This is where eager mode shines.

Example

Think about this query in the diagram where the input in is processed in some ways to achieve the result out.

graph LR
    A[in] --> B[result1];
    A --> C[result2];
    A --> D[result3];
    C --> D;
    B --> E[out];
    D --> E;
With the eager API, result2, result1 must be computed fully and stored in memory, then result3, before producing the output. In developement, it indeed makes sense to produce all these intermediate results and stored them in memory to check the accuracy of the results. However, it can be the case that the out is what really matters in production, and the intermediate results are just implementation details. Lazy execution would be the way to go in such cases. With the lazy API, the query is optimized, vectorized, multi-threaded so the execution can be ~10x faster.

Concretely, the data structure to operate on with the eager API is called DataFrame, and for the lazy API it is called LazyFrame.

There is one caveat/gotcha when working with lazy API. One might use blindly all data operations that are available in the eager API and surprised that they are not availabe in the lazy API. To understand why this happens, it is crucial to know that LazyFrame must be agnostic of the data values and it must be aware of the schema. The schema of the output of pivot operation can NOT be determined if the data values are not known (how many columns are there?). It is the same story with transpose.

Once the query reaches a point where lazy API cannot do what the user wants, it is time to switch to the eager API, do that operation eagerly, and switch back to lazy for the speed.

In code, the pattern looks like this

lf.collect().pivot(...).lazy().other_lazy_operations()

Here .collect() would turn a LazyFrame into DataFrame (so lf is materialized/stored in memory). After the pivot operation, .lazy() would turn the dataframe back to LazyFrame.

Expression

polars offers a beloved Expression API. In essence, expressions are functions that associate an ouput array to an input array (think numpy operators). As operators, they can/should be isolated from data (Frame/Series). Expression can be appiled to an existing Expression to obtain a new Expression in the sense of function composition. This allows users to build complicated queries by composing building blocks offered in the Expression API.

Syntax difference

The migration guide from official docs serves as a good summary4.

Go further

To make full advantage of polars, using lazy execution whenever possible and getting familiar with Expression are probably sufficient for most of the everyday data processing jobs.

To go further, the official User Guide is the best source, which is being revised and improved actively (~ end of 2024).