Intro to Snowpark API
Polars users might find the Snowpark/PySpark API verbose.
But the high-level abstractions roughly map: snowpark.Column is like polars.Expr, and snowpark.DataFrame is like polars.LazyFrame.
Caveat: the list of methods on Column is much smaller than on Expr, and there’s no clear namespace separating methods by dtype.
If you don’t see a method for your use case, check the gigantic list under snowpark.functions.
Let’s explore.
First, imports
from snowflake import snowpark as sp
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
Column class
Accessible via F.col, e.g. lf.select(F.col("A")).
A few notable methods on Column instances:
castname(alias)asc/descis_nullis_inoversubstr
Unlike Polars, there’s no method namespace separation by dtype.
Still, Column feels a lot like polars.Expr.
F namespace (functions)
The F namespace provides many ways to manipulate Columns.
Note: all functions return a Column, not just F.col, e.g.
F.concatF.containsF.array_sizeF.yearF.when(...).otherwise(...)F.explodeandF.flatten
Again, no subnamespaces by dtype — everything is dumped into functions.
Window class
Provides a bunch of class methods:
Window.partition_byWindow.order_byWindow.range_between(time or int)
And constants like:
Window.CURRENT_ROW
used as an argument to the over method.
T (types)
For type hints and casting, use:
T.ArrayTypeT.DateTypeT.StructType
Caveat: these are callables — instantiate them like T.DateType().
DataFrame class
It’s called a DataFrame, but really behaves like a LazyFrame — call .collect() to materialize.
Notable methods (all return a new DataFrame):
selectfilterpivot/unpivotjoinunionwith_columns(takeslist[str],list[Column])with_columndistinctfillna
and the .columns attribute.
Note: after joining, columns with the same name need to be aliased explicitly to save/view the joined table. e.g. t1.col.alias("c1"), t2.col.alias("c2").
IO
Ways to read data from snowflake:
session.table("TABLE_NAME")session.sql("SELECT ...")session.read.option("infer_schema", False).parquet("@mystage/test.parquet")Multiple file formats are supported, simply replace parquet by the file extension e.g. csv
Ways to push in-memory data to snowflake:
session.create_dataframe(df: pd.DataFrame).write.format(...).save(...)session.create_dataframe(df: list[tuple|Row]).write.mode("overwrite").save_as_table("my_table", table_type="temporary")
Row class
You get a list of Rows after collect()
The .as_dict() method on Row makes it easy to interoperate with Polars:
just pass a list of dicts to construct a Polars DataFrame.
Testing
sp.testing.assert_dataframe_equal
Horizontal ops in snowpark
Horizontal ops such as horizontal any (think np.any(..., axis=1)) can be achieved with a chain of logical OR e.g. (... | ... | ... )
or by using F.when().otherwise(). When the number of conditions/columns increases, I would like to use something similar to polars
or the general purpose reduce/fold function for horizontal ops in polars.
In snowpark there is no any_horizontal, nor reduce function.
But one can use python functools.reduce.
from functools import reduce
any_expr = reduce(lambda a, b: a | b, map(F.col, col_names))
lf.select(any_expr)
Exception
The most common is sp.exceptions.SnowparkSQLException, corresponding to 13xx SQL error.
NULL handling
In aggregate functions like count, sum etc, NULL values are ignored. If all values in a column are NULL, the aggregate function acting on the column returns NULL.
In comparison logic (>, <, <>, =, >=, <=) the result of NULL vs any value is NULL. Therefore in filters, NULL needs to be explicitly catched by the clause A IS NULL or A IS NOT NULL, for otherwise the filter where A < 1 excludes all the records where A is NULL.
Several functions are bound to NULL handling e.g. coalesce, ifnull, nullif, equalnull.