Working with Large Datasets in Snowflake
This post covers how to work with large datasets in Snowflake e.g. 100+ GB, billions of rows. You can't really do this kind of stuff on your personal laptop (mine has 6 cores and 16GB RAM), so most people turn to the cloud for this workload.
We'll go over some practical tips, using the Snowpark API version 1.32 to illustrate.
from snowflake import snowpark as sp
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
Choose the Right Warehouse
There’s no one-size-fits-all. Larger warehouses cost more, so don’t scale up unless you actually need the extra horsepower. Match the warehouse size to the workload. Start small and scale up only if performance is a problem.
Cache the Results
Snowpark DataFrames are lazily evaluated. That means if you reuse part of a query in multiple branches, Snowflake will recompute that shared part unless you cache it explicitly.
Example:
a = session.table("...")
query = a.group_by("A").agg(...) # simulate expensive intermediate step, and this line won't execute anything yet due to lazy evaluation
# branching out
res1 = query.filter(...).collect()
res2 = query.select(...).collect()
Both res1
and res2
will trigger the same expensive computation. To avoid that, materialize the intermediate result:
Temp Table (session scoped)
query.write.mode("overwrite").save_as_table("temp_table", table_type="temporary")
lf = session.table("temp_table")
Materialized View (persistent, auto-refreshed, costs money)
session.sql("""
create materialized view my_view as
select id, count(*)
from my_table
group by id
""").collect()
Parquet File (in stage)
path = f"{session.get_session_stage()}/f.parquet"
a.group_by("A").agg(...).write.format("parquet").save(path)
lf = sp.DataFrameReader.parquet(path)
Predicate and Projection Pushdown
"Predicate" = filter. "Projection" = column selection.
Snowflake tries to optimize automatically, but it only knows what you tell it—and lazy evaluation doesn't help. If you're building up a complex query, be explicit:
- Filter early, especially before joins
- Drop columns you don’t need
- Use
DataFrame.explain()
to inspect the query plan
Monitoring
Use QUERY_HISTORY
to monitor performance. Look at:
bytes_scanned
partitions_scanned
Lower is better. If you're scanning too much data, it's a sign that pushdowns or clustering could be improved.