Building with AI for human benefit. Sharing how it's done.
Practical patterns for when pandas runs out of memory.
Your ocean dataset is:
import dask.dataframe as dd
# Don't load yet — just build the task graph
df = dd.read_parquet("s3://bucket/ocean-data/*.parquet")
# Still lazy
filtered = df[df.depth > 100]
# NOW it loads (only the filtered rows)
result = filtered.compute()
import xarray as xr
# Chunk along time dimension
ds = xr.open_mfdataset(
"data/*.nc",
chunks={"time": 100, "lat": 500, "lon": 500}
)
# Operations are lazy
mean_sst = ds.sst.mean(dim="time")
# Compute when ready
mean_sst.compute()
from dask.distributed import Client
# Local cluster (uses all cores)
client = Client()
# Or remote cluster
client = Client("scheduler-address:8786")
# Same code, now distributed
result = df.groupby("region").mean().compute()
| Issue | Solution |
|---|---|
| Too many small files | Repartition: df.repartition(npartitions=100) |
| Memory spikes | Reduce chunk size |
| Slow shuffles | Persist before groupby: df.persist() |
| Silent failures | Check dashboard: client.dashboard_link |
Always run with the dashboard open:
client = Client()
print(client.dashboard_link) # http://localhost:8787
Watch for:
Pandas syntax, terabyte scale.