mqqn.net

Building with AI for human benefit. Sharing how it's done.

View My GitHub Profile

Dask Patterns for Large Ocean Datasets

Practical patterns for when pandas runs out of memory.

When Dask?

Your ocean dataset is:

Pattern 1: Lazy Loading

import dask.dataframe as dd

# Don't load yet — just build the task graph
df = dd.read_parquet("s3://bucket/ocean-data/*.parquet")

# Still lazy
filtered = df[df.depth > 100]

# NOW it loads (only the filtered rows)
result = filtered.compute()

Pattern 2: Chunked NetCDF with Xarray

import xarray as xr

# Chunk along time dimension
ds = xr.open_mfdataset(
    "data/*.nc",
    chunks={"time": 100, "lat": 500, "lon": 500}
)

# Operations are lazy
mean_sst = ds.sst.mean(dim="time")

# Compute when ready
mean_sst.compute()

Pattern 3: Distributed Processing

from dask.distributed import Client

# Local cluster (uses all cores)
client = Client()

# Or remote cluster
client = Client("scheduler-address:8786")

# Same code, now distributed
result = df.groupby("region").mean().compute()

Gotchas

Issue Solution
Too many small files Repartition: df.repartition(npartitions=100)
Memory spikes Reduce chunk size
Slow shuffles Persist before groupby: df.persist()
Silent failures Check dashboard: client.dashboard_link

The Dashboard

Always run with the dashboard open:

client = Client()
print(client.dashboard_link)  # http://localhost:8787

Watch for:


Pandas syntax, terabyte scale.

Source: C4IROcean-OceanDataPlatform