Downsampling geospatial xarrays too large for memory

Question:

I have a very large (86400×43200) xarray dataset with geospatial coordinates (lat, lon) and elevation and would like to downsize this map to have one twentieth the resolution per dimension (4320, 2160, which is equivalent to 1/12th degree).

I at first used the interp method directly, but it tries to load the whole dataset and hence requests 28GB of memory, which I can’t allocate.

Current approach that is detailed below:

  1. Create new downsampled dataset and fill with 0s
  2. Slice large dataset into 36 equal slices
  3. Load and interpolate one slice
  4. Add downsampled slice to my downsampled dataset.

-> Get all nan values. Step 2 works, just step 3 does not.
How can I downsample such a large dataset or populate my new dataset in the correct way?

import xarray as xr
def format_spatial_resolution(
    xarray, res_lat=1 / 12, res_lon=1 / 12, sample_array=None
):

    new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
    new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
    xarray_new = xarray.interp(
        lat=new_lat,
        lon=new_lon,
        # Extrapolate to avoid nan values as edges
        kwargs={
            "fill_value": "extrapolate",
        },
    )
    return xarray_new


xarray = xr.open_dataset(filename)
# Setting resolution
res_lat = 1/12 # 1/12 degree
res_lon = 1/12
# Create a downsampled xarray with zero values
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)

xarray_downsampled = xr.Dataset(
    dict(elevation=(["lat", "lon"], np.zeros((len(new_lat), len(new_lon))))),
    coords=dict(lon=new_lon, lat=new_lat),
)
# Interpolate each slice and update downsampled xarray
for i in range(-180, 180, 10):
    xarray_downsampled.update(format_spatial_resolution(xarray.sel(lon=slice(i, i + 10))))
Asked By: a-doering

||

Answers:

I’d recommend using dask in combination with xr.DataArray.coarsen.

Start by chunking the data on read with a chunking scheme in multiples of your coarsening window (so multiples of 20 on each dim), targeting chunk sizes in the low 100s of MBs. Chunks of (8640 x 4320) float64 data are 285MB, and then would reduce to 142MB if converted to float32 – this is a reasonable size. These are each 1/10 of the total dataset, so we’ll have 100 total chunks:

ds = xr.open_dataset(filename, chunks={"lat": 8640, "lon": 4320})

Optionally, we can reduce the precision:

ds["elevation"] = ds["elevation"].astype("float32")

Finally, let’s reduce the pixel density with coarsen:

coarsened = ds.coarsen(lat=20, lon=20, boundary='exact').mean()

The result is a dask-backed array; the job has not yet been executed. To trigger the computation, we can either write the file to disk, or compute the result in memory with:

coarsened = coarsened.compute()
Answered By: Michael Delgado