Downsampling geospatial xarrays too large for memory
Question:
I have a very large (86400×43200) xarray dataset with geospatial coordinates (lat, lon) and elevation and would like to downsize this map to have one twentieth the resolution per dimension (4320, 2160, which is equivalent to 1/12th degree).
I at first used the interp
method directly, but it tries to load the whole dataset and hence requests 28GB of memory, which I can’t allocate.
Current approach that is detailed below:
- Create new downsampled dataset and fill with 0s
- Slice large dataset into 36 equal slices
- Load and interpolate one slice
- Add downsampled slice to my downsampled dataset.
-> Get all nan values. Step 2 works, just step 3 does not.
How can I downsample such a large dataset or populate my new dataset in the correct way?
import xarray as xr
def format_spatial_resolution(
xarray, res_lat=1 / 12, res_lon=1 / 12, sample_array=None
):
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
xarray_new = xarray.interp(
lat=new_lat,
lon=new_lon,
# Extrapolate to avoid nan values as edges
kwargs={
"fill_value": "extrapolate",
},
)
return xarray_new
xarray = xr.open_dataset(filename)
# Setting resolution
res_lat = 1/12 # 1/12 degree
res_lon = 1/12
# Create a downsampled xarray with zero values
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
xarray_downsampled = xr.Dataset(
dict(elevation=(["lat", "lon"], np.zeros((len(new_lat), len(new_lon))))),
coords=dict(lon=new_lon, lat=new_lat),
)
# Interpolate each slice and update downsampled xarray
for i in range(-180, 180, 10):
xarray_downsampled.update(format_spatial_resolution(xarray.sel(lon=slice(i, i + 10))))
Answers:
I’d recommend using dask in combination with xr.DataArray.coarsen
.
Start by chunking the data on read with a chunking scheme in multiples of your coarsening window (so multiples of 20 on each dim), targeting chunk sizes in the low 100s of MBs. Chunks of (8640 x 4320) float64 data are 285MB, and then would reduce to 142MB if converted to float32 – this is a reasonable size. These are each 1/10 of the total dataset, so we’ll have 100 total chunks:
ds = xr.open_dataset(filename, chunks={"lat": 8640, "lon": 4320})
Optionally, we can reduce the precision:
ds["elevation"] = ds["elevation"].astype("float32")
Finally, let’s reduce the pixel density with coarsen:
coarsened = ds.coarsen(lat=20, lon=20, boundary='exact').mean()
The result is a dask-backed array; the job has not yet been executed. To trigger the computation, we can either write the file to disk, or compute the result in memory with:
coarsened = coarsened.compute()
I have a very large (86400×43200) xarray dataset with geospatial coordinates (lat, lon) and elevation and would like to downsize this map to have one twentieth the resolution per dimension (4320, 2160, which is equivalent to 1/12th degree).
I at first used the interp
method directly, but it tries to load the whole dataset and hence requests 28GB of memory, which I can’t allocate.
Current approach that is detailed below:
- Create new downsampled dataset and fill with 0s
- Slice large dataset into 36 equal slices
- Load and interpolate one slice
- Add downsampled slice to my downsampled dataset.
-> Get all nan values. Step 2 works, just step 3 does not.
How can I downsample such a large dataset or populate my new dataset in the correct way?
import xarray as xr
def format_spatial_resolution(
xarray, res_lat=1 / 12, res_lon=1 / 12, sample_array=None
):
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
xarray_new = xarray.interp(
lat=new_lat,
lon=new_lon,
# Extrapolate to avoid nan values as edges
kwargs={
"fill_value": "extrapolate",
},
)
return xarray_new
xarray = xr.open_dataset(filename)
# Setting resolution
res_lat = 1/12 # 1/12 degree
res_lon = 1/12
# Create a downsampled xarray with zero values
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
xarray_downsampled = xr.Dataset(
dict(elevation=(["lat", "lon"], np.zeros((len(new_lat), len(new_lon))))),
coords=dict(lon=new_lon, lat=new_lat),
)
# Interpolate each slice and update downsampled xarray
for i in range(-180, 180, 10):
xarray_downsampled.update(format_spatial_resolution(xarray.sel(lon=slice(i, i + 10))))
I’d recommend using dask in combination with xr.DataArray.coarsen
.
Start by chunking the data on read with a chunking scheme in multiples of your coarsening window (so multiples of 20 on each dim), targeting chunk sizes in the low 100s of MBs. Chunks of (8640 x 4320) float64 data are 285MB, and then would reduce to 142MB if converted to float32 – this is a reasonable size. These are each 1/10 of the total dataset, so we’ll have 100 total chunks:
ds = xr.open_dataset(filename, chunks={"lat": 8640, "lon": 4320})
Optionally, we can reduce the precision:
ds["elevation"] = ds["elevation"].astype("float32")
Finally, let’s reduce the pixel density with coarsen:
coarsened = ds.coarsen(lat=20, lon=20, boundary='exact').mean()
The result is a dask-backed array; the job has not yet been executed. To trigger the computation, we can either write the file to disk, or compute the result in memory with:
coarsened = coarsened.compute()