Combining xarray datasets with combine_by_coords() for 2 dimensions simultaneously in python

Question:

I have multiple xarray datasets with the dimensions: target-latitudes (180) and target-longitudes 360) and one variable: variable1. Each of these datasets represents a source-gridcell and thus corresponds to a particular source-latitude and source-longitude; e.g., the dataset sourcelat25_sourcelon126_mm3_per_yr.nc corresponds to a gridcell with the source-latitude of 25 and a source-longitude of 126 which looks like this:

<xarray.Dataset>
Dimensions:        (targetlongitude: 360, targetlatitude: 180)
Coordinates:
    sourcelatitude      float64 25.0
    sourcelongitude     float64 126.0
  * targetlongitude     (targetlongitude) float64 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
  * targetlatitude      (targetlatitude) float64 90.0 89.0 88.0 87.0 ... -87.0 -88.0 -89.0
Data variables:
    variable1           (targetlatitude, targetlongitude) float64 ...

My goal is to combine all datasets to obtain a dataset with complete source-latitude (180) and source-longitude (360) dimensions (as well as the target-latitude and target-longitude dimensions), like this:

<xarray.Dataset> Dimensions: (sourcelongitude: 360, sourcelatitude: 180, targetlongitude: 360, targetlatitude: 180)

I have tried to combine the datasets with xr.concat() however, that gave some issues. Then I tried xr.combine_by_coords() as you can see in the code example below:

directory = 'specified_directory'
filenames = [f for f in os.listdir(directory) if f.startswith('start') and f.endswith('end.nc')]      

combined_ds = None
for filename in filenames:
    ds = xr.open_dataset(os.path.join(directory, filename))

    if combined_ds is None:
        combined_ds = ds.copy()

    else:
        if 'sourcelatitude' in combined_ds.dims:
            ds = ds.expand_dims(dim = ['sourcelatitude', 'sourcelongitude'])
            combined_ds = xr.combine_by_coords([combined_ds, ds], join= 'exact')
        else:
            ds = ds.expand_dims(dim=['sourcelatitude', 'sourcelongitude'])
            combined_ds = combined_ds.expand_dims(dim=['sourcelatitude', 'sourcelongitude'])
            combined_ds = xr.combine_by_coords([combined_ds, ds], join='exact')

This works for the first and the second iteration of the loop, and then gives me the error:

ValueError: Resulting object does not have monotonic global indexes along dimension sourcelongitude

Does anyone have any insights about how to solve this or perhaps another way to combine these datasets? I would appreciate it very much, thank you for reading!

Asked By: Freek

||

Answers:

The xarray docs have a section on combining along multiple dimensions with options for combining.

I am partial to combine_nested as it allows you to be explicit about the ordering of data along each dim. But combine_by_coords works great too!

The biggest changes to your code I’d make are:

  1. When combining two or more datasets with any of these methods, they need to have compatible shapes. So the first iteration of your loop also needs to expand dims. But more importantly,
  2. As with pandas and numpy, you should never iteratively expand an array in a for loop. This is because allocating the memory for the array and copying the data into that memory is slow, compared with array operations, and when you loop over the files and iteratively concat you end up resizing the array once for each pixel. Instead, append to a list or dict and only do the concatenation once at the end.
  3. Also, use dask. You don’t want to create a 150GB array in memory with a single core. If you really want it in memory, just call combined_ds = combined_ds.compute() at the end of the code below.

Here are my updates to your code:

directory = 'specified_directory'
filenames = [
    f for f in os.listdir(directory)
    if f.startswith('start') and f.endswith('end.nc')
]

combined_ds = []
for filename in filenames:
    fp = os.path.join(directory, filename)
    ds = xr.open_dataset(fp).chunk()
    combined_ds.append(
        ds.expand_dims(
            ['sourcelatitude', 'sourcelongitude']
        )
    )

combined_ds = xr.combine_by_coords(
    combined_ds, join= 'exact'
)

Alternatively, you could do this in one step and open the files in parallel with open_mfdataset:

def preprocess(ds):
    return ds.expand_dims(
        ['sourcelatitude', 'sourcelongitude']
    )

fps = [
    os.path.join(directory, filename)
    for filename in filenames
]

ds = xr.open_mfdataset(
    fps,
    combine="by_coords",
    parallel=True,
    preprocess=preprocess,
)

Separately, I’d also consider using float32. It’ll halve the size of your data and I doubt you have such high precision estimates of moisture flows that the difference would be significant.

Answered By: Michael Delgado