Combine xarray Datasets on stacked coordinate

Question:

I have two datasets, where they share a coordinate. But, in the first set that coordinate is stacked. I want to merge the two dataset, so that the second set is also indexed by the stacked coordinate.

I’ve managed to get this working by manually restructuring the second dataset, but feel like there has to be a better way to do this in xarray. Any ideas?

Example data setup

import numpy as np
import xarray as xr

# set up dataset A
seed = 33
np.random.seed(seed)
a = np.arange(3)
b = np.arange(4)

shape = [len(a), len(b)]
c = np.arange(np.prod(shape)).reshape(shape)
x = np.random.random(shape)
y = np.random.random(shape)

dataset_a = xr.Dataset.from_dict({
    'dims': ['a', 'b'],
    'coords': {
        'a': {'dims': 'a', 'data': a},
        'b': {'dims': 'b', 'data': b},
        'c': {'dims': ['a', 'b'], 'data': c}
    },
    'data_vars': {
        'x': {'dims': ['a', 'b'], 'data': x},
        'y': {'dims': ['a', 'b'], 'data': y}
    }
})

# set up dataset_b
np.random.seed(seed)
mask = np.random.choice([True, False], shape)
c_new = c[mask]
z = np.random.random(len(c_new))

dataset_b = xr.Dataset.from_dict({
    'dims': 'c',
    'coords': {
        'c': {'dims': 'c', 'data': c_new}
    },
    'data_vars': {
        'z': {'dims': 'c', 'data': z}
    }
})

Datasets to merge

In [15]: dataset_a
Out[15]: 
<xarray.Dataset>
Dimensions:  (a: 3, b: 4)
Coordinates:
  * a        (a) int64 0 1 2
  * b        (b) int64 0 1 2 3
    c        (a, b) int64 0 1 2 3 4 5 6 7 8 9 10 11
Data variables:
    x        (a, b) float64 0.2485 0.45 0.4109 0.2603 ... 0.4866 0.965 0.3934
    y        (a, b) float64 0.07956 0.3514 0.1636 ... 0.2478 0.6228 0.1424

In [16]: dataset_b
Out[16]: 
<xarray.Dataset>
Dimensions:  (c: 6)
Coordinates:
  * c        (c) int64 0 2 3 4 8 11
Data variables:
    z        (c) float64 0.01966 0.9533 0.6805 0.4866 0.965 0.3934

Attempt at merging with ds.merge

When I use xr.Dataset.merge() the original coordinate c is overwritten by the new version, without stack.

In [17]: dataset_c = dataset_a.merge(dataset_b)
Out[17]:
<xarray.Dataset>
Dimensions:  (a: 3, b: 4, c: 6)
Coordinates:
  * a        (a) int64 0 1 2
  * b        (b) int64 0 1 2 3
  * c        (c) int64 0 2 3 4 8 11
Data variables:
    x        (a, b) float64 0.2485 0.45 0.4109 0.2603 ... 0.4866 0.965 0.3934
    y        (a, b) float64 0.07956 0.3514 0.1636 ... 0.2478 0.6228 0.1424
    z        (c) float64 0.01966 0.9533 0.6805 0.4866 0.965 0.3934

Attempt #2 – manual alignment

Should I be doing something like creating z filled with nan, and then use xr.Dataset.where() to add the values from dataset_b?

dataset_d = dataset_a.copy()

z_new = np.full_like(dataset_d['x'].values.flatten(), np.nan)
z_shape = dataset_d['x'].shape
z_dims = dataset_d['x'].dims

for i in dataset_d['c'].values.flatten():
    if i in dataset_b['c'].values.flatten():
        z_new[i] = dataset_b.sel(c = i)['z'].values.flatten()

z_new = z_new.reshape(z_shape)

dataset_d['z'] = (z_dims, z_new)
<xarray.Dataset>
Dimensions:  (a: 3, b: 4)
Coordinates:
  * a        (a) int64 0 1 2
  * b        (b) int64 0 1 2 3
    c        (a, b) int64 0 1 2 3 4 5 6 7 8 9 10 11
Data variables:
    x        (a, b) float64 0.2485 0.45 0.4109 0.2603 ... 0.4866 0.965 0.3934
    y        (a, b) float64 0.07956 0.3514 0.1636 ... 0.2478 0.6228 0.1424
    z        (a, b) float64 0.01966 nan 0.9533 0.6805 ... 0.965 nan nan 0.3934

This works, but is there a cleaner/faster way to do this?

Asked By: Mattias Thalén

||

Answers:

great question! it’s actually much easier than this. using xarray’s More Advanced Indexing rules, passing a DataArray to a selection method (.sel, .isel, .loc, .interp, …) will reshape the array by using the selection array’s values as the selector but reshaping to conform to the shape of the selector’s coordinates in the output.

So in the absence of any mismatched coordinates, you can just do this:

dataset_b.sel(c=dataset_a.c)

In your case, you have some levels in dataset_a.c which are not present in dataset_b.c. To fully align them, you could do any number of things to align a with b or visa versa. I’ll stick with a left join on dataset_a.c‘s levels, by using xarray.Dataset.reindex:

all_c_levels = np.unique(dataset_a.c.values)
dataset_b_reindexed = dataset_b.reindex(c=all_c_levels)

now, the reindexed dataset can be reshaped to conform to the shape of dataset_a.c:

In [6]: dataset_b_reindexed.sel(c=dataset_a.c)
Out[6]:
<xarray.Dataset>
Dimensions:  (a: 3, b: 4)
Coordinates:
    c        (a, b) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * a        (a) int64 0 1 2
  * b        (b) int64 0 1 2 3
Data variables:
    z        (a, b) float64 0.01966 nan 0.9533 0.6805 ... 0.965 nan nan 0.3934

There are some nans in there, as we had to add new levels which weren’t previously in dataset_b. You can do whatever you like with these, e.g. with xr.Dataset.fillna. But the data is now in a shape that can be broadcast against dataset_a:

In [8]: xr.merge([dataset_a, dataset_b_reindexed.sel(c=dataset_a.c)])
Out[8]:
<xarray.Dataset>
Dimensions:  (a: 3, b: 4)
Coordinates:
  * a        (a) int64 0 1 2
  * b        (b) int64 0 1 2 3
    c        (a, b) int64 0 1 2 3 4 5 6 7 8 9 10 11
Data variables:
    x        (a, b) float64 0.2485 0.45 0.4109 0.2603 ... 0.4866 0.965 0.3934
    y        (a, b) float64 0.07956 0.3514 0.1636 ... 0.2478 0.6228 0.1424
    z        (a, b) float64 0.01966 nan 0.9533 0.6805 ... 0.965 nan nan 0.3934

Using DataArrays in selection, covered in the More Advanced Indexing section of the User Guide, is one of the more powerful features of xarray that I use constantly. I think it’s worth committing it to memory and really focusing on some of the stranger reshape use cases (like the one you have here) because it’s a feature of xarray that is truly unique – it’s hard to do this in both pandas and numpy and xarray really knocks this out of the park!

Answered By: Michael Delgado
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.