How to replace 'for' loop by 'xarray.apply_ufunc' to performing linear regression between "x" and "y" for a 11-day moving-window for a xr.Dataset?

Question:

Estimate the linear slope between ‘x’ and ‘y’ for each 11-day moving window by 1-day stride.

from sklearn import linear_model
import numpy as np
import xarray as xr
import pandas as pd

# Create a dataset as an example
site = np.linspace(0,3,num=4,dtype='int8')
time= pd.date_range('2018-01-01','2020-12-31',freq='d')
x = np.random.randint(0,500,size=[len(site), len(time)])
y = np.random.randint(0,500,size=[len(site), len(time)])

_ds = xr.Dataset(data_vars=dict(
                    x=(["site", "time"], x),
                    y=(["site", "time"], y)),
                coords=dict(
                    site=site,
                    time=time))

# define the linear regression model
def ransac_fit(xi,yi, **ransac_kwargs):
    Xi = xi.reshape(-1, 1)
    yi = yi
    ransac = linear_model.RANSACRegressor(**ransac_kwargs)
    ransac.fit(Xi, yi)
    slope= ransac.estimator_.coef_
    b = ransac.estimator_.intercept_
    return slope, b

At present I am able to use ‘for’ loop for ‘site’ and ‘time’ to do that, which however is extremely clumsy…

def clc_slope(_ds, window_size=11):
    slps    =[]
    bs      =[]
    mean_xs =[]
    mean_ys=[]
    
    var_x = _ds['x']
    var_y = _ds['y']
    
   # for loop for each year and date
    for year in np.unique(_ds.time.dt.year.values):
        for doy in np.unique(_ds.sel(time=str(year)).time.dt.dayofyear.values):
            
            # define inorg and endrg
            inorg = doy-np.int(window_size/2+1)
            enorg = doy+np.int(window_size/2)
            
            # calculate mean values of x and y for each moving window
            mean_x = np.nanmean(var_x.sel(time=str(year))[inorg:enorg].values)
            mean_y  = np.nanmean(var_y.sel(time=str(year))[inorg:enorg].values)
            
            mean_xs = np.append(mean_xs, mean_x)
            mean_ys  = np.append(mean_ys, mean_x)

            # start to estimate slope and intercept
            _x = var_x.sel(time=str(year))[inorg:enorg].values
            _y = var_y.sel(time=str(year))[inorg:enorg].values
            
            # if there is too many nans then assign slope and intcept to be nan
            if (np.isfinite(_x) & np.isfinite(_y)).sum()<((np.int(window_size/2)+2)*1):
                _slp=_b= np.nan
            else:
                try:
                    _slp, _b = ransac_fit(_x,_y, min_samples=0.6, stop_n_inliers=np.int(window_size/2)*1)
                except:
                    _slp=_b = np.nan

            slps = np.append(slps,_slp)
            bs   = np.append(bs, _b)

    outs = [slps, bs, mean_xs, mean_ys]
    return outs

# run the slope and intercept estimation for each site and concat afterwards
_dss = []
for st in ds.site.values:
    _ds = ds.sel(site=st)
    outs = clc_slope(_ds)
    _ds['slp']    = ('time',outs[0])
    _ds['b']      = ('time',outs[1])
    _ds['mean_xs']= ('time',outs[2])
    _ds['mean_ys']= ('time',outs[3])
    _dss.append(_ds)
dss = xr.concat(_dss, dim='site')

I know xarray.apply_ufunc can extremely save time, but I do not get this tricky approach. Would be supper appreciate if you can give a hint! Thank you!

Asked By: William Jose Zabka

||

Answers:

This doesn’t use apply_ufunc, but it does speed up the implementation quite a bit I expect.

xarray’s rolling modules have a really powerful feature called construct. What it does is to take a rolling window, and rather than reducing it, expand it into a new dimension. xarray does this without ever copying data – it just provides one slice into the array for each element along the rolled dimension, with each slice being the length of the window, and offset by one from the previous slice:

In [3]: rolled = _ds.rolling(time=11).construct("window")

In [4]: rolled
Out[4]:
<xarray.Dataset>
Dimensions:  (site: 50, time: 1096, window: 11)
Coordinates:
  * site     (site) int8 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2 2 2 3
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2020-12-31
Dimensions without coordinates: window
Data variables:
    x        (site, time, window) float64 nan nan nan nan ... 350.0 9.0 303.0
    y        (site, time, window) float64 nan nan nan nan ... 246.0 351.0 310.0

You can use this to perform arbitrary operations along each window. It’s also really helpful for prototyping complex windowed operations, because you can see exactly what’s going on in each slice.

Next, for each slice, we can stack the site and window dimension, to get all the observations you’d like for each regression in one vector:

In [5]: stacked = rolled.stack(obs=("window", "site"))

In [6]: stacked
Out[6]:
<xarray.Dataset>
Dimensions:  (time: 1096, obs: 550)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2020-12-31
  * obs      (obs) MultiIndex
  - window   (obs) int64 0 0 0 0 0 0 0 0 0 0 0 ... 10 10 10 10 10 10 10 10 10 10
  - site     (obs) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 3
Data variables:
    x        (time, obs) float64 nan nan nan nan nan ... 81.0 136.0 194.0 303.0
    y        (time, obs) float64 nan nan nan nan nan ... 470.0 300.0 329.0 310.0

Now that we have this, we can wrap your regression function so it accepts and returns a Dataset. I’ll add a new dimension coeff because slope is a vector (you could alternatively just grab the scalar slope with slope.item() and skip the extra dim):

def ransac_fit_xr(ds, **ransac_kwargs):
    xi, yi = ds.x.values.ravel(), ds.y.values.ravel()
    mask = (~np.isnan(xi))
    # you could apply your masking rule here if you'd like:
    # if mask.sum() < len(mask) / 2:
    #     return xr.Dataset({"slope": (("coeff", ), [np.nan]), "b": np.nan})
    xi, yi = xi[mask], yi[mask]
    slope, b = ransac_fit(xi, yi, **ransac_kwargs)
    return xr.Dataset({"slope": (("coeff", ), slope), "b": b})

Now, we can loop over the elements of time to build our regression results:

In [22]: results = []
    ...: for i in stacked.time.values:
    ...:     results.append(ransac_fit_xr(stacked.sel(time=i, drop=True)))
    ...: res_ds = xr.concat(results, dim=stacked.time)

In [23]: res_ds
Out[23]:
<xarray.Dataset>
Dimensions:  (time: 1096, coeff: 1)
Coordinates:
  * time     (time) datetime64[ns] 2018-01-01 2018-01-02 ... 2020-12-31
Dimensions without coordinates: coeff
Data variables:
    slope    (time, coeff) float64 -0.1954 0.3 -0.0878 ... -0.1385 0.05444
    b        (time) float64 413.6 303.5 366.0 271.4 ... 342.1 256.4 362.8 303.

This runs reasonably quickly. A persistent challenge I have with sklearn’s estimators is there is no good way to run tensor regressions, where you want to pass in an array of arguments, run the regression along some subset of the dimensions, and then receive an array of outputs. xarray’s polyfit does do this, but you currently can only run polynomial regressions. So if you have something more complex like your RANSACRegressor, you have to accept the performance hit of an outer loop. You could speed this up by parallelizing it with map_blocks if you’d like.

Answered By: Michael Delgado
rolled = ds.rolling(time=11, center=True).construct("window")

slps, bs = xr.apply_ufunc(
    ransac_fit,
    rolled['x'],
    rolled['y'],
    input_core_dims=[['window'],['window']],
    output_core_dims=[[],[]],
    vectorize=True,
    dask='parallelized',
)

This is the way that comes out by creating a 3rd dimension (window) using rolling function, then using apply_ufunc to broadcast dimensions of site and time.

Answered By: William Jose Zabka
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.