Masking nan values from an xarray dataset for scikit.learn mulltiple linear regression following scipy

Question:

I’m attempting to use scikit-learn.linear_model’s LinearRegression find the multiple linear regression coefficients for different variables at each latitude and longitude point along the time dimension like so:

for i in range(len(data.lat)):
    for j in range(len(data.lon)):
         storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
                                                                    data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
                                                          data.dvar.values[:, i, j].reshape(len(data.time)).coef_)

While this general form works, there are abundant NaN values in my data because it comes from real observations. I generally do not want to impute data whenever possible, trying to preserve whatever real relations there might be. Is it possible to copy a behavior from scipy.stats.linregress, where "Missing values are considered pair-wise: if a value is missing in x, the corresponding value in y is masked?" This feels like the best route; otherwise, could I add a conditional clause along the lines of

if data.ivar1[:, i, j].isnull() or data.ivar[:, i, j].isnull() == True:
     storage_dfram[i, j, :] = np.nan
else:
     storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
                                                                data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
                                                      data.dvar.values[:, i, j].reshape(len(data.time)).coef_)

I’ve attempted essentially that, with no success. Please feel free to chime in!

Asked By: Logan

||

Answers:

Your code is difficult to read, especially without context, so here’s a simpler example of what I think you’re trying to do:

# generate some fake input and output data
inp = np.array(range(10))
# inp -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

outp = np.array([x*x if x%2 else np.nan for x in inp])
# outp -> [nan,  1., nan,  9., nan, 25., nan, 49., nan, 81.]

mask = ~np.isnan(outp)
# mask -> [False, True, False, True, False, True, False, True, False, True]

masked_inp = inp[mask]
# masked_inp -> [1, 3, 5, 7, 9]

masked_outp = outp[mask]
# masked_outp -> [ 1.,  9., 25., 49., 81.]
Answered By: Woodford

This boolean clause handles it:

if data.isel(lat=i,lon=j).ivar1.isnull().any() or data.isel(lev=2,lat=i,lon=j).ivar2.isnull().any() or data.isel(lev=2, lat=i,lon=j).ivar3.isnull().any() or data.isel(lev=0, lat=i,lon=j).ivar4.isnull().any() or data2.isel(lat=i, lon=j).dvar.isnull().any() == True:
     storage_dframe[i, j, :] = np.nan
else:
     storage_dframe[i, j, :] = LinearRegression(...)

where ivarx is the xth independent variable and dvar is the dependent variable.

Answered By: Logan