# Masking nan values from an xarray dataset for scikit.learn mulltiple linear regression following scipy

## Question:

I’m attempting to use scikit-learn.linear_model’s LinearRegression find the multiple linear regression coefficients for different variables at each latitude and longitude point along the time dimension like so:

``````for i in range(len(data.lat)):
for j in range(len(data.lon)):
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
``````

While this general form works, there are abundant NaN values in my data because it comes from real observations. I generally do not want to impute data whenever possible, trying to preserve whatever real relations there might be. Is it possible to copy a behavior from scipy.stats.linregress, where "Missing values are considered pair-wise: if a value is missing in x, the corresponding value in y is masked?" This feels like the best route; otherwise, could I add a conditional clause along the lines of

``````if data.ivar1[:, i, j].isnull() or data.ivar[:, i, j].isnull() == True:
storage_dfram[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
``````

I’ve attempted essentially that, with no success. Please feel free to chime in!

Your code is difficult to read, especially without context, so here’s a simpler example of what I think you’re trying to do:

``````# generate some fake input and output data
inp = np.array(range(10))
# inp -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

outp = np.array([x*x if x%2 else np.nan for x in inp])
# outp -> [nan,  1., nan,  9., nan, 25., nan, 49., nan, 81.]

# mask -> [False, True, False, True, False, True, False, True, False, True]

# masked_inp -> [1, 3, 5, 7, 9]

# masked_outp -> [ 1.,  9., 25., 49., 81.]
``````

This boolean clause handles it:

``````if data.isel(lat=i,lon=j).ivar1.isnull().any() or data.isel(lev=2,lat=i,lon=j).ivar2.isnull().any() or data.isel(lev=2, lat=i,lon=j).ivar3.isnull().any() or data.isel(lev=0, lat=i,lon=j).ivar4.isnull().any() or data2.isel(lat=i, lon=j).dvar.isnull().any() == True:
storage_dframe[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression(...)
``````

where ivarx is the xth independent variable and dvar is the dependent variable.

Categories: questions
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.