Masking nan values from an xarray dataset for scikit.learn mulltiple linear regression following scipy
Question:
I’m attempting to use scikit-learn.linear_model’s LinearRegression find the multiple linear regression coefficients for different variables at each latitude and longitude point along the time dimension like so:
for i in range(len(data.lat)):
for j in range(len(data.lon)):
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
While this general form works, there are abundant NaN values in my data because it comes from real observations. I generally do not want to impute data whenever possible, trying to preserve whatever real relations there might be. Is it possible to copy a behavior from scipy.stats.linregress, where "Missing values are considered pair-wise: if a value is missing in x, the corresponding value in y is masked?" This feels like the best route; otherwise, could I add a conditional clause along the lines of
if data.ivar1[:, i, j].isnull() or data.ivar[:, i, j].isnull() == True:
storage_dfram[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
I’ve attempted essentially that, with no success. Please feel free to chime in!
Answers:
Your code is difficult to read, especially without context, so here’s a simpler example of what I think you’re trying to do:
# generate some fake input and output data
inp = np.array(range(10))
# inp -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
outp = np.array([x*x if x%2 else np.nan for x in inp])
# outp -> [nan, 1., nan, 9., nan, 25., nan, 49., nan, 81.]
mask = ~np.isnan(outp)
# mask -> [False, True, False, True, False, True, False, True, False, True]
masked_inp = inp[mask]
# masked_inp -> [1, 3, 5, 7, 9]
masked_outp = outp[mask]
# masked_outp -> [ 1., 9., 25., 49., 81.]
This boolean clause handles it:
if data.isel(lat=i,lon=j).ivar1.isnull().any() or data.isel(lev=2,lat=i,lon=j).ivar2.isnull().any() or data.isel(lev=2, lat=i,lon=j).ivar3.isnull().any() or data.isel(lev=0, lat=i,lon=j).ivar4.isnull().any() or data2.isel(lat=i, lon=j).dvar.isnull().any() == True:
storage_dframe[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression(...)
where ivarx is the xth independent variable and dvar is the dependent variable.
I’m attempting to use scikit-learn.linear_model’s LinearRegression find the multiple linear regression coefficients for different variables at each latitude and longitude point along the time dimension like so:
for i in range(len(data.lat)):
for j in range(len(data.lon)):
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
While this general form works, there are abundant NaN values in my data because it comes from real observations. I generally do not want to impute data whenever possible, trying to preserve whatever real relations there might be. Is it possible to copy a behavior from scipy.stats.linregress, where "Missing values are considered pair-wise: if a value is missing in x, the corresponding value in y is masked?" This feels like the best route; otherwise, could I add a conditional clause along the lines of
if data.ivar1[:, i, j].isnull() or data.ivar[:, i, j].isnull() == True:
storage_dfram[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
data.dvar.values[:, i, j].reshape(len(data.time)).coef_)
I’ve attempted essentially that, with no success. Please feel free to chime in!
Your code is difficult to read, especially without context, so here’s a simpler example of what I think you’re trying to do:
# generate some fake input and output data
inp = np.array(range(10))
# inp -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
outp = np.array([x*x if x%2 else np.nan for x in inp])
# outp -> [nan, 1., nan, 9., nan, 25., nan, 49., nan, 81.]
mask = ~np.isnan(outp)
# mask -> [False, True, False, True, False, True, False, True, False, True]
masked_inp = inp[mask]
# masked_inp -> [1, 3, 5, 7, 9]
masked_outp = outp[mask]
# masked_outp -> [ 1., 9., 25., 49., 81.]
This boolean clause handles it:
if data.isel(lat=i,lon=j).ivar1.isnull().any() or data.isel(lev=2,lat=i,lon=j).ivar2.isnull().any() or data.isel(lev=2, lat=i,lon=j).ivar3.isnull().any() or data.isel(lev=0, lat=i,lon=j).ivar4.isnull().any() or data2.isel(lat=i, lon=j).dvar.isnull().any() == True:
storage_dframe[i, j, :] = np.nan
else:
storage_dframe[i, j, :] = LinearRegression(...)
where ivarx is the xth independent variable and dvar is the dependent variable.