Fill NaNs with per-column max in dask dataframe
Question:
I need to impute in a dataframe the maximum number in each column when the value is np.nan
. Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation:
https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
So I’m trying to do this manually with fillna
. This is my attempt:
df = pd.DataFrame({
'height': [6.21, 5.12, 5.85, 5.78, 5.98, np.nan],
'weight': [np.nan, 150, 126, 133, 164, 203]
})
df_dask = dd.from_pandas(df, npartitions=2)
meta = [('height', 'float'),('weight', 'float')]
df_dask = df_dask.apply(lambda x: x.fillna(x.max()), axis=1, meta=meta)
df_dask.compute()
height weight
0 6.21 6.21
1 5.12 150.00
2 5.85 126.00
3 5.78 133.00
4 5.98 164.00
5 203.00 203.00
I’m using axis=1
to work by column however dask is taking the max of the row. How to fix this?
Answers:
The axis argument works the same way in dask.dataframe as it does in pandas – axis=0
applies a function column-wise in pandas too:
In [9]: df.apply(lambda x: x.fillna(x.max()), axis=0)
Out[9]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0
However, in dask.dataframe, you cannot currently apply a function column-wise. See the dask.dataframe.apply
docs:
Parallel version of pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly).
- The user should provide output metadata via the meta keyword.
However, you could easily do this without an apply:
In [19]: df_dask.fillna(df_dask.max(), axis=0).compute()
Out[19]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0
I need to impute in a dataframe the maximum number in each column when the value is np.nan
. Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation:
https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
So I’m trying to do this manually with fillna
. This is my attempt:
df = pd.DataFrame({
'height': [6.21, 5.12, 5.85, 5.78, 5.98, np.nan],
'weight': [np.nan, 150, 126, 133, 164, 203]
})
df_dask = dd.from_pandas(df, npartitions=2)
meta = [('height', 'float'),('weight', 'float')]
df_dask = df_dask.apply(lambda x: x.fillna(x.max()), axis=1, meta=meta)
df_dask.compute()
height weight
0 6.21 6.21
1 5.12 150.00
2 5.85 126.00
3 5.78 133.00
4 5.98 164.00
5 203.00 203.00
I’m using axis=1
to work by column however dask is taking the max of the row. How to fix this?
The axis argument works the same way in dask.dataframe as it does in pandas – axis=0
applies a function column-wise in pandas too:
In [9]: df.apply(lambda x: x.fillna(x.max()), axis=0)
Out[9]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0
However, in dask.dataframe, you cannot currently apply a function column-wise. See the dask.dataframe.apply
docs:
Parallel version of
pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly).- The user should provide output metadata via the meta keyword.
However, you could easily do this without an apply:
In [19]: df_dask.fillna(df_dask.max(), axis=0).compute()
Out[19]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0