Fill NaNs with per-column max in dask dataframe

Question:

I need to impute in a dataframe the maximum number in each column when the value is np.nan. Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation:

https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

So I’m trying to do this manually with fillna. This is my attempt:

df = pd.DataFrame({
    'height':  [6.21, 5.12, 5.85, 5.78, 5.98, np.nan],
    'weight': [np.nan, 150, 126, 133, 164, 203]
})

df_dask = dd.from_pandas(df, npartitions=2) 
meta = [('height', 'float'),('weight', 'float')]
df_dask = df_dask.apply(lambda x: x.fillna(x.max()), axis=1, meta=meta)

df_dask.compute()

    height  weight
0   6.21    6.21
1   5.12    150.00
2   5.85    126.00
3   5.78    133.00
4   5.98    164.00
5   203.00  203.00

I’m using axis=1 to work by column however dask is taking the max of the row. How to fix this?

Asked By: ps0604

||

Answers:

The axis argument works the same way in dask.dataframe as it does in pandas – axis=0 applies a function column-wise in pandas too:

In [9]: df.apply(lambda x: x.fillna(x.max()), axis=0)
Out[9]:
   height  weight
0    6.21   203.0
1    5.12   150.0
2    5.85   126.0
3    5.78   133.0
4    5.98   164.0
5    6.21   203.0

However, in dask.dataframe, you cannot currently apply a function column-wise. See the dask.dataframe.apply docs:

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  • Only axis=1 is supported (and must be specified explicitly).
  • The user should provide output metadata via the meta keyword.

However, you could easily do this without an apply:

In [19]: df_dask.fillna(df_dask.max(), axis=0).compute()
Out[19]:
   height  weight
0    6.21   203.0
1    5.12   150.0
2    5.85   126.0
3    5.78   133.0
4    5.98   164.0
5    6.21   203.0
Answered By: Michael Delgado
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.