Pandas Dataframe: Replacing NaN with row average

Question:

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I’m doing? Is it because its not implemented? see link here

import pandas as pd
import numpy as np
​
pd.__version__
Out[44]:
'0.15.2'

In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df

Out[45]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9

In [46]:  
df.fillna(df.mean(axis=1)) 

Out[46]:
    c1  c2  c3
0   1   4   7
1   2   5   NaN
2   3   6   9

However something like this looks to work fine

df.fillna(df.mean(axis=0)) 

Out[47]:
    c1  c2  c3
0   1   4   7
1   2   5   8
2   3   6   9
Asked By: Aenaon

||

Answers:

As commented the axis argument to fillna is NotImplemented.

df.fillna(df.mean(axis=1), axis=1)

Note: this would be critical here as you don’t want to fill in your nth columns with the nth row average.

For now you’ll need to iterate through:

m = df.mean(axis=1)
for i, col in enumerate(df):
    # using i allows for duplicate columns
    # inplace *may* not always work here, so IMO the next line is preferred
    # df.iloc[:, i].fillna(m, inplace=True)
    df.iloc[:, i] = df.iloc[:, i].fillna(m)

print(df)

   c1  c2   c3
0   1   4  7.0
1   2   5  3.5
2   3   6  9.0

An alternative is to fillna the transpose and then transpose, which may be more efficient…

df.T.fillna(df.mean(axis=1)).T
Answered By: Andy Hayden

As an alternative, you could also use an apply with a lambda expression like this:

df.apply(lambda row: row.fillna(row.mean()), axis=1)

yielding also

    c1   c2   c3
0  1.0  4.0  7.0
1  2.0  5.0  3.5
2  3.0  6.0  9.0
Answered By: Cleb

I’ll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.

The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.

We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A’s with the respective row averages.

# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]

# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df

giving

    c1   c2   c3
0   1.0  4.0  7.0
1   2.0  5.0  3.5
2   3.0  6.0  9.0
Answered By: Troy

Just had the same problem. I found this workaround to be working:

df.transpose().fillna(df.mean(axis=1)).transpose()

I’m not sure though about the efficiency of this solution.

Answered By: LKho

You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.

fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), 
                    columns=df.columns,
                    index=df.index)

df.update(fill, overwrite=False)
print(df)

     1    1    1
0  1.0  4.0  7.0
0  2.0  5.0  3.5
0  3.0  6.0  9.0
Answered By: ALollz

For an efficient solution, use DataFrame.where:

We could use where on axis=0:

df.where(df.notna(), df.mean(axis=1), axis=0)

or mask on axis=0:

df.mask(df.isna(), df.mean(axis=1), axis=0)

By using axis=0, we can fill in the missing values in each column with the row averages.

These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.

Some benchmarks:

df = creator()

>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

def creator():
    A = np.random.rand(300_000, 20)
    A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
    return pd.DataFrame(A)
Answered By: user7864386