Taking the mean of a row of a pandas dataframe with NaN and arrays

Question:

Here is my reproducible example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'x' : [np.NaN, np.array([0,2])], 'y' : [np.array([3,2]),np.NaN], 'z' : [np.array([4,5]),np.NaN], 't' : [np.array([3,4]),np.array([4,5])]})

I would like to compute the mean array for each row excluding NaN

I have tried df.mean(axis=1) which gives NaN for both row. This is particularly surprising to me as df.sum(axis=1) appears to be working as I would have expected.

[df.loc[i,:].mean() for i in df.index] does work but I am sure there is a more straightforward solution.

Asked By: user1627466

||

Answers:

Your DataFrame uses the object dtype which is always a bit of a bodge. It’s slower than native types, and doesn’t always behave the way you’d expect.

Since Pandas removed the "Panel" type which was used for 3D data, I’d recommend you not store this data in a DataFrame. Instead, store it in a 3D NumPy array, then you can use np.nanmean() to easily calculate averages while ignoring NaN.

Answered By: John Zwinck

Another possible solution:

df.apply(lambda x: np.mean(x[x.notnull()]), axis=1)

Output:

0    [3.3333333333333335, 3.6666666666666665]
1                                  [2.0, 3.5]
dtype: object
Answered By: PaulS