Taking the mean of a row of a pandas dataframe with NaN and arrays
Question:
Here is my reproducible example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x' : [np.NaN, np.array([0,2])], 'y' : [np.array([3,2]),np.NaN], 'z' : [np.array([4,5]),np.NaN], 't' : [np.array([3,4]),np.array([4,5])]})
I would like to compute the mean array for each row excluding NaN
I have tried df.mean(axis=1)
which gives NaN for both row. This is particularly surprising to me as df.sum(axis=1)
appears to be working as I would have expected.
[df.loc[i,:].mean() for i in df.index]
does work but I am sure there is a more straightforward solution.
Answers:
Your DataFrame uses the object
dtype which is always a bit of a bodge. It’s slower than native types, and doesn’t always behave the way you’d expect.
Since Pandas removed the "Panel" type which was used for 3D data, I’d recommend you not store this data in a DataFrame. Instead, store it in a 3D NumPy array, then you can use np.nanmean()
to easily calculate averages while ignoring NaN.
Another possible solution:
df.apply(lambda x: np.mean(x[x.notnull()]), axis=1)
Output:
0 [3.3333333333333335, 3.6666666666666665]
1 [2.0, 3.5]
dtype: object
Here is my reproducible example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x' : [np.NaN, np.array([0,2])], 'y' : [np.array([3,2]),np.NaN], 'z' : [np.array([4,5]),np.NaN], 't' : [np.array([3,4]),np.array([4,5])]})
I would like to compute the mean array for each row excluding NaN
I have tried df.mean(axis=1)
which gives NaN for both row. This is particularly surprising to me as df.sum(axis=1)
appears to be working as I would have expected.
[df.loc[i,:].mean() for i in df.index]
does work but I am sure there is a more straightforward solution.
Your DataFrame uses the object
dtype which is always a bit of a bodge. It’s slower than native types, and doesn’t always behave the way you’d expect.
Since Pandas removed the "Panel" type which was used for 3D data, I’d recommend you not store this data in a DataFrame. Instead, store it in a 3D NumPy array, then you can use np.nanmean()
to easily calculate averages while ignoring NaN.
Another possible solution:
df.apply(lambda x: np.mean(x[x.notnull()]), axis=1)
Output:
0 [3.3333333333333335, 3.6666666666666665]
1 [2.0, 3.5]
dtype: object