pandas groupby mean with nan
Question:
I have the following dataframe:
date id cars
2012 1 4
2013 1 6
2014 1 NaN
2012 2 10
2013 2 20
2014 2 NaN
Now, I want to get the mean of cars over the years for each id ignoring the NaN’s. The result should be like this:
date id cars result
2012 1 4 5
2013 1 6 5
2014 1 NaN 5
2012 2 10 15
2013 2 20 15
2014 2 NaN 15
I have the following command:
df["result"]=df.groupby("id")["cars"].mean()
The command runs without errors, but the result column only has NaN’s.
What did I do wrong?
Answers:
Use transform
, this returns a series the same size as the original:
df["result"]=df.groupby("id")["cars"].transform('mean')
print (df)
date id cars result
0 2012 1 4.0 5.0
1 2013 1 6.0 5.0
2 2014 1 NaN 5.0
3 2012 2 10.0 15.0
4 2013 2 20.0 15.0
5 2014 2 NaN 15.0
Hello good old 2017 question. This is just another way with a lot of overhead. You write about getting only NaN values as the mean (as soon as one of the numbers is NaN), with df["result"]=df.groupby("id")["cars"].mean()
. In 2023, I did not run into this problem. Perhaps, this has been fixed in later versions? Anyway, if you face this in whatever time and space again, you might want to know in the first place how to get the mean per id without NaN weighing out everything:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')
df.groupby(['id']).apply(lambda x: np.average(x['cars'].dropna()))
After this, join on the id:s. I do not take the time to show this since this answer has a lot of overhead for your question at hand and should not be put to work. There might just be some who search for a way to get the means without NaNs in the first place.
I have the following dataframe:
date id cars
2012 1 4
2013 1 6
2014 1 NaN
2012 2 10
2013 2 20
2014 2 NaN
Now, I want to get the mean of cars over the years for each id ignoring the NaN’s. The result should be like this:
date id cars result
2012 1 4 5
2013 1 6 5
2014 1 NaN 5
2012 2 10 15
2013 2 20 15
2014 2 NaN 15
I have the following command:
df["result"]=df.groupby("id")["cars"].mean()
The command runs without errors, but the result column only has NaN’s.
What did I do wrong?
Use transform
, this returns a series the same size as the original:
df["result"]=df.groupby("id")["cars"].transform('mean')
print (df)
date id cars result
0 2012 1 4.0 5.0
1 2013 1 6.0 5.0
2 2014 1 NaN 5.0
3 2012 2 10.0 15.0
4 2013 2 20.0 15.0
5 2014 2 NaN 15.0
Hello good old 2017 question. This is just another way with a lot of overhead. You write about getting only NaN values as the mean (as soon as one of the numbers is NaN), with df["result"]=df.groupby("id")["cars"].mean()
. In 2023, I did not run into this problem. Perhaps, this has been fixed in later versions? Anyway, if you face this in whatever time and space again, you might want to know in the first place how to get the mean per id without NaN weighing out everything:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')
df.groupby(['id']).apply(lambda x: np.average(x['cars'].dropna()))
After this, join on the id:s. I do not take the time to show this since this answer has a lot of overhead for your question at hand and should not be put to work. There might just be some who search for a way to get the means without NaNs in the first place.