How to select the first value that is not NaN after using groupby().agg()?
Question:
I have the following code:
df.groupby(["id", "year"], as_index=False).agg({"brand":"first", "color":"first"})
However, I have some values that are NaN. I want to select the first value that is not NaN.
Suppose my dataframe looks like this:
id
year
brand
color
001
2010
NaN
Blue
001
2010
Audi
NaN
001
2010
Audo
Blue
001
2011
Bmw
NaN
001
2011
NaN
NaN
001
2012
BMW
Green
002
2010
Tesla
White
I want to find all unique combinations of id and year,i.e. df.groupby(["id", "year"])
and then find the first non-NaN value. The motivation behind this is that a have a large and messy data set with many missing values and many typos. In the example table I also simulated a typo. Note that it is irrelevant whether the typo is first and gets chosen, as long as I keep track of the data per combination of id and year. Typos are a completely separate problem for now.
The desired output would be:
id
year
brand
color
001
2010
Audi
Blue
001
2011
BMW
NaN
001
2012
BMW
Green
002
2010
Tesla
White
Answers:
This approach will produce the below output.
df1 = df.groupby(['id', 'year']).agg({
'brand': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
'color': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
}).reset_index()
print(df1)
Using a costume function:
def get_first_non_nan(x):
return x.dropna().iloc[0] if x.dropna().any() else np.nan
df1 = df.groupby(['id', 'year']).agg({
'brand': get_first_non_nan,
'color': get_first_non_nan,
}).reset_index()
print(df1)
id year brand color
0 001 2010 Audi Blue
1 001 2011 Bmw NaN
2 001 2012 BMW Green
3 002 2010 Tesla White
I have the following code:
df.groupby(["id", "year"], as_index=False).agg({"brand":"first", "color":"first"})
However, I have some values that are NaN. I want to select the first value that is not NaN.
Suppose my dataframe looks like this:
id | year | brand | color |
---|---|---|---|
001 | 2010 | NaN | Blue |
001 | 2010 | Audi | NaN |
001 | 2010 | Audo | Blue |
001 | 2011 | Bmw | NaN |
001 | 2011 | NaN | NaN |
001 | 2012 | BMW | Green |
002 | 2010 | Tesla | White |
I want to find all unique combinations of id and year,i.e. df.groupby(["id", "year"])
and then find the first non-NaN value. The motivation behind this is that a have a large and messy data set with many missing values and many typos. In the example table I also simulated a typo. Note that it is irrelevant whether the typo is first and gets chosen, as long as I keep track of the data per combination of id and year. Typos are a completely separate problem for now.
The desired output would be:
id | year | brand | color |
---|---|---|---|
001 | 2010 | Audi | Blue |
001 | 2011 | BMW | NaN |
001 | 2012 | BMW | Green |
002 | 2010 | Tesla | White |
This approach will produce the below output.
df1 = df.groupby(['id', 'year']).agg({
'brand': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
'color': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
}).reset_index()
print(df1)
Using a costume function:
def get_first_non_nan(x):
return x.dropna().iloc[0] if x.dropna().any() else np.nan
df1 = df.groupby(['id', 'year']).agg({
'brand': get_first_non_nan,
'color': get_first_non_nan,
}).reset_index()
print(df1)
id year brand color
0 001 2010 Audi Blue
1 001 2011 Bmw NaN
2 001 2012 BMW Green
3 002 2010 Tesla White