How to select the first value that is not NaN after using groupby().agg()?

Question

I have the following code:

df.groupby(["id", "year"], as_index=False).agg({"brand":"first", "color":"first"})

However, I have some values that are NaN. I want to select the first value that is not NaN.

Suppose my dataframe looks like this:

id	year	brand	color
001	2010	NaN	Blue
001	2010	Audi	NaN
001	2010	Audo	Blue
001	2011	Bmw	NaN
001	2011	NaN	NaN
001	2012	BMW	Green
002	2010	Tesla	White

I want to find all unique combinations of id and year,i.e. df.groupby(["id", "year"]) and then find the first non-NaN value. The motivation behind this is that a have a large and messy data set with many missing values and many typos. In the example table I also simulated a typo. Note that it is irrelevant whether the typo is first and gets chosen, as long as I keep track of the data per combination of id and year. Typos are a completely separate problem for now.

The desired output would be:

id	year	brand	color
001	2010	Audi	Blue
001	2011	BMW	NaN
001	2012	BMW	Green
002	2010	Tesla	White

Asked By: Xtiaan

||

Source

Answer 1

This approach will produce the below output.

df1 = df.groupby(['id', 'year']).agg({
    'brand': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
    'color': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
}).reset_index()
print(df1)

Using a costume function:

def get_first_non_nan(x):
    return x.dropna().iloc[0] if x.dropna().any() else np.nan
    
df1 = df.groupby(['id', 'year']).agg({
    'brand': get_first_non_nan,
    'color': get_first_non_nan,
}).reset_index()
print(df1)

    id  year  brand  color
0  001  2010   Audi   Blue
1  001  2011    Bmw    NaN
2  001  2012    BMW  Green
3  002  2010  Tesla  White

Answered By: Jamiu S.

How to select the first value that is not NaN after using groupby().agg()?

Question:

Answers: