How to select the first value that is not NaN after using groupby().agg()?

Question:

I have the following code:

df.groupby(["id", "year"], as_index=False).agg({"brand":"first", "color":"first"})

However, I have some values that are NaN. I want to select the first value that is not NaN.

Suppose my dataframe looks like this:

id year brand color
001 2010 NaN Blue
001 2010 Audi NaN
001 2010 Audo Blue
001 2011 Bmw NaN
001 2011 NaN NaN
001 2012 BMW Green
002 2010 Tesla White

I want to find all unique combinations of id and year,i.e. df.groupby(["id", "year"]) and then find the first non-NaN value. The motivation behind this is that a have a large and messy data set with many missing values and many typos. In the example table I also simulated a typo. Note that it is irrelevant whether the typo is first and gets chosen, as long as I keep track of the data per combination of id and year. Typos are a completely separate problem for now.

The desired output would be:

id year brand color
001 2010 Audi Blue
001 2011 BMW NaN
001 2012 BMW Green
002 2010 Tesla White
Asked By: Xtiaan

||

Answers:

This approach will produce the below output.

df1 = df.groupby(['id', 'year']).agg({
    'brand': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
    'color': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
}).reset_index()
print(df1)

Using a costume function:

def get_first_non_nan(x):
    return x.dropna().iloc[0] if x.dropna().any() else np.nan
    
df1 = df.groupby(['id', 'year']).agg({
    'brand': get_first_non_nan,
    'color': get_first_non_nan,
}).reset_index()
print(df1)

    id  year  brand  color
0  001  2010   Audi   Blue
1  001  2011    Bmw    NaN
2  001  2012    BMW  Green
3  002  2010  Tesla  White
Answered By: Jamiu S.