Pandas : Fill rows and drop duplicates, but keep different values

Question

I’ll try to be as clear as possible with my example.

df_old

user, col1,col2,col3
a   ,  X  ,    ,
a   ,     ,  Y ,
a   ,     ,    , 6
b   ,  A  ,    ,
b   ,     ,  B , C
b   ,     ,  D ,

This dataframe is ordered by user. I would like to fill the blanks and drop the duplicates, so for user a I would get only one row in the final dataframe.
I’m struggling with cases like user b. As there are 2 different values in col2 for user b, I want the final dataframe to have 2 different rows :

df_new

user, col1,col2,col3
a   ,  X  ,  Y , 6
b   ,  A  ,  B , C
b   ,  A  ,  D , C

Note that I want the rows to be "consistent" so B and C stay at the same index.

Thanks a lot for any help !

Asked By: Movilla

||

Source

Answer 1

You could ffill then dropna:

df.groupby('user', group_keys=False).apply(lambda g: g.ffill()).dropna(how='any')

Output:

  user col1 col2 col3
2    a    X    Y    6
4    b    A    B    C
5    b    A    D    C

Alternatively, if you don’t want to fill the NaNs, you could use groupby.transform to shift the non-NaN values up, then dropna the all-NaN rows:

out = (df.set_index('user')
         .groupby(level=0)
         .transform(lambda s: s.sort_values(key=lambda x: x.isna(),
                                            kind='stable').values)
         .dropna(how='all').reset_index()
      )

Output:

  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b  NaN    D  NaN

Used input:

df = pd.DataFrame({'user': ['a', 'a', 'a', 'b', 'b', 'b'],
                  'col1': ['X', None, None, 'A', None, None],
                  'col2': [None, 'Y', None, None, 'B', 'D'],
                  'col3': [None, None, '6', None, 'C', None]})

Answered By: mozway

Answer 2

Use GroupBy.transform with set NaNs to duplicates by Series.mask and Series.duplicated, sorting by non NaNs values with forward missing values and last remove duplicates per groups users:

out = (df.set_index('user')
         .groupby('user')
         .transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna).ffill())
         .reset_index()
         .drop_duplicates(ignore_index=True)
         )
print (out)
  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b    A    D    C

EDIT: If need missing values per rows if exist at least one non missing value omit ffill and add DataFrame.dropna with axis='all' parameter

out = (df.set_index('user')
         .groupby('user')
         .transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna))
         .dropna(how='all')
         .reset_index()
         )
print (out)
  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b  NaN    D  NaN

Answered By: jezrael

Pandas : Fill rows and drop duplicates, but keep different values

Question:

Answers: