Pandas : Fill rows and drop duplicates, but keep different values

Question:

I’ll try to be as clear as possible with my example.

df_old

user, col1,col2,col3
a   ,  X  ,    ,
a   ,     ,  Y ,
a   ,     ,    , 6
b   ,  A  ,    ,
b   ,     ,  B , C
b   ,     ,  D ,

This dataframe is ordered by user. I would like to fill the blanks and drop the duplicates, so for user a I would get only one row in the final dataframe.
I’m struggling with cases like user b. As there are 2 different values in col2 for user b, I want the final dataframe to have 2 different rows :

df_new

user, col1,col2,col3
a   ,  X  ,  Y , 6
b   ,  A  ,  B , C
b   ,  A  ,  D , C

Note that I want the rows to be "consistent" so B and C stay at the same index.

Thanks a lot for any help !

Asked By: Movilla

||

Answers:

You could ffill then dropna:

df.groupby('user', group_keys=False).apply(lambda g: g.ffill()).dropna(how='any')

Output:

  user col1 col2 col3
2    a    X    Y    6
4    b    A    B    C
5    b    A    D    C

Alternatively, if you don’t want to fill the NaNs, you could use groupby.transform to shift the non-NaN values up, then dropna the all-NaN rows:

out = (df.set_index('user')
         .groupby(level=0)
         .transform(lambda s: s.sort_values(key=lambda x: x.isna(),
                                            kind='stable').values)
         .dropna(how='all').reset_index()
      )

Output:

  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b  NaN    D  NaN

Used input:

df = pd.DataFrame({'user': ['a', 'a', 'a', 'b', 'b', 'b'],
                  'col1': ['X', None, None, 'A', None, None],
                  'col2': [None, 'Y', None, None, 'B', 'D'],
                  'col3': [None, None, '6', None, 'C', None]})
Answered By: mozway

Use GroupBy.transform with set NaNs to duplicates by Series.mask and Series.duplicated, sorting by non NaNs values with forward missing values and last remove duplicates per groups users:

out = (df.set_index('user')
         .groupby('user')
         .transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna).ffill())
         .reset_index()
         .drop_duplicates(ignore_index=True)
         )
print (out)
  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b    A    D    C

EDIT: If need missing values per rows if exist at least one non missing value omit ffill and add DataFrame.dropna with axis='all' parameter

out = (df.set_index('user')
         .groupby('user')
         .transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna))
         .dropna(how='all')
         .reset_index()
         )
print (out)
  user col1 col2 col3
0    a    X    Y    6
1    b    A    B    C
2    b  NaN    D  NaN
Answered By: jezrael
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.