Pandas : Fill rows and drop duplicates, but keep different values
Question:
I’ll try to be as clear as possible with my example.
df_old
user, col1,col2,col3
a , X , ,
a , , Y ,
a , , , 6
b , A , ,
b , , B , C
b , , D ,
This dataframe is ordered by user
. I would like to fill the blanks and drop the duplicates, so for user a I would get only one row in the final dataframe.
I’m struggling with cases like user b. As there are 2 different values in col2 for user b, I want the final dataframe to have 2 different rows :
df_new
user, col1,col2,col3
a , X , Y , 6
b , A , B , C
b , A , D , C
Note that I want the rows to be "consistent" so B and C stay at the same index.
Thanks a lot for any help !
Answers:
df.groupby('user', group_keys=False).apply(lambda g: g.ffill()).dropna(how='any')
Output:
user col1 col2 col3
2 a X Y 6
4 b A B C
5 b A D C
Alternatively, if you don’t want to fill the NaNs, you could use groupby.transform
to shift the non-NaN values up, then dropna
the all-NaN rows:
out = (df.set_index('user')
.groupby(level=0)
.transform(lambda s: s.sort_values(key=lambda x: x.isna(),
kind='stable').values)
.dropna(how='all').reset_index()
)
Output:
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b NaN D NaN
Used input:
df = pd.DataFrame({'user': ['a', 'a', 'a', 'b', 'b', 'b'],
'col1': ['X', None, None, 'A', None, None],
'col2': [None, 'Y', None, None, 'B', 'D'],
'col3': [None, None, '6', None, 'C', None]})
Use GroupBy.transform
with set NaNs
to duplicates by Series.mask
and Series.duplicated
, sorting by non NaNs values with forward missing values and last remove duplicates per groups user
s:
out = (df.set_index('user')
.groupby('user')
.transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna).ffill())
.reset_index()
.drop_duplicates(ignore_index=True)
)
print (out)
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b A D C
EDIT: If need missing values per rows if exist at least one non missing value omit ffill
and add DataFrame.dropna
with axis='all'
parameter
out = (df.set_index('user')
.groupby('user')
.transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna))
.dropna(how='all')
.reset_index()
)
print (out)
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b NaN D NaN
I’ll try to be as clear as possible with my example.
df_old
user, col1,col2,col3
a , X , ,
a , , Y ,
a , , , 6
b , A , ,
b , , B , C
b , , D ,
This dataframe is ordered by user
. I would like to fill the blanks and drop the duplicates, so for user a I would get only one row in the final dataframe.
I’m struggling with cases like user b. As there are 2 different values in col2 for user b, I want the final dataframe to have 2 different rows :
df_new
user, col1,col2,col3
a , X , Y , 6
b , A , B , C
b , A , D , C
Note that I want the rows to be "consistent" so B and C stay at the same index.
Thanks a lot for any help !
df.groupby('user', group_keys=False).apply(lambda g: g.ffill()).dropna(how='any')
Output:
user col1 col2 col3
2 a X Y 6
4 b A B C
5 b A D C
Alternatively, if you don’t want to fill the NaNs, you could use groupby.transform
to shift the non-NaN values up, then dropna
the all-NaN rows:
out = (df.set_index('user')
.groupby(level=0)
.transform(lambda s: s.sort_values(key=lambda x: x.isna(),
kind='stable').values)
.dropna(how='all').reset_index()
)
Output:
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b NaN D NaN
Used input:
df = pd.DataFrame({'user': ['a', 'a', 'a', 'b', 'b', 'b'],
'col1': ['X', None, None, 'A', None, None],
'col2': [None, 'Y', None, None, 'B', 'D'],
'col3': [None, None, '6', None, 'C', None]})
Use GroupBy.transform
with set NaNs
to duplicates by Series.mask
and Series.duplicated
, sorting by non NaNs values with forward missing values and last remove duplicates per groups user
s:
out = (df.set_index('user')
.groupby('user')
.transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna).ffill())
.reset_index()
.drop_duplicates(ignore_index=True)
)
print (out)
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b A D C
EDIT: If need missing values per rows if exist at least one non missing value omit ffill
and add DataFrame.dropna
with axis='all'
parameter
out = (df.set_index('user')
.groupby('user')
.transform(lambda x: x.mask(x.duplicated()).sort_values(key=pd.isna))
.dropna(how='all')
.reset_index()
)
print (out)
user col1 col2 col3
0 a X Y 6
1 b A B C
2 b NaN D NaN