How to drop_duplicates but remain a specified value in pandas dataframe?
Question:
date price_bl price_ss price_bs price_sl
0 2022-03-09 03:00:00 41198.5 NaN NaN NaN
0 2022-03-10 01:00:00 NaN NaN NaN 40931.0
0 2022-03-10 01:00:00 NaN NaN 40931.0 NaN
1 2022-03-16 02:00:00 40867.8 NaN NaN NaN
0 2022-03-16 02:00:00 NaN 40867.8 NaN NaN
# after drop_duplicates...
date price_bl price_ss price_bs price_sl
0 2022-03-09 03:00:00 41198.5 NaN NaN NaN
0 2022-03-10 01:00:00 NaN NaN NaN 40931.0
1 2022-03-16 02:00:00 40867.8 NaN NaN NaN
1 2022-03-16 03:00:00 NaN NaN NaN 39137.9
1 2022-03-16 08:00:00 NaN 40289.0 NaN NaN
As you can see, the price_bs at 2022-03-10 01:00:00 has two value: 40931.0 and NaN, and after drop_duplicates, it’s become one value, NaN. But I want it remain the normal value but not the NaN value (if it has a normal value and a NaN value). What should I do (I mean not only for the price_bs column, but also the other 3 price_* columns)?
Answers:
You can try combining rows via groupby
and aggregation function like max
or first
:
df = df.groupby('date', as_index=False).max()
# or
df = df.groupby('date', as_index=False).first() # will use first non-null value in the column for group
date price_bl price_ss price_bs price_sl
0 2022-03-09 03:00:00 41198.5 NaN NaN NaN
0 2022-03-10 01:00:00 NaN NaN NaN 40931.0
0 2022-03-10 01:00:00 NaN NaN 40931.0 NaN
1 2022-03-16 02:00:00 40867.8 NaN NaN NaN
0 2022-03-16 02:00:00 NaN 40867.8 NaN NaN
# after drop_duplicates...
date price_bl price_ss price_bs price_sl
0 2022-03-09 03:00:00 41198.5 NaN NaN NaN
0 2022-03-10 01:00:00 NaN NaN NaN 40931.0
1 2022-03-16 02:00:00 40867.8 NaN NaN NaN
1 2022-03-16 03:00:00 NaN NaN NaN 39137.9
1 2022-03-16 08:00:00 NaN 40289.0 NaN NaN
As you can see, the price_bs at 2022-03-10 01:00:00 has two value: 40931.0 and NaN, and after drop_duplicates, it’s become one value, NaN. But I want it remain the normal value but not the NaN value (if it has a normal value and a NaN value). What should I do (I mean not only for the price_bs column, but also the other 3 price_* columns)?
You can try combining rows via groupby
and aggregation function like max
or first
:
df = df.groupby('date', as_index=False).max()
# or
df = df.groupby('date', as_index=False).first() # will use first non-null value in the column for group