How to convert max values of a multiple groupby dataframe to nan?
Question:
I have this df:
CODE MONTH PP
0 100007 01 22.1
1 100007 01 20
2 100007 01 5
3 100007 01 10
4 100007 01 12
... .. ..
10542747 155217 02 11
10542748 155217 02 12
10542749 155217 02 15
10542750 155217 02 18
10542751 155217 02 3
[10542752 rows x 3 columns]
I want to first group the df by df['CODE']
and df['MONTH']
. And then convert the max value of the grouped df ‘PP’ column to nan.
So i did this code:
grouped_df=pd.DataFrame()
for i, data in df.groupby(['CODE','MONTH']):
data.loc[data['PP']==data['PP'].max(), 'PP']=np.nan
grouped_df=grouped_df.append(data)
But it takes too long to run. Like 15 minutes. Maybe cause i have [10542752 rows x 3 columns] in the df. But is there any way to improve this code to a faster one?
Thanks in advance
Answers:
No need for the loop, directly perform boolean indexing using groupby.transform('max')
as reference:
m = data.groupby(['CODE','MONTH'])['PP'].transform('max')
data.loc[data['PP'].eq(m), 'PP'] = np.nan
using mask
df['PP']=df['PP'].mask(df.groupby(['CODE','MONTH'])['PP'].transform(max).eq(df['PP'], np.nan) )
df
CODE MONTH PP
0 100007 1 NaN
1 100007 1 20.0
2 100007 1 5.0
3 100007 1 10.0
4 100007 1 12.0
10542747 155217 2 11.0
10542748 155217 2 12.0
10542749 155217 2 15.0
10542750 155217 2 NaN
10542751 155217 2 3.0
I have this df:
CODE MONTH PP
0 100007 01 22.1
1 100007 01 20
2 100007 01 5
3 100007 01 10
4 100007 01 12
... .. ..
10542747 155217 02 11
10542748 155217 02 12
10542749 155217 02 15
10542750 155217 02 18
10542751 155217 02 3
[10542752 rows x 3 columns]
I want to first group the df by df['CODE']
and df['MONTH']
. And then convert the max value of the grouped df ‘PP’ column to nan.
So i did this code:
grouped_df=pd.DataFrame()
for i, data in df.groupby(['CODE','MONTH']):
data.loc[data['PP']==data['PP'].max(), 'PP']=np.nan
grouped_df=grouped_df.append(data)
But it takes too long to run. Like 15 minutes. Maybe cause i have [10542752 rows x 3 columns] in the df. But is there any way to improve this code to a faster one?
Thanks in advance
No need for the loop, directly perform boolean indexing using groupby.transform('max')
as reference:
m = data.groupby(['CODE','MONTH'])['PP'].transform('max')
data.loc[data['PP'].eq(m), 'PP'] = np.nan
using mask
df['PP']=df['PP'].mask(df.groupby(['CODE','MONTH'])['PP'].transform(max).eq(df['PP'], np.nan) )
df
CODE MONTH PP
0 100007 1 NaN
1 100007 1 20.0
2 100007 1 5.0
3 100007 1 10.0
4 100007 1 12.0
10542747 155217 2 11.0
10542748 155217 2 12.0
10542749 155217 2 15.0
10542750 155217 2 NaN
10542751 155217 2 3.0