calculate sum of a column after groupby based on unique values of second column
Question:
I have a dataframe
, where there are columns like gp1, gp2, gp3, id, sub_id, activity
usr gp2 gp3 id sub_id activity
1 IN ASIA 1 1 1
1 IN ASIA 1 2 1
1 IN ASIA 2 9 0
2 IN ASIA 3 4 1
2 IN ASIA 3 5 1
2 IN ASIA 4 6 1
2 IN ASIA 4 7 0
2 IN ASIA 4 8 0
I want to aggregate the above dataframe by grouping on usr, gp1, gp2
, and calculate two columns one is ‘Account (id)’, which is number of unique id
for every group & then Actuals (Activity) which is Activity
based on every unique ‘id’.
for example, if id = 1, the activity sum would be 1 not 2
usr gp1 gp3 id Activity
1 IN ASIA 2 1
2 IN ASIA 2 2
df.groupby(['usr', 'gp2', 'gp3']).agg({'id': pd.Series.nunique, 'activity': LOGIC_REQUIRED})
Answers:
Use GroupBy.apply
to operate on multiple (dependent) columns:
df.drop(columns='sub_id').groupby(['usr', 'gp2', 'gp3'])
.apply(lambda x: pd.DataFrame({'id': [x['id'].nunique()],
'activity': [x[x.activity.ne(0)].drop_duplicates(subset='id')['activity'].sum()]})
.set_index('id')).reset_index()
usr gp2 gp3 id activity
0 1 IN ASIA 2 1
1 2 IN ASIA 2 2
import pandas as pd
df = pd.DataFrame({'usr':[1, 1, 1, 2, 2, 2, 2, 2],
'gp2':['IN', 'IN', 'IN', 'IN', 'IN', 'IN', 'IN', 'IN'],
'gp3':['ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA'],
'id':[1, 1, 2, 3, 3, 4, 4, 4],
'sub_id':[1, 2, 9, 4, 5, 6, 7, 8],
'activity':[1, 1, 0, 1, 1, 1, 0, 0],
})
df = (df.groupby(['usr', 'gp2', 'gp3'])
.agg({'id':'nunique'})
.reset_index(level=['usr', 'gp2', 'gp3'])
)
df['Activity'] = df.groupby(['usr', 'gp2', 'gp3']).ngroup().add(1)
usr gp2 gp3 id Activity
0 1 IN ASIA 2 1
1 2 IN ASIA 2 2
I have a dataframe
, where there are columns like gp1, gp2, gp3, id, sub_id, activity
usr gp2 gp3 id sub_id activity
1 IN ASIA 1 1 1
1 IN ASIA 1 2 1
1 IN ASIA 2 9 0
2 IN ASIA 3 4 1
2 IN ASIA 3 5 1
2 IN ASIA 4 6 1
2 IN ASIA 4 7 0
2 IN ASIA 4 8 0
I want to aggregate the above dataframe by grouping on usr, gp1, gp2
, and calculate two columns one is ‘Account (id)’, which is number of unique id
for every group & then Actuals (Activity) which is Activity
based on every unique ‘id’.
for example, if id = 1, the activity sum would be 1 not 2
usr gp1 gp3 id Activity
1 IN ASIA 2 1
2 IN ASIA 2 2
df.groupby(['usr', 'gp2', 'gp3']).agg({'id': pd.Series.nunique, 'activity': LOGIC_REQUIRED})
Use GroupBy.apply
to operate on multiple (dependent) columns:
df.drop(columns='sub_id').groupby(['usr', 'gp2', 'gp3'])
.apply(lambda x: pd.DataFrame({'id': [x['id'].nunique()],
'activity': [x[x.activity.ne(0)].drop_duplicates(subset='id')['activity'].sum()]})
.set_index('id')).reset_index()
usr gp2 gp3 id activity
0 1 IN ASIA 2 1
1 2 IN ASIA 2 2
import pandas as pd
df = pd.DataFrame({'usr':[1, 1, 1, 2, 2, 2, 2, 2],
'gp2':['IN', 'IN', 'IN', 'IN', 'IN', 'IN', 'IN', 'IN'],
'gp3':['ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA', 'ASIA'],
'id':[1, 1, 2, 3, 3, 4, 4, 4],
'sub_id':[1, 2, 9, 4, 5, 6, 7, 8],
'activity':[1, 1, 0, 1, 1, 1, 0, 0],
})
df = (df.groupby(['usr', 'gp2', 'gp3'])
.agg({'id':'nunique'})
.reset_index(level=['usr', 'gp2', 'gp3'])
)
df['Activity'] = df.groupby(['usr', 'gp2', 'gp3']).ngroup().add(1)
usr gp2 gp3 id Activity
0 1 IN ASIA 2 1
1 2 IN ASIA 2 2