loop through a group by, aggregate, and make new columns based on the groups
Question:
I am trying to get by-student columns that count up activities by their progress level.
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME ACTIVITY_NAME prog_level
FredID gobbledeegook1 Node1 MyActivity1 pass
FredID gobbledeegook2 Node1 MyActivity1 pass
FredID gobbledeegook3 Node2 MyActivity2 pass
JaniceID gobbledeegook4 Node3 MyActivity3 stay
JaniceID gobbledeegook5 Node3 MyActivity3 stay
JaniceID gobbledeegook5 Node3 MyActivity3 fail
STUDENT_ID attempts_pass attempts_fail attempts_stay
FredID 3
JaniceID 1 2
- I attempt to loop through so variable names are automatic. I want each row to be a STUDENT_ID, and the count to be a column
def std_attempts_by_prog_level(df):
dict_fields = {}
df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
for name, group in df_by_prog_level:
x = group.count()
dict_fields["attempts_" + name] = x
return pd.Series(dict_fields)
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()
result:
STUDENT_ID level_1 0
0 Fred attempts_cancel 104
1 Fred attempts_fail 96
2 Fred attempts_in_progress 30
…so this would need to be pivoted and messed with, so I tried just taking it from a pivot approach
- Pivot approach and naming the fields manually: the resulting multi-index won’t let me readily merge back with the other by-student metrics
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)
print(df_temp.columns)
result:
MultiIndex([('STUDENT_ACTIVITY_SESSION_ID', 'attempts_cancel'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_fail'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_pass')],
names=[None, 'prog_level'])
Answers:
You can use a combination pd.get_dummies()
, groupby()
and sum()
:
df2 = (pd.get_dummies(df, columns=['prog_level']).groupby('STUDENT_ID', as_index=False).sum())
df2
:
STUDENT_ID prog_level_fail prog_level_pass prog_level_stay
0 FredID 0 3 0
1 JaniceID 1 0 2
We can use rename
to get the desired column names:
df2.rename(columns={"prog_level_fail":"attempts_fail", "prog_level_pass":"attempts_pass", "prog_level_stay":"attempts_stay"}, inplace = True)
We can index df2
to get the desired column order:
df2 = df2[['STUDENT_ID', 'attempts_pass', 'attempts_fail', 'attempts_stay']]
This is the final result of df2
:
STUDENT_ID attempts_pass attempts_fail attempts_stay
0 FredID 3 0 0
1 JaniceID 0 1 2
You could use .pivot_table
:
result = df.pivot_table(
index="STUDENT_ID", columns="prog_level", values="ACTIVITY_NAME",
aggfunc="count", fill_value=0
).rename(lambda c: f"prog_level_{c}", axis=1).rename_axis(None, axis=1)
Result:
prog_level_fail prog_level_pass prog_level_stay
STUDENT_ID
FredID 0 3 0
JaniceID 1 0 2
If you want the index as column then add .reset_index()
at the end of the pipeline.
I am trying to get by-student columns that count up activities by their progress level.
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME ACTIVITY_NAME prog_level FredID gobbledeegook1 Node1 MyActivity1 pass FredID gobbledeegook2 Node1 MyActivity1 pass FredID gobbledeegook3 Node2 MyActivity2 pass JaniceID gobbledeegook4 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 fail
STUDENT_ID attempts_pass attempts_fail attempts_stay FredID 3 JaniceID 1 2
- I attempt to loop through so variable names are automatic. I want each row to be a STUDENT_ID, and the count to be a column
def std_attempts_by_prog_level(df):
dict_fields = {}
df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
for name, group in df_by_prog_level:
x = group.count()
dict_fields["attempts_" + name] = x
return pd.Series(dict_fields)
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()
result:
STUDENT_ID level_1 0 0 Fred attempts_cancel 104 1 Fred attempts_fail 96 2 Fred attempts_in_progress 30
…so this would need to be pivoted and messed with, so I tried just taking it from a pivot approach
- Pivot approach and naming the fields manually: the resulting multi-index won’t let me readily merge back with the other by-student metrics
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)
print(df_temp.columns)
result:
MultiIndex([('STUDENT_ACTIVITY_SESSION_ID', 'attempts_cancel'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_fail'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_pass')],
names=[None, 'prog_level'])
You can use a combination pd.get_dummies()
, groupby()
and sum()
:
df2 = (pd.get_dummies(df, columns=['prog_level']).groupby('STUDENT_ID', as_index=False).sum())
df2
:
STUDENT_ID prog_level_fail prog_level_pass prog_level_stay
0 FredID 0 3 0
1 JaniceID 1 0 2
We can use rename
to get the desired column names:
df2.rename(columns={"prog_level_fail":"attempts_fail", "prog_level_pass":"attempts_pass", "prog_level_stay":"attempts_stay"}, inplace = True)
We can index df2
to get the desired column order:
df2 = df2[['STUDENT_ID', 'attempts_pass', 'attempts_fail', 'attempts_stay']]
This is the final result of df2
:
STUDENT_ID attempts_pass attempts_fail attempts_stay
0 FredID 3 0 0
1 JaniceID 0 1 2
You could use .pivot_table
:
result = df.pivot_table(
index="STUDENT_ID", columns="prog_level", values="ACTIVITY_NAME",
aggfunc="count", fill_value=0
).rename(lambda c: f"prog_level_{c}", axis=1).rename_axis(None, axis=1)
Result:
prog_level_fail prog_level_pass prog_level_stay
STUDENT_ID
FredID 0 3 0
JaniceID 1 0 2
If you want the index as column then add .reset_index()
at the end of the pipeline.