loop through a group by, aggregate, and make new columns based on the groups

Question:

I am trying to get by-student columns that count up activities by their progress level.

Data looks like
enter image description here

STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME   ACTIVITY_NAME   prog_level
FredID  gobbledeegook1  Node1   MyActivity1 pass
FredID  gobbledeegook2  Node1   MyActivity1 pass
FredID  gobbledeegook3  Node2   MyActivity2 pass
JaniceID    gobbledeegook4  Node3   MyActivity3 stay
JaniceID    gobbledeegook5  Node3   MyActivity3 stay
JaniceID    gobbledeegook5  Node3   MyActivity3 fail

Here is what I want:
enter image description here

STUDENT_ID attempts_pass   attempts_fail   attempts_stay
FredID  3       
JaniceID        1   2
  1. I attempt to loop through so variable names are automatic. I want each row to be a STUDENT_ID, and the count to be a column
def std_attempts_by_prog_level(df):
    dict_fields = {}
    df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
    for name, group in df_by_prog_level:
        x = group.count() 
        dict_fields["attempts_" + name] = x

    return pd.Series(dict_fields)     
  
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()

result:

STUDENT_ID level_1 0
0   Fred    attempts_cancel 104
1   Fred    attempts_fail   96
2   Fred    attempts_in_progress    30

…so this would need to be pivoted and messed with, so I tried just taking it from a pivot approach

  1. Pivot approach and naming the fields manually: the resulting multi-index won’t let me readily merge back with the other by-student metrics
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)

print(df_temp.columns)

result:

MultiIndex([('STUDENT_ACTIVITY_SESSION_ID',      'attempts_cancel'),
            ('STUDENT_ACTIVITY_SESSION_ID',        'attempts_fail'),
            ('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
            ('STUDENT_ACTIVITY_SESSION_ID',        'attempts_pass')],
           names=[None, 'prog_level'])
Asked By: CiviLearner

||

Answers:

You can use a combination pd.get_dummies(), groupby() and sum():

df2 = (pd.get_dummies(df, columns=['prog_level']).groupby('STUDENT_ID', as_index=False).sum())

df2:

  STUDENT_ID  prog_level_fail  prog_level_pass  prog_level_stay
0     FredID                0                3                0
1   JaniceID                1                0                2

We can use rename to get the desired column names:

df2.rename(columns={"prog_level_fail":"attempts_fail", "prog_level_pass":"attempts_pass", "prog_level_stay":"attempts_stay"}, inplace = True)

We can index df2 to get the desired column order:

df2 = df2[['STUDENT_ID', 'attempts_pass', 'attempts_fail', 'attempts_stay']]

This is the final result of df2:

  STUDENT_ID  attempts_pass  attempts_fail  attempts_stay
0     FredID              3              0              0
1   JaniceID              0              1              2
Answered By: Marcelo Paco

You could use .pivot_table:

result = df.pivot_table(
    index="STUDENT_ID", columns="prog_level", values="ACTIVITY_NAME",
    aggfunc="count", fill_value=0
).rename(lambda c: f"prog_level_{c}", axis=1).rename_axis(None, axis=1)

Result:

            prog_level_fail  prog_level_pass  prog_level_stay
STUDENT_ID                                                   
FredID                    0                3                0
JaniceID                  1                0                2

If you want the index as column then add .reset_index() at the end of the pipeline.

Answered By: Timus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.