I have to pass multiple columns to the one aggregate function

Question

I have a very huge, but simple 2D dataframe, and I have to reshape it to the following (no need to look into data, just shape. Data processing and filtering are already implemented in some functions. Indexes are taken from time_spend_company column)

              max                                           mean        
  last_evaluation       satisfaction_level       last_evaluation
                0     1                  0     1               0     1
2             1.0  0.99                1.0  0.94            0.72  0.69
7             1.0  0.90                1.0  0.94            0.77  0.75

                                       min
  satisfaction_level       last_evaluation       satisfaction_level
                   0     1               0     1                  0     1
2               0.70  0.65            0.37  0.52               0.09  0.29
7               0.48  0.60            0.36  0.42               0.09  0.15

All columns you need to know are last_evaluation, satisfaction_level and left (left is 0 or 1, means "was the employee expired or continues to work in company"). If you try to transform df to dictionary it will look like this

{('max', 'last_evaluation', 0): {2: 1.0, 7: 1.0},
 ('max', 'last_evaluation', 1): {2: 0.99, 7: 0.9},
 ('max', 'satisfaction_level', 0): {2: 1.0, 7: 1.0},
 ('max', 'satisfaction_level', 1): {2: 0.94, 7: 0.94},
 ('mean', 'last_evaluation', 0): {2: 0.72, 7: 0.77},
 ('mean', 'last_evaluation', 1): {2: 0.69, 7: 0.75},
 ('mean', 'satisfaction_level', 0): {2: 0.7, 7: 0.48},
 ('mean', 'satisfaction_level', 1): {2: 0.65, 7: 0.6},
 ('min', 'last_evaluation', 0): {2: 0.37, 7: 0.36},
 ('min', 'last_evaluation', 1): {2: 0.52, 7: 0.42},
 ('min', 'satisfaction_level', 0): {2: 0.09, 7: 0.09},
 ('min', 'satisfaction_level', 1): {2: 0.29, 7: 0.15}}

I have tried to do something like this

reformated_df = df.groupby("time_spend_company")
    .agg(min=(("satisfaction_level", "last_eval"), max_last_eval_and_satis_level))

I thought if I’ll pass tuple of columns I need to give to my aggregate function those cols will be passed, but I’m getting KeyError: "Column(s) [array(['last_eval', 'satisfaction_level'], dtype=object)] do not exist".

Also I’ve tried to think "maybe I have to use pivot_table?", but I dunno how to apply pivot_table to achieve my goal

Asked By: Бодя паук

||

Source

Answer 1

You could try the following:

cols = ["last_evaluation", "satisfaction_level"]
res = (
    df.groupby(["time_spend_company", "left"])[cols]
    .agg(["max", "min", "mean"]).unstack()
    .swaplevel(0, 1, axis=1).sort_index(axis=1)
)

First group over the time_spend_company and left columns and use the max, min, and mean functions on the so grouped columns last_evaluation and satisfaction_level.
The result is pretty close to what you want, only the left values are listed along to row-axis and not along the columns. Use .unstack() to get that fixed: It’s a bit like pivoting. Without arguments it takes the innermost index-level (here left) and adds it as the innermost column-level, while dragging the corresponding values into the corresponding positions.
Finally use .swaplevel with axis=1 to swap the 2 outermost columns-levels (the columns are multi-indices with levels).

For a dataframe like

from random import randint, random, choice

df = pd.DataFrame({
    "time_spend_company": [randint(1, 5) for _ in range(100)],
    "last_evaluation": [random() for _ in range(100)],
    "satisfaction_level": [random() for _ in range(100)],
    "left": [choice([0, 1]) for _ in range(100)]
})

    time_spend_company  last_evaluation  satisfaction_level  left
0                    3         0.290137            0.322621     0
1                    3         0.907717            0.071477     1
2                    5         0.486425            0.392126     1
..                 ...              ...                 ...   ...
97                   1         0.846142            0.992779     0
98                   4         0.043582            0.127273     0
99                   4         0.158687            0.624325     0

[100 rows x 4 columns]

the result looks like

                               max                                         
                   last_evaluation           satisfaction_level             
left                             0         1                  0         1   
time_spend_company                                                          
1                         0.993958  0.955763           0.992779  0.984983   
2                         0.996302  0.918176           0.469955  0.580548   
3                         0.851398  0.915317           0.888917  0.929749   
4                         0.819249  0.710011           0.870759  0.792035   
5                         0.862168  0.927405           0.847103  0.968584   

                              mean                                         
                   last_evaluation           satisfaction_level             
left                             0         1                  0         1   
time_spend_company                                                          
1                         0.673433  0.766494           0.539506  0.746730   
2                         0.395296  0.580339           0.217164  0.339102   
3                         0.413337  0.570722           0.616578  0.424184   
4                         0.320173  0.468517           0.412261  0.542356   
5                         0.500759  0.472259           0.468105  0.610215   

                               min                                         
                   last_evaluation           satisfaction_level            
left                             0         1                  0         1  
time_spend_company                                                         
1                         0.011188  0.579060           0.072963  0.504767  
2                         0.008818  0.116667           0.022334  0.104738  
3                         0.290137  0.071113           0.322621  0.038693  
4                         0.013239  0.229026           0.003464  0.094058  
5                         0.019908  0.039934           0.012394  0.057485

Answered By: Timus

I have to pass multiple columns to the one aggregate function

Question:

Answers: