I have to pass multiple columns to the one aggregate function
Question:
I have a very huge, but simple 2D dataframe, and I have to reshape it to the following (no need to look into data, just shape. Data processing and filtering are already implemented in some functions. Indexes are taken from time_spend_company
column)
max mean
last_evaluation satisfaction_level last_evaluation
0 1 0 1 0 1
2 1.0 0.99 1.0 0.94 0.72 0.69
7 1.0 0.90 1.0 0.94 0.77 0.75
min
satisfaction_level last_evaluation satisfaction_level
0 1 0 1 0 1
2 0.70 0.65 0.37 0.52 0.09 0.29
7 0.48 0.60 0.36 0.42 0.09 0.15
All columns you need to know are last_evaluation
, satisfaction_level
and left
(left
is 0 or 1, means "was the employee expired or continues to work in company"). If you try to transform df to dictionary it will look like this
{('max', 'last_evaluation', 0): {2: 1.0, 7: 1.0},
('max', 'last_evaluation', 1): {2: 0.99, 7: 0.9},
('max', 'satisfaction_level', 0): {2: 1.0, 7: 1.0},
('max', 'satisfaction_level', 1): {2: 0.94, 7: 0.94},
('mean', 'last_evaluation', 0): {2: 0.72, 7: 0.77},
('mean', 'last_evaluation', 1): {2: 0.69, 7: 0.75},
('mean', 'satisfaction_level', 0): {2: 0.7, 7: 0.48},
('mean', 'satisfaction_level', 1): {2: 0.65, 7: 0.6},
('min', 'last_evaluation', 0): {2: 0.37, 7: 0.36},
('min', 'last_evaluation', 1): {2: 0.52, 7: 0.42},
('min', 'satisfaction_level', 0): {2: 0.09, 7: 0.09},
('min', 'satisfaction_level', 1): {2: 0.29, 7: 0.15}}
I have tried to do something like this
reformated_df = df.groupby("time_spend_company")
.agg(min=(("satisfaction_level", "last_eval"), max_last_eval_and_satis_level))
I thought if I’ll pass tuple of columns I need to give to my aggregate function those cols will be passed, but I’m getting KeyError: "Column(s) [array(['last_eval', 'satisfaction_level'], dtype=object)] do not exist"
.
Also I’ve tried to think "maybe I have to use pivot_table
?", but I dunno how to apply pivot_table
to achieve my goal
Answers:
You could try the following:
cols = ["last_evaluation", "satisfaction_level"]
res = (
df.groupby(["time_spend_company", "left"])[cols]
.agg(["max", "min", "mean"]).unstack()
.swaplevel(0, 1, axis=1).sort_index(axis=1)
)
- First group over the
time_spend_company
and left
columns and use the max
, min
, and mean
functions on the so grouped columns last_evaluation
and satisfaction_level
.
- The result is pretty close to what you want, only the
left
values are listed along to row-axis and not along the columns. Use .unstack()
to get that fixed: It’s a bit like pivoting. Without arguments it takes the innermost index-level (here left
) and adds it as the innermost column-level, while dragging the corresponding values into the corresponding positions.
- Finally use
.swaplevel
with axis=1
to swap the 2 outermost columns-levels (the columns are multi-indices with levels).
For a dataframe like
from random import randint, random, choice
df = pd.DataFrame({
"time_spend_company": [randint(1, 5) for _ in range(100)],
"last_evaluation": [random() for _ in range(100)],
"satisfaction_level": [random() for _ in range(100)],
"left": [choice([0, 1]) for _ in range(100)]
})
time_spend_company last_evaluation satisfaction_level left
0 3 0.290137 0.322621 0
1 3 0.907717 0.071477 1
2 5 0.486425 0.392126 1
.. ... ... ... ...
97 1 0.846142 0.992779 0
98 4 0.043582 0.127273 0
99 4 0.158687 0.624325 0
[100 rows x 4 columns]
the result looks like
max
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.993958 0.955763 0.992779 0.984983
2 0.996302 0.918176 0.469955 0.580548
3 0.851398 0.915317 0.888917 0.929749
4 0.819249 0.710011 0.870759 0.792035
5 0.862168 0.927405 0.847103 0.968584
mean
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.673433 0.766494 0.539506 0.746730
2 0.395296 0.580339 0.217164 0.339102
3 0.413337 0.570722 0.616578 0.424184
4 0.320173 0.468517 0.412261 0.542356
5 0.500759 0.472259 0.468105 0.610215
min
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.011188 0.579060 0.072963 0.504767
2 0.008818 0.116667 0.022334 0.104738
3 0.290137 0.071113 0.322621 0.038693
4 0.013239 0.229026 0.003464 0.094058
5 0.019908 0.039934 0.012394 0.057485
I have a very huge, but simple 2D dataframe, and I have to reshape it to the following (no need to look into data, just shape. Data processing and filtering are already implemented in some functions. Indexes are taken from time_spend_company
column)
max mean
last_evaluation satisfaction_level last_evaluation
0 1 0 1 0 1
2 1.0 0.99 1.0 0.94 0.72 0.69
7 1.0 0.90 1.0 0.94 0.77 0.75
min
satisfaction_level last_evaluation satisfaction_level
0 1 0 1 0 1
2 0.70 0.65 0.37 0.52 0.09 0.29
7 0.48 0.60 0.36 0.42 0.09 0.15
All columns you need to know are last_evaluation
, satisfaction_level
and left
(left
is 0 or 1, means "was the employee expired or continues to work in company"). If you try to transform df to dictionary it will look like this
{('max', 'last_evaluation', 0): {2: 1.0, 7: 1.0},
('max', 'last_evaluation', 1): {2: 0.99, 7: 0.9},
('max', 'satisfaction_level', 0): {2: 1.0, 7: 1.0},
('max', 'satisfaction_level', 1): {2: 0.94, 7: 0.94},
('mean', 'last_evaluation', 0): {2: 0.72, 7: 0.77},
('mean', 'last_evaluation', 1): {2: 0.69, 7: 0.75},
('mean', 'satisfaction_level', 0): {2: 0.7, 7: 0.48},
('mean', 'satisfaction_level', 1): {2: 0.65, 7: 0.6},
('min', 'last_evaluation', 0): {2: 0.37, 7: 0.36},
('min', 'last_evaluation', 1): {2: 0.52, 7: 0.42},
('min', 'satisfaction_level', 0): {2: 0.09, 7: 0.09},
('min', 'satisfaction_level', 1): {2: 0.29, 7: 0.15}}
I have tried to do something like this
reformated_df = df.groupby("time_spend_company")
.agg(min=(("satisfaction_level", "last_eval"), max_last_eval_and_satis_level))
I thought if I’ll pass tuple of columns I need to give to my aggregate function those cols will be passed, but I’m getting KeyError: "Column(s) [array(['last_eval', 'satisfaction_level'], dtype=object)] do not exist"
.
Also I’ve tried to think "maybe I have to use pivot_table
?", but I dunno how to apply pivot_table
to achieve my goal
You could try the following:
cols = ["last_evaluation", "satisfaction_level"]
res = (
df.groupby(["time_spend_company", "left"])[cols]
.agg(["max", "min", "mean"]).unstack()
.swaplevel(0, 1, axis=1).sort_index(axis=1)
)
- First group over the
time_spend_company
andleft
columns and use themax
,min
, andmean
functions on the so grouped columnslast_evaluation
andsatisfaction_level
. - The result is pretty close to what you want, only the
left
values are listed along to row-axis and not along the columns. Use.unstack()
to get that fixed: It’s a bit like pivoting. Without arguments it takes the innermost index-level (hereleft
) and adds it as the innermost column-level, while dragging the corresponding values into the corresponding positions. - Finally use
.swaplevel
withaxis=1
to swap the 2 outermost columns-levels (the columns are multi-indices with levels).
For a dataframe like
from random import randint, random, choice
df = pd.DataFrame({
"time_spend_company": [randint(1, 5) for _ in range(100)],
"last_evaluation": [random() for _ in range(100)],
"satisfaction_level": [random() for _ in range(100)],
"left": [choice([0, 1]) for _ in range(100)]
})
time_spend_company last_evaluation satisfaction_level left
0 3 0.290137 0.322621 0
1 3 0.907717 0.071477 1
2 5 0.486425 0.392126 1
.. ... ... ... ...
97 1 0.846142 0.992779 0
98 4 0.043582 0.127273 0
99 4 0.158687 0.624325 0
[100 rows x 4 columns]
the result looks like
max
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.993958 0.955763 0.992779 0.984983
2 0.996302 0.918176 0.469955 0.580548
3 0.851398 0.915317 0.888917 0.929749
4 0.819249 0.710011 0.870759 0.792035
5 0.862168 0.927405 0.847103 0.968584
mean
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.673433 0.766494 0.539506 0.746730
2 0.395296 0.580339 0.217164 0.339102
3 0.413337 0.570722 0.616578 0.424184
4 0.320173 0.468517 0.412261 0.542356
5 0.500759 0.472259 0.468105 0.610215
min
last_evaluation satisfaction_level
left 0 1 0 1
time_spend_company
1 0.011188 0.579060 0.072963 0.504767
2 0.008818 0.116667 0.022334 0.104738
3 0.290137 0.071113 0.322621 0.038693
4 0.013239 0.229026 0.003464 0.094058
5 0.019908 0.039934 0.012394 0.057485