Iterating through a data frame and aggregating sums
Question:
Hope you can help me.
I have the following dataframe named df (see table below)
I have two sales reps – sales_rep_1 and sales_rep_2.
I need to assign either of them to a a letter in the table. A to Z
For each row I would like to assign either sales_rep_1 or sales_rep_2 to a new column – df[‘Rep’]
The only condition is that the sum of each column: DM Bookings, Revenue and AtL Opps needs to be split as evenly as possible between each sales rep.
Letter
DM Bookings
Revenue
AtL Opps
A
6.0
42506.0
34
B
2.0
21055.0
41
C
1.0
6307.0
36
D
0.0
8254.0
14
E
1.0
29878.0
38
F
0.0
6911.0
10
G
1.0
6735.0
19
H
0.0
0.0
80
Is there a way to do this? The result does not need to be perfect, nor exact
Any help would be great.
Thanks in advance
Answers:
This is an optimization problem. So, does scipy have a ready-made solution for this? Close, but not quite – it has linear programming solvers which are very general but not that easy to use.
With thanks to SO answer Linear sum assignment (SciPy) and balancing the costs we can use a mixed-integer linear programming solver to set up our conditions and solve for best worker assignment with two workers. Here we just adapt that answer by @joni.
We’ll need to define a task cost so that we can base the solution on an optimized distribution of task cost between reps or "workers".
import pulp
import numpy as np
workers = ["A", "B"]
n_workers = len(workers)
n_tasks = len(df)
columns = ["Revenue", "AtL Opps", "DM Bookings"]
Rescale and weigh columns – matrix multiply by [1, 1/3, 1/9] here to give decreasing weight to the columns in the order they were given.
task_weights = (df[columns] / df[columns].max(axis=0))
## combine column weights by multiplying them by [1, 1/3, 1/9] to de-prioritize later columns
task_weight = task_weights @ 3.**(-np.arange(len(columns)))
Then define the constrained optimization problem
# actual cost matrix - one row per worker
# in our case, cost is the same regardless of which rep handles it
c = np.stack([task_weight] * len(workers), axis=0)
# create the model
mdl = pulp.LpProblem("even_assignment")
# decision variables
x = {}
for w in workers:
for t in range(n_tasks):
x[w, t] = pulp.LpVariable(f"x[{w}, {t}]", cat="Binary")
max_val = pulp.LpVariable("max_val", cat="Continuous")
min_val = pulp.LpVariable("min_val", cat="Continuous")
# objective: minimize the difference between the maximum and the minimum
# costs per worker
mdl.setObjective(max_val - min_val)
# constraint: each task can only be assigned once
for task in range(n_tasks):
mdl.addConstraint(sum(x[w, task] for w in workers) == 1)
# constraint: evenly distribute the tasks
for i_w, w in enumerate(workers):
assignment_cost = sum(x[w, task] * c[i_w, task] for task in range(n_tasks))
mdl.addConstraint(assignment_cost <= max_val)
mdl.addConstraint(assignment_cost >= min_val)
# solve the problem
mdl.solve()
df_result = df.copy()
# Output
for i_w, w in enumerate(workers):
worker_cost = sum(x[w, t].varValue*c[i_w, t] for t in range(n_tasks))
print(f"scaled cost for rep {w}: {worker_cost:.2f}")
worker_task = [x[w, t].varValue for t in range(n_tasks)]
df_result[w] = worker_task
print()
df_result = (df_result.assign(Rep = lambda df: np.select([df[w] == 1 for w in workers], [w.upper() for w in workers], default=None))
.drop(columns=workers))
df_result
scaled cost for rep A: 2.10
scaled cost for rep B: 2.09
Letter DM Bookings Revenue AtL Opps Rep
0 A 6.0 42506.0 34 B
1 B 2.0 21055.0 41 A
2 C 1.0 6307.0 36 A
3 D 0.0 8254.0 14 B
4 E 1.0 29878.0 38 A
5 F 0.0 6911.0 10 A
6 G 1.0 6735.0 19 B
7 H 0.0 0.0 80 B
Evaluation time, how did we split?
pd.concat([df_result.groupby("Rep").size().rename("count"), df_result.groupby("Rep").sum()], axis=1)
count DM Bookings Revenue AtL Opps
Rep
A 4 4.0 64151.0 125
B 4 7.0 57495.0 147
Different weights per column will adjust the split differently. For example using [1, 0.1, 0.01] would have prioritized Revenue more.
If balancing the number of items assigned to each rep is also wanted, then add a synthetic column with all ones and give it an appropriate priority and weight.
Hope you can help me.
I have the following dataframe named df (see table below)
I have two sales reps – sales_rep_1 and sales_rep_2.
I need to assign either of them to a a letter in the table. A to Z
For each row I would like to assign either sales_rep_1 or sales_rep_2 to a new column – df[‘Rep’]
The only condition is that the sum of each column: DM Bookings, Revenue and AtL Opps needs to be split as evenly as possible between each sales rep.
Letter | DM Bookings | Revenue | AtL Opps |
---|---|---|---|
A | 6.0 | 42506.0 | 34 |
B | 2.0 | 21055.0 | 41 |
C | 1.0 | 6307.0 | 36 |
D | 0.0 | 8254.0 | 14 |
E | 1.0 | 29878.0 | 38 |
F | 0.0 | 6911.0 | 10 |
G | 1.0 | 6735.0 | 19 |
H | 0.0 | 0.0 | 80 |
Is there a way to do this? The result does not need to be perfect, nor exact
Any help would be great.
Thanks in advance
This is an optimization problem. So, does scipy have a ready-made solution for this? Close, but not quite – it has linear programming solvers which are very general but not that easy to use.
With thanks to SO answer Linear sum assignment (SciPy) and balancing the costs we can use a mixed-integer linear programming solver to set up our conditions and solve for best worker assignment with two workers. Here we just adapt that answer by @joni.
We’ll need to define a task cost so that we can base the solution on an optimized distribution of task cost between reps or "workers".
import pulp
import numpy as np
workers = ["A", "B"]
n_workers = len(workers)
n_tasks = len(df)
columns = ["Revenue", "AtL Opps", "DM Bookings"]
Rescale and weigh columns – matrix multiply by [1, 1/3, 1/9] here to give decreasing weight to the columns in the order they were given.
task_weights = (df[columns] / df[columns].max(axis=0))
## combine column weights by multiplying them by [1, 1/3, 1/9] to de-prioritize later columns
task_weight = task_weights @ 3.**(-np.arange(len(columns)))
Then define the constrained optimization problem
# actual cost matrix - one row per worker
# in our case, cost is the same regardless of which rep handles it
c = np.stack([task_weight] * len(workers), axis=0)
# create the model
mdl = pulp.LpProblem("even_assignment")
# decision variables
x = {}
for w in workers:
for t in range(n_tasks):
x[w, t] = pulp.LpVariable(f"x[{w}, {t}]", cat="Binary")
max_val = pulp.LpVariable("max_val", cat="Continuous")
min_val = pulp.LpVariable("min_val", cat="Continuous")
# objective: minimize the difference between the maximum and the minimum
# costs per worker
mdl.setObjective(max_val - min_val)
# constraint: each task can only be assigned once
for task in range(n_tasks):
mdl.addConstraint(sum(x[w, task] for w in workers) == 1)
# constraint: evenly distribute the tasks
for i_w, w in enumerate(workers):
assignment_cost = sum(x[w, task] * c[i_w, task] for task in range(n_tasks))
mdl.addConstraint(assignment_cost <= max_val)
mdl.addConstraint(assignment_cost >= min_val)
# solve the problem
mdl.solve()
df_result = df.copy()
# Output
for i_w, w in enumerate(workers):
worker_cost = sum(x[w, t].varValue*c[i_w, t] for t in range(n_tasks))
print(f"scaled cost for rep {w}: {worker_cost:.2f}")
worker_task = [x[w, t].varValue for t in range(n_tasks)]
df_result[w] = worker_task
print()
df_result = (df_result.assign(Rep = lambda df: np.select([df[w] == 1 for w in workers], [w.upper() for w in workers], default=None))
.drop(columns=workers))
df_result
scaled cost for rep A: 2.10
scaled cost for rep B: 2.09
Letter DM Bookings Revenue AtL Opps Rep
0 A 6.0 42506.0 34 B
1 B 2.0 21055.0 41 A
2 C 1.0 6307.0 36 A
3 D 0.0 8254.0 14 B
4 E 1.0 29878.0 38 A
5 F 0.0 6911.0 10 A
6 G 1.0 6735.0 19 B
7 H 0.0 0.0 80 B
Evaluation time, how did we split?
pd.concat([df_result.groupby("Rep").size().rename("count"), df_result.groupby("Rep").sum()], axis=1)
count DM Bookings Revenue AtL Opps
Rep
A 4 4.0 64151.0 125
B 4 7.0 57495.0 147
Different weights per column will adjust the split differently. For example using [1, 0.1, 0.01] would have prioritized Revenue more.
If balancing the number of items assigned to each rep is also wanted, then add a synthetic column with all ones and give it an appropriate priority and weight.