Comparing three data frames to evaluate multiple criteria
Question:
I have three dataframes:
-
ob (Orderbook) – an orderbook containing Part Numbers, the week they are due and the hours it takes to build them.
Part Number
Due Week
Build Hours
A
2022-46
4
A
2022-46
5
B
2022-46
8
C
2022-47
1.6
-
osm (Operator Skill Matrix) – a skills matrix containing operators names and part numbers
Operator
Part number
Mr.One
A
Mr.One
B
Mr.Two
A
Mr.Two
B
Mrs. Three
C
-
ah (Avaliable Hours) – a list containg how many hours an operator can work in a given week
Operator
YYYYWW
Hours
Mr.One
2022-45
40
Mr.One
2022-46
35
Mr.Two
2022-46
37
Mr.Two
2022-47
39
Mrs. Three
2022-47
40
Mrs. Three
2022-48
45
I am trying to work out for each week if there are enough operators, with the right skills, working enough hours to complete all of the orders on the orderbook. And if not, identify the orders that cant be complete.
Step by Step it would look like this:
- Take the part number of the first row of the orderbook.
- Seach the skills matrix to find a list of operators who can build that part.
- Seach the hours list and check if the operators have any hours avaliable for the week the order is due.
- If the operator has hours avalible, add their name to that row of the orderbook.
- Subtract the Build hours in the orderbook from the Avalible hours in the Avalible Hours df.
- Repeat this for each row in the orderbook until all orders have a name against them or there are no avalible hours left.
The only thing i could think to try was a bunch of nested for loops, but as there are thousands of rows it takes ~45 minutes to complete one iteration and would take days if not weeks to complete the whole thing.
#for each row in the orderbook
for i, rowi in ob_sum_hours.iterrows():
#for each row in the operator skill matrix
for j, rowj in osm.iterrows():
#for each row in the avalible operator hours
for y, rowy in aoh.iterrows():
if(rowi['Material']==rowj['MATERIAL'] and rowi['ProdYYYYWW']==rowy['YYYYWW'] and rowj['Operator']==rowy['Operator'] and rowy['Hours'] > 0):`
rowy['Hours'] -=rowi['PlanHrs']
rowi['HoursAllocated'] = rowi['Operator']
The final result would look like this:
Part Number
Due Week
Build Hours
Operator
A
2022-46
4
Mr.One
A
2022-46
5
Mr.One
B
2022-46
8
Mr.Two
C
2022-47
1.6
Mrs.Three
Is there a better way to achieve this?
Answers:
Made with one loop + apply on each line.
Orderbook.groupby(Orderbook.index) groups by index, i.e. my_func iterates through each row, still better than a loop.
In the ‘aaa’ list, we get a list of unique Operators that match. In the ‘bbb’ list, filter Avaliable by: ‘YYYYWW’, ‘Operator’ (using isin for the list of unique Operators) and ‘Hours’ greater than 0. Further in the loop, using the ‘bbb’ indices, we check free time and if ‘ava’ is greater than zero, using explicit indexing loc set values.
import pandas as pd
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable= pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
def my_func(x):
aaa = Operator.loc[Operator['Part number'] == x['Part Number'].values[0], 'Operator'].unique()
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa)) & (Avaliable['Hours'] > 0)]
for i in bbb.index:
ava = Avaliable.loc[i, 'Hours'] - x['Build Hours'].values
if ava >= 0:
Avaliable.loc[i, 'Hours'] = ava
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[i, 'Operator']
break# added loop interrupt
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)
Update 18.11.2022
I did it without cycles. But, you need to check. If you find something incorrect please let me know. You can also measure the exact processing time by putting at the beginning:
import datetime
now = datetime.datetime.now()
and printing the elapsed time at the end:
time_ = datetime.datetime.now() - now
print('elapsed time', time_)
the code:
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable = pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
aaa = [Operator.loc[Operator['Part number'] == Orderbook.loc[i, 'Part Number'], 'Operator'].unique() for i in
range(len(Orderbook))]
def my_func(x):
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa[x.index[0]])) & (Avaliable['Hours'] > 0)]
fff = Avaliable.loc[bbb.index, 'Hours'] - x['Build Hours'].values
ind = fff[fff.ge(0)].index
Avaliable.loc[ind[0], 'Hours'] = fff[ind[0]]
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[ind[0], 'Operator']
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)
I have three dataframes:
-
ob (Orderbook) – an orderbook containing Part Numbers, the week they are due and the hours it takes to build them.
Part Number Due Week Build Hours A 2022-46 4 A 2022-46 5 B 2022-46 8 C 2022-47 1.6 -
osm (Operator Skill Matrix) – a skills matrix containing operators names and part numbers
Operator Part number Mr.One A Mr.One B Mr.Two A Mr.Two B Mrs. Three C -
ah (Avaliable Hours) – a list containg how many hours an operator can work in a given week
Operator YYYYWW Hours Mr.One 2022-45 40 Mr.One 2022-46 35 Mr.Two 2022-46 37 Mr.Two 2022-47 39 Mrs. Three 2022-47 40 Mrs. Three 2022-48 45
I am trying to work out for each week if there are enough operators, with the right skills, working enough hours to complete all of the orders on the orderbook. And if not, identify the orders that cant be complete.
Step by Step it would look like this:
- Take the part number of the first row of the orderbook.
- Seach the skills matrix to find a list of operators who can build that part.
- Seach the hours list and check if the operators have any hours avaliable for the week the order is due.
- If the operator has hours avalible, add their name to that row of the orderbook.
- Subtract the Build hours in the orderbook from the Avalible hours in the Avalible Hours df.
- Repeat this for each row in the orderbook until all orders have a name against them or there are no avalible hours left.
The only thing i could think to try was a bunch of nested for loops, but as there are thousands of rows it takes ~45 minutes to complete one iteration and would take days if not weeks to complete the whole thing.
#for each row in the orderbook
for i, rowi in ob_sum_hours.iterrows():
#for each row in the operator skill matrix
for j, rowj in osm.iterrows():
#for each row in the avalible operator hours
for y, rowy in aoh.iterrows():
if(rowi['Material']==rowj['MATERIAL'] and rowi['ProdYYYYWW']==rowy['YYYYWW'] and rowj['Operator']==rowy['Operator'] and rowy['Hours'] > 0):`
rowy['Hours'] -=rowi['PlanHrs']
rowi['HoursAllocated'] = rowi['Operator']
The final result would look like this:
Part Number | Due Week | Build Hours | Operator |
---|---|---|---|
A | 2022-46 | 4 | Mr.One |
A | 2022-46 | 5 | Mr.One |
B | 2022-46 | 8 | Mr.Two |
C | 2022-47 | 1.6 | Mrs.Three |
Is there a better way to achieve this?
Made with one loop + apply on each line.
Orderbook.groupby(Orderbook.index) groups by index, i.e. my_func iterates through each row, still better than a loop.
In the ‘aaa’ list, we get a list of unique Operators that match. In the ‘bbb’ list, filter Avaliable by: ‘YYYYWW’, ‘Operator’ (using isin for the list of unique Operators) and ‘Hours’ greater than 0. Further in the loop, using the ‘bbb’ indices, we check free time and if ‘ava’ is greater than zero, using explicit indexing loc set values.
import pandas as pd
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable= pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
def my_func(x):
aaa = Operator.loc[Operator['Part number'] == x['Part Number'].values[0], 'Operator'].unique()
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa)) & (Avaliable['Hours'] > 0)]
for i in bbb.index:
ava = Avaliable.loc[i, 'Hours'] - x['Build Hours'].values
if ava >= 0:
Avaliable.loc[i, 'Hours'] = ava
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[i, 'Operator']
break# added loop interrupt
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)
Update 18.11.2022
I did it without cycles. But, you need to check. If you find something incorrect please let me know. You can also measure the exact processing time by putting at the beginning:
import datetime
now = datetime.datetime.now()
and printing the elapsed time at the end:
time_ = datetime.datetime.now() - now
print('elapsed time', time_)
the code:
Orderbook = pd.read_csv('Orderbook.csv', header=0)
Operator = pd.read_csv('Operator.csv', header=0)
Avaliable = pd.read_csv('Avaliable.csv', header=0)
Orderbook['Operator'] = 'no'
aaa = [Operator.loc[Operator['Part number'] == Orderbook.loc[i, 'Part Number'], 'Operator'].unique() for i in
range(len(Orderbook))]
def my_func(x):
bbb = Avaliable[(Avaliable['YYYYWW'] == x['Due Week'].values[0]) &
(Avaliable['Operator'].isin(aaa[x.index[0]])) & (Avaliable['Hours'] > 0)]
fff = Avaliable.loc[bbb.index, 'Hours'] - x['Build Hours'].values
ind = fff[fff.ge(0)].index
Avaliable.loc[ind[0], 'Hours'] = fff[ind[0]]
Orderbook.loc[x.index, 'Operator'] = Avaliable.loc[ind[0], 'Operator']
Orderbook.groupby(Orderbook.index).apply(my_func)
print(Orderbook)
print(Avaliable)