How can I reduce the time needed to run my code and what is the cause of the slow speed?
Question:
The code works well on small datasets as illustrated in the example code, but the data set I have to process are two lists picker and order each with a length of 1592798 and 288 and 528510 unique values respectively.
For the sake of the example I have replaced these with two short lists, but the concept is the same. I am wondering if the long time required to run the code is due to the sheer amount of data, or if the code is inefficient at processing the data and can be improved.
The purpose of the code is to group all pickers associated with a unique order into a list(hold) within a list(pairs). The order the elements occur in the pair list must be determined, by the first entry in each element on the list, for instance [1, 'a']
must come before [2, 'b', 'k']
, because 1 is a smaller number than 2. Regarding for instance 'b'
, 'k'
in [2, 'b', 'k']
, the order of these is determined by which of these occurs first in the list picker. 'b'
comes before 'k'
because 'b'
has a lower index.
The current code looks like this
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = []
order_picker = list(zip(order, picker))
for x in set(order):
hold = []
hold.append(x)
for i in range(len(order_picker)):
if x == list(order_picker[i])[0]:
if list(order_picker[i])[1] not in hold:
hold.append(list(order_picker[i])[1])
pairs.append(hold)
print(pairs)
The output from the print(pairs)
>>> print(pairs)
[[1, 'a'], [2, 'b', 'k'], [3, 'c'], [4, 'd', 'j', 'k'], [5, 'e'], [6, 'f'], [7, 'g'], [8, 'h'], [9, 'i']]
The output must be on this format for me to later write it to excel.
I suspect that the long time required to run the code occurs due to checking the entire list of length 1592798 each time a new value must be identified, but I have been unable to create a faster solution. How can I reduce the time required to run the code.
Answers:
Perhaps you can speed up your code by only looping over the elements in picker
and order
once
In the example I made, I am zipping the two lists, and using a defaultdict consisting of sets to add each element. Finally, the dictionary is converted to your desired output format
from collections import defaultdict
order = [1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = defaultdict(set)
for o, p in zip(order, picker):
pairs[o].add(p)
pairs = [[k, *v] for k, v in pairs.items()]
print(pairs)
You can use dictionary to store the orders and associated pickers and solve it in O(n)
complexity instead of O(n^2)
order = [1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = []
order_picker = list(zip(order, picker))
orders_dict = {}
for order, picker in order_picker:
if order in orders_dict:
if picker not in orders_dict[order]:
orders_dict[order].append(picker)
else:
orders_dict[order] = [picker]
for order, pickers in orders_dict.items():
pairs.append([order] + pickers)
print(pairs)
If your dataset is very large and performance is critical, you can consider using Pandas
import pandas as pd
df = pd.DataFrame({'order': order, 'picker': picker})
pairs = df.groupby('order')['picker'].apply(set).reset_index().values.tolist()
It takes long because you iterate multiple times on the same data : zip
, for
and for
Try to optimize by iterating less,
something like this produces the same output with only 1 for
loop
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
order_indexes = {} # stores indexes of orders
pairs = []
for i in range(0, len(order)):
order_item = order[i]
picker_item = picker[i]
if (order_item not in order_indexes):
order_indexes[order_item] = len(pairs)
# the index it will be inserted in
pairs.append([order_item])
# insertion of new order
if (picker_item not in pairs[order_indexes[order_item]]):
pairs[order_indexes[order_item]].append(picker_item)
# add picker if not already present
print(pairs)
Fast solution with the desired orders:
def pairs(order, picker):
d = {o: {} for o in sorted(set(order))}
for o, p in zip(order, picker):
d[o][p] = None
return [[o, *p] for o, p in d.items()]
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
print(pairs(order, picker))
Output (Attempt This Online!):
[[1, 'a'], [2, 'b', 'k'], [3, 'c'], [4, 'd', 'j', 'k'], [5, 'e'], [6, 'f'], [7, 'g'], [8, 'h'], [9, 'i']]
The code works well on small datasets as illustrated in the example code, but the data set I have to process are two lists picker and order each with a length of 1592798 and 288 and 528510 unique values respectively.
For the sake of the example I have replaced these with two short lists, but the concept is the same. I am wondering if the long time required to run the code is due to the sheer amount of data, or if the code is inefficient at processing the data and can be improved.
The purpose of the code is to group all pickers associated with a unique order into a list(hold) within a list(pairs). The order the elements occur in the pair list must be determined, by the first entry in each element on the list, for instance [1, 'a']
must come before [2, 'b', 'k']
, because 1 is a smaller number than 2. Regarding for instance 'b'
, 'k'
in [2, 'b', 'k']
, the order of these is determined by which of these occurs first in the list picker. 'b'
comes before 'k'
because 'b'
has a lower index.
The current code looks like this
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = []
order_picker = list(zip(order, picker))
for x in set(order):
hold = []
hold.append(x)
for i in range(len(order_picker)):
if x == list(order_picker[i])[0]:
if list(order_picker[i])[1] not in hold:
hold.append(list(order_picker[i])[1])
pairs.append(hold)
print(pairs)
The output from the print(pairs)
>>> print(pairs)
[[1, 'a'], [2, 'b', 'k'], [3, 'c'], [4, 'd', 'j', 'k'], [5, 'e'], [6, 'f'], [7, 'g'], [8, 'h'], [9, 'i']]
The output must be on this format for me to later write it to excel.
I suspect that the long time required to run the code occurs due to checking the entire list of length 1592798 each time a new value must be identified, but I have been unable to create a faster solution. How can I reduce the time required to run the code.
Perhaps you can speed up your code by only looping over the elements in picker
and order
once
In the example I made, I am zipping the two lists, and using a defaultdict consisting of sets to add each element. Finally, the dictionary is converted to your desired output format
from collections import defaultdict
order = [1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = defaultdict(set)
for o, p in zip(order, picker):
pairs[o].add(p)
pairs = [[k, *v] for k, v in pairs.items()]
print(pairs)
You can use dictionary to store the orders and associated pickers and solve it in O(n)
complexity instead of O(n^2)
order = [1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
pairs = []
order_picker = list(zip(order, picker))
orders_dict = {}
for order, picker in order_picker:
if order in orders_dict:
if picker not in orders_dict[order]:
orders_dict[order].append(picker)
else:
orders_dict[order] = [picker]
for order, pickers in orders_dict.items():
pairs.append([order] + pickers)
print(pairs)
If your dataset is very large and performance is critical, you can consider using Pandas
import pandas as pd
df = pd.DataFrame({'order': order, 'picker': picker})
pairs = df.groupby('order')['picker'].apply(set).reset_index().values.tolist()
It takes long because you iterate multiple times on the same data : zip
, for
and for
Try to optimize by iterating less,
something like this produces the same output with only 1 for
loop
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
order_indexes = {} # stores indexes of orders
pairs = []
for i in range(0, len(order)):
order_item = order[i]
picker_item = picker[i]
if (order_item not in order_indexes):
order_indexes[order_item] = len(pairs)
# the index it will be inserted in
pairs.append([order_item])
# insertion of new order
if (picker_item not in pairs[order_indexes[order_item]]):
pairs[order_indexes[order_item]].append(picker_item)
# add picker if not already present
print(pairs)
Fast solution with the desired orders:
def pairs(order, picker):
d = {o: {} for o in sorted(set(order))}
for o, p in zip(order, picker):
d[o][p] = None
return [[o, *p] for o, p in d.items()]
order = [ 1, 2, 3, 4, 1, 5, 3, 6, 7, 1, 8, 9, 4, 4, 2, 8, 4, 4, 2 ]
picker = ['a', 'b', 'c', 'd', 'a', 'e', 'c', 'f', 'g', 'a', 'h', 'i', 'j', 'k', 'b', 'h', 'j', 'j', 'k']
print(pairs(order, picker))
Output (Attempt This Online!):
[[1, 'a'], [2, 'b', 'k'], [3, 'c'], [4, 'd', 'j', 'k'], [5, 'e'], [6, 'f'], [7, 'g'], [8, 'h'], [9, 'i']]