performance problem, code works but consider takes long time in long list
Question:
why the following code is consider inefficient in terms of time complexity and how can I improve it? The time complexity for ia in n takes o(n) hence the problem.
what have I tired? I sorted initially n,a and b but no change in performance.
Objective, find the sum of h-m
Note: len(a) always equal to len(b)
n=[1, 5, 3] #//can be with 100K+ items
a=set([3,1]) #//can be with 50K+ items
b=set([5,7])
h=0
m=0
for ia, ib in zip(a,b):
if ia in n:
h+=1
if ib in n:
m+=1
print (h-m)
Edit: I realized that the it is insufficient to discuss only concept ideas such as why it is consider inefficient without explicitly address time/space complexity. I have changed the question accordingly.
Answers:
Speculating, the if x in y
test is slowest. It probably doesn’t help much to have a and b as sets – you’re just zipping and enumerating. But if n was a set, then the membership test would likely be faster.
It’s probably not necessary to zip, given that you don’t appear to be doing anything with ia and ib such that they interact, but I doubt that that introduces much overhead.
Since n
is a list
, and it’s huge (100K+ items), each if WHATEVER in n:
is doing O(n)
work, involving 100K+ equality checks.
You basically have your types backwards here; you’re using set
s for things you iterate (where being a set
is saving you little aside from perhaps removing duplicates from your inputs) and using list
s for things you membership test (where O(n)
containment checks are much more expensive on large list
s than O(1)
containment checks are for set
s of any size).
Assuming the elements of n
are hashable, convert them to a set
before the loop and use containment tests against the set
:
n=[1, 5, 3] #can be with 100K+ items
nset = set(n) # Cache set view of n
a=set([3,1]) #can be with 50K+ items
b=set([5,7])
h=0
m=0
for ia, ib in zip(a,b):
if ia in nset: # Check against set in O(1)
h+=1
if ib in nset: # Check against set in O(1)
m+=1
print (h-m)
Note that zip
ing is doing nothing except possibly excluding some elements from being iterated at all; if len(a) != len(b)
, you’ll fail to check the elements that would be iterated beyond the length of the shortest set
. If you want to count them all, the simplest solution is to split the loops replacing the single loop with just:
h = sum(1 for ia in a if ia in nset) # sum(ia in nset for ia in a) also works, but it's somewhat slower/less intuitive
m = sum(1 for ib in b if ib in nset)
Here’s an easy way – using set.intersection
– and without using a for
loop or zip
function:
n = [1, 5, 3] # can be with 100K+ items
a = {3, 1} # can be with 50K+ items
b = {5, 7}
nset = set(n) # cache set view of n
h = len(nset & a)
m = len(nset & b)
print(h - m)
why the following code is consider inefficient in terms of time complexity and how can I improve it? The time complexity for ia in n takes o(n) hence the problem.
what have I tired? I sorted initially n,a and b but no change in performance.
Objective, find the sum of h-m
Note: len(a) always equal to len(b)
n=[1, 5, 3] #//can be with 100K+ items
a=set([3,1]) #//can be with 50K+ items
b=set([5,7])
h=0
m=0
for ia, ib in zip(a,b):
if ia in n:
h+=1
if ib in n:
m+=1
print (h-m)
Edit: I realized that the it is insufficient to discuss only concept ideas such as why it is consider inefficient without explicitly address time/space complexity. I have changed the question accordingly.
Speculating, the if x in y
test is slowest. It probably doesn’t help much to have a and b as sets – you’re just zipping and enumerating. But if n was a set, then the membership test would likely be faster.
It’s probably not necessary to zip, given that you don’t appear to be doing anything with ia and ib such that they interact, but I doubt that that introduces much overhead.
Since n
is a list
, and it’s huge (100K+ items), each if WHATEVER in n:
is doing O(n)
work, involving 100K+ equality checks.
You basically have your types backwards here; you’re using set
s for things you iterate (where being a set
is saving you little aside from perhaps removing duplicates from your inputs) and using list
s for things you membership test (where O(n)
containment checks are much more expensive on large list
s than O(1)
containment checks are for set
s of any size).
Assuming the elements of n
are hashable, convert them to a set
before the loop and use containment tests against the set
:
n=[1, 5, 3] #can be with 100K+ items
nset = set(n) # Cache set view of n
a=set([3,1]) #can be with 50K+ items
b=set([5,7])
h=0
m=0
for ia, ib in zip(a,b):
if ia in nset: # Check against set in O(1)
h+=1
if ib in nset: # Check against set in O(1)
m+=1
print (h-m)
Note that zip
ing is doing nothing except possibly excluding some elements from being iterated at all; if len(a) != len(b)
, you’ll fail to check the elements that would be iterated beyond the length of the shortest set
. If you want to count them all, the simplest solution is to split the loops replacing the single loop with just:
h = sum(1 for ia in a if ia in nset) # sum(ia in nset for ia in a) also works, but it's somewhat slower/less intuitive
m = sum(1 for ib in b if ib in nset)
Here’s an easy way – using set.intersection
– and without using a for
loop or zip
function:
n = [1, 5, 3] # can be with 100K+ items
a = {3, 1} # can be with 50K+ items
b = {5, 7}
nset = set(n) # cache set view of n
h = len(nset & a)
m = len(nset & b)
print(h - m)