Quick way to extend a set if we know elements are unique

Question:

I am performing multiple iterations of the type:

masterSet=masterSet.union(setA)

As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).

I expect that the time is taken up checking whether each element of setA is already in masterSet?

My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?

[UPDATE]

Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:

When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.

I wondered if there was a way to ‘tell’ the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer’s assertion they were definately distict. Perhpas through calling some different “.unionWithDistinctSet()” procedure or something.

I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.

I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?

Asked By: Stewart_R

||

Answers:

You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union

>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])

Of course, if you’re doing this in a loop:

masterSet = set()
for setA in iterable:
    masterSet = masterSet.union(setA)

You might get a performance boost by doing something like:

masterSet = set().union(*iterable)

Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn’t really a big performance hit.

Answered By: mgilson

As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:

def union():
    i = set(range(10000))
    j = set(range(5000, 15000))
    return i.union(j)

def update():
    i = set(range(10000))
    j = set(range(5000, 15000))
    i.update(j)
    return i

timeit.Timer(union).timeit(10000)   # 10.351907968521118
timeit.Timer(update).timeit(10000)  # 8.83384895324707
Answered By: Daniel Roseman

If you know your elements are unique, a set is not necessarily the best structure.

A simple list is way faster to extend.

masterList = list(masterSet)
masterList.extend(setA)
Answered By: njzk2

For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)

However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it’s not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.

Answered By: Evgeni Sergeev
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.