|= vs update with a subclass of collections.abc.Set

Question:

I need to subclass set so I subclassed collections.abc.Set, as suggested here: https://stackoverflow.com/a/6698723/211858.
Please find my simple implementation below.
It essentially wraps a set of integers.

I generate list of 10,000 MySet instances consisting of 100 random integers.
I would like to take the union of these wrapped sets.
I have two implementations below.
For some reason, the first using update is very fast, yet the second using |= is slow.
The tqdm wrapper is to conduct nonrigorous benchmarks.
Is there some way to correct the definition of the class to fix this performance issue?
Thanks!

I’m on Python 3.10.5.

from collections.abc import Iterable, Iterator, Set

from tqdm import tqdm


class MySet(Set):
    def __init__(self, integers: Iterable[int]) -> None:
        self.data: set[int] = set(integers)

    def __len__(self) -> int:
        return len(self.data)

    def __iter__(self) -> Iterator[int]:
        return iter(self.data)

    def __contains__(self, x: object) -> bool:
        if isinstance(x, int):
            return x in self.data
        else:
            raise NotImplemented

    def my_func(self):
        ...

    def my_other_func(self):
        ...


# %%
import random

# Make some mock data
my_sets: list[MySet] = [
    MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]

# %%
universe: set[int] = set()
universe2: set[int] = set()

# %%
# Nearly instant
for my_set in tqdm(my_sets):
    universe.update(my_set)

# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
    universe2 |= my_set

Asked By: hwong557

||

Answers:

Conclusion: The way to add the least code is to implement the __ior__ method.

What happens when there is no implementation:

  1. When binary inplace or operation is performed for the first time, because universe2 is set and my_set is MySet, set cannot recognize the MySet class, so the binary inplace or operation will degenerate into a binary or operation.
  2. As in point 1, the binary or operation of set will fail, so Python will try to call the __ror__ method of MySet.
  3. Because MySet has no __ror__ method, Python will fall back to the collections.abc.Set. The __ror__ method of it is the same as the __or__ method and returns the result of type MySet. You can find it in the _collections_abc.py file:
class Set(Collection):
    ...
    @classmethod
    def _from_iterable(cls, it):
        '''Construct an instance of the class from any iterable input.

        Must override this method if the class constructor signature
        does not accept an iterable for an input.
        '''
        return cls(it)
    ...
    def __or__(self, other):
        if not isinstance(other, Iterable):
            return NotImplemented
        chain = (e for s in (self, other) for e in s)
        return self._from_iterable(chain)

    __ror__ = __or__
    ...
  1. For the subsequent binary inplace or operation, because the first __ror__ operation changes universe2 to MySet type and neither MySet nor collections.abc.Set has the __ior__ method, so the collections.abc.Set.__or__ function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the __ior__ method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.

Suggestions for better implementation: The abstract class collections.abc.Set represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet and implementing the add and discard abstract methods. Mutableset implements the inplace operation methods such as __ior__ through these two abstract methods (of course, it is still not efficient compared with the built-in set, so it is better to implement them by yourself):

class MutableSet(Set):
    ...
    def __ior__(self, it):
        for value in it:
            self.add(value)
        return self
    ...
Answered By: Mechanic Pig
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.