|= vs update with a subclass of collections.abc.Set
Question:
I need to subclass set
so I subclassed collections.abc.Set
, as suggested here: https://stackoverflow.com/a/6698723/211858.
Please find my simple implementation below.
It essentially wraps a set of integers.
I generate list of 10,000 MySet
instances consisting of 100 random integers.
I would like to take the union of these wrapped sets.
I have two implementations below.
For some reason, the first using update
is very fast, yet the second using |=
is slow.
The tqdm
wrapper is to conduct nonrigorous benchmarks.
Is there some way to correct the definition of the class to fix this performance issue?
Thanks!
I’m on Python 3.10.5.
from collections.abc import Iterable, Iterator, Set
from tqdm import tqdm
class MySet(Set):
def __init__(self, integers: Iterable[int]) -> None:
self.data: set[int] = set(integers)
def __len__(self) -> int:
return len(self.data)
def __iter__(self) -> Iterator[int]:
return iter(self.data)
def __contains__(self, x: object) -> bool:
if isinstance(x, int):
return x in self.data
else:
raise NotImplemented
def my_func(self):
...
def my_other_func(self):
...
# %%
import random
# Make some mock data
my_sets: list[MySet] = [
MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]
# %%
universe: set[int] = set()
universe2: set[int] = set()
# %%
# Nearly instant
for my_set in tqdm(my_sets):
universe.update(my_set)
# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
universe2 |= my_set
Answers:
Conclusion: The way to add the least code is to implement the __ior__
method.
What happens when there is no implementation:
- When binary inplace or operation is performed for the first time, because
universe2
is set
and my_set
is MySet
, set
cannot recognize the MySet
class, so the binary inplace or operation will degenerate into a binary or operation.
- As in point 1, the binary or operation of
set
will fail, so Python will try to call the __ror__
method of MySet
.
- Because
MySet
has no __ror__
method, Python will fall back to the collections.abc.Set
. The __ror__
method of it is the same as the __or__
method and returns the result of type MySet
. You can find it in the _collections_abc.py file:
class Set(Collection):
...
@classmethod
def _from_iterable(cls, it):
'''Construct an instance of the class from any iterable input.
Must override this method if the class constructor signature
does not accept an iterable for an input.
'''
return cls(it)
...
def __or__(self, other):
if not isinstance(other, Iterable):
return NotImplemented
chain = (e for s in (self, other) for e in s)
return self._from_iterable(chain)
__ror__ = __or__
...
- For the subsequent binary inplace or operation, because the first
__ror__
operation changes universe2
to MySet
type and neither MySet
nor collections.abc.Set
has the __ior__
method, so the collections.abc.Set.__or__
function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the __ior__
method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.
Suggestions for better implementation: The abstract class collections.abc.Set
represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet
and implementing the add
and discard
abstract methods. Mutableset
implements the inplace operation methods such as __ior__
through these two abstract methods (of course, it is still not efficient compared with the built-in set
, so it is better to implement them by yourself):
class MutableSet(Set):
...
def __ior__(self, it):
for value in it:
self.add(value)
return self
...
I need to subclass set
so I subclassed collections.abc.Set
, as suggested here: https://stackoverflow.com/a/6698723/211858.
Please find my simple implementation below.
It essentially wraps a set of integers.
I generate list of 10,000 MySet
instances consisting of 100 random integers.
I would like to take the union of these wrapped sets.
I have two implementations below.
For some reason, the first using update
is very fast, yet the second using |=
is slow.
The tqdm
wrapper is to conduct nonrigorous benchmarks.
Is there some way to correct the definition of the class to fix this performance issue?
Thanks!
I’m on Python 3.10.5.
from collections.abc import Iterable, Iterator, Set
from tqdm import tqdm
class MySet(Set):
def __init__(self, integers: Iterable[int]) -> None:
self.data: set[int] = set(integers)
def __len__(self) -> int:
return len(self.data)
def __iter__(self) -> Iterator[int]:
return iter(self.data)
def __contains__(self, x: object) -> bool:
if isinstance(x, int):
return x in self.data
else:
raise NotImplemented
def my_func(self):
...
def my_other_func(self):
...
# %%
import random
# Make some mock data
my_sets: list[MySet] = [
MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]
# %%
universe: set[int] = set()
universe2: set[int] = set()
# %%
# Nearly instant
for my_set in tqdm(my_sets):
universe.update(my_set)
# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
universe2 |= my_set
Conclusion: The way to add the least code is to implement the __ior__
method.
What happens when there is no implementation:
- When binary inplace or operation is performed for the first time, because
universe2
isset
andmy_set
isMySet
,set
cannot recognize theMySet
class, so the binary inplace or operation will degenerate into a binary or operation. - As in point 1, the binary or operation of
set
will fail, so Python will try to call the__ror__
method ofMySet
. - Because
MySet
has no__ror__
method, Python will fall back to thecollections.abc.Set
. The__ror__
method of it is the same as the__or__
method and returns the result of typeMySet
. You can find it in the _collections_abc.py file:
class Set(Collection):
...
@classmethod
def _from_iterable(cls, it):
'''Construct an instance of the class from any iterable input.
Must override this method if the class constructor signature
does not accept an iterable for an input.
'''
return cls(it)
...
def __or__(self, other):
if not isinstance(other, Iterable):
return NotImplemented
chain = (e for s in (self, other) for e in s)
return self._from_iterable(chain)
__ror__ = __or__
...
- For the subsequent binary inplace or operation, because the first
__ror__
operation changesuniverse2
toMySet
type and neitherMySet
norcollections.abc.Set
has the__ior__
method, so thecollections.abc.Set.__or__
function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the__ior__
method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.
Suggestions for better implementation: The abstract class collections.abc.Set
represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet
and implementing the add
and discard
abstract methods. Mutableset
implements the inplace operation methods such as __ior__
through these two abstract methods (of course, it is still not efficient compared with the built-in set
, so it is better to implement them by yourself):
class MutableSet(Set):
...
def __ior__(self, it):
for value in it:
self.add(value)
return self
...