Comparing two generators in Python
Question:
I am wondering about the use of ==
when comparing two generators
For example:
x = ['1','2','3','4','5']
gen_1 = (int(ele) for ele in x)
gen_2 = (int(ele) for ele in x)
gen_1 and gen_2 are the same for all practical purposes, and yet when I compare them:
>>> gen_1 == gen_2
False
My guess here is that ==
here is treated like is
normally is, and since gen_1 and gen_2 are located in different places in memory:
>>> gen_1
<generator object <genexpr> at 0x01E8BAA8>
>>> gen_2
<generator object <genexpr> at 0x01EEE4B8>
their comparison evaluates to False
. Am I right on this guess? And any other insight is welcome.
And btw, I do know how to compare two generators:
>>> all(a == b for a,b in zip(gen_1, gen_2))
True
or even
>>> list(gen_1) == list(gen_2)
True
But if there is a better way, I’d love to know.
Answers:
Because generators generate their values on-demand, there isn’t any way to “compare” them without actually consuming them. And if your generators generate an infinite sequence of values, such an equality test as you propose would be useless.
In order to do an item-wise comparison of two generators as with lists and other containers, Python would have to consume them both entirely (well, the shorter one, anyway). I think it’s good that you must do this explicitly, especially since one or the other may be infinite.
You are right with your guess – the fallback for comparison of types that don’t define ==
is comparison based on object identity.
A better way to compare the values they generate would be
from itertools import zip_longest, tee
sentinel = object()
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
(For Python 2.x use izip_longest
instead of zip_longest
)
This can actually short-circuit without necessarily having to look at all values. As pointed out by larsmans in the comments, we can’t use zip()
here since it might give wrong results if the generators produce a different number of elements – zip()
will stop on the shortest iterator. We use a newly created object
instance as fill value for zip_longest()
, since object
instances compare unequal to any sane value that could appear in one of the generators (including other object instances).
Note that there is no way to compare generators without changing their state. You could store the items that were consumed if you need them later on:
gen_1, gen_1_teed = tee(gen_1)
gen_2, gen_2_teed = tee(gen_2)
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
This will give leave the state of gen_1
and gen_2
essentially unchanged. All values consumed by all()
are stored inside the tee
object.
At that point, you might ask yourself if it is really worth it to use lazy generators for the application at hand — it might be better to simply convert them to lists and work with the lists instead.
==
is indeed the same as is
on two generators, because that’s the only check that can be made without changing their state and thus losing elements.
list(gen_1) == list(gen_2)
is the reliable and general way of comparing two finite generators (but obviously consumes both); your zip
-based solution fails when they do not generate an equal numbers of elements:
>>> list(zip([1,2,3,4], [1,2,3]))
[(1, 1), (2, 2), (3, 3)]
>>> all(a == b for a, b in zip([1,2,3,4], [1,2,3]))
True
The list
-based solution still fails when either generator generates an infinite number of elements. You can devise a workaround for that, but when both generators are infinite, you can only devise a semi-algorithm for non-equality.
I am wondering about the use of ==
when comparing two generators
For example:
x = ['1','2','3','4','5']
gen_1 = (int(ele) for ele in x)
gen_2 = (int(ele) for ele in x)
gen_1 and gen_2 are the same for all practical purposes, and yet when I compare them:
>>> gen_1 == gen_2
False
My guess here is that ==
here is treated like is
normally is, and since gen_1 and gen_2 are located in different places in memory:
>>> gen_1
<generator object <genexpr> at 0x01E8BAA8>
>>> gen_2
<generator object <genexpr> at 0x01EEE4B8>
their comparison evaluates to False
. Am I right on this guess? And any other insight is welcome.
And btw, I do know how to compare two generators:
>>> all(a == b for a,b in zip(gen_1, gen_2))
True
or even
>>> list(gen_1) == list(gen_2)
True
But if there is a better way, I’d love to know.
Because generators generate their values on-demand, there isn’t any way to “compare” them without actually consuming them. And if your generators generate an infinite sequence of values, such an equality test as you propose would be useless.
In order to do an item-wise comparison of two generators as with lists and other containers, Python would have to consume them both entirely (well, the shorter one, anyway). I think it’s good that you must do this explicitly, especially since one or the other may be infinite.
You are right with your guess – the fallback for comparison of types that don’t define ==
is comparison based on object identity.
A better way to compare the values they generate would be
from itertools import zip_longest, tee
sentinel = object()
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
(For Python 2.x use izip_longest
instead of zip_longest
)
This can actually short-circuit without necessarily having to look at all values. As pointed out by larsmans in the comments, we can’t use zip()
here since it might give wrong results if the generators produce a different number of elements – zip()
will stop on the shortest iterator. We use a newly created object
instance as fill value for zip_longest()
, since object
instances compare unequal to any sane value that could appear in one of the generators (including other object instances).
Note that there is no way to compare generators without changing their state. You could store the items that were consumed if you need them later on:
gen_1, gen_1_teed = tee(gen_1)
gen_2, gen_2_teed = tee(gen_2)
all(a == b for a, b in zip_longest(gen_1, gen_2, fillvalue=sentinel))
This will give leave the state of gen_1
and gen_2
essentially unchanged. All values consumed by all()
are stored inside the tee
object.
At that point, you might ask yourself if it is really worth it to use lazy generators for the application at hand — it might be better to simply convert them to lists and work with the lists instead.
==
is indeed the same as is
on two generators, because that’s the only check that can be made without changing their state and thus losing elements.
list(gen_1) == list(gen_2)
is the reliable and general way of comparing two finite generators (but obviously consumes both); your zip
-based solution fails when they do not generate an equal numbers of elements:
>>> list(zip([1,2,3,4], [1,2,3]))
[(1, 1), (2, 2), (3, 3)]
>>> all(a == b for a, b in zip([1,2,3,4], [1,2,3]))
True
The list
-based solution still fails when either generator generates an infinite number of elements. You can devise a workaround for that, but when both generators are infinite, you can only devise a semi-algorithm for non-equality.