Zipped Python generators with 2nd one being shorter: how to retrieve element that is silently consumed
Question:
I want to parse 2 generators of (potentially) different length with zip
:
for el1, el2 in zip(gen1, gen2):
print(el1, el2)
However, if gen2
has less elements, one extra element of gen1
is “consumed”.
For example,
def my_gen(n:int):
for i in range(n):
yield i
gen1 = my_gen(10)
gen2 = my_gen(8)
list(zip(gen1, gen2)) # Last tuple is (7, 7)
print(next(gen1)) # printed value is "9" => 8 is missing
gen1 = my_gen(8)
gen2 = my_gen(10)
list(zip(gen1, gen2)) # Last tuple is (7, 7)
print(next(gen2)) # printed value is "8" => OK
Apparently, a value is missing (8
in my previous example) because gen1
is read (thus generating the value 8
) before it realizes gen2
has no more elements. But this value disappears in the universe. When gen2
is “longer”, there is no such “problem”.
QUESTION: Is there a way to retrieve this missing value (i.e. 8
in my previous example)? … ideally with a variable number of arguments (like zip
does).
NOTE: I have currently implemented in another way by using itertools.zip_longest
but I really wonder how to get this missing value using zip
or equivalent.
NOTE 2: I have created some tests of the different implementations in this REPL in case you want to submit and try a new implementation π https://repl.it/@jfthuong/MadPhysicistChester
Answers:
This is zip
implementation equivalent given in docs
def zip(*iterables):
# zip('ABCD', 'xy') --> Ax By
sentinel = object()
iterators = [iter(it) for it in iterables]
while iterators:
result = []
for it in iterators:
elem = next(it, sentinel)
if elem is sentinel:
return
result.append(elem)
yield tuple(result)
In your 1st example gen1 = my_gen(10)
and gen2 = my_gen(8)
.
After both the generators are consumed until 7th iteration. Now in 8th iteration gen1
calls elem = next(it, sentinel)
which return 8 but when gen2
calls elem = next(it, sentinel)
it returns sentinel
(because at this gen2
is exhausted) and if elem is sentinel
is satisfied and function executes return and stops. Now next(gen1)
returns 9.
In your 2nd example gen1 = gen(8)
and gen2 = gen(10)
. After both the generators are consumed until 7th iteration. Now in 8th iteration gen1
calls elem = next(it, sentinel)
which returns sentinel
(because at this point gen1
is exhausted) and if elem is sentinel
is satisfied and the function executes return and stops. Now next(gen2)
returns 8.
Inspired by Mad Physicist’s answer, you could use this Gen
wrapper to counter it:
Edit: To handle the cases pointed by Jean-Francois T.
Once a value is consumed from the iterator it’s gone forever from the the iterator and there’s no in-place mutating method for iterators to add it back to the iterator. One work around is to store the last consumed value.
class Gen:
def __init__(self,iterable):
self.d = iter(iterable)
self.sentinel = object()
self.prev = self.sentinel
def __iter__(self):
return self
@property
def last_val_consumed(self):
if self.prev is None:
raise StopIteration
if self.prev == self.sentinel:
raise ValueError('Nothing has been consumed')
return self.prev
def __next__(self):
self.prev = next(self.d,None)
if self.prev is None:
raise StopIteration
return self.prev
Examples:
# When `gen1` is larger than `gen2`
gen1 = Gen(range(10))
gen2 = Gen(range(8))
list(zip(gen1,gen2))
# [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7)]
gen1.last_val_consumed
# 8 #as it was the last values consumed
next(gen1)
# 9
gen1.last_val_consumed
# 9
# 2. When `gen1` or `gen2` is empty
gen1 = Gen(range(0))
gen2 = Gen(range(5))
list(zip(gen1,gen2))
gen1.last_val_consumed
# StopIteration error is raised
gen2.last_val_consumed
# ValueError is raised saying `ValueError: Nothing has been consumed`
One way would be to implement a generator that lets you cache the last value:
class cache_last(collections.abc.Iterator):
"""
Wraps an iterable in an iterator that can retrieve the last value.
.. attribute:: obj
A reference to the wrapped iterable. Provided for convenience
of one-line initializations.
"""
def __init__(self, iterable):
self.obj = iterable
self._iter = iter(iterable)
self._sentinel = object()
@property
def last(self):
"""
The last object yielded by the wrapped iterator.
Uninitialized iterators raise a `ValueError`. Exhausted
iterators raise a `StopIteration`.
"""
if self.exhausted:
raise StopIteration
return self._last
@property
def exhausted(self):
"""
`True` if there are no more elements in the iterator.
Violates EAFP, but convenient way to check if `last` is valid.
Raise a `ValueError` if the iterator is not yet started.
"""
if not hasattr(self, '_last'):
raise ValueError('Not started!')
return self._last is self._sentinel
def __next__(self):
"""
Retrieve, record, and return the next value of the iteration.
"""
try:
self._last = next(self._iter)
except StopIteration:
self._last = self._sentinel
raise
# An alternative that has fewer lines of code, but checks
# for the return value one extra time, and loses the underlying
# StopIteration:
#self._last = next(self._iter, self._sentinel)
#if self._last is self._sentinel:
# raise StopIteration
return self._last
def __iter__(self):
"""
This object is already an iterator.
"""
return self
To use this, wrap the inputs to zip
:
gen1 = cache_last(range(10))
gen2 = iter(range(8))
list(zip(gen1, gen2))
print(gen1.last)
print(next(gen1))
It is important to make gen2
an iterator rather than an iterable, so you can know which one was exhausted. If gen2
is exhausted, you don’t need to check gen1.last
.
Another approach would be to override zip to accept a mutable sequence of iterables instead of separate iterables. That would allow you to replace iterables with a chained version that includes your “peeked” item:
def myzip(iterables):
iterators = [iter(it) for it in iterables]
while True:
items = []
for it in iterators:
try:
items.append(next(it))
except StopIteration:
for i, peeked in enumerate(items):
iterables[i] = itertools.chain([peeked], iterators[i])
return
else:
yield tuple(items)
gens = [range(10), range(8)]
list(myzip(gens))
print(next(gens[0]))
This approach is problematic for many reasons. Not only will it lose the original iterable, but it will lose any of the useful properties the original object may have had by replacing it with a chain
object.
I can see you’ve found this answer already and it got brought up in the comments but I figured I’ll make an answer out of it. You want to use itertools.zip_longest()
, which will replace the empty values of the shorter generator with None
:
import itertools
def my_gen(n:int):
for i in range(n):
yield i
gen1 = my_gen(10)
gen2 = my_gen(8)
for i, j in itertools.zip_longest(gen1, gen2):
print(i, j)
Prints:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 None
9 None
You can also supply a fillvalue
argument when calling zip_longest
to replace the None
with a default value, but basically for your solution once you hit a None
(either i
or j
) in the for loop, the other variable will have your 8
.
Inspired by @GrandPhuba’s elucidation of zip
, let’s create a “safe” variant (unit-tested here):
def safe_zip(*args):
"""
Safe zip that restores last consumed element in eachgenerator
if not able to consume an element in all of them
Returns:
* generators in tuple
* generator for zipped generators
"""
continue_ = True
n = len(args)
result = (_ for _ in [])
while continue_:
addend = []
for i, gen in enumerate(args):
try:
value = next(gen)
addend.append(value)
except StopIteration:
genlist = list(args)
args = tuple([chain([v], g) for v, g in zip(addend, genlist[:i])]+genlist[i:])
continue_ = False
break
if len(addend)==n: result = chain(result, [tuple(addend)])
return args, result
Here is a basic test:
g1, g2 = (i for i in range(10)), (i for i in range(4))
# Create (g1, g2), g3 first, then loop over g3 as one would with zip
(g1, g2), g3 = safe_zip(g1, g2)
for a, b in g3:
print(a, b)#(0, 0) to (3, 3)
for x in g1:
print(x)#4 to 9
i don’t think you can retrieve dropped value with basic for loop, because exhausted iterator, taken from zip(..., ...).__iter__
being dropped once exhausted and you cant access it.
You should mutate your zip, then you can get position of dropped item with some hacky code)
z = zip(range(10), range(8))
for _ in iter(z.__next__, None):
...
_, (one, other) = z.__reduce__()
_, (i_one,), p_one = one.__reduce__() # p_one == current pos, 1 based
import itertools
val = next(itertools.islice(iter(i_one), p_one - 1, p_one))
If you want to reuse code, the easiest solution is:
from more_itertools import peekable
a = peekable(a)
b = peekable(b)
while True:
try:
a.peek()
b.peek()
except StopIteration:
break
x = next(a)
y = next(b)
print(x, y)
print(list(a), list(b)) # Misses nothing.
You can test this code out using your setup:
def my_gen(n: int):
yield from range(n)
a = my_gen(10)
b = my_gen(8)
It will print:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
[8, 9] []
you could use itertools.tee and itertools.islice:
from itertools import islice, tee
def zipped(gen1, gen2, pred=list):
g11, g12 = tee(gen1)
z = pred(zip(g11, gen2))
return (islice(g12, len(z), None), gen2), z
gen1 = iter(range(10))
gen2 = iter(range(5))
(gen1, gen2), output = zipped(gen1, gen2)
print(output)
print(next(gen1))
# [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# 5
Right out of the box, zip() is hardwired to dispose of the unmatched item. So, you need a way to remember values before they get consumed.
The itertool called tee() was designed for this purpose. You can use it to create a "shadow" of the first input iterator. If the second iterator terminates, you can fetch first iterator’s value from the shadow iterator.
Here’s one way to do it that uses existing tooling, that runs at C-speed, and that is memory efficient:
>>> from itertools import tee
>>> from operator import itemgetter
>>> iterable1, iterable2 = 'abcde', 'xyz'
>>> it1, shadow1 = tee(iterable1)
>>> it2 = iter(iterable2)
>>> combined = map(itemgetter(0, 1), zip(it1, it2, shadow1))
>>> list(combined)
[('a', 'x'), ('b', 'y'), ('c', 'z')]
>>> next(shadow1)
'd'
I want to parse 2 generators of (potentially) different length with zip
:
for el1, el2 in zip(gen1, gen2):
print(el1, el2)
However, if gen2
has less elements, one extra element of gen1
is “consumed”.
For example,
def my_gen(n:int):
for i in range(n):
yield i
gen1 = my_gen(10)
gen2 = my_gen(8)
list(zip(gen1, gen2)) # Last tuple is (7, 7)
print(next(gen1)) # printed value is "9" => 8 is missing
gen1 = my_gen(8)
gen2 = my_gen(10)
list(zip(gen1, gen2)) # Last tuple is (7, 7)
print(next(gen2)) # printed value is "8" => OK
Apparently, a value is missing (8
in my previous example) because gen1
is read (thus generating the value 8
) before it realizes gen2
has no more elements. But this value disappears in the universe. When gen2
is “longer”, there is no such “problem”.
QUESTION: Is there a way to retrieve this missing value (i.e. 8
in my previous example)? … ideally with a variable number of arguments (like zip
does).
NOTE: I have currently implemented in another way by using itertools.zip_longest
but I really wonder how to get this missing value using zip
or equivalent.
NOTE 2: I have created some tests of the different implementations in this REPL in case you want to submit and try a new implementation π https://repl.it/@jfthuong/MadPhysicistChester
This is zip
implementation equivalent given in docs
def zip(*iterables):
# zip('ABCD', 'xy') --> Ax By
sentinel = object()
iterators = [iter(it) for it in iterables]
while iterators:
result = []
for it in iterators:
elem = next(it, sentinel)
if elem is sentinel:
return
result.append(elem)
yield tuple(result)
In your 1st example gen1 = my_gen(10)
and gen2 = my_gen(8)
.
After both the generators are consumed until 7th iteration. Now in 8th iteration gen1
calls elem = next(it, sentinel)
which return 8 but when gen2
calls elem = next(it, sentinel)
it returns sentinel
(because at this gen2
is exhausted) and if elem is sentinel
is satisfied and function executes return and stops. Now next(gen1)
returns 9.
In your 2nd example gen1 = gen(8)
and gen2 = gen(10)
. After both the generators are consumed until 7th iteration. Now in 8th iteration gen1
calls elem = next(it, sentinel)
which returns sentinel
(because at this point gen1
is exhausted) and if elem is sentinel
is satisfied and the function executes return and stops. Now next(gen2)
returns 8.
Inspired by Mad Physicist’s answer, you could use this Gen
wrapper to counter it:
Edit: To handle the cases pointed by Jean-Francois T.
Once a value is consumed from the iterator it’s gone forever from the the iterator and there’s no in-place mutating method for iterators to add it back to the iterator. One work around is to store the last consumed value.
class Gen:
def __init__(self,iterable):
self.d = iter(iterable)
self.sentinel = object()
self.prev = self.sentinel
def __iter__(self):
return self
@property
def last_val_consumed(self):
if self.prev is None:
raise StopIteration
if self.prev == self.sentinel:
raise ValueError('Nothing has been consumed')
return self.prev
def __next__(self):
self.prev = next(self.d,None)
if self.prev is None:
raise StopIteration
return self.prev
Examples:
# When `gen1` is larger than `gen2`
gen1 = Gen(range(10))
gen2 = Gen(range(8))
list(zip(gen1,gen2))
# [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7)]
gen1.last_val_consumed
# 8 #as it was the last values consumed
next(gen1)
# 9
gen1.last_val_consumed
# 9
# 2. When `gen1` or `gen2` is empty
gen1 = Gen(range(0))
gen2 = Gen(range(5))
list(zip(gen1,gen2))
gen1.last_val_consumed
# StopIteration error is raised
gen2.last_val_consumed
# ValueError is raised saying `ValueError: Nothing has been consumed`
One way would be to implement a generator that lets you cache the last value:
class cache_last(collections.abc.Iterator):
"""
Wraps an iterable in an iterator that can retrieve the last value.
.. attribute:: obj
A reference to the wrapped iterable. Provided for convenience
of one-line initializations.
"""
def __init__(self, iterable):
self.obj = iterable
self._iter = iter(iterable)
self._sentinel = object()
@property
def last(self):
"""
The last object yielded by the wrapped iterator.
Uninitialized iterators raise a `ValueError`. Exhausted
iterators raise a `StopIteration`.
"""
if self.exhausted:
raise StopIteration
return self._last
@property
def exhausted(self):
"""
`True` if there are no more elements in the iterator.
Violates EAFP, but convenient way to check if `last` is valid.
Raise a `ValueError` if the iterator is not yet started.
"""
if not hasattr(self, '_last'):
raise ValueError('Not started!')
return self._last is self._sentinel
def __next__(self):
"""
Retrieve, record, and return the next value of the iteration.
"""
try:
self._last = next(self._iter)
except StopIteration:
self._last = self._sentinel
raise
# An alternative that has fewer lines of code, but checks
# for the return value one extra time, and loses the underlying
# StopIteration:
#self._last = next(self._iter, self._sentinel)
#if self._last is self._sentinel:
# raise StopIteration
return self._last
def __iter__(self):
"""
This object is already an iterator.
"""
return self
To use this, wrap the inputs to zip
:
gen1 = cache_last(range(10))
gen2 = iter(range(8))
list(zip(gen1, gen2))
print(gen1.last)
print(next(gen1))
It is important to make gen2
an iterator rather than an iterable, so you can know which one was exhausted. If gen2
is exhausted, you don’t need to check gen1.last
.
Another approach would be to override zip to accept a mutable sequence of iterables instead of separate iterables. That would allow you to replace iterables with a chained version that includes your “peeked” item:
def myzip(iterables):
iterators = [iter(it) for it in iterables]
while True:
items = []
for it in iterators:
try:
items.append(next(it))
except StopIteration:
for i, peeked in enumerate(items):
iterables[i] = itertools.chain([peeked], iterators[i])
return
else:
yield tuple(items)
gens = [range(10), range(8)]
list(myzip(gens))
print(next(gens[0]))
This approach is problematic for many reasons. Not only will it lose the original iterable, but it will lose any of the useful properties the original object may have had by replacing it with a chain
object.
I can see you’ve found this answer already and it got brought up in the comments but I figured I’ll make an answer out of it. You want to use itertools.zip_longest()
, which will replace the empty values of the shorter generator with None
:
import itertools
def my_gen(n:int):
for i in range(n):
yield i
gen1 = my_gen(10)
gen2 = my_gen(8)
for i, j in itertools.zip_longest(gen1, gen2):
print(i, j)
Prints:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 None
9 None
You can also supply a fillvalue
argument when calling zip_longest
to replace the None
with a default value, but basically for your solution once you hit a None
(either i
or j
) in the for loop, the other variable will have your 8
.
Inspired by @GrandPhuba’s elucidation of zip
, let’s create a “safe” variant (unit-tested here):
def safe_zip(*args):
"""
Safe zip that restores last consumed element in eachgenerator
if not able to consume an element in all of them
Returns:
* generators in tuple
* generator for zipped generators
"""
continue_ = True
n = len(args)
result = (_ for _ in [])
while continue_:
addend = []
for i, gen in enumerate(args):
try:
value = next(gen)
addend.append(value)
except StopIteration:
genlist = list(args)
args = tuple([chain([v], g) for v, g in zip(addend, genlist[:i])]+genlist[i:])
continue_ = False
break
if len(addend)==n: result = chain(result, [tuple(addend)])
return args, result
Here is a basic test:
g1, g2 = (i for i in range(10)), (i for i in range(4))
# Create (g1, g2), g3 first, then loop over g3 as one would with zip
(g1, g2), g3 = safe_zip(g1, g2)
for a, b in g3:
print(a, b)#(0, 0) to (3, 3)
for x in g1:
print(x)#4 to 9
i don’t think you can retrieve dropped value with basic for loop, because exhausted iterator, taken from zip(..., ...).__iter__
being dropped once exhausted and you cant access it.
You should mutate your zip, then you can get position of dropped item with some hacky code)
z = zip(range(10), range(8))
for _ in iter(z.__next__, None):
...
_, (one, other) = z.__reduce__()
_, (i_one,), p_one = one.__reduce__() # p_one == current pos, 1 based
import itertools
val = next(itertools.islice(iter(i_one), p_one - 1, p_one))
If you want to reuse code, the easiest solution is:
from more_itertools import peekable
a = peekable(a)
b = peekable(b)
while True:
try:
a.peek()
b.peek()
except StopIteration:
break
x = next(a)
y = next(b)
print(x, y)
print(list(a), list(b)) # Misses nothing.
You can test this code out using your setup:
def my_gen(n: int):
yield from range(n)
a = my_gen(10)
b = my_gen(8)
It will print:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
[8, 9] []
you could use itertools.tee and itertools.islice:
from itertools import islice, tee
def zipped(gen1, gen2, pred=list):
g11, g12 = tee(gen1)
z = pred(zip(g11, gen2))
return (islice(g12, len(z), None), gen2), z
gen1 = iter(range(10))
gen2 = iter(range(5))
(gen1, gen2), output = zipped(gen1, gen2)
print(output)
print(next(gen1))
# [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# 5
Right out of the box, zip() is hardwired to dispose of the unmatched item. So, you need a way to remember values before they get consumed.
The itertool called tee() was designed for this purpose. You can use it to create a "shadow" of the first input iterator. If the second iterator terminates, you can fetch first iterator’s value from the shadow iterator.
Here’s one way to do it that uses existing tooling, that runs at C-speed, and that is memory efficient:
>>> from itertools import tee
>>> from operator import itemgetter
>>> iterable1, iterable2 = 'abcde', 'xyz'
>>> it1, shadow1 = tee(iterable1)
>>> it2 = iter(iterable2)
>>> combined = map(itemgetter(0, 1), zip(it1, it2, shadow1))
>>> list(combined)
[('a', 'x'), ('b', 'y'), ('c', 'z')]
>>> next(shadow1)
'd'