Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?

Question:

Consider the code:

def test(data):
    for row in data:
        print("first loop")
    for row in data:
        print("second loop")

When data is an iterator, for example a list iterator or a generator expression*, this does not work:

>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop

This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?

Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.

On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:

>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop

* More examples:


For general theory and terminology explanation, see What are iterator, iterable, and iteration?.

To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.

Asked By: JSchwartz

||

Answers:

An iterator can only be consumed once. For example:

lst = [1, 2, 3]
it = iter(lst)

next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration

When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.

A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:

data = list(data)

If the iterator would iterate over many elements, however, it’s a better idea to create independent iterators using tee():

import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed

Now each one can be iterated over in turn:

for e in it1:
    print("first loop")

for e in it2:
    print("second loop")
Answered By: Óscar López

Once an iterator is exhausted, it will not yield any more.

>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
Answered By: falsetru

Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.

This is explained in Óscar López’s answer, however, that answer’s recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.
In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.

Answered By: kaya3

How do I loop over an iterator twice?

It is usually impossible. (Explained later.) Instead, do one of the following:

  • Collect the iterator into a something that can be looped over multiple times.

    items = list(iterator)
    
    for item in items:
        ...
    

    Downside: This costs memory.

  • Create a new iterator. It usually takes only a microsecond to make a new iterator.

    for item in create_iterator():
        ...
    
    for item in create_iterator():
        ...
    

    Downside: Iteration itself may be expensive (e.g. reading from disk or network).

  • Reset the "iterator". For example, with file iterators:

    with open(...) as f:
        for item in f:
            ...
    
        f.seek(0)
    
        for item in f:
            ...
    

    Downside: Most iterators cannot be "reset".


Philosophy of an Iterator

Typically, though not technically1:

  • Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
  • Iterator: A pointer to some element of an iterable.

If we were to define a sequence iterator, it might look something like this:

class SequenceIterator:
    index: int
    items: Sequence  # Sequences can be randomly indexed via items[index].

    def __next__(self):
        """Increment index, and return the latest item."""

The important thing here is that typically, an iterator does not store any actual data inside itself.

Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.

Exhausting an Iterator

What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That’s what a for loop does:

iterable = "ABC"
iterator = iter(iterable)

for item in iterator:
    print(item)

Let’s support this functionality in SequenceIterator by telling the for loop how to extract the next item:

class SequenceIterator:
    def __next__(self):
        item = self.items[self.index]
        self.index += 1
        return item

Hold on. What if index goes past the last element of items? We should raise a safe exception for that:

class SequenceIterator:
    def __next__(self):
        try:
            item = self.items[self.index]
        except IndexError:
            raise StopIteration  # Safely says, "no more items in iterator!"
        self.index += 1
        return item

Now, the for loop knows when to stop extracting items from the iterator.

What happens if we now try to loop over the iterator again?

iterable = "ABC"
iterator = iter(iterable)

# iterator.index == 0

for item in iterator:
    print(item)

# iterator.index == 3

for item in iterator:
    print(item)

# iterator.index == 3

Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.


1 Technically:

  • Iterable: An object that returns an iterator when __iter__ is called on it.
  • Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.

More details here.

Answered By: Mateen Ulhaq

Don’t use an iterator in the for loop in the first place whenever possible, it’s unnecessary, use an iterable instead, iter(iterable) is called to generate an iterator by python while executing the for loop. An iterable can be looped as many time as you want.

If you’re writing the code, and you find yourself try to iterate over an iterator, there must be something unnatural in your code, try to fix that first.

Answered By: pltc325

Why doesn’t iterating work the second time for iterators?

It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.

Why does it work for other kinds of iterables?

Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.

This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.

Given an iterator, how can we iterate twice over the data?

By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.

Caching

The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:

elements = list(iterator)
for element in elements:
    ...

for element in elements:
    ...

Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:

abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
    print(element) # starts at 1, because the list does

for element in concrete:
    print(element) # also starts at 1, because a new iterator is created

A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they’re iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)

from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
    print(element)
    if element == 3:
        break

for element in y:
    print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer

Starting over

If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:

def powers_of(base, *range_args):
    for i in range(*range_args):
        yield base ** i

exhaustible = powers_of(2, 1, 12):

for value in exhaustible:
    print(value)

print('exhausted')

for value in exhaustible: # no results from here
    print(value)

# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
    print(value)

Seekable or resettable iterators

Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.

Nothing in Python disallows this, but in many cases it’s not feasible to provide a simple interface; in most other cases, it just isn’t supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.

The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value – e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.

In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to – i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that’s straightforward: just modify i. For a file, we’d need to know what the file position was at the beginning of the desired line, not just the line number. That’s why the file interface provides .tell as well as .seek.

Here’s a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:

class PowersOf:
    def __init__(self, base):
        self._exponent = 0
        self._base = base
    def __iter__(self):
        return self
    def __next__(self):
        result = self._base ** self._exponent
        self._exponent += 1
        return result
    @property
    def exponent(self):
        return self._exponent
    @exponent.setter
    def exponent(self, value):
        if not isinstance(new_value, int):
            raise TypeError("must set with an integer")
        if new_value < 0:
            raise ValueError("can't set to negative value")
        self._exponent = new_value

Examples:

pot = PowersOf(2)
for i in pot:
    if i > 1000:
        break
    print(i)

pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64

Technical detail

Iterators vs. iterables

Recall that, briefly:

  • "iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
  • "iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that – basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" – in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
  • "iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.

Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements – since we can find out what they are, by examining them – and certainly they can be examined in a specified order – since that’s what the iterator enables. So, we can conclude that an iterator is a kind of iterable – and Python’s definitions agree.

How iteration works

In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.

Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it’s some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn’t.

The iterator protocol

Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:

  • Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)

  • Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to

class Iterator:
    def __init__(self, bound_getitem):
        self._index = 0
        self._bound_getitem = bound_getitem
    def __iter__(self):
        return self
    def __next__(self):
        try:
            result = self._bound_getitem(self._index)
        except IndexError:
            raise StopIteration
        self._index += 1
        return result
  • Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.

  • With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like

for element in iterable:
    ...

will approximately translate to:

iterator = iter(iterable)
while True:
    try:
        element = next(iterator)
    except StopIteration:
        break
    ...

except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

Answered By: Karl Knechtel
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.