# Getting number of elements in an iterator in Python

## Question:

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?

No, any method will require you to resolve every result. You can do

``````iter_length = len(list(iterable))
``````

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using `list()` will read the whole iterable into memory at once, which may be undesirable. Another way is to do

``````sum(1 for _ in iterable)
``````

as another person posted. That will avoid keeping it in memory.

This code should work:

``````>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50
``````

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

``````>>> sum(1 for _ in range(0))
0
``````

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

``````>>> sum(1 for _ in itertools.count())
[nothing happens, forever]
``````

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.

No. It’s not possible.

Example:

``````import random

def gen(n):
for i in xrange(n):
if random.randint(0, 1) == 0:
yield i

iterator = gen(10)
``````

Length of `iterator` is unknown until you iterate through it.

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.

Kinda. You could check the `__length_hint__` method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the `next()` method. You can call it as many times as required and they may or may not eventually raise `StopIteration`. Luckily, this behaviour is most of the time transparent to the coder. ðŸ™‚

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

``````import itertools
from collections import deque

def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
"""
counter = itertools.count()
deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
return next(counter)
``````

(For Python 3.x replace `itertools.izip` with `zip`).

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.

``````def count_iter(iter):
sum = 0
for _ in iter: sum += 1
return sum
``````

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

``````>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2
``````

The actual `count()` implementation is as follows:

``````def count(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)

d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0
``````

A quick benchmark:

``````import collections
import itertools

def count_iter_items(iterable):
counter = itertools.count()
collections.deque(itertools.izip(iterable, counter), maxlen=0)
return next(counter)

def count_lencheck(iterable):
if hasattr(iterable, '__len__'):
return len(iterable)

d = collections.deque(enumerate(iterable, 1), maxlen=1)
return d[0][0] if d else 0

def count_sum(iterable):
return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))
``````

The results:

``````10000 loops, best of 3: 37.2 Âµs per loop
10000 loops, best of 3: 47.6 Âµs per loop
10000 loops, best of 3: 61 Âµs per loop
``````

I.e. the simple count_iter_items is the way to go.

``````61.9 Âµs Â± 275 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)
74.4 Âµs Â± 190 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)
82.6 Âµs Â± 164 ns per loop (mean Â± std. dev. of 7 runs, 10000 loops each)
``````

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

``````\$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
...
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48
``````

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

• `len(list(gen))`,
• `len([_ for _ in gen])`,
• `sum(1 for _ in gen),`
• `ilen(gen)` (from more_itertool),
• `reduce(lambda c, i: c + 1, gen, 0)`,

sorted by performance of execution (including memory consumption), will make you surprised:

“`

# 1: test_list.py:8: 0.492 KiB

``````gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))
``````

(‘list, sec’, 1.9684218849870376)

# 2: test_list_compr.py:8: 0.867 KiB

``````gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])
``````

(‘list_compr, sec’, 2.5885991149989422)

# 3: test_sum.py:8: 0.859 KiB

``````gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()
``````

(‘sum, sec’, 3.441088170016883)

# 4: more_itertools/more.py:413: 1.266 KiB

``````d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)
``````

(‘ilen, sec’, 9.812256851990242)

# 5: test_reduce.py:8: 0.859 KiB

``````gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)
``````

(‘reduce, sec’, 13.436614598002052)
“`

So, `len(list(gen))` is the most frequent and less memory consumable

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with `copy` or `deepcopy`

``````import copy

def get_iter_len(iterator):
return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
print("Finding the length did not exhaust the iterator!")
else:
print("oh no! it's all gone")
``````

The output is “`Finding the length did not exhaust the iterator!`

Optionally (and unadvisedly), you can shadow the built-in `len` function as follows:

``````import copy

def len(obj, *, len=len):
try:
if hasattr(obj, "__len__"):
r = len(obj)
elif hasattr(obj, "__next__"):
r = sum(1 for _ in copy.copy(obj))
else:
r = len(obj)
finally:
pass
return r
``````

One simple way is using built-in function `set()` or `list()`:

A: `set()` in case of having no duplicated item in the iterator (fastest way)

``````iter = zip([1,2,3],['a','b','c'])
print(len(set(iter)) # set(iter) = {(1, 'a'), (2, 'b'), (3, 'c')}
Out[45]: 3
``````

or

``````iter = range(1,10)
print(len(set(iter)) # set(iter) = {1, 2, 3, 4, 5, 6, 7, 8, 9}
Out[47]: 9
``````

B: `list()`in case of having duplicated item in the iterator

``````iter = (1,2,1,2,1,2,1,2)
print(len(list(iter)) # list(iter) = [1, 2, 1, 2, 1, 2, 1, 2]
Out[49]: 8
# compare with set function
print(len(set(iter)) # set(iter) = {1, 2}
Out[51]: 2
``````

I thought it could be worthwhile to have a micro-benchmark comparing the run-times of the different approaches mentioned here.

Disclaimer: I’m using `simple_benchmark` (a library written by me) for the benchmarks and also include `iteration_utilities.count_items` (a function in a third-party-library written by me).

To provide a more differentiated result I’ve done two benchmarks, one only including the approaches that don’t build an intermediate container just to throw it away and one including these:

``````from simple_benchmark import BenchmarkBuilder
import more_itertools as mi
import iteration_utilities as iu

b1 = BenchmarkBuilder()
b2 = BenchmarkBuilder()

def summation(it):
return sum(1 for _ in it)

def len_list(it):
return len(list(it))

def len_listcomp(it):
return len([_ for _ in it])

def more_itertools_ilen(it):
return mi.ilen(it)

def iteration_utilities_count_items(it):
return iu.count_items(it)

def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, [0]*size

r1 = b1.run()
r2 = b2.run()

import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=[15, 18])
r1.plot(ax=ax2)
r2.plot(ax=ax1)
plt.savefig('result.png')
``````

The results were:

It uses log-log-axis so that all ranges (small values, large values) can be inspected. Since the plots are intended for qualitative comparison the actual values aren’t too interesting. In general the y-axis (vertical) represents the time and the x-axis (horizontal) represents the number of elements in the input "iterable". Lower on the vertical axis means faster.

The upper plot shows the approaches where no intermediate list was used. Which shows that the `iteration_utilities` approach was fastest, followed by `more_itertools` and the slowest was using `sum(1 for _ in iterator)`.

The lower plot also included the approaches that used `len()` on an intermediate list, once with `list` and once with a list comprehension. The approach with `len(list)` was fastest here, but the difference to the `iteration_utilities` approach is almost negligible. The approach using the comprehension was significantly slower than using `list` directly.

# Summary

Any approach mentioned here did show a dependency on the length of the input and iterated over ever element in the iterable. There is no way to get the length without the iteration (even if the iteration is hidden).

If you don’t want third-party extensions then using `len(list(iterable))` is definitely the fastest approach of the tested approaches, it however generates an intermediate list which could use significant more memory.

If you don’t mind additional packages then `iteration_utilities.count_items` would be almost as fast as the `len(list(...))` function but doesn’t require additional memory.

However it’s important to note that the micro-benchmark used a list as input. The result of the benchmark could be different depending on the iterable you want to get the length of. I also tested with `range` and a simple genertor-expression and the trends were very similar, however I cannot exclude that the timing won’t change depending on the type of input.

This is theoretically impossible: this is, in fact, the Halting Problem.

## Proof

Assume in contradiction it were possible to determine the length (or infinite length) of any generator `g`, using a function `len(g)`.

For any program `P`, let us now convert `P` into a generator `g(P)`:
For every return or exit point in `P`, yield a value instead of returning it.

If `len(g(P)) == infinity`, P doesn’t stop.

This solves the Halting Problem, which is known to be impossible, see Wikipedia. Contradiction.

Thus, it is impossible to count the elements of a generic generator without iterating over it (==actually running through the program).

More concretely, consider

``````def g():
while True:
yield "more?"
``````

The length is infinite. There are infinitely many such generators.

I decided to re-run benchmarks on this on a modern version of Python and found a near-complete inversion of the benchmarks by

I ran the following commands:

``````py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(tuple(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(list(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(map(lambda i: 1, x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(1 for _ in x)" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  d = deque(enumerate(x, 1), maxlen=1)" -s "  return d[0][0] if d else 0" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  counter = count()" -s "  deque(zip(x, counter), maxlen=0)" -s "  return next(counter)" -- "itlen(it)"
``````

Which are equivalent to timing each of the the following `itlen*(it)` functions:

``````it = iter(range(1000000))
from collections import deque
from itertools import count

def itlen1(x):
return len(tuple(x))
def itlen2(x):
return len(list(x))
def itlen3(x):
return sum(map(lambda i: 1, x))
def itlen4(x):
return sum(1 for _ in x)
def itlen5(x):
d = deque(enumerate(x, 1), maxlen=1)
return d[0][0] if d else 0
def itlen6(x):
counter = count()
deque(zip(x, counter), maxlen=0)
return next(counter)
``````

On a Windows 11, Python 3.11 machine with an AMD Ryzen 7 5800H and 16 GB RAM, I got the following output:

``````10000000 loops, best of 5: 103 nsec per loop
10000000 loops, best of 5: 107 nsec per loop
10000000 loops, best of 5: 138 nsec per loop
10000000 loops, best of 5: 164 nsec per loop
10000000 loops, best of 5: 338 nsec per loop
10000000 loops, best of 5: 425 nsec per loop
``````

Which suggests that `len(list(x))` and `len(tuple(x))` are tied; followed by `sum(map(lambda i: 1, x))`; then closely by `sum(1 for _ in x)`; then the other more complicated methods mentioned by other answers and/or used in cardinality are at least a factor of two slower still.

Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.