Getting number of elements in an iterator in Python

Question:

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?

Asked By: user248237

||

Answers:

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.

Answered By: Daenyth

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.

Answered By: John Howard

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.

Answered By: Tomasz Wysocki

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).

Answered By: Jesus Ramos

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.

Answered By: Wayne Werner

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. 🙂

Answered By: badp

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.

Answered By: tom10

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.

Answered By: Kevin Jacobs

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).

Answered By: zuo

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.

Answered By: FCAlive
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
Answered By: hasen

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0
Answered By: Erwin Mayer

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Answered By: Michael

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48
Answered By: Jason R. Coombs

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052)
“`

So, len(list(gen)) is the most frequent and less memory consumable

Answered By: Alex-Bogdanov

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r
Answered By: Toothpick Anemone

One simple way is using built-in function set() or list():

A: set() in case of having no duplicated item in the iterator (fastest way)

iter = zip([1,2,3],['a','b','c'])
print(len(set(iter)) # set(iter) = {(1, 'a'), (2, 'b'), (3, 'c')}
Out[45]: 3

or

iter = range(1,10)
print(len(set(iter)) # set(iter) = {1, 2, 3, 4, 5, 6, 7, 8, 9}
Out[47]: 9

B: list()in case of having duplicated item in the iterator

iter = (1,2,1,2,1,2,1,2)
print(len(list(iter)) # list(iter) = [1, 2, 1, 2, 1, 2, 1, 2]
Out[49]: 8
# compare with set function
print(len(set(iter)) # set(iter) = {1, 2}
Out[51]: 2
Answered By: imanzabet

I thought it could be worthwhile to have a micro-benchmark comparing the run-times of the different approaches mentioned here.

Disclaimer: I’m using simple_benchmark (a library written by me) for the benchmarks and also include iteration_utilities.count_items (a function in a third-party-library written by me).

To provide a more differentiated result I’ve done two benchmarks, one only including the approaches that don’t build an intermediate container just to throw it away and one including these:

from simple_benchmark import BenchmarkBuilder
import more_itertools as mi
import iteration_utilities as iu

b1 = BenchmarkBuilder()
b2 = BenchmarkBuilder()

@b1.add_function()
@b2.add_function()
def summation(it):
    return sum(1 for _ in it)

@b1.add_function()
def len_list(it):
    return len(list(it))

@b1.add_function()
def len_listcomp(it):
    return len([_ for _ in it])

@b1.add_function()
@b2.add_function()
def more_itertools_ilen(it):
    return mi.ilen(it)

@b1.add_function()
@b2.add_function()
def iteration_utilities_count_items(it):
    return iu.count_items(it)

@b1.add_arguments('length')
@b2.add_arguments('length')
def argument_provider():
    for exp in range(2, 18):
        size = 2**exp
        yield size, [0]*size

r1 = b1.run()
r2 = b2.run()

import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=[15, 18])
r1.plot(ax=ax2)
r2.plot(ax=ax1)
plt.savefig('result.png')

The results were:

enter image description here

It uses log-log-axis so that all ranges (small values, large values) can be inspected. Since the plots are intended for qualitative comparison the actual values aren’t too interesting. In general the y-axis (vertical) represents the time and the x-axis (horizontal) represents the number of elements in the input "iterable". Lower on the vertical axis means faster.

The upper plot shows the approaches where no intermediate list was used. Which shows that the iteration_utilities approach was fastest, followed by more_itertools and the slowest was using sum(1 for _ in iterator).

The lower plot also included the approaches that used len() on an intermediate list, once with list and once with a list comprehension. The approach with len(list) was fastest here, but the difference to the iteration_utilities approach is almost negligible. The approach using the comprehension was significantly slower than using list directly.

Summary

Any approach mentioned here did show a dependency on the length of the input and iterated over ever element in the iterable. There is no way to get the length without the iteration (even if the iteration is hidden).

If you don’t want third-party extensions then using len(list(iterable)) is definitely the fastest approach of the tested approaches, it however generates an intermediate list which could use significant more memory.

If you don’t mind additional packages then iteration_utilities.count_items would be almost as fast as the len(list(...)) function but doesn’t require additional memory.

However it’s important to note that the micro-benchmark used a list as input. The result of the benchmark could be different depending on the iterable you want to get the length of. I also tested with range and a simple genertor-expression and the trends were very similar, however I cannot exclude that the timing won’t change depending on the type of input.

Answered By: MSeifert

This is theoretically impossible: this is, in fact, the Halting Problem.

Proof

Assume in contradiction it were possible to determine the length (or infinite length) of any generator g, using a function len(g).

For any program P, let us now convert P into a generator g(P):
For every return or exit point in P, yield a value instead of returning it.

If len(g(P)) == infinity, P doesn’t stop.

This solves the Halting Problem, which is known to be impossible, see Wikipedia. Contradiction.


Thus, it is impossible to count the elements of a generic generator without iterating over it (==actually running through the program).

More concretely, consider

def g():
    while True:
        yield "more?"

The length is infinite. There are infinitely many such generators.

Answered By: Gulzar

I decided to re-run benchmarks on this on a modern version of Python and found a near-complete inversion of the benchmarks by

I ran the following commands:

py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(tuple(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return len(list(x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(map(lambda i: 1, x))" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  return sum(1 for _ in x)" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  d = deque(enumerate(x, 1), maxlen=1)" -s "  return d[0][0] if d else 0" -- "itlen(it)"
py -m timeit -n 10000000 -s "it = iter(range(1000000))" -s "from collections import deque" -s "from itertools import count" -s "def itlen(x):" -s "  counter = count()" -s "  deque(zip(x, counter), maxlen=0)" -s "  return next(counter)" -- "itlen(it)"

Which are equivalent to timing each of the the following itlen*(it) functions:

it = iter(range(1000000))
from collections import deque
from itertools import count

def itlen1(x):
  return len(tuple(x))
def itlen2(x):
  return len(list(x))
def itlen3(x):
  return sum(map(lambda i: 1, x))
def itlen4(x):
  return sum(1 for _ in x)
def itlen5(x):
  d = deque(enumerate(x, 1), maxlen=1)
  return d[0][0] if d else 0
def itlen6(x):
  counter = count()
  deque(zip(x, counter), maxlen=0)
  return next(counter)

On a Windows 11, Python 3.11 machine with an AMD Ryzen 7 5800H and 16 GB RAM, I got the following output:

10000000 loops, best of 5: 103 nsec per loop
10000000 loops, best of 5: 107 nsec per loop
10000000 loops, best of 5: 138 nsec per loop
10000000 loops, best of 5: 164 nsec per loop
10000000 loops, best of 5: 338 nsec per loop
10000000 loops, best of 5: 425 nsec per loop

Which suggests that len(list(x)) and len(tuple(x)) are tied; followed by sum(map(lambda i: 1, x)); then closely by sum(1 for _ in x); then the other more complicated methods mentioned by other answers and/or used in cardinality are at least a factor of two slower still.

Answered By: AbyxDev
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.