Efficient way to sum Counter objects in Python

Question:

Is there a more efficient way or a library that makes addition of Counter objects faster?

So far, I’m using the following code and I need something faster than it:

cnt = sum([Counter(objects) for objects in object_list], Counter())
Asked By: Ian Herve Chu Te

||

Answers:

I couldn’t figure out how the sum function is working on the code you posted. A quick test with a simpler function seems to show improvements on that block of code:

from collections import Counter
import random
import timeit

def func1(objs):
    count = Counter()
    for obj in objs:
        count.update(obj)
    return count

def func2(objs):
    return sum([Counter(obj) for obj in objs], Counter())

length = 100
objs = [[random.randint(0, 100) for i in range(length)] for i in range(length)]
time1 = timeit.timeit(lambda: func1(objs), number=100)
time2 = timeit.timeit(lambda: func2(objs), number=100)

print(f"Proposed Solution (t1): {time1}")
print(f"Question Solution (t2): {time2}")
print(f"t1 < t2: {time1 < time2}")
print(f"f1 == f2 {func1(objs) == func2(objs)}")
Proposed Solution (t1): 0.047416953
Question Solution (t2): 0.433098309
t1 < t2: True
f1 == f2 True
Answered By: Felipe

Don’t make a ton of temporary Counters, just make one, and have it count everything:

from collections import Counter
from itertools import chain

cnt = Counter(chain.from_iterable(object_list))

Making a bunch of individual Counters from smaller inputs is expensive, and denies you some of the performance benefits that Counter‘s C-accelerator for counting input iterables gives you. Using sum to combine them makes it a Schlemiel the Painter’s algorithm, as it makes tons of temporary Counters of progressively increasing size (the work ends up being roughly O(m * n) where n is the total number of items counted, and m is the number of objects they’re split over). Counting once over a flattened input iterable gets the work down to O(n).

Flattening your iterable of iterables to a single stream of inputs and counting it all once dramatically reduces runtime, especially for large numbers of smaller objects.

Using chain.from_iterable like this is equivalent to:

cnt = Counter(item for object in object_list for item in object)

but pushes the work to the C layer on the CPython reference interpreter; if the contents of object_list are all built-in types implemented in C as well, then no bytecode gets executed at all when you use chain.from_iterable, removing a lot of interpreter overhead.

If you must have a bunch of Counters, at least avoid the Schlemiel the Painter’s algorithm by doing in-place updates of the accumulator Counter. You can one-line this in an ugly way (that still makes temporary Counters, but at least it doesn’t make progressively larger temporaries that it throws away each time) with:

cnt = functools.reduce(operator.iadd, map(Counter, object_list), Counter())

or make it more readable (and avoid any additional temporaries):

cnt = Counter()
for obj in object_list:
    cnt.update(obj)  # cnt += Counter(obj) works, but involves unnecessary temporary
Answered By: ShadowRanger