Why is `np.sum(range(N))` very slow?

Question:

I saw a video about speed of loops in python, where it was explained that doing sum(range(N)) is much faster than manually looping through range and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy to the mix. As I expected np.sum(np.arange(N)) is the fastest, but sum(np.arange(N)) and np.sum(range(N)) are even slower than doing the naive for loop.

Why is this?

Here’s the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.10.0, numpy 1.21.2):

updated script:

import numpy as np
from timeit import timeit

N = 10_000_000
repetition = 10

def sum0(N = N):
    s = 0
    i = 0
    while i < N: # condition is checked in python
        s += i
        i += 1 # both additions are done in python
    return s

def sum1(N = N):
    s = 0
    for i in range(N): # increment in C
        s += i # addition in python
    return s

def sum2(N = N):
    return sum(range(N)) # everything in C

def sum3(N = N):
    return sum(list(range(N)))

def sum4(N = N):
    return np.sum(range(N)) # very slow np.array conversion

def sum5(N = N):
    # much faster np.array conversion
    return np.sum(np.fromiter(range(N),dtype = int))

def sum5v2_(N = N):
    # much faster np.array conversion
    return np.sum(np.fromiter(range(N),dtype = np.int_))

def sum6(N = N):
    # possibly slow conversion to Py_long from np.int
    return sum(np.arange(N))

def sum7(N = N):
    # list returns a list of np.int-s
    return sum(list(np.arange(N)))

def sum7v2(N = N):
    # tolist conversion to python int seems faster than the implicit conversion
    # in sum(list()) (tolist returns a list of python int-s)
    return sum(np.arange(N).tolist())

def sum8(N = N):
    return np.sum(np.arange(N)) # everything in numpy (fortran libblas?)

def sum9(N = N):
    return np.arange(N).sum() # remove dispatch overhead

def array_basic(N = N):
    return np.array(range(N))

def array_dtype(N = N):
    return np.array(range(N),dtype = np.int_)

def array_iter(N = N):
    # np.sum's source code mentions to use fromiter to convert from generators
    return np.fromiter(range(N),dtype = np.int_)

print(f"while loop:         {timeit(sum0, number = repetition)}")
print(f"for loop:           {timeit(sum1, number = repetition)}")
print(f"sum_range:          {timeit(sum2, number = repetition)}")
print(f"sum_rangelist:      {timeit(sum3, number = repetition)}")
print(f"npsum_range:        {timeit(sum4, number = repetition)}")
print(f"npsum_iterrange:    {timeit(sum5, number = repetition)}")
print(f"npsum_iterrangev2:  {timeit(sum5, number = repetition)}")
print(f"sum_arange:         {timeit(sum6, number = repetition)}")
print(f"sum_list_arange:    {timeit(sum7, number = repetition)}")
print(f"sum_arange_tolist:  {timeit(sum7v2, number = repetition)}")
print(f"npsum_arange:       {timeit(sum8, number = repetition)}")
print(f"nparangenpsum:      {timeit(sum9, number = repetition)}")
print(f"array_basic:        {timeit(array_basic, number = repetition)}")
print(f"array_dtype:        {timeit(array_dtype, number = repetition)}")
print(f"array_iter:         {timeit(array_iter,  number = repetition)}")

print(f"npsumarangeREP:     {timeit(lambda : sum8(N/1000), number = 100000*repetition)}")
print(f"npsumarangeREP:     {timeit(lambda : sum9(N/1000), number = 100000*repetition)}")

# Example output:
#
# while loop:         11.493371912998555
# for loop:           7.385945574002108
# sum_range:          2.4605720699983067
# sum_rangelist:      4.509678105998319
# npsum_range:        11.85120212900074
# npsum_iterrange:    4.464334709002287
# npsum_iterrangev2:  4.498494338993623
# sum_arange:         9.537815956995473
# sum_list_arange:    13.290120724996086
# sum_arange_tolist:  5.231948580003518
# npsum_arange:       0.241889145996538
# nparangenpsum:      0.21876695199898677
# array_basic:        11.736577274998126
# array_dtype:        8.71628468400013
# array_iter:         4.303306431000237
# npsumarangeREP:     21.240833958996518
# npsumarangeREP:     16.690092379001726

Asked By: fbence

||

Answers:

From the cpython source code for sum sum initially seems to attempt a fast path that assumes all inputs are the same type. If that fails it will just iterate:

/* Fast addition by keeping temporary sums in C instead of new Python objects.
   Assumes all inputs are the same type.  If the assumption fails, default
   to the more general routine.
*/

I’m not entirely certain what is happening under the hood, but it is likely the repeated creation/conversion of C types to Python objects that is causing these slow-downs. It’s worth noting that both sum and range are implemented in C.


This next bit is not really an answer to the question, but I wondered if we could speed up sum for python ranges as range is quite a smart object.

To do this I’ve used functools.singledispatch to override the built-in sum function specifically for the range type; then implemented a small function to calculate the sum of an arithmetic progression.

from functools import singledispatch

def sum_range(range_, /, start=0):
    """Overloaded `sum` for range, compute arithmetic sum"""
    n = len(range_)
    if not n:
        return start
    return int(start + (n * (range_[0] + range_[-1]) / 2))

sum = singledispatch(sum)
sum.register(range, sum_range)

def test():
    """
    >>> sum(range(0, 100))
    4950
    >>> sum(range(0, 10, 2))
    20
    >>> sum(range(0, 9, 2))
    20
    >>> sum(range(0, -10, -1))
    -45
    >>> sum(range(-10, 10))
    -10
    >>> sum(range(-1, -100, -2))
    -2500
    >>> sum(range(0, 10, 100))
    0
    >>> sum(range(0, 0))
    0
    >>> sum(range(0, 100), 50)
    5000
    >>> sum(range(0, 0), 10)
    10
    """

if __name__ == "__main__":
    import doctest
    doctest.testmod()

I’m not sure if this is complete, but it’s definitely faster than looping.

Answered By: Alex

Let’s see if I can summarize the results.

sum can work with any iterable, repeatedly asking for the next value and adding it. range is a generator, that’s happy to supply the next value

# sum_range:          1.4830789409988938

Making a list from a range takes time:

# sum_rangelist:      3.6745876889999636

Summing a pregenerated list is actually faster than summing the range:

%%timeit x = list(range(N))
    ...: sum(x)

np.sum is designed to sum arrays. It’s a wrapper to np.add.reduce.

np.sum has a deprecation warning for np.sum(generator), recommending the use of fromiter or Python sum:

# npsum_range:        16.216972655000063

fromiter is the best way of making an array from a generator. Using np.array on range is legacy code and may go away in the future. I think it’s the only generator that np.array will accept.

np.array is a general purpose function that can handle many cases, including nested arrays, and conversion to various dtypes. As such it has to process the whole input argument, deducing both shape and dtype.

# npsum_fromiterrange:3.47655400199983

Iteration on a numpy array is slower than a list, since it has to "unbox" each element.

# sum_arange:         16.656015603000924

Similarly making a list from an array is slow; same sort of python level iteration.

# sum_list_arange:    19.500842117000502

arr.tolist() is relatively fast, creating a pure python list in compiled code. So speed is similar to making a list from range.

# sum_arange_tolist:  4.004777374000696

np.sum of an array is pure numpy and quite fast. np.sum(x) where x=np.arange(N) is even faster (by about 4x)

# npsum_arange:       0.2332638230000157

np.sum from range or list is dominated by the cost of creating the array first:

# array_basic:        16.1631146109994
# array_dtype:        16.550737804000164
# array_iter:         3.9803170430004684
Answered By: hpaulj

np.sum(range(N)) is slow mostly because the current Numpy implementation do not use enough informations about the exact type/content of the values provided by the generator range(N). The heart of the general problem is inherently due to dynamic typing of Python and big integers although Numpy could optimize this specific case.

First of all, range(N) returns a dynamically-typed Python object which is a (special kind of) Python generator. The object provided by this generator are also dynamically-typed. It is in practice a pure-Python integer.

The thing is Numpy is written in the statically-typed language C and so it cannot efficiently work on dynamically-typed pure-Python objects. The strategy of Numpy is to convert such objects into C types when it can. One big problem in this case is that the integers provided by the generator can theorically be huge: Numpy do not know if the values can overflow a np.int32 or even a np.int64 type. Thus, Numpy first detect the good type to use and then compute the result using this type.

This translation process can be quite expensive and appear not to be needed here since all the values provided by range(10_000_000). However, range(5_000_000_000) returns the same object type with pure-Python integers overflowing np.int32 and Numpy needs to automatically detect this case not to return wrong results. The thing is also the input type can be correctly identified (np.int32 on my machine), it does not means that the output result will be correct because overflows can appear in during the computation of the sum. This is sadly the case on my machine.

Numpy developers decided to deprecate such a use and put in the documentation that np.fromiter should be used instead. np.fromiter has a dtype required parameter to let the user define what is the good type to use.

One way to check this behaviour in practice is to simply use create a temporary list:

tmp = list(range(10_000_000))

# Numpy implicitly convert the list in a Numpy array but 
# still automatically detect the input type to use
np.sum(tmp)

A faster implementation is the following:

tmp = list(range(10_000_000))

# The array is explicitly converted using a well-defined type and 
# thus there is no need to perform an automatic detection 
# (note that the result is still wrong since it does not fit in a np.int32)
tmp2 = np.array(tmp, dtype=np.int32)
result = np.sum(tmp2)

The first case takes 476 ms on my machine while the second takes 289 ms. Note that np.sum takes only 4 ms. Thus, a large part of the time is spend in the conversion of pure-Python integer objects to internal int32 types (more specifically the management of pure-Python integers). list(range(10_000_000)) is expensive too as it takes 205 ms. This is again due to the overhead of pure-Python integers (ie. allocations, deallocations, reference counting, increment of variable-sized integers, memory indirections and conditions due to the dynamic typing) as well as the overhead of the generator.

sum(np.arange(N)) is slow because sum is a pure-Python function working on a Numpy-defined object. The CPython interpreter needs to call Numpy functions to perform basic additions. Moreover, Numpy-defined integer object are still Python object and so they are subject to reference counting, allocation, deallocation, etc. Not to mention Numpy and CPython add many checks in the functions aiming to finally just add two native numbers together. A Numpy-aware just-in-time compiler such as Numba can solve this issue. Indeed, Numba takes 23 ms on my machine to compute the sum of np.arange(10_000_000) (with code still written in Python) while the CPython interpreter takes 556 ms.

Answered By: Jérôme Richard
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.