Why are for-loops much slower than broadcasting


Comparing two chunks of code for a simple matrix operation, the one with a nested for loop is much slower. I wonder: what is the underlying reason for this?

This loop tuns for 2.5 seconds:

m = np.zeros((800,8000)) 
for i in range(0,800):
    for j in range(0,8000):
        m[i,j] = i+j

Whereas this one takes only 0.05 seconds:

a = np.tile(np.arange(0,8000), (800, 1))
b = np.arange(0,800)
b = b[:,None]
print(a + b)
Asked By: Oleg Zheleznyy



The short answer is that the second code uses Numpy that in turn implements a lot of vectorized operations, whereas for..loops run on Python’s interpreter, which is slower.

Disclaimer: I’m not an expert on the intricacies of vectorized operations, therefore I recommend you treating the following explanation with a grain of salt and If you want to dive deeper on the subject, you should search for more reliable sources.

What Are Vectorized operations?

Vectorized operations are operations that apply to whole arrays or data sequences instead of individual elements. They use NumPy’s broadcasting rules to match the shapes of different arrays and perform element-wise calculations. Vectorized operations are much faster than for loops because they use optimized and pre-compiled functions that run on low-level languages like C or Fortran.

Numpy ND-Arrays Are Homogeneous

Another characteristic, that is related to the previous explanation, is that Numpy ND-arrays are homogeneous. This means that an array can only contain data of a single type. For instance, an array can contain 8-bit integers or 32-bit floating point numbers, but not a mix of the two. This is in stark contrast to Python’s lists and tuples, which are entirely unrestricted in the variety of contents they can possess. A given list could simultaneously contain strings, integers, and other objects. This restriction of “knowing” that an array’s contents are homogeneous in data type enables NumPy to delegate the task of performing mathematical operations on the array’s contents to optimized, compiled C code. This is a process that is referred to as vectorization. The outcome of this can be a tremendous speedup relative to the analogous computation performed in Python, which must painstakingly check the data type of every one of the items as it iterates over the arrays, since Python typically (not always) works with lists with unrestricted contents.

Note: How NumPy can Perform Multiple Operations at the Same Time

The following example was extracted from the following source: Vectorization with NumPy

Many calculations require to repeatedly do the same operations with all items in one or several sequences, e.g., multiplying two vectors a = [1, 2, 3, 4, 5] and b = [6, 7, 8, 9, 10]. This is usually implemented with a loop (e.g., for..loop or while..loop) where each item is treated one by one, e.g., 1 * 6, then 2 * 7, etc. Modern computers have special registers for such operations that allow them to operate on various items at once. This means that a part of the data, say 4 items each, is loaded and multiplied simultaneously. For the mentioned example where both vectors have a size of 5, this means that instead of 5 operations, only 2 are necessary (one with the first 4 elements and one with the last “left over” element). With 12 items to be multiplied on each side we had 3 operations instead of 12, with 40 we had 10 and so on.

Multiplication vectorized and not vectorized

Answered By: Ingwersen_erik