why isn't numpy.mean multithreaded?

Question:

I’ve been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.

I know that numpy is configured for multiple cores, since I can see tests using numpy.dot use all my cores, so I just reimplemented mean as a dot product, and it runs way faster. Is there some reason mean can’t run this fast on its own? I find similar behavior for larger arrays, although the ratio is close to 2 than the 3 shown in my example.

I’ve been reading a bunch of posts on similar numpy speed issues, and apparently its way more complicated than I would have thought. Any insight would be helpful, I’d prefer to just use mean since it’s more readable and less code, but I might switch to dot based means.

In [27]: data = numpy.random.rand(10,10)

In [28]: a = numpy.ones(10)

In [29]: %timeit numpy.dot(data,a)/10.0
100000 loops, best of 3: 4.8 us per loop

In [30]: %timeit numpy.mean(data,axis=1)
100000 loops, best of 3: 14.8 us per loop

In [31]: numpy.dot(data,a)/10.0 - numpy.mean(data,axis=1)
Out[31]: 
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.11022302e-16,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -1.11022302e-16])
Asked By: gggg

||

Answers:

Basically, because the BLAS library has an optimized dot product that they can easily call for dot that is inherently parallel. They admit they could extend numpy to parallelize other operations, but opted not to go that route. However, they give several tips on how to parallelize your numpy code (basically to divide work among N cores (e.g., N=4), split your array into N sub-arrays and send jobs for each sub-array to its own thread and then combine your results).

See http://wiki.scipy.org/ParallelProgramming :

Use parallel primitives

One of the great strengths of numpy is that you can express array operations very cleanly. For example to compute the product of the matrix A and the matrix B, you just do:

>>> C = numpy.dot(A,B)

Not only is this simple and clear to read and write, since numpy knows you want to do a matrix dot product it can use an optimized implementation obtained as part of “BLAS” (the Basic Linear Algebra Subroutines). This will normally be a library carefully tuned to run as fast as possible on your hardware by taking advantage of cache memory and assembler implementation. But many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything. Similarly for other matrix operations, like inversion, singular value decomposition, determinant, and so on. For example, the open source library ATLAS allows compile time selection of the level of parallelism (number of threads). The proprietary MKL library from Intel offers the possibility to chose the level of parallelism at runtime. There is also the GOTO library that allow run-time selection of the level of parallelism. This is a commercial product but the source code is distributed free for academic use.

Finally, scipy/numpy does not parallelize operations like

>>> A = B + C

>>> A = numpy.sin(B)

>>> A = scipy.stats.norm.isf(B)

These operations run sequentially, taking no advantage of multicore machines (but see below). In principle, this could be changed without too much work. OpenMP is an extension to the C language which allows compilers to produce parallelizing code for appropriately-annotated loops (and other things). If someone sat down and annotated a few core loops in numpy (and possibly in scipy), and if one then compiled numpy/scipy with OpenMP turned on, all three of the above would automatically be run in parallel. Of course, in reality one would want to have some runtime control – for example, one might want to turn off automatic parallelization if one were planning to run several jobs on the same multiprocessor machine.

Answered By: dr jimbob

I’ve been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.

Who says it’s supposed to be multithreaded?

numpy is primarily designed to be as fast as possible on a single core, and to be as parallelizable as possible if you need to do so. But you still have to parallelize it.

In particular, you can operate on independent sub-objects at the same time, and slow operations release the GIL when possible—although “when possible” may not be nearly enough. Also, numpy objects are designed to be shared or passed between processes as easily as possible, to facilitate using multiprocessing.

There are some specialized methods that are automatically parallelized, but most of the core methods are not. In particular, dot is implemented on top of BLAS when possible, and BLAS is automatically parallelized on most platforms, but mean is implemented in plain C code.

See Parallel Programming with numpy and scipy for details.


So, how do you know which methods are parallelized and which aren’t? And, of those which aren’t, how do you know which ones can be nicely manually-threaded and which need multiprocessing?

There’s no good answer to that. You can make educated guesses (X seems like it’s probably implemented on top of ATLAS, and my copy of ATLAS is implicitly threaded), or you can read the source.

But usually, the best thing to do is try it and test. If the code is using 100% of one core and 0% of the others, add manual threading. If it’s now using 100% of one core and 10% of the others and barely running faster, change the multithreading to multiprocessing. (Fortunately, Python makes this pretty easy, especially if you use the Executor classes from concurrent.futures or the Pool classes from multiprocessing. But you still often need to put some thought into it, and test the relative costs of sharing vs. passing if you have large arrays.)

Also, as kwatford points out, just because some method doesn’t seem to be implicitly parallel doesn’t mean it won’t be parallel in the next version of numpy, or the next version of BLAS, or on a different platform, or even on a machine with slightly different stuff installed on it. So, be prepared to re-test. And do something like my_mean = numpy.mean and then use my_mean everywhere, so you can just change one line to my_mean = pool_threaded_mean.

Answered By: abarnert