How to parallelize list-comprehension calculations in Python?

Question:

Both list comprehensions and map-calculations should — at least in theory — be relatively easy to parallelize: each calculation inside a list-comprehension could be done independent of the calculation of all the other elements. For example in the expression

[ x*x for x in range(1000) ]

each x*x-Calculation could (at least in theory) be done in parallel.

My question is: Is there any Python-Module / Python-Implementation / Python Programming-Trick to parallelize a list-comprehension calculation (in order to use all 16 / 32 / … cores or distribute the calculation over a Computer-Grid or over a Cloud)?

Asked By: phynfo

||

Answers:

There is a comprehensive list of parallel packages for Python here:

http://wiki.python.org/moin/ParallelProcessing

I’m not sure if any handle the splitting of a list comprehension construct directly, but it should be trivial to formulate the same problem in a non-list comprehension way that can be easily forked to a number of different processors. I’m not familiar with cloud computing parallelization, but I’ve had some success with mpi4py on multi-core machines and over clusters. The biggest issue that you’ll have to think about is whether the communication overhead is going to kill any gains you get from parallelizing the problem.

Edit: The following might also be of interest:

http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/

Answered By: JoshAdel

Not within a list comprehension AFAIK.

You could certainly do it with a traditional for loop and the multiprocessing/threading modules.

Answered By: mluebke

No, because list comprehension itself is a sort of a C-optimized macro. If you pull it out and parallelize it, then it’s not a list comprehension, it’s just a good old fashioned MapReduce.

But you can easily parallelize your example. Here’s a good tutorial on using MapReduce with Python’s parallelization library:

http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/

Answered By: Ken Kinder

As Ken said, it can’t, but with 2.6’s multiprocessing module, it’s pretty easy to parallelize computations.

import multiprocessing

try:
    cpus = multiprocessing.cpu_count()
except NotImplementedError:
    cpus = 2   # arbitrary default


def square(n):
    return n * n

pool = multiprocessing.Pool(processes=cpus)
print(pool.map(square, range(1000)))

There are also examples in the documentation that show how to do this using Managers, which should allow for distributed computations as well.

Answered By: Mahmoud Abdelkader

Using the futures.{Thread,Process}PoolExecutor.map(func, *iterables, timeout=None) and futures.as_completed(future_instances, timeout=None) functions from the new 3.2 concurrent.futures package could help.

It’s also available as a 2.6+ backport.

Answered By: Georges Martin

On automatical parallelisation of list comprehension

IMHO, effective automatic parallisation of list comprehension would be impossible without additional information (such as those provided using directives in OpenMP), or limiting it to expressions that involve only built-in types/methods.

Unless there is a guarantee that the processing done on each list item has no side effects, there is a possibility that the results will be invalid (or at least different) if done out of order.

# Artificial example
counter = 0

def g(x): # func with side-effect
    global counter
    counter = counter + 1
    return x + counter

vals = [g(i) for i in range(100)] # diff result when not done in order

There is also the issue of task distribution. How should the problem space be decomposed?

If the processing of each element forms a task (~ task farm), then when there are many elements each involving trivial calculation, the overheads of managing the tasks will swamps out the performance gains of parallelisation.

One could also take the data decomposition approach where the problem space is divided equally among the available processes.

The fact that list comprehension also works with generators makes this slightly tricky, however this is probably not a show stopper if the overheads of pre-iterating it is acceptable. Of course, there is also a possibility of generators with side-effects which can change the outcome if subsequent items are prematurely iterated. Very unlikely, but possible.

A bigger concern would be load imbalance across processes. There is no guarantee that each element would take the same amount of time to process, so statically partitioned data may result in one process doing most of the work while the idle your time away.

Breaking the list down to smaller chunks and handing them as each child process is available is a good compromise, however, a good selection of chunk size would be application dependent hence not doable without more information from the user.

Alternatives

As mentioned in several other answers, there are many approaches and parallel computing modules/frameworks to choose from depending on one requirements.

Having used only MPI (in C) with no experience using Python for parallel processing, I am not in a position to vouch for any (although, upon a quick scan through,
multiprocessing, jug, pp and pyro stand out).

If a requirement is to stick as close as possible to list comprehension, then jug seems to be the closest match. From the tutorial, distributing tasks across multiple instances can be as simple as:

from jug.task import Task
from yourmodule import process_data
tasks = [Task(process_data,infile) for infile in glob('*.dat')]

While that does something similar to multiprocessing.Pool.map(), jug can use different backends for synchronising process and storing intermediate results (redis, filesystem, in-memory) which means the processes can span across nodes in a cluster.

Answered By: Shawn Chin

For shared-memory parallelism, I recommend joblib:

from joblib import delayed, Parallel

def square(x): return x*x
values = Parallel(n_jobs=NUM_CPUS)(delayed(square)(x) for x in range(1000))
Answered By: Fred Foo

As the above answers point out, this is actually pretty hard to do automatically. Then I think the question is actually how to do it in the easiest way possible. Ideally, a solution wouldn’t require you to know things like “how many cores do I have”. Another property that you might want is to be able to still do the list comprehension in a single readable line.

Some of the given answers already seem to have nice properties like this, but another alternative is Ray (docs), which is a framework for writing parallel Python. In Ray, you would do it like this:

import ray

# Start Ray. This creates some processes that can do work in parallel.
ray.init()

# Add this line to signify that the function can be run in parallel (as a
# "task"). Ray will load-balance different `square` tasks automatically.
@ray.remote
def square(x):
    return x * x

# Create some parallel work using a list comprehension, then block until the
# results are ready with `ray.get`.
ray.get([square.remote(x) for x in range(1000)])
Answered By: Stephanie Wang

You can use asyncio. (Documentation can be found [here][1]). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.

import asyncio

def background(f):
    def wrapped(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)

    return wrapped

@background
def your_function(argument):
    #code

Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.

For your specific case, you can do:

import asyncio
import time


def background(f):
    def wrapped(*args, **kwargs):
        return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)                                   
    return wrapped


@background
def op(x):                                           # Do any operation you want
    time.sleep(1)
    print(f"function called for {x=}n", end='')
    return x*x


loop = asyncio.get_event_loop()                      # Have a new event loop

looper = asyncio.gather(*[op(i) for i in range(20)]) # Run the loop; Doing for 20 for better demo
                             
results = loop.run_until_complete(looper)            # Wait until finish

print('List comprehension has finished and results are gathered!')      
print(results)

This produces following output:

function called for x=5
function called for x=4
function called for x=2
function called for x=0
function called for x=6
function called for x=1
function called for x=7
function called for x=3
function called for x=8
function called for x=9
function called for x=10
function called for x=12
function called for x=11
function called for x=15
function called for x=13
function called for x=14
function called for x=16
function called for x=17
function called for x=18
function called for x=19
List comprehension has finished and results are gathered!
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]

Note that all function calls were in parallel thus shuffled prints however original order is preserved in the resulting list.

Answered By: Hamza