How to combine parallelization and concurrency in a single python library?

Question:

I want to speed up my code by using concurrency and parallelization within one library using Python.

My current set up goes in single thread:

- input a, b, c
- downloading data into a list: data = [downloadedData(a), downloadedData(b), downloadedData(c)]
- transforming data one by one in any order. But output from the previous transformation is needed for the next transformation. Order doesn't matter.
result = transformer(data[0], None)
result = transformer(data[1], result)
result = transformer(data[2], result)

So as you can see I can download data in parallel and in any order. The duration of the process of downloading the data depends on input (a, b, c).
The transformations can be can in any order as well but only one by one.

So ideally: I would like to transform the first pack of downloaded data, even if the rest of the data is not downloaded.

So I can find a way to do multiple requests and do the transformations one by one, but I failed to combine them in a single python library (concurrent and subprocesses)

Asked By: Krzysztof C

||

Answers:

First, there is no such way of "combining parallelisation and concurrency" as parallelisation is already a form of concurrency. Concurrency roughly means that execution can be out of order and interleaved with other computations while parallelism means that multiple executions can append at the same time (they can overlap in time, thus the executions are concurrent).

What you likely want is:

  • downloads can be executed in parallel,
  • transformations must be executed sequentially (not in parallel).

To achieve this, you can use a thread pool:

from concurrent.futures import ThreadPoolExecutor
from time import sleep


def download_data(url: str) -> str:
    print(f"Downloading {url}")
    sleep(0.5)
    return url


def transform(data: str) -> str:
    print(f"Transforming {data}")
    sleep(0.5)
    return data


def main():
    urls = ["foo", "bar", "baz"]
    
    
    with ThreadPoolExecutor() as pool:
        # Concurrently execute multiple `downloads_data`
        donwloads = pool.map(download_data, urls)
        results = []

        # Execute `transform` one at a time
        for dl in donwloads:
            result = transform(dl)
            results.append(result)
        
        print(f"Output: {results}")


if __name__ == "__main__":
    main()

Here, the .map() operator will schedule the execution of download_data in parallel for each urls item. This does not wait for the downloads to finish. Instead, it returns an iterator (named downloads here) that will yield the downloads as soon as they are available. The items are yielded sequentially, thus the transformations are not executed concurrently.

If you execute this code, you’ll notice that the three "Downloading {url}" are printed right away while the "Transforming {data}" are printed one at a time.

Answered By: Louis Lac