Appending to a list with multithreading ThreadPoolExecutor and map

Question:

I have the following code

import random
import csv
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
import pandas as pd

def generate_username(id, job_location):
    number = "{:03d}".format(random.randrange(1, 999))
    return "".join([id, job_location.strip(), str(number)])

def append_to_list(l, idx, job_location):
    l.append([idx, generate_username(str(idx), job_location)])

def generate_csv(filepath, df):
    rows = [["EMP_ID", "username"]]
    ids, locations = df.EMP_ID, df["Job Location"]
    for idx, location in zip(ids, locations):
        rows.append([idx, generate_username(str(idx), location)])
    with open(filepath, 'w') as file:
        writer = csv.writer(file)
        writer.writerows(rows)

And this is the multithreading implementation

def generate_csv_threads(filepath, df, n):
    rows = [["EMP_ID", "username"]]
    ids, locations = df.EMP_ID, df["Job Location"]

    with ThreadPoolExecutor(max_workers=n) as executor:
        executor.map(append_to_list, rows, ids, locations)
        executor.shutdown(wait=True)
        
    with open(filepath, 'w') as file:
        writer = csv.writer(file)
        writer.writerows(rows)

I have several questions regarding this. I saw that append is thread safe, so I would not need a lock. However, the csv generated is the following:

[['EMP_ID', 'username', [234687, '234687Oregon696']]]

(I have more than one user to generate)

Asked By: Norhther

||

Answers:

If generate_username is a very fast operation and CPU-bound (as it seems to be) then you won’t get any benefits from multithreading in Python. Worse, this will likely make the program slower and introduce subtle concurrency issues.

The .append() method of CPython (the most common Python implementation) is thread-safe because of the GIL (Global Interpreter Lock) but you generally should not rely on that because it may be removed in a future version or does not exist in some other Python implementations.

You can use the concurrent.futures.as_completed() function to process the results of the thread pool in a sequential manner.

The first thing to do before trying to use multithreading or multiprocessing is to benchmark your program execution. Only chose to use multithreading or multiprocessing of you clearly identified parts that can be accelerated by such tools.

Answered By: Louis Lac