Pandas dataframe applymap parallel execution

Question:

I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.

def apply_all_regexes(data, regexes):
    # find all regex matches is applied to the pandas' dataframe
    new_df = data.applymap(
        partial(apply_re_to_cell, regexes))
    return regex_applied

def apply_re_to_cell(regexes, cell):
    cell = str(cell)
    regex_matches = []
    for regex in regexes:
        regex_matches.extend(re.findall(regex, cell))
    return regex_matches

Due to the serial execution of applymap, the time taken to process is ~ elements * (serial execution of the regexes for 1 element). Is there anyway to invoke parallelism? I tried ProcessPoolExecutor, but that appeared to take longer time than executing serially.

Asked By: Sushim Mukul Dutta

||

Answers:

Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?

I was able to do something similar with a dataframe about gene expression.
I would run it small scale and control if you get the expected output.

Unfortunately I don’t have enough reputation to comment

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    for x in df_split:
        print(x.shape)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()


    return df

This is the general function I used

Answered By: Mibi

A bit more modern version:

from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm

tqdm.pandas()

def parallel_applymap(df, func, worker_count):
    def _apply(shard):
        return shard.progress_applymap(func)

    shards = np.array_split(df, worker_count)
    with ThreadPoolExecutor(max_workers=worker_count) as e:
        futures = e.map(_apply, shards)
    return pd.concat(list(futures))
Answered By: gebbissimo