Python multiprocessing: sharing global real-only large data without reloading from disk for child processes

Question:

Say I need to read from disk a large data and do some read-only work on it.

I need to use multiprocessing, but to share it across processes using multiprocessing.Manager() or Array() is way too slow. Since my operation on this large data is read-only, according to this answer, I can declare this large data in the global scope, and then each child process has its own large data in the memory:

# main.py
import argparse
import numpy as np
import multiprocessing as mp
import time

parser = argparse.ArgumentParser()
parser.add_argument('-p', '--path', type=str)
args = parser.parse_args()
print('loading data from disk... may take a long time...')
global_large_data = np.load(args.path)

def worker(row_id):
    # some stuff read-only to the global_large_data
    time.sleep(0.01)
    print(row_id, np.sum(global_large_data[row_id]))

def main():
    pool = mp.Pool(mp.cpu_count())
    pool.map(worker, range(global_large_data.shape[0]))
    pool.close()
    pool.join()

if __name__ == '__main__':
    main()

And in terminal,

$ python3 main.py -p /path/to/large_data.npy

This is fast, and almost good to me. However, one shortcoming is that each child process needs to reload the large file from disk, and the loading process can waste a lot of time.

Is there any way (e.g., wrapper) so that only the parent process loads the file from disk once, and then directly send the copy to each child process’s memory?

Note that my memory space is abundant — many copies of this large data in memory is good. I just don’t want to reload it from disk for many times.

Asked By: graphitump

||

Answers:

I suspect you want to read the section "contexts and start methods", in the multiprocessing section.

A new process is created either via spawning or forking. If spawned, then the child process is a completely new Python process, and it has to re-read everything it needs to run. If forked, the parent process creates a clone of itself.

The documentation includes which is the default on your OS (you didn’t specify), how to change the default, and what is available. If you can manage to use ‘fork’ on your machine, then after you’ve read the file, it will be in all child processes.

If you cannot use ‘fork’, then what you’re looking for is very difficult. As it says, every child process stars anew.

You are correct that you do not want to use a managed array. That means that all requests for data are routed through the main process, which then replies with the requested bytes. Yes, very slow.

You might consider looking at mmap. In this case, each process reads only the parts of the file that it needs rather than the whole thing. But the file itself is still on disk and has to be read.

Answered By: Frank Yellin