Which multiprocessing method map or apply_async?

Question:

I have a function:

def movingWinStretch(u0,u1):
     # u0,u1 are 1D arrays
     # do a bunch of stuff to u0 and u1 

     return C , epsArray, tSamp

When I do this workflow on smaller amounts of data I use a couple of nested for loops to loop through the data matrices and get the inputs u0,u1. I then append the output C, epsArray, and tSamp to lists after each call to movingWinStretch. That would look something like :

cArray =[]
dtArray = []
tArray = []
for i in range(1,seisdata.shape[0]):
    for j in range(seisdata.shape[3]):

        u0 = seisdata[0,-1,:,j]
        u1 = seisdata[i,-1,:,j]
        
        C, dtot, tSamp = movingWinStretch(u0, u1)


    cArray.append(C)
    dtArray.append(dtot)
    tArray.append(tSamp)

Now I need to do this on a much larger amount of data and would like to get speed up from the mp package if possible. I’ve written an iterator:

def traceIterator(seisdatarray):
    for i in range(1,seisdatarray.shape[0]):
        for j in range(seisdatarray.shape[3]):
            u0 = seisdatarray[0,-1,:,j] 
            u1 = seisdatarray[i,-1,:,j]
            yield u0, u1

that yields the input to my function.

I’ve used the multiprocessing package once or twice and thought I would try something like

num_proc = 8
pool = mp.Pool(processes = num_proc)
proc = [pool.apply_async(movingWinStretch,args=(u0, u1)) for u0,u1 in zip(*traceIterator(seisdata))]    
results = [p.get() for p in proc]

My issue is how do I do the append step for each call to movingWinStretch now? Additionally, I don’t think apply_async is the correct method to use. Perhaps map or starmap would be better choices since I have multiple inputs?

Asked By: magmadaddy

||

Answers:

Yes, you can use map or starmap instead of apply_async. apply_async is used when you want to submit a single function call as a background task and get the results later using get(). On the other hand, map and starmap are used when you want to apply a function to a collection of arguments in parallel, and get the results as a list.

To use map or starmap for your problem, you need to modify your movingWinStretch function to take a tuple of inputs instead of two separate arguments. You can then use the starmap method to apply the function to a list of tuples of inputs, like this:

inputs = [(u0, u1) for u0, u1 in traceIterator(seisdata)]
results = pool.starmap(movingWinStretch, inputs)

This will give you a list of tuples, where each tuple contains the outputs of the movingWinStretch function for a given pair of inputs.

To perform the append step for each call to movingWinStretch, you can modify the movingWinStretch function to return a tuple that includes the index of the input pair, like this:

def movingWinStretch(idx, u0, u1):
     # do a bunch of stuff to u0 and u1 
     C, epsArray, tSamp = ...

     return idx, C, epsArray, tSamp

You can then modify the loop that processes the results to append the outputs to the appropriate lists using the index:

cArray = []
dtArray = []
tArray = []
for idx, C, dtot, tSamp in results:
    cArray.append((idx, C))
    dtArray.append((idx, dtot))
    tArray.append((idx, tSamp))

# sort the lists by the index to restore the original order
cArray.sort()
dtArray.sort()
tArray.sort()

# extract the outputs from the sorted lists
cArray = [C for idx, C in cArray]
dtArray = [dtot for idx, dtot in dtArray]
tArray = [tSamp for idx, tSamp in tArray]

Note that sorting the lists by the index is necessary because the order of the outputs in the results list is not guaranteed to be the same as the order of the inputs.

Answered By: South Sponge

Pool.starmap reflects the input-order in the output, so there’s no need for artificial indices and sorting. You could also let zip do the extraction into the 3 output lists, something like the following:

from multiprocessing import Pool

# Your movingWinStretch
def foo(u0, u1):
    return u0, u1, u0 + u1

# Your traceIterator
def arguments(n, m):
    for u0 in range(n):
        for u1 in range(m):
            yield u0, u1

if __name__ == "__main__":    
    num_proc = 8
    # a, b, c your cArray, dtArray, tArray
    with Pool(num_proc) as pool:
        a, b, c = zip(*pool.starmap(foo, arguments(2, 3)))
    print(f"{a = }, {b = }, {c = }")

Result here:

a = (0, 0, 0, 1, 1, 1), b = (0, 1, 2, 0, 1, 2), c = (0, 1, 2, 1, 2, 3)

If you need lists, then do

    ...
        a, b, c = map(list, zip(*pool.starmap(foo, arguments(2, 3))))

instead (or use a comprehension if you don’t like map).

Answered By: Timus