Why my parallel code is slower than the sequential

Question

I am trying to implement an online recursive parallel algorithm, which is highly parallelizable. My problem is that my python implementation does not work as I want. I have two 2D matrices where I want to update recursively every column every time a new observation is observed at time-step t.
My parallel code is like this

def apply_async(t):
    worker =  mp.Pool(processes = 4)
    for i in range(4):
        X[:,i,np.newaxis], b[:,i,np.newaxis] =  worker.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis])).get()


    worker.close()
    worker.join()      




for t in range(p,T):
    count = 0 
    for l in range(p):
        for k in range(4):
            gn[count]=train[t-l-1,k]
            count+=1
    G = G*v +  gn @ gn.T
    Gt = (1/(t-p+1))*G

    if __name__ == '__main__':
        apply_async(t)

The two matrices are X and b. I want to replace directly on master’s memory as each process updates recursively only one specific column of the matrices.

Why this implementation is slower than the sequential?

Is there any way to resume the process every time-step rather than killing them and create them again? Could this be the reason it is slower?

Asked By: Bekromoularo

||

Source

Answer 1

The reason is, your program is in practice sequential. This is an example code snippet that is from parallelism standpoint identical to yours:

from multiprocessing import Pool
from time import sleep

def gwork( qq):
    print (qq)
    sleep(1)
    return 42

p = Pool(processes=4)

for q in range(1, 10):
    p.apply_async(gwork, args=(q,)).get()
p.close()
p.join()

Run this and you shall notice numbers 1-9 appearing exactly once in a second. Why is this? The reason is your .get(). This means every call to apply_async will in practice block in get() until a result is available. It will submit one task, wait a second emulating processing delay, then return the result, after which another task is submitted to your pool. This means there is no parallel execution ongoing at all.

Try replacing the pool management part with this:

results = []
for q in range(1, 10):
    res = p.apply_async(gwork, args=(q,))
    results.append(res)
p.close()
p.join()
for r in results:
    print (r.get())

You can now see parallelism at work, as four of your tasks are now processed simultaneously. Your loop does not block in get, as get is moved out of the loop and results are received only when they are ready.

NB: If your arguments to your worker or the return values from them are large data structures, you will lose some performance. In practice Python implements these as queues, and transmitting a lot of data via a queue is slow on relative terms compared to getting an in-memory copy of a data structure when a subprocess is forked.

Answered By: Hannu

Answer 2

I kept having a problem impelmenting Hannu‘s code:

results = []
for q in range(1, 10):
    res = p.apply_async(gwork, args=(q,))
    results.append(res)
p.close()
p.join()
for r in results:
    print (r.get())

The problem is that when the loop hits the first r.get() with an exception the whole program exits because it’s not handled properly. I’ve seen this method posted many times in almost the same way but always resulted in the same problem.

I ended up wrapping the r.get() in a TRY/EXCEPT block and that allowed the program to handle all the exceptions in the list and continue on as designed.

from multiprocessing.pool import Pool
import traceback
    results = []
    pool = Pool(32)
    process_schedule_data = pool.apply_async(TSMDataProcessor().process_schedule_data, args=("Schedule",))
    # a bunch more calls like the one above but to different methods of the same class
    pool.close()
    pool.join()


    for r in results:
        try:
            r.get()
        except BaseException:
            logger.error(f"data processor exception: {traceback.format_exc()}")

Answered By: Morgan Mains

Why my parallel code is slower than the sequential

Question:

Answers: