Multiprocessing. How to output the computed records of a text file in the same order as they where read?

Question:

I have a text file of about 300GiB in size that has a header followed by data records. Here’s a dummy input.txt:

# header
# etc... (the number of lines in the header can vary) 
record #1
record #2
record #3
record #4
record #5
record #6
record #7
record #8
record #9
...

Given the input file size, processing it one line at a time is slow, and CPU-bound, so I decided to add some parallelism to the code:

Here’s a sample code (it took me some time to get it right; now it works as expected):

#!/usr/bin/env python
import sys
import multiprocessing

def worker(queue,lock,process_id):

    while True:
        data = queue.get()
        if data == None:
            break

        # whatever processing that takes some time
        for x in range(1000000):
            data.split()

        lock.acquire()
        print( data.rstrip() + " computed by process #" + process_id )
        sys.stdout.flush()
        lock.release()

if __name__ == '__main__':

    queue = multiprocessing.Queue()
    lock = multiprocessing.Lock()
    workers = []
    num_processes = 4

    for process_id in range(num_processes-1):
        p = multiprocessing.Process(target=worker, args=(queue,lock,str(process_id)))
        p.start()
        workers.append(p)

    with open('input.txt') as handler:
        try:
            # read the header
            line = next(handler)
            while line.startswith('#'):
                line = next(handler)

            # send the records to the queue
            while True:
                queue.put(line)
                line = next(handler)

        except StopIteration:
            for p in workers:
                queue.put(None)

        finally:
            for p in workers:
                p.join()

Output:

record #4 computed by process #2
record #3 computed by process #1
record #1 computed by process #0
record #2 computed by process #3
record #5 computed by process #2
record #6 computed by process #1
record #7 computed by process #0
record #8 computed by process #3
record #9 computed by process #2

My problem is that the ordering in the output should be the same as in the input. How can I do it efficiently?

AL: The architecture of the code may seem weird (I didn’t find any example that looks like my code), so if there’s an other more standard and efficient way to do the same thing then it would be great if you can share it.

Asked By: Fravadona

||

Answers:

pool.map() seems to do exactly what you want, with a lot less hassle.

def worker(line):
    result = do_whatever_to(line)
    return result
    

def line_iterator():
    '''returns the lines we need to process'''
    with open('input.txt') as handler:
        seen_real_line =  False
        for line in handler:
            if seen_real_line or not line.startswith('#'):
                yield line
                seen_real_line = True
                
def main():
    with multiprocessing.Pool(processes=4) as pool:
        for result in pool.imap(worker, line_iterator()):
            print(result)

If you don’t include processes=4, it will use however many CPUs you have on your machine. That’s probably what you want. It won’t create a new process for each line.

line_iterator() could probably be made more efficient, but that’s not going to be the bottleneck.


Update:
Just realized that the heart of line_iterator() could easily be replaced with a simple itertools.dropwhile statement. This is precisely what that function is for! Oh well, not the important part of this answer.

Answered By: Frank Yellin
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.