How to parse a large file taking advantage of threading in Python?

Question:

I have a huge file and need to read it and process.

with open(source_filename) as source, open(target_filename) as target:
    for line in source:
        target.write(do_something(line))

    do_something_else()

Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?

edit: To make this question not a discussion, How should the code look like?

with open(source_filename) as source, open(target_filename) as target:
   ?

@Nicoretti: In an iteration I need to read a line of several KB of data.

update 2: the file may be a bz2, so Python may have to wait for unpacking:

$ bzip2 -d country.osm.bz2 | ./my_script.py
Asked By: culebrón

||

Answers:

Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.

If you spend relatively more time reading the file, ie it is big, than processing it, then you can’t win in performance by using threads, the bottleneck is just the IO which threads dont improve.

Answered By: rapadura

This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.

Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.

Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.

There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).

Answered By: Marcin

In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:

  1. you are doing processing that doesn’t require the global interpreter lock; or

  2. you are spending time blocked on I/O.

Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.

So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.

Here’s a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne’s version with only three threads).

from threading import Thread
from Queue import Queue

def process_file(f, source_filename, target_filename):
    """
    Apply the function `f` to each line of `source_filename` and write
    the results to `target_filename`. Each call to `f` is evaluated in
    a separate thread.
    """
    worker_queue = Queue()
    finished = object()

    def process(queue, line):
        "Process `line` and put the result on `queue`."
        queue.put(f(line))

    def read():
        """
        Read `source_filename`, create an output queue and a worker
        thread for every line, and put that worker's output queue onto
        `worker_queue`.
        """
        with open(source_filename) as source:
            for line in source:
                queue = Queue()
                Thread(target = process, args=(queue, line)).start()
                worker_queue.put(queue)
        worker_queue.put(finished)

    Thread(target = read).start()
    with open(target_filename, 'w') as target:
        for output in iter(worker_queue.get, finished):
            target.write(output.get())
Answered By: Gareth Rees

You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.

import threading
import Queue

QUEUE_SIZE = 1000
sentinel = object()

def read_file(name, queue):
    with open(name) as f:
        for line in f:
            queue.put(line)
    queue.put(sentinel)

def process(inqueue, outqueue):
    for line in iter(inqueue.get, sentinel):
        outqueue.put(do_something(line))
    outqueue.put(sentinel)

def write_file(name, queue):
    with open(name, "w") as f:
        for line in iter(queue.get, sentinel):
            f.write(line)

inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)

threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)

It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.

Answered By: Janne Karila