Parallelization of un-bzipping millions of files

Question

I have millions of compressed .bz2 files which I need to uncompressed.

Can uncompression be parallelized ? I have access to the server with many cpu cores for the purpose.

I worked with the following code which is correct but it is extremely slow.

import os, glob, bz2

files = glob.glob("/data01/*.bz2")
for fi in files:
fo = fi[:-4]
  with bz2.BZ2File(fi) as fr, open(fo, "wb") as fw:
   shutil.copyfileobj(fr, fw)

Asked By: daniel

||

Source

Answer 1

What you probably looking is to use async.

Refer to this other question that has almost the same context as yours:

The function of asynchronous data extraction from the archive.

If your files are already in the server you can skip the part that makes the download and use just the part that opens and extracts the file.

An other alternative is to use async-unzip: https://pypi.org/project/async-unzip/

Answered By: Guilherme H. S. Ostrock

Answer 2

You can use the following code:

import pathlib
import itertools
import bz2
import shutil
import multiprocessing as mp


def extract(files):
    for filename in files:
        print(f'Processing {filename}')
        with (bz2.BZ2File(filename) as fr,
              open(filename.stem, 'wb') as fw):
            shutil.copyfileobj(fr, fw)


# For python > 3.8 and < 3.12
def batched(iterable, n):
    it = iter(iterable)
    while (batch := tuple(itertools.islice(it, n))):
        yield batch


if __name__ == '__main__':
    num_cpus = mp.cpu_count() - 1  # safe, to leave one free CPU
    files = pathlib.Path('data').glob('*.bz2')
    with mp.Pool(num_cpus) as pool:
        pool.map(extract, batched(files, num_cpus))

Instead of creating one process per file, we split the list of files into "num_cpus" batches and process them. As you use all the processors without handover, I leave one processor free.

Answered By: Corralien

Answer 3

Multithreading would ideal for this because it’s primarily IO-bound.

from concurrent.futures import ThreadPoolExecutor
import glob
import bz2
import shutil

def process(filename):
    with bz2.BZ2File(filename) as fr, open(filename[:-4], "wb") as fw:
        shutil.copyfileobj(fr, fw)

def main():
    with ThreadPoolExecutor() as tpe:
        tpe.map(process, glob.glob('/data01/*.bz2'))

if __name__ == '__main__':
    main()

Answered By: Pingu

Parallelization of un-bzipping millions of files

Question:

Answers: