Parallelization of un-bzipping millions of files
Question:
I have millions of compressed .bz2
files which I need to uncompressed.
Can uncompression be parallelized ? I have access to the server with many cpu cores for the purpose.
I worked with the following code which is correct but it is extremely slow.
import os, glob, bz2
files = glob.glob("/data01/*.bz2")
for fi in files:
fo = fi[:-4]
with bz2.BZ2File(fi) as fr, open(fo, "wb") as fw:
shutil.copyfileobj(fr, fw)
Answers:
What you probably looking is to use async.
Refer to this other question that has almost the same context as yours:
The function of asynchronous data extraction from the archive.
If your files are already in the server you can skip the part that makes the download and use just the part that opens and extracts the file.
An other alternative is to use async-unzip: https://pypi.org/project/async-unzip/
You can use the following code:
import pathlib
import itertools
import bz2
import shutil
import multiprocessing as mp
def extract(files):
for filename in files:
print(f'Processing {filename}')
with (bz2.BZ2File(filename) as fr,
open(filename.stem, 'wb') as fw):
shutil.copyfileobj(fr, fw)
# For python > 3.8 and < 3.12
def batched(iterable, n):
it = iter(iterable)
while (batch := tuple(itertools.islice(it, n))):
yield batch
if __name__ == '__main__':
num_cpus = mp.cpu_count() - 1 # safe, to leave one free CPU
files = pathlib.Path('data').glob('*.bz2')
with mp.Pool(num_cpus) as pool:
pool.map(extract, batched(files, num_cpus))
Instead of creating one process per file, we split the list of files into "num_cpus" batches and process them. As you use all the processors without handover, I leave one processor free.
Multithreading would ideal for this because it’s primarily IO-bound.
from concurrent.futures import ThreadPoolExecutor
import glob
import bz2
import shutil
def process(filename):
with bz2.BZ2File(filename) as fr, open(filename[:-4], "wb") as fw:
shutil.copyfileobj(fr, fw)
def main():
with ThreadPoolExecutor() as tpe:
tpe.map(process, glob.glob('/data01/*.bz2'))
if __name__ == '__main__':
main()
I have millions of compressed .bz2
files which I need to uncompressed.
Can uncompression be parallelized ? I have access to the server with many cpu cores for the purpose.
I worked with the following code which is correct but it is extremely slow.
import os, glob, bz2
files = glob.glob("/data01/*.bz2")
for fi in files:
fo = fi[:-4]
with bz2.BZ2File(fi) as fr, open(fo, "wb") as fw:
shutil.copyfileobj(fr, fw)
What you probably looking is to use async.
Refer to this other question that has almost the same context as yours:
The function of asynchronous data extraction from the archive.
If your files are already in the server you can skip the part that makes the download and use just the part that opens and extracts the file.
An other alternative is to use async-unzip: https://pypi.org/project/async-unzip/
You can use the following code:
import pathlib
import itertools
import bz2
import shutil
import multiprocessing as mp
def extract(files):
for filename in files:
print(f'Processing {filename}')
with (bz2.BZ2File(filename) as fr,
open(filename.stem, 'wb') as fw):
shutil.copyfileobj(fr, fw)
# For python > 3.8 and < 3.12
def batched(iterable, n):
it = iter(iterable)
while (batch := tuple(itertools.islice(it, n))):
yield batch
if __name__ == '__main__':
num_cpus = mp.cpu_count() - 1 # safe, to leave one free CPU
files = pathlib.Path('data').glob('*.bz2')
with mp.Pool(num_cpus) as pool:
pool.map(extract, batched(files, num_cpus))
Instead of creating one process per file, we split the list of files into "num_cpus" batches and process them. As you use all the processors without handover, I leave one processor free.
Multithreading would ideal for this because it’s primarily IO-bound.
from concurrent.futures import ThreadPoolExecutor
import glob
import bz2
import shutil
def process(filename):
with bz2.BZ2File(filename) as fr, open(filename[:-4], "wb") as fw:
shutil.copyfileobj(fr, fw)
def main():
with ThreadPoolExecutor() as tpe:
tpe.map(process, glob.glob('/data01/*.bz2'))
if __name__ == '__main__':
main()