Fastest way to extract tar files using Python
Question:
I have to extract hundreds of tar.bz files each with size of 5GB. So tried the following code:
import tarfile
from multiprocessing import Pool
files = glob.glob('D:\*.tar.bz') ##All my files are in D
for f in files:
tar = tarfile.open (f, 'r:bz2')
pool = Pool(processes=5)
pool.map(tar.extractall('E:\') ###I want to extract them in E
tar.close()
But the code has type error:
TypeError: map() takes at least 3 arguments (2 given)
How can I solve it?
Any further ideas to accelerate extracting?
Answers:
You need to change pool.map(tar.extractall('E:\')
to something like pool.map(tar.extractall(),"list_of_all_files")
Note that map()
takes 2 argument first is a function , second is a iterable , and Apply function to every item of iterable and return a list of the results.
Edit : you need to pass a TarInfo
object into the other process :
def test_multiproc():
files = glob.glob('D:\*.tar.bz2')
pool = Pool(processes=5)
result = pool.map(read_files, files)
def read_files(name):
t = tarfile.open (name, 'r:bz2')
t.extractall('E:\')
t.close()
>>>test_multiproc()
Define a function that extract a single tar file. Pass that function and a tar file list to multiprocessing.Pool.map
:
import glob
import tarfile
from functools import partial
from multiprocessing import Pool
def extract(path, dest):
with tarfile.open(path, 'r:bz2') as tar:
tar.extractall(dest)
if __name__ == '__main__':
files = glob.glob('D:\*.tar.bz')
pool = Pool(processes=5)
pool.map(partial(extract, dest='E:\'), files)
I have to extract hundreds of tar.bz files each with size of 5GB. So tried the following code:
import tarfile
from multiprocessing import Pool
files = glob.glob('D:\*.tar.bz') ##All my files are in D
for f in files:
tar = tarfile.open (f, 'r:bz2')
pool = Pool(processes=5)
pool.map(tar.extractall('E:\') ###I want to extract them in E
tar.close()
But the code has type error:
TypeError: map() takes at least 3 arguments (2 given)
How can I solve it?
Any further ideas to accelerate extracting?
You need to change pool.map(tar.extractall('E:\')
to something like pool.map(tar.extractall(),"list_of_all_files")
Note that map()
takes 2 argument first is a function , second is a iterable , and Apply function to every item of iterable and return a list of the results.
Edit : you need to pass a TarInfo
object into the other process :
def test_multiproc():
files = glob.glob('D:\*.tar.bz2')
pool = Pool(processes=5)
result = pool.map(read_files, files)
def read_files(name):
t = tarfile.open (name, 'r:bz2')
t.extractall('E:\')
t.close()
>>>test_multiproc()
Define a function that extract a single tar file. Pass that function and a tar file list to multiprocessing.Pool.map
:
import glob
import tarfile
from functools import partial
from multiprocessing import Pool
def extract(path, dest):
with tarfile.open(path, 'r:bz2') as tar:
tar.extractall(dest)
if __name__ == '__main__':
files = glob.glob('D:\*.tar.bz')
pool = Pool(processes=5)
pool.map(partial(extract, dest='E:\'), files)