Is there a way to increase the file reading speed of PyPDF2.PdfFileReader. It takes too much time to read multiple files
Question:
I have a code to search .pdf
files by reading inside data of the pdf files. My solution gives me the correct files, but it is slow. Is there a way to make it quicker?
keyword = keyword.lower()
for subdir, dirs, files in os.walk(folder_path):
for file in files:
filepath = subdir + os.sep + file
fpath = subdir + os.sep
if(keyword in file.lower()):
if filepath not in tflist:
tflist.append(os.path.join(filepath))
if filepath.endswith(".pdf"):
if filepath not in tflist:
with open(os.path.join(fpath,file), "rb") as f:
reader = PyPDF2.PdfFileReader(f)
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page_content = page.extractText().lower()
if(keyword in page_content):
tflist.append(os.path.join(filepath))
break
#print (str(1+reader.getPageNumber(page)))
#print(keyword)
print(tflist)
Answers:
What you could do is use multiprocessing.Pool
.
Split your code into two pieces. The first piece generates a list of paths using os.walk
. Let’s call this list_of_filenames
.
The second part is a function that reads the file and returns the filename and True
or False
for each page depending on your criteria:
def worker(path):
rv = {}
with open(path, "rb") as f:
reader = PyPDF2.PdfFileReader(f)
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page_content = page.extractText().lower()
if(keyword in page_content):
rv[i] = True
else:
rv[i] = False
return (path, rv)
Use it like this:
import multiprocessing as mp
p = mp.Pool()
for path, rv in p.imap_unordered(worker, list_of_filenames):
print('File:', path)
print('Results:', rv)
Given that your CPU has n cores, this will run approximately n times faster than just processing one file at a time.
I have a code to search .pdf
files by reading inside data of the pdf files. My solution gives me the correct files, but it is slow. Is there a way to make it quicker?
keyword = keyword.lower()
for subdir, dirs, files in os.walk(folder_path):
for file in files:
filepath = subdir + os.sep + file
fpath = subdir + os.sep
if(keyword in file.lower()):
if filepath not in tflist:
tflist.append(os.path.join(filepath))
if filepath.endswith(".pdf"):
if filepath not in tflist:
with open(os.path.join(fpath,file), "rb") as f:
reader = PyPDF2.PdfFileReader(f)
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page_content = page.extractText().lower()
if(keyword in page_content):
tflist.append(os.path.join(filepath))
break
#print (str(1+reader.getPageNumber(page)))
#print(keyword)
print(tflist)
What you could do is use multiprocessing.Pool
.
Split your code into two pieces. The first piece generates a list of paths using os.walk
. Let’s call this list_of_filenames
.
The second part is a function that reads the file and returns the filename and True
or False
for each page depending on your criteria:
def worker(path):
rv = {}
with open(path, "rb") as f:
reader = PyPDF2.PdfFileReader(f)
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page_content = page.extractText().lower()
if(keyword in page_content):
rv[i] = True
else:
rv[i] = False
return (path, rv)
Use it like this:
import multiprocessing as mp
p = mp.Pool()
for path, rv in p.imap_unordered(worker, list_of_filenames):
print('File:', path)
print('Results:', rv)
Given that your CPU has n cores, this will run approximately n times faster than just processing one file at a time.