Is there a way to increase the file reading speed of PyPDF2.PdfFileReader. It takes too much time to read multiple files

Question:

I have a code to search .pdf files by reading inside data of the pdf files. My solution gives me the correct files, but it is slow. Is there a way to make it quicker?

keyword = keyword.lower()

for subdir, dirs, files in os.walk(folder_path):
    for file in files:
        filepath = subdir + os.sep + file
        fpath = subdir + os.sep
        if(keyword in file.lower()):
            if filepath not in tflist:
                tflist.append(os.path.join(filepath))
        if filepath.endswith(".pdf"):
            if filepath not in tflist:
                with open(os.path.join(fpath,file), "rb") as f:
                    reader = PyPDF2.PdfFileReader(f)
                    for i in range(reader.getNumPages()):
                        page = reader.getPage(i)
                        page_content = page.extractText().lower()
                        if(keyword in page_content):
                            tflist.append(os.path.join(filepath))
                            break
                            #print (str(1+reader.getPageNumber(page)))
                            #print(keyword)

print(tflist)
Asked By: Thilina Bandara

||

Answers:

What you could do is use multiprocessing.Pool.

Split your code into two pieces. The first piece generates a list of paths using os.walk. Let’s call this list_of_filenames.

The second part is a function that reads the file and returns the filename and True or False for each page depending on your criteria:

def worker(path):
    rv = {}
    with open(path, "rb") as f:             
        reader = PyPDF2.PdfFileReader(f)       
        for i in range(reader.getNumPages()):
            page = reader.getPage(i)
            page_content = page.extractText().lower()
            if(keyword in page_content):
                 rv[i] = True
            else:
                 rv[i] = False
    return (path, rv)

Use it like this:

 import multiprocessing as mp

 p = mp.Pool()
 for path, rv in p.imap_unordered(worker, list_of_filenames):
     print('File:', path)
     print('Results:', rv)

Given that your CPU has n cores, this will run approximately n times faster than just processing one file at a time.

Answered By: Roland Smith
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.