How do I extract text in the right order from PDF using PyPDF2?

Question:

I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.

I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf

    import PyPDF2

with open('pdftest2.pdf', 'rb') as pdfTest:
    reader = PyPDF2.PdfFileReader(pdfTest)
    page5 = reader.getPage(4)
    text = page5.extractText()
    print(text)

The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.

All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?

Asked By: Aldin Yusmar

||

Answers:

I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.

Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.

I drop some code down here if you are interested.

First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there’s a good Medium article here.

To convert PDFs to images using pdf2image:

Don’t forget to add your poppler path if you are working on windows. It should look like something like that r'C:<your_path>poppler-21.02.0Librarybin'

def pdftoimg(fic,output_folder, poppler_path):
    # Store all the pages of the PDF in a variable 
    pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path) 

    image_counter = 0

    # Iterate through all the pages stored above 
    for page in pages: 
        filename = "page_"+str(image_counter)+".jpg"
        page.save(output_folder+filename, 'JPEG') 
        image_counter = image_counter + 1
        
    for i in os.listdir(output_folder):
        if i.endswith('.ppm'):
            os.remove(output_folder+i)

To extract text from the image:

Your tesseract path is going to be something like that: r'C:Program FilesTesseract-OCRtesseract.exe'

def imgtotext(img, tesseract_path):
    # Recognize the text as string in image using pytesserct 
    pytesseract.pytesseract.tesseract_cmd = tesseract_path
    text = str(((pytesseract.image_to_string(Image.open(img))))) 
    text = text.replace('-n', '')
    
    return text
Answered By: zanga

I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.

Answered By: Martin Noah
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.