Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) – text positioned in the middle for each row

Question:

I extracted data from PDF file. I am sharing a sample of the page here.

I extracted data from the PDF using Tabula-py. These are the arguments I used to extract the text from PDF page.

import numpy as np
import pandas as pd
from tabula.io import read_pdf

import warnings
warnings.filterwarnings("ignore")


def parse_pdf_pages(pdf):
    page = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        stream = True,
        silent = True,
        columns = [75, 135, 230, 347, 425, 602, 640, 705], 
        area = [0, 0, 1100, 1100], 
        pandas_options = {'header': None}
    )[0]
    
    return page

df = parse_pdf_pages(pdf="page.pdf")

#To remove unnecessary header and footer rows
df = df[~df[0].str.contains(r'Building|Permit|Report Date', na = False)].reset_index(drop = True)

The output for the data frame containing the text extracted is as follows:

enter image description here

I am happy with the way the text has been extracted. My main challenge is to group rows of data, especially column 5 (Work Description), as shown in the PDF.

Ideally, if the text in column 0 (Permit #) were at the top of each row in the PDF, it would have been much easier to group the rows – All I had to do was fill forward column 0 and then use pandas group by function to join all the text in column 5.

Unfortunately for this PDF, column 0 is positioned in the mıddle of each row. Therefore, I have a hard time figuring out how to group the texts in column 5 based on column 0.

I have tried extracting the data by setting lattice = True for tabula’s read_pdf function, but it does not return anything as there are no visible border lines on the PDF. Each row has an alternating color.

How do I approach this problem? Are there other python packages that can extract the text from PDF properly by recognizing the color borders of each row as shown in the PDF file? Or is there a way to wrangle and group the data correctly?

I read a few posts where the python package PyMuPDF can potentially detect colors in a PDF document. But I have no idea where it’s possible to use PyMuPDF to help extract and group texts from the PDF meaningfully. I would appreciate any guidance regarding this. I have been struggling for months with this issue.

Thank you for your kind help.

Asked By: Joe

||

Answers:

I have posted a PyMuPDF-based solution on the mentioned discussion thread on Github.

Answered By: Jorj McKie