Tabula py not reading all rows for PDFs with alternating colors for each row when Lattice is set to True

Question:

I am trying to extract all rows from the PDF attached here.

Here is the code I used:

def parse_latticepdf_pages(pdf):
    pages = read_pdf(
        pdf,
        pages = "all",
        guess = False,
        lattice = True,
        silent = True,
        area = [43, 5, 568, 774], 
        pandas_options = {'header': None}
    )
       
    return pd.concat(pages)

parse_latticepdf_pages(pdf = "file.pdf")

The output shows only those rows which are in the grey background color. İt doesn’t show rows with the white background color. How do I get all rows regardless of the color the rows are in?

Note: Initially I tried with stream = True, but that caused other problems where each line appears as a separate row and it is impossible to group the rows as needed. Hence, I set Lattice = True. Also, enabling and not enabling multiple_tables return the same issue.

I would appreciate any help regarding this. Thank you!

Asked By: Joe

||

Answers:

Not sure what’s happening, but confirmed it works with multiple_tables=False option as the following:

In [41]: tabula.read_pdf(fname, pages=1, lattice=True, area = [43, 5, 568, 774], multiple_tables=False)
Out[41]:
[  Issued Date      Permit No.  ...                                       Proposed Use       Valuation
 0    4/1/2019  P025361-032119  ...  New office and restroom addition to existingr...      $45,000.00
 1   4/12/2019  P025502-041219  ...  Isolate chapel from fire damaged area 4000 sq....       $1,000.00
 2   4/12/2019  P025487-041019  ...  Interior finish-out for new meat market 2500r...      $35,000.00
 3   4/15/2019  P025520-041519  ...       New 8-unit apartment building 10,800 sq. ft.     $350,000.00
 4   4/25/2019  P025101-020719  ...                New Five Story Hotel 93,501 sq. ft.  $12,327,000.00
 5    4/9/2019  P025475-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 6    4/9/2019  P025477-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 7    4/9/2019  P025479-040919  ...                 Mobile Home Placement 1216 sq. ft.       $1,250.00
 8    4/8/2019  P025459-040519  ...                                   Build a carport.       $1,000.00

 [9 rows x 7 columns]]

It might cause another issue for page="all" though.

Answered By: chezou

I managed to finally solve this. For this particular PDF format, it’s better to use other python packages such as PyMuPDF. I had posted a similar question on another post in StackOverflow. I am posting the link here. Hope this helps others too struggling to find a solution to a problem similar to that mentioned in this post.

Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) – text positioned in the middle for each row

Answered By: Joe
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.