How to extract a table using camelot

Question:

I’m trying to extract a table from a sample pdf, the problem is that the table doesn’t have lines within it to separate columns.

This is a picture of the document:

enter image description here

When I try running:

tables = camelot.read_pdf(filename)
print(tables[0].df)

It prints the right table content but with no regard of the columns, it treats the whole table as one single column, like this:

0  Credit  nDate  nReference No.  nDescription...
1  09/10/2016  n3194949206  nOnline Banking tra...
2  09/20/2016  n3194749206  nOnline Banking tra...
3  09/20/2016  n34236757678  nCA TLR cash withd...
4  10/08/2016  n5444  nInterest  n            ...

And when I run this:

print(tables[0].df.shape)

The result is (5, 1).

I tried another solution with specifying the stream like this:

tables = camelot.read_pdf(filename, flavor='stream')

But then it will get and print the wrong data, this is the result:

                                0                                   1
0                  Bank of Domino                                    
1                                        Customer service information
2                  P.O. Box 15001                        1.888.DOMINO
3             Arlington, VA 18505  TDD/TTY users only: 1.800.288.4101
4                                          En Espanol: 1.800.688.6229
5  Account Number: 00000970987652                    bankofdomino.com
6                     ROBERT BELL                Bank of Domino, N.A.
7              HOLLOW WAY,APT 503                      P.O. Box 25125
8         SAN MESA, CA 92627-5125                   San Mesa,CA 33390

And the shape of the df is (9, 2).

I also tried specifying the columns x coordinates but to no avail:

tables = camelot.read_pdf(filename, flavor='stream', columns="10, 120, 230, 470, 520, 650, 720, 800")

It still gets the wrong data.

Any help is appreciated.

Thanks in advance.

Edit: here’s the pdf sample.

https://mega.nz/file/cNtwDSpI#KuhG03P1Qg5kLa69jZ7ohb3FF8G6ITNuMFsFyQVMudw

Asked By: aziz aon

||

Answers:

Ok, so you have a few options, I’ll give you a few examples with your sample pdf file:

Option 1, using tabula-py:

import tabula

pdf_path = dir_path + "/bankk.pdf"
tb = tabula.read_pdf(pdf_path, pages='all')

This will give you a list of dataframes of all the tables it detected in your pdf, using tb[0] will give the following result:

+----------+-------------+--------------------+---------+-------+----------+
|      Date|Reference No.|         Description|   Credit|  Debit|   Balance|
+----------+-------------+--------------------+---------+-------+----------+
|09/10/2016|   3194949206|Online Banking tr...|$2,500.00|   null|$14,000.49|
|09/20/2016|   3194749206|Online Banking tr...|$3,800.00|   null|$17,800.49|
|09/20/2016|  34236757678|CA TLR cash withd...|     null|$300.00|$17,500.49|
|10/08/2016|         5444|            Interest|    $0.64|   null|$17,501.13|
+----------+-------------+--------------------+---------+-------+----------+

Option 2, using pandas and pdfplumber:

import pdfplumber
import pandas as pd

pdf = pdfplumber.open(pdf_path)
page = pdf.pages[0]
tb = page.extract_table(table_settings={"horizontal_strategy": "lines",  
                                        "vertical_strategy": "text",
                                        "keep_blank_chars": "text",
                                         "snap_tolerance": 5,})
df = pd.DataFrame(tb[1:], columns=tb[0])

Note that you might need to play with the table_settings, to read
more refer to this link

Option 3, using AWS Textract (Thomas edit):

You can use the amazon-textract-textractor package to call textract and parse its output. The advantage of Textract is that it works for native pdfs, scanned pdfs, or images. For example with your image:

from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
    file_source="./N6klE.png",
    features=[TextractFeatures.TABLES],
)

Textract detects the two tables:

document.tables[1].to_pandas(use_columns=True)

pandas df2

Option 4, using Deep Learning

Use deep learning algorithms to train a model to detect the tables on every page of the pdf and then use pytesseract to get the table data, you can refer to my article in Medium for getting the table data after detection here: Image Table to DataFrame using Python OCR


Of course, you can always still use camelot but I prefer tabula-py for simple solutions and deep learning for more complex ones

Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.