How to extract a table using camelot
Question:
I’m trying to extract a table from a sample pdf, the problem is that the table doesn’t have lines within it to separate columns.
This is a picture of the document:
When I try running:
tables = camelot.read_pdf(filename)
print(tables[0].df)
It prints the right table content but with no regard of the columns, it treats the whole table as one single column, like this:
0 Credit nDate nReference No. nDescription...
1 09/10/2016 n3194949206 nOnline Banking tra...
2 09/20/2016 n3194749206 nOnline Banking tra...
3 09/20/2016 n34236757678 nCA TLR cash withd...
4 10/08/2016 n5444 nInterest n ...
And when I run this:
print(tables[0].df.shape)
The result is (5, 1).
I tried another solution with specifying the stream like this:
tables = camelot.read_pdf(filename, flavor='stream')
But then it will get and print the wrong data, this is the result:
0 1
0 Bank of Domino
1 Customer service information
2 P.O. Box 15001 1.888.DOMINO
3 Arlington, VA 18505 TDD/TTY users only: 1.800.288.4101
4 En Espanol: 1.800.688.6229
5 Account Number: 00000970987652 bankofdomino.com
6 ROBERT BELL Bank of Domino, N.A.
7 HOLLOW WAY,APT 503 P.O. Box 25125
8 SAN MESA, CA 92627-5125 San Mesa,CA 33390
And the shape of the df is (9, 2).
I also tried specifying the columns x coordinates but to no avail:
tables = camelot.read_pdf(filename, flavor='stream', columns="10, 120, 230, 470, 520, 650, 720, 800")
It still gets the wrong data.
Any help is appreciated.
Thanks in advance.
Edit: here’s the pdf sample.
https://mega.nz/file/cNtwDSpI#KuhG03P1Qg5kLa69jZ7ohb3FF8G6ITNuMFsFyQVMudw
Answers:
Ok, so you have a few options, I’ll give you a few examples with your sample pdf file:
Option 1, using tabula-py
:
import tabula
pdf_path = dir_path + "/bankk.pdf"
tb = tabula.read_pdf(pdf_path, pages='all')
This will give you a list of dataframes of all the tables it detected in your pdf, using tb[0]
will give the following result:
+----------+-------------+--------------------+---------+-------+----------+
| Date|Reference No.| Description| Credit| Debit| Balance|
+----------+-------------+--------------------+---------+-------+----------+
|09/10/2016| 3194949206|Online Banking tr...|$2,500.00| null|$14,000.49|
|09/20/2016| 3194749206|Online Banking tr...|$3,800.00| null|$17,800.49|
|09/20/2016| 34236757678|CA TLR cash withd...| null|$300.00|$17,500.49|
|10/08/2016| 5444| Interest| $0.64| null|$17,501.13|
+----------+-------------+--------------------+---------+-------+----------+
Option 2, using pandas
and pdfplumber
:
import pdfplumber
import pandas as pd
pdf = pdfplumber.open(pdf_path)
page = pdf.pages[0]
tb = page.extract_table(table_settings={"horizontal_strategy": "lines",
"vertical_strategy": "text",
"keep_blank_chars": "text",
"snap_tolerance": 5,})
df = pd.DataFrame(tb[1:], columns=tb[0])
Note that you might need to play with the table_settings
, to read
more refer to this link
Option 3, using AWS Textract (Thomas edit):
You can use the amazon-textract-textractor
package to call textract and parse its output. The advantage of Textract is that it works for native pdfs, scanned pdfs, or images. For example with your image:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
file_source="./N6klE.png",
features=[TextractFeatures.TABLES],
)
Textract detects the two tables:
document.tables[1].to_pandas(use_columns=True)
Option 4, using Deep Learning
Use deep learning algorithms to train a model to detect the tables on every page of the pdf and then use pytesseract
to get the table data, you can refer to my article in Medium for getting the table data after detection here: Image Table to DataFrame using Python OCR
Of course, you can always still use camelot
but I prefer tabula-py
for simple solutions and deep learning for more complex ones
I’m trying to extract a table from a sample pdf, the problem is that the table doesn’t have lines within it to separate columns.
This is a picture of the document:
When I try running:
tables = camelot.read_pdf(filename)
print(tables[0].df)
It prints the right table content but with no regard of the columns, it treats the whole table as one single column, like this:
0 Credit nDate nReference No. nDescription...
1 09/10/2016 n3194949206 nOnline Banking tra...
2 09/20/2016 n3194749206 nOnline Banking tra...
3 09/20/2016 n34236757678 nCA TLR cash withd...
4 10/08/2016 n5444 nInterest n ...
And when I run this:
print(tables[0].df.shape)
The result is (5, 1).
I tried another solution with specifying the stream like this:
tables = camelot.read_pdf(filename, flavor='stream')
But then it will get and print the wrong data, this is the result:
0 1
0 Bank of Domino
1 Customer service information
2 P.O. Box 15001 1.888.DOMINO
3 Arlington, VA 18505 TDD/TTY users only: 1.800.288.4101
4 En Espanol: 1.800.688.6229
5 Account Number: 00000970987652 bankofdomino.com
6 ROBERT BELL Bank of Domino, N.A.
7 HOLLOW WAY,APT 503 P.O. Box 25125
8 SAN MESA, CA 92627-5125 San Mesa,CA 33390
And the shape of the df is (9, 2).
I also tried specifying the columns x coordinates but to no avail:
tables = camelot.read_pdf(filename, flavor='stream', columns="10, 120, 230, 470, 520, 650, 720, 800")
It still gets the wrong data.
Any help is appreciated.
Thanks in advance.
Edit: here’s the pdf sample.
https://mega.nz/file/cNtwDSpI#KuhG03P1Qg5kLa69jZ7ohb3FF8G6ITNuMFsFyQVMudw
Ok, so you have a few options, I’ll give you a few examples with your sample pdf file:
Option 1, using tabula-py
:
import tabula
pdf_path = dir_path + "/bankk.pdf"
tb = tabula.read_pdf(pdf_path, pages='all')
This will give you a list of dataframes of all the tables it detected in your pdf, using tb[0]
will give the following result:
+----------+-------------+--------------------+---------+-------+----------+
| Date|Reference No.| Description| Credit| Debit| Balance|
+----------+-------------+--------------------+---------+-------+----------+
|09/10/2016| 3194949206|Online Banking tr...|$2,500.00| null|$14,000.49|
|09/20/2016| 3194749206|Online Banking tr...|$3,800.00| null|$17,800.49|
|09/20/2016| 34236757678|CA TLR cash withd...| null|$300.00|$17,500.49|
|10/08/2016| 5444| Interest| $0.64| null|$17,501.13|
+----------+-------------+--------------------+---------+-------+----------+
Option 2, using pandas
and pdfplumber
:
import pdfplumber
import pandas as pd
pdf = pdfplumber.open(pdf_path)
page = pdf.pages[0]
tb = page.extract_table(table_settings={"horizontal_strategy": "lines",
"vertical_strategy": "text",
"keep_blank_chars": "text",
"snap_tolerance": 5,})
df = pd.DataFrame(tb[1:], columns=tb[0])
Note that you might need to play with the
table_settings
, to read
more refer to this link
Option 3, using AWS Textract (Thomas edit):
You can use the amazon-textract-textractor
package to call textract and parse its output. The advantage of Textract is that it works for native pdfs, scanned pdfs, or images. For example with your image:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
file_source="./N6klE.png",
features=[TextractFeatures.TABLES],
)
Textract detects the two tables:
document.tables[1].to_pandas(use_columns=True)
Option 4, using Deep Learning
Use deep learning algorithms to train a model to detect the tables on every page of the pdf and then use pytesseract
to get the table data, you can refer to my article in Medium for getting the table data after detection here: Image Table to DataFrame using Python OCR
Of course, you can always still use camelot
but I prefer tabula-py
for simple solutions and deep learning for more complex ones