How to extract Table from PDF in Python?
Question:
I have thousands of PDF files, composed only by tables, with this structure:
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?
Answers:
Try this: pip install tabula-py
from tabula import read_pdf
df = read_pdf("file_name.pdf")
After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula’s read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)
use library tabula
pip install tabula
then exract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried read pdf files by using another way. Then it works better than library tabula
. I will post it soon.
@fmarques
You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF:
https://github.com/StatCan/SLICEmyPDF
From my experience SLICEmyPDF outperforms other free Python or R packages.
The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at
https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html
I have thousands of PDF files, composed only by tables, with this structure:
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
import PyPDF2
pdfFileObj = open(pdf_file.pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
print(pageObj.extractText().split('n')[0])
print(pageObj.extractText().split('/')[0])
I also tried Tabula, but it only reads the header (and not the content of the tables)
from tabula import read_pdf
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
Any thoughts?
Try this: pip install tabula-py
from tabula import read_pdf
df = read_pdf("file_name.pdf")
After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula’s read_pdf function the area of the table and the limits of the columns.
Here is the working code:
import pypdf
from tabula import read_pdf
# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)
# For each page the table can be read with the following code
table_pdf = read_pdf(
pdf_file,
guess=False,
pages=1,
stream=True,
encoding="utf-8",
area=(96, 24, 558, 750),
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)
use library tabula
pip install tabula
then exract it
import tabula
# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)
# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)
df[1]
By the way, I tried read pdf files by using another way. Then it works better than library tabula
. I will post it soon.
@fmarques
You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF:
https://github.com/StatCan/SLICEmyPDF
From my experience SLICEmyPDF outperforms other free Python or R packages.
The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at
https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html