Python PDF read straight across as how it looks in the PDF
Question:
If I use the code in the answer here:
Extracting text from a PDF file using PDFMiner in python?
I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf
However, you see under “CONSOLIDATED INCOME STATEMENT”, it reads down … ie… Revenues VAS Online advertising
then later it reads the numbers… I want it to read across, ie:
Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108
etc… is there a way to do this?
Looking for other possible solutions other than pdfminer
.
And if I try using this code for PyPDF2
not all of the text even shows up:
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open(file, 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
a=(pdfReader.numPages)
# creating a page object
for i in range(0,a):
pageObj = pdfReader.getPage(i)
print(pageObj.extractText())
Answers:
Your issue is more to do with how PDF files are constructed than an issue with pyPDF2. I ran into many of the same problems while parsing PDFs to re-construct a page layout.
Whan a PDF is generated each text block is positioned on the page and rendered based on the font rules applied (similar to constructing an HTML document using nothing but absolution positioning and CSS). A simple PDF library will simply return the text from each block in the order they are defined in the file (I’ve had documents when the pages were generated in reverse, with the last paragraph, defined first).
Either you will need to use a more advanced PDF library (likely one that will build on top of the simple libraries) that will take the X, Y location of each text block along with its font information to determine the vertical positioning, or develop this yourself. It looks like the software that JosephA is talking about is doing exactly this.
I first looked up the extractText function of PyPDF2 and tried to “strip” any new lines from the output to give you the “across” the page one-liner.
The output wasn’t so desirable…output
Also, it doesn’t seem reliable in terms of your output.
From the PyPDF2 documentation:
“Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.”
So I went and explored the options of using Tesseract. So this is a bit of a deviation on using a “pdf extraction library” and it’s basically “build your own extraction script”.
It’s not too difficult once you have the grasp of Tesseract. It took me about an hours research with existing knowledge of tesseract.
Here are my results from my own code extracting the pdf page by page: https://gist.github.com/Benehiko/60862a6be13b3b652b7d506121b95811
Please note my code has a drawback. It does not extract the pages in order.
Just in case the link dies:
from PIL import Image
import pytesseract
import subprocess
import pathlib
import glob
import os
pathlib.Path("pages").mkdir(parents=False, exist_ok=True)
params = ['convert', "-density", "300", 'test.pdf', '-depth', '8',
'pages/test_%02d.tiff']
subprocess.check_call(params)
images = glob.glob("pages/*.tiff")
for image in images:
image = Image.open(image)
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
os.environ["TESSDATA_PREFIX"] = ROOT_DIR + "/tessdata"
text = pytesseract.image_to_string(image, lang='eng', nice=0,
output_type=pytesseract.Output.STRING).replace("n", " ")
print(text)
An Explanation of the code:
This first converts the pdf to separate “tiff” images since reading a multi-paged tiff with pytesseract for some reason only reads the first page. The tiff files are saved in a separate directory called “pages”. Pytesseract reads each file and then returns the text, which is then formatted by use of “.replace” which removes all the lines and formats the text as one line.
A place to start: Tesseract install
Using tesseract in python: pytesseract
Training data used: eng.traineddata
Extra Source: pdf to tiff
Pytesseract: documentation
I hope this helps you. Not sure if this was something you were looking for.
You can use PDFMiner to do the job and in my experience it works better than other open source Python tools out there.
The key is to specify the laparams
parameter correctly and not leave it to its default values. This parameter is used to give PDFMiner more information about the layout of the page. Since the text here corresponds to tables with wide spaces, we need to instruct PDFMiner to use a large character margin (char_margin
).
The code for the layout is here. Play around with the hyperparameters that give the best results for this particular document.
Here’s a sample code for the pdf in question. I am using only a single page for demonstration here:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path, pages):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams=LAParams(all_texts=True, detect_vertical=True,
line_overlap=0.5, char_margin=1000.0, #set char_margin to a large number
line_margin=0.5, word_margin=2,
boxes_flow=1)
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set(pages)
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
pdf_text_page6 = convert_pdf_to_txt("15000691526464720.pdf", pages=[6])
The output for the given page (page 6 corresponding to page 7 in the document) looks like the block below. It is not perfect but all the numerical components of the table are captured in the same line as the text.
Page 7 of 11
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
As shown in one of the answers above, in order to keep all text visible on a line in one line of the output you will need to provide LAParams with char_margin set to a high number. Just keep trying higher numbers until you get the output that you want.
straight across as how it looks in the PDF
In a PDF the plain text for extraction is inset from left margin so copy and paste is usually a problem
CONSOLIDATED INCOME STATEMENT
RMB in million, unless specified
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
Gross margin 50% 51% 50% 47%
Interest income 1,065 808 1,065 1,156
Other gains, net 7,585 3,191 7,585 7,906
To bypass that issue one way is to extract the text layout and simplest is call pdftotext. That output can be automatically be saved as inputname.txt file for insert to spreadsheet or doc editor, simply remove the console redirect -
at the end.
Python/conda installs usually include pdftotext as part of poppler utils.
CONSOLIDATED INCOME STATEMENT
RMB in million, unless specified
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
Gross margin 50% 51% 50% 47%
Interest income 1,065 808 1,065 1,156
Other gains, net 7,585 3,191 7,585 7,906
Selling and marketing expenses (5,570) (3,158) (5,570) (6,022)
General and administrative expenses (9,430) (7,012) (9,430) (8,811)
Operating profit 30,692 19,272 30,692 25,724
Operating margin 42% 39% 42% 39%
Finance costs, net (654) (691) (654) (859)
Share of profit/(loss) of associates and joint ventures (319) (375) (319) (120)
Profit before income tax 29,719 18,206 29,719 24,745
Income tax expense (5,746) (3,658) (5,746) (3,123)
Depending on your method of looping filenames or OS shell method the command can be varied and there are many options for language or area of interest. However at its most basic start from:-
pdftotext -nopgbrk -layout pathtofile.pdf
If I use the code in the answer here:
Extracting text from a PDF file using PDFMiner in python?
I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf
However, you see under “CONSOLIDATED INCOME STATEMENT”, it reads down … ie… Revenues VAS Online advertising
then later it reads the numbers… I want it to read across, ie:
Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108
etc… is there a way to do this?
Looking for other possible solutions other than pdfminer
.
And if I try using this code for PyPDF2
not all of the text even shows up:
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open(file, 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
a=(pdfReader.numPages)
# creating a page object
for i in range(0,a):
pageObj = pdfReader.getPage(i)
print(pageObj.extractText())
Your issue is more to do with how PDF files are constructed than an issue with pyPDF2. I ran into many of the same problems while parsing PDFs to re-construct a page layout.
Whan a PDF is generated each text block is positioned on the page and rendered based on the font rules applied (similar to constructing an HTML document using nothing but absolution positioning and CSS). A simple PDF library will simply return the text from each block in the order they are defined in the file (I’ve had documents when the pages were generated in reverse, with the last paragraph, defined first).
Either you will need to use a more advanced PDF library (likely one that will build on top of the simple libraries) that will take the X, Y location of each text block along with its font information to determine the vertical positioning, or develop this yourself. It looks like the software that JosephA is talking about is doing exactly this.
I first looked up the extractText function of PyPDF2 and tried to “strip” any new lines from the output to give you the “across” the page one-liner.
The output wasn’t so desirable…output
Also, it doesn’t seem reliable in terms of your output.
From the PyPDF2 documentation:
“Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.”
So I went and explored the options of using Tesseract. So this is a bit of a deviation on using a “pdf extraction library” and it’s basically “build your own extraction script”.
It’s not too difficult once you have the grasp of Tesseract. It took me about an hours research with existing knowledge of tesseract.
Here are my results from my own code extracting the pdf page by page: https://gist.github.com/Benehiko/60862a6be13b3b652b7d506121b95811
Please note my code has a drawback. It does not extract the pages in order.
Just in case the link dies:
from PIL import Image
import pytesseract
import subprocess
import pathlib
import glob
import os
pathlib.Path("pages").mkdir(parents=False, exist_ok=True)
params = ['convert', "-density", "300", 'test.pdf', '-depth', '8',
'pages/test_%02d.tiff']
subprocess.check_call(params)
images = glob.glob("pages/*.tiff")
for image in images:
image = Image.open(image)
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
os.environ["TESSDATA_PREFIX"] = ROOT_DIR + "/tessdata"
text = pytesseract.image_to_string(image, lang='eng', nice=0,
output_type=pytesseract.Output.STRING).replace("n", " ")
print(text)
An Explanation of the code:
This first converts the pdf to separate “tiff” images since reading a multi-paged tiff with pytesseract for some reason only reads the first page. The tiff files are saved in a separate directory called “pages”. Pytesseract reads each file and then returns the text, which is then formatted by use of “.replace” which removes all the lines and formats the text as one line.
A place to start: Tesseract install
Using tesseract in python: pytesseract
Training data used: eng.traineddata
Extra Source: pdf to tiff
Pytesseract: documentation
I hope this helps you. Not sure if this was something you were looking for.
You can use PDFMiner to do the job and in my experience it works better than other open source Python tools out there.
The key is to specify the laparams
parameter correctly and not leave it to its default values. This parameter is used to give PDFMiner more information about the layout of the page. Since the text here corresponds to tables with wide spaces, we need to instruct PDFMiner to use a large character margin (char_margin
).
The code for the layout is here. Play around with the hyperparameters that give the best results for this particular document.
Here’s a sample code for the pdf in question. I am using only a single page for demonstration here:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path, pages):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams=LAParams(all_texts=True, detect_vertical=True,
line_overlap=0.5, char_margin=1000.0, #set char_margin to a large number
line_margin=0.5, word_margin=2,
boxes_flow=1)
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set(pages)
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
pdf_text_page6 = convert_pdf_to_txt("15000691526464720.pdf", pages=[6])
The output for the given page (page 6 corresponding to page 7 in the document) looks like the block below. It is not perfect but all the numerical components of the table are captured in the same line as the text.
Page 7 of 11
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
As shown in one of the answers above, in order to keep all text visible on a line in one line of the output you will need to provide LAParams with char_margin set to a high number. Just keep trying higher numbers until you get the output that you want.
straight across as how it looks in the PDF
In a PDF the plain text for extraction is inset from left margin so copy and paste is usually a problem
CONSOLIDATED INCOME STATEMENT
RMB in million, unless specified
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
Gross margin 50% 51% 50% 47%
Interest income 1,065 808 1,065 1,156
Other gains, net 7,585 3,191 7,585 7,906
To bypass that issue one way is to extract the text layout and simplest is call pdftotext. That output can be automatically be saved as inputname.txt file for insert to spreadsheet or doc editor, simply remove the console redirect -
at the end.
Python/conda installs usually include pdftotext as part of poppler utils.
CONSOLIDATED INCOME STATEMENT
RMB in million, unless specified
Unaudited Unaudited
1Q2018 1Q2017 1Q2018 4Q2017
Revenues 73,528 49,552 73,528 66,392
VAS 46,877 35,108 46,877 39,947
Online advertising 10,689 6,888 10,689 12,361
Others 15,962 7,556 15,962 14,084
Cost of revenues (36,486) (24,109) (36,486) (34,897)
Gross profit 37,042 25,443 37,042 31,495
Gross margin 50% 51% 50% 47%
Interest income 1,065 808 1,065 1,156
Other gains, net 7,585 3,191 7,585 7,906
Selling and marketing expenses (5,570) (3,158) (5,570) (6,022)
General and administrative expenses (9,430) (7,012) (9,430) (8,811)
Operating profit 30,692 19,272 30,692 25,724
Operating margin 42% 39% 42% 39%
Finance costs, net (654) (691) (654) (859)
Share of profit/(loss) of associates and joint ventures (319) (375) (319) (120)
Profit before income tax 29,719 18,206 29,719 24,745
Income tax expense (5,746) (3,658) (5,746) (3,123)
Depending on your method of looping filenames or OS shell method the command can be varied and there are many options for language or area of interest. However at its most basic start from:-
pdftotext -nopgbrk -layout pathtofile.pdf