Read PDF in Python and convert to text in PDF
Question:
I have used this code to convert pdf to text.
input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))
I have created the Home directory and pasted the source file in it.
The output I get is
1
And no file with .txt was created. Where is the Problem?
Answers:
I think pdftotext command takes only one argument. Try using:
os.system(("pdftotext %s") % input1)
and see what happens. Hope this helps.
Your expression
("pdftotext %s %s") %( input1, output)
will translate to
pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt
which means that the first parameter passed to pdftotext
is //Home//Sai
, and the second parameter is Krishna
. That obviously won’t work.
Enclose the parameters in quotes:
os.system("pdftotext '%s' '%s'" % (input1, output))
There are various Python packages to extract the text from a PDF with Python. You can see a speed/quality benchmark.
As the maintainer of pypdf
and PyPDF2
I am biased, but I would recommend pypdf
for people to start. It’s pure-python and a BSD 3-clause license. That should work for most people. Also pypdf can do way more with PDF files (e.g. transformations).
If you feel comfortable with the C-dependency and don’t want to modify the PDF, give pypdfium2 a shot. pypdfium2 is really fast and has an amazing extraction quality.
I previously recommended popplers pdftotext. Don’t use that. It’s quality is worse than PDFium/PyPDF2.
Tika and PyMuPDF work similarly well as PDFium, but they also have the non-python dependency. PyMuPDF might not work for you due to the commercial license.
I would NOT use pdfminer / pdfminer.six / pdfplumber/ pdftotext / borb / PyPDF2 / PyPDF3 / PyPDF4.
pypdf: Pure Python
Installation: pip install pypdf
(more instructions)
from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "n"
PDFium: High quality and very fast, but with C-dependency
Installation: pip install pypdfium2
import pypdfium2 as pdfium
text = ""
pdf = pdfium.PdfDocument(data)
for i in range(len(pdf)):
page = pdf.get_page(i)
textpage = page.get_textpage()
text += textpage.get_text()
text += "n"
[g.close() for g in (textpage, page)]
pdf.close()
I have used this code to convert pdf to text.
input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))
I have created the Home directory and pasted the source file in it.
The output I get is
1
And no file with .txt was created. Where is the Problem?
I think pdftotext command takes only one argument. Try using:
os.system(("pdftotext %s") % input1)
and see what happens. Hope this helps.
Your expression
("pdftotext %s %s") %( input1, output)
will translate to
pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt
which means that the first parameter passed to pdftotext
is //Home//Sai
, and the second parameter is Krishna
. That obviously won’t work.
Enclose the parameters in quotes:
os.system("pdftotext '%s' '%s'" % (input1, output))
There are various Python packages to extract the text from a PDF with Python. You can see a speed/quality benchmark.
As the maintainer of pypdf
and PyPDF2
I am biased, but I would recommend pypdf
for people to start. It’s pure-python and a BSD 3-clause license. That should work for most people. Also pypdf can do way more with PDF files (e.g. transformations).
If you feel comfortable with the C-dependency and don’t want to modify the PDF, give pypdfium2 a shot. pypdfium2 is really fast and has an amazing extraction quality.
I previously recommended popplers pdftotext. Don’t use that. It’s quality is worse than PDFium/PyPDF2.
Tika and PyMuPDF work similarly well as PDFium, but they also have the non-python dependency. PyMuPDF might not work for you due to the commercial license.
I would NOT use pdfminer / pdfminer.six / pdfplumber/ pdftotext / borb / PyPDF2 / PyPDF3 / PyPDF4.
pypdf: Pure Python
Installation: pip install pypdf
(more instructions)
from pypdf import PdfReader
reader = PdfReader("example.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "n"
PDFium: High quality and very fast, but with C-dependency
Installation: pip install pypdfium2
import pypdfium2 as pdfium
text = ""
pdf = pdfium.PdfDocument(data)
for i in range(len(pdf)):
page = pdf.get_page(i)
textpage = page.get_textpage()
text += textpage.get_text()
text += "n"
[g.close() for g in (textpage, page)]
pdf.close()