Reading PDF using PyPDF2 not resulting anything
Question:
Here is my code – courtesy – http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/ .
I modified it to include next version of PyPDF.
import PyPDF2
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(file(path, "rb"))
# Iterate pages
print "Number of pages is ", pdf.getNumPages()
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "n"
print (content)
# Collapse whitespace
content = " ".join(content.replace(u"xa0", " ").strip().split())
return content
print getPDFContent("RL.pdf").encode("ascii", "xmlcharrefreplace")
The file I am reading is here.
http://dmc.kar.nic.in/RL.pdf
All I get is this.
Number of pages is 1
Blank after this.
Is this a problem with the PDF or am I going wrong somewhere?
All help appreciated!
Answers:
The file turned out to be corrupt.
Here is my code – courtesy – http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/ .
I modified it to include next version of PyPDF.
import PyPDF2
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(file(path, "rb"))
# Iterate pages
print "Number of pages is ", pdf.getNumPages()
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "n"
print (content)
# Collapse whitespace
content = " ".join(content.replace(u"xa0", " ").strip().split())
return content
print getPDFContent("RL.pdf").encode("ascii", "xmlcharrefreplace")
The file I am reading is here.
http://dmc.kar.nic.in/RL.pdf
All I get is this.
Number of pages is 1
Blank after this.
Is this a problem with the PDF or am I going wrong somewhere?
All help appreciated!
The file turned out to be corrupt.