UnicodeEncodeError when reading pdf with pyPdf

Question

Guys i had posted a question earlier pypdf python tool .dont mark this as duplicate as i get this error indicated below

  import sys
  import pyPdf

  def convertPdf2String(path):
      content = ""
      # load PDF file
      pdf = pyPdf.PdfFileReader(file(path, "rb"))
      # iterate pages
      for i in range(0, pdf.getNumPages()):
          # extract the text from each page
          content += pdf.getPage(i).extractText() + " n"
      # collapse whitespaces
      content = u" ".join(content.replace(u"xa0", u" ").strip().split())
      return content

  # convert contents of a PDF file and store retult to TXT file
  f = open('a.txt','w+')
  f.write(convertPdf2String(sys.argv[1]))
  f.close()

  # or print contents to the standard out stream
  print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

I get this error for a the 1st pdf file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

UnicodeEncodeError: 'ascii' codec can't encode character u'xe7' in position 38: ordinal not in range(128)

How to resolve this

Asked By: Hulk

||

Source

Answer 1

I tried it myself and got the same result. Ignore my comment, I hadn’t seen that you’re writing the output to a file as well. This is the problem:

f.write(convertPdf2String(sys.argv[1]))

As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like

f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))

EDIT:

The working source code, only one line changed.

# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf

def convertPdf2String(path):
    content = ""
    # load PDF file
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # iterate pages
    for i in range(0, pdf.getNumPages()):
        # extract the text from each page
        content += pdf.getPage(i).extractText() + " n"
    # collapse whitespaces
    content = u" ".join(content.replace(u"xa0", u" ").strip().split())
    return content

# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()

# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

Answered By: AndiDog

UnicodeEncodeError when reading pdf with pyPdf

Question:

Answers: