Python Unicode Error Reading an Arabic PDF into txt

Question:

Goal

To convert a PDF file that has some arabic text within it into a utf-8 txt file in Python using PyPDF.

Code

What I have tried:

import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(open(input_filepath, "rb"))#read PDF
for page in pdf.pages:#loop through pages
    page_text = page.extractText()#get text from page
    page_text = page_text.decode(encoding='utf-8')#decode 
    output_file.write(page_text)#write to file
output_file.close()#close

Error

I however receive this error:

Traceback (most recent call last):
  File "pdf2txt.py", line 9, in <module>
    page_text = page_text.decode(encoding='windows-1256')#decode 
  File "/usr/lib/python2.7/encodings/cp1256.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'u2122' in position 98: ordinal not in range(128)
Asked By: user3137329

||

Answers:

Instead of opening the file using the built in python open you could try to open the file using codecs and specifying the encoding of the file when opening, which it looks like you already imported codecs. Your code would change to:

import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(codecs.open(input_filepath, "rb", encoding='utf-8'))#read PDF
for page in pdf.pages:#loop through pages
    page_text = page.extractText()#get text from page
    page_text = page_text.decode(encoding='utf-8')#decode 
    output_file.write(page_text)#write to file
output_file.close()#close
Answered By: Cory Shay
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.