Python – convert pdf to text, encoding error

Question:

I tried to convert pdf document to txt file.
(example of pdf file link)

So I tried like below.
But the extracted text is strange like ??챘#?遏?h첨챦_철?‾n?~w??¬?k
How can I fix it?

#!/usr/bin/python
# -*- coding: cp949 -*-
# -*- coding: utf-8 -*-
# -*- coding: latin-1 -*-
# -*- coding: euc-kr -*-

import codecs
import pyPdf
filename = "d:/data/processed_data/paper/iscram/2006/iscram1.pdf"
#pdf = codecs.open(filename, "rb", encoding = 'utf-8') 
pdf = codecs.open(filename, "rb", encoding = 'latin1')
for page in pdf:
    print page.encode('utf-8')

I use a win7-64bit korean version.

I tried it to another way by using pyPdf like below

import os
import glob
from pyPdf import PdfFileReader
import pdfminer

f=open("d:/data/processed_data/paper/iscram/2006/iscram1.txt",'w')
parent = "d:/data/processed_data/paper/iscram/2006"
os.chdir(parent)
filename = os.path.abspath('iscram1.pdf')

input = PdfFileReader(file(filename, "rb"))
for page in input.pages:
    f.write(page.extractText())

but it doesn’t work and it occurs ”ascii’ codec can’t encode character u’u0152′ in position 602: ordinal not in range(128)’ error

Asked By: user3704652

||

Answers:

The former code couldn’t work at all, PDF does not necessarily contain directly readable text at all. The latter code with pyPdf looks more promising though.

The TypeError is raised because the pages in PDF (the page) are not strings, but f.write expects to see a string.

Thus you might try using the extractText method from the documentation:

for page in input.pages:
    f.write(page.extractText().encode('UTF-8'))
  1. the pdf command stream is encoded with an encoding similar to latin-1
  2. the command stream includes instructions to display stuff on the page
  3. where this stuff is “text” then it is actually instructions to display character shapes i.e glyphs taken from a font (or subset of a font or combination of bits of several fonts).
  4. most of the time the information needed to translate the bytes in these
    instructions to (say) unicode text is stored within the PDF but some times it is not and sometimes the translation is not possible at all (for example where the font prints a logo).
  5. PyPDF2 (and many other open source PDF packages) does not include functionality to deal with the full complexity of this but fortunately many creators of documents rely on a small set of “standard encodings” which include a number of latin-1 variants and the ‘extract text’ function does provide usable results in these cases. I have also found PDFs where the font definitions have replacement mappings that give you the name of the glyph for each byte used and found it easy to modify PyPDF2 to take care of this. Other cases are not so simple.

  6. Finally there are two other factors that need to be take account of when trying to extract readable text from PDFs. First is that some PDF streams can be compressed and that some are encrypted. PyPDF2 can take care of both of these cases. A second problem is that the PDF instructions are only to put the characters at specific points on the page. In most cases PDF writers may write the data in reading order but may make positioning changes within words as well as at word breaks.

Answered By: user13526470