PDFminer gives strange letters

Question:

I am using python2.7 and PDFminer for extracting text from pdf. I noticed that sometimes PDFminer gives me words with strange letters, but pdf viewers don’t. Also for some pdf docs result returned by PDFminer and other pdf viewers are same (strange), but there are docs where pdf viewers can recognize text (copy-paste). Here is example of returned values:

from pdf viewer: ‫فتــح بـــاب ا�ستيــراد البيــ�ض والدجــــاج المجمـــد‬
from PDFMiner: óªéªdG êÉ````LódGh ¢†``«ÑdG OGô``«à°SG ÜÉH í“àa

So my question is can I get same result as pdf viewer, and what is wrong with PDFminer. Does it missing encodings I don’t know.

Asked By: Milan Kocic

||

Answers:

Yes.

This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly.

pdfminer gives garbage output in such cases because encoding is required to interpret the text

Answered By: codingscientist

Maybe the PDF file you are trying to read has an encoding not yet supported by pdfMiner.

I had a similar problem last month and finally solved it by using a java library named “pdfBox” and calling it from python. The pdfBox library supported the encoding that I needed and worked like a charm!.

First I downloaded pdfbox from the official site
and then referenced the path to the .jar file from my code.

Here is a simplified version of the code I used (untested, but based on my original tested code).
You will need subprocess32, which you can install by calling pip install subprocess32

import subprocess32 as subprocess
import os
import tempfile

def extractPdf(file_path, pdfboxPath, timeout=30, encoding='UTF-8'):
    #tempfile = temp_file(data, suffix='.pdf')
    try:
        command_args = ['java', '-jar', os.path.expanduser(pdfboxPath), 'ExtractText', '-console', '-encoding', encoding, file_path]
        status, stdout, stderr = external_process(command_args, timeout=timeout)
    except subprocess.TimeoutExpired:
        raise RunnableError('PDFBox timed out while processing document')
    finally:
        pass#os.remove(tempfile)

    if status != 0:
         raise RunnableError('PDFBox returned error status code {0}.nPossible error:n{1}'.format(status, stderr))

    # We can use result from PDFBox directly, no manipulation needed
    pdf_plain_text = stdout
    return pdf_plain_text

def external_process(process_args, input_data='', timeout=None):
   process = subprocess.Popen(process_args,
                              stdout=subprocess.PIPE,
                              stdin=subprocess.PIPE,
                              stderr=subprocess.PIPE)
   try:
      (stdout, stderr) =  process.communicate(input_data, timeout)
   except subprocess.TimeoutExpired as e:
      # cleanup process
      # see https://docs.python.org/3.3/library/subprocess.html?highlight=subprocess#subprocess.Popen.communicate
      process.kill()
      process.communicate()
      raise e

   exit_status = process.returncode
   return (exit_status, stdout, stderr)


def temp_file(data, suffix=''):
   handle, file_path = tempfile.mkstemp(suffix=suffix)
   f = os.fdopen(handle, 'w')
   f.write(data)
   f.close()
   return file_path

if __name__ == '__main__':
    text = extractPdf(filename, 'pdfbox-app-2.0.3.jar')

`
This code was not entirely written by me. I followed the suggestions of other stack overflow answers, but it was a month ago, so I lost the original sources. If anyone finds the original posts where I got the pieces of this code, please let me know, so I can give them their deserved credit for the code.

Answered By: caspillaga
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.