Python PdfMiner – How to get the info on the orientation of each word/sentence included in a pdf?

Question

Target:
I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero degrees, not the 90,180 or 270 degrees.

.

What I have tried:
The first thing I tried is to use the parameter: detect_vertical of LAParams of PDFMiner but this does not help me.

When I am trying: "detect_vertical=True" then I am getting all the text from all of the orientations but the sentences of 180 degrees (the one that is inverted actually) has wrong order:

*Upper side, third line
Upper side, second line
This is the upper side of the box. *

When I am trying: "detect_vertical=False" then I am getting the text from the sides one by one but I am still getting the text from the 180 degrees (the one that is inverted actually) with wrong order again. The text from the sides is one by one character.

Since I only want to filter the text with orientation 0 degrees, none of the above does not help me.

The code used for this is the following:

from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer, LAParams

page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
                               laparams= LAParams(detect_vertical=True ) ) ) 
 
for page in page_info:
    for element in page:
        if isinstance(element, LTTextContainer): 
            print(element.get_text())

The second thing I tried is to get this info from the latest level of the PDF layout (LTChar) as described here: https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#working-with-rotated-characters

The Code I have used is the following one for this attempt but unfortunately I can only get: fontname, font size and the coordinates of the character, not the orientation:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LAParams , LTChar
 
page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
                               laparams= LAParams(detect_vertical=True ) ) ) 
for page in page_info:
    for element in page:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print('======================')
                        print('text:',character.get_text()) 
                        print('fontname:',character.fontname[7:])
                        print('size:',character.size)   
                        print('adv:',character.adv)   # textwidth * fontsize * scaling  
                        print('matrix:',character.matrix)  
                        (_,_,x,y) = character.bbox 
                        print('x dim:',x,'and y dim:',y) 
                        print('n')

What I do not want to use:

I do not want to use Tesseract as I have already tried it and the results are not as good as using PDFMiner

Any suggestions on this?

Asked By: Vagelis

||

Source

Answer 1

After a lot of investigation I finally found a way to do this in character level by using the matrix included in LTChar.

So in order to get all of the characters with 0 degrees i do the following:

for page in label_pages:
    for element in page:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        if character.matrix[0]>0 :
                            print('======================')
                            print('text:',character.get_text())    
                            print('matrix:',character.matrix)     
                            (_,_,x,y) = character.bbox 
                            print('x dim:',x,'and y dim:',y) 
                            print('n')

Answered By: Vagelis

Answer 2

As stated before, the orientation of character are based on an 6 elements array that code all transformations of the character (translation, scaling rotation and skewing).

For rotation it will be code as follow: [cos θ sin θ −sin θ cos θ 0 0]

By looking at the orientation of the first character of the LTTextBox, you can assume the general orientation like this:

from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

fp = open(<filepath>, 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams(detect_vertical=True, all_texts=True)
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)

for page in pages:
        interpreter.process_page(page)
        layout = device.get_result()
    
        for lobj in layout:
            if isinstance(lobj, LTTextBox):
              
              # LTTextBox, LTTextLine, LTChar are not subscriptable
              # we can access the first character using the list function:
              first_matrix =list(list(lobj)[0])[0].matrix

              if first_matrix[0] == 0 and first_matrix[1] == 1:
                rotation = 90
              if first_matrix[0] == -1 and first_matrix[1] == 0:
                rotation = 180
              if first_matrix[0] == 0 and first_matrix[1] == -1:
                rotation = 270
              else:
                rotation = 0

ref:

https://pdfminersix.readthedocs.io/en/latest/tutorial/extract_pages.html
https://github.com/pdfminer/pdfminer.six/issues/454
https://ghostscript.com/~robin/pdf_reference17.pdf (SECTION 4.2 – 4.2.2 Common Transformations)

Answered By: Thom

Python PdfMiner – How to get the info on the orientation of each word/sentence included in a pdf?

Question:

Answers: