Find text position in PDF file

Question:

I have a PDF file and I am trying to find a specific text in the PDF and highlight it using Python.
I found pypdf, which can highlight part of a PDF when we give the coordinates of the wanted highlight position in the file.

I am trying to find a tool which can give me the position of a given text in the PDF.

Asked By: Simdan

||

Answers:

PyMuPDF can find text by coordinates. You can use this in conjunction with the PyPDF2 highlighting method to accomplish what you’re describing. Or you can just use PyMuPDF to highlight the text.

Here is sample code for finding text and highlighting with PyMuPDF:

import fitz

### READ IN PDF
doc = fitz.open("input.pdf")

for page in doc:
    ### SEARCH
    text = "Sample text"
    text_instances = page.search_for(text)

    ### HIGHLIGHT
    for inst in text_instances:
        highlight = page.add_highlight_annot(inst)
        highlight.update()


### OUTPUT
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
Answered By: Cilantro Ditrek

If you are on Windows and have Acrobat Pro (not reader), you can try the old Component Object Model with Python or VBA.

enter image description here

import win32com, winerror, os
from win32com.client.dynamic import ERRORS_BAD_CONTEXT
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
win32com.client.gencache.EnsureModule('{E64169B3-3592-47d2-816E-602C5C13F328}', 0, 1, 1)
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
avDoc.Open(src, src)

avDoc.BringToFront()
pdDoc = avDoc.GetPDDoc()
jsoObject = pdDoc.GetJSObject()

for pageNo in range(1):
    pdfPage = pdDoc.AcquirePage(pageNo)
    pageHL = win32com.client.DispatchEx('AcroExch.HiliteList')
    _ = pageHL.Add(0, 9000)
    pageSel = pdfPage.CreatePageHilite(pageHL)

    pdfText = ""
    for wordNo in range(pageSel.GetNumText()):
        word = pageSel.GetText(wordNo)
        pdfText += word

        if keyword in pdfText:
            wordToHl = win32com.client.DispatchEx('AcroExch.HiliteList')
            wordToHl.Add(wordNo, 1)
            wordHl = pdfPage.CreateWordHilite(wordToHl)
            rect = wordHl.GetBoundingRect()
            annot = jsoObject.AddAnnot()
            props = annot.GetProps()
            props.Type = "Square"
            props.Page = pageNo
            props.Hidden = False
            props.Lock = True
            props.Name = word
            props.NoView = False
            props.Opacity = 0.3
            props.ReadOnly = True
            props.Style = "S"
            props.ToggleNoView = False
            props.PopupOpen = False
            popupRect = [rect.Left - 5, rect.Top + 5, rect.Left + 40, rect.Top - 20]
            props.Rect = popupRect
            props.PopupRect = popupRect
            props.StrokeColor = jsoObject.Color.Red
            props.FillColor = jsoObject.Color.Yellow

            annot.SetProps(props)
            print(f'Found {keyword}')
Answered By: Yiping

With the new version of PyMuPDF, some methods got depreciated.
Here is the sample code as per the recent version. Secondly, I’ve also added a comment for each highlight which facilities the user to transverse.

pdfIn = fitz.open("page-4.pdf")

for page in pdfIn:
    print(page)
    texts = ["SEPA", "voorstelnummer"]
    text_instances = [page.search_for(text) for text in texts] 
    
    # coordinates of each word found in PDF-page
    print(text_instances)  

    # iterate through each instance for highlighting
    for inst in text_instances:
        annot = page.add_highlight_annot(inst)
        # annot = page.add_rect_annot(inst)
        
        ## Adding comment to the highlighted text
        info = annot.info
        info["title"] = "word_diffs"
        info["content"] = "diffs"
        annot.set_info(info)
        annot.update()


# Saving the PDF Output
pdfIn.save("page-4_output.pdf")

Answered By: RevolverRakk
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.