List matches of page.search_for() with PyMuPDF

Question:

I’m writing a script to highlight text from a list of quotes in a PDF. The quotes are in the list text_list. I use this code to highlight the text in the PDF:

import fitz
#Load Document
doc = fitz.open(filename)

#Iterate over pages
for page in doc:
# iterate through each text using for loop and annotate
    for i, text in enumerate(text_list):
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)
# Print how many results were found
print(str(i) + " instances highlighted in pdf")

I now want to get a list of the quotes that were not found and highlighted and was wondering if there is any simple way to get a list of the matches page.search_for found (or of those quotes it didn’t find).

Asked By: SamVimes

||

Answers:

The list of hit rectangles / quads rl will be empty if nothing was found.
I suggest you check if rl == []: and depend adding highlights on this as well as adding the respective text to some no_hit list.

Probably better the other way round:
Your text list better should be a Python set. If a text was ever found put it in another, found_set. At end of processing subtract (set difference) the found set from text_list set.

Answered By: Jorj McKie
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.