Extract annotations by layer from a PDF in Python

Question:

I have a PDF with annotations (markups) stored in different layers. Each layer has a specific name. I need to extract the annotations with their layer name. In particular, I’m interested only in the location of the annotation (as in, the bounding box of it) and the name of their layer, i.e. an output like:

{ "layerName": "myLayer01", "location" : [ 10, 5, 4, 2 ] }

Using a library like pyPDF2 (I’m the latest v3.0.1), I can extract the annotations’ location using this:

from PyPDF2 import PdfReader
reader = PdfReader("myFile.pdf")

for page in reader.pages:
    if "/Annots" in page:
        for annot in page["/Annots"]:
            obj = annot.get_object()
            annotation = { "layerName": ???, "location": obj["/Rect"] } # how do I get the layer Name?

While it’s easy to get the location, I am struggling to figure out how to get the layerName of the annotation.

If I look into the properties of the extracted obj (for example serializing it entirely with jsonPickle and CTRL+F in the entire result) I cannot find any mention of the layer the annotations are located on.

I know it’s possible to get a list of all existing layers with something like:

# Get the first page of the PDF and its layers
page = pdf_reader.getPage(0)
layers = page['/OCProperties']['/OCGs']

but this doesn’t help grouping the annotations per layer.

Any suggestion is appreciated. I’d prefer a concise solution, using also libraries different than pyPDF if helpful.

Asked By: alelom

||

Answers:

This is easy in PyMuPDF.

import fitz  # PyMuPDF
from pprint import pprint

doc = fitz.open("input.pdf")
for page in doc:
    for annot in page.annots():
        oc_xref = annot.get_oc()  # xref of its OCG or OCMD
        if oc_xref > 0:  # it indeed has an OCG/OCMD
            ocg_dict = doc.get_ocgs()[oc_xref]  # describes the OCG's properties
            pprint(ocg_dict)

# the output would be somethink like this:
{'on': True,
'intent': ['View', 'Design'],
'name': 'Circle',
'usage': 'Artwork'}

...
Answered By: Jorj McKie

The answer from @jorj worked fine to get the layer name, but it missed to extract the location. I’m posting here my solution that uses PyMuPDF (based on his answer) for completeness.

I am opting for the following output shape, a dictionary with key being the page number, and the value being another dictionary, where the key is the layer name, and the value is the list of annotations’ location in that layer. The location is the

{ page_number : { layer_name : [ locations ] } } 

Solution:

import fitz
from collections import defaultdict

# Open the PDF file
pdf_document = fitz.open("myfile.pdf")

# Output shape:
# { page_num : { layer_name : [ locations ] } }

all_annotations = dict()
for page in pdf_document:
    page_num = page.number
    page_annotations = list(page.annots())
    if len(page_annotations) == 0:
        continue
    
    all_annotations[page_num] = defaultdict(list)
    for annot in page_annotations:
        # get the location (bounding box of the annotation)
        pix_map = annot.get_pixmap().irect
        location = [ pix_map.x0, pix_map.y0, pix_map.x1, pix_map.y1 ]

        # get the layer's name
        oc_xref = annot.get_oc()  # xref of its OCG or OCMD
        if oc_xref > 0:  # it indeed has an OCG/OCMD
            ocg_dict = pdf_document.get_ocgs()[oc_xref]
            layer_name = ocg_dict["name"]  
        
        all_annotations[page_num][layer_name].append(location)
Answered By: alelom