How to calculate bounding box using PyPDF2 in Python 3

Question:

This question relates to PyPDF2 used with Python 3

ghostscript apparently is able to effectively calculate the bounding box of the content within a PDF page as follows:

gs -dBATCH -dSAFER -dNOPAUSE -sDEVICE=bbox document1.pdf

The result returned in the example above appears to be correct and is:

GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
%%BoundingBox: 88 525 521 718
%%HiResBoundingBox: 88.145997 525.401984 520.397984 717.533978

My question is, can the bounding box be calculated by PyPDF? If yes, any guidance on how to do so would be appreciated. I have dug hard but so far cannot see how to do it.

I am able to get PyPDF2 to give me the coordinates for the mediaBox, the cropBox, the artBox and the trimBox but these appear to be unrelated concepts to the bounding box.

Asked By: Duke Dougal

||

Answers:

The boxes you listed are associated with page objects. PyPDF2 allows you to access and modify the coordinates for these boxes.

You’re correct that bounding boxes are unrelated; a page may have none or many bounding boxes. I believe each bbox represents a region for a graphic, font, etc., rather than a whole page.

To answer your question, PyPDF2 does not currently provide access to the coordinates for bounding boxes. It is something that should be considered, though.

pyPdf and, by extension, PyPDF2, don’t focus on specific content extraction as much as they do page manipulation. But this is a concept we will look into developing!

Answered By: Matthew Stamy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.