How to get the diff of two PDF files using Python?

Question:

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

Asked By: Goutham

||

Answers:

What do you mean by “difference”? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python’s built-in diff library to get the difference of the converted texts.

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can’t figure out the correct text flow, especially with complex layouts or tables.

Answered By: fbuchinger

Check this out, it can be useful: pypdf

Answered By: mtasic85

I do not know your use case, but for regression tests of script which generates pdf using reportlab, I do diff pdfs by

  1. Converting each page to an image using ghostsript
  2. Diffing each page against page image of standard pdf, using PIL

e.g

im1 = Image.open(imagePath1)
im2 = Image.open(imagePath2)

imDiff = ImageChops.difference(im1, im2)

This works in my case for flagging any changes introduced due to code changes.

Answered By: Anurag Uniyal

Met the same question on my encrypted pdf unittest, neither pdfminer nor pyPdf works well for me.

Here are two commands (pdftocairo, pdftotext) work perfect on my test. (Ubuntu Install: apt-get install poppler-utils)

You can get pdf content by:

from subprocess import Popen, PIPE

def get_formatted_content(pdf_content):
    cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info
    ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    stdout, stderr = ps.communicate(input=pdf_content)
    if ps.returncode != 0:
        raise OSError(ps.returncode, cmd, stderr)
    return stdout

Seems pdftocairo can redraw pdf files, pdftotext can extract all text.

And then you can compare two pdf files:

c1 = get_formatted_content(open('f1.pdf').read())
c2 = get_formatted_content(open('f2.pdf').read())
print(cmp(c1, c2)) # for binary compare
# import difflib
# print(list(difflib.unified_diff(c1, c2))) # for text compare
Answered By: gzerone

Even though this question is quite old, my guess is that I can contribute to the topic.

We have several applications generating tons of PDFs. One of these apps is written in Python and recently I wanted to write integration tests to check if the PDF generation was working correctly.

Testing PDF generation is HARD, because the specs for PDF files are very complicated and non-deterministic. Two PDFs, generated with the same exact input data, will generate different files, so direct file comparison is discarded.

The solution: we have to go with testing the way they look like (because THAT should be deterministic!).

In our case, the PDFs are being generated with the reportlab package, but this doesn’t matter from the test perspective, we just need a filename or the PDF blob (bytes) from the generator. We also need an expectation file containing a “good” PDF to compare with the one coming from the generator.

The PDFs are converted to images and then compared. This can be done in multiple ways, but we decided to use ImageMagick, because it is extremely versatile and very mature, with bindings for almost every programming language out there. For Python 3, the bindings are offered by the Wand package.

The test looks something like the following. Specific details of our implementation were removed and the example was simplified:

import os
from unittest import TestCase
from wand.image import Image
from app.generators.pdf import PdfGenerator


DIR = os.path.dirname(__file__)


class PdfGeneratorTest(TestCase):

    def test_generated_pdf_should_match_expectation(self):
        # `pdf` is the blob of the generated PDF
        # If using reportlab, this is what you get calling `getpdfdata()`
        # on a Canvas instance, after all the drawing is complete
        pdf = PdfGenerator().generate()

        # PDFs are vectorial, so we need to set a resolution when
        # converting to an image
        actual_img = Image(blob=pdf, resolution=150)

        filename = os.path.join(DIR, 'expected.pdf')

        # Make sure to use the same resolution as above
        with Image(filename=filename, resolution=150) as expected:
            diff = actual.compare(expected, metric='root_mean_square')
            self.assertLess(diff[1], 0.01)

The 0.01 is as low as we can tolerate small differences. Considering that diff[1] varies from 0 to 1 using the root_mean_square metric, we are here accepting a difference up to 1% on all channels, comparing with the sample expected file.

Answered By: Victor Schröder
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.