PyPDF2.errors.PdfReadError: PDF starts with '♣▬', but '%PDF-' expected

Question:

I have a folder containing a lot of sub-folders, with PDF files inside. It’s a real mess to find information in these files, so I’m making a program to parse these folders and files, searching for a keyword in the PDF files, and returning the names of the PDF files containing the keyword.

And it’s working. Almost, actually.

I have this error: PyPDF2.errors.PdfReadError: PDF starts with '♣▬', but '%PDF-' expected when my program reaches some folders (hard to know which one exactly). From my point of view, all the PDF files in my folders are the same, so I don’t understand why my program works with some files and doesn’t work with others.

Thank you in advance for your responses.

Asked By: SejAC

||

Answers:

disclaimer: I am the author of borb, the library mentioned in this answer

PDF documents caught in the wild will sometimes start with non-pdf bytes (a header that is not really part of the PDF spec). This can cause all kinds of problems.

PDF will (internally) keep track of all the byte offsets of objects in the file (e.g. "object 10 starts at byte 10202"). This header makes it harder to know where an object starts.

  • Do we start counting at the start of the file?
  • Or at the start of where the file behaves like a PDF?

If you just want to extract text from a PDF (to be able to check it for content and keywords), you can try to use borb.

borb will look for the start of the PDF within the first 1MB of the file (thus potentially ignoring your faulty header). If this turns out to corrupt the XREF (cross reference table, containing all byte addresses of objects) it will simply build a new one.

This is an example of how to extract text from a PDF using borb:

import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction


def main():

    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("output.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text_for_page(0))


if __name__ == "__main__":
    main()

You can find more examples in the examples repository.

Answered By: Joris Schellekens

PdfFileReader is deprecated. Use PdfReader instead! (source)

PdfFileReader has a strict attribute. Use it:

reader = PdfFileReader("example.pdf", strict=False)

PdfReader is the same as PdfFileReader, but by default it has strict=False. Most people want strict=False. In the next major release, I will remove PdfFileReader from PyPDF2 in favor of PdfReader.

If you’re still getting issues, please open an issue on Github – but only if you can share a pdf+code that caused the issue: https://github.com/py-pdf/PyPDF2

Answered By: Martin Thoma
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.