Create a partial pdf from bytes in python

Question:

I have a pdf file somewhere. This pdf is being send to the destination in equal amount of bytes (apart from the last chunk).

Let’s say this pdf file is being read in like this in python:

with open(filename, 'rb') as file:
        chunk = file.read(3000)
        while chunk:
            #the sending method here
            await asyncio.sleep(0.5)
            chunk = file.read(3000)

the question is:
Can I construct a partial PDF file in the destination, while the leftover part of the document is being sent?

I tried it with pypdfium2 / PyPDF2, but they throw errors until the whole PDF file is arrived:

full_pdf = b''
    def process(self, message):
        self.full_pdf += message
        partial = io.BytesIO(self.full_pdf)
        try:
            pdf=pypdfium2.PdfDocument(partial)
            print(len(pdf))
        except Exception as e:
            print("error", e)

basically I’d like to get the pages of the document, even if it’s not the whole document currently.

Asked By: Patrick Visi

||

Answers:

It’s not possible to stream PDF and do anything useful with it before the whole file is present.

According to the PDF 1.7 standard, the structure is:

  1. A one-line header identifying the version of the PDF specification to which the file conforms
  2. A body containing the objects that make up the document contained in the file
  3. A cross-reference table containing information about the indirect objects in the file
  4. A trailer giving the location of the cross-reference table and of certain special objects within the body of the
    file

The problem is that the x-ref table / trailer is at the end.

PDF Linearization: "fast web view"

The above part is true for arbitrary PDFs. However, it’s possible to create so-called "linearized PDF files" (also called "fast web view"). Those files re-order the internal structure of PDF files to make them streamable.

At the moment, pypdf==3.4.0 does not support PDF linearization.

pikepdf claims to support that:

import pikepdf  # pip install pikepdf

with pikepdf.open("input.pdf") as pdf:
    pdf.save("out.pdf", linearize=True)
Answered By: Martin Thoma
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.