How do I split a PDF in google cloud storage using Python

Question:

I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.

I am using PyPDF2 to do this curretly

    list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
    print(len(list_of_blobs))
    list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
    
    inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))

    individual_files = []
    stream = io.StringIO()
    
    for i in range(inputpdf.numPages):
        output = PdfFileWriter()
        output.addPage(inputpdf.getPage(i))
        individual_files.append(output)
        with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
            outputStream.write(stream.getvalue())
            #print(outputStream.read())
            with open(outputStream.name, 'rb') as f:
                data = f.seek(85)
                data = f.read()
                individual_files.append(data)
                bucket.blob('processed/' +  "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')

In the output, I see different PyPDF2 objects such as
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next. I am also open to using other libraries if those work better.

Asked By: saladass4254

||

Answers:

To split a PDF file in several small file (page), you need to download the data for that. You can materialize the data in a file (in the writable directory /tmp) or simply keep them in memory in a python variable.

In both cases:

  • The data will reside in memory
  • You need to get the data to perform the PDF split.

If you absolutely want to read the data in streaming (I don’t know if it’s possible with PDF format!!), you can use the streaming feature of GCS. But, because there isn’t CRC on the downloaded data, I won’t recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.

Answered By: guillaume blaquiere

There were two reasons why my program was not working:

  1. I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
  2. I should have been writing bytes (I fixed this by changing the open mode to ‘wb’ instead of ‘a’)

Below is the corrected code:

if inputpdf.numPages > 2:
   for i in range(inputpdf.numPages):
      output = PdfFileWriter()
      output.addPage(inputpdf.getPage(i))
      with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
           output.write(outputStream)
      with open(outputStream.name, 'rb') as f:
           data = f.seek(0)
           data = f.read()
           #print(data)
           bucket.blob(prefix + '/processed/' +  "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
      stream.truncate(0)
Answered By: saladass4254

FYI, Document AI has an actively monitored tag [cloud-document-ai]


The Document AI Toolbox SDK has been released as experimental, and it includes the ability to split PDF files based on the output of a splitter/classifier processor in Document AI.

This documentation page lists the supported features and code samples.

https://cloud.google.com/document-ai/docs/handle-response#toolbox

Answered By: Holt Skinner