Read PDF in base64 format with a PDF library in Python

Question:

I have a base64 string and I need to read it with a Python library. I can do that with the following steps:

  1. Decode the PDF in base64
  2. Save it into a new file
  3. Read it with libraries like PyPDF2

But since I can’t create a new file, I need to read it using another process. I tried using the BufferedWriter class, that is part of the io library but I believe that it is not the right way.

Edit 1

I can’t create new files because I will be running the code in a serverless API host. And what I need to do is get the Base64 string and read it in a way that I can split each page into a new file and then save those files into a blob storage (but the split and save part are easy, the problem is the "read Base64 string without creating a new file").

Asked By: Kotynho

||

Answers:

PDF is a binary file format, not a base64 string. Base64 is a way of encoding binary data as ASCII text.

What you need to do is decode the base64 string with base64.b64decode into a byte array, then use a PDF library like PyPDF2 to read that byte array either directly or through a BytesIO object :

import base64
import io
from pypdf import PdfReader

buffer=base64.b64decode(thatString)
f=io.BytesIO(buffer)
reader = PdfReader(f)
page = reader.pages[0]
Answered By: Panagiotis Kanavos
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.