How to use the Python pdf2image module (thus poppler) on Google Cloud Function?

Question:

I tried convert PDF to JPEG on Google Cloud Functions. I used the Python module pdf2image. But I have no idea how to solve the errors No such file or directory: 'pdfinfo' and "Unable to get page count. Is poppler installed and in PATH? on the cloud function.

The error code is very similar to this question. pdf2image is a wrapper around "pdftoppm" and "pdftocairo" of poppler. But how can I install the poppler package on google cloud function, and add it to PATH? I can’t find relevant references for it. It is even possible? If not, what could be done?

There is also this question, but it isn’t useful.

The code look something like the following. Entry point is process_image.

import requests
from pdf2image import convert_from_path

def process_image(event, context):
    # Download sample pdf file
    url = 'https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf'
    r = requests.get(url, allow_redirects=True)
    open('/tmp/sample.pdf', 'wb').write(r.content)

    # Error occur on this line
    pages = convert_from_path('/tmp/sample.pdf')

    # Save pages to /tmp
    for idx, page in enumerate(pages):
        output_file_path = f"/tmp/{str(idx)}.jpg"
        page.save(output_file_path, 'JPEG')
        # To be saved to cloud storage

Requirement.txt:

requests==2.25.1
pdf2image==1.14.0

This is the error code I get:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 441, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 1706, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py", line 149, in view_func
    function(data, context)
  File "/workspace/main.py", line 11, in process_image
    pages = convert_from_path('/tmp/sample.pdf')
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 97, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 467, in pdfinfo_from_path
    raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Thanks in advance for any help.

Asked By: lamty101

||

Answers:

This error occurs because poppler package doesn’t work in Cloud Functions as it requires certain files written to the system. Unfortunately, you cannot write to file system in serverless products like Cloud Functions.

You may want to try methods, described in another thread, Cloud Functions for Firebase – Converting PDF to image or consider using GCP Compute Engine that has access to the whole system.

Answered By: Farid Shumbar

Cloud Functions does not support installing custom system-level packages (even though it support third-party libraries for a relevant programming language with a package manager like npm, pip). As shown in https://cloud.google.com/functions/docs/reference/system-packages, there is no package “poppler”.

However, you can still make use the other pre-installed packages. ghostscript can be used to convert pdf to images.

First of all you should save the pdf file in cloud function (e.g. from cloud storage). You only have disk write access to /tmp
(https://cloud.google.com/functions/docs/concepts/exec#file_system).

An example of terminal command to convert pdf to jpeg would be like this

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path

Sample code to use the command in python environment:

# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)

# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout, stderr = p.communicate()
error = stderr.decode('utf8')
if error:
    logging.error(error)
    return

Note:
You might want to use the imagemagick package instead, which itself use ghostscript. However, as mentioned in Can't load PDF with Wand/ImageMagick in Google Cloud Function, PDF reading by ImageMagick has been disabled because of a security vulnerability Ghostscript had as of the time of writing (2021-07-12). The solution provided is essentially another way to run ghostscript.

Reference:
https://www.the-swamp.info/blog/google-cloud-functions-system-packages/

Answered By: lamty101

You can directly write the image to gcs using the below code:

import io
from PIL import Image
from google.cloud import storage
from pdf2image import convert_from_bytes

storage_client = storage.Client()

def convert_pil_image_to_byte_array(img):
    img_byte_array = io.BytesIO()
    img.save(img_byte_array, format='JPEG', subsampling=0, quality=100)
    img_byte_array = img_byte_array.getvalue()
    return img_byte_array

def write_to_gcs_bucket(bucket_name, source_prefix, target_prefix):
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.get_blob(source_prefix)
    contents = blob.download_as_string()
    images = convert_from_bytes(contents,first_page = 5)
    for i in range(len(images)):
        object_byte = convert_pil_image_to_byte_array(images[i])
        file_name = 'slide' + str(i) + '.jpg'
        blob = bucket.blob(target_prefix + file_name)
        blob.upload_from_string(object_byte)
Answered By: Shashank Tripathi