In Django how to convert an uploaded pdf file to an image file and save to the corresponding column in database?

Question:

I am creating an HTML template to show the cover of a pdf file(first page or user can choose one). I want Django to create the cover image automatically without extra upload.

The pdf file is uploaded using Django Modelform. Here is the structure of my code

models.py

class Pdffile(models.Model):
    pdf = models.FileField(upload_to='pdfdirectory/')
    filename = models.CharField(max_length=20)
    pagenumforcover = models.IntegerField()
    coverpage = models.FileField(upload_to='coverdirectory/')

form.py

class PdffileForm(ModelForm):
    class Meta:
        model = Pdffile
        fields = (
            'pdf',
            'filename',
            'pagenumforcover',
        )

views.py

def upload(request):
    if request.method == 'POST':
        form = PdffileForm(request.POST, request.FILES)
        if form.is_valid():
            form.save()
            return redirect('pdffilelist')
    else:
        form = PdffileForm()
    return render(request, "uploadform.html", {'form': form})


def pdfcover(request, pk):
    thispdf = get_object_or_404(Pdffile, pk=pk)

    return render(request, 'pdfcover.html', {'thispdf': thispdf})

In the ‘pdfcover.html’, I want to use the Django template language so I can render different HTML for different uploaded pdf files. That’s why I want to save the image file to the same column as the pdf file.

I am new to Python, new to Django, and obviously new to stack overflow. I have tried pdf2image and PyPDF2 and I believe they all could work however I just cannot find the right code. If you guys enlighten me I will be thankful.

Asked By: codefarmer

||

Answers:

In the pdf2image package there is a function called convert_from_path.

This is the description inside the package of what each of the parameters of the function does.

Parameters:
            pdf_path -> Path to the PDF that you want to convert
            dpi -> Image quality in DPI (default 200)
            output_folder -> Write the resulting images to a folder (instead of directly in memory)
            first_page -> First page to process
            last_page -> Last page to process before stopping
            fmt -> Output image format
            jpegopt -> jpeg options `quality`, `progressive`, and `optimize` (only for jpeg format)
            thread_count -> How many threads we are allowed to spawn for processing
            userpw -> PDF's password
            use_cropbox -> Use cropbox instead of mediabox
            strict -> When a Syntax Error is thrown, it will be raised as an Exception
            transparent -> Output with a transparent background instead of a white one.
            single_file -> Uses the -singlefile option from pdftoppm/pdftocairo
            output_file -> What is the output filename or generator
            poppler_path -> Path to look for poppler binaries
            grayscale -> Output grayscale image(s)
            size -> Size of the resulting image(s), uses the Pillow (width, height) standard
            paths_only -> Don't load image(s), return paths instead (requires output_folder)
            use_pdftocairo -> Use pdftocairo instead of pdftoppm, may help performance
            timeout -> Raise PDFPopplerTimeoutError after the given time

Because convert_from_path is designed to be able to turn every page in a pdf into an image the function returns an array of Image objects.

If you set the output_folder parameter each image will be saved to that location from the base directory. output_folder must be a full path in this case e.g. 'path/from/root/to/output_folder'. If you don’t set it the images won’t be saved when converted, only in memory.

By default if you do not set the output_file parameter the function will generate a random formatted filename such as 0a15a918-59ba-4f15-90f0-2ed5fbd0c36c-1.ext. Although if you do set a filename, because this filename is used for converting multiple pdf pages, if your output_file was 'file_name' then each file would be named starting from 'file_name0001-1.ext'.

Beware that if you set output_file and output_folder and try converting two different pdfs the second pdf will overwrite the image files of the first if they are in the same directory.

Here is some code modelled around yours in the question. This code assumes you have pdf2image installed.

I’ve added a built-in validator on the pdf FileField because else the code will crash if anything else but a pdf is uploaded.

validators=[FileExtensionValidator(allowed_extensions=['pdf'])]

I also created three constants for the upload directories and file format. If you need to change any of them then the rest of the code can remain the same.

COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'

Also I’m assuming you have the default settings setup for saving files.

settings.py

MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')

models.py

from django.core.validators import FileExtensionValidator
from django.db.models.signals import post_save
from pdf2image import convert_from_path
from django.conf import settings
import os


COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'

# this function is used to rename the pdf to the name specified by filename field
def set_pdf_file_name(instance, filename):
    return os.path.join(PDF_DIRECTORY, '{}.pdf'.format(instance.filename))

# not used in this example
def set_cover_file_name(instance, filename):
    return os.path.join(COVER_PAGE_DIRECTORY, '{}.{}'.format(instance.filename, COVER_PAGE_FORMAT))

class Pdffile(models.Model):
    # validator checks file is pdf when form submitted
    pdf = models.FileField(
        upload_to=set_pdf_file_name, 
        validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
        )
    filename = models.CharField(max_length=20)
    pagenumforcover = models.IntegerField()
    coverpage = models.FileField(upload_to=set_cover_file_name)

def convert_pdf_to_image(sender, instance, created, **kwargs):
    if created:
        # check if COVER_PAGE_DIRECTORY exists, create it if it doesn't
        # have to do this because of setting coverpage attribute of instance programmatically
        cover_page_dir = os.path.join(settings.MEDIA_ROOT, COVER_PAGE_DIRECTORY)

        if not os.path.exists(cover_page_dir):
            os.mkdir(cover_page_dir)

        # convert page cover (in this case) to jpg and save
        cover_page_image = convert_from_path(
            pdf_path=instance.pdf.path,
            dpi=200, 
            first_page=instance.pagenumforcover, 
            last_page=instance.pagenumforcover, 
            fmt=COVER_PAGE_FORMAT, 
            output_folder=cover_page_dir,
            )[0]

        # get name of pdf_file 
        pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
        new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
        # rename the file that was saved to be the same as the pdf file
        os.rename(cover_page_image.filename, new_cover_page_path)
        # get the relative path to the cover page to store in model
        new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
        instance.coverpage = new_cover_page_path_relative

        # call save on the model instance to update database record
        instance.save()

post_save.connect(convert_pdf_to_image, sender=Pdffile)

convert_pdf_to_image is a function that runs on the post_save signal of the Pdffile model. It gets run after your PdffileForm gets saved in your upload view so that we can create the cover image file from the saved pdf file.

cover_page_image = convert_from_path(
            pdf_path=instance.pdf.path,
            dpi=200, 
            first_page=instance.pagenumforcover, 
            last_page=instance.pagenumforcover, 
            fmt=COVER_PAGE_FORMAT, 
            output_folder=cover_page_dir,
            )[0]

Changing dpi will change the quality of the image. In order to only convert one page the first_page and last_page parameters are the same. Because the result is an array we grab the first and only element in the list inside cover_page_image in this case.

Minor change to your upload view.

views.py

def upload(request):

    form = PdffileForm()

    if request.method == 'POST':
        form = PdffileForm(request.POST, request.FILES)
        # if form is not valid then form data will be sent back to view to show error message
        if form.is_valid():
            form.save()
            return redirect('pdffilelist')

    return render(request, "uploadform.html", {'form': form})

I don’t know what your upload.html file looks like but I used the following which will work with the code provided.

upload.html

<h1>Upload PDF</h1>

<form method="POST" enctype="multipart/form-data">
    {% csrf_token %}
    {{ form.as_p }}
    <button type="submit">Upload</button>
</form>

With an example pdf

example pdf

Uploaded through the form

upload form

The resulting database record

db record

The resulting file locations once uploaded

file directory with images

Final note:

Because FileFields have code to ensure that existing files don’t get overwritten, The code

# get name of pdf_file 
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename file to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative

ensures the pdf FileField filename is used to name the cover page because it is almost completely unique.

duplicate filenames

Answered By: Danoram

I used the explanation here, and everything works fine, except when from admin panel I oped the saved Pdffile object and try to change the pagenumforcover to another integer and then save it then it won’t generate the new coverpage

Answered By: Anxious
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.