In Django how to convert an uploaded pdf file to an image file and save to the corresponding column in database?
Question:
I am creating an HTML template to show the cover of a pdf file(first page or user can choose one). I want Django to create the cover image automatically without extra upload.
The pdf file is uploaded using Django Modelform. Here is the structure of my code
models.py
class Pdffile(models.Model):
pdf = models.FileField(upload_to='pdfdirectory/')
filename = models.CharField(max_length=20)
pagenumforcover = models.IntegerField()
coverpage = models.FileField(upload_to='coverdirectory/')
form.py
class PdffileForm(ModelForm):
class Meta:
model = Pdffile
fields = (
'pdf',
'filename',
'pagenumforcover',
)
views.py
def upload(request):
if request.method == 'POST':
form = PdffileForm(request.POST, request.FILES)
if form.is_valid():
form.save()
return redirect('pdffilelist')
else:
form = PdffileForm()
return render(request, "uploadform.html", {'form': form})
def pdfcover(request, pk):
thispdf = get_object_or_404(Pdffile, pk=pk)
return render(request, 'pdfcover.html', {'thispdf': thispdf})
In the ‘pdfcover.html’, I want to use the Django template language so I can render different HTML for different uploaded pdf files. That’s why I want to save the image file to the same column as the pdf file.
I am new to Python, new to Django, and obviously new to stack overflow. I have tried pdf2image and PyPDF2 and I believe they all could work however I just cannot find the right code. If you guys enlighten me I will be thankful.
Answers:
In the pdf2image
package there is a function called convert_from_path
.
This is the description inside the package of what each of the parameters of the function does.
Parameters:
pdf_path -> Path to the PDF that you want to convert
dpi -> Image quality in DPI (default 200)
output_folder -> Write the resulting images to a folder (instead of directly in memory)
first_page -> First page to process
last_page -> Last page to process before stopping
fmt -> Output image format
jpegopt -> jpeg options `quality`, `progressive`, and `optimize` (only for jpeg format)
thread_count -> How many threads we are allowed to spawn for processing
userpw -> PDF's password
use_cropbox -> Use cropbox instead of mediabox
strict -> When a Syntax Error is thrown, it will be raised as an Exception
transparent -> Output with a transparent background instead of a white one.
single_file -> Uses the -singlefile option from pdftoppm/pdftocairo
output_file -> What is the output filename or generator
poppler_path -> Path to look for poppler binaries
grayscale -> Output grayscale image(s)
size -> Size of the resulting image(s), uses the Pillow (width, height) standard
paths_only -> Don't load image(s), return paths instead (requires output_folder)
use_pdftocairo -> Use pdftocairo instead of pdftoppm, may help performance
timeout -> Raise PDFPopplerTimeoutError after the given time
Because convert_from_path
is designed to be able to turn every page in a pdf into an image the function returns an array of Image objects.
If you set the output_folder
parameter each image will be saved to that location from the base directory. output_folder
must be a full path in this case e.g. 'path/from/root/to/output_folder'
. If you don’t set it the images won’t be saved when converted, only in memory.
By default if you do not set the output_file
parameter the function will generate a random formatted filename such as 0a15a918-59ba-4f15-90f0-2ed5fbd0c36c-1.ext
. Although if you do set a filename, because this filename is used for converting multiple pdf pages, if your output_file
was 'file_name'
then each file would be named starting from 'file_name0001-1.ext'
.
Beware that if you set output_file
and output_folder
and try converting two different pdfs the second pdf will overwrite the image files of the first if they are in the same directory.
Here is some code modelled around yours in the question. This code assumes you have pdf2image
installed.
I’ve added a built-in validator on the pdf
FileField because else the code will crash if anything else but a pdf is uploaded.
validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
I also created three constants for the upload directories and file format. If you need to change any of them then the rest of the code can remain the same.
COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'
Also I’m assuming you have the default settings setup for saving files.
settings.py
MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')
models.py
from django.core.validators import FileExtensionValidator
from django.db.models.signals import post_save
from pdf2image import convert_from_path
from django.conf import settings
import os
COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'
# this function is used to rename the pdf to the name specified by filename field
def set_pdf_file_name(instance, filename):
return os.path.join(PDF_DIRECTORY, '{}.pdf'.format(instance.filename))
# not used in this example
def set_cover_file_name(instance, filename):
return os.path.join(COVER_PAGE_DIRECTORY, '{}.{}'.format(instance.filename, COVER_PAGE_FORMAT))
class Pdffile(models.Model):
# validator checks file is pdf when form submitted
pdf = models.FileField(
upload_to=set_pdf_file_name,
validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
)
filename = models.CharField(max_length=20)
pagenumforcover = models.IntegerField()
coverpage = models.FileField(upload_to=set_cover_file_name)
def convert_pdf_to_image(sender, instance, created, **kwargs):
if created:
# check if COVER_PAGE_DIRECTORY exists, create it if it doesn't
# have to do this because of setting coverpage attribute of instance programmatically
cover_page_dir = os.path.join(settings.MEDIA_ROOT, COVER_PAGE_DIRECTORY)
if not os.path.exists(cover_page_dir):
os.mkdir(cover_page_dir)
# convert page cover (in this case) to jpg and save
cover_page_image = convert_from_path(
pdf_path=instance.pdf.path,
dpi=200,
first_page=instance.pagenumforcover,
last_page=instance.pagenumforcover,
fmt=COVER_PAGE_FORMAT,
output_folder=cover_page_dir,
)[0]
# get name of pdf_file
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename the file that was saved to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative
# call save on the model instance to update database record
instance.save()
post_save.connect(convert_pdf_to_image, sender=Pdffile)
convert_pdf_to_image
is a function that runs on the post_save
signal of the Pdffile
model. It gets run after your PdffileForm
gets saved in your upload view so that we can create the cover image file from the saved pdf file.
cover_page_image = convert_from_path(
pdf_path=instance.pdf.path,
dpi=200,
first_page=instance.pagenumforcover,
last_page=instance.pagenumforcover,
fmt=COVER_PAGE_FORMAT,
output_folder=cover_page_dir,
)[0]
Changing dpi
will change the quality of the image. In order to only convert one page the first_page
and last_page
parameters are the same. Because the result is an array we grab the first and only element in the list inside cover_page_image
in this case.
Minor change to your upload view.
views.py
def upload(request):
form = PdffileForm()
if request.method == 'POST':
form = PdffileForm(request.POST, request.FILES)
# if form is not valid then form data will be sent back to view to show error message
if form.is_valid():
form.save()
return redirect('pdffilelist')
return render(request, "uploadform.html", {'form': form})
I don’t know what your upload.html
file looks like but I used the following which will work with the code provided.
upload.html
<h1>Upload PDF</h1>
<form method="POST" enctype="multipart/form-data">
{% csrf_token %}
{{ form.as_p }}
<button type="submit">Upload</button>
</form>
With an example pdf
Uploaded through the form
The resulting database record
The resulting file locations once uploaded
Final note:
Because FileFields have code to ensure that existing files don’t get overwritten, The code
# get name of pdf_file
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename file to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative
ensures the pdf FileField filename is used to name the cover page because it is almost completely unique.
I used the explanation here, and everything works fine, except when from admin panel I oped the saved Pdffile object and try to change the pagenumforcover to another integer and then save it then it won’t generate the new coverpage
I am creating an HTML template to show the cover of a pdf file(first page or user can choose one). I want Django to create the cover image automatically without extra upload.
The pdf file is uploaded using Django Modelform. Here is the structure of my code
models.py
class Pdffile(models.Model):
pdf = models.FileField(upload_to='pdfdirectory/')
filename = models.CharField(max_length=20)
pagenumforcover = models.IntegerField()
coverpage = models.FileField(upload_to='coverdirectory/')
form.py
class PdffileForm(ModelForm):
class Meta:
model = Pdffile
fields = (
'pdf',
'filename',
'pagenumforcover',
)
views.py
def upload(request):
if request.method == 'POST':
form = PdffileForm(request.POST, request.FILES)
if form.is_valid():
form.save()
return redirect('pdffilelist')
else:
form = PdffileForm()
return render(request, "uploadform.html", {'form': form})
def pdfcover(request, pk):
thispdf = get_object_or_404(Pdffile, pk=pk)
return render(request, 'pdfcover.html', {'thispdf': thispdf})
In the ‘pdfcover.html’, I want to use the Django template language so I can render different HTML for different uploaded pdf files. That’s why I want to save the image file to the same column as the pdf file.
I am new to Python, new to Django, and obviously new to stack overflow. I have tried pdf2image and PyPDF2 and I believe they all could work however I just cannot find the right code. If you guys enlighten me I will be thankful.
In the pdf2image
package there is a function called convert_from_path
.
This is the description inside the package of what each of the parameters of the function does.
Parameters:
pdf_path -> Path to the PDF that you want to convert
dpi -> Image quality in DPI (default 200)
output_folder -> Write the resulting images to a folder (instead of directly in memory)
first_page -> First page to process
last_page -> Last page to process before stopping
fmt -> Output image format
jpegopt -> jpeg options `quality`, `progressive`, and `optimize` (only for jpeg format)
thread_count -> How many threads we are allowed to spawn for processing
userpw -> PDF's password
use_cropbox -> Use cropbox instead of mediabox
strict -> When a Syntax Error is thrown, it will be raised as an Exception
transparent -> Output with a transparent background instead of a white one.
single_file -> Uses the -singlefile option from pdftoppm/pdftocairo
output_file -> What is the output filename or generator
poppler_path -> Path to look for poppler binaries
grayscale -> Output grayscale image(s)
size -> Size of the resulting image(s), uses the Pillow (width, height) standard
paths_only -> Don't load image(s), return paths instead (requires output_folder)
use_pdftocairo -> Use pdftocairo instead of pdftoppm, may help performance
timeout -> Raise PDFPopplerTimeoutError after the given time
Because convert_from_path
is designed to be able to turn every page in a pdf into an image the function returns an array of Image objects.
If you set the output_folder
parameter each image will be saved to that location from the base directory. output_folder
must be a full path in this case e.g. 'path/from/root/to/output_folder'
. If you don’t set it the images won’t be saved when converted, only in memory.
By default if you do not set the output_file
parameter the function will generate a random formatted filename such as 0a15a918-59ba-4f15-90f0-2ed5fbd0c36c-1.ext
. Although if you do set a filename, because this filename is used for converting multiple pdf pages, if your output_file
was 'file_name'
then each file would be named starting from 'file_name0001-1.ext'
.
Beware that if you set output_file
and output_folder
and try converting two different pdfs the second pdf will overwrite the image files of the first if they are in the same directory.
Here is some code modelled around yours in the question. This code assumes you have pdf2image
installed.
I’ve added a built-in validator on the pdf
FileField because else the code will crash if anything else but a pdf is uploaded.
validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
I also created three constants for the upload directories and file format. If you need to change any of them then the rest of the code can remain the same.
COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'
Also I’m assuming you have the default settings setup for saving files.
settings.py
MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')
models.py
from django.core.validators import FileExtensionValidator
from django.db.models.signals import post_save
from pdf2image import convert_from_path
from django.conf import settings
import os
COVER_PAGE_DIRECTORY = 'coverdirectory/'
PDF_DIRECTORY = 'pdfdirectory/'
COVER_PAGE_FORMAT = 'jpg'
# this function is used to rename the pdf to the name specified by filename field
def set_pdf_file_name(instance, filename):
return os.path.join(PDF_DIRECTORY, '{}.pdf'.format(instance.filename))
# not used in this example
def set_cover_file_name(instance, filename):
return os.path.join(COVER_PAGE_DIRECTORY, '{}.{}'.format(instance.filename, COVER_PAGE_FORMAT))
class Pdffile(models.Model):
# validator checks file is pdf when form submitted
pdf = models.FileField(
upload_to=set_pdf_file_name,
validators=[FileExtensionValidator(allowed_extensions=['pdf'])]
)
filename = models.CharField(max_length=20)
pagenumforcover = models.IntegerField()
coverpage = models.FileField(upload_to=set_cover_file_name)
def convert_pdf_to_image(sender, instance, created, **kwargs):
if created:
# check if COVER_PAGE_DIRECTORY exists, create it if it doesn't
# have to do this because of setting coverpage attribute of instance programmatically
cover_page_dir = os.path.join(settings.MEDIA_ROOT, COVER_PAGE_DIRECTORY)
if not os.path.exists(cover_page_dir):
os.mkdir(cover_page_dir)
# convert page cover (in this case) to jpg and save
cover_page_image = convert_from_path(
pdf_path=instance.pdf.path,
dpi=200,
first_page=instance.pagenumforcover,
last_page=instance.pagenumforcover,
fmt=COVER_PAGE_FORMAT,
output_folder=cover_page_dir,
)[0]
# get name of pdf_file
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename the file that was saved to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative
# call save on the model instance to update database record
instance.save()
post_save.connect(convert_pdf_to_image, sender=Pdffile)
convert_pdf_to_image
is a function that runs on the post_save
signal of the Pdffile
model. It gets run after your PdffileForm
gets saved in your upload view so that we can create the cover image file from the saved pdf file.
cover_page_image = convert_from_path(
pdf_path=instance.pdf.path,
dpi=200,
first_page=instance.pagenumforcover,
last_page=instance.pagenumforcover,
fmt=COVER_PAGE_FORMAT,
output_folder=cover_page_dir,
)[0]
Changing dpi
will change the quality of the image. In order to only convert one page the first_page
and last_page
parameters are the same. Because the result is an array we grab the first and only element in the list inside cover_page_image
in this case.
Minor change to your upload view.
views.py
def upload(request):
form = PdffileForm()
if request.method == 'POST':
form = PdffileForm(request.POST, request.FILES)
# if form is not valid then form data will be sent back to view to show error message
if form.is_valid():
form.save()
return redirect('pdffilelist')
return render(request, "uploadform.html", {'form': form})
I don’t know what your upload.html
file looks like but I used the following which will work with the code provided.
upload.html
<h1>Upload PDF</h1>
<form method="POST" enctype="multipart/form-data">
{% csrf_token %}
{{ form.as_p }}
<button type="submit">Upload</button>
</form>
With an example pdf
Uploaded through the form
The resulting database record
The resulting file locations once uploaded
Final note:
Because FileFields have code to ensure that existing files don’t get overwritten, The code
# get name of pdf_file
pdf_filename, extension = os.path.splitext(os.path.basename(instance.pdf.name))
new_cover_page_path = '{}.{}'.format(os.path.join(cover_page_dir, pdf_filename), COVER_PAGE_FORMAT)
# rename file to be the same as the pdf file
os.rename(cover_page_image.filename, new_cover_page_path)
# get the relative path to the cover page to store in model
new_cover_page_path_relative = '{}.{}'.format(os.path.join(COVER_PAGE_DIRECTORY, pdf_filename), COVER_PAGE_FORMAT)
instance.coverpage = new_cover_page_path_relative
ensures the pdf FileField filename is used to name the cover page because it is almost completely unique.
I used the explanation here, and everything works fine, except when from admin panel I oped the saved Pdffile object and try to change the pagenumforcover to another integer and then save it then it won’t generate the new coverpage