what is fastest way to convert pdf to jpg image?

Question:

I am trying to convert multiple pdfs (10k +) to jpg images and extract text from them. I am currently using the pdf2image python library but it is rather slow, is there any faster/fastest library than this?

from pdf2image import convert_from_bytes
images = convert_from_bytes(open(path,"rb").read())

Note : I am using ubantu 18.04
CPU : 4 core 8 thread ( ryzen 3 3100)
memory : 8 GB

Asked By: Sahil Lohiya

||

Answers:

Try the following

  1. pypdfium2
  2. Using the python subprocess, https://blog.alivate.com.au/poppler-windows/
Answered By: Jitesh

Using converters, then speed is generally relative to the file size and complexity, since the content needs fresh build each run. For PDF (your not generating yourself) that can require different solutions, however you are quoting systems that require several steps so "fastest" is the core machine code binary, that is usually the cli version, without any slower wrapping apps.

As a rough rule of thumb 100 x 150dpi png pages per minute is reasonable so a run just started 10 minutes ago has just done 947 pages (e.g. 1.578 pages per second or 0.6336 seconds per page).

In a recent stress test with a single complex page (on kit not too different to yours) the resolution was biggest factor so 1 complex chart page took from 1.6 to 14+ seconds (depending on output resolution) and using multithreading only reduced it to 12 seconds https://stackoverflow.com/a/73060439/10802527

Pdf2image is built around poppler with pdfimages pdftotext & pdftoppm and rather than jpg I would recommend use pdftoppm -png since the results should be crisper thus faster leaner output looking good.

Imagemagick cannot convert without GhostScript nor output text, so the fast route core there is Artifex GhostScript. Also consider/compare with sister application MuPDF (Mutool) it has both Image and Text outputs, Multi-threading and banding.

The core of Chrome/Edge/Chromium and Foxit/Skia solutions are the PDFium binaries that can be found in various forms for different platforms.

some rough times on my kit for a large file all at 150 dpi

poppler/pdftoppm -f 1 -l 100 -png = 100 pages from 13,234 us-public-health-and-welfare-code.pdf
or similar speed
pdftocairo -f 1 -l 100 -png -r 150 us-public-health-and-welfare-code.pdf time/out
The current time is: 17:17:17
The current time is: 17:18:08
100 pages as png = 51 seconds

100+ pages per minute (better than most high speed printers, but over 2 hours for just one file)

PDFium via a cli exe was around 30 seconds for the 100 pages but the resolution would need exif setting thus a second pass, however lets be generous and say that’s
Approx. 200 pages per minute (Est. 1 hour 6 mins total)

xpdf pdftopng  with settings for 150dpi x 100 from 13234pages.pdf
The current time is: 17:25:27
The current time is: 17:25:42
100 pages as png = 15 seconds

400 pages per minute (Est. 33 mins total)

MuTool convert -o time/out%d.png -O resolution=150  x 100 from 13234pages.pdf
The current time is: 17:38:14
The current time is: 17:38:25
100 pages as png = 11 seconds

545 pages per minute (Est. 24.3 mins total)

That can be bettered

mutool draw -st -P -T 4 -B 2048 -r 150 -F png -o ./time/out%d.png 13234pages.pdf 1-100
total 5076ms (0ms layout) / 100 pages for an average of 50ms

1,182 pages per minute (Est. 11.2 mins total)

Note a comment by @jcupitt

I tried time parallel mutool convert -A 8 -o page-%d.png -O resolution=150 us-public-health-and-welfare-code.pdf {}-{} ::: {1..100} and it’s 100 pages in 600ms. If you use pgm, it’s 300ms (!!).

That would be 10,000 or 20,000 pages per minute (Est. 0.66-1.32 mins total)

There are other good libs to render just as quick in the same timeframe, but as generally they demand the one core GPU/CPU/Memory/Fonts etc. then on one device multiple parallel processes can often fail. One application that looked good for the task fell over with memory fail after only 2 pages.
If you must use one device you can try separate invocation’s in "Parallel" however my attempts, in native windows, always seemed thwarted by file locks on resources when there were conflicting demands for the bus or support files.
The only reliable way to multiprocessing is batch blocks of sequential sets of files in Parallel devices, so upscale to farming-out across multiple real "CPU/GPU"s and their dedicated drives.

Note this developers comparison where the three best of their bunch were

  1. MuPDF 2) Xpdf 3) PDFium (their selection (as tested above) has more permissive license)
Answered By: K J

pyvips is a bit quicker than pdf2image. I made a tiny benchmark:

#!/usr/bin/python3

import sys
from pdf2image import convert_from_bytes

images = convert_from_bytes(open(sys.argv[1], "rb").read())
for i in range(len(images)):
    images[i].save(f"page-{i}.jpg")

With this test document I see:

$ /usr/bin/time -f %M:%e ./pdf.py nipguide.pdf 
1991624:4.80

So 2GB of memory and 4.8s of elapsed time.

You could write this in pyvips as:

#!/usr/bin/python3

import sys
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1])
for i in range(image.get('n-pages')):
    image = pyvips.Image.new_from_file(filename, page=i)
    image.write_to_file(f"page-{i}.jpg")

I see:

$ /usr/bin/time -f %M:%e ./vpdf.py nipguide.pdf[dpi=200]
676436:2.57

670MB of memory and 2.6s elapsed time.

They are both using poppler behind the scenes, but pyvips calls directly into the library rather than using processes and temp files, and can overlap load and save.

You can configure pyvips to use pdfium rather than poppler, but they are roughly the same speed in my experience.

You can use multiprocessing to get a further speedup. This will work better with pyvips because of the lower memory use, and the fact that it’s not using huge temp files.

If I modify the pyvips code to only render a single page, I can use gnu parallel to render each page in a separate process:

$ time parallel ../vpdf.py us-public-health-and-welfare-code.pdf[dpi=150] ::: {1..100}
real    0m1.846s
user    0m38.200s
sys 0m6.371s

So 100 pages at 150dpi in 1.8s.

Answered By: jcupitt