Download pdfs and join them using python

Question:

I have a list, named links_to_announcement, of urls for different pdfs.

How do I download them and join them together? My code generated a corrupt pdf which doesn’t open in pdf reader at all.

with open('joined_pdfs.pdf', 'wb') as f:
    for l in links_to_announcement:
        response = requests.get(l)
        f.write(response.content)
Asked By: KawaiKx

||

Answers:

Many file formats have a specific and structured format (rather than being simply lines of arbitrary text) and appending them isn’t sufficient!

Instead, in this (writing a PDF) and with many other formats, it’s necessary to rewrite them with something that understands their context

For a simple example, if two CSVs were blindly appended, the second CSV’s header would be spliced into the middle of the new document and any nonexact columns wouldn’t parse correctly or be misinterpreted, even if the corrupt line was removed

file1.csv

colA,colB
1,2

file2.csv

colC,colD,colA
3,4,5

blindly appending the two files

colA,colB
1,2
colC,colD,colA
3,4,5

how should this document be interpreted?

Instead, a context-aware parser can merge the documents correctly

colA,colB,colC,colD
1,2,,
5,,3,4

As suggested by @esqew, PDF files can be merged with logic like Merge PDF files

You show how to download the files, but it’s probably significantly faster to unpack each web request into a BytesIO and combine them all in memory (python requests return file-like object for streaming)
NOTE this will frustrate attempts to restart after a failed request, and you should consider writing each PDF to disk intermediately and checking if you have the file in your local cache before downloading again if you find frequent failed requests

import requests  # aiohttp might be better to asyncio.gather()
from pypdf import PdfMerger

merger = PdfMerger()

with open("links_to_announcement.txt") as fh:
    for url in fh:
        r = request.get(url, stream=True)
        # TODO error handling: .raise_for_status(), backoff, etc.
        r.raw.decode_content = True  # possibly fix encoding issues
        merger.append(r.raw)  # probably an io.BytesIO

merger.write("combined.pdf")
merger.close()
Answered By: ti7
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.