Why do I scrape corrupted PDFs of same size with BeautifulSoup?

Question:

I went through similar topics here but did not find anything helpful for my case.

I managed to get all PDFs (for personal learning purposes) in local folder but cannot open them. They also have the same (310 kB) size. Perhaps, you find some mistake in my code. Thanks.

import os
import requests
from bs4 import BeautifulSoup

# define the URL to scrape
url = 'https://www.apotheken-umschau.de/medikamente/arzneimittellisten/medikamente_i.html'

# define the folder to save the PDFs to
save_path = r'C:PDFs'

# create the folder if it doesn't exist
if not os.path.exists(save_path):
    os.makedirs(save_path)

# make a request to the URL
response = requests.get(url)

# parse the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')

# find all links on the page that contain 'href="/medikamente/beipackzettel/"'
links = soup.find_all('a', href=lambda href: href and '/medikamente/beipackzettel/' in href)

# loop through each link and download the PDF
for link in links:
    href = link['href']
    file_name = href.split('?')[0].split('/')[-1] + '.pdf'
    pdf_url = 'https://www.apotheken-umschau.de' + href + '&file=pdf'
    response = requests.get(pdf_url)
    with open(os.path.join(save_path, file_name), 'wb') as f:
        f.write(response.content)
        f.close()
    print(f'Downloaded {file_name} to {save_path}')
Asked By: Mr.Slow

||

Answers:

There are some issues here:

  • Select your elements from the list more specific, used css selectors:

    soup.select('article li a[href*="/medikamente/beipackzettel/"]')
    
  • Check the responses you get from your requests if expected elements are available and what the behavior looks like.

    • You will notice that you will have to iterate more levels as you have done.

      for link in soup.select('article li a[href*="/medikamente/beipackzettel/"]'):
          soup_detail_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + link.get('href')).content)
      
          for file in soup_detail_page.select('a:-soup-contains("Original Beipackzettel")'):
              soup_file_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + file.get('href')).content)
      
    • You will notice that the PDF is displayed in an IFRAME and you have to scrape it via external url

      pdf_url = soup_file_page.iframe.get('src').split('?file=')[-1]
      
    • You will notice that there are not only Beipackzettel for download

Example

import os
import requests
from bs4 import BeautifulSoup

# define the URL to scrape
url = 'https://www.apotheken-umschau.de/medikamente/arzneimittellisten/medikamente_i.html'

# define the folder to save the PDFs to
save_path = r'C:PDFs'

# create the folder if it doesn't exist
if not os.path.exists(save_path):
    os.makedirs(save_path)

# parse the HTML content of the page
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

# loop through each link and download the PDF
for link in soup.select('article li a[href*="/medikamente/beipackzettel/"]'):
    soup_detail_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + link.get('href')).content, 'html.parser')

    for file in soup_detail_page.select('a:-soup-contains("Original Beipackzettel")'):
        soup_file_page = BeautifulSoup(requests.get('https://www.apotheken-umschau.de' + file.get('href')).content, 'html.parser')
        pdf_url = soup_file_page.iframe.get('src').split('?file=')[-1]
        file_name = file.get('href').split('.html')[0].split('/')[-1] + '.pdf'

        with open(os.path.join(save_path, file_name), 'wb') as f:
            f.write(requests.get(pdf_url).content)
            f.close()
        print(f'Downloaded {file_name} to {save_path}')
Answered By: HedgeHog