Pagination with BeautifulSoup

Question

I am trying to get some data from the following website. https://www.drugbank.ca/drugs

For every drug in the table, I will need to go deeply and have the name and some other specific features like categories, structured indication (please click on drug name to see the features I will use).

I wrote the following code but the issue that I can’t make my code handle pagination (as you see there more than 2000 pages!).

import requests
from bs4 import BeautifulSoup


def drug_data():
url = 'https://www.drugbank.ca/drugs/'
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
for link in soup.select('name-head a'):
    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
    pages_data(href)


def pages_data(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.text, "lxml")
g_data = soup.select('div.content-container')

for item in g_data:
    print item.contents[1].text
    print item.contents[3].findAll('td')[1].text
    try:
        print item.contents[5].findAll('td',{'class':'col-md-2 col-sm-4'})
    [0].text
    except:
        pass
    print item_url
    drug_data()

How can I scrape all of the data and handle pagination properly?

Asked By: Lizou

||

Source

Answer 1

This page uses almost the same url for all pages so you can use for loop to generate them

def drug_data(page_number):
    url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
    #... rest ...

# --- later ---

for x in range(1, 2001):
    drug_data(x)

Or using while and try/except to get more then 2000 pages

def drug_data(page_number):
    url = 'https://www.drugbank.ca/drugs/?page=' + str(page_number)
    #... rest ...

# --- later ---

page = 0

while True:
    try:
        page += 1
        drug_data(page)
    except Exception as ex:
        print(ex)
        print("probably last page:", page)
        break # exit `while` loop

You can also find url to next page in HTML

<a rel="next" class="page-link" href="/drugs?approved=1&amp;c=name&amp;d=up&amp;page=2">›</a>

so you can use BeautifulSoup to get this link and use it.

It displays current url, finds link to next page (using class="page-link" rel="next") and loads it

import requests
from bs4 import BeautifulSoup

def drug_data():
    url = 'https://www.drugbank.ca/drugs/'

    while url:
        print(url)
        r = requests.get(url)
        soup = BeautifulSoup(r.text ,"lxml")
        
        #data = soup.select('name-head a')
        #for link in data:
        #    href = 'https://www.drugbank.ca/drugs/' + link.get('href')
        #    pages_data(href)

        # next page url
        url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
        print(url)
        if url:
            url = 'https://www.drugbank.ca' + url[0].get('href')
        else:
            break
        
drug_data()

BTW: never use except:pass because you can have error which you didn’t expect and you will not know why it doesn’t work. Better display error

 except Exception as ex:
      print('Error:',  ex)

Answered By: furas

Pagination with BeautifulSoup

Question:

Answers: