BeautifulSoup & Pagination

Question:

I am trying to scrape a website that has multiple cities, and within those cities there are multiple companies. I am trying to make a scraper (or crawler ig) that can go through all the cities, and get all of the companies from them. I’m facing an issue where I cannot figure out how to go through all of the pages in each specific city, as all have different amounts. Here is my code:

    html_text = requests.get('https://partnercarrier.com/IL').text
    soup = BeautifulSoup(html_text, 'lxml')
    cities = soup.find_all('div', class_ = 'col-md-4 col-sm-6 col-xs-12 form-group')[10:]
    
    for city in cities:
        cityInfo = city.find('a', class_ = 'city-link-font-size')
        url = cityInfo.get('href')
        current_city = url.split('/')[2]
        with open(f'cities/{current_city}.csv', 'a') as c:
            city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
            stew = BeautifulSoup(city_html, 'lxml')
            for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
                if 'MC :N/A' in company.text:
                    continue
                print(company.text.replace('n', ''))
                c.writelines(company.text.replace('n', ''))
                c.writelines('n')

I’ve tried the following, but it didn’t produce the output I wanted. It just went through the first city, the first page, and printed the same companies over and over again. it didn’t even write to the csv.

    for city in cities:
        cityInfo = city.find('a', class_ = 'city-link-font-size')
        url = cityInfo.get('href')
        current_city = url.split('/')[2]
        with open(f'cities/{current_city}.csv', 'a') as c:
            page_count = 1
            while page_count >= 1:
                try:
                    city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
                    stew = BeautifulSoup(city_html, 'lxml')
                    for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
                        if 'MC :N/A' in company.text:
                            continue
                        print(company.text.replace('n', ''))
                        c.writelines(company.text.replace('n', ''))
                        c.writelines('n')
                        page_count += 1
                except:
                    continue

Any help is appreciated!

Edit:
I fixed it with the help of the kind commenter below, and by changing the following:

        if pagination:
            try:
                num_pages = int(pagination.find_all('a')[-2].text)
            except: num_pages = int(pagination.find_all('a')[-4].text)
        else:
            num_pages = 1

Edit2: Well my fix infact did not work, but the edit by the commenter absolutely did! Thank you so much kind strangers!

Asked By: Vladeta

||

Answers:

To scrape all the pages in each city, you will need to identify the pagination element and extract the total number of pages. You can then loop through all the pages using a for loop, appending the page number to the URL for each request. Here is an updated version of your code that includes pagination logic:

import requests
from bs4 import BeautifulSoup

BASE_URL = 'https://partnercarrier.com'

html_text = requests.get(f'{BASE_URL}/IL').text
soup = BeautifulSoup(html_text, 'lxml')
cities = soup.find_all('div', class_='col-md-4 col-sm-6 col-xs-12 form-group')[10:]

for city in cities:
    cityInfo = city.find('a', class_='city-link-font-size')
    url = cityInfo.get('href')
    current_city = url.split('/')[2]
    with open(f'cities/{current_city}.csv', 'a') as c:
        city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
        stew = BeautifulSoup(city_html, 'lxml')
    
        # Extract the pagination element and get the total number of pages
        pagination = stew.find('ul', class_='pagination')
        if pagination:
            num_pages = int(pagination.find_all('a')[-2].text)
        else:
            num_pages = 1
    
        # Loop through all the pages in the current city
        for page in range(1, num_pages+1):
            page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
            page_soup = BeautifulSoup(page_html, 'lxml')
            for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
                if 'MC :N/A' in company.text:
                    continue
                print(company.text.replace('n', ''))
                c.writelines(company.text.replace('n', ''))
                c.writelines('n')

In this updated code, we first define a BASE_URL variable to store the base URL for the website. We then extract the pagination element from the city page and get the total number of pages. If there is no pagination element, we assume that there is only one page.

We then loop through all the pages in the current city using a range function, appending the page number to the URL for each request. Finally, we scrape the companies from each page as before and write them to the CSV file.

EDIT:

for city in cities:
    cityInfo = city.find('a', class_='city-link-font-size')
    url = cityInfo.get('href')
    current_city = url.split('/')[2]
    with open(f'cities/{current_city}.csv', 'a') as c:
        city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
        stew = BeautifulSoup(city_html, 'lxml')
    
        # Extract the pagination element and get the total number of pages
        pagination = stew.find('ul', class_='pagination')
        if pagination:
            last_page_link = pagination.find_all('a')[-2].text
            if last_page_link.isdigit():
                num_pages = int(last_page_link)
            else:
                num_pages = 1
        else:
            num_pages = 1
    
        # Loop through all the pages in the current city
        for page in range(1, num_pages+1):
            page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
            page_soup = BeautifulSoup(page_html, 'lxml')
            for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
                if 'MC :N/A' in company.text:
                    continue
                print(company.text.replace('n', ''))
                c.writelines(company.text.replace('n', ''))
                c.writelines('n')
Answered By: Omnishroom