BeautifulSoup & Pagination
Question:
I am trying to scrape a website that has multiple cities, and within those cities there are multiple companies. I am trying to make a scraper (or crawler ig) that can go through all the cities, and get all of the companies from them. I’m facing an issue where I cannot figure out how to go through all of the pages in each specific city, as all have different amounts. Here is my code:
html_text = requests.get('https://partnercarrier.com/IL').text
soup = BeautifulSoup(html_text, 'lxml')
cities = soup.find_all('div', class_ = 'col-md-4 col-sm-6 col-xs-12 form-group')[10:]
for city in cities:
cityInfo = city.find('a', class_ = 'city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
I’ve tried the following, but it didn’t produce the output I wanted. It just went through the first city, the first page, and printed the same companies over and over again. it didn’t even write to the csv.
for city in cities:
cityInfo = city.find('a', class_ = 'city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
page_count = 1
while page_count >= 1:
try:
city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
page_count += 1
except:
continue
Any help is appreciated!
Edit:
I fixed it with the help of the kind commenter below, and by changing the following:
if pagination:
try:
num_pages = int(pagination.find_all('a')[-2].text)
except: num_pages = int(pagination.find_all('a')[-4].text)
else:
num_pages = 1
Edit2: Well my fix infact did not work, but the edit by the commenter absolutely did! Thank you so much kind strangers!
Answers:
To scrape all the pages in each city, you will need to identify the pagination element and extract the total number of pages. You can then loop through all the pages using a for loop, appending the page number to the URL for each request. Here is an updated version of your code that includes pagination logic:
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://partnercarrier.com'
html_text = requests.get(f'{BASE_URL}/IL').text
soup = BeautifulSoup(html_text, 'lxml')
cities = soup.find_all('div', class_='col-md-4 col-sm-6 col-xs-12 form-group')[10:]
for city in cities:
cityInfo = city.find('a', class_='city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
# Extract the pagination element and get the total number of pages
pagination = stew.find('ul', class_='pagination')
if pagination:
num_pages = int(pagination.find_all('a')[-2].text)
else:
num_pages = 1
# Loop through all the pages in the current city
for page in range(1, num_pages+1):
page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
page_soup = BeautifulSoup(page_html, 'lxml')
for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
In this updated code, we first define a BASE_URL variable to store the base URL for the website. We then extract the pagination element from the city page and get the total number of pages. If there is no pagination element, we assume that there is only one page.
We then loop through all the pages in the current city using a range function, appending the page number to the URL for each request. Finally, we scrape the companies from each page as before and write them to the CSV file.
EDIT:
for city in cities:
cityInfo = city.find('a', class_='city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
# Extract the pagination element and get the total number of pages
pagination = stew.find('ul', class_='pagination')
if pagination:
last_page_link = pagination.find_all('a')[-2].text
if last_page_link.isdigit():
num_pages = int(last_page_link)
else:
num_pages = 1
else:
num_pages = 1
# Loop through all the pages in the current city
for page in range(1, num_pages+1):
page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
page_soup = BeautifulSoup(page_html, 'lxml')
for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
I am trying to scrape a website that has multiple cities, and within those cities there are multiple companies. I am trying to make a scraper (or crawler ig) that can go through all the cities, and get all of the companies from them. I’m facing an issue where I cannot figure out how to go through all of the pages in each specific city, as all have different amounts. Here is my code:
html_text = requests.get('https://partnercarrier.com/IL').text
soup = BeautifulSoup(html_text, 'lxml')
cities = soup.find_all('div', class_ = 'col-md-4 col-sm-6 col-xs-12 form-group')[10:]
for city in cities:
cityInfo = city.find('a', class_ = 'city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
I’ve tried the following, but it didn’t produce the output I wanted. It just went through the first city, the first page, and printed the same companies over and over again. it didn’t even write to the csv.
for city in cities:
cityInfo = city.find('a', class_ = 'city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
page_count = 1
while page_count >= 1:
try:
city_html = requests.get(f'https://partnercarrier.com/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
for company in stew.find_all('div', class_ = 'col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
page_count += 1
except:
continue
Any help is appreciated!
Edit:
I fixed it with the help of the kind commenter below, and by changing the following:
if pagination:
try:
num_pages = int(pagination.find_all('a')[-2].text)
except: num_pages = int(pagination.find_all('a')[-4].text)
else:
num_pages = 1
Edit2: Well my fix infact did not work, but the edit by the commenter absolutely did! Thank you so much kind strangers!
To scrape all the pages in each city, you will need to identify the pagination element and extract the total number of pages. You can then loop through all the pages using a for loop, appending the page number to the URL for each request. Here is an updated version of your code that includes pagination logic:
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://partnercarrier.com'
html_text = requests.get(f'{BASE_URL}/IL').text
soup = BeautifulSoup(html_text, 'lxml')
cities = soup.find_all('div', class_='col-md-4 col-sm-6 col-xs-12 form-group')[10:]
for city in cities:
cityInfo = city.find('a', class_='city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
# Extract the pagination element and get the total number of pages
pagination = stew.find('ul', class_='pagination')
if pagination:
num_pages = int(pagination.find_all('a')[-2].text)
else:
num_pages = 1
# Loop through all the pages in the current city
for page in range(1, num_pages+1):
page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
page_soup = BeautifulSoup(page_html, 'lxml')
for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')
In this updated code, we first define a BASE_URL variable to store the base URL for the website. We then extract the pagination element from the city page and get the total number of pages. If there is no pagination element, we assume that there is only one page.
We then loop through all the pages in the current city using a range function, appending the page number to the URL for each request. Finally, we scrape the companies from each page as before and write them to the CSV file.
EDIT:
for city in cities:
cityInfo = city.find('a', class_='city-link-font-size')
url = cityInfo.get('href')
current_city = url.split('/')[2]
with open(f'cities/{current_city}.csv', 'a') as c:
city_html = requests.get(f'{BASE_URL}/IL/{current_city}').text
stew = BeautifulSoup(city_html, 'lxml')
# Extract the pagination element and get the total number of pages
pagination = stew.find('ul', class_='pagination')
if pagination:
last_page_link = pagination.find_all('a')[-2].text
if last_page_link.isdigit():
num_pages = int(last_page_link)
else:
num_pages = 1
else:
num_pages = 1
# Loop through all the pages in the current city
for page in range(1, num_pages+1):
page_html = requests.get(f'{BASE_URL}/IL/{current_city}/{page}').text
page_soup = BeautifulSoup(page_html, 'lxml')
for company in page_soup.find_all('div', class_='col-md-12 col-sm-12 col-xs-12 div-border'):
if 'MC :N/A' in company.text:
continue
print(company.text.replace('n', ''))
c.writelines(company.text.replace('n', ''))
c.writelines('n')