Could I ask how do I know how many pages are in the website? (Web scrapping)
Question:
I get a website (https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/)
And I found there are many forum pages
Want to use the for loop for web scrapping, therefore could I ask, how I get the maximum number of forum pages on this page by BeaurifulSoup?
Many thanks.
Answers:
You can try something like this:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/"
req = requests.get(url)
soup = bs(req.content, 'html.parser')
navs = soup.find("ul", { "class" : "pageNav-main" }).find_all("li", recursive=False)
print(navs)
print(f'Length: {len(navs)}')
Result
[<li class="pageNav-page pageNav-page--current"><a href="/forum/threads/had-a-friend-with-type-one.136015/">1</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-2">2</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-3">3</a></li>, <li class="pageNav-page"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-4">4</a></li>]
Length: 4
You don’t need BeautifulSoup to count the number of pages.
URL of page 1 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-1
URL of page 2 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-2
URL of page 3 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-3
And so on…
So you need to increment the value X
in https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-X
to move to the next page. You can then check the status code of the response and the page title to ensure that we are not visiting the same page twice.
import requests
import re
def getPageTitle(response_text):
d = re.search('<W*titleW*(.*)</title', response_text, re.IGNORECASE)
return d.group(1)
def count_pages():
count = 0
uniquePages = set()
while(True):
count += 1
url = ('https://www.diabetesdaily.com/forum/threads/' +
f'had-a-friend-with-type-one.136015/page-{count}')
response = requests.get(url)
title = getPageTitle(response.text)
if title in uniquePages or response.status_code != 200:
break
uniquePages.add(title)
return len(uniquePages)
print(count_pages()) # 4
I get a website (https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/)
And I found there are many forum pages
Want to use the for loop for web scrapping, therefore could I ask, how I get the maximum number of forum pages on this page by BeaurifulSoup?
Many thanks.
You can try something like this:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/"
req = requests.get(url)
soup = bs(req.content, 'html.parser')
navs = soup.find("ul", { "class" : "pageNav-main" }).find_all("li", recursive=False)
print(navs)
print(f'Length: {len(navs)}')
Result
[<li class="pageNav-page pageNav-page--current"><a href="/forum/threads/had-a-friend-with-type-one.136015/">1</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-2">2</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-3">3</a></li>, <li class="pageNav-page"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-4">4</a></li>]
Length: 4
You don’t need BeautifulSoup to count the number of pages.
URL of page 1 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-1
URL of page 2 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-2
URL of page 3 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-3
And so on…
So you need to increment the value X
in https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-X
to move to the next page. You can then check the status code of the response and the page title to ensure that we are not visiting the same page twice.
import requests
import re
def getPageTitle(response_text):
d = re.search('<W*titleW*(.*)</title', response_text, re.IGNORECASE)
return d.group(1)
def count_pages():
count = 0
uniquePages = set()
while(True):
count += 1
url = ('https://www.diabetesdaily.com/forum/threads/' +
f'had-a-friend-with-type-one.136015/page-{count}')
response = requests.get(url)
title = getPageTitle(response.text)
if title in uniquePages or response.status_code != 200:
break
uniquePages.add(title)
return len(uniquePages)
print(count_pages()) # 4