Could I ask how do I know how many pages are in the website? (Web scrapping)

Question:

I get a website (https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/)

And I found there are many forum pages
enter image description here

Want to use the for loop for web scrapping, therefore could I ask, how I get the maximum number of forum pages on this page by BeaurifulSoup?
Many thanks.

Asked By: KiuSandy

||

Answers:

You can try something like this:

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/"
req = requests.get(url)
soup = bs(req.content, 'html.parser')
navs = soup.find("ul", { "class" : "pageNav-main" }).find_all("li", recursive=False)
print(navs)
print(f'Length: {len(navs)}')

Result

[<li class="pageNav-page pageNav-page--current"><a href="/forum/threads/had-a-friend-with-type-one.136015/">1</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-2">2</a></li>, <li class="pageNav-page pageNav-page--later"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-3">3</a></li>, <li class="pageNav-page"><a href="/forum/threads/had-a-friend-with-type-one.136015/page-4">4</a></li>]
Length: 4
Answered By: Coderio

You don’t need BeautifulSoup to count the number of pages.

URL of page 1 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-1

URL of page 2 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-2

URL of page 3 : https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-3

And so on…

So you need to increment the value X in https://www.diabetesdaily.com/forum/threads/had-a-friend-with-type-one.136015/page-X to move to the next page. You can then check the status code of the response and the page title to ensure that we are not visiting the same page twice.

import requests
import re


def getPageTitle(response_text):
    d = re.search('<W*titleW*(.*)</title', response_text, re.IGNORECASE)
    return d.group(1)


def count_pages():
    count = 0
    uniquePages = set()
    while(True):
        count += 1
        url = ('https://www.diabetesdaily.com/forum/threads/' +
               f'had-a-friend-with-type-one.136015/page-{count}')
        response = requests.get(url)
        title = getPageTitle(response.text)
        if title in uniquePages or response.status_code != 200:
            break
        uniquePages.add(title)
    return len(uniquePages)


print(count_pages())  # 4
Answered By: Bunny
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.