Web Scraper failing after 3rd page ('NoneType' object has no attribute 'find_all')

Question:

I’ve written a function to try and get the names of authors and their respective links from a sandbox website (https://quotes.toscrape.com/), which should move onto the next page when all have been covered.
It works for the first two pages but fails when moving onto the third with the error ‘NoneType’ object has no attribute ‘find_all’.

Why would it break at the start of the new page when it has already successfully moved pages already?

Here’s the function:

def AuthorLink(url):
    a = 0
    url = url
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    divContainer = soup.find("div", class_="container")
    divRow = divContainer.find_all("div", class_= "row")

    for result in divRow:
        divQuotes = result.find_all("div", class_="quote")
        for quotes in divQuotes:
            for el in quotes.find_all("small", class_="author"):
                print(el.get_text())
            for link in quotes.find_all("a"):
                if link['href'][1:7] == "author":
                    print(url + link['href'])
    a += 1
    print("Page:", a)
    nav = soup.find("li", class_="next")
    nextPage = nav.find("a")
    AuthorLink(url + nextPage['href'])
    

Here’s the code that it broke on:

      5     soup = BeautifulSoup(page.content, "html.parser")
      6     divContainer = soup.find("div", class_="container")
----> 7     divRow = divContainer.find_all("div", class_= "row")

I don’t see why this is happening if it ran for the first two pages successfully.

I’ve checked the structure of the website and it seems little has changed from each page.

I’ve also tried to change the code so that instead of using the link from "Next" at the bottom of the page, it just adds the number of the next page to the URL but this doesn’t work either.

Asked By: caasi

||

Answers:

You are facing this error because your new requsting url is adding in that previous url which means.
url value is iterations:

  1. "https://quotes.toscrape.com/", where works;
  2. "https://quotes.toscrape.com/page/2/", where also works;
  3. "https://quotes.toscrape.com/page/2//page/3/", but here website can’t serve the page. So, doesn’t work.

Exact solution could be different, but here’s a little bit changed in my answer.

import requests
from bs4 import BeautifulSoup


base_url="https://quotes.toscrape.com"
def AuthorLink(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    divContainer = soup.find("div", class_="container")
    divRow = divContainer.find_all("div", class_= "row")[1]

    divQuotes = divRow.find_all("div", class_="quote")
    for quotes in divQuotes:
        for el in quotes.find_all("small", class_="author"):
            print(el.get_text())
        for link in quotes.find_all("a"):
            if link['href'][1:7] == "author":
                print(base_url + link['href'])

for i in range(1,5):
    AuthorLink(f"{base_url}/page/{i}")

I have defined new base_url to store actual website link. And next page is "/page/[i]" which means we can use for loop to generate i=1,2,3… . And other change is print(base_url + link['href']) where you had used url instead of "base url" that again leads to same URL changing problem from above.

Answered By: imxitiz