Why does my code only sometimes find the html object?

Question

I am trying to scrape the amazon product review count number and then convert it to an integer. The code works 50% of the time when the code is the same. It seems to be that the object under the variable review_count is not always found, which gives and error when html2text runs.

I don’t understand why there the object is not found every time. Is there a better way of going about this? I appreciate any help.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import html2text
import re

url = 'https://www.amazon.com/product-reviews/B07V37GVY9/pageNumber=1'
s = HTMLSession()
r = s.get(url)

soup1 = BeautifulSoup(r.content, "html.parser")

review_count = soup1.find(string=re.compile("with reviews"))
review_txt = (html2text.html2text(review_count))
reviews_list = review_txt.split()
reviews = reviews_list.pop(3)
reviews = reviews.replace(",","")
reviews = int(reviews)

print(reviews)

Asked By: BKANE

||

Source

Answer 1

You are getting rate limited, and you need to slow down your request count.

You can, however, check for the rate limit using code:

import time

CAPTCHA_TEXT = "Sorry, we just need to make sure you're not a robot."
r = s.get(url)

# If we get rate limited
while CAPTCHA_TEXT in r.text:
    # Wait for a bit
    time.sleep(30)
    # And try again :)
    r = s.get(url)

Answered By: Xiddoc

Why does my code only sometimes find the html object?

Question:

Answers: