How to find body of a webpage using beautifulsoup

Question

I want to check if there is any content available on more than 500 webpages, using beautiful soup. This is the is script I wrote. It works, but somewhere it stops. If I fix the error it shows a different one. Below is the code I tried. I just want to be sure the page has a body. I’m unsure how to handle timeouts. Maybe the website needs more time.

method 1:

res  = requests.get(full_https_url, timeout=40)
soup  = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems == '':
    pass
else:
    print('body found')

method 2:

soup  = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('body')
if elems != '':
    print('body found')
else:
    pass

Asked By: stack overflow

||

Source

Answer 1

select() returns a list, not a string, so it will always compare not equal to '', whether it’s successful or not. Just test if the result is not empty.

Use try/except to catch the timeout error.

try:
    res = requests.get(full_https_url, timeout=40)
    soup  = bs4.BeautifulSoup(res.text, 'html.parser')
    elems = soup.select('body')
    if elems:
        # do stuff
    else:
        print("No body in {full_https_url}")
except requests.exceptions.Timeout:
    print(f"Timeout on {full_https_url}, skipping")

Answered By: Barmar

How to find body of a webpage using beautifulsoup

Question:

Answers: