Beautiful Soup Nested Tag Search

Question:

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).

Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn’t find any, although there are. Is there any simple method or another way to do it?

Asked By: Asafwr

||

Answers:

UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:

from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))

INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:

from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)

count= 0
matcher= re.compile("(s|n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
    continue
    temp = matcher.split(tag.text) # Split using tokens such as s and n
    temp = filter(None, temp) # remove empty elements in the list
    count +=len(temp)
print "number of words in the document %d" %count
fd.close()

Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason

Answered By: Melardev

Maybe I’m guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:

soup = bs4.BeautifulSoup(content, 'html.parser') 

# This will get the div
div_container = soup.find('div', class_='some_class')  

# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
    # prints the p tag content
    print(ptag.text)

Hope that helps

Answered By: Mario Kirov

Try this one :

data = []
for nested_soup in soup.find_all('xyz'):
    data = data + nested_soup.find_all('abc')

Maybe you can turn in into lambda and make it cool, but this works. Thanks.

Answered By: Maifee Ul Asad

You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.

for eg:

 r = requests.get(url,headers=headers)
 p_tags = re.findall(r'<p>.*?</p>',r.content)

this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.

Alternatively if you just want just the text you can try this:

from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()

this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Answered By: jayee
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.