BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

Question:

I’m trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don’t care, I just want to get the internal text.

For example, for:

<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>

How can I extract:

Red
Blue
Yellow
Light green

Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don’t want to have to specify the internal tags in advance – I want to deal with any that may occur.

Is there a ‘just get the visible HTML’ type of method in BeautifulSoup?

—-UPDATE——

On advice, trying:

soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags): 
    print str(i) + p_tag

But that doesn’t help – it prints out:

0Red
1

2Blue
3

4Yellow
5

6Light 
7green
8
Asked By: AP257

||

Answers:

Short answer: soup.findAll(text=True)

This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.

UPDATE:

To clarify, a working piece of code:

>>> txt = """
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
...     print ''.join(node.findAll(text=True))

Red
Blue
Yellow
Light green
Answered By: taleinat

The accepted answer is great but it is 6 years old now, so here’s the current Beautiful Soup 4 version of this answer:

>>> txt = """
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
"""
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.5.1'
>>> soup = BeautifulSoup(txt, "html.parser")
>>> print("".join(soup.strings))

Red
Blue
Yellow
Light green
Answered By: Jaymon

First, convert the html to a string using str. Then, use the following code with your program:

import re
x = str(soup.find_all('p'))
content = str(re.sub("<.*?>", "", x))

This is called a regex. This one will remove anything that comes between two html tags (inclusive of the tags).

Answered By: toyotasupra

Normally the data scrapped from website will contains tags.To avoid that tags and show only text content, you can use text attribute.

For example,

    from BeautifulSoup import BeautifulSoup

    import urllib2 
    url = urllib2.urlopen("https://www.python.org")

    content = url.read()

    soup = BeautifulSoup(content)

    title = soup.findAll("title")

    paragraphs = soup.findAll("p")

    print paragraphs[1] //Second paragraph with tags

    print paragraphs[1].text //Second paragraph without tags

In this example, I collect all paragraphs from python site and display it with tags and without tags.

Answered By: Codemaker

I have stumbled upon this very same problem and wanted to share the 2019 version of this solution. Maybe it helps somebody out.

# importing the modules
from bs4 import BeautifulSoup
from urllib.request import urlopen

# setting up your BeautifulSoup Object
webpage = urlopen("https://insertyourwebpage.com")
soup = BeautifulSoup( webpage.read(), features="lxml")
p_tags = soup.find_all('p')


for each in p_tags: 
    print (str(each.get_text()))

Notice that we’re first printing the array content one by one and THEN call the get_text() method that strips the tags from the text, so that we only print out the text.

Also:

  • it is better to use the updated ‘find_all()’ in bs4 than the older findAll()
  • urllib2 was replaced by urllib.request and urllib.error, see here

Now your output should be:

  • Red
  • Blue
  • Yellow
  • Light

Hope this helps someone looking for an updated solution.

Answered By: erdin
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.