Using BeautifulSoup to find a HTML tag that contains certain text

Question:

I’m trying to get the elements in an HTML doc that contain the following pattern of text: #S{11}

<h2> this is cool #12345678901 </h2>

So, the previous would match by using:

soup('h2',text=re.compile(r' #S{11}'))

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

I’m able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I’d want all the h2 elements to return, not the text matches.

Ideas?

Asked By: sotangochips

||

Answers:

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
Answered By: nosklo

BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object’s __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True
Answered By: Bruno Bronosky

With bs4 (Beautiful Soup 4), the OP’s attempt works exactly like expected:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #S{11}'))

returns [<h2> this is cool #12345678901 </h2>].

Answered By: T.C. Proctor