How to find tag with particular text with Beautiful Soup?

Question:

How to find text I am looking for in the following HTML (line breaks marked with n)?

...
<tr>
  <td class="pos">n
      "Some text:"n
      <br>n
      <strong>some value</strong>n
  </td>
</tr>
<tr>
  <td class="pos">n
      "Fixed text:"n
      <br>n
      <strong>text I am looking for</strong>n
  </td>
</tr>
<tr>
  <td class="pos">n
      "Some other text:"n
      <br>n
      <strong>some other value</strong>n
  </td>
</tr>
...

The code below returns first found value, so I need to filter by "Fixed text:" somehow.

result = soup.find('td', {'class' :'pos'}).find('strong').text

UPDATE: If I use the following code:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

then it returns just Fixed text:, not the <strong>-highlighted text in that same element.

Asked By: LA_

||

Answers:

You can pass a regular expression to the text parameter of findAll, like so:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})
Answered By: user130076

This post got me to my answer even though the answer is missing from this post. I felt I should give back.

The challenge here is in the inconsistent behavior of BeautifulSoup.find when searching with and without text.

Note:
If you have BeautifulSoup, you can test this locally via:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Code: https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">n
#>>       "Fixed text:"n
#>>       <br />n
#>>       <strong>text I am looking for</strong>n
#>>   </td>,
#>>  'previous': <td class="pos">n
#>>       "Fixed text:"n
#>>       <br />n
#>>       <strong>text I am looking for</strong>n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.

Answered By: Bruno Bronosky

A solution for finding a anchor tag if having a particular keyword would be the following:

from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
from urllib.parse import urljoin,urlparse

rawLinks=soup.findAll('a',href=True)
for link in rawLinks:
    innercontent=link.text
    if keyword.lower() in innercontent.lower():
        print(link)
Answered By: Prasad Giri

With bs4 4.7.1+ you can use :contains pseudo class to specify the td containing your (filter) search string. You can then use a descendant child combinator, in this case, to move to the strong containing target text:

from bs4 import BeautifulSoup as bs

html = '''
<tr>
  <td class="pos">n
      "Some text:"n
      <br>n
      <strong>some value</strong>n
  </td>
</tr>
<tr>
  <td class="pos">n
      "Fixed text:"n
      <br>n
      <strong>text I am looking for</strong>n
  </td>
</tr>
<tr>
  <td class="pos">n
      "Some other text:"n
      <br>n
      <strong>some other value</strong>n
  </td>
</tr>'''
soup = bs(html, 'lxml')
print(soup.select_one('td:contains("Fixed text:") strong').text)

soupsieve 2.1.0 onwards:

NEW: In order to avoid conflicts with future CSS specification
changes, non-standard pseudo classes will now start with the :-soup-
prefix. As a consequence, :contains() will now be known as
:-soup-contains(), though for a time the deprecated form of
:contains() will still be allowed with a warning that users should
migrate over to :-soup-contains().

NEW: Added new non-standard pseudo class :-soup-contains-own() which
operates similar to :-soup-contains() except that it only looks at
text nodes directly associated with the currently scoped element and
not its descendants.

Quote from @facelessuser github page.

Answered By: QHarr
result = soup.find('strong', text='text I am looking for').text
Answered By: alek vertysh

Since Beautiful Soup 4.4.0. a parameter called string does the work that text used to do in the previous versions.

string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for the string. This code finds the tags whose .string is “Elsie”:

soup.find_all("td", string="Elsie")

For more information about string have a look this section https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument

Answered By: Memin

You could solve this with some simple gazpacho parsing:

from gazpacho import Soup

soup = Soup(html)
tds = soup.find("td", {"class": "pos"})
tds[1].find("strong").text

Which will output:

text I am looking for

Answered By: emehex

You can use Beautiful Soup’s CSS selector method.

from bs4 import BeautifulSoup
from bs4.element import Tag
from typing import List

# This will work as of BeautifulSoup 4.9.1.
result: List[Tag] = BeautifulSoup(html_string, 'lxml').select(
    'tr td strong:contains("text I am looking for")'
    )
print(result)

[<strong>text I am looking for</strong>]

Answered By: Zachary Chiodini