How to use lxml to find an element by text?

Question:

Assume we have the following html:

<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>

How do I make it find the element "a", which contains "TEXT A"?

So far I’ve got:

root = lxml.html.document_fromstring(the_html_above)
e = root.find('.//a')

I’ve tried:

e = root.find('.//a[@text="TEXT A"]')

but that didn’t work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I’ve tried?

Asked By: user1973386

||

Answers:

You are very close. Use text()= rather than @text (which indicates an attribute).

e = root.xpath('.//a[text()="TEXT A"]')

Or, if you know only that the text contains “TEXT A”,

e = root.xpath('.//a[contains(text(),"TEXT A")]')

Or, if you know only that text starts with “TEXT A”,

e = root.xpath('.//a[starts-with(text(),"TEXT A")]')

See the docs for more on the available string functions.


For example,

import lxml.html as LH

text = '''
<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>'''

root = LH.fromstring(text)
e = root.xpath('.//a[text()="TEXT A"]')
print(e)

yields

[<Element a at 0xb746d2cc>]
Answered By: unutbu

Another way that looks more straightforward to me:

results = []
root = lxml.hmtl.fromstring(the_html_above)
for tag in root.iter():
    if "TEXT A" in tag.text
        results.append(tag)
Answered By: ToonAlfrink
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.