How to get the text from a cell after <br/> tag?

Question:

I’m crawling through a simple, but long HTML chunk, which is similar to this:

<table>
  <tbody>
    <tr>
      <td> Some text </td>
      <td> Some text </td>
    </tr>
    <tr>
      <td> Some text 
        <br/>
           Some more text
      </td>
    </tr>
  </tbody>
</table>

I’m collecting the data with following little python code (using lxml):

for element in root.iter():
  if element == 'td': 
    print element.text

Some of the texts are divided into two rows, but mostly they fit in a single row. The problem is within the divided rows.

The root element is the ‘table’ tag. That little code can print out all the other texts, but not what comes after the ‘br’ tags. If I don’t exclude non-td tags, the code tries to print possible text from inside the ‘br’ tags, but of course there’s nothing in there and thus this prints just empty new line.

However after this ‘br’, the code moves to the next tag on the line within the iteration, but ignores that data that’s still inside the previous ‘td’ tag.

How can I get also the data after those tags?

Edit: It seems that some of the ‘br’ tags are self closing, but some are left open

<td> 
     Some text
  <br>
     Some more text
</td>

The element.tail method, suggested in the first answer, does not seem to be able to get the data after that open tag.

Edit2: Actually it works. Was my own mistake. Forgot to mention that the “print element.text” part was encapsulated by try-except, which in case of the br tag caught an AttributeError, because there’s nothing inside the br tags. I had set the exception to just pass and print out nothing. Inside the same try-except I tried also print out the tail, but printing out the tail was never reached, because of the exception that happened before it.

Asked By: zaplec

||

Answers:

Because <br/> is a self-closing tag, it does not have any text content. Instead, you need to access it’s tail content. The tail content is the content after the element’s closing tag, but before the next opening tag. To access this content in your for loop you will need to use the following:

for element in root.iter():
    element_text = element.text
    element_tail = element.tail

Even if the br tag is an opening tag, this method will still work:

from lxml import etree

content = '''
<table>
  <tbody>
    <tr>
      <td> Some text </td>
      <td> Some text </td>
    </tr>
    <tr>
      <td> Some text 
        <br>
           Some more text
      </td>
    </tr>
  </tbody>
</table>
'''

root = etree.HTML(content)

for element in root.iter():
    print(element.tail)

Output

Some more text
Answered By: gtlambert

To me below is working to extract all the text after br

normalize-space(//table//br/following::text()[1])

Working example is at.

Answered By: SIslam

You can target the br element and use . get(index) to fetch the underlying DOM element, the use nextSibling to target the text node. Then nodeValue property can be used to get the text.

Answered By: Harshita Jain
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.