lxml .text returns None when string contains tags

Question:

I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg> elements. Whenever <seg> element contains serialized tags, I get None object instead of a string.

Code that returns None:

source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text

Sample content of <seg> element that causes the issue:

<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>

Expected value of string variable source_segment:

<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />

I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text cause it is a None object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg'), I get this:

b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>n      '

Sample XML content:

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
  <header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
    <prop type="x-Note:SingleString"></prop>
    <prop type="x-Recognizers">RecognizeAll</prop>
    <prop type="x-IncludesContextContent">True</prop>
    <prop type="x-TMName">XXXXXXXX</prop>
    <prop type="x-TokenizerFlags">DefaultFlags</prop>
    <prop type="x-WordCountFlags">DefaultFlags</prop>
  </header>
  <body>
    <tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
      <prop type="x-LastUsedBy">XXXXXXXX</prop>
      <prop type="x-Context">0, 0</prop>
      <prop type="x-Origin">TM</prop>
      <prop type="x-ConfirmationLevel">Translated</prop>
      <prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
      <prop type="x-Note:SingleString">XXXXXXXX</prop>
      <tuv xml_lang="en-GB">
        <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
      <tuv xml_lang="lt-LT">
        <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
    </tu>
  </body>
</tmx>

How do I extract the string from <seg> element when it contains serialized tags?

Asked By: wilkas

||

Answers:

You can iterate about <seg>, depends on what you are interested in:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('seg.xml')
root = tree.getroot()

def elem_to_string(child):
    print("Your wish as a string", ET.tostring(child).decode())

data = []
for elem in root:
    if elem.tag == "body":
        for child in elem.findall(".//seg"):
            elem_to_string(child)
            for sub_c in child.iter():
                print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
                row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
                data.append(row)
                
df = pd.DataFrame(data)
print(df.to_string())

Output:

     0                                   1     2                  3
0  seg                                  {}  None           n      
1  bpt  {'i': '1', 'type': '14', 'x': '1'}  None  Coded glass plate
2  ept                          {'i': '1'}  None               None
3   ph            {'x': '4', 'type': '33'}  None               None
4  seg                                  {}  None           n      
5  bpt  {'i': '1', 'type': '14', 'x': '1'}  None      YYYYYYYYYYYYY
6  ept                          {'i': '1'}  None               None
7   ph            {'x': '4', 'type': '33'}  None               None

Optional as a string:

Your wish as a string <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Your wish as a string <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
Answered By: Hermann12

The best approach I found is to convert the parent child to a string, passing parameter ‘encoding=str’ to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the tags from the resulting string.

import xml.etree.ElementTree as ET
root = ET.parse('seg.xml').getroot()

seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')

seg_string = ET.tostring(seg_elem, encoding=str)

# Regex to strip <seg> tags
seg_pattern = '(?<=<seg>).*?(?=</seg>)'
# Strip <seg> tags
final_string = re.search(seg_pattern, seg_string).group()
Answered By: wilkas
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.