lxml .text returns None when string contains tags
Question:
I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg>
elements. Whenever <seg>
element contains serialized tags, I get None
object instead of a string.
Code that returns None
:
source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
Sample content of <seg>
element that causes the issue:
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Expected value of string variable source_segment
:
<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />
I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
cause it is a None
object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg')
, I get this:
b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>n '
Sample XML content:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
<prop type="x-Note:SingleString"></prop>
<prop type="x-Recognizers">RecognizeAll</prop>
<prop type="x-IncludesContextContent">True</prop>
<prop type="x-TMName">XXXXXXXX</prop>
<prop type="x-TokenizerFlags">DefaultFlags</prop>
<prop type="x-WordCountFlags">DefaultFlags</prop>
</header>
<body>
<tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
<prop type="x-LastUsedBy">XXXXXXXX</prop>
<prop type="x-Context">0, 0</prop>
<prop type="x-Origin">TM</prop>
<prop type="x-ConfirmationLevel">Translated</prop>
<prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
<prop type="x-Note:SingleString">XXXXXXXX</prop>
<tuv xml_lang="en-GB">
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
<tuv xml_lang="lt-LT">
<seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
</tu>
</body>
</tmx>
How do I extract the string from <seg>
element when it contains serialized tags?
Answers:
You can iterate about <seg>
, depends on what you are interested in:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('seg.xml')
root = tree.getroot()
def elem_to_string(child):
print("Your wish as a string", ET.tostring(child).decode())
data = []
for elem in root:
if elem.tag == "body":
for child in elem.findall(".//seg"):
elem_to_string(child)
for sub_c in child.iter():
print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
data.append(row)
df = pd.DataFrame(data)
print(df.to_string())
Output:
0 1 2 3
0 seg {} None n
1 bpt {'i': '1', 'type': '14', 'x': '1'} None Coded glass plate
2 ept {'i': '1'} None None
3 ph {'x': '4', 'type': '33'} None None
4 seg {} None n
5 bpt {'i': '1', 'type': '14', 'x': '1'} None YYYYYYYYYYYYY
6 ept {'i': '1'} None None
7 ph {'x': '4', 'type': '33'} None None
Optional as a string:
Your wish as a string <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Your wish as a string <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
The best approach I found is to convert the parent child to a string, passing parameter ‘encoding=str’ to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the tags from the resulting string.
import xml.etree.ElementTree as ET
root = ET.parse('seg.xml').getroot()
seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
seg_string = ET.tostring(seg_elem, encoding=str)
# Regex to strip <seg> tags
seg_pattern = '(?<=<seg>).*?(?=</seg>)'
# Strip <seg> tags
final_string = re.search(seg_pattern, seg_string).group()
I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg>
elements. Whenever <seg>
element contains serialized tags, I get None
object instead of a string.
Code that returns None
:
source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
Sample content of <seg>
element that causes the issue:
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Expected value of string variable source_segment
:
<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />
I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
cause it is a None
object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg')
, I get this:
b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>n '
Sample XML content:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
<prop type="x-Note:SingleString"></prop>
<prop type="x-Recognizers">RecognizeAll</prop>
<prop type="x-IncludesContextContent">True</prop>
<prop type="x-TMName">XXXXXXXX</prop>
<prop type="x-TokenizerFlags">DefaultFlags</prop>
<prop type="x-WordCountFlags">DefaultFlags</prop>
</header>
<body>
<tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
<prop type="x-LastUsedBy">XXXXXXXX</prop>
<prop type="x-Context">0, 0</prop>
<prop type="x-Origin">TM</prop>
<prop type="x-ConfirmationLevel">Translated</prop>
<prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
<prop type="x-Note:SingleString">XXXXXXXX</prop>
<tuv xml_lang="en-GB">
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
<tuv xml_lang="lt-LT">
<seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
</tu>
</body>
</tmx>
How do I extract the string from <seg>
element when it contains serialized tags?
You can iterate about <seg>
, depends on what you are interested in:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('seg.xml')
root = tree.getroot()
def elem_to_string(child):
print("Your wish as a string", ET.tostring(child).decode())
data = []
for elem in root:
if elem.tag == "body":
for child in elem.findall(".//seg"):
elem_to_string(child)
for sub_c in child.iter():
print(sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail)
row = sub_c.tag, sub_c.attrib, sub_c.text, sub_c.tail
data.append(row)
df = pd.DataFrame(data)
print(df.to_string())
Output:
0 1 2 3
0 seg {} None n
1 bpt {'i': '1', 'type': '14', 'x': '1'} None Coded glass plate
2 ept {'i': '1'} None None
3 ph {'x': '4', 'type': '33'} None None
4 seg {} None n
5 bpt {'i': '1', 'type': '14', 'x': '1'} None YYYYYYYYYYYYY
6 ept {'i': '1'} None None
7 ph {'x': '4', 'type': '33'} None None
Optional as a string:
Your wish as a string <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Your wish as a string <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
The best approach I found is to convert the parent child to a string, passing parameter ‘encoding=str’ to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the tags from the resulting string.
import xml.etree.ElementTree as ET
root = ET.parse('seg.xml').getroot()
seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
seg_string = ET.tostring(seg_elem, encoding=str)
# Regex to strip <seg> tags
seg_pattern = '(?<=<seg>).*?(?=</seg>)'
# Strip <seg> tags
final_string = re.search(seg_pattern, seg_string).group()