Extracting data from an XLIFF file and creating a data frame

Question:

I have an XLIFF file with the following structure.

<?xml version="1.0" encoding="UTF-8"?>
<xliff >
TAG SOURCE TARGET
Title Source text Target text
Description Source text Target text
Summary Source text Target text
Relevant Source text Target text
From area code Source text Target text

I tried building a data frame with all tags and text using the following code, so then I could filter the rows that contain the data I need.

import xml.etree.ElementTree as ET
tree=ET.parse('583197.xliff')
root=tree.getroot()

# print(root)
store_items = []
all_items = []

for elem in tree.iter():
        
        tag=elem.tag()
        attri = elem.attrib()
        text = elem.text()
      
        store_items = [attri,text]
        all_items.append(store_items)

xmlToDf = pd.DataFrame(all_items, columns=[
'Attri','Text'])

print(xmlToDf.to_string(index=False))

How can I extract specific tags, attributes, and text from an XLIFF file so I can build a data frame?

Asked By: Lola Ro

||

Answers:

Try:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("your_file.xml")
root = tree.getroot()

data = []
for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
    source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
    target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
    data.append(
        {
            "TAG": tu.attrib["resname"].split("::")[-1],
            "SOURCE": source.text,
            "TARGET": target.text,
        }
    )

df = pd.DataFrame(data)
print(df)

Prints:

              TAG                                                                      SOURCE                                                                                     TARGET
0           title                                                                     Name 1                                                                     Name 1 target language 
1         summary  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
2        relevant  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
3     description  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
4  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
5         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
6  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
7         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
Answered By: Andrej Kesely