Extracting data from an XLIFF file and creating a data frame

Question

I have an XLIFF file with the following structure.

<?xml version="1.0" encoding="UTF-8"?>
<xliff >



TAG
SOURCE
TARGET




Title
Source text
Target text


Description
Source text
Target text


Summary
Source text
Target text


Relevant
Source text
Target text


From area code
Source text
Target text


I tried building a data frame with all tags and text using the following code, so then I could filter the rows that contain the data I need.
import xml.etree.ElementTree as ET
tree=ET.parse('583197.xliff')
root=tree.getroot()

# print(root)
store_items = []
all_items = []

for elem in tree.iter():
        
        tag=elem.tag()
        attri = elem.attrib()
        text = elem.text()
      
        store_items = [attri,text]
        all_items.append(store_items)

xmlToDf = pd.DataFrame(all_items, columns=[
'Attri','Text'])

print(xmlToDf.to_string(index=False))

How can I extract specific tags, attributes, and text from an XLIFF file so I can build a data frame?

Answer 1

Try:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("your_file.xml")
root = tree.getroot()

data = []
for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
    source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
    target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
    data.append(
        {
            "TAG": tu.attrib["resname"].split("::")[-1],
            "SOURCE": source.text,
            "TARGET": target.text,
        }
    )

df = pd.DataFrame(data)
print(df)

Prints:

              TAG                                                                      SOURCE                                                                                     TARGET
0           title                                                                     Name 1                                                                     Name 1 target language 
1         summary  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
2        relevant  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
3     description  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
4  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
5         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
6  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
7         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.

Answered By: Andrej Kesely

Extracting data from an XLIFF file and creating a data frame

Question:

Answers:

TAG	SOURCE	TARGET
Title	Source text	Target text
Description	Source text	Target text
Summary	Source text	Target text
Relevant	Source text	Target text
From area code	Source text	Target text