Import data from XML using lxml

Question:

I’m trying to extract data from several 1,000 XML files and compose a single df from it.

The code I have so far is for a single XML extraction.

from lxml import etree
import pandas as pd

serial = ["S1.xml"]


content = serial.encode('utf-8')
doc = etree.XML(content)

targets = doc.xpath('/reiXmlPrenos')
data = []
for target in targets:
   data.append(target.xpath("./@A")[0])
   data.append(target.xpath("./@z")[0])

   
columns = ['A', 'Z']
pd.DataFrame([data],columns=columns)

The XML file looks like this:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
  <Qf>255340</Qf>
  <Qp>597451</Qp>
  <CO2>126660</CO2>
  <A>2362.8</A>
  <Ht>0.336</Ht>
  <f0>0.59</f0>
  <z>0.105891</z>
</reiXmlPrenos>

For the final df I’d like for it to look like this:

         A      z
S1.xml   2362  0.105891
S2.xml    ...   ...
...

The error that i’m getting is

line 16, in <module>
    content = serial.encode('utf-8')

AttributeError: 'list' object has no attribute 'encode'

Can you please find me the error that i’m making and then to expand the code, so it could load all xml files in the same folder?

Asked By: energyMax

||

Answers:

from lxml import etree
import pandas as pd

serial = ["tmp.xml", "S2.xml"]
columns = ["file",'A', 'Z']

all_data = []
for item in serial:
    data = []
    data.append(item)
    with open(item, 'r') as file:
        content = file.read().encode('utf-8')

    doc = etree.XML(content)

    # add a predicate to make sure A and z exists
    targets = doc.xpath('/reiXmlPrenos[A and z]')
    for target in targets:
        data.append(target.xpath("./A")[0].text)
        data.append(target.xpath("./z")[0].text)
        all_data.append(data)

df = pd.DataFrame(all_data,columns=columns)

print(df)

Result

      file       A         Z
0  tmp.xml  2362.8  0.105891
1   S2.xml  2362.8  0.105891
Answered By: LMC

To import data from an XML file using lxml, simply create an lxml.etree.ElementTree instance, and pass it the file name of the XML file. The data will be automatically parsed and stored in the instance:

tree = lxml.etree.ElementTree(file='myfile.xml')

To access the data, simply use the instance’s methods and attributes. For example, to get the root element of the XML file, use the getroot() method:

root = tree.getroot()
Answered By: Speezy

Using only Pandas (lxml under the hood):

import pandas as pd

# file S1 same as S2, for demonstration
serial = ["S1.xml", "S2.xml"]
# To save money, we first collect dataframes in the generator, then combine them.
df = pd.concat((pd.read_xml(file, xpath='//reiXmlPrenos')[['A', 'z']] for file in serial))
# Adding a column for indexing.
df['serial'] = serial
df = df.set_index('serial')
print(df)

             A         z
serial                  
S1.xml  2362.8  0.105891
S2.xml  2362.8  0.105891
Answered By: Сергей Кох
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.