Python XML Parsing without root v2

Question:

I have 100,000 XML files that look like this

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
  <program>BB</program>
  <nazivStavbe>Test build</nazivStavbe>
  <X>101000</X>
  <Y>462000</Y>
  <QNH>24788</QNH>
  <QNC>9698</QNC>
  <Qf>255340</Qf>
  <Qp>597451</Qp>
  <CO2>126660</CO2>
  <An>1010.7</An>
  <Vc>3980</Vc>
  <A>2362.8</A>
  <Ht>0.336</Ht>
  <f0>0.59</f0>
...
</reiXml>

I want to extract around 10 numbers from each, e.g. An, Vc… but i have a problem since the XML files doesn’t have a root name.
I looked up to other cases on this forum, but I can’t seem to replicate their solutions (e.g. link).

So I have basically 2 problems: 1) how to read multiple XML files and 2) extract certain values from it… and put that in 1 txt file with 100,000 rows 🙁

The final result would be something like:

          An     Vc
XMLfile1 1010.7  3980
XMLfile2 ...     ...
XMLfile3 ...     ...
Asked By: energyMax

||

Answers:

Can you try beautifulsoup to parse the XML files?

xml_doc = """
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
  <program>BB</program>
  <nazivStavbe>Test build</nazivStavbe>
  <X>101000</X>
  <Y>462000</Y>
  <QNH>24788</QNH>
  <QNC>9698</QNC>
  <Qf>255340</Qf>
  <Qp>597451</Qp>
  <CO2>126660</CO2>
  <An>1010.7</An>
  <Vc>3980</Vc>
  <A>2362.8</A>
  <Ht>0.336</Ht>
  <f0>0.59</f0>
</reiXml>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_doc, "xml")

print(soup.An.text)
print(soup.Vc.text)

Prints:

1010.7
3980

EDIT: To create a dataframe:

import pandas as pd
from bs4 import BeautifulSoup

files = ["file1.xml", ...other files]

all_data = []
for file in files:
    with open(file, "r") as f_in:
        soup = BeautifulSoup(f_in.read(), "xml")
        all_data.append({"file": file, "An": soup.An.text, "Vc": soup.Vc.text})

df = pd.DataFrame(all_data).set_index("file")
df.index.name = None
print(df)

Prints:

           An    Vc
file1.xml  1010.7  3980
Answered By: Andrej Kesely

This is not answering my own question, but I’ll edit the solution later.

If I use the solution from previous post, I get an error

AttributeError: 'NoneType' object has no attribute 'text'

The values are in the XML file, so i really don’t know what to do…

image

Answered By: energyMax

Since bs4 doesn’t work well with XML files, would it be possible to recreate this code with lxml?

Answered By: energyMax