Ho to parse xml file to xlsx in python
Question:
I have a file xml like this (input):
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <obs id="0">
> <dim name="Column1" value="a"/>
> <dim name="Column2" value="b"/>
> </obs>
> <obs id="1">
> <dim name="Column1" value="tr"/>
> <dim name="Column2" value="yu"/>
> </obs>
How can i do for parse in xlsx file?
i would like to have a xlsx file like this:
Column1|Column2
a |b
tr |yu
Column1
Column2
a
b
tr
yu
Thanks a lot.
I’ve tried with other xml parser but i did not realised the solution.
Answers:
You can use BeautifulSoup
to parse the XML document + pandas
to save the dataframe to CSV and/or Excel format:
import pandas as pd
from bs4 import BeautifulSoup
with open("your_file.xml", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser") # <-- you can ignore the warning or use different parser, such as `xml`
all_data = []
for obs in soup.select("obs"):
d = {}
for dim in obs.select("dim[name][value]"):
d[dim["name"]] = dim["value"]
all_data.append(d)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
This prints:
Column1 Column2
0 a b
1 tr yu
and saves data.csv
.
Input file was:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<obs id="0">
<dim name="Column1" value="a" />
<dim name="Column2" value="b" />
</obs>
<obs id="1">
<dim name="Column1" value="tr" />
<dim name="Column2" value="yu" />
</obs>
You need a well formed XML with only one root element
, like:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<root>
<obs id="0">
<dim name="Column1" value="a"/>
<dim name="Column2" value="b"/>
</obs>
<obs id="1">
<dim name="Column1" value="tr"/>
<dim name="Column2" value="yu"/>
</obs>
</root>
You can parse this xml to columns and rows for a pandas DataFrame and write this df to a Excel sheet with pandas ExcelWriter()
:
import xml.etree.ElementTree as ET
import pandas as pd
import openpyxl
tree = ET.parse('Excel.xml')
root = tree.getroot()
columns = []
data = []
for elem in root.iter('dim'):
if elem.get('name') not in columns:
columns.append(elem.get('name'))
if elem.get('name') == "Column1":
c1 = elem.get('value')
else:
c2 = elem.get('value')
row = (c1, c2)
data.append(row)
df = pd.DataFrame(data, columns=columns)
print(df)
with pd.ExcelWriter("Excel.xlsx") as writer:
df.to_excel(writer)
I have a file xml like this (input):
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <obs id="0">
> <dim name="Column1" value="a"/>
> <dim name="Column2" value="b"/>
> </obs>
> <obs id="1">
> <dim name="Column1" value="tr"/>
> <dim name="Column2" value="yu"/>
> </obs>
How can i do for parse in xlsx file?
i would like to have a xlsx file like this:
Column1|Column2
a |b
tr |yu
Column1 | Column2 |
---|---|
a | b |
tr | yu |
Thanks a lot.
I’ve tried with other xml parser but i did not realised the solution.
You can use BeautifulSoup
to parse the XML document + pandas
to save the dataframe to CSV and/or Excel format:
import pandas as pd
from bs4 import BeautifulSoup
with open("your_file.xml", "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser") # <-- you can ignore the warning or use different parser, such as `xml`
all_data = []
for obs in soup.select("obs"):
d = {}
for dim in obs.select("dim[name][value]"):
d[dim["name"]] = dim["value"]
all_data.append(d)
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv', index=False)
This prints:
Column1 Column2
0 a b
1 tr yu
and saves data.csv
.
Input file was:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<obs id="0">
<dim name="Column1" value="a" />
<dim name="Column2" value="b" />
</obs>
<obs id="1">
<dim name="Column1" value="tr" />
<dim name="Column2" value="yu" />
</obs>
You need a well formed XML with only one root element
, like:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<root>
<obs id="0">
<dim name="Column1" value="a"/>
<dim name="Column2" value="b"/>
</obs>
<obs id="1">
<dim name="Column1" value="tr"/>
<dim name="Column2" value="yu"/>
</obs>
</root>
You can parse this xml to columns and rows for a pandas DataFrame and write this df to a Excel sheet with pandas ExcelWriter()
:
import xml.etree.ElementTree as ET
import pandas as pd
import openpyxl
tree = ET.parse('Excel.xml')
root = tree.getroot()
columns = []
data = []
for elem in root.iter('dim'):
if elem.get('name') not in columns:
columns.append(elem.get('name'))
if elem.get('name') == "Column1":
c1 = elem.get('value')
else:
c2 = elem.get('value')
row = (c1, c2)
data.append(row)
df = pd.DataFrame(data, columns=columns)
print(df)
with pd.ExcelWriter("Excel.xlsx") as writer:
df.to_excel(writer)