Python XML/Pandas: How to merge nested XML?
Question:
How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns_xsd="http://www.w3.org/2001/XMLSchema" xmlns_xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku
information PLUS its Blade
information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)
Answers:
Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::*
axis and corresponding RSku attributes using the ancestor::*
axis. Python’ lxml
(default parser of pandas.read_xml
) can run XSLT 1.0 scripts.
Below XSLT’s <xsl:for-each>
is used to prefix RSku_
and Blade_
to attribute names since they share same attribute such as ID
. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet rel="nofollow noreferrer">Online XSLT Demo
How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns_xsd="http://www.w3.org/2001/XMLSchema" xmlns_xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku
information PLUS its Blade
information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)
Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::*
axis and corresponding RSku attributes using the ancestor::*
axis. Python’ lxml
(default parser of pandas.read_xml
) can run XSLT 1.0 scripts.
Below XSLT’s <xsl:for-each>
is used to prefix RSku_
and Blade_
to attribute names since they share same attribute such as ID
. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet rel="nofollow noreferrer">Online XSLT Demo