Python – Construct DF From Nested XML Response

Question:

What would be the best way to construct a DF from the below nested XML data?

Each "properties" element has three "property" elements nested containing the "name" and "value" of our data. I tried doing two for loops, pandas read_xml option, and a few other pieces but haven’t quite gotten the nested logic figured out. My current approach below is closer, but does not keep the names and values together.

Using Python 3.7+ in Jupyter on windows

Sample XML Data:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
    xmlns_soap="http://www.w3.org/2003/05/soap-envelope"
    xmlns_xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns_xsd="http://www.w3.org/2001/XMLSchema"
    xmlns_wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing"
    xmlns_wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
    xmlns_wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
    <env:Header
        xmlns_env="http://www.w3.org/2003/05/soap-envelope">
        <wsa:Action>RetrieveResponse</wsa:Action>
        <wsa:MessageID>urn:uuid:1234</wsa:MessageID>
        <wsa:RelatesTo>urn:uuid:1234</wsa:RelatesTo>
        <wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
        <wsse:Security>
            <wsu:Timestamp wsu_Id="Timestamp-45333">
                <wsu:Created>2022-11-07T17:02:44Z</wsu:Created>
                <wsu:Expires>2022-11-07T17:07:44Z</wsu:Expires>
            </wsu:Timestamp>
        </wsse:Security>
    </env:Header>
    <soap:Body>
        <RetrieveResponseMsg
            xmlns="http://exacttarget.com/wsdl/partnerAPI">
            <OverallStatus>MoreDataAvailable</OverallStatus>
            <RequestID>asdfds455</RequestID>
            <Results xsi_type="DataExtensionObject">
                <PartnerKey xsi_nil="true" />
                <ObjectID xsi_nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>asdfdfd12</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>asdf</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>fasdsa</Value>
                    </Property>
                </Properties>
            </Results>
            <Results xsi_type="DataExtensionObject">
                <PartnerKey xsi_nil="true" />
                <ObjectID xsi_nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>fasd123</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>asdfd</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>a0A4f</Value>
                    </Property>
                </Properties>
            </Results>
            <Results xsi_type="DataExtensionObject">
                <PartnerKey xsi_nil="true" />
                <ObjectID xsi_nil="true" />
                <Type>DataExtensionObject</Type>
                <Properties>
                    <Property>
                        <Name>FIELD_NAME</Name>
                        <Value>0034P00</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_2</Name>
                        <Value>fasdfs</Value>
                    </Property>
                    <Property>
                        <Name>FIELD_NAME_3</Name>
                        <Value>a0fasd</Value>
                    </Property>
                </Properties>
            </Results>
        </RetrieveResponseMsg>
    </soap:Body>
</soap:Envelope>

What I’ve Attempted So Far:

data_output = []
for el in soup_de.find_all('Property'):
    
    dict_ = {el.find('Name').text:el.find('Value').text}

    data_output.append(dict_)
    

print(len(data_output))
# print(data_output)


testing_de_df = pd.DataFrame(data_output)
    
    
display(testing_de_df.info())
display(testing_de_df.head(25))

Desired Output:

details = {'FIELD_NAME': ['asdfdfd12', 'fasd123', '0034P00'],
           'FIELD_NAME_2': ['asdf', 'asdfd', 'fasdfs'],
           'FIELD_NAME_3': ['fasdsa', 'a0A4f', 'a0fasd']}

desired_output = pd.DataFrame(details)

print(desired_output)
Asked By: Carson Whitley

||

Answers:

Since <Property> sits at a shallow part of the XML, simply call pandas.read_xml narrowing in on that set of nodes while acknowledging the default namespace (http://exacttarget.com/wsdl/partnerAPI):

property_df = pd.read_xml(
    "Input.xml", 
    xpath = ".//rrm:Property",
    namespaces = {"rrm": "http://exacttarget.com/wsdl/partnerAPI"}
)

print(property_df)
#            Name      Value
# 0    FIELD_NAME  asdfdfd12
# 1  FIELD_NAME_2       asdf
# 2  FIELD_NAME_3     fasdsa
# 3    FIELD_NAME    fasd123
# 4  FIELD_NAME_2      asdfd
# 5  FIELD_NAME_3      a0A4f
# 6    FIELD_NAME    0034P00
# 7  FIELD_NAME_2     fasdfs
# 8  FIELD_NAME_3     a0fasd

To delineate by property, consider creating a property group number with groupby().cumcount() and reshaping data wide with pivot_table:

property_wide_df = (
    property_df
      .assign(property_no = lambda x: x.groupby("Name").cumcount().add(1))
      .pivot_table(index="property_no", columns="Name", values="Value", aggfunc="sum")
)

print(property_wide_df)
# Name        FIELD_NAME FIELD_NAME_2 FIELD_NAME_3
# property_no
# 1            asdfdfd12         asdf       fasdsa
# 2              fasd123        asdfd        a0A4f
# 3              0034P00       fasdfs       a0fasd
Answered By: Parfait
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.