Use Python to remove unneeded elements from XML file

Question:

I’m writing a program in Python to use an API that doesn’t seem to filter out requests based on if a user is considered active. When I ask the API for a list of active users I get a much longer XML document that looks like the below text and it still includes users where the <active> tag is false.

<ArrayOfuser xmlns="WebsiteWhereDataComesFrom.com" xmlns_i="http://www.w3.org/2001/XMLSchema-instance">
    <user>
        <active>false</active>
        <datelastlogin>2/3/2014 10:21:13 PM</datelastlogin>
        <dept>0</dept>
        <email/>
        <firstname>userfirstname</firstname>
        <lastname>userlastname</lastname>
        <lastupdated/>
        <lastupdatedby/>
        <loginemail>userloginemail</loginemail>
        <phone1/>
        <phone2/>
        <rep>userinitials</rep>
    </user>
    <user>
        <active>true</active>
        <datelastlogin>8/21/2019 9:16:30 PM</datelastlogin>
        <dept>3</dept>
        <email>useremail</email>
        <firstname>userfirstname</firstname>
        <lastname>userlastname</lastname>
        <lastupdated>2/6/2019 11:10:29 PM</lastupdated>
        <lastupdatedby>userinitials</lastupdatedby>
        <loginemail>userloginemail</loginemail>
        <phone1>userphone</phone1>
        <phone2/>
        <rep>userinitials</rep>
    </user>
</ArrayOfuser>

The program needs to eventually return a list of the <rep> tag from only active users.

Here is the code I tried as a beginning to this project. I may have overcomplicated this because I was trying to parse users.xml for active users then save a file containing all the XML data about active users, then use a for loop in that file to get the info from each <rep> tag and save it to a list:

to_remove = ['<active>false</active>']
with open('users.xml') as xmlfile, open('activeusers.xml','w') as newfile:
    for line in xmlfile:
        if not any(remo in line for remo in to_remove):
            newfile.write(line)

In activeusers.xml I was expecting to see the below code block.

<ArrayOfuser xmlns="WebsiteWhereDataComesFrom.com" xmlns_i="http://www.w3.org/2001/XMLSchema-instance">
    <user>
        <active>true</active>
        <datelastlogin>8/21/2019 9:16:30 PM</datelastlogin>
        <dept>3</dept>
        <email>useremail</email>
        <firstname>userfirstname</firstname>
        <lastname>userlastname</lastname>
        <lastupdated>2/6/2019 11:10:29 PM</lastupdated>
        <lastupdatedby>userinitials</lastupdatedby>
        <loginemail>userloginemail</loginemail>
        <phone1>userphone</phone1>
        <phone2/>
        <rep>userinitials</rep>
    </user>
</ArrayOfuser>

The result is an identical copy of the users xml file. My guess is that the program must be reading the file correctly if it’s copying everything, but it’s definitely not removing what I need so that syntax must not be correct.
This is just the solution I thought of and the program doesn’t have to make a new file called activeusers.xml. The end goal is to get a list of the <rep> tag for only active users, so if there is a better way to do this I would love to know because I’m a complete newbie with XML and a novice with Python.

Asked By: Wes Graham

||

Answers:

Since you’re dealing with xml, you should use a proper xml parser. Note that in this case you have to deal with namespaces as well.

So try this:

from lxml import etree
#load your file
doc = etree.parse("users.xml")
#declare namespaces
ns = {'xx': 'WebsiteWhereDataComesFrom.com'}

#locate your deletion targets
targets = doc.xpath('//xx:user[xx:active="false"]',namespaces=ns)
for target in targets:
    target.getparent().remove(target)

#save your file
with open("newfile.xml", 'a') as file:
    file.write(etree.tostring(doc).decode())

This should have your expected output.

Answered By: Jack Fleeting
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.