Extracting XML Attributes
Question:
I have an XML file with several thousand records in it in the form of:
<custs>
<record cust_ID="B123456@Y1996" l_name="Jungle" f_name="George" m_name="OfThe" city="Fairbanks" zip="00010" current="1" />
<record cust_ID="Q975697@Z2000" l_name="Freely" f_name="I" m_name="P" city="Yellow River" zip="03010" current="1" />
<record cust_ID="M7803@J2323" l_name="Jungle" f_name="Jim" m_name="" city="Fallen Arches" zip="07008" current="0" />
</custs>
# (I know it's not normalized. This is just sample data)
How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts.
I’ve found a couple threads here about attribs, (Beautifulsoup unable to extract data using attrs=class, Extracting an attribute value with beautifulsoup) and they have gotten me almost there with:
# Python 3.30
#
from bs4 import BeautifulSoup
import fileinput
Input = open("C:/Python/XML Tut/MinGrp.xml", encoding = "utf-8", errors = "backslashreplace")
OutFile = open('C:/Python/XML Tut/MinGrp_Out.ttxt', 'w', encoding = "utf-8", errors = "backslashreplace")
soup = BeautifulSoup(Input, features="xml")
results = soup.findAll('custs', attrs={})
# output = results [0]#[0]
for each_tag in results:
cust_attrb_value = results[0]
# print (cust_attrb_value)
OutFile.write(cust_attrb_value)
OutFile.close()
What’s the next (last?) step?
Answers:
If this data is formatted correctly — as in, uses canonical XML — you should consider lxml
rather than BeautifulSoup. With lxml
, you read the file, then you can apply DOM logic on it, including XPath queries. With your XPath queries, you can then get the lxml
objects that represent each node that you’re interested in, extract the data from them that you need, and rewrite them into an arbitrary format of your choosing using something like the csv
module..
Specifically, in the lxml documentation, check out these tutorials:
I (also) wouldn’t use BeautifulSoup for this, and though I like lxml, that’s an extra install, and if you don’t want to bother, this is simple enough to do with the standard lib ElementTree module.
Something like:
import xml.etree.ElementTree as ET
import sys
tree=ET.parse( 'test.xml' )
root=tree.getroot()
rs=root.getchildren()
keys = rs[0].attrib.keys()
for a in keys: sys.stdout.write(a); sys.stdout.write('t')
sys.stdout.write('n')
for r in rs:
assert keys == r.attrib.keys()
for k in keys: sys.stdout.write( r.attrib[k]); sys.stdout.write('t')
sys.stdout.write('n')
will, from python-3, produce :
zip m_name current city cust_ID l_name f_name
00010 OfThe 1 Fairbanks B123456@Y1996 Jungle George
03010 P 1 Yellow River Q975697@Z2000 Freely I
07008 0 Fallen Arches M7803@J2323 Jungle Jim
Note that with Python-2.7, the order of the attributes will be different.
If you want them to output in a different specific order, you should sort or
order the list “keys” .
The assert is checking that all rows have the same attributes.
If you actually have missing or different attributes in the elements,
then you’ll have to remove that and add some code to deal with the differences
and supply defaults for missing values. ( In your sample data, you have a
null value ( m_name=”” ), rather than a missing value. You might want to check
that this case is handled OK by the consumer of this output, or else add some
more special handling for this case.
men
In beautiful soup,
product=soup.find("product",attrs={})
then use attribute to access data like product["name"]
I have an XML file with several thousand records in it in the form of:
<custs>
<record cust_ID="B123456@Y1996" l_name="Jungle" f_name="George" m_name="OfThe" city="Fairbanks" zip="00010" current="1" />
<record cust_ID="Q975697@Z2000" l_name="Freely" f_name="I" m_name="P" city="Yellow River" zip="03010" current="1" />
<record cust_ID="M7803@J2323" l_name="Jungle" f_name="Jim" m_name="" city="Fallen Arches" zip="07008" current="0" />
</custs>
# (I know it's not normalized. This is just sample data)
How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts.
I’ve found a couple threads here about attribs, (Beautifulsoup unable to extract data using attrs=class, Extracting an attribute value with beautifulsoup) and they have gotten me almost there with:
# Python 3.30
#
from bs4 import BeautifulSoup
import fileinput
Input = open("C:/Python/XML Tut/MinGrp.xml", encoding = "utf-8", errors = "backslashreplace")
OutFile = open('C:/Python/XML Tut/MinGrp_Out.ttxt', 'w', encoding = "utf-8", errors = "backslashreplace")
soup = BeautifulSoup(Input, features="xml")
results = soup.findAll('custs', attrs={})
# output = results [0]#[0]
for each_tag in results:
cust_attrb_value = results[0]
# print (cust_attrb_value)
OutFile.write(cust_attrb_value)
OutFile.close()
What’s the next (last?) step?
If this data is formatted correctly — as in, uses canonical XML — you should consider lxml
rather than BeautifulSoup. With lxml
, you read the file, then you can apply DOM logic on it, including XPath queries. With your XPath queries, you can then get the lxml
objects that represent each node that you’re interested in, extract the data from them that you need, and rewrite them into an arbitrary format of your choosing using something like the csv
module..
Specifically, in the lxml documentation, check out these tutorials:
I (also) wouldn’t use BeautifulSoup for this, and though I like lxml, that’s an extra install, and if you don’t want to bother, this is simple enough to do with the standard lib ElementTree module.
Something like:
import xml.etree.ElementTree as ET
import sys
tree=ET.parse( 'test.xml' )
root=tree.getroot()
rs=root.getchildren()
keys = rs[0].attrib.keys()
for a in keys: sys.stdout.write(a); sys.stdout.write('t')
sys.stdout.write('n')
for r in rs:
assert keys == r.attrib.keys()
for k in keys: sys.stdout.write( r.attrib[k]); sys.stdout.write('t')
sys.stdout.write('n')
will, from python-3, produce :
zip m_name current city cust_ID l_name f_name
00010 OfThe 1 Fairbanks B123456@Y1996 Jungle George
03010 P 1 Yellow River Q975697@Z2000 Freely I
07008 0 Fallen Arches M7803@J2323 Jungle Jim
Note that with Python-2.7, the order of the attributes will be different.
If you want them to output in a different specific order, you should sort or
order the list “keys” .
The assert is checking that all rows have the same attributes.
If you actually have missing or different attributes in the elements,
then you’ll have to remove that and add some code to deal with the differences
and supply defaults for missing values. ( In your sample data, you have a
null value ( m_name=”” ), rather than a missing value. You might want to check
that this case is handled OK by the consumer of this output, or else add some
more special handling for this case.
men
In beautiful soup,
product=soup.find("product",attrs={})
then use attribute to access data like product["name"]