ElementTree better way to search out nodes (XPATH) using AND and 'parent'
Question:
I need to find tag=ITEM that match 2 criteria, and then get the parent tag=NODE@name based on this find.
Two issues:
-
I can’t find a way for XPath to do an ‘and’, for example
item = node.findall('./ITEM[@name="toppas_type" and @value="output file list"]')
-
Getting the parent NODE info without having to explicitely search and save it in advance of finding the ITEM, for example something like
parent_name = item.parent.attrib['name']
This is the code I have now:
node_names = []
for node in tree.findall('NODE[@name="vertices"]/NODE'):
for item in node.findall('./ITEM[@name="toppas_type"]'):
if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
node_names.append(node.attrib['name'])
…to parse a file like this (snippet only) …
<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi_noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" >
<NODE name="vertices" description="">
<NODE name="23" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
</NODE>
<NODE name="24" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
<ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
</NODE>
<NODE name="33" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
<ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
</NODE>
<!--(snip)-->
</NODE>
</PARAMETERS>
UPDATE:
@Mathias Müller
Great suggestion – unfortunately when I try to load the XML file, I get an error. I’m not familiar with lxml…so I’m not sure if I’m using it right.
from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1
Unfortunately, ElementTree will not accept that xpath in its tree.find(xpath) or tree.findall(xpath)
Answers:
Perhaps you do not need nested loops at all, a single XPath expression would suffice. I am not exactly sure what you would like the final result to be, but here is an example with lxml
:
>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
... <NODE name="23" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="24" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
... <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="33" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
... <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
... </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x102b5f788>]
And if you actually need the name of the parent element, you can move to the parent node with ..
:
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]/../@name')
['24']
Parsing an XML document from a file
The function etree.DTD
is the wrong choice if you would like to parse an XML document from a file. A DTD is not an XML document. Here is how you can do it with lxml
:
>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>
Second Update
If the outermost element is PARAMETERS
, you need to search like this:
>>> root.xpath('/PARAMETERS/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x106593e18>]
In XPath, everything inside []
are predicates (filtering criteria), and are not restricted to attribute filtering.
Without any criteria, the XPath expression:
//NODE/@name
will produce all name
attribute values of all NODE
nodes.
In your case, you care only about the NODE
nodes having specific children. So this means you have to filter the NODE
nodes:
//NODE[‹predicate here›]/@name
Specifically, for NODE
nodes having item
nodes with the same criteria as per issue №1 of your question, the predicate would be:
ITEM[@name="toppas_type" and @value="output file list"]
i.e match direct ITEM
children with specific values for their name
and value
attributes.
The full XPath would then be:
//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name
Applying this with lxml
to your sample XML in a Python REPL:
>>> doc.xpath('//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name')
['24']
I need to find tag=ITEM that match 2 criteria, and then get the parent tag=NODE@name based on this find.
Two issues:
-
I can’t find a way for XPath to do an ‘and’, for example
item = node.findall('./ITEM[@name="toppas_type" and @value="output file list"]')
-
Getting the parent NODE info without having to explicitely search and save it in advance of finding the ITEM, for example something like
parent_name = item.parent.attrib['name']
This is the code I have now:
node_names = []
for node in tree.findall('NODE[@name="vertices"]/NODE'):
for item in node.findall('./ITEM[@name="toppas_type"]'):
if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
node_names.append(node.attrib['name'])
…to parse a file like this (snippet only) …
<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi_noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" >
<NODE name="vertices" description="">
<NODE name="23" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
<ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
</NODE>
<NODE name="24" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
<ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
</NODE>
<NODE name="33" description="">
<ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
<ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
<ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
<ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
<ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
</NODE>
<!--(snip)-->
</NODE>
</PARAMETERS>
UPDATE:
@Mathias Müller
Great suggestion – unfortunately when I try to load the XML file, I get an error. I’m not familiar with lxml…so I’m not sure if I’m using it right.
from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1
Unfortunately, ElementTree will not accept that xpath in its tree.find(xpath) or tree.findall(xpath)
Perhaps you do not need nested loops at all, a single XPath expression would suffice. I am not exactly sure what you would like the final result to be, but here is an example with lxml
:
>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
... <NODE name="23" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
... <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="24" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
... <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
... </NODE>
...
... <NODE name="33" description="">
... <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
... <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
... <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
... <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
... <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
... </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x102b5f788>]
And if you actually need the name of the parent element, you can move to the parent node with ..
:
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]/../@name')
['24']
Parsing an XML document from a file
The function etree.DTD
is the wrong choice if you would like to parse an XML document from a file. A DTD is not an XML document. Here is how you can do it with lxml
:
>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>
Second Update
If the outermost element is PARAMETERS
, you need to search like this:
>>> root.xpath('/PARAMETERS/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x106593e18>]
In XPath, everything inside []
are predicates (filtering criteria), and are not restricted to attribute filtering.
Without any criteria, the XPath expression:
//NODE/@name
will produce all name
attribute values of all NODE
nodes.
In your case, you care only about the NODE
nodes having specific children. So this means you have to filter the NODE
nodes:
//NODE[‹predicate here›]/@name
Specifically, for NODE
nodes having item
nodes with the same criteria as per issue №1 of your question, the predicate would be:
ITEM[@name="toppas_type" and @value="output file list"]
i.e match direct ITEM
children with specific values for their name
and value
attributes.
The full XPath would then be:
//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name
Applying this with lxml
to your sample XML in a Python REPL:
>>> doc.xpath('//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name')
['24']