ElementTree better way to search out nodes (XPATH) using AND and 'parent'

Question:

I need to find tag=ITEM that match 2 criteria, and then get the parent tag=NODE@name based on this find.

Two issues:

  1. I can’t find a way for XPath to do an ‘and’, for example

    item = node.findall('./ITEM[@name="toppas_type" and @value="output file list"]')
    
  2. Getting the parent NODE info without having to explicitely search and save it in advance of finding the ITEM, for example something like

    parent_name = item.parent.attrib['name']
    

This is the code I have now:

node_names = []
for node in tree.findall('NODE[@name="vertices"]/NODE'): 
    for item in node.findall('./ITEM[@name="toppas_type"]'):
        if item.attrib['name'] == 'toppas_type' and item.attrib['value'] == 'output file list':
            node_names.append(node.attrib['name'])

…to parse a file like this (snippet only) …

<?xml version="1.0" encoding="ISO-8859-1"?>
<PARAMETERS version="1.6.2" xsi_noNamespaceSchemaLocation="http://open-ms.sourceforge.net/schemas/Param_1_6_2.xsd" >
    <NODE name="vertices" description="">   
        <NODE name="23" description="">
          <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
          <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
          <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
          <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
          <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
          <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
        </NODE>

        <NODE name="24" description="">
          <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
          <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
          <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
          <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
          <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
        </NODE>

        <NODE name="33" description="">
          <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
          <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
          <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
          <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
          <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
        </NODE>
    <!--(snip)-->
    </NODE>
</PARAMETERS>

UPDATE:

@Mathias Müller

Great suggestion – unfortunately when I try to load the XML file, I get an error. I’m not familiar with lxml…so I’m not sure if I’m using it right.

from lxml import etree
root = etree.DTD("/Users/mikes/Documents/Eclipseworkspace/Bioproximity/Assay-Workflows-Mikes/protein_lfq/protein_lfq-1.1.2.toppas")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/dtd.pxi", line 294, in lxml.etree.DTD.__init__ (src/lxml/lxml.etree.c:187024)
lxml.etree.DTDParseError: Content error in the external subset, line 2, column 1

Unfortunately, ElementTree will not accept that xpath in its tree.find(xpath) or tree.findall(xpath)

Asked By: RightmireM

||

Answers:

Perhaps you do not need nested loops at all, a single XPath expression would suffice. I am not exactly sure what you would like the final result to be, but here is an example with lxml:

>>> import lxml.etree
>>> s = '''<NODE name="vertices" description="">
...
...     <NODE name="23" description="">
...       <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
...       <ITEM name="toppas_type" value="tool" type="string" description="" required="false" advanced="false" />
...       <ITEM name="tool_name" value="FileConverter" type="string" description="" required="false" advanced="false" />
...       <ITEM name="tool_type" value="" type="string" description="" required="false" advanced="false" />
...       <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
...       <ITEM name="y_pos" value="-1380" type="double" description="" required="false" advanced="false" />
...     </NODE>
...
...     <NODE name="24" description="">
...       <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
...       <ITEM name="toppas_type" value="output file list" type="string" description="" required="false" advanced="false" />
...       <ITEM name="x_pos" value="-440" type="double" description="" required="false" advanced="false" />
...       <ITEM name="y_pos" value="-1480" type="double" description="" required="false" advanced="false" />
...       <ITEM name="output_folder_name" value="" type="string" description="" required="false" advanced="false" />
...     </NODE>
...
...     <NODE name="33" description="">
...       <ITEM name="recycle_output" value="false" type="string" description="" required="false" advanced="false" />
...       <ITEM name="toppas_type" value="merger" type="string" description="" required="false" advanced="false" />
...       <ITEM name="x_pos" value="-620" type="double" description="" required="false" advanced="false" />
...       <ITEM name="y_pos" value="-1540" type="double" description="" required="false" advanced="false" />
...       <ITEM name="round_based" value="false" type="string" description="" required="false" advanced="false" />
...     </NODE>
... <!--(snip)-->
... </NODE>'''
>>> root = lxml.etree.fromstring(s)
>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x102b5f788>]

And if you actually need the name of the parent element, you can move to the parent node with ..:

>>> root.xpath('/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]/../@name')
['24']

Parsing an XML document from a file

The function etree.DTD is the wrong choice if you would like to parse an XML document from a file. A DTD is not an XML document. Here is how you can do it with lxml:

>>> import lxml.etree
>>> root = lxml.etree.parse("example.xml")
>>> root
<lxml.etree._ElementTree object at 0x106593b00>

Second Update

If the outermost element is PARAMETERS, you need to search like this:

>>> root.xpath('/PARAMETERS/NODE[@name="vertices"]/NODE/ITEM[@name = "toppas_type" and @value = "output file list"]')
[<Element ITEM at 0x106593e18>]
Answered By: Mathias Müller

In XPath, everything inside [] are predicates (filtering criteria), and are not restricted to attribute filtering.

Without any criteria, the XPath expression:

//NODE/@name

will produce all name attribute values of all NODE nodes.

In your case, you care only about the NODE nodes having specific children. So this means you have to filter the NODE nodes:

//NODE[‹predicate here›]/@name

Specifically, for NODE nodes having item nodes with the same criteria as per issue №1 of your question, the predicate would be:

ITEM[@name="toppas_type" and @value="output file list"]

i.e match direct ITEM children with specific values for their name and value attributes.

The full XPath would then be:

//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name

Applying this with lxml to your sample XML in a Python REPL:

>>> doc.xpath('//NODE[ITEM[@name="toppas_type" and @value="output file list"]]/@name')
['24']
Answered By: tzot
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.