How to best iterate (breadth-first) over an lxml etree using Python

Question:

I’m trying to wrap my head around lxml (new to this) and how I can use it to do what I want to do. I’ve got an well-formed and valid XML file

<root>
  <a>
    <b>Text</b>
    <c>More text</c>
  </a>
  <!-- some comment -->
  <a>
    <d id="10" />
  </a>
</root>

something like this. Now I’d like to visit the children breadth-first, and the best I can come up with is something like this:

for e in xml.getroot()[0].itersiblings() :
    print(e.tag, e.attrib)

and then take it from there. However, this gives me all elements including comments

a {}
<built-in function Comment> {}
a {}

How do I skip over comments? Is there a better way to iterate over the direct children of a node?

In general, what are the recommendations to parse an XML tree vs. event-driven pull-parsing using, say, iterparse()?

Asked By: Jens

||

Answers:

This works for your case

for child in doc.getroot().iterchildren("*"):
    print(child.tag, child.attrib)
Answered By: spiralx

This question was asked over 9 years ago, but I just ran into this issue myself, and I solved it with the following

import xml.etree.ElementTree as ET

xmlfile = ET.parse("file.xml")
root = xmlfile.getroot()

visit = [root]
while len(visit):
  curr = visit.pop(0)
  print(curr.tag, curr.attrib, curr.text)
  visit += list(curr)

list(node) will give a list of all the immediate children of that node. So by adding those children to a stack and just repeating that process with whatever is on the top of the stack (popping it off at the same time), we should end up with a standard breadth-first search.

Answered By: pattymills
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.