How do I use xml namespaces with find/findall in lxml?

Question:

I’m trying to parse content in an OpenOffice ODS spreadsheet. The ods format is essentially just a zipfile with a number of documents. The content of the spreadsheet is stored in ‘content.xml’.

import zipfile
from lxml import etree

zf = zipfile.ZipFile('spreadsheet.ods')
root = etree.parse(zf.open('content.xml'))

The content of the spreadsheet is in a cell:

table = root.find('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table')

We can also go straight for the rows:

rows = root.findall('.//{urn:oasis:names:tc:opendocument:xmlns:table:1.0}table-row')

The individual elements know about the namespaces:

>>> table.nsmap['table']
'urn:oasis:names:tc:opendocument:xmlns:table:1.0'

How do I use the namespaces directly in find/findall?

The obvious solution does not work.

Trying to get the rows from the table:

>>> root.findall('.//table:table')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1792, in lxml.etree._ElementTree.findall (src/lxml/lxml.etree.c:41770)
  File "lxml.etree.pyx", line 1297, in lxml.etree._Element.findall (src/lxml/lxml.etree.c:37027)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 225, in findall
    return list(iterfind(elem, path))
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 200, in iterfind
    selector = _build_path_iterator(path)
  File "/usr/lib/python2.6/dist-packages/lxml/_elementpath.py", line 184, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
KeyError: ':'
Asked By: saffsd

||

Answers:

If root.nsmap contains the table namespace prefix then you could:

root.xpath('.//table:table', namespaces=root.nsmap)

findall(path) accepts {namespace}name syntax instead of namespace:name. Therefore path should be preprocessed using namespace dictionary to the {namespace}name form before passing it to findall().

Answered By: jfs

Here’s a way to get all the namespaces in the XML document (and supposing there’s no prefix conflict).

I use this when parsing XML documents where I do know in advance what the namespace URLs are, and only the prefix.

        doc = etree.XML(XML_string)

        # Getting all the name spaces.
        nsmap = {}
        for ns in doc.xpath('//namespace::*'):
            if ns[0]: # Removes the None namespace, neither needed nor supported.
                nsmap[ns[0]] = ns[1]
        doc.xpath('//prefix:element', namespaces=nsmap)
Answered By: ChrisR

Maybe the first thing to notice is that the namespaces
are defined at Element level, not Document level.

Most often though, all namespaces are declared in the document’s
root element (office:document-content here), which saves us parsing it all to collect inner rel="noreferrer">dict comprehension to filter it out
in a more compact expression.

You have a slightly different syntax for xpath and
ElementPath.


So here's the code you could use to get all your first table's rows
(tested with: lxml=3.4.2) :

import zipfile
from lxml import etree

# Open and parse the document
zf = zipfile.ZipFile('spreadsheet.ods')
tree = etree.parse(zf.open('content.xml'))

# Get the root element
root = tree.getroot()

# get its namespace map, excluding default namespace
nsmap = {k:v for k,v in root.nsmap.iteritems() if k}

# use defined prefixes to access elements
table = tree.find('.//table:table', nsmap)
rows = table.findall('table:table-row', nsmap)

# or, if xpath is needed:
table = tree.xpath('//table:table', namespaces=nsmap)[0]
rows = table.xpath('table:table-row', namespaces=nsmap)
Answered By: RockyRoad

Etree won't find namespaced elements if there are no

tree = etree.fromstring(xml_doc)

# finds nothing:
tree.find('.//ns:root', {'ns': 'foo'})
tree.find('.//{foo}root', {'ns': 'foo'})
tree.find('.//ns:root')
tree.find('.//ns:root')

Sometimes that is the data you are given. So, what can you do when there is no namespace?

My solution: add one.

import lxml.etree as etree

xml_doc = '<ns:root><ns:child></ns:child></ns:root>'
xml_doc_with_ns = '<ROOT >%s</ROOT>' % xml_doc

tree = etree.fromstring(xml_doc_with_ns)

# finds what you're looking for:
tree.find('.//{foo}root')
Answered By: dsummersl