python : lxml xpath tag name with colon

Question:

i have to parse some feed, but one of the element (tag) is with colon <dc:creator>leemore23</dc:creator>

how can i parse it using lxml? so i have done it in this way

r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc')[0].text.strip()

but i think there is a better way, something like

data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()

or

data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()

so without replacing

what can you advice me ?

Asked By: yital9

||

Answers:

The dc: prefix indicates a XML namespace. Use the elementtree API namespace support to deal with it, not just remove it from your input. As it happens, dc usually refers to Dublin Core metadata.

You need to determine the full namespace URL, then use that URL in your XPath queries:

DCNS = 'http://purl.org/dc/elements/1.1/'
creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))

Here I used the recommended http://purl.org/dc/elements/1.1/ namespace URL for the dublin core prefix.

You can normally determine the URL from the .nsmap property; your root element probably has the following .nsmap attribute:

{'dc': 'http://purl.org/dc/elements/1.1/'}

and thus you can change your code to:

creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))

This can be simplified further still by passing the nsmap dictionary to the xpath() method as the namespaces keyword, at which point you can use the prefix in your xpath expression:

creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)
Answered By: Martijn Pieters

The dc: indicates a namespace. When using lxml‘s xpath method, use the namespaces parameter to search for elements in a namespace.

So, in your case, using the dublin core prefix supplied by @MartijnPieters,

r = requests.get('http://www.site.com/feed/')
tree = lxml.etree.fromstring(r.content)
ns = {'dc':'http://purl.org/dc/elements/1.1/'}
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc:creator', namespaces = ns)[0].text.strip()
Answered By: unutbu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.