python : lxml xpath tag name with colon

Question

i have to parse some feed, but one of the element (tag) is with colon <dc:creator>leemore23</dc:creator>

how can i parse it using lxml? so i have done it in this way

r = requests.get('http://www.site.com/feed/')
foo = (r.content).replace("dc:creator","dc")
tree = lxml.etree.fromstring(foo)
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc')[0].text.strip()

but i think there is a better way, something like

data['dc'] = article_node.xpath('.//dc:creator')[0].text.strip()

or

data['dc'] = article_node.xpath('.//dc|creator')[0].text.strip()

so without replacing

what can you advice me ?

Asked By: yital9

||

Source

Answer 1

The dc: prefix indicates a XML namespace. Use the elementtree API namespace support to deal with it, not just remove it from your input. As it happens, dc usually refers to Dublin Core metadata.

You need to determine the full namespace URL, then use that URL in your XPath queries:

DCNS = 'http://purl.org/dc/elements/1.1/'
creator = article_node.xpath('.//{{{0}}}creator'.format(DCNS))

Here I used the recommended http://purl.org/dc/elements/1.1/ namespace URL for the dublin core prefix.

You can normally determine the URL from the .nsmap property; your root element probably has the following .nsmap attribute:

{'dc': 'http://purl.org/dc/elements/1.1/'}

and thus you can change your code to:

creator = article_node.xpath('.//{{{0}}}creator'.format(article_node.nsmap['dc']))

This can be simplified further still by passing the nsmap dictionary to the xpath() method as the namespaces keyword, at which point you can use the prefix in your xpath expression:

creator = article_node.xpath('.//dc:creator', namespaces=article_node.nsmap)

Answered By: Martijn Pieters

Answer 2

The dc: indicates a namespace. When using lxml‘s xpath method, use the namespaces parameter to search for elements in a namespace.

So, in your case, using the dublin core prefix supplied by @MartijnPieters,

r = requests.get('http://www.site.com/feed/')
tree = lxml.etree.fromstring(r.content)
ns = {'dc':'http://purl.org/dc/elements/1.1/'}
for article_node in tree.xpath('//item'):
    data['dc'] = article_node.xpath('.//dc:creator', namespaces = ns)[0].text.strip()

Answered By: unutbu

python : lxml xpath tag name with colon

Question:

Answers: