how to either return string or match empty if node/tag not found in xpath (lxml)

Question

I have the following XPath to match authors name in an Amazon page:

//div[@class=’pTitle’]/span[@class=’small itemByline’] | //div[@class=’pTitle’]/span[not(text())]

The first part of this XPath matches it just fine, however some items in the page have no span after such div with class pTitle, so there’s nothing to match, but I’d like to either get a ” or something else, to know the author was not found for real instead of just skipping it. I suppose the second XPath is invalid as it does not work…

For instance, the 3 titles starting with ‘A Ditadura’ should return ” for the author entry using the XPath I’m building. They’re not though. It’s making the above XPath return 179 items instead of 209.

Target is http://www.amazon.com/wishlist/3MCYFXCFDH4FA/ref=cm_wl_act_print_o?_encoding=UTF8&layout=standard-print&disableNav=1&visitor-view=1&items-per-page=1000

This is part of the code of my Python module https://github.com/caio1982/Amazon-Wishlist (thanks by the way for all the good answers in SO so far, I’ve learned XPath thanks to you guys).

For sake of info, I’m trying this using Firefox’s XPath Checker extension, implementing it with Python (lxml).

It sounds similar to How do I return '' for an empty node's text() in XPath? but I’m not sure though.

I suspect the answer may be something around XPath axes and a [notcontains] restriction of some sort?

EDIT1: rephrasing it a bit after Dimitre’s suggestion… is it possible to use –– and if so, do you have a working example of –– Becker’s XPath method using lxml?

EDIT2: sample tree and expected results:

    <html>
        <body>
            <h1>Title</h1>
            <p>First Paragraph</p>
            <p>Second paragraph: <span>value</span></p>
            <p>Third paragraph: <span>value</span></p>
            <p>Forth paragraph:</p>
        </body>
    </html>

XPath //p/span returns the Second and the Third paragraph ‘value’ strings accordingly. That’s ok, but I’m looking for 4 results instead of 2, like this:

    None
    value
    value
    None

I know //p/span does not work for this, hence I’m looking for some string-magic, node comparison or conditionals etc.

Asked By: caio1982

||

Source

Answer 1

You can use an XPath expression like this one:

concat(
//div[@class='pTitle']/span[@class='small itemByline'],
substring('UNKNOWN', 
          1 + 7*(boolean(//div[@class='pTitle']/span[@class='small itemByline'])
          )
       )

When this XPath expression is evaluated, and if //div[@class='pTitle']/span[@class='small itemByline'] exists, then its string value (concatenated with the empty string) is produced.

When //div[@class='pTitle']/span[@class='small itemByline'] doesn’t exist, then the result is the string 'UNKNOWN' — the empty string is concatenated with substring('UNKNOWN', 1+0).

Here we use the fact that in XPath 1.0 whenever a boolean value is an argument of an arithmetic operator, it is first converted to a number, using the rule that:

   number(true()) = 1

and

   number(false()) = 0

Update: Here is an XSLT- based verification, using the XML document from the EDIT 2 by the OP and producing exactly the wanted result (the Same XPath expression (only an index is updated) is evaluated 4 times and all produced values are output — each on a separate line):

<xsl:stylesheet version="1.0"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:for-each select="(//node())[not(position() > count(//p))]">
   <xsl:variable name="vPos" select="position()"/>
   <xsl:value-of select=
     "concat((//p)[position() = $vPos]/span,
             substring('UNKNOWN',
                       1 +7*boolean((//p)[position() = $vPos]/span)
                       )
             )
     "/>

     <xsl:text>&#xA;</xsl:text>
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the latest provided XML document:

<html>
    <body>
        <h1>Title</h1>
        <p>First Paragraph</p>
        <p>Second paragraph: 
            <span>value</span>
        </p>
        <p>Third paragraph: 
            <span>value</span>
        </p>
        <p>Forth paragraph:</p>
    </body>
</html>

the XPath expression is evaluated N (4) times and the results of this evaluation are produced — as we see, these are exactly the wanted results:

UNKNOWN
value
value
UNKNOWN

Answered By: Dimitre Novatchev

how to either return string or match empty if node/tag not found in xpath (lxml)

Question:

Answers: