Exclude Xpath from other Xpath

Question:

If you have two Xpaths you can join them with the | operator to return both their results in one result set. This essentially gives back the union of the two sets of elements. The example below gives back all divs and all spans on a website:

//div | //span

What I need is the difference (subsection). I need all elements in the first Xpath group that are not in the second Xpath group. So far I have seen that there is an except operator but that only works in Xpath2. I need an Xpath1 solution. I have seen that the not function might help but I was not able to make it work.

As an example imagine the following:

<tr>
    <td>1</td>
    <td>2</td>
    <td>3</td>
    <td>4</td>
    <td>5</td>
</tr> 

In this example I would have the Xpath group //tr/td. I would want to exclude <td>1</td> and <td>4</td>. Although there are many ways to solve the problem I am specifically looking for a solution where I can say in an Xpath: "Here is a group of elements and exclude this group of elements from it".

UPDATE:
As it was pointed out, the example above did not give a proper representation for what I wanted to achieve, so I am giving another example here.

On a website I collect all HTML div elements with: //div. Here I get 645 div elements. I also have 3 specific xpaths for 3 div elements I want to exclude form the result of 645:

//span[@class="sp"]/p/div[1]
//html/body/div/span/div
//p[@id="para"]/div/p/div

So what I want to achieve is get back 1 xpath which contains all divs on a webpage, excluding the ones I have specified in the other xpaths. The result should be 642 div elements.

This has been answered, however I am leaving here the update for future readers.

Asked By: CaptainCsaba

||

Answers:

You can use logic and and not operators here.
For your specific example you can use the following XPath

"//tr/td[not(text()=`1`)][not(text()=`4`)]"
Answered By: Prophet

An approach realizing this is using the self:: axis and the not() operator in a predicate:
For example, with an XML like this

<root>
    <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
    </tr>    
    <dr>
        <td>1</td>
        <td>4</td>
    </dr>    
</root>

you can use this XPath-1.0 expression:

//tr/td[not(self::*=//dr/td)]

which can be shortened to

//tr/td[not(.=//dr/td)]

The resulting nodeset is as desired

<td>2</td>
<td>3</td>
<td>5</td>

The XPath expression selects all elements of the first part and checks in the predicate if every element itself (self::* or .) is in the second part. If it is, it will be excluded (not(...)).

You can also apply this approach to attribute nodes. In this case you have to use the ., because self::* is more specific and only selects elements. So you could replace self::* by ., but not the other way round. (The most general axis would be self::node().)

Answered By: zx485

In XPath 2.0+ there is an operator for this: except. If E and F are general expressions returning sets of nodes, then E except F returns all nodes selected by E that are not selected by F.

There’s no convenient way of doing the same thing in XPath 1.0, but the rather cumbersome (and potentially expensive) expression E[count(.|F) != count(F)] is equivalent (though you need to take care about the context for evaluation of F).

In many practical cases you can achieve the desired effect with a filter predicate, for example //td[not(ancestor::tr)].

Answered By: Michael Kay