XPath on lxml's iterparse matches elements outside its scope

Question:

I have huge corpora that I am parsing with lxml, so I am using iterparse which makes it easy to read XML on-the-fly. By using iterparse(fh, tag="your_tag") we can efficiently iterate over nodes in large files.

I wish to do some XPath matching for each major tag in the file, in my case alpino_ds. For each alpino_ds node I want to check whether some given XPath matches. I found, however, that an XPath would match on an element, when in reality it is matching on something else in the document – not just the current iterated alpino_ds element but a consecutive one.

I am puzzled as to why this happens: in the example below, I would expect only one match (in the last alpino_ds node) but as you can see it matches three times and the matched XPath result is the same item in all three cases (part of the last node)!

from io import BytesIO
import lxml.etree as ET

xml = """<treebank>
<alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.1">
    <node begin="0" cat="top" end="4" id="0" rel="top">
      <node begin="0" cat="du" end="3" id="1" rel="--">
        <node begin="0" conjtype="neven" end="1" frame="complementizer(root)" id="2" lcat="du" lemma="en" pos="comp" postag="VG(neven)" pt="vg" rel="dlink" root="en" sc="root" sense="en" word="en"/>
        <node begin="1" cat="np" end="3" id="3" rel="nucl">
          <node begin="1" end="2" frame="number(hoofd(sg_num))" id="4" infl="sg_num" lcat="detp" lemma="een" numtype="hoofd" pos="num" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="det" root="één" sense="één" special="hoofd" word="één"/>
          <node begin="2" end="3" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="5" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
        </node>
      </node>
      <node begin="3" end="4" frame="punct(punt)" id="6" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
    </node>
    <sentence>en één printer .</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.1|en één printer .|1|1|1.2960516563900006</comment>
    </comments>
  </alpino_ds>
  <alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.2">
    <node begin="0" cat="top" end="20" id="0" rel="top">
      <node begin="0" cat="smain" end="19" id="1" rel="--">
        <node begin="0" cat="np" end="2" id="2" index="1" rel="su">
          <node begin="0" end="1" frame="determiner(de,nwh,nmod,pro,nparg)" getal="getal" id="3" infl="de" lcat="detp" lemma="die" naamval="stan" pdtype="pron" persoon="3" pos="det" postag="VNW(aanw,pron,stan,vol,3,getal)" pt="vnw" rel="det" root="die" sense="die" status="vol" vwtype="aanw" wh="nwh" word="Die"/>
          <node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="4" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
        </node>
        <node begin="2" end="3" frame="verb(unacc,sg3,passive)" id="5" infl="sg3" lcat="smain" lemma="worden" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="wordt" wvorm="pv"/>
        <node begin="0" cat="ppart" end="19" id="6" rel="vc">
          <node begin="0" end="2" id="7" index="1" rel="obj1"/>
          <node begin="3" buiging="zonder" end="4" frame="verb(hebben,psp,np_pc_pp(voor))" id="8" infl="psp" lcat="ppart" lemma="gebruiken" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="gebruik" sc="np_pc_pp(voor)" sense="gebruik-voor" word="gebruikt" wvorm="vd"/>
          <node begin="4" cat="pp" end="19" id="9" rel="pc">
            <node begin="4" end="5" frame="preposition(voor,[aan,door,uit,[in,de,plaats]])" id="10" lcat="pp" lemma="voor" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="voor" sense="voor" vztype="init" word="voor"/>
            <node begin="5" cat="np" end="19" id="11" rel="obj1">
              <node begin="5" end="6" frame="determiner(het,nwh,nmod,pro,nparg,wkpro)" id="12" infl="het" lcat="detp" lemma="het" lwtype="bep" naamval="stan" npagr="evon" pos="det" postag="LID(bep,stan,evon)" pt="lid" rel="det" root="het" sense="het" wh="nwh" word="het"/>
              <node begin="6" end="7" frame="v_noun(intransitive)" getal="mv" graad="basis" id="13" lcat="np" lemma="druk" ntype="soort" pos="verb" postag="N(soort,mv,basis)" pt="n" rel="hd" root="druk" sc="intransitive" sense="druk" special="v_noun" word="drukken"/>
              <node begin="7" cat="pp" end="19" id="14" rel="mod">
                <node begin="7" end="8" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="15" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                <node begin="8" cat="np" end="19" id="16" rel="obj1">
                  <node begin="8" end="9" frame="determiner(de)" id="17" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                  <node begin="9" end="10" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="18" lcat="np" lemma="tekst" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="tekst" sense="tekst" word="tekst"/>
                  <node begin="10" cat="pp" end="19" id="19" rel="mod">
                    <node begin="10" end="11" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="20" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                    <node begin="11" cat="conj" end="19" id="21" rel="obj1">
                      <node begin="14" conjtype="neven" end="15" frame="conj(en)" id="22" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
                      <node begin="11" cat="np" end="19" id="23" rel="cnj">
                        <node begin="11" end="12" frame="modal_adverb" id="24" index="2" lcat="advp" lemma="bijvoorbeeld" pos="adv" postag="BW()" pt="bw" rel="mod" root="bijvoorbeeld" sc="modal" sense="bijvoorbeeld" word="bijvoorbeeld"/>
                        <node begin="12" end="13" frame="determiner(de)" id="25" index="3" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                        <node begin="13" end="14" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="26" lcat="np" lemma="naam" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="naam" sense="naam" word="naam"/>
                        <node begin="16" cat="pp" end="19" id="27" index="4" rel="mod">
                          <node begin="16" end="17" frame="preposition(op,[af,na])" id="28" lcat="pp" lemma="op" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="op" sense="op" vztype="init" word="op"/>
                          <node begin="17" cat="np" end="19" id="29" rel="obj1">
                            <node begin="17" end="18" frame="determiner(de)" id="30" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                            <node begin="18" end="19" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="31" lcat="np" lemma="cd" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="cd" sense="cd" word="cd"/>
                          </node>
                        </node>
                      </node>
                      <node begin="11" cat="np" end="19" id="32" rel="cnj">
                        <node begin="11" end="12" id="33" index="2" rel="mod"/>
                        <node begin="12" end="13" id="34" index="3" rel="det"/>
                        <node begin="15" end="16" frame="noun(het,count,pl)" gen="het" getal="mv" graad="basis" id="35" lcat="np" lemma="adresgegevens" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="adres_gegeven" sense="adres_gegeven" word="adresgegevens"/>
                        <node begin="16" end="19" id="36" index="4" rel="mod"/>
                      </node>
                    </node>
                  </node>
                </node>
              </node>
            </node>
          </node>
        </node>
      </node>
      <node begin="19" end="20" frame="punct(punt)" id="37" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
    </node>
    <sentence>Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.2|Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .|1|1|0.11022457209000547</comment>
    </comments>
  </alpino_ds>
  <alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.3">
    <node begin="0" cat="top" end="25" id="0" rel="top">
      <node begin="15" end="16" frame="punct(komma)" id="1" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
      <node begin="22" end="23" frame="punct(komma)" id="2" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
      <node begin="0" cat="smain" end="25" id="3" rel="--">
        <node begin="0" cat="np" end="2" id="4" rel="su">
          <node begin="0" end="1" frame="determiner(een)" id="5" infl="een" lcat="detp" lemma="een" lwtype="onbep" naamval="stan" npagr="agr" pos="det" postag="LID(onbep,stan,agr)" pt="lid" rel="det" root="een" sense="een" word="Een"/>
          <node begin="1" end="2" frame="noun(het,count,sg)" gen="het" genus="onz" getal="ev" graad="dim" id="6" lcat="np" lemma="robot-arm" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,dim,onz,stan)" pt="n" rel="hd" root="robot_arm_DIM" sense="robot_arm_DIM" word="robot-armpje"/>
        </node>
        <node begin="2" end="3" frame="verb(hebben,sg3,er_pp_sbar(voor))" id="7" infl="sg3" lcat="smain" lemma="zorgen" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="zorg" sc="er_pp_sbar(voor)" sense="zorg-voor" tense="present" word="zorgt" wvorm="pv"/>
        <node begin="3" cat="pp" end="25" id="8" rel="pc">
          <node begin="3" end="4" frame="er_adverb(voor)" id="9" lcat="pp" lemma="ervoor" pos="pp" postag="BW()" pt="bw" rel="hd" root="ervoor" sense="ervoor" special="er" word="ervoor"/>
          <node begin="4" cat="cp" end="25" id="10" rel="vc">
            <node begin="4" conjtype="onder" end="5" frame="complementizer(dat)" id="11" lcat="cp" lemma="dat" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="dat" sc="dat" sense="dat" word="dat"/>
            <node begin="5" cat="conj" end="25" id="12" rel="body">
              <node begin="5" cat="ssub" end="13" id="13" rel="cnj">
                <node begin="5" cat="np" end="7" id="14" index="1" rel="su">
                  <node begin="5" end="6" frame="determiner(de)" id="15" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
                  <node begin="6" end="7" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="16" lcat="np" lemma="brander" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="brander" sense="brander" word="branders"/>
                </node>
                <node begin="9" end="10" frame="verb(unacc,pl,passive)" id="17" infl="pl" lcat="ssub" lemma="worden" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="worden" wvorm="pv"/>
                <node begin="5" cat="ppart" end="13" id="18" rel="vc">
                  <node begin="5" end="7" id="19" index="1" rel="obj1"/>
                  <node begin="7" end="8" frame="adverb" id="20" lcat="advp" lemma="steeds" pos="adv" postag="BW()" pt="bw" rel="mod" root="steeds" sense="steeds" word="steeds"/>
                  <node begin="8" buiging="zonder" end="9" frame="verb(hebben,psp,np_pc_pp(met))" id="21" infl="psp" lcat="ppart" lemma="laden" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="laad" sc="np_pc_pp(met)" sense="laad-met" word="geladen" wvorm="vd"/>
                  <node begin="10" cat="pp" end="13" id="22" rel="pc">
                    <node begin="10" end="11" frame="preposition(met,[mee,[en,al]])" id="23" lcat="pp" lemma="met" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="met" sense="met" vztype="init" word="met"/>
                    <node begin="11" cat="np" end="13" id="24" rel="obj1">
                      <node aform="base" begin="11" buiging="met-e" end="12" frame="adjective(e)" graad="basis" id="25" infl="e" lcat="ap" lemma="leeg" naamval="stan" pos="adj" positie="prenom" postag="ADJ(prenom,basis,met-e,stan)" pt="adj" rel="mod" root="leeg" sense="leeg" vform="adj" word="lege"/>
                      <node begin="12" end="13" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="26" lcat="np" lemma="cd" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="cd" sense="cd" word="cd&apos;s"/>
                    </node>
                  </node>
                </node>
              </node>
              <node begin="13" conjtype="neven" end="14" frame="conj(en)" id="27" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
              <node begin="14" cat="ssub" end="25" id="28" rel="cnj">
                <node begin="14" end="15" frame="determiner(het,nwh,nmod,pro,nparg)" getal="ev" id="29" infl="het" lcat="np" lemma="dat" naamval="stan" pdtype="pron" persoon="3o" pos="det" postag="VNW(aanw,pron,stan,vol,3o,ev)" pt="vnw" rel="su" root="dat" sense="dat" status="vol" vwtype="aanw" wh="nwh" word="dat"/>
                <node begin="16" cat="cp" end="22" id="30" rel="mod">
                  <node begin="16" conjtype="onder" end="17" frame="complementizer(als)" id="31" lcat="cp" lemma="als" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="als" sc="als" sense="als" word="als"/>
                  <node begin="17" cat="ssub" end="22" id="32" rel="body">
                    <node begin="17" case="both" def="def" end="18" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="33" index="2" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="su" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
                    <node begin="19" end="20" frame="verb(unacc,pl,passive)" id="34" infl="pl" lcat="ssub" lemma="zijn" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="ben" sc="passive" sense="ben" tense="present" word="zijn" wvorm="pv"/>
                    <node begin="17" cat="ppart" end="22" id="35" rel="vc">
                      <node begin="17" end="18" id="36" index="2" rel="obj1"/>
                      <node begin="18" end="19" frame="verb(hebben,psp,np_pc_pp(van))" id="37" infl="psp" lcat="ppart" lemma="voorzien" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="voorzie" sc="np_pc_pp(van)" sense="voorzie-van" word="voorzien" wvorm="pv"/>
                      <node begin="20" cat="pp" end="22" id="38" rel="pc">
                        <node begin="20" end="21" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="39" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
                        <node begin="21" end="22" frame="noun(de,mass,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="40" lcat="np" lemma="audio" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="obj1" root="audio" sense="audio" word="audio"/>
                      </node>
                    </node>
                  </node>
                </node>
                <node begin="23" case="both" def="def" end="24" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="41" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="obj1" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
                <node begin="24" buiging="zonder" end="25" frame="verb(hebben,sg3,transitive)" id="42" infl="sg3" lcat="ssub" lemma="verplaatsen" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="verplaats" sc="transitive" sense="verplaats" tense="present" word="verplaatst" wvorm="vd"/>
              </node>
            </node>
          </node>
        </node>
      </node>
    </node>
    <sentence>Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd&apos;s en dat , als ze voorzien zijn van audio , ze verplaatst</sentence>
    <comments>
      <comment>Q#WR-P-P-D-0000000006.p.34.s.3|Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd&apos;s en dat , als ze voorzien zijn van audio , ze verplaatst|1|1|-0.4347218970399951</comment>
    </comments>
  </alpino_ds>
  </treebank>
"""

xpath = '//node[@cat="cp" and node[@rel="cmp" and @pt="vg" and number(@begin) < number(../node[@rel="body" and @cat="ssub"]/node[@rel="vc" and @cat="ppart"]/node[@rel="hd" and @pt="ww"]/@begin)] and node[@rel="body" and @cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pt="ww" and number(@begin) < number(../../node[@rel="hd" and @pt="ww"]/@begin)]] and node[@rel="hd" and @pt="ww"]]]'


for _, element in ET.iterparse(BytesIO(str.encode(xml)), tag="alpino_ds", events=("end", )):
    result = element.xpath(xpath)
    if result:
        print("match", ET.tostring(result[0]))

What am I missing here?

Asked By: Bram Vanroy

||

Answers:

With XPath, an absolute path starting with / searches down from the document node (sometimes also called root node) and if you start with e.g. //node you select node elements anywhere in the document (of the context node you call your xpath function on).

So to select relative to/inside of your selected alpine_ds elements, use a path starting with .//node instead of //node.

Answered By: Martin Honnen
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.