Python and LXML: changing attributes of different contiguous elements if certain conditions are met

Question:

I have to process XML files with structures such as the following:

<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXTA</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXTB</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXTC</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT7</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXTX</tok>

I want to be able to change the values in attributes from more than one element for specific sequences of elements whenever certain conditions are met.

For instance, in the previous example document I’d like to change the value of attribute ‘tag1’ in the first element of the sequence to "changed" and the value of the same attribute in the following element to "changed2" if the sequence of elements have the sequence of atomic values ‘TEXTA’, ‘TEXTB’ and ‘TEXTC’. So, the output would have to be:

<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>
<tok tag1="changed" tag2="blah2" tag3="blah3">TEXTA</tok> 
<tok tag1="changed2" tag2="blahB" tag3="blahC">TEXTB</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXTC</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT7</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXTX</tok>

Right now I know how to do this only when the modification would affect only one element.

for el in root.xpath('//tok[text()="TEXTA"][following-sibling::tok[1][text()="TEXTB"]][following-sibling::tok[2][text()="TEXTC"]]':
   el.set('tag1', 'changed')

I’ve read the documentation and tutorials I’ve found for the XPath syntax but I cannot figure out how to add the instructions for the modifications to affect also other elements following or preceding the one that is specified as ‘el’ in the code. Any help with this would be greatly appreciated.

Asked By: jfontana

||

Answers:

Ok, this time try it this way:

for el in root.xpath('//tok[.="TEXTB"][preceding-sibling::tok[1][.="TEXTA"]][following-sibling::tok[1][.="TEXTC"]]'):
    el.set('tag1', 'changed2')
    el.xpath('preceding-sibling::tok[1]')[0].set('tag1', 'changed')
print(etree.tostring(root).decode())

and the output should be your expected output.

Answered By: Jack Fleeting
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.