Python and LXML: changing an attribute only in specific contexts

Question:

I have to process XML files with structures such as the following:

<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>

Up to now I’m using the following code to change the attributes of certain tags within some element whenever a particular condition is met. For instance, I want to change the attributes of the tags ‘tag1’ and ‘tag2’ only in the ‘tok’ elements where the tag ‘tag1’ has the attribute ‘blah1’, this does the job:

def xml_change(root_element):

  for el in root.xpath('//tok'):        
        if el.get('tag1') == "blah1":
            el.set('tag1', 'Blah1-TEXT1')
            el.set('tag2', 'Blah2-TEXT1')
            

it returns:

<tok tag1="Blah1-TEXT1" tag2="Blah2-TEXT1" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT3</tok>

What I need to do next, though, is a bit more complicated and I’m totally stumped. Let me try to describe the problem to see if you can point me to a satisfactory solution.

In some cases I need to change the attributes of certain tags in the ‘tok’ element only if the tags within elements preceding this element or within elements following it have certain attributes. So, say I have the following XML:

<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>

What would I have to change in my code to modify all the ‘tag1’ attributes to, say, "newattrib", only in cases where the ‘tag1’ attribute of the previous element is "blah1" and the ‘tag2’ attribute in the following element is "blahY". So, using the previous example of XML doc, this would have to affect only the element with text ‘TEXT5’ and would have to return:

<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok> 
<tok tag1="newattrib" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>

In essence, what I don’t know how to do is how to specify the context for the elements that I want to modify.

Asked By: jfontana

||

Answers:

You’ll have to use a somewhat complicated xpath expression involving preceding and following siblings, but it’s doable. Try something like this:

from lxml import etree

blahs ="""<root>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT1</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT2</tok>
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT3</tok>
<tok tag1="blah1" tag2="blah2" tag3="blah3">TEXT4</tok> 
<tok tag1="blahA" tag2="blahB" tag3="blahC">TEXT5</tok>
<tok tag1="blahX" tag2="blahY" tag3="blahZ">TEXT6</tok>
</root>"""

doc = etree.fromstring(blahs)
for el in doc.xpath('//tok[preceding-sibling::tok[1][@tag1="blah1"]][following-sibling::tok[1][@tag2="blahY"]]'):
    el.set('tag1', 'newattrib')
print(etree.tostring(doc).decode())

The output should be your expected output.

Depending on the actual structure, you may be able to drop the [1]s in the expression.

Answered By: Jack Fleeting
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.