I want to remove the unwanted sub-level duplicate tags using lxml etree

Question

This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues

sample text

Required Output

sample text

Asked By: Rajeshkanna Purushothaman

||

Source

Answer 1

Markdown, itself, provides structural to extract elements inside

Using re in python, you may extract elements and recombine them.

For example:

import re


html = """<p><b><b><i><b><i><b>

<i>sample text</i>

</b></i></b></i></b></b></p>"""


regex_object = re.compile("<(.*?)>")
html_objects = regex_object.findall(html)
set_html = []

for obj in html_objects:
    if obj[0] != "/" and obj not in set_html:
        set_html.append(obj)


regex_text = re.compile(">(.*?)<")
text = [result for result in regex_text.findall(html) if result][0]

# Recombine
result = ""
for obj in set_html:
    result += f"<{obj}>"
result += text
for obj in set_html[::-1]:
    result += f"</{obj}>"
    
# result = '<p><b><i>sample text</i></b></p>'

Answered By: Prem Chotepanit

Answer 2

You can use the regex library re to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn’t important then just use a set). Once all pairs of tags are found, wrap what’s left with the keys of the dictionary in reverse order.

import re

def remove_duplicates(string):
    
    tags = {}
    while (match := re.findall(r'<(.+)>([wW]*)</1>', string)):
        tag, string = match[0][0], match[0][1]   # match is [(group0, group1)]
        tags.update({tag: None})

    for tag in reversed(tags):
        string = f'<{tag}>{string}</{tag}>'

    return string

Note: I’ve used [wW]* as a cheat to match everything.

Answered By: bn_ln

Answer 3

I written this Object based cleanup using lxml for sublevel duplicate tags. It may help others.

import lxml.etree as ET

textcont = '<p><b><b><i><b><i><b><i>sample text</i></b></i></b></i></b></b></p>'

soup = ET.fromstring(textcont)

for tname in ['i','b']:
    for tagn in soup.iter(tname):
        if tagn.getparent().getparent() != None and tagn.getparent().getparent().tag == tname:
            iparOfParent = tagn.getparent().getparent()
            iParent = tagn.getparent()
            if iparOfParent.text == None:
                iparOfParent.addnext(iParent)
                iparOfParent.getparent().remove(iparOfParent)
        elif tagn.getparent() != None and tagn.getparent().tag == tname:
            iParent = tagn.getparent()
            if iParent.text == None:
                iParent.addnext(tagn)
                iParent.getparent().remove(iParent)

            
print(ET.tostring(soup))

output:

b'<p><b><i>sample text</i></b></p>'

Answered By: Rajeshkanna Purushothaman

I want to remove the unwanted sub-level duplicate tags using lxml etree

Question:

Answers: