How to delete the words between two delimiters?

Question:

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something".
Is there a way on how to delete the text between those two delimiters "<" and ">"?

Asked By: frazman

||

Answers:

Use regular expressions:

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]

If you tried a pattern like <.+>, where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are “greedy” by default. The expression will match anything until the end of the string, including the > – and this is not what we want. We want to match < and stop on the next >, so we use the [^x] pattern which means “any character but x” (x being >).

The ? operator turns the match “non-greedy”, so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.

Answered By: Paulo Scardine

Of course, you can use regular expressions.

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it.

Answered By: Sufian Latif
import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all characters between < and > ('<.*?>') and replacing them with nothing ('').

The ? is used in re for non-greedy searches.

More about the re module.


If that “noises” are actually html tags, I suggest you to look into BeautifulSoup

Answered By: juliomalegria

Just for interest, you could write some code such as:

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>
Answered By: Rusty Rob

First thank you Paulo Scardine, I used your re to do great thing. The idea was to have tag free LibreOffice po file for printing purposes. And I made the following script which will clean the help file for smaller and easier ones.

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()
Answered By: user1993440
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.