Python regex to remove text between some pattern
Question:
I have text in following format.
|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text
I want to remove all text in between |start| and |end|
I have tried following re.
regex = '(?<=|start|).+(?=|end|)'
re.sub(regex, ''. text)
It returns
“Again some free text”
But I expect to return
this is another text. Again some free text
Answers:
Note the start/end delimiters are in lookaround constructs in your pattern and thus will remain in the resulting string after re.sub
. You should convert the lookbehind and lookahead into consuming patterns.
Also, you seem to want to remove special chars after the right hand delimiter, so you need to add [^ws]*
at the end of the regex.
You may use
import re
text = """|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text"""
print( re.sub(r'(?s)|start|.*?|end|[^ws]*', '', text).replace('n', '') )
# => this is another text. Again some free text
See the Python demo.
Regex details
(?s)
– inline DOTALL modifier
|start|
– |start|
text
.*?
– any 0+ chars, as few as possible
|end|
– |end|
text
[^ws]*
– 0 or more chars other than word and whitespace chars.
Try this:
import re
your_string = """|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text"""
regex = r'(|start|).+(|end|.)'
result = re.sub(regex, '', your_string).replace('n', '')
print(result)
Outputs:
this is another text. Again some free text
I have text in following format.
|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text
I want to remove all text in between |start| and |end|
I have tried following re.
regex = '(?<=|start|).+(?=|end|)'
re.sub(regex, ''. text)
It returns
“Again some free text”
But I expect to return
this is another text. Again some free text
Note the start/end delimiters are in lookaround constructs in your pattern and thus will remain in the resulting string after re.sub
. You should convert the lookbehind and lookahead into consuming patterns.
Also, you seem to want to remove special chars after the right hand delimiter, so you need to add [^ws]*
at the end of the regex.
You may use
import re
text = """|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text"""
print( re.sub(r'(?s)|start|.*?|end|[^ws]*', '', text).replace('n', '') )
# => this is another text. Again some free text
See the Python demo.
Regex details
(?s)
– inline DOTALL modifier|start|
–|start|
text.*?
– any 0+ chars, as few as possible|end|
–|end|
text[^ws]*
– 0 or more chars other than word and whitespace chars.
Try this:
import re
your_string = """|start| this is first para to remove |end|.
this is another text.
|start| this is another para to remove |end|. Again some free text"""
regex = r'(|start|).+(|end|.)'
result = re.sub(regex, '', your_string).replace('n', '')
print(result)
Outputs:
this is another text. Again some free text