Removing all text within double quotes
Question:
I am working on preprocessing some text in Python and would like to get rid of all text that appears in double quotes within the text. I am unsure how to do that and will appreciate your help with. A minimally reproducible example is below for your reference. Thank you in advance.
x='The frog said "All this needs to get removed" something'
So, pretty much what I want to get is 'The frog said something'
by removing the text in the double quotes from x
above, and I am not sure how to do that. Thanks once again.
Answers:
Use regex substitution:
import re
x='The frog said "All this needs to get removed" something'
res = re.sub(r's*"[^"]+"s*', ' ', x)
print(res)
The frog said something
s*
– match optional whitespace characters
"
– match "
char as is
[^"]+
– match any character except "
(ensured via ^
sign) one or more
If you want to use index and slicing:
s='The frog said "All this needs to get removed" something'
# To get the index of both the quotes
[i for i, x in enumerate(s) if x == '"']
#[14, 44]
s[:13]+s[45:]
#'The frog said something'
A quick fix assuming "
are balanced in the string, i.e. are even, and double spaces are not relevant.
x = 'The frog said "All this needs to get removed" something'
x_new = ''.join(x.split('"')[::2]).replace(' ', ' ')
Eventually, these conditions can be checked with str.count
:
if x.count('"') % 2 != 0:
raise Exception('Double quotes are not balanced')
if x.count(" ") > 0:
raise Exception('Double spaces are present')
I am working on preprocessing some text in Python and would like to get rid of all text that appears in double quotes within the text. I am unsure how to do that and will appreciate your help with. A minimally reproducible example is below for your reference. Thank you in advance.
x='The frog said "All this needs to get removed" something'
So, pretty much what I want to get is 'The frog said something'
by removing the text in the double quotes from x
above, and I am not sure how to do that. Thanks once again.
Use regex substitution:
import re
x='The frog said "All this needs to get removed" something'
res = re.sub(r's*"[^"]+"s*', ' ', x)
print(res)
The frog said something
s*
– match optional whitespace characters"
– match"
char as is[^"]+
– match any character except"
(ensured via^
sign) one or more
If you want to use index and slicing:
s='The frog said "All this needs to get removed" something'
# To get the index of both the quotes
[i for i, x in enumerate(s) if x == '"']
#[14, 44]
s[:13]+s[45:]
#'The frog said something'
A quick fix assuming "
are balanced in the string, i.e. are even, and double spaces are not relevant.
x = 'The frog said "All this needs to get removed" something'
x_new = ''.join(x.split('"')[::2]).replace(' ', ' ')
Eventually, these conditions can be checked with str.count
:
if x.count('"') % 2 != 0:
raise Exception('Double quotes are not balanced')
if x.count(" ") > 0:
raise Exception('Double spaces are present')