remove or replace only the word after a specific word in column pandas using regex
Question:
I’m trying to use regex to remove or replace only the word after specific word(s) in a column of strings in a dataframe. This means I don’t want the spaces to be replace. Just the word the proceeds the specific word(s)
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
df.STRING.str.replace(r'/(?<=NO|WITHOUT)(s+)w','', regex=True) #this doesn't work
here’s my output:
String
0 THERE IS NO REASON WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT DOUBT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO INT...
desired_output
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO VALUE
Again, i don’t want the spaces between them to be removed. I only want the one word after NO or WITHOUT to be removed/replaced.
Answers:
Note that your regex, /(?<=NO|WITHOUT)(s+)w
, contains several issues:
/
– is a typo, it was probably a regex delimiter that got into the pattern
(?<=NO|WITHOUT)
– is a lookbehind pattern where alternatives match strings of different length and Python lookbehinds patterns must be fixed-width
w
– matches a single word char, not one or more. There must be some quantifier after w
, *
(zero or more times) or +
(one or more occurrences).
You can use
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
pattern = r'b((?:NO|WITHOUT)s+)w+'
df['STRING'] = df['STRING'].str.replace(pattern, r'1', regex=True)
Output:
>>> print(df.to_string())
STRING
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM HAT HAS NO VALUE
See the regex demo. Details:
b
– a word boundary
((?:NO|WITHOUT)s+)
– Group 1 (1
refers to this group value from the replacement pattern): NO
or WITHOUT
and then one or more whitespaces
w+
– one or more word chars (replace with S+
if you plan to remove one or more non-whitespace chars, or even S+b
to cut off trailing punctutation).
I’m trying to use regex to remove or replace only the word after specific word(s) in a column of strings in a dataframe. This means I don’t want the spaces to be replace. Just the word the proceeds the specific word(s)
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
df.STRING.str.replace(r'/(?<=NO|WITHOUT)(s+)w','', regex=True) #this doesn't work
here’s my output:
String
0 THERE IS NO REASON WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT DOUBT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO INT...
desired_output
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM THAT HAS NO VALUE
Again, i don’t want the spaces between them to be removed. I only want the one word after NO or WITHOUT to be removed/replaced.
Note that your regex, /(?<=NO|WITHOUT)(s+)w
, contains several issues:
/
– is a typo, it was probably a regex delimiter that got into the pattern(?<=NO|WITHOUT)
– is a lookbehind pattern where alternatives match strings of different length and Python lookbehinds patterns must be fixed-widthw
– matches a single word char, not one or more. There must be some quantifier afterw
,*
(zero or more times) or+
(one or more occurrences).
You can use
import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
pattern = r'b((?:NO|WITHOUT)s+)w+'
df['STRING'] = df['STRING'].str.replace(pattern, r'1', regex=True)
Output:
>>> print(df.to_string())
STRING
0 THERE IS NO WHY THIS SHOULDN'T WORK!
1 I AM WITHOUT VERY BAD AT REGEX
2 I CAN'T SOLVE A PROBLEM HAT HAS NO VALUE
See the regex demo. Details:
b
– a word boundary((?:NO|WITHOUT)s+)
– Group 1 (1
refers to this group value from the replacement pattern):NO
orWITHOUT
and then one or more whitespacesw+
– one or more word chars (replace withS+
if you plan to remove one or more non-whitespace chars, or evenS+b
to cut off trailing punctutation).