remove or replace only the word after a specific word in column pandas using regex

Question:

I’m trying to use regex to remove or replace only the word after specific word(s) in a column of strings in a dataframe. This means I don’t want the spaces to be replace. Just the word the proceeds the specific word(s)

import pandas as pd

df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT     VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
 
df.STRING.str.replace(r'/(?<=NO|WITHOUT)(s+)w','', regex=True)  #this doesn't work

here’s my output:

                                              String  
0        THERE IS NO REASON WHY THIS SHOULDN'T WORK!   
1           I AM WITHOUT DOUBT     VERY BAD AT REGEX   
2        I CAN'T SOLVE A PROBLEM THAT HAS NO INT...   

                                      desired_output  
0              THERE IS NO  WHY THIS SHOULDN'T WORK!  
1                I AM WITHOUT      VERY BAD AT REGEX  
2         I CAN'T SOLVE A PROBLEM THAT HAS NO  VALUE  

Again, i don’t want the spaces between them to be removed. I only want the one word after NO or WITHOUT to be removed/replaced.

Asked By: Ankhnesmerira

||

Answers:

Note that your regex, /(?<=NO|WITHOUT)(s+)w, contains several issues:

  • / – is a typo, it was probably a regex delimiter that got into the pattern
  • (?<=NO|WITHOUT) – is a lookbehind pattern where alternatives match strings of different length and Python lookbehinds patterns must be fixed-width
  • w – matches a single word char, not one or more. There must be some quantifier after w, * (zero or more times) or + (one or more occurrences).

You can use

import pandas as pd
df = pd.DataFrame({'STRING': [r"THERE IS NO REASON WHY THIS SHOULDN'T WORK!", r"I AM WITHOUT DOUBT     VERY BAD AT REGEX", r"I CAN'T SOLVE A PROBLEM HAT HAS NO INTRINSIC VALUE"]})
pattern = r'b((?:NO|WITHOUT)s+)w+'
df['STRING'] = df['STRING'].str.replace(pattern, r'1', regex=True)

Output:

>>> print(df.to_string())
                                      STRING
0      THERE IS NO  WHY THIS SHOULDN'T WORK!
1        I AM WITHOUT      VERY BAD AT REGEX
2  I CAN'T SOLVE A PROBLEM HAT HAS NO  VALUE 

See the regex demo. Details:

  • b – a word boundary
  • ((?:NO|WITHOUT)s+) – Group 1 (1 refers to this group value from the replacement pattern): NO or WITHOUT and then one or more whitespaces
  • w+ – one or more word chars (replace with S+ if you plan to remove one or more non-whitespace chars, or even S+b to cut off trailing punctutation).
Answered By: Wiktor Stribiżew
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.