Python Pandas – Remove a word from a cell

Question:

I have column containing names. I want to remove the name and the ; where its marked as (Retired) or (retired) after the name. But the problem is, it does not appear in the same format. Sometimes the cell has multiple names and one name is retried. In another case the cell will have first name followed by retired then the last name.

Dataframe = df

Sample column values – Current state

Owner Name
George (Georgy) (Retired) Clooney
Meghan (retired) Markle
Harry Porter (Retired)
Hermione Granger; Harry Porter (Retired)
Ginny Weasley; Ron Weasley; Harry Porter (retired); Luna Lovegood

Sample column values – Future state

Owner Name
Null
Null
Null
Hermione Granger
Ginny Weasley; Ron Weasley; Luna Lovegood

I thought of using replace with "" but it does not work. Please. I would appreciate any directions.

Asked By: Ziggy

||

Answers:

split, filter, join again with groupby.agg:

df['Owner Name'] = (df['Owner Name']
 .str.split(';s*', expand=True).stack()
 .loc[lambda s: ~s.str.contains('(Retired)', case=False)]
 .groupby(level=0).agg('; '.join)
)

Output:

                                  Owner Name
0                                        NaN
1                                        NaN
2                                        NaN
3                           Hermione Granger
4  Ginny Weasley; Ron Weasley; Luna Lovegood
Answered By: mozway

With single regex replacement:

df['Owner Name'] = df['Owner Name'].str.replace(r'[^;]*(retired)[^;]*;?', "", regex=True, case=False)
    .str.strip(';').replace("", np.nan)

                                  Owner Name
0                                        NaN
1                                        NaN
2                                        NaN
3                           Hermione Granger
4  Ginny Weasley; Ron Weasley; Luna Lovegood

Time execution comparison (just for the case):

In [364]: %timeit df['Owner Name'].str.replace(r'[^;]*(retired)[^;]*;?', "", regex=True, case=False).str.strip(';'
     ...: ).replace("", np.nan)
322 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [365]: %timeit df['Owner Name'].str.split(';s*', expand=True).stack().loc[lambda s: ~s.str.contains('(Retired)
     ...: ', case=False)].groupby(level=0).agg('; '.join)
1.19 ms ± 8.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Answered By: RomanPerekhrest
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.