Remove combination of string in dataset in Python

Question

I have a dataset in Python where I want to remove certain combinations of words of columnX in a new columnY.

Example of 2 rows of columnX:

what is good: the weather what needs improvement: the house
what is good: everything what needs improvement: nothing

I want tot delete the following combination of words: "what is good" & "what needs improvement".

In the end the following text should remain in the columnY:

the weather the house
everything nothing

I have the following script:

stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))

But it doesn’t work. What am I doing wrong here?

Asked By: marita

||

Source

Answer 1

Maybe you can operate on the columns itself.

df["Y"] = df["X"]

df.Y = df.Y.str.replace("what is good", "")

So you would have to do this for every item in your stop list. But I am not sure how many items you have.

So for example

replacement_map = {"what needs improvement": "", "what is good": ""}

for old, new in replacement_map.items():
    df.Y = df.Y.str.replace(old, new)

if you need to specify different translations or

items_to_replace = ["what needs improvement", "what is good"]

for item_to_replace in items_to_replace:
    df.Y = df.Y.str.replace(item_to_replace, "")

if the item should always be deleted.

Or you can skip the loop if you express it as a regex:

items_to_replace = ["what needs improvement", "what is good"]

replace_regex = r"|".join(item for item in items_to_replace)

df.Y = df.Y.str.replace(replace_regex , "")

(Credits: @MatBailie & @romanperekhrest)

Answered By: Ken Jiiii

Answer 2

In your case the replacement won’t happen as the condition if item not in stoplist (in item in x.split() if item not in stoplist) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:

df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)

                                             columnX                columnY
0  what is good: the weather what needs improveme...  the weather the house
1  what is good: everything what needs improvemen...     everything nothing

Answered By: RomanPerekhrest

Answer 3

another way without using a regex and to still use apply would be to use a simple function:

def func(s):
    for item in stoplist:
        s = s.replace(item, '')
    return s
df['columnY']=df['columnY'].apply(func)

Answered By: user19077881

Remove combination of string in dataset in Python

Question:

Answers: