Remove combination of string in dataset in Python
Question:
I have a dataset in Python where I want to remove certain combinations of words of columnX in a new columnY.
Example of 2 rows of columnX:
what is good: the weather what needs improvement: the house
what is good: everything what needs improvement: nothing
I want tot delete the following combination of words: "what is good" & "what needs improvement".
In the end the following text should remain in the columnY:
the weather the house
everything nothing
I have the following script:
stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))
But it doesn’t work. What am I doing wrong here?
Answers:
Maybe you can operate on the columns itself.
df["Y"] = df["X"]
df.Y = df.Y.str.replace("what is good", "")
So you would have to do this for every item in your stop list. But I am not sure how many items you have.
So for example
replacement_map = {"what needs improvement": "", "what is good": ""}
for old, new in replacement_map.items():
df.Y = df.Y.str.replace(old, new)
if you need to specify different translations or
items_to_replace = ["what needs improvement", "what is good"]
for item_to_replace in items_to_replace:
df.Y = df.Y.str.replace(item_to_replace, "")
if the item should always be deleted.
Or you can skip the loop if you express it as a regex:
items_to_replace = ["what needs improvement", "what is good"]
replace_regex = r"|".join(item for item in items_to_replace)
df.Y = df.Y.str.replace(replace_regex , "")
(Credits: @MatBailie & @romanperekhrest)
In your case the replacement won’t happen as the condition if item not in stoplist
(in item in x.split() if item not in stoplist
) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:
df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)
columnX columnY
0 what is good: the weather what needs improveme... the weather the house
1 what is good: everything what needs improvemen... everything nothing
another way without using a regex and to still use apply would be to use a simple function:
def func(s):
for item in stoplist:
s = s.replace(item, '')
return s
df['columnY']=df['columnY'].apply(func)
I have a dataset in Python where I want to remove certain combinations of words of columnX in a new columnY.
Example of 2 rows of columnX:
what is good: the weather what needs improvement: the house
what is good: everything what needs improvement: nothing
I want tot delete the following combination of words: "what is good" & "what needs improvement".
In the end the following text should remain in the columnY:
the weather the house
everything nothing
I have the following script:
stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))
But it doesn’t work. What am I doing wrong here?
Maybe you can operate on the columns itself.
df["Y"] = df["X"]
df.Y = df.Y.str.replace("what is good", "")
So you would have to do this for every item in your stop list. But I am not sure how many items you have.
So for example
replacement_map = {"what needs improvement": "", "what is good": ""}
for old, new in replacement_map.items():
df.Y = df.Y.str.replace(old, new)
if you need to specify different translations or
items_to_replace = ["what needs improvement", "what is good"]
for item_to_replace in items_to_replace:
df.Y = df.Y.str.replace(item_to_replace, "")
if the item should always be deleted.
Or you can skip the loop if you express it as a regex:
items_to_replace = ["what needs improvement", "what is good"]
replace_regex = r"|".join(item for item in items_to_replace)
df.Y = df.Y.str.replace(replace_regex , "")
(Credits: @MatBailie & @romanperekhrest)
In your case the replacement won’t happen as the condition if item not in stoplist
(in item in x.split() if item not in stoplist
) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:
df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)
columnX columnY
0 what is good: the weather what needs improveme... the weather the house
1 what is good: everything what needs improvemen... everything nothing
another way without using a regex and to still use apply would be to use a simple function:
def func(s):
for item in stoplist:
s = s.replace(item, '')
return s
df['columnY']=df['columnY'].apply(func)