Remove words from list but keep the ones only made up from the list

Question:

I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list.

Here is an example:

strings_variable
Avalon Toyota loan
Blazer Chevrolet
Suzuki Vitara sales
Vauxhall Astra
Buick Special car
Ford Aerostar
car refund
car loan
data = {'strings_variable': ['Avalon Toyota loan', 'Blazer Chevrolet', 'Suzuki Vitara sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)
words_to_remove = ('car','sales','loan','refund')

The final output should look like this:

strings_variable
Avalon Toyota
Blazer Chevrolet
Suzuki Vitara
Vauxhall Astra
Buick Special
Ford Aerostar
car refund
car loan
data= {'strings_variable': ['Avalon Toyota', 'Blazer Chevrolet', 'Suzuki Vitara', 'Vauxhall Astra', 'Buick Special', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)

Note, the words that I want to remove are in addition to the car names however I would like to keep the rows where the strings are only made of words in words_to_remove

Here is my code (Python) so far:

def remove_words(df):
   df = [word for words in df if word not in words_to_remove]
   return df

strings_variable = strings_variable.apply(remove_words)

I hope it makes sense – thank you in advance!

Asked By: RoyalPotatoe

||

Answers:

You can use set differences to decide whether all the words are included or not in words_to_remove.

def remove_words(text, words=words_to_remove):
    if set(text.split()).difference(set(words)):
        for word in words:
            text = text.replace(word, '')
        text = text.strip()
    return text

df['strings_variable'].map(remove_words)
0       Avalon Toyota
1    Blazer Chevrolet
2       Suzuki Vitara
3      Vauxhall Astra
4       Buick Special
5       Ford Aerostar
6          car refund
7            car loan
Name: strings_variable, dtype: object
Answered By: Ignatius Reilly

You could create a temporary list of words and if it ends up empty, you just use the original word. Something like this:

df= {'strings_variable': ['Avalon Toyota loan', 'Blazer Chevrolet', 'Suzuki Vitara sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar', 'car refund', 'car loan']}
words_to_remove = ('car','sales','loan','refund')

new_df = []
for word in df["strings_variable"]:
    temp = []
    for w in word.split():
        if w.lower() not in words_to_remove: temp.append(w)
    new_df.append(" ".join(temp) if temp else word)
print(new_df)

Result:

['Avalon Toyota', 'Blazer Chevrolet', 'Suzuki Vitara', 'Vauxhall Astra', 'Buick Special', 'Ford Aerostar', 'car refund', 'car loan']

I also considered not case sensitive, hence the lower() in the condition.

Answered By: palvarez

I’m assuming you’re using pandas, because of your use of df and the .apply() method. However, you need to create the DataFrame itself. Then you can create a function to apply to the Series (if you’re only changing the column) or to apply-map to the whole DataFrame (probably not what you’re looking for).

import pandas as pd

df = pd.DataFrame({
    'strings_variable': [
        'Avalon Toyota loan',
        'Blazer Chevrolet',
        'Suzuki Vitara sales', 
        'Vauxhall Astra', 
        'Buick Special car', 
        'Ford Aerostar', 
        'car refund', 
        'car loan'
    ]
})

words_to_remove = ('car', 'sales', 'loan', 'refund')

def remove_words(text: str) -> str:
    """Remove stop words if string composed made entirely of them"""
    
    new_text = ' '.join([
        word
        for word in text.split()
        if word not in words_to_remove
    ])
    
    if not new_text:
        new_text = text
        
    return new_text

df['strings_variable'] = df['strings_variable'].apply(remove_words)
# or
df = df.applymap(remove_words) # probably not this one
Answered By: pashri