Search DataFrame column for words in list
Question:
I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column…
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it’s getting stuck in endless processing.
Answers:
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"b({'|'.join(re.escape(k) for k in keywords)})b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN
I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column…
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it’s getting stuck in endless processing.
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"b({'|'.join(re.escape(k) for k in keywords)})b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN