Return value if value in a column is found in another dataframe column pandas
Question:
I have two dfs.
df1:
Summary
0 This is a basket of red apples.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
4 This is bag of green apples.
df2:
Fruits
0 plum
1 pear
2 apple
3 orange
I want the output to be:
df2:
Fruits Summary
0 plum We have a box of plums.
1 pear There is a peck of pears that taste sweet.
2 apple This is a basket of red apples, This is bag of green apples
3 orange
In simple terms, if the fruits were found in summary, then the appropriate value in summary should be returned else nothing or NaN.
EDIT: If multiple instances were found then all instances should be returned separated by a comma.
Answers:
- I think it is faster to find the unique fruit in each sentence, than to find each sentence for every fruit.
- Finding each sentence for every fruit, requires iterating of every sentence, for every fruit.
- Presumably, there are fewer unique fruits compared to sentences, so it’s faster to find the fruit in the sentence.
- The speed of way compared to the other is an assumption, that has not been tested.
- For every
'Summary'
add all found 'Fruits'
to a list
, because maybe there is more than one fruit in a sentence.
- Explode the
lists
to separate rows
- Merged
df1
and df2
- Groupby
'Fruits'
and combine each sentence into a comma separated string.
import pandas as pd
# sample dataframes
df1 = pd.DataFrame({'Summary': ['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red.', 'There is a peck of pears that taste sweet.', 'We have a box of plums.', 'This is bag of green apples.', 'We have apples and pears']})
df2 = pd.DataFrame({'Fruits': ['plum', 'pear', 'apple', 'orange']})
# display(df1)
Summary
0 This is a basket of red apples. They are sour.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
4 This is bag of green apples.
5 We have apples and pears
# set all values to lowercase in Fruits
df2.Fruits = df2.Fruits.str.lower()
# create an array of unique Fruits from df2
unique_fruits = df2.Fruits.unique()
# for each sentence check if a fruit is in the sentence and create a list
df1['Fruits'] = df1.Summary.str.lower().apply(lambda x: [v for v in unique_fruits if v in x])
# explode the lists into separate rows; if sentences contain more than one fruit, there will be more than one row
df1 = df1.explode('Fruits', ignore_index=True)
# merge df1 to df2
df2_ = df2.merge(df1, on='Fruits', how='left')
# groupby fruit, into a string
df2_ = df2_.groupby('Fruits').Summary.agg(list).str.join(', ').reset_index()
# display(df2_)
Fruits Summary
0 apple This is a basket of red apples. They are sour., This is bag of green apples., We have apples and pears
1 orange NaN
2 pear There is a peck of pears that taste sweet., We have apples and pears
3 plum We have a box of plums.
Alternative
- As previously stated, my assumption is this will be the slower option, even though there is less code, because it requires iterating through every sentence, for every fruit.
df2['Summary'] = df2.Fruits.str.lower().apply(lambda x: ', '.join([v for v in df1.Summary if x in v.lower()]))
I have two dfs.
df1:
Summary
0 This is a basket of red apples.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
4 This is bag of green apples.
df2:
Fruits
0 plum
1 pear
2 apple
3 orange
I want the output to be:
df2:
Fruits Summary
0 plum We have a box of plums.
1 pear There is a peck of pears that taste sweet.
2 apple This is a basket of red apples, This is bag of green apples
3 orange
In simple terms, if the fruits were found in summary, then the appropriate value in summary should be returned else nothing or NaN.
EDIT: If multiple instances were found then all instances should be returned separated by a comma.
- I think it is faster to find the unique fruit in each sentence, than to find each sentence for every fruit.
- Finding each sentence for every fruit, requires iterating of every sentence, for every fruit.
- Presumably, there are fewer unique fruits compared to sentences, so it’s faster to find the fruit in the sentence.
- The speed of way compared to the other is an assumption, that has not been tested.
- For every
'Summary'
add all found'Fruits'
to alist
, because maybe there is more than one fruit in a sentence. - Explode the
lists
to separate rows - Merged
df1
anddf2
- Groupby
'Fruits'
and combine each sentence into a comma separated string.
import pandas as pd
# sample dataframes
df1 = pd.DataFrame({'Summary': ['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red.', 'There is a peck of pears that taste sweet.', 'We have a box of plums.', 'This is bag of green apples.', 'We have apples and pears']})
df2 = pd.DataFrame({'Fruits': ['plum', 'pear', 'apple', 'orange']})
# display(df1)
Summary
0 This is a basket of red apples. They are sour.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
4 This is bag of green apples.
5 We have apples and pears
# set all values to lowercase in Fruits
df2.Fruits = df2.Fruits.str.lower()
# create an array of unique Fruits from df2
unique_fruits = df2.Fruits.unique()
# for each sentence check if a fruit is in the sentence and create a list
df1['Fruits'] = df1.Summary.str.lower().apply(lambda x: [v for v in unique_fruits if v in x])
# explode the lists into separate rows; if sentences contain more than one fruit, there will be more than one row
df1 = df1.explode('Fruits', ignore_index=True)
# merge df1 to df2
df2_ = df2.merge(df1, on='Fruits', how='left')
# groupby fruit, into a string
df2_ = df2_.groupby('Fruits').Summary.agg(list).str.join(', ').reset_index()
# display(df2_)
Fruits Summary
0 apple This is a basket of red apples. They are sour., This is bag of green apples., We have apples and pears
1 orange NaN
2 pear There is a peck of pears that taste sweet., We have apples and pears
3 plum We have a box of plums.
Alternative
- As previously stated, my assumption is this will be the slower option, even though there is less code, because it requires iterating through every sentence, for every fruit.
df2['Summary'] = df2.Fruits.str.lower().apply(lambda x: ', '.join([v for v in df1.Summary if x in v.lower()]))