Return value if value in a column is found in another dataframe column pandas

Question:

I have two dfs.
df1:

              Summary
0        This is a basket of red apples.
1        We found a bushel of fruit. They are red.
2        There is a peck of pears that taste sweet.
3        We have a box of plums.
4        This is bag of green apples.

df2:

      Fruits        
0    plum     
1    pear     
2    apple     
3    orange

I want the output to be:

df2:

      Fruits     Summary   
0    plum        We have a box of plums.
1    pear        There is a peck of pears that taste sweet.
2    apple       This is a basket of red apples, This is bag of green apples
3    orange

In simple terms, if the fruits were found in summary, then the appropriate value in summary should be returned else nothing or NaN.

EDIT: If multiple instances were found then all instances should be returned separated by a comma.

Asked By: Brussel

||

Answers:

  • I think it is faster to find the unique fruit in each sentence, than to find each sentence for every fruit.
    • Finding each sentence for every fruit, requires iterating of every sentence, for every fruit.
    • Presumably, there are fewer unique fruits compared to sentences, so it’s faster to find the fruit in the sentence.
    • The speed of way compared to the other is an assumption, that has not been tested.
  • For every 'Summary' add all found 'Fruits' to a list, because maybe there is more than one fruit in a sentence.
  • Explode the lists to separate rows
  • Merged df1 and df2
  • Groupby 'Fruits' and combine each sentence into a comma separated string.
import pandas as pd

# sample dataframes
df1 = pd.DataFrame({'Summary': ['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red.', 'There is a peck of pears that taste sweet.', 'We have a box of plums.', 'This is bag of green apples.', 'We have apples and pears']})

df2 = pd.DataFrame({'Fruits': ['plum', 'pear', 'apple', 'orange']})

# display(df1)
                                          Summary
0  This is a basket of red apples. They are sour.
1       We found a bushel of fruit. They are red.
2      There is a peck of pears that taste sweet.
3                         We have a box of plums.
4                    This is bag of green apples.
5                        We have apples and pears

# set all values to lowercase in Fruits
df2.Fruits = df2.Fruits.str.lower()

# create an array of unique Fruits from df2
unique_fruits = df2.Fruits.unique()

# for each sentence check if a fruit is in the sentence and create a list
df1['Fruits'] = df1.Summary.str.lower().apply(lambda x: [v for v in unique_fruits if v in x])

# explode the lists into separate rows; if sentences contain more than one fruit, there will be more than one row
df1 = df1.explode('Fruits', ignore_index=True)

# merge df1 to df2
df2_ = df2.merge(df1, on='Fruits', how='left')

# groupby fruit, into a string
df2_ = df2_.groupby('Fruits').Summary.agg(list).str.join(', ').reset_index()

# display(df2_)
   Fruits                                                                                                 Summary
0   apple  This is a basket of red apples. They are sour., This is bag of green apples., We have apples and pears
1  orange                                                                                                     NaN
2    pear                                    There is a peck of pears that taste sweet., We have apples and pears
3    plum                                                                                 We have a box of plums.

Alternative

  • As previously stated, my assumption is this will be the slower option, even though there is less code, because it requires iterating through every sentence, for every fruit.
df2['Summary'] = df2.Fruits.str.lower().apply(lambda x: ', '.join([v for v in df1.Summary if x in v.lower()]))
Answered By: Trenton McKinney
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.