Return value if value in a column is found in another dataframe column pandas

Question

I have two dfs.
df1:

              Summary
0        This is a basket of red apples.
1        We found a bushel of fruit. They are red.
2        There is a peck of pears that taste sweet.
3        We have a box of plums.
4        This is bag of green apples.

df2:

      Fruits        
0    plum     
1    pear     
2    apple     
3    orange

I want the output to be:

df2:

      Fruits     Summary   
0    plum        We have a box of plums.
1    pear        There is a peck of pears that taste sweet.
2    apple       This is a basket of red apples, This is bag of green apples
3    orange

In simple terms, if the fruits were found in summary, then the appropriate value in summary should be returned else nothing or NaN.

EDIT: If multiple instances were found then all instances should be returned separated by a comma.

Asked By: Brussel

||

Source

Answer 1

I think it is faster to find the unique fruit in each sentence, than to find each sentence for every fruit.
- Finding each sentence for every fruit, requires iterating of every sentence, for every fruit.
- Presumably, there are fewer unique fruits compared to sentences, so it’s faster to find the fruit in the sentence.
- The speed of way compared to the other is an assumption, that has not been tested.
For every 'Summary' add all found 'Fruits' to a list, because maybe there is more than one fruit in a sentence.
Explode the lists to separate rows
Merged df1 and df2
Groupby 'Fruits' and combine each sentence into a comma separated string.

import pandas as pd

# sample dataframes
df1 = pd.DataFrame({'Summary': ['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red.', 'There is a peck of pears that taste sweet.', 'We have a box of plums.', 'This is bag of green apples.', 'We have apples and pears']})

df2 = pd.DataFrame({'Fruits': ['plum', 'pear', 'apple', 'orange']})

# display(df1)
                                          Summary
0  This is a basket of red apples. They are sour.
1       We found a bushel of fruit. They are red.
2      There is a peck of pears that taste sweet.
3                         We have a box of plums.
4                    This is bag of green apples.
5                        We have apples and pears

# set all values to lowercase in Fruits
df2.Fruits = df2.Fruits.str.lower()

# create an array of unique Fruits from df2
unique_fruits = df2.Fruits.unique()

# for each sentence check if a fruit is in the sentence and create a list
df1['Fruits'] = df1.Summary.str.lower().apply(lambda x: [v for v in unique_fruits if v in x])

# explode the lists into separate rows; if sentences contain more than one fruit, there will be more than one row
df1 = df1.explode('Fruits', ignore_index=True)

# merge df1 to df2
df2_ = df2.merge(df1, on='Fruits', how='left')

# groupby fruit, into a string
df2_ = df2_.groupby('Fruits').Summary.agg(list).str.join(', ').reset_index()

# display(df2_)
   Fruits                                                                                                 Summary
0   apple  This is a basket of red apples. They are sour., This is bag of green apples., We have apples and pears
1  orange                                                                                                     NaN
2    pear                                    There is a peck of pears that taste sweet., We have apples and pears
3    plum                                                                                 We have a box of plums.

Alternative

As previously stated, my assumption is this will be the slower option, even though there is less code, because it requires iterating through every sentence, for every fruit.

df2['Summary'] = df2.Fruits.str.lower().apply(lambda x: ', '.join([v for v in df1.Summary if x in v.lower()]))

Answered By: Trenton McKinney

Return value if value in a column is found in another dataframe column pandas

Question:

Answers:

Alternative