How would I merge two dataframes using a "contains" condition?

Question:

I have the following two tables df1:

id description
1 Foo
2 Bar

and df2:

description category
Foo cat1
Barrista cat2

I now want to merge the dfs based on the "description" column if the string in df2 contains the string of df1.

Right now I could only make it work with exact matches but not for contains cases:

df3 = df1.merge(df2, on='description', how='left', suffixes=('_1', '_2'))

which returns

id description category
1 Foo cat1
2 Bar

but the desired output df3 should look like:

id description category
1 Foo cat1
2 Bar cat2
Asked By: JSRB

||

Answers:

You can craft a regex to extract the patterns in df1, then merge:

import re
s = df1['description'].sort_values(key=lambda s: s.str.len(), ascending=False)
regex = '|'.join(map(re.escape, s))
# 'Foo|Bar'

df3 = df1.merge(df2, left_on='description',
                right_on=df2['description'].str.extract(fr'({regex})',
                                                        expand=False),
                how='left', suffixes=('_1', '_2'))

NB. I am sorting the words of df1 by decreasing length to ensure that a longer match would be used in case of multiple matches.

output:

  description  id description_1 description_2 category
0         Foo   1           Foo           Foo     cat1
1         Bar   2           Bar      Barrista     cat2
Answered By: mozway

use pandasql

df1.sql("select self.*,df2.category from self left join df2 on instr(df2.description,self.description)>0",df2=df2)

out

    id description category
0   1         Foo     cat1
1   2         Bar     cat2
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.