How would I merge two dataframes using a "contains" condition?

Question

I have the following two tables df1:

id	description
1	Foo
2	Bar

and df2:

description	category
Foo	cat1
Barrista	cat2

I now want to merge the dfs based on the "description" column if the string in df2 contains the string of df1.

Right now I could only make it work with exact matches but not for contains cases:

df3 = df1.merge(df2, on='description', how='left', suffixes=('_1', '_2'))

which returns

id	description	category
1	Foo	cat1
2	Bar

but the desired output df3 should look like:

id	description	category
1	Foo	cat1
2	Bar	cat2

Asked By: JSRB

||

Source

Answer 1

You can craft a regex to extract the patterns in df1, then merge:

import re
s = df1['description'].sort_values(key=lambda s: s.str.len(), ascending=False)
regex = '|'.join(map(re.escape, s))
# 'Foo|Bar'

df3 = df1.merge(df2, left_on='description',
                right_on=df2['description'].str.extract(fr'({regex})',
                                                        expand=False),
                how='left', suffixes=('_1', '_2'))

NB. I am sorting the words of df1 by decreasing length to ensure that a longer match would be used in case of multiple matches.

output:

  description  id description_1 description_2 category
0         Foo   1           Foo           Foo     cat1
1         Bar   2           Bar      Barrista     cat2

Answered By: mozway

Answer 2

use pandasql

df1.sql("select self.*,df2.category from self left join df2 on instr(df2.description,self.description)>0",df2=df2)

out

    id description category
0   1         Foo     cat1
1   2         Bar     cat2

Answered By: G.G

How would I merge two dataframes using a "contains" condition?

Question:

Answers: