Merge 2 dfs, with the row if it is the only row that contains the word

Question:

I have 2 pandas data frames:

df1 = pd.DataFrame({'keyword': ['Sox','Sox','Jays','D', 'Jays'],
                   'val':[1,2,3,4,5]})

df2 = pd.DataFrame({'name': ['a b c', 'Sox Red', 'Blue Jays White Sox'],
                   'city':[f'city-{i}' for i in [1,2,3]],
                   'info': [5, 6, 7]})
>>> df1
    keyword val
0   Sox     1
1   Sox     2
2   Jays    3
3   D       4
4   Jays    5

>>> df2
    name                 city       info
0   a b c                city-1     5
1   Sox Red              city-2     6
2   Blue Jays White Sox  city-3     7

For each row of df1 the merge should be taking the exact element of df1['keyword'] and see if it is present in each of the df2['name'] elements (e.g. using .str.contains). Now there can be the following options:

  • if it is present in exactly 1 row of df2['name']: match the current row of df1 with this 1 row of df2.
  • otherwise (if it is present in more than 1 or in 0 rows of df2['name']): don’t match the current row of df1 with anything – the values will be NaN.

The result should look like this:

    keyword  val  name                 city    info
0   Sox      1    NaN                  NaN     NaN
1   Sox      2    NaN                  NaN     NaN
2   Jays     3    Blue Jays White Sox  city-3  7.0
3   D        4    NaN                  NaN     NaN
4   Jays     5    Blue Jays White Sox  city-3  7.0

Here in the column "keyword":

  • "Sox" matches multiples lines of df2 (lines 1 and 2), so its merged with NaNs,
  • "D" matches 0 lines of df2, so it’s also merged with NaNs,
  • "Jays" matches exactly 1 line in df2 (line 2), so it’s merged with this line.

How to do this using pandas?

Asked By: Vladimir Fokow

||

Answers:

One way to do this is to use a combination of .apply() and .str.contains() to find the rows in df2 that match the rows in df1. Then, we can use .merge() to merge the resulting data frames:

def merge_dfs(row):
    keyword = row['keyword']
    df2_match = df2[df2['name'].str.contains(keyword)]
    return df2_match.iloc[0] if len(df2_match) == 1 else pd.Series(dtype='float64')
result = df1.apply(merge_dfs, axis=1).reset_index(drop=True)
result = df1.merge(result, left_index=True, right_index=True, how='left')

This should give the desired result:

>>> result
    keyword val city    info    name
0   Sox     1   NaN     NaN     NaN
1   Sox     2   NaN     NaN     NaN
2   Jays    3   city-3  7.0     Blue Jays White Sox
3   D       4   NaN     NaN     NaN
4   Jays    5   city-3  7.0     Blue Jays White Sox
Answered By: Vladimir Fokow

Use a regex and str.extractall to extract the keywords, remove the duplicates with drop_duplicates, and finally merge:

import re

pattern = '|'.join(map(re.escape, df1['keyword']))
# 'Sox|Sox|Jays|D|Jays

key = (df2['name'].str.extractall(fr'b({pattern})b')[0]
      .droplevel('match')
      .drop_duplicates(keep=False)
      )

out = df1.merge(df2.assign(keyword=key),
                on='keyword', how='left')

print(out) 

NB. I’m assuming you want to match full words only, if not remove the word boundaries (b).

Output:


  keyword  val                 name    city  info
0     Sox    1                  NaN     NaN   NaN
1     Sox    2                  NaN     NaN   NaN
2    Jays    3  Blue Jays White Sox  city-3   7.0
3       D    4                  NaN     NaN   NaN
4    Jays    5  Blue Jays White Sox  city-3   7.0
Answered By: mozway