How to compare column in different dataframes for fragment of characters in python

Question

I have two dataframes of different size but common columns, with ID numbers as key column.

I want to compare the column "names" in both dataframes.

But sometimes the "names" does not match because there are upper cases or there are additional uncommon words.

Is there a way to "match" the column with something like a "tolerance", 70% matching words?

Here is an example:

import pandas as pd


data1 = {'key_column':  ['1457', '2356', '2268', '1497','2298'],
        'names': ['Birds are flying', 'Flowers', 'pink bird', 'pink bird', 'Sun Beach'],
       'random_column_in_df1':['1', '2', '2', '2', '2'],

        }

data2 = {'key_column':  ['2268', '2356', '2298', '1497'],
        'names': ['bird', 'flowers here', 'Sun','some text'],
       'random_column_in_df2':['1', '3', '2', '3'],

        }

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df1
df2

enter image description here

expected output:

Asked By: yoopiyo

||

Source

Answer 1

You can use TheFuzz; Is a package that implements Levenshtein distance in python.

from thefuzz import fuzz

df3 = df1.merge(df2, how='inner', on='key_column', suffixes=['_df1', '_df2'])

threshold = 50 # Value range: 0-100
row_selection = df3.apply(lambda row: fuzz.ratio(row['names_df1'], row['names_df2']) >= threshold, axis=1)

df3 = df3.iloc[row_selection.values]

df3.head(10)

Output:

Answered By: ggeop

How to compare column in different dataframes for fragment of characters in python

Question:

Answers: