I need a way to compare two strings in python without using sets in a pandas dataframe


I am currently working on a huge csv file with pandas, and I need to find and print similarity between the selected row and every other row. For example if the string is "Card" and the second string is "Credit Card Debit Card" it should return 2 or if the first string is "Credit Card" and the second string is "Credit Card Debit Card" it should return 3 because 3 of the words match with the first string. I tried solving this using sets but because of sets being unique and not containing duplicates in the first example instead of 2 it returns 1. Because in a set "Credit Card Debit Card" is {"Credit", "Card", "Debit"}. Is there anyway that I can calculate this? The formula of similarity is ((numberOfSameWords)/whichStringisLonger)*100 as explained in this photo:

I tried so many things like Jaccard Similarity but they all work with sets and return wrong answers. Thanks for any help.
The code I tried running:

def test(row1, row2):
    return str(round(len(np.intersect1d(row1.split(), row2.split())) / max(len(row1.split()), len(row2.split()))*100, 2))

data = int(input("Which index should be tested:"))
for j in range(0,10):
    print(test(dff['Product'].iloc[data], dff['Product'].iloc[j]))

and my dataframe currently looks like this:

print(df.sample(10).to_dict("list")) returned me:

{'Product': ['Bank account or service', 'Credit card', 'Credit reporting', 'Credit reporting credit repair services or other personal consumer reports', 'Credit reporting', 'Mortgage', 'Debt collection', 'Mortgage', 'Mortgage', 'Credit reporting'], 'Issue': ['Deposits and withdrawals', 'Billing disputes', 'Incorrect information on credit report', "Problem with a credit reporting company's investigation into an existing problem", 'Incorrect information on credit report', 'Applying for a mortgage or refinancing an existing mortgage', 'Disclosure verification of debt', 'Loan servicing payments escrow account', 'Loan servicing payments escrow account', 'Incorrect information on credit report'], 'Company': ['CITIBANK NA', 'FIRST NATIONAL BANK OF OMAHA', 'EQUIFAX INC', 'Experian Information Solutions Inc', 'Experian Information Solutions Inc', 'BANK OF AMERICA NATIONAL ASSOCIATION', 'AllianceOne Recievables Management', 'SELECT PORTFOLIO SERVICING INC', 'OCWEN LOAN SERVICING LLC', 'Experian Information Solutions Inc'], 'State': ['CA', 'WA', 'FL', 'UT', 'MI', 'CA', 'WA', 'IL', 'TX', 'CA'], 'ZIP_code': ['92606', '98272', '329XX', '84321', '486XX', '94537', '984XX', '60473', '76247', '91401'], 'Complaint_ID': [90452, 2334443, 1347696, 2914771, 1788024, 2871939, 1236424, 1619712, 2421373, 1803691]}
Asked By: Berk Sunduri



You can use numpy.intersect1d to get the common words but the % is different for the third row.

import numpy as np

df["Similarity_%"] = (
                        df.apply(lambda x: "%" + str(round(len(np.intersect1d(x['Col1'].split(), x['Col2'].split()))
                                                          / max(len(x["Col1"].split()), len(x["Col2"].split()))
                                                          *100, 2)), axis=1)

# Output :

                                                                Col1                                  Col2 Similarity_%
0                                                    Debt collection                       Debt collection       %100.0
1                                                    Debt collection                              Mortgage         %0.0
2                                                Managing loan lease               Problems end loan lease        %50.0
3                                                Managing loan lease                   Struggling pay loan       %33.33
4  Credit reporting credit repair services personal consumer reports  Payday loan title loan personal loan        %12.5

# Input used:

import pandas as pd

df= pd.DataFrame({'Col1': ['Debt collection', 'Debt collection', 'Managing loan lease', 'Managing loan lease', 
                           'Credit reporting credit repair services personal consumer reports'],
                  'Col2': ['Debt collection', 'Mortgage', 'Problems end loan lease', 'Struggling pay loan',
                           'Payday loan title loan personal loan']})

# Update:

Based on the second given dataframe in your question, you can use a cross join (with pandas.DataFrame.merge) to compare each row of the column Product to the rest of the rows of the same colum.

Try this :

out = df[["Product"]].merge(df[["Product"]], how="cross", suffixes=("", "_cross"))

out["Similarity_%"] = (
                        out.apply(lambda x: "%" + str(round(len(np.intersect1d(x['Product'].split(), x['Product_cross'].split()))
                                                          / max(len(x["Product"].split()), len(x["Product_cross"].split()))
                                                          *100, 2)), axis=1)

With a dataframe/colum of 10 rows, the result will have 100 rows plus a similarity column.

Answered By: abokey

You can try this:

import pandas as pd

l1 = ["Debt collection", "Debt collection", "Managing loan lease", "Managing loan lease",
      "Credit reporting credit repair services personal consumer reports", "Credit reporting credit repair services personal consumer report"]
l2 = ["Debt collection", "Mortgage", "Problems end loan lease", "Struggling pay loan",
      "Payday loan title loan personal loan", "Credit card prepaid card"]

df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = l2

def similarity(row1, row2):
    # calculate longest row
    longestSentence = 0
    commonWords = 0
    wordsRow1 = [x.upper() for x in row1.split()]
    wordsRow2 = [x.upper() for x in row2.split()]
    # calculate similar words in both sentences
    common = list(set(wordsRow1).intersection(wordsRow2))
    if len(wordsRow1) > len(wordsRow2):
        longestSentence = len(wordsRow1)
        commonWords = calculate(common, wordsRow1)
        longestSentence = len(wordsRow2)
        commonWords = calculate(common, wordsRow2)
    return (commonWords / longestSentence) * 100

def calculate(common, longestRow):
    sum = 0
    for word in common:
        sum += longestRow.count(word)
    return sum

df['similarity'] = df.apply(lambda x: similarity(x.col1, x.col2), axis=1)

Answered By: Tomer S

You may try this:

def countCommonWords( string1, string2):
  words1 = string1.lower().split()
  words2 = string2.lower().split()
  n = 0
  for word1 in words1:
    if word1 in words2: n+=1
  return n

Please note that:
countCommonWords( ‘a b c a’, ‘c b a’) will return 3,
countCommonWords( ‘c b a’, ‘a b c a’) will return 4 and may be your solution.
We do not know if your search string may contain doubled words

Answered By: user3435121
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.