Compare each string with all other strings in a dataframe

Question:

I have this dataframe:

mylist = [
    "₹67.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>11 Feb 2023, 20:42:25",
    "₹66.00 to Rupam Sweets using Bank Account XXXXXXXX5343<br>10 Feb 2023, 21:09:23",
    "₹32.00 to Nagori Sajjad Mohammed Sayyed using Bank Account XXXXXXXX5343<br>9 Feb 2023, 07:06:52",
    "₹110.00 to Vikram Manohar Jsohi using Bank Account XXXXXXXX5343<br>9 Feb 2023, 06:40:08",
    "₹120.00 to Winner Dinesh Gupta using Bank Account XXXXXXXX5343<br>30 Jan 2023, 06:23:55",
]
import pandas as pd

df = pd.DataFrame(mylist)
df.columns = ["full_text"]
ndf = df.full_text.str.split("to", expand=True)
ndf.columns = ["amt", "full_text"]
ndf2 = ndf.full_text.str.split("using Bank Account XXXXXXXX5343<br>", expand=True)
ndf2.columns = ["client", "date"]
df = ndf.join(ndf2)[["date", "client", "amt"]]

I have created embeddings for each client name:

from openai.embeddings_utils import get_embedding, cosine_similarity
import openai

openai.api_key = 'xxx'
embedding_model = "text-embedding-ada-002"
embeddings = df.client.apply([lambda x: get_embedding(x, engine=embedding_model)])
df["embeddings"] = embeddings

I can now calculate the similarity index for a given string. For e.g. "Rupam Sweet" using:

query_embedding = get_embedding("Rupam Sweet", engine="text-embedding-ada-002")
df["similarity"] = df.embeddings.apply(lambda x: cosine_similarity(x, query_embedding))

But I need the similarity score of each client across all other clients. In other words, the client names will be in rows as well as in columns and the score will be the data. How do I achieve this?

Asked By: shantanuo

||

Answers:

I managed to get the expected results using:

for k, i in enumerate(df.client):
    query_embedding = get_embedding(i, engine="text-embedding-ada-002")
    df[i + str(k)] = df.embeddings.apply(
        lambda x: cosine_similarity(x, query_embedding)
    )

I am not sure if this is efficient in case of big data.

Answered By: shantanuo

If you have a vectorized similarity function f(x, y) and want to apply it to all pairs of a series, you can make use of numpy broadcasting. If f is not a vectorized function, you can turn it into one by calling f_vec = np.vectorize(f) on it. In the example below, I’m using the ratio function from the fuzzywuzzy module for illustration purposes, but it works the same way with any other comparison function.

from fuzzywuzzy.fuzz import ratio
import numpy as np

ratio_vec = np.vectorize(ratio)
s = pd.Series(mylist)
df = pd.DataFrame(ratio_vec(s, s[:, None]))

The result is a similarity matrix:

     0    1    2    3    4
0  100   92   74   76   71
1   92  100   74   73   72
2   70   74  100   74   67
3   73   73   73  100   72
4   71   72   64   74  100
Answered By: fsimonjetz