Cosine similarity of two columns in a DataFrame

Question:

I’ve a dataframe with 2 columns and I am tring to get a cosine similarity score of each pair of sentences.

Dataframe (df)

       A                   B
0    Lorem ipsum ta      lorem ipsum
1    Excepteur sint      occaecat excepteur
2    Duis aute irure     aute irure 

some of the code pieces that I’ve tried are:

1. df["cosine_sim"] = df[["A","B"]].apply(lambda x1,x2:cosine_sim(x1,x2))

2. from spicy.spatial.distance import cosine
df["cosine_sim"] = df.apply(lambda row: 1 - cosine(row['A'], row['B']), axis = 1)

The above codes didn’t work, and I am still trying different approaches but in the meanwhile I would appreciate any guidance, Thank you in advance!

Desired output:

       A                   B                     cosine_sim
0    Lorem ipsum ta      lorem ipsum                 0.8
1    Excepteur sint      occaecat excepteur          0.5
2    Duis aute irure     aute irure                  0.4
Asked By: stack

||

Answers:

You need to first convert your sentences into a vector, this process is referred to as text vectorization. There are many ways to perform text vectorization depending on the level of sophistication you require, what your corpus looks like, and the intended application. The simplest is the "Bag of Words" (BoW) which I’ve implemented below. Once you have an understanding of what it means to represent a sentence as a vector, you can progress to other more complex methods of representing lexical similarity. For example:

  • tf-idf which weights certain words based on how frequently they occur across many documents (or sentences in your case). You can think of this as a weighted BoW approach.
  • BM25 which fixes a shortcoming of tf-idf in which single mentions of words in a short documents produce high "relevance" scores. It does this by taking into account the length of the document.

Advancing to measures of semantic similarity you can employ methods such as Doc2Vec [1] which start to use "embedding spaces" to represent the semantics of text. Finally, the recent methods like SentenceBERT [2] and Dense Passage Retrieval [3] employ techniques based on the Transformer (encoder-decoder) architecture [4] to allow for "context aware" representations to be formed.

Solution

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from numpy.linalg import norm

df = pd.DataFrame({
    "A": [
    "I'm not a party animal, but I do like animal parties.",
    "That must be the tenth time I've been arrested for selling deep-fried cigars.",
    "He played the game as if his life depended on it and the truth was that it did."
    ],
    "B": [
    "The mysterious diary records the voice.",
    "She had the gift of being able to paint songs.",
    "The external scars tell only part of the story."
    ]
    })

# Combine all to make single corpus of text (i.e. list of sentences)
corpus = pd.concat([df["A"], df["B"]], axis=0, ignore_index=True).to_list()
# print(corpus)  # Display list of sentences

# Vectorization using basic Bag of Words (BoW) approach
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names_out())  # Display features
vect_sents = X.toarray()

cosine_sim_scores = []
# Iterate over each vectorised sentence in the A-B pairs from the original dataframe
for A_vect, B_vect in zip(vect_sents, vect_sents[int(len(vect_sents)/2):]):
    # Calculate cosine similarity and store result
    cosine_sim_scores.append(np.dot(A_vect, B_vect)/(norm(A_vect)*norm(B_vect)))
# Append results to original dataframe
df.insert(2, 'cosine_sim', cosine_sim_scores)
print(df)

Output

                                A                                         B  cosine_sim
0  I'm not a party animal, but...          The mysterious diary records ...    0.000000
1  That must be the tenth time...   She had the gift of being able to pa...    0.084515
2  He played the game as if hi...  The external scars tell only part of ...    0.257130

References

[1] Le, Q. and Mikolov, T., 2014, June. Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196). PMLR.

[2] Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

[3] Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D. and Yih, W.T., 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.

[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30.

Answered By: Kyle F Hartzenberg