Match most similar string pairs from two lists of strings?

Question:

I have two lists of strings (of equal sizes):

l1 = [
"Which of the following products have you used",
"Provide a rating to the product",
"Will you purchase the product again"
]

l2 = [
"Please give a rating to the product"
"Will you buy the product again"
"Please select a product you have used"
]

I have to write a program that should be able to match similar sentences as shown below:
Input and Output

There are basically two sub-problems here:

  1. How exactly to quantify/score the similarity between two sentences. I have thought of using either simple character matching algorithm (fuzzywuzzy) in order to keep time taken by the program low, however open to any suggestions for a better algorithm.

  2. How to apply the above algorithm to find actual matches between the strings
    Now once we have selected an algorithm in the above step, how do we go about applying it to the two lists? One possible way I thought was to generate all possible string-pair combinations and select the one which has the highest score, however this approach fails if the lists have 10 or more items as the time taken exceeds 10 minutes per run on my machine.

Asked By: Ritik Kamra

||

Answers:

Instead of using fuzzy matching, why not use sentence similarity as a criterion for matching sentences semantically (contextual matching)?

You can use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here – https://github.com/RaRe-Technologies/gensim-data

Once you have embedded each sentence as a vector (50 dim in this case), you can use cosine similarity to determine which sentence embedding is similar to each other using scipy.spatial.distance.cosine for example.

Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.

A more advanced usage might be to use a pre-trained sentence embedding model such as BERT based models but those would be for more advanced situations than this and need a bit of deeper understanding of how those models and their parameters work, so I wouldn’t recommend using them at this point, until you are comfortable with how this approach works.

Here is an example –

from scipy import spatial
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")
#choose from multiple models https://github.com/RaRe-Technologies/gensim-data

def get_vector(s):
    s1 = [i.lower() for i in s.split()]
    v1 = np.sum(np.array([model[i] for i in s1]), axis=0)
    return v1

l1 =["Which of the following products have you used",
     "Provide a rating to the product", 
     "Will you purchase the product again"]

l2 =["Please give a rating to the product",
     "Will you buy the product again",
     "Please select a product you have used"]

embeddings1 = np.array([get_vector(i) for i in l1])
embeddings2 = np.array([get_vector(i) for i in l2])

similarity = (1 - spatial.distance.cdist(embeddings1, embeddings2, 'cosine')).argmax(axis=1)
sorted_l2 = [l2[i] for i in similarity]

print(l1)
print(sorted_l2)
# original l1 list
['Which of the following products have you used', 
'Provide a rating to the product', 
'Will you purchase the product again']

# new l2 list after modifying
['Please select a product you have used', 
'Please give a rating to the product', 
'Will you buy the product again']

As you can see, this is the same result as you are expecting –

enter image description here

Answered By: Akshay Sehgal
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.