How to compare sentence similarities using embeddings from BERT

Question:

I am using the HuggingFace Transformers package to access pretrained models. As my use case needs functionality for both English and Arabic, I am using the bert-base-multilingual-cased pretrained model. I need to be able to compare the similarity of sentences using something such as cosine similarity. To use this, I first need to get an embedding vector for each sentence, and can then compute the cosine similarity.

Firstly, what is the best way to extratc the semantic embedding from the BERT model? Would taking the last hidden state of the model after being fed the sentence suffice?

import torch
from transformers import BertModel, BertTokenizer

model_class = BertModel
tokenizer_class = BertTokenizer
pretrained_weights = 'bert-base-multilingual-cased'

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

sentence = 'this is a test sentence'

input_ids = torch.tensor([tokenizer.encode(sentence, add_special_tokens=True)])
with torch.no_grad():
    output_tuple = model(input_ids)
    last_hidden_states = output_tuple[0]

print(last_hidden_states.size(), last_hidden_states)

Secondly, if this is a sufficient way to get embeddings from my sentence, I now have another problem where the embedding vectors have different lengths depending on the length of the original sentence. The shapes output are [1, n, vocab_size], where n can have any value.

In order to compute two vectors’ cosine similarity, they need to be the same length. How can I do this here? Could something as naive as first summing across axis=1 still work? What other options do I have?

Asked By: KOB

||

Answers:

You can use the [CLS] token as a representation for the entire sequence. This token is typically prepended to your sentence during the preprocessing step. This token that is typically used for classification tasks (see figure 2 and paragraph 3.2 in the BERT paper).

It is the very first token of the embedding.

Alternatively you can take the average vector of the sequence (like you say over the first(?) axis), which can yield better results according to the huggingface documentation (3rd tip).

Note that BERT was not designed for sentence similarity using the cosine distance, though in my experience it does yield decent results.

Answered By: Swier

In addition to an already great accepted answer, I want to point you to sentence-BERT, which discusses the similarity aspect and implications of specific metrics (like cosine similarity) in greater detail.
They also have a very convenient implementation online. The main advantage here is that they seemingly gain a lot of processing speed compared to a “naive” sentence embedding comparison, but I am not familiar enough with the implementation itself.

Importantly, there is also generally a more fine-grained distinction in what kind of similarity you want to look at. Specifically for that, there is also a great discussion in one of the task papers from SemEval 2014 (SICK dataset), which goes into more detail about this. From your task description, I am assuming that you are already using data from one of the later SemEval tasks, which also extended this to multilingual similarity.

Answered By: dennlinger

You should NOT use BERT’s output as sentence embeddings for semantic similarity. BERT is not pretrained for semantic similarity, which will result in poor results, even worse than simple Glove Embeddings. See below a comment from Jacob Devlin (first author in BERT’s paper) and a piece from the Sentence-BERT paper, which discusses in detail sentence embeddings.

Jacob Devlin’s comment: I’m not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn’t mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally). (https://github.com/google-research/bert/issues/164#issuecomment-441324222)

From Sentence-BERT paper: The results show that directly using the output of BERT leads to rather poor performances. Averaging the BERT embeddings achieves an average correlation of only 54.81, and using the CLS token output only achieves an average correlation of 29.19. Both are worse than computing average GloVe embeddings. (https://arxiv.org/pdf/1908.10084.pdf)

You should use instead a model pre-trained specifically for sentence similarity, such as Sentence-BERT. Sentence-BERT and several other pretrained models for sentence similarity are available in the sentence-transformers library (https://www.sbert.net/docs/pretrained_models.html), which is fully compatible with the amazing HuggingFace transformers library. With these libraries, you can obtain sentence embeddings in just a line of code.

Answered By: Cristian Arteaga

As a complement to dennlinger‘s answer, I’ll add a code example from https://www.sbert.net/docs/usage/semantic_textual_similarity.html to compare sentence similarities using embeddings from BERT:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} tt {} tt Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The library contains the state-of-the-art sentence embedding models.

See https://stackoverflow.com/a/68728666/395857 to perform sentence clustering.

Answered By: Franck Dernoncourt

Illustrating with some descriptions of how to use Bert architecture for sentence embedding.

Also illustrated Christian Arteagas comment on choosing the right model for the right task.

I am using the Bert model and tokenizer from Hugging face instead of the sentence_transformer wrapping, as it will give a better idea on how these works for the users who are starting off with NLP

Bert Model – https://huggingface.co/transformers/v3.0.2/model_doc/bert.html

Note – this is just pseudo code; see also https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

'''
 Adapted and extended from 
 https://github.com/huggingface/transformers/issues/1950#issuecomment-558679189

'''
import pandas as pd
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch

def get_sentence_similarity(tokenizer,model,s1,s2):

    s1 = tokenizer.encode(s1)  
    s2 = tokenizer.encode(s2)

    print("1 len(s1) s1",len(s1),s1) # prints length of tokens - input_ids 8 [101, 7592...
    print("1 len(s2) s2",len(s2),s2)
    s1 = torch.tensor(s1)
    #print("2",s1) # prints tensor([ 101, 7592, ...
    s1 = s1.unsqueeze(0) # add an extra dimension, why ? the model needs to be fed in batches, we give a dummy batch 1
    #print("3",s1) # prints tensor([[ 101, 7592, 
    s2 = torch.tensor(s2).unsqueeze(0)

    # Pass it to the model for inference
    with torch.no_grad():
        output_1 = model(s1)
        output_2 = model(s2)

    logits_s1 = output_1[0]  # The last hidden-state is the first element of the output tuple
    logits_s2 = output_2[0].detach()
    #print("logits_s1 before detach",logits_s1) # prints  tensor([[[-0.1162,  0.2388, ...-0.2128]]], grad_fn=<NativeLayerNormBackward0>)
    logits_s1 = logits_s1.detach() # to remove the last part we call detach

    print("logits_s1.shape",logits_s1.shape ) # prints ([1, <length of tokens>, 768]) - Each token is rep by a 768 row vector for the base Bert Model!
    print("logits_s2.shape",logits_s2.shape ) # 1 the dummy batch dimension we added to the model by un-squeeze
    logits_s1 = torch.squeeze(logits_s1) #lets remove the batch dimension by squeeze
    logits_s2 = torch.squeeze(logits_s2)
    print("logits_s1.shape",logits_s1.shape ) # prints ([<length of tokens>, 768]) say torch.Size([8, 768])
    print("logits_s2.shape",logits_s2.shape )
    a = logits_s1.reshape(1,logits_s1.numel()) # we lay the vector flat make it 1, **768 via reshape; numel is number of elements
    b = logits_s2.reshape(1,logits_s2.numel())
    print("a.shape",a.shape ) # torch.Size([1, 6144])
    print("b.shape",b.shape ) # the shape will be 1, 768* no of tokens in b sentence - need not be similar

    # we can  mean over the rows to give it better similarity - but that is giving poor output
    # a = sentence_vector_1.mean(axis=1) this is giving cosine similarity as 1
    # b = sentence_vector_2.mean(axis=1)
    #cos_sim = F.cosine_similarity(a.reshape(1,-1),b.reshape(1,-1), dim=1)

    # so we pad the tensors to be same shape
    if  a.shape[1] <  b.shape[1]:
        pad_size = (0, b.shape[1] - a.shape[1]) 
        a = torch.nn.functional.pad(a, pad_size, mode='constant', value=0)
    else:
        pad_size = (0, a.shape[1] - b.shape[1]) 
        b = torch.nn.functional.pad(b, pad_size, mode='constant', value=0)

    print("After padding")
    print("a.shape",a.shape ) # 1,N
    print("b.shape",b.shape ) # 1, N


    # Calculate the cosine similarity
    cos_sim = cosine_similarity(a,b)
    #print("got cosine similarity",cos_sim) # output [[0.80432487]]
    return cos_sim



if __name__ == "__main__":


    s1 = "John loves dogs" 
    s2 = "dogs love John"

    # Tokenize the text using BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    model = BertModel.from_pretrained("bert-base-uncased") #Not good for sentence similarity
    model.eval()
    
    cos_sim = get_sentence_similarity(tokenizer,model,s1,s2)
    print("got cosine similarity",cos_sim) # output [[0.738616]]

    # Let's try the same with a better model - say for sentence embedding
    # From https://www.sbert.net/docs/pretrained_models.html
    # They have been extensively evaluated for their quality to embedded sentences 
    # (Performance Sentence Embeddings) and to embedded search queries & paragraphs 

    # better to use AutoTokenizer for other models see https://github.com/huggingface/transformers/issues/5587
    tokenizer = BertTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    model = BertModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    model.eval()
    cos_sim = get_sentence_similarity(tokenizer,model,s1,s2)
    print("got cosine similarity",cos_sim) # output [[0.5646803]]
Answered By: Alex Punnen