How can I check similarity in meaning and not just having same words between two texts with spacy

Question:

I’m trying to compare two different texts—one coming from a Curriculum Vitae (CV) and the other from a job announcement.

After cleaning the texts, I’m trying to compare them to detect if a job announcement is more linked to a specific CV.

I am trying to do this using similarity matching in spaCy via the following code:

similarity = pdf_text.similarity(final_text_from_annonce)

This works well, but I’m getting strange results from two different CVs for the same job announcement. Specifically, I get the same similarity score (~0.6), however, one should clearly be higher than the other.

I checked on spaCy website and I found this very important sentence:

Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.

So, what do I need to use or code to make spaCy compare my two texts based on their meaning instead of the occurrence of words?

I am expecting a parameter for the similarity function of spaCy, or another function that will compare my both texts and calculate a similarity score based on the meaning of the texts and not if the same words are used.

Asked By: Adrien Villemin

||

Answers:

The spaCy library by default will use the average of the word embeddings of words in a sentence to determine semantic similarity. This can be thought of as a naive sentence embedding approach. Such an approach could work, but if you were to use it is recommended that you first filter non-meaningful words (e.g. common words) to prevent them from undesirably influencing the final sentence embeddings.

The alternative (and more reliable) solution is to use a different pipeline within spaCy that has been designed to use sentence embeddings created specifically with a dedicated sentence encoder (e.g. the Universal Sentence Encoder (USE) [1] by Cer et al.). Martino Mensio created a package called spacy-universal-sentence-encoder that makes use of this model. Install it via the following command in your command prompt:

pip install spacy-universal-sentence-encoder

Then you can compute the semantic similarity between sentences as follows:

import spacy_universal_sentence_encoder

# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')

# Create two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')

# Use the similarity method to compare the full documents (i.e. sentences)
print(doc_1.similarity(doc_2))  # Output: 0.9356049733134972
# Or make the comparison using a predefined span of the second document 
print(doc_1.similarity(doc_2[0:7])) # Output: 0.9739387861159459

As a side note, when you run the nlp = spacy_universal_sentence_encoder.load_model('en_use_lg') command for the first time, you may have to do so with administrator rights to allow TensorFlow to create the models folder in C:Program FilesPython310Libsite-packagesspacy_universal_sentence_encoder and download the appropriate model. If you don’t, it is possible that there will be a PermissionDeniedError and the code will not run.

References

[1] Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C. and Sung, Y.H., 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175.

Answered By: Kyle F Hartzenberg