How can I calculate the Jaccard Similarity of two lists containing strings in Python?
Question:
I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?
This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).
Answers:
Assuming your usernames don’t repeat, you can use the same idea:
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5
I ended up writing my own solution after all:
def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(set(list1)) + len(set(list2))) - intersection
return float(intersection) / union
@aventinus I don’t have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity
but the function is misnamed as jaccard_distance
, which is actually 1 - jaccard_similarity
For Python 3:
def jaccard_similarity(list1, list2):
s1 = set(list1)
s2 = set(list2)
return float(len(s1.intersection(s2)) / len(s1.union(s2)))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5
For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))
If you’d like to include repeated elements, you can use Counter
, which I would imagine is relatively quick since it’s just an extended dict
under the hood:
from collections import Counter
def jaccard_repeats(a, b):
"""Jaccard similarity measure between input iterables,
allowing repeated elements"""
_a = Counter(a)
_b = Counter(b)
c = (_a - _b) + (_b - _a)
n = sum(c.values())
return n/(len(a) + len(b) - n)
list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']
jaccard_repeats(list1, list3)
>>> 0.75
jaccard_repeats(list1, list2)
>>> 0.16666666666666666
jaccard_repeats(list2, list3)
>>> 0.5
You can use the Distance library
#pip install Distance
import distance
distance.jaccard("decide", "resize")
# Returns
0.7142857142857143
@Aventinus (I also cannot comment): Note that Jaccard similarity is an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab')
should result in 0.5
.
def jaccard_similarity(list1, list2):
intersection = len(set(list1).intersection(list2))
union = len(set(list1)) + len(set(list2)) - intersection
return intersection / union
Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.
To avoid repetition of elements in the union (denominator), and a little bit faster I propose:
def Jaccar_score(lista1, lista2):
inter = len(list(set(lista_1) & set(lista_2)))
union = len(list(set(lista_1) | set(lista_2)))
return inter/union
Creator of the Simphile NLP text similarity package here. Simphile contains several text similarity methods, Jaccard being one of them.
In the terminal install the package:
pip install simphile
Then your code could be something like:
from simphile import jaccard_list_similarity
list_a = ['cat', 'cat', 'dog']
list_b = ['dog', 'dog', 'cat']
print(f"Jaccard Similarity: {jaccard_list_similarity(list_a, list_b)}")
The output being:
Jaccard Similarity: 0.5
Note that this solution accounts for repeated elements — critical for text similarity; without it, the above example would show 100% similarity due to the fact that both lists as sets would reduce to {‘dog’, ‘cat’}.
I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?
This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).
Assuming your usernames don’t repeat, you can use the same idea:
def jaccard(a, b):
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
list1 = ['dog', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
# The intersection is ['dog', 'cat']
# union is ['dog', 'cat', 'rat', 'mouse]
words1 = set(list1)
words2 = set(list2)
jaccard(words1, words2)
>>> 0.5
I ended up writing my own solution after all:
def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(set(list1)) + len(set(list2))) - intersection
return float(intersection) / union
@aventinus I don’t have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity
but the function is misnamed as jaccard_distance
, which is actually 1 - jaccard_similarity
For Python 3:
def jaccard_similarity(list1, list2):
s1 = set(list1)
s2 = set(list2)
return float(len(s1.intersection(s2)) / len(s1.union(s2)))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5
For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))
If you’d like to include repeated elements, you can use Counter
, which I would imagine is relatively quick since it’s just an extended dict
under the hood:
from collections import Counter
def jaccard_repeats(a, b):
"""Jaccard similarity measure between input iterables,
allowing repeated elements"""
_a = Counter(a)
_b = Counter(b)
c = (_a - _b) + (_b - _a)
n = sum(c.values())
return n/(len(a) + len(b) - n)
list1 = ['dog', 'cat', 'rat', 'cat']
list2 = ['dog', 'cat', 'rat']
list3 = ['dog', 'cat', 'mouse']
jaccard_repeats(list1, list3)
>>> 0.75
jaccard_repeats(list1, list2)
>>> 0.16666666666666666
jaccard_repeats(list2, list3)
>>> 0.5
You can use the Distance library
#pip install Distance
import distance
distance.jaccard("decide", "resize")
# Returns
0.7142857142857143
@Aventinus (I also cannot comment): Note that Jaccard similarity is an operation on sets, so in the denominator part it should also use sets (instead of lists). So for example jaccard_similarity('aa', 'ab')
should result in 0.5
.
def jaccard_similarity(list1, list2):
intersection = len(set(list1).intersection(list2))
union = len(set(list1)) + len(set(list2)) - intersection
return intersection / union
Note that in the intersection, there is no need to cast to list first. Also, the cast to float is not needed in Python 3.
To avoid repetition of elements in the union (denominator), and a little bit faster I propose:
def Jaccar_score(lista1, lista2):
inter = len(list(set(lista_1) & set(lista_2)))
union = len(list(set(lista_1) | set(lista_2)))
return inter/union
Creator of the Simphile NLP text similarity package here. Simphile contains several text similarity methods, Jaccard being one of them.
In the terminal install the package:
pip install simphile
Then your code could be something like:
from simphile import jaccard_list_similarity
list_a = ['cat', 'cat', 'dog']
list_b = ['dog', 'dog', 'cat']
print(f"Jaccard Similarity: {jaccard_list_similarity(list_a, list_b)}")
The output being:
Jaccard Similarity: 0.5
Note that this solution accounts for repeated elements — critical for text similarity; without it, the above example would show 100% similarity due to the fact that both lists as sets would reduce to {‘dog’, ‘cat’}.