Calculating the similarity of two lists

Question:

I have two lists:

eg.
a = [1,8,3,9,4,9,3,8,1,2,3]
and
b = [1,8,1,3,9,4,9,3,8,1,2,3]

Both contain ints. There is no meaning behind the ints (eg. 1 is not ‘closer’ to 3 than it is to 8).

I’m trying to devise an algorithm to calculate the similarity between two ORDERED lists. Ordered is keyword right here (so I can’t just take the set of both lists and calculate their set_difference percentage). Sometimes numbers do repeat (for example 3, 8, and 9 above, and I cannot ignore the repeats).

In the example above, the function I would call would tell me that a and b are ~90% similar for example. How can I do that? Edit distance was something which came to mind. I know how to use it with strings but I’m not sure how to use it with a list of ints. Thanks!

Asked By: aerain

||

Answers:

Just use the same algorithm for calculating edit distance on strings if the values don’t have any particular meaning.

Answered By: jason

It sounds like edit (or Levenshtein) distance is precisely the right tool for the job.

Here is one Python implementation that can be used on lists of integers: http://hetland.org/coding/python/levenshtein.py

Using that code, levenshtein([1,8,3,9,4,9,3,8,1,2,3], [1,8,1,3,9,4,9,3,8,1,2,3]) returns 1, which is the edit distance.

Given the edit distance and the lengths of the two arrays, computing a “percentage similarity” metric should be pretty trivial.

Answered By: NPE

Unless im missing the point.

from __future__ import division

def similar(x,y):
    si = 0
    for a,b in zip(x, y):
        if a == b:
            si += 1
    return (si/len(x)) * 100


if __name__ in '__main__':
    a = [1,8,3,9,4,9,3,8,1,2,3] 
    b = [1,8,1,3,9,4,9,3,8,1,2,3]
    result = similar(a,b)
    if result is not None:
        print "%s%s Similar!" % (result,'%')
Answered By: Jakob Bowyer

You can use the difflib module

ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].

Which gives :

 >>> s1=[1,8,3,9,4,9,3,8,1,2,3]
 >>> s2=[1,8,1,3,9,4,9,3,8,1,2,3]
 >>> sm=difflib.SequenceMatcher(None,s1,s2)
 >>> sm.ratio()
 0.9565217391304348
Answered By: kraymer

One way to tackle this is to utilize histogram. As an example (demonstration with numpy):

In []: a= array([1,8,3,9,4,9,3,8,1,2,3])
In []: b= array([1,8,1,3,9,4,9,3,8,1,2,3])

In []: a_c, _= histogram(a, arange(9)+ 1)
In []: a_c
Out[]: array([2, 1, 3, 1, 0, 0, 0, 4])

In []: b_c, _= histogram(b, arange(9)+ 1)
In []: b_c
Out[]: array([3, 1, 3, 1, 0, 0, 0, 4])

In []: (a_c- b_c).sum()
Out[]: -1

There exist now plethora of ways to harness a_c and b_c.

Where the (seemingly) simplest similarity measure is:

In []: 1- abs(-1/ 9.)
Out[]: 0.8888888888888888

Followed by:

In []: norm(a_c)/ norm(b_c)
Out[]: 0.92796072713833688

and:

In []: a_n= (a_c/ norm(a_c))[:, None]
In []: 1- norm(b_c- dot(dot(a_n, a_n.T), b_c))/ norm(b_c)
Out[]: 0.84445724579043624

Thus, you need to be much more specific to find out most suitable similarity measure suitable for your purposes.

Answered By: eat

I’ve implemented something for a similar task a long time ago. Now, I have only a blog entry for that. It was simple: you had to compute the pdf of both sequences then it would find the common area covered by the graphical representation of pdf.

Sorry for the broken images on link, the external server that I’ve used back then is dead now.

Right now, for your problem the code translates to

def overlap(pdf1, pdf2):
    s = 0
    for k in pdf1:
        if pdf2.has_key(k):
            s += min(pdf1[k], pdf2[k])
    return s

def pdf(l):
    d = {}
    s = 0.0
    for i in l:
        s += i
        if d.has_key(i):
            d[i] += 1
        else:
            d[i] = 1
    for k in d:
        d[k] /= s
    return d

def solve():
    a = [1, 8, 3, 9, 4, 9, 3, 8, 1, 2, 3]
    b = [1, 8, 1, 3, 9, 4, 9, 3, 8, 1, 2, 3]
    pdf_a = pdf(a)
    pdf_b = pdf(b)
    print pdf_a
    print pdf_b
    print overlap(pdf_a, pdf_b)
    print overlap(pdf_b, pdf_a)

if __name__ == '__main__':
    solve()

Unfortunately, it gives an unexpected answer, only 0.212292609351

Answered By: Mihai Maruseac

The solution proposed by @kraymer does not work in the case of

s1=[1,2,3,4,5,6,7,8,9,10]
s2=[2,1,3,4,5,6,7,8,9,9]  

since it returns 0.8 even though there are 3 different elements and not 2.

A workaround could be:

def find_percentage_agreement(s1, s2):
    assert len(s1)==len(s2), "Lists must have the same shape"
    nb_agreements = 0  # initialize counter to 0
    for idx, value in enumerate(s1):
        if s2[idx] == value:
            nb_agreements += 1

    percentage_agreement = nb_agreements/len(s1)

    return percentage_agreement

Which returns the expected result:

>>> s1=[1,2,3,4,5,6,7,8,9,10]
>>> s2=[2,1,3,4,5,6,7,8,9,9]
>>> find_percentage_agreement(s1, s2)
0.7
Answered By: Tommaso Di Noto
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.