Find Least Similar Vectors Python and Numpy

Question:

if I have a list of arrays e.g;

[0,1,1,0,1,1,0]
[1,0,1,0,0,0,1]
[1,0,1,0,1,1,1]
[1,0,1,0,1,0,1]
[0,1,0,0,0,0,0]
[1,0,0,0,0,0,1]
[1,0,1,0,1,1,1]
[1,0,1,0,0,0,1]

and I want to find the n arrays the least like the others, what would be the best method?

e.g; I want two arrays that are the least similar to the group as a whole.

Asked By: Howard Zoopaloopa

||

Answers:

You would first need to make some methodology choices before you can actually implement a solution.

  1. How do you define most different? You need to choose or define the distance measure you would like to use. The appropriate measure is really dependent on the problem you are trying to solve
  2. How do you define the difference of a single array vs the group of arrays? For example, you could define a method whereby you leave one out, take the average of the rest, and then compute the distance between the array you left out vs the average of the rest. Alternatively, you could compute distance between all pairs of arrays in your group, and then choose the two whose average difference vs the rest is the largest.

Some ideas for (1):

  • Hamming distance which essentially just counts the number of entries that do not match between two arrays. Since your example given is binary it may be appropriate
  • The L2 norm of the difference of the vectors (essentially just the sum of squares of the difference between each entry). Probably the most popular, at least in some domains. Note that it is more sensitive to outliers, which you may or may not want. You can also compute the L1 norm instead, which is just the sum of the absolute difference, and in the binary case will match the Hamming distance.
  • Many many more. Try searching distance measures for arrays, or clustering distance measures and so on. It really comes down to the interpretation of your data.

Once you have chosen desired methodology for (1) and (2), the implementation should not be too difficult.

Answered By: LarryBird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.