Python: nearest neighbour (or closest match) filtering on data records (list of tuples)

Question:

I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a “nearest neighbour” or “nearest match” type algorithim.

I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.

datarows = [(10,2.0,3.4,100),
            (11,2.0,5.4,120),
            (17,12.9,42,123)]

filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter

def get_nearest_neighbour(data, criteria, weights):
    for each row in data:
        # calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
    # determine the row which was either an exact match or was 'least dissimilar'
    # return the match (or nearest match)
    pass

if __name__ == '__main__':
    result = get_nearest_neighbour(datarow, filter_record, weights)
    print result

For the snippet above, the output should be:

(10,2.0,3.4,100)

since it is the ‘nearest’ to the sample data passed to the function get_nearest_neighbour().

My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the ‘distance metric’ we use is simply an arithmentic subtraction of the input data from the current row.

Answers:

use heapq.nlargest on a generator calculating the distance*weight for each record.

something like:

heapq.nlargest(N, ((row, dist_function(row,criteria,weight)) for row in data), operator.itemgetter(1))
Answered By: Not_a_Golfer

Simple out-of-the-box solution:

import math

def distance(row_a, row_b, weights):
    diffs = [math.fabs(a-b) for a,b in zip(row_a, row_b)]
    return sum([v*w for v,w in zip(diffs, weights)])

def get_nearest_neighbour(data, criteria, weights):
    def sort_func(row):
        return distance(row, criteria, weights)
    return min(data, key=sort_func)
    

If you’d need to work with huge datasets, you should consider switching to Numpy and using Numpy’s KDTree to find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it’s implemented a top of highly optimized LAPACK (Linear Algebra PACKage).

Answered By: vartec

About naive-NN:

Many of these other answers propose “naive nearest-neighbor”, which is an O(N*d)-per-query algorithm (d is the dimensionality, which in this case seems constant, so it’s O(N)-per-query).

While an O(N)-per-query algorithm is pretty bad, you might be able to get away with it, if you have less than any of (for example):

  • 10 queries and 100000 points
  • 100 queries and 10000 points
  • 1000 queries and 1000 points
  • 10000 queries and 100 points
  • 100000 queries and 10 points

Doing better than naive-NN:

Otherwise you will want to use one of the techniques (especially a nearest-neighbor data structure) listed in:

especially if you plan to run your program more than once. There are most likely libraries available. To otherwise not use a NN data structure would take too much time if you have a large product of #queries * #points. As user ‘dsign’ points out in comments, you can probaby squeeze out a large additional constant factor of speed by using the numpy library.

However if you can get away with using the simple-to-implement naive-NN though, you should use it.

Answered By: ninjagecko
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.