Fastest way to find nearest neighbours in NumPy array

Question

What is the fastest way to perform operations on adjacent elements of an mxn array within distance $l$ (where m, n are large). If this was an image, it would equate to an operation on the surrounding pixels. To make things clearer, I’ve created a new array with the neighbours of the corresponding source.

Given some array like

x = [[1,2,3],
     [4,5,6], 
     [7,8,9]]

if I were to take the [0,0] element, and want the surrounding elements at $l$=1, I’d need the [0,1] and [1,0] elements (namley 2 and 4). The desired output would look something like this

y = [[[2,4], [1,3,5], [2,6]], 
     [[1,5,7], [4,6,2,8], [3,9,5]],
     [[4,8], [7,5,9], [8,6]]]

I’ve tried playing around with kdTree from scipy.spatial, and am aware of https://stackoverflow.com/a/45742628/20451990, but as far as I can tell this is actually finding the nearest data points, whereas I want to find the nearest array elements. I guess it could be naively done by iterating through, but that is very slow…

The end goal here is to generate combinations of nearby array elements which I will be taking the product of. For the example above this could be

 [[1*2, 1*4], [2*1, 2*3, 2*5], [3*2, 3*6]],...]

Asked By: notastringtheorist

||

Source

Answer 1

Key takeaways

With numba, it is possible to get roughly 690x times faster algorithms than with naïve python code with for-loops and list appends.
With numba, functions have signature; you tell explicitly what is the datatype.
Avoid memory (re-)allocations. Try to allocate memory for any arrays in advance. Reuse the data containers whenever possible (See: cell_result in the numbafied process_cell())
Numba is not super handy with classes (at least, OOP style code), stuff which is dynamically typed, containers with mixed types or containers changing in size. Prefer simple functions and typed structures with defined size. See also: Supported Python features
Numba likes for-loops, and they’re fast!

Prewords

You asked for a fastest way to calculate this. I had no baseline, so I created first a pure python for-loop solution as a baseline. Then, I used numba to make the code run fast. It most probably is not the fastest implementation but at least it is way faster than the naïve pure python for-loop approach.

So, if you are not familiar with numba this is a good way to learn about it a bit 🙂

Used test data

I use two pieces of test data. First, the simple array given in the question. I call this myarr, and it is used for easy comparison of the output:

import numpy as np

myarr = np.array(
    [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
    ],
    dtype=np.float32,
)

The second dataset is for benchmarking. You mentioned that the arrays will be of size 30 x 30 and the distance I will be less than 4.

arr_large = np.arange(1, 30 * 30 + 1, 1, dtype=np.float32).reshape(30, 30)

In other words, the arr_large is a 30 x 30 2d-array:

>>> arr_large
array([[  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
         12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
         23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.],
        ...
       [871., 872., 873., 874., 875., 876., 877., 878., 879., 880., 881.,
        882., 883., 884., 885., 886., 887., 888., 889., 890., 891., 892.,
        893., 894., 895., 896., 897., 898., 899., 900.]], dtype=float32)

I specified the dtype because specifying datatype is needed at the optimization step. For the pure python solution this is of course not necessary at all.

Baseline solution: Pure python with for-loops

I implemented the baseline soution with a python class and for-loops. The output from it looks like this (source for NeighbourProcessor below):

Example output with 3 x 3 input array (I=1)

n = NeighbourProcessor()
output = n.process(myarr, max_distance=1)

The output is then

>>> output
{(0, 0): [2, 4],
 (0, 1): [2, 6, 10],
 (0, 2): [6, 18],
 (1, 0): [4, 20, 28],
 (1, 1): [10, 20, 30, 40],
 (1, 2): [18, 30, 54],
 (2, 0): [28, 56],
 (2, 1): [40, 56, 72],
 (2, 2): [54, 72]}

which is same as

{(0, 0): [1 * 2, 1 * 4],
 (0, 1): [2 * 1, 2 * 3, 2 * 5],
 (0, 2): [3 * 2, 3 * 6],
 (1, 0): [4 * 1, 4 * 5, 4 * 7],
 (1, 1): [5 * 2, 5 * 4, 5 * 6, 5 * 8],
 (1, 2): [6 * 3, 6 * 5, 6 * 9],
 (2, 0): [7 * 4, 7 * 8],
 (2, 1): [8 * 5, 8 * 7, 8 * 9],
 (2, 2): [9 * 6, 9 * 8]}

This is basically what was asked in the question; the target ouput was

 [[1*2, 1*4], [2*1, 2*3, 2*5], [3*2, 3*6]],...]

Here I used a dictionary with (row, column) as the key because that way you can more easily find the output for each cell.

Baseline performance

For the largest input of 30 x 30, and largest distance (I=4), the calculation takes about 0.188 seconds on my laptop:

>>> %timeit n.process(arr_large, max_distance=4)
188 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Code for NeighbourProcessor

import math
import numpy as np


class NeighbourProcessor:
    def __init__(self):
        self.arr = None

    def process(self, arr, max_distance=1):
        self.arr = arr
        output = dict()
        rows, columns = self.arr.shape
        for current_row in range(rows):
            for current_col in range(columns):
                cell_result = self.process_cell(current_row, current_col, max_distance)
                output[(current_row, current_col)] = cell_result
        return output

    def row_col_is_within_array(self, row, col):
        if row < 0 or col < 0:
            return False
        if row > self.arr.shape[0] - 1 or col > self.arr.shape[1] - 1:
            return False
        return True

    def distance(self, row, col, current_row, current_col):
        distance_squared = (current_row - row) ** 2 + (current_col - col) ** 2
        return np.sqrt(distance_squared)

    def are_neighbours(self, row, col, current_row, current_col, max_distance):
        if row == current_row and col == current_col:
            return False
        if not self.row_col_is_within_array(row, col):
            return False
        return self.distance(row, col, current_row, current_col) <= max_distance

    def neighbours(self, current_row, current_col, max_distance):
        start_row = math.floor(current_row - max_distance)
        start_col = math.floor(current_col - max_distance)
        end_row = math.ceil(current_row + max_distance)
        end_col = math.ceil(current_col + max_distance)
        for row in range(start_row, end_row + 1):
            for col in range(start_col, end_col + 1):
                if self.are_neighbours(
                    row, col, current_row, current_col, max_distance
                ):
                    yield row, col

    def process_cell(self, current_row, current_col, max_distance):
        cell_output = []
        current_cell_value = self.arr[current_row][current_col]
        for row, col in self.neighbours(current_row, current_col, max_distance):
            neighbour_cell_value = self.arr[row][col]
            cell_output.append(current_cell_value * neighbour_cell_value)
        return cell_output

Short explanation

So what the NeighbourProcessor.process does is goes through the rows and columns of the input array, starting from (0,0), which is left top corner, and processing from left to right, top to bottom until the bottom right corner, which is (n_rows, n_columns), each time marking the cell as current cell; (current_row, current_column).
For each current cell, process it in process_cell. That will form an iterator with neighbours() which iterates all the neighbours at within maximum distance of I from the current cell. You can check how the logic goes in are_neighbours

Faster solution: Using numba and memory pre-allocation

Now I will make a functions-only version with numba, and try to make the processing as fast as possible. There is possibility also to use classes in numba, but they are still bit more experimental and complex, and this problem can be solved with functions only. The readability of the code suffers a bit, but that’s the price we sometimes pay for speed optimization.

I’ll start with the process function. Now it will have to create a a three dimensional array instead of a dict. The reason we want to create the array ahead of time because we memory allocation is a costly process and we want to do that exactly once. So, instead of having this as output for myarr:

# output[(row,column)]
#
output[(0,0)] # [2,4]
output[(0,1)] # [2, 6, 10]
#..etc

I want constant-sized output:

# output[row][column]
#
output[0][0] # [2, 4, nan, nan]
output[0][1] # [2, 6, 10, nan]
#..etc

Notice that after all the "pairs", the output is np.nan (not a number). Any postprocessing script must then just simply ignore the extra nans.

Solving for the required size for the pre-allocated array

How I know the size of the third dimension, i.e. the number of neighbours for given max. distance I? Well, I don’t. It seems this is quite a complicated problem. See, for example this, this or the Gauss circle problem in Wikipedia. Nevertheless, I can quite easily calculate an upper bound for the number of neighbours. In the following I assume that neighbour is a neighbour if and only if the distance of the middle point of the cells is less or equal to I. If you create sketches with pen and paper, you will notice that when you increase the number of neighbours, the maximum number of neighbours grows as:

I = 1 -> max_number_neighbours = 4
I = 2 -> max_number_neighbours = 9
I = 3 -> max_number_neighbours = 28

Here is an example sketch with 10 x 10 2d-array and distance I=3, when current cell is (4,5), the number of neighbours must be less or equal to 28:

This pattern is represented as a function of max distance (I): (2*I-1)**2 + 4 -1, or

n_third_dimension = max_number_neighbours = (2*I-1)**2 + 3

Refactoring the code to work with numba

We start with creating the function signature of the entry point. In this case, we create a function process with the function signature:

@numba.jit("f4[:,:,:](f4[:,:], f4)")
def process(arr, max_distance):
    ...

See the docs for the other available types. The f4[:,:] just means that the input is 2d-array of float32 and f4[:,:,:](....) means that the function output is 3d-array of float32. Next, we create the output with the formula we invented above. Here is one part of the magic: memory pre-allocation with np.empty:

    n_third_dimension = (2 * math.ceil(max_distance) - 1) ** 2 + 3
    output = np.empty((*arr.shape, n_third_dimension), dtype=np.float32)
    cell_result = np.empty(n_third_dimension, dtype=np.float32)

Numbafied code

I will not walk though the rest of the code hand-in-hand, but you can see below that it is a bit modified version of the pure python for-loop baseline.

import math

import numba
import numpy as np


@numba.njit("f4(i4,i4,i4,i4)")
def distance(row, col, current_row, current_col):
    distance_squared = (current_row - row) ** 2 + (current_col - col) ** 2
    return np.sqrt(distance_squared)


@numba.njit("boolean(i4,i4, i4,i4)")
def row_col_is_within_array(
    row,
    col,
    arr_rows,
    arr_cols,
):
    if row < 0 or col < 0:
        return False
    if row > arr_rows - 1 or col > arr_cols - 1:
        return False
    return True


@numba.njit("boolean(i4,i4,i4,i4,f4,i4,i4)")
def are_neighbours(
    neighbour_row,
    neighbour_col,
    current_row,
    current_col,
    max_distance,
    arr_rows,
    arr_cols,
):
    if neighbour_row == current_row and neighbour_col == current_col:
        return False
    if not row_col_is_within_array(
        neighbour_row,
        neighbour_col,
        arr_rows,
        arr_cols,
    ):
        return False
    return (
        distance(neighbour_row, neighbour_col, current_row, current_col) <= max_distance
    )


@numba.njit("f4[:](f4[:,:], f4[:], i4,i4,i4,f4)")
def process_cell(
    arr, cell_result, current_row, current_col, n_third_dimension, max_distance
):
    for i in range(n_third_dimension):
        cell_result[i] = np.nan

    current_cell_value = arr[current_row][current_col]

    # Potential cell neighbour area
    start_row = math.floor(current_row - max_distance)
    start_col = math.floor(current_col - max_distance)
    end_row = math.ceil(current_row + max_distance)
    end_col = math.ceil(current_col + max_distance)

    arr_rows, arr_cols = arr.shape

    cell_pointer = 0
    for neighbour_row in range(start_row, end_row + 1):
        for neighbour_col in range(start_col, end_col + 1):
            if are_neighbours(
                neighbour_row,
                neighbour_col,
                current_row,
                current_col,
                max_distance,
                arr_rows,
                arr_cols,
            ):
                neighbour_cell_value = arr[neighbour_row][neighbour_col]
                cell_result[cell_pointer] = current_cell_value * neighbour_cell_value
                cell_pointer += 1
    return cell_result


@numba.njit("f4[:,:,:](f4[:,:], f4)")
def process(arr, max_distance):
    n_third_dimension = (2 * math.ceil(max_distance) - 1) ** 2 + 3
    output = np.empty((*arr.shape, n_third_dimension), dtype=np.float32)
    cell_result = np.empty(n_third_dimension, dtype=np.float32)

    rows, columns = arr.shape

    for current_row in range(rows):
        for current_col in range(columns):
            cell_result = process_cell(
                arr,
                cell_result,
                current_row,
                current_col,
                n_third_dimension,
                max_distance,
            )
            output[current_row][current_col][:] = cell_result
    return output

Example output

>>> output = process(myarr, max_distance=1.0)
>>> output
array([[[ 2.,  4., nan, nan],
        [ 2.,  6., 10., nan],
        [ 6., 18., nan, nan]],

       [[ 4., 20., 28., nan],
        [10., 20., 30., 40.],
        [18., 30., 54., nan]],

       [[28., 56., nan, nan],
        [40., 56., 72., nan],

>>> output[0]
array([[ 2.,  4., nan, nan],
       [ 2.,  6., 10., nan],
       [ 6., 18., nan, nan]], dtype=float32)


>>> output[0][1]
array([ 2.,  6., 10., nan], dtype=float32)
# Above is the same as target: [2 * 1, 2 * 3, 2 * 5]

Speed of the numbafied code and closing words

The baseline approach rxecution time was 188 ms. Now, it is 271 µs. That is only 0.00144 times of what the original code took! (99.85% reduction in execution time. Some would say 693x faster.).

>>> %timeit process(arr_large, max_distance=4.0)
271 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Note that you might want to calculate the distance differently, or add there weighting, or some more complex logic, aggregation functions, etc. This could be still further optimized a bit by creating better estimate for the maximum number of neighbors, for example. Have fun with numba, and I hope you learned something! 🙂

Bonus tip: There is also ahead of time compilation in numba which you can use to make also the first function call fast!

Answered By: np8