Why is KNN slow with custom metric?

Question:

I work with data set consists about 200k objects. Every object has 4 features. I classifies them by K nearest neighbors (KNN) with euclidean metric. Process is finished during about 20 seconds.

Lately I’ve got a reason to use custom metric. Probably it will make better results. I’ve implemented custom metric and KNN has become to work more than one hour. I didn’t wait for finishing of it.

I assumed that a reason of this issue is my metric. I replace my code by return 1. KNN still worked more than one hour. I assumed that a reason is algorithm Ball Tree, but KNN with it and euclidean metric works during about 20 seconds.

Right now I have no idea what’s wrong. I use Python 3 and sklearn 0.17.1. Here process can’t be finished with custom metric. I also tried algorithm brute but it has same effect. Upgrade and downgrade of scikit-learn have no effect. Implementing classification by custom metric on Python 2 has no positive effect too. I implemented this metric (just return 1) on Cython, it has same effect.

def custom_metric(x: np.ndarray, y: np.ndarray) -> float:
    return 1

clf = KNeighborsClassifier(n_jobs=1, metric=custom_metric)
clf.fit(X, Y)

Can I boost classification process by KNN with custom metric?

Sorry if my english is not clear.

Asked By: ANtlord

||

Answers:

Sklearn is optimized and use cython and several process to run as fast as possible. Writing pure python code especially when it is called several times is the cause that slows your code. I recommend that you write your custom metric using cython.
You have a tutorial that you can follow right here : https://blog.sicara.com/https-medium-com-redaboumahdi-speed-sklearn-algorithms-custom-metrics-using-cython-de92e5a325c

Answered By: Réda Boumahdi

As rightly pointed by @RĂ©da Boumahdi the cause is using custom metric defined in python. This is a known issue discussed here. It was closed as “wontfix” at the end of the discussion. So, only solution suggested is writing your custom metric in cython to avoid GIL that slows down in case of using python metric.

Answered By: Rohin Kumar
  1. You can use numba. It gave faster run time than cython (most probably because I don’t know cython well enough and it was zero effort)
  2. Why KNN is so fast by default? Because it uses KD-tree. However, we can’t use KD-tree with custom metric for some reason, so it chooses to brute-force. I tried to set it manually, but it didn’t work. BUT ‘ball-tree’ worked fine and it sped up the algorithm even more.

I used a dataset with ~5k train rows, ~20 features, and ~1k rows for inference (validation). I compared the following:

  1. scipy correlation function https://github.com/scipy/scipy/blob/v1.10.0/scipy/spatial/distance.py#L577-L624

  2. numba:

import numba
from numba import jit

@jit(nopython=True)
def corr_numba(u,v):
    umu = np.average(u)
    vmu = np.average(v)
    u = u - umu
    v = v - vmu
    uv = np.average(u*v)
    uu = np.average(np.square(u))
    vv = np.average(np.square(v))
    dist = 1.0 - uv / np.sqrt(uu*vv)
    return dist

corr_numba(np.array([0,1,1]), np.array([1,0,0])) # 2.0
  1. cython (I haven’t tried to optimize it, so it’s not the best version possible)
%%cython --annotate

import numpy as np
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

def corr(double[:] u, double[:] v):
    cdef u_sum = 0
    cdef v_sum = 0
    cdef uv_sum = 0
    cdef uu_sum = 0
    cdef vv_sum = 0
    cdef n_elems = u.shape[0]
    
    
    for i in range(n_elems):
        u_sum += u[i]
        v_sum += v[i]
        
    u_sum = u_sum / n_elems
    v_sum = v_sum / n_elems
    
    for i in range(n_elems):
        uv_sum += (u[i]-u_sum)*(v[i]-v_sum)
        uu_sum += (u[i]-u_sum)*(u[i]-u_sum)
        vv_sum += (v[i]-v_sum)*(v[i]-v_sum)
        
    uv_sum = uv_sum / n_elems
    vv_sum = vv_sum / n_elems
    uu_sum = uu_sum / n_elems
    
    dist = 1.0 - uv_sum / sqrt(uu_sum*vv_sum)
    return dist
corr(np.array([0.,1,1]), np.array([1.,0,0])) # 2.0

The results:

  1. no ball tree:
metric <function correlation at 0x7f8b2373d7e0> 0.934, train_time: 0.001, infer_time: 182.970

metric CPUDispatcher(<function corr_numba at 0x7f8b1a028af0>) 0.934, train_time: 0.001, infer_time: 6.239

metric <built-in function corr> 0.934, train_time: 0.001, infer_time: 32.485
  1. ball tree
metric <function correlation at 0x7f8b2373d7e0> 0.935, train_time: 1.252, infer_time: 80.716

metric CPUDispatcher(<function corr_numba at 0x7f8b1a028af0>) 0.935, train_time: 0.049, infer_time: 2.336

metric <built-in function corr> 0.935, train_time: 0.249, infer_time: 12.725
Answered By: ckorzhik
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.