How to get the centroids in DBSCAN sklearn?

Question:

I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans.

However, I observed that DBSCAN has something called core points. I am thinking if it is possible to use these core points or any other alternative to obtain a representative point from each cluster.

I have mentioned below the code that I have used.

import numpy as np
from math import pi
from sklearn.cluster import DBSCAN

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

#using default values, set metric to 'precomputed'
db = DBSCAN(eps=((100 / (24*60)) * 2 * pi ), min_samples = 2, metric='precomputed')

#check db
print(db)

db.fit(dist)

#get labels
labels = db.labels_

#get number of clusters
no_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print('No of clusters:', no_clusters)
print('Cluster 0 : ', np.nonzero(labels == 0)[0])
print('Cluster 1 : ', np.nonzero(labels == 1)[0])

print(db.core_sample_indices_)

I am happy to provide more details if needed.

Asked By: EmJ

||

Answers:

Why don’t you estimate the centroids of the resulted estimated clusters?

points_of_cluster_0 = dist[labels==0,:]
centroid_of_cluster_0 = np.mean(points_of_cluster_0, axis=0) 
print(centroid_of_cluster_0)

points_of_cluster_1 = dist[labels==1,:]
centroid_of_cluster_1 = np.mean(points_of_cluster_1, axis=0)
print(centroid_of_cluster_1)
Answered By: seralouk

Maybe, do KDE row by row like (e.g. density_i = np.where(cdist(x[i:i+1],x[inds])-cut_off<0,1,0).sum(1)) for each cluster {i.e., i in inds, where inds=np.argwhere(cluster_results==cluster_index)} and find the point with highest density in each cluster; that is the most representative centroid. This may still can be slow if dataset is massive.

Answered By: Chonk