Getting the center point of a cluster for latitude and longitude in Python


I have a list of of coordinates that have areas mapped out as follows


For the following latitude longitude pairs I am using DBSCAN to cluster them

X=np.array(df[['latitude', 'longitude']])

kms_per_radian = 6371.0088
epsilon = 1 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=5)
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))

cluster_labels = cluster_labels.astype(float)
cluster_labels[cluster_labels == -1] = np.nan

clusters = pd.Series( [X[cluster_labels==n] for n in range(num_clusters)] )

labels = pd.DataFrame(db.labels_,columns=['CLUSTER_LABEL'])


How do I get the get the center point of these clusters and map it back to the dataset so that when I display the same in folium with a marker and the summary starts there?

So far I have tried

def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

centermost_points =

which gives me a IndexError: list index out of range error

Asked By: Ani



To get the coordinates of each cluster’s centroid:

for ea in clusters:


POINT (12.85585784912 77.79859915316)
POINT (12.86547048333333 77.79709629166666)
POINT (13.1982603551 77.70706457576)

To create a geodataframe from the centroids and plot it.
(assuming the coordinates are long/lat)

# To create a geodataframe of the centroids
clusters_centroids = [MultiPoint(ea).centroid for ea in clusters]
crs = {'init': 'epsg:4326'}
cgdf = gpd.GeoDataFrame(clusters, crs=crs, geometry=clusters_centroids)
# Eliminate some empty row(s)
good_cdgf = cgdf[ ~cgdf['geometry'].is_empty ]

# plot to see the centroids

The output plot:


Answered By: swatchai

To add the center points back into the original dataframe df.

Here I start with checking dfnew which is simply df with added column CLUSTER_LABEL.


    user_id   latitude  longitude  CLUSTER_LABEL
0        55  13.263394  75.434141             -1
1        55  13.263396  75.434138             -1
2       356  12.809677  77.695516             -1
3       356  12.809921  77.695234             -1
4       356  12.810059  77.695263             -1
..      ...        ...        ...            ...
76     9271  13.064171  77.746333             -1
77     9896  13.201384  77.708284              2
78     9991  13.115466  77.606998             -1
79     9991  13.195747  77.705557              2
80     9991  13.232903  77.695669             -1

[81 rows x 4 columns]

The column CLUSTER_LABEL will be used to join and get values from cgdf dataframe.

Add a new CLUSTER_LABEL column with proper cluster’s label values to cgdf

cgdf["CLUSTER_LABEL"] = [0,1,2, -1]

Drop column 0 of cgdf

cgdf.drop(columns=[0], axis=1, inplace=True)

Check current cgdf


                geometry  CLUSTER_LABEL
0  POINT (12.856 77.799)              0
1  POINT (12.865 77.797)              1
2  POINT (13.198 77.707)              2
3            POINT EMPTY             -1

Merge two dataframes into new dataframe dfnew2.

dfnew2 = dfnew.merge(cgdf, on='CLUSTER_LABEL')

Check current status of dfnew2, it should look like this:

    user_id   latitude  longitude  CLUSTER_LABEL               geometry
0        55  13.263394  75.434141             -1            POINT EMPTY
1        55  13.263396  75.434138             -1            POINT EMPTY
2       356  12.809677  77.695516             -1            POINT EMPTY
3       356  12.809921  77.695234             -1            POINT EMPTY
4       356  12.810059  77.695263             -1            POINT EMPTY
..      ...        ...        ...            ...                    ...
76     4594  13.198635  77.706593              2  POINT (13.198 77.707)
77     6886  13.196168  77.705323              2  POINT (13.198 77.707)
78     6886  13.199368  77.709566              2  POINT (13.198 77.707)
79     9896  13.201384  77.708284              2  POINT (13.198 77.707)
80     9991  13.195747  77.705557              2  POINT (13.198 77.707)

[81 rows x 5 columns]

‘dfnew2’ should be equivalent with the original dataframe with 2 additional special columns, ‘CLUSTER_LABEL’ and ‘geometry’ (of cluster’s center point).

Answered By: swatchai
    from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans

def kmeans_centers(list_of_lats_lngs): #type of input list of lists
        data = pd.DataFrame([list_of_lats_lngs],columns=['lat','lng'])
        data['eventType']= "test"

        K_clusters = range(1,10)
        kmeans = [KMeans(n_clusters=i) for i in K_clusters]
        Y_axis = data[['lat']]
        X_axis = data[['lng']]
        kmeans = KMeans(n_clusters = 3, init ='k-means++')[X.columns[1:3]])
        X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
        centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
        # labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
        return centers
    except Exception as e:
        print("kmeans - CLustering exception",e)
        return None
  • Ready to use


Answered By: gamingflexer