Get inertia for nltk k means clustering using cosine_similarity

Question:

I have used nltk for k mean clustering as I would like to change the distance metric. Does nltk k means have an inertia similar to that of sklearn? Can’t seem to find in their documentation or online…

The code below is how people usually find inertia using sklearn k means.

inertia = []
for n_clusters in range(2, 26, 1):
  clusterer = KMeans(n_clusters=n_clusters)
  preds = clusterer.fit_predict(features)
  centers = clusterer.cluster_centers_
  inertia.append(clusterer.inertia_)

plt.plot([i for i in range(2,26,1)], inertia, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Asked By: atjw94

||

Answers:

you can write your own function to obtain the inertia for Kmeanscluster in nltk.

As per your question posted by you, How do I obtain individual centroids of K mean cluster using nltk (python) . Using the same dummy data, which look like this. after making 2 cluster..
enter image description here

Refereing to docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, inertia is Sum of squared distances of samples to their closest cluster center.

 feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
 centroid = df['centroid'].to_numpy()

 def nltk_inertia(feature_matrix, centroid):
     sum_ = []
     for i in range(feature_matrix.shape[0]):
         sum_.append(np.sum((feature_matrix[i] - centroid[i])**2))  #here implementing inertia as given in the docs of scikit i.e sum of squared distance..

     return sum(sum_)

 nltk_inertia(feature_matrix, centroid)
 #op 27.495250000000002

 #now using kmeans clustering for feature1, feature2, and feature 3 with same number of cluster 2

scikit_kmeans = KMeans(n_clusters= 2)
scikit_kmeans.fit(vectors)  # vectors = [np.array(f) for f in df.values]  which contain feature1, feature2, feature3
scikit_kmeans.inertia_
#op
27.495250000000006
Answered By: qaiser

The previous comment is actually missing a small detail:

feature_matrix = df[['feature1','feature2','feature3']].to_numpy()
centroid = df['centroid'].to_numpy()
cluster = df['predicted_cluster'].to_numpy()

 def nltk_inertia(feature_matrix, centroid):
     sum_ = []
     for i in range(feature_matrix.shape[0]):
         sum_.append(np.sum((feature_matrix[i] - centroid[cluster[i]])**2))  

     return sum(sum_)

You have to select the corresponding cluster centroid when calculating distance between centroids and data points. Notice the cluster variable in the above code.

Answered By: Salih Kılıçlı

@qaiser’s comment is the simple solution. @Salih Kilicli if you pay attention to how centroid’s were kept in the sample dataframe, you will see that qaiser’s solution is correct.

Answered By: mert özlütıraş
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.