How do I automate the number of clusters?

Question:

I’ve been playing with the below script:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os

folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}

# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
    for file in files:
        full_path = os.path.join(root, file)
        print(f'Processing {file}')
        try:
            text = textract.process(full_path)
            dict_of_docs[file] = text
        except Exception as e:
            print(e)


vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)

It scans a folder of images that are scanned documents, extracts the text then clusters the text. I know for a fact there are 3 different types of documents, so I set the true_k to 3. But what if I had a folder of unknown documents where there could be anythings from 1 to 100s of different document types.

Asked By: Ari

||

Answers:

This is a slippery field because it is very difficult to measure how “good” your clustering algorithm works without any ground truth labels. In order to make an automatic selection, you need to have a metrics that will compare how KMeans performs for different values of n_clusters.

A popular choice is the silhouette score. You can find more details about it here. Here is the scikit-learn documentation:

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b – a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples – 1.

As a result, you can only compute the silhouette score for n_clusters >= 2, (which might be a limitation for you given your problem description unfortunately).

This is how you would use it on a dummy data set (you can adapt it to your code then, it is just to have a reproducible example):

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

sil_score_max = -1 #this is the minimum possible score

for n_clusters in range(2,10):
  model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
  labels = model.fit_predict(X)
  sil_score = silhouette_score(X, labels)
  print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
  if sil_score > sil_score_max:
    sil_score_max = sil_score
    best_n_clusters = n_clusters

This will return:

The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31

And thus you will have best_n_clusters = 2 (NB: in reality, Iris has three classes…)

Answered By: MaximeKan