Learning: KMeans clustering inconsistent results

Question:

Learning ML and I’m new to KMeans clustering. How do I know if my model is accurate with the consistently inconsistent results that I’m getting?

What I mean by consistently inconsistent is I get the exact same set of 4 results but they appear randomly.

Setup (Jupyter Notebook):
I’m using the iris dataset from sklearn

from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target

# I run this "cell" repeatedly and get varying results
model = KMeans(n_clusters=3)
model.fit(X)
print(classification_report(y, model.labels_))
print(confusion_matrix(y, model.labels_))

Results are consistently inconsistent but here are all the results that I get:

[[6.85       3.07368421 5.74210526 2.07105263]
 [5.006      3.428      1.462      0.246     ]
 [5.9016129  2.7483871  4.39354839 1.43387097]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        50
           1       0.00      0.00      0.00        50
           2       0.23      0.28      0.25        50

    accuracy                           0.09       150
   macro avg       0.08      0.09      0.08       150
weighted avg       0.08      0.09      0.08       150

[[ 0 50  0]
 [ 2  0 48]
 [36  0 14]]

[[5.006      3.428      1.462      0.246     ]
 [5.9016129  2.7483871  4.39354839 1.43387097]
 [6.85       3.07368421 5.74210526 2.07105263]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.77      0.96      0.86        50
           2       0.95      0.72      0.82        50

    accuracy                           0.89       150
   macro avg       0.91      0.89      0.89       150
weighted avg       0.91      0.89      0.89       150

[[50  0  0]
 [ 0 48  2]
 [ 0 14 36]]


[[5.9016129  2.7483871  4.39354839 1.43387097]
 [5.006      3.428      1.462      0.246     ]
 [6.85       3.07368421 5.74210526 2.07105263]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        50
           1       0.00      0.00      0.00        50
           2       0.95      0.72      0.82        50

    accuracy                           0.24       150
   macro avg       0.32      0.24      0.27       150
weighted avg       0.32      0.24      0.27       150

[[ 0 50  0]
 [48  0  2]
 [14  0 36]]

[[5.006      3.428      1.462      0.246     ]
 [6.85       3.07368421 5.74210526 2.07105263]
 [5.9016129  2.7483871  4.39354839 1.43387097]]

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.05      0.04      0.05        50
           2       0.23      0.28      0.25        50

    accuracy                           0.44       150
   macro avg       0.43      0.44      0.43       150
weighted avg       0.43      0.44      0.43       150

[[50  0  0]
 [ 0  2 48]
 [ 0 36 14]]

My question is how do I know if my model is accurate or not? Is there a pattern in the results that I’m not seeing or don’t know how to interpret?

Any help would be appreciated. Thank you in advance 🙂

PS I know that this is an unsupervised algorithm so classification reports and confusion reports have little to no value but it did highlight an oddity that prompted this question. Also I’ve added the cluster centers in the output as well which are also consistently inconsistent.

Asked By: Linus

||

Answers:

  1. Your result seems consistent to me. Every time you run K-Means, you get the same centroids. The only change is the order, but this should be arbitrary. There’s no special reason to assign a particular cluster the attribute of being the first one or second, or third one…

  2. In order to evaluate it, since your data is labeled (iris dataset). I would recommend to check how many items from each cluster correspond to the same labeled set, or how many items with the same label are in the same cluster. For example: are all Iris setosa in the same cluster or are they distributed in more than one cluster?

I guess you precission/recall/F1 if you want, but should define first which cluster correspond to each species. I would start by a visual evaluation, since you have only three tags. But basically, you want a correlation between clusters and species (can cluster predict species?).

But in general, remember that KMeans forces structure in your data, even if there’s not (it was originally though as a compression algorithm, not as a cauterization one). So, in many cases you don’t actually evaluate it’s performance, but just whether is useful or not (for example for feature generation).

PS (after some additional research):

There are two measures that can be quite useful for evaluating clustering if you know the ground truth: Mutual information, and the Adjusted Rand index. MI measures the gain in information about one variable, knowing the other one (how many yes/no questions about one variable can you answer by knowing the other one). Is a non-linear measure of correlation between variables. The Rand index is a measure of the similarity between two data clusters (similar to an accuracy metric), and the adjusted version is corrected against grouping by chance. Both mutual information and adjusted Rand index can be implemented with sklearn.

Answered By: Ignatius Reilly
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.