What does these lines of codes in K_means clustering means?

Question:

I was learning K-means clustering. And is quite confused about the working of plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1') what is the purpose of X[y_kmeans == 0, 0], X[y_kmeans == 0, 1] in the code?

Full code here

#k-means

#importing libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing the dataset
dataset = pd.read_csv("mall_customers.csv")
X = dataset.iloc[:,[3,4]].values

#using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = [] #Within-Cluster Sum of Square

for i in range(1,11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++',max_iter = 300,n_init=10,random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.plot(range(1,11),wcss)
plt.title("The elbow method")
plt.xlabel("Number of cluster")
plt.ylabel('Wcss') 
plt.show()    

#applying kmeans to all dataset
kmeans = KMeans(n_clusters = 5,init = 'k-means++', max_iter=300,n_init=10,random_state=0)
y_kmeans = kmeans.fit_predict(X)

#Visualising the cluster
plt.scatter(X[y_kmeans == 0,0],X[y_kmeans == 0,1],s=100,c = 'red' ,label='Cluster1')
plt.scatter(X[y_kmeans == 1,0],X[y_kmeans == 1,1],s=100,c='blue', label='Cluster2')
plt.scatter(X[y_kmeans == 2,0],X[y_kmeans == 2,1],s=100,c='green',label='Cluster3')
plt.scatter(X[y_kmeans == 3,0],X[y_kmeans == 3,1],s=100, c ='cyan',label = 'CLuster4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=300, c = 'yellow', label ='Centroids')

plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

I have added the output image for reference purpose
elbow graph,
Final cluster image

Asked By: user10064176

||

Answers:

That’s a filter. y_kmeans == 0 selects those elements where y_kmeans[i] is equal to 0. X[y_kmeans == 0, 0] selects the elements of X where the corresponding y_kmeans value is 0 and the second dimension is 0.

Originally answered by tim roberts

X[y_hc ==1,0] here 0 means model is in x plain X[y_hc == 0,1] means model is in y-plain.
Where as 1 refers to the value of [i] or the cluster value.

Answered By: user10064176

X[y_kmeans == 0, 0] :

It’s a filter that works as explained below:

Remember y contains the result of your clustering model where we have 5 clusters represented as cluster 0, cluster 1 … cluster 4.

At first y_kmeans == 0 will select the elements where y==0, meaning elements classified as cluster 0 so y==0 return a list of boolean with True for those elements belonging to cluster 0 and false for other elements. The outcome will now be X[[True, False, etc…],0], the first element in the bracket represents the list of boolean mentioned above and the second element ( the 0 ) represents the column (or feature. Example sepal length for the Iris dataset). Also, remember to make a scatter plot we need two values (x and y) in the case of the iris dataset, X can be the Sepal length and Y the Sepal Width.

So the first line
X[y_kmeans == 0,0],X[y_kmeans == 0,1]
will be evaluated to X[[True, False…],0] and X[[True, False],1] the bolded value here represents the column’s value in your original dataset. Each Boolean value is mapped to the corresponding row in your dataset, if the value is True, that row is selected and its columns value (corresponding to the bolded part of the bracket) is selected. So you will have something like this:

x[[False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False, False, False, False, False,
   False, False, False, False, False,  True,  True, False,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True, False,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True,  True,  True,  True,  True,  True,  True,  True,  True,
    True, False,  True, False, False, False, False,  True, False,
   False, False, False, False, False,  True,  True, False, False,
   False, False,  True, False,  True, False,  True, False, False,
    True,  True, False, False, False, False, False,  True, False,
   False, False, False,  True, False, False, False,  True, False,
   False, False,  True, False, False,  True],0]

Note that the number of rows in your dataset or X must be equal to the number of elements in your y.

Answered By: junior