Find the sum of squares for each cluster? in data using scale function and build the K-means

Question:

Consider the dataset “USArrests.csv”. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Variables Description

  • States:: The state where the incident occurred
  • Murder: No. of arrests for murder (per 100,000 residents)
  • Assault: No. of arrests for assault (per 100,000 residents)
  • UrbanPop: Percentage of urban population
  • Rape: Rape arrests (per 100,000 residents)

Set the column States as index of the data frame while reading the data. Set the random number generator to set.seed(123). Normalize the data using scale function and build the K-means algorithm with the given conditions:

  • number of clusters = 4
  • nstart=20

According to the built model, the within cluster sum of squares for each cluster is __
(the order of values in each option could be different)

I am stuck after importing the data and setting the seed. Struggling to fit and build a K-means algorithm.

Asked By: wildDog

||

Answers:

I am happy in someway after looking at the dataset. If I am not wrong this dataset is taken from Kaggle

Anyways using R to execute the code here, hope you are familiar with the same. If not the concept would be very similar. Try to understand and re-write the code in your comfortable coding language.

After all the necessary formalities and import

data=read.csv("USArrests.csv", header=T, row.names = "States")
df <- scale(data)
set.seed(123)
fit<-kmeans(df, centers=4, nstart=20)
print(fit$withinss)

The output would be exactly 8.316061 11.952463 16.212213 19.922437

Feel free to comment if you don’t understand or find a mistake.

Answered By: 01001010
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.