Calculating optimal K value in K-means clustering with elbow curve

Question:

I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)

ks = range(1,30)
inertias = []
for k in ks:
    km = KMeans(n_clusters=k).fit(trialsX)
    inertias.append(km.inertia_)
    
plt.plot(ks,inertias)

inertia graph, which is an elbow plot

Based on my reading, the optimal k value lies at the ‘elbow’ of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?

Asked By: Warlax56

||

Answers:

I’ll post this, because it’s the best I have come up with thus far:

It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:

y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
x_range = np.linspace(ks[0],ks[-1],1000)

y_spl_1d = y_spl.derivative(n=1)

plt.plot(x_range,y_spl_1d(x_range))

first derivative of the inertia curve

then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.

EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.

Answered By: Warlax56