Using cophenetic distance to choose best linkage method?

Question:

I have the dataset that generates the following code.

X_moons, y_moons = datasets.make_moons(n_samples=1000, noise=.07, random_state=42)

The case is that I would like to make a dendrogram (bottom-up) in Python and I must select a linkage criterion. If you consult the documentation of the function you can see the existing methods. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

Any suggestions on how I can move forward? Is there a foolproof way to determine the best linkage?

I have tested the cophenetic distance for my dataset with each of the methods.

Answers:

There is no direct way to know which linkage is best. However, by looking at spread of data we can best guess. For your case, single linkage will produce best result. enter image description here

  1. Single linkage works best if cluster is in form of a chain. Complete linkage is more appropriate for data with globules/spherical clusters.
  2. If your data has categorical variables, then average/centroid/ward may not work properly. Single/Complete linkage is better for data with categorical variables.
from sklearn.cluster import AgglomerativeClustering
fig, ax = plt.subplots(1,4,figsize=(20,5))
link =['single','complete','average','ward']
for i in range(4):
    model = AgglomerativeClustering(n_clusters=2, linkage=link[i])

    labels = model.fit_predict(X_moons)

    ax[i].scatter(X_moons[:,0],X_moons[:,1], c=labels)
    ax[i].set_title(link[i])

fig.show()

Further Reading: https://www.youtube.com/watch?v=VMyXc3SiEqs

Answered By: amol goel
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.