In python hierarchical clustering by pairwise distances, how can I cut on specific distances and get clusters and list of members of each cluster?

Question:

I have pairwise distances data like this:

distances = {

('DN1357_i2', 'DN1357_i5'): 1.0,

('DN1357_i2', 'DN10172_i1'): 28.0,

('DN1357_i2', 'DN1357_i1'): 8.0,

('DN1357_i5', 'DN1357_i1'): 2.0,

('DN1357_i5', 'DN10172_i1'): 34.0,

('DN1357_i1', 'DN10172_i1'): 38.0,
}

So I have 4 objects, I clustered these objects using this code lines:

keys = [sorted(k) for k in obj_distances.keys()]

values = obj_distances.values()

sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendro = dendrogram(Z, labels=labels)

It gives me a dendrogram. What is the code to get clusters and name of objects in each cluster, (if I cut the dendrogram in distance 2)?

Asked By: MySky

||

Answers:

You can use the scipy function cut_tree, here’s an example for your data:

from scipy.cluster.hierarchy import cut_tree, dendrogram, linkage

obj_distances = {
    ('DN1357_i2', 'DN1357_i5'): 1.0,
    ('DN1357_i2', 'DN10172_i1'): 28.0,
    ('DN1357_i2', 'DN1357_i1'): 8.0,
    ('DN1357_i5', 'DN1357_i1'): 2.0,
    ('DN1357_i5', 'DN10172_i1'): 34.0,
    ('DN1357_i1', 'DN10172_i1'): 38.0,
}

keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

Z = linkage(distances)

labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendro = dendrogram(Z, labels=labels)

members = dendro['ivl']
clusters = cut_tree(Z, height=2)
cluster_ids = [c[0] for c in clusters]

for k in range(max(cluster_ids) + 1):
    print(f"Cluster {k}")
    for i, c in enumerate(cluster_ids):
        if c == k:
            print(f"{members[i]}")

    print('n')

For cutting the tree at a height of 2, the output is:

Cluster 0
DN10172_i1


Cluster 1
DN1357_i1


Cluster 2
DN1357_i2
DN1357_i5
Answered By: Leonardo Sirino

The answer from @Leonardo Sirino gives me the right dendrogram, but wrong cluster results (I haven’t completely figured out why)

How to reproduce my claim:
map-replace entity names in obj_distances (DN1357_i2 becomes A, DN1357_i5 becomes B, DN10172_i1 becomes C and DN1357_i1 becomes D)

i.e.

obj_distances = {
    ("A", "B"): 1.0,
    ("A", "C"): 28.0,
    ("A", "D"): 8.0,
    ("B", "D"): 2.0,
    ("B", "C"): 34.0,
    ("D", "C"): 38.0,
}

which is essentially the same obj_distances in the question, but replace each entity by A, B, C accordingly. This will mess up the cluster result, giving

Cluster 0

  • C
  • D

Cluster 1

  • A

Cluster 2

  • B

But A and B should be together according to the dendrogram:

dendrogram

Here’s what would give me the correct cluster result that is consistent with the dendrogram:

Replace:

members = dendro['ivl']
clusters = cut_tree(Z, height=2)
cluster_ids = [c[0] for c in clusters]

for k in range(max(cluster_ids) + 1):
    print(f"Cluster {k}")
    for i, c in enumerate(cluster_ids):
        if c == k:
            print(f"{members[i]}")

    print('n')

with this:

cluster_result = list(zip(labels, fcluster(Z, t=1, criterion="distance")))
dict(pd.DataFrame(cluster_result, columns=["user", "cluster_num"]).groupby("cluster_num").user.apply(list))

Thank you @Leonardo Sirino for the answer that took me this far!

Answered By: Fangyuan Cao