Matching up the output of scipy linkage() and dendrogram()

Question:

I’m drawing dendrograms from scratch using the Z and P outputs of code like the following (see below for a fuller example):

Z = scipy.cluster.hierarchy.linkage(...)
P = scipy.cluster.hierarchy.dendrogram(Z, ..., no_plot=True)

and in order to do what I want, I need to match up a given index in P["icoord"]/P["dcoord"] (which contain the coordinates to draw the cluster linkage in a plot) with the corresponding index in Z (which contains the information about which data elements are in which cluster) or vice-versa. Unfortunately, it does not seem that in general, the position of clusters in P["icoord"]/P["dcoord"] just match up with the corresponding positions in Z (see the output of the code below for proof).

The Question: what is a way that I could match them up? I need either a function Z_i = f(P_coords_i) or its inverse P_coords_i = g(Z_i) so that I can iterate over one list and easily access the corresponding elements in the other.


The code below generates 26 random points and labels them with the letters of the alphabet and then prints out the letters corresponding with the clusters represented by the rows of Z and then the points in P where dcoord is zero (i.e. the leaf nodes), to prove that in general they don’t match up: for example the first element of Z corresponds to cluster iu but the first set of points in P["icoord"]/P["dcoord"] corresponds to drawing the cluster for jy and that of iu doesn’t come until a few elements later.

import numpy as np
from scipy.cluster import hierarchy
from scipy.spatial import distance
import string

# let's make some random data
np.random.seed(1)
data = np.random.multivariate_normal([0,0],[[5, 0], [0, 1]], 26)
letters = list(string.ascii_lowercase)
X = distance.pdist(data)


# here's the code I need to run for my use-case
Z = hierarchy.linkage(X)
P = hierarchy.dendrogram(Z, labels=letters, no_plot=True)


# let's look at the order of Z
print("Z:")

clusters = letters.copy()

for c1, c2, _, _ in Z:
    clusters.append(clusters[int(c1)]+clusters[int(c2)])
    print(clusters[-1])

# now let's look at the order of P["icoord"] and P["dcoord"]
print("nP:")

def lookup(y, x):
    return "?" if y else P["ivl"][int((x-5)/10)]

for ((x1,x2,x3,x4),(y1,y2,y3,y4)) in zip(P["icoord"], P["dcoord"]):
     print(lookup(y1, x1)+lookup(y4, x4))

Output:

------Z:
iu
ez
niu
jy
ad
pr
bq
prbq
wniu
gwniu
ezgwniu
hm
ojy
prbqezgwniu
ks
ojyprbqezgwniu
vks
ojyprbqezgwniuvks
lhm
adlhm
fadlhm
cfadlhm
tcfadlhm
ojyprbqezgwniuvkstcfadlhm
xojyprbqezgwniuvkstcfadlhm

------P:
jy
o?
pr
bq
??
ez
iu
n?
w?
g?
??
??
??
ks
v?
??
ad
hm
l?
??
f?
c?
t?
??
x?
Asked By: nicolaskruchten

||

Answers:

Key Idea: Imitate the code of constructing the R['icoord']/R['dcoord']. Append the cluster idx to an empty list cluster_id_list in a way that the link infos are appended. The element in cluster_id_list and R['icoord']/R['dcoord'] will be "aligned".

You may consider the following codes:

def append_index(n, i, cluster_id_list):
    # refer to the recursive progress in
    # https://github.com/scipy/scipy/blob/4cf21e753cf937d1c6c2d2a0e372fbc1dbbeea81/scipy/cluster/hierarchy.py#L3549

    # i is the idx of cluster(counting in all 2 * n - 1 clusters)
    # so i-n is the idx in the "Z"
    if i < n:
        return
    aa = int(Z[i - n, 0])
    ab = int(Z[i - n, 1])

    append_index(n, aa, cluster_id_list)
    append_index(n, ab, cluster_id_list)

    cluster_id_list.append(i-n)
    # Imitate the progress in hierarchy.dendrogram
    # so how `i-n` is appended , is the same as how the element in 'icoord'&'dcoord' be.
    return

def get_linkid_clusterid_relation(Z):
    Zs = Z.shape
    n = Zs[0] + 1
    i = 2 * n - 2
    cluster_id_list = []
    append_index(n, i, cluster_id_list)
    # cluster_id_list[i] is the cluster idx(in Z) that the R['icoord'][i]/R['dcoord'][i] corresponds to

    dict_linkid_2_clusterid = {linkid: clusterid for linkid, clusterid in enumerate(cluster_id_list)}
    dict_clusterid_2_linkid = {clusterid: linkid for linkid, clusterid in enumerate(cluster_id_list)}
    return dict_linkid_2_clusterid, dict_clusterid_2_linkid

I just imitate the recursive process in _dendrogram_calculate_info function called by the dendrogram function. The dict_linkid_2_clusterid gives which cluster every linkage belongs to. The dict_linkid_2_clusterid[i] is the cluster that the P["icoord"][i]/P["dcoord"][i] coorespondes to, i.e. the index in of the idx in the Z array. And the dict_clusterid_2_linkid is the inversmap.

NOTE: If use count_sort&distance_sort which will influence the order of adding links. You can expand my answer by adding extra the codes from the the scipy source code. The parameter truncate_mode can also be taken into consideration.


Test code:

dict_linkid_2_clusterid, dict_clusterid_2_linkid = get_linkid_clusterid_relation(Z)
for linkid, _ in enumerate(zip(P["icoord"], P["dcoord"])):
    clusterid = dict_linkid_2_clusterid[linkid]
    c1, c2, _, _ = Z[clusterid]
    print(clusters[int(c1)] + clusters[int(c2)])

You can see that you can fill up the unknown y in your original code.

Answered By: hellohawaii

First define the leaf label function.

def llf(id):
if id < n:
return str(id)
else:
return ‘[%d %d %1.2f]’ % (id, count, R[n-id,3])