cluster nodes of graph around specific nodes

Question:

Considering a graph of nodes from networkx how can I apply a kmean cluster of all the nodes where specific nodes are considered the centroids of the clusters. In other words, assume we have this graph:

import networkx as nx

s = [0,3,2,3,4,5,1]
t = [1,2,7,4,6,6,5]
dist = [3,2,5,1,5,4,2]

G = nx.Graph()
for i in range(len(s)):
    G.add_edge(s[i],t[i],weight=dist[i])

I want to apply a kmean clustering on the network where for example I choose the centroids to be 3 and 6 and the graph will be clustered accordingly to produce two subgraphs (or as many centroids as I input)

I have been looking at the kmean clustering here https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/ and what that doesn’t cover is the inputted centroids but rather it only considers number of clusters without a centroid node.

Asked By: mb567

||

Answers:

Note that you cannot directly apply k-means clustering to a network, as there does not necessarily exist a metric to measure distances between nodes and centroids. But

.. provided you assume:

  • The path length of the weighted shortest-path is a distance measure between a pair of nodes.
  • centroids are nodes. Note: In traditional k-means clustering centroids are not necessarily data points themselves.

Under these assumptions the sum of distances to the centroids is minimal if you associate to each node the centroid with the shortest weighted shortest-path.

So the procedure could be:

  • Associate each node to a centroid such that the sum of the distances from each node to its centroid is minimal (i.e. the within cluster sum of distances)
  • Update the centroids
  • Repeat previous two steps until the centroids are stable.

This procedure corresponds loosely to the procedure of k-mean clustering, that is to minimize the within-cluster sum of squares (WCSS).

Although this procedure is similar to k-means clustering in data points in a metric-space, I would not call it k-means clustering. Especially because the position of the centroids is restricted to nodes in the network.


Here is how you could approach this with python:

  1. Define the initial centroids:

    centroids = [3, 6]
    
  2. For each node, get all shortest paths to all centroids .

    For example:

     shortest_paths = [[(
         cent,
         nx.shortest_path(G,
                          source=n,
                          target=cent,
                          weight='weight')
     ) for cent in centroids] for n in G.nodes]
    

    This gives (here they are reported together with the id of the centroid):

    In [26]: shortest_paths                                                         
    Out[26]: 
    [[(3, [0, 1, 5, 6, 4, 3]), (6, [0, 1, 5, 6])],
    [(3, [1, 5, 6, 4, 3]), (6, [1, 5, 6])],
    [(3, [3]), (6, [3, 4, 6])],
    [(3, [2, 3]), (6, [2, 3, 4, 6])],
    [(3, [7, 2, 3]), (6, [7, 2, 3, 4, 6])],
    [(3, [4, 3]), (6, [4, 6])],
    [(3, [6, 4, 3]), (6, [6])],
    [(3, [5, 6, 4, 3]), (6, [5, 6])]]
    
  3. Calculate the actual distance, i.e. sum the weights over the paths, for all shortest paths for all nodes:

    For example:

    distances = [[(
        sp[0],  # this is the id of the centroid
        sum(
            [G[sp[1][i]][sp[1][i+1]]['weight'] 
             for i in range(len(sp[1]) - 1)]
        ) if len(sp[1]) > 1 else 0
    ) for sp in sps] for sps in shortest_paths]
    

    So the distances are:

    In [28]: distances                                                              
    Out[28]: 
    [[(3, 15), (6, 9)],
    [(3, 12), (6, 6)],
    [(3, 0), (6, 6)],
    [(3, 2), (6, 8)],
    [(3, 7), (6, 13)],
    [(3, 1), (6, 5)],
    [(3, 6), (6, 0)],
    [(3, 10), (6, 4)]]
    
  4. Get the centroid with the minimal distance for all nodes:

    For example:

    closest_centroid = [
        min(dist, key=lambda d: d[1])[0] for dist in distances
    ]
    

    Leading to the grouping according to the centroids:

    In [30]: closest_centroid                                                       
    Out[30]: [6, 6, 3, 3, 3, 3, 6, 6]
    
  5. Update the centroids as the current centroids might no longer be the actual centroids of the group:

    Approach:

    # for each group
        # for each member of the group
            # get the distance of shortest paths to all the other members of the group
            # sum this distances
        # find the node with the minimal summed distance > this is the new centroid of the group
    

Iteration: If the new centroids are not the same as the old one, use the new centroids and repeat steps 2.- 5.

Final step: If the new centroids found in step 5. are the same as the old ones or you’ve reached a iteration limit, associate the closest centroid to each node:

For example:

nodes = [n for n in G]  # the actual id of the nodes
cent_dict = {nodes[i]: closest_centroid[i] for i in range(len(nodes))}
nx.set_node_attributes(G, cent_dict, 'centroid')

Or nx.set_node_attributes(G, 'centroid', cent_dict), if you are still at v1.x.

This would be an approach to do a sort-of k-means clustering for a network.

Answered By: jojo