Spectral Clustering a graph in python
Question:
I’d like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it’s considered an exceptional graph clustering technique. Sadly, I can’t find examples of spectral clustering graphs in python online.

Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they’re not aliases.

Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I’ve asked for such an example from the developers, but they’re overworked and haven’t gotten to it.

A good network to document this against is the Karate Club Network. It’s included as a method in networkx.
I’d love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
Answers:
Without much experience with Spectralclustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get groundtruth: clublabels > transform to 0/1 nparray
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacencymatrix as numpyarray
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare groundtruth and clusteringresults
print('spectral clustering')
print(sc.labels_)
print('just for bettervisualization: invert clusters (permutation)')
print(np.abs(sc.labels_  1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for bettervisualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectralclustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)…
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)
) and set the param affinity
to precomputed (as our adjancencymatrix is our precomputed similaritymeasure).
Alternatively, using precomputed, a userprovided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. kmeans can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for bettervisualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That’s a pretty much perfect fit to the groundtruth!
Here is a dummy example just to see what it does to a simple similarity matrix — inspired by sascha’s answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let’s first cluster a graph G into K=2 clusters and then generalize for all K.

We can use the function
linalg.algebraicconnectivity.fiedler_vector()
fromnetworkx
, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx import numpy as np A = np.zeros((11,11)) A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1 A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1 A[0,5] = 5 G = nx.from_numpy_matrix(A) ev = nx.linalg.algebraicconnectivity.fiedler_vector(G) labels = [0 if v < 0 else 1 for v in ev] # using threshold 0 labels # [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] nx.draw(G, pos=nx.drawing.layout.spring_layout(G), with_labels=True, node_color=labels)

We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G) e, v = np.linalg.eig(L.todense()) idx = np.argsort(e) e = e[idx] v = v[:,idx] labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0 labels # [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:

With
SpectralClustering
fromsklearn.cluster
we can get the exact same result:sc = SpectralClustering(2, affinity='precomputed', n_init=100) sc.fit(A) sc.labels_ # [0 0 0 0 0 1 1 1 1 1 1]

We can generalize the above for K > 2 clusters as follows (use
kmeans
clustering for partitioning the Fiedler vector instead of thresholding):The following code demonstrates how kmeans clustering can be used to partition the Fiedler vector and obtain a 3clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0], [2,3,2,0,0,0,0,0,0], [2,2,3,1,0,0,0,0,0], [0,0,1,3,3,3,0,0,0], [0,0,0,3,3,3,0,0,0], [0,0,0,3,3,3,1,0,0], [0,0,0,0,0,1,3,1,1], [0,0,0,0,0,0,1,3,1], [0,0,0,0,0,0,1,1,3]]) K = 3 # K clusters G = nx.from_numpy_matrix(A) ev = nx.linalg.algebraicconnectivity.fiedler_vector(G) from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(1,1)) kmeans.labels_ # array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above: