Spectral Clustering a graph in python Spectral Clustering a graph in python python python

Spectral Clustering a graph in python


Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):

Code:

import numpy as npimport networkx as nxfrom sklearn.cluster import SpectralClusteringfrom sklearn import metricsnp.random.seed(1)# Get your mentioned graphG = nx.karate_club_graph()# Get ground-truth: club-labels -> transform to 0/1 np-array#     (possible overcomplicated networkx usage here)gt_dict = nx.get_node_attributes(G, 'club')gt = [gt_dict[i] for i in G.nodes()]gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])# Get adjacency-matrix as numpy-arrayadj_mat = nx.to_numpy_matrix(G)print('ground truth')print(gt)# Clustersc = SpectralClustering(2, affinity='precomputed', n_init=100)sc.fit(adj_mat)# Compare ground-truth and clustering-resultsprint('spectral clustering')print(sc.labels_)print('just for better-visualization: invert clusters (permutation)')print(np.abs(sc.labels_ - 1))# Calculate some clustering metricsprint(metrics.adjusted_rand_score(gt, sc.labels_))print(metrics.adjusted_mutual_info_score(gt, sc.labels_))

Output:

ground truth[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]spectral clustering[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]just for better-visualization: invert clusters (permutation)[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]0.2040947582810.271689477828

The general idea:

Introduction on the data and task from here:

The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.

Using sklearn & spectral-clustering to tackle this:

If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.

This describes normalized graph cuts as:

Find two disjoint partitions A and B of the vertices V of a graph, so that A ∪ B = V and A ∩ B = ∅

Given a similarity measure w(i,j) between two vertices (e.g. identity when they are connected) a cut value (and its normalized version) is defined as: cut(A, B) = SUM u in A, v in B: w(u, v)

...

we seek the minimization of disassociation between the groups A and B and the maximization of the association within each group

Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).

Alternatively, using precomputed, a user-provided affinity matrix can be used.

Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:

The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.

So trying the less sensitive approach:

sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')

Output:

ground truth[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]spectral clustering[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]just for better-visualization: invert clusters (permutation)[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]0.7717250324250.722546051351

That's a pretty much perfect fit to the ground-truth!


Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.

Code

import numpy as npfrom sklearn.cluster import SpectralClusteringfrom sklearn import metricsnp.random.seed(0)adj_mat = [[3,2,2,0,0,0,0,0,0],           [2,3,2,0,0,0,0,0,0],           [2,2,3,1,0,0,0,0,0],           [0,0,1,3,3,3,0,0,0],           [0,0,0,3,3,3,0,0,0],           [0,0,0,3,3,3,1,0,0],           [0,0,0,0,0,1,3,1,1],           [0,0,0,0,0,0,1,3,1],           [0,0,0,0,0,0,1,1,3]]adj_mat = np.array(adj_mat)sc = SpectralClustering(3, affinity='precomputed', n_init=100)sc.fit(adj_mat)print('spectral clustering')print(sc.labels_)

Output

spectral clustering[0 0 0 1 1 1 2 2 2]