How to Matching Labels Cluster with True Labels with K-Means using python
I had the same problem: my cluster (kmeans) did return different classes (cluster numbers) then the true classes. The result that the true label and predicted labels didn't match. The solution that worked for me was this code (scroll to 'Permutation maximizing the sum of the diagonal elements'). Although this methods works wel, there can be situations where it is wrong I think.
Here is a concrete example showing how to match KMeans
cluster ids with training data labels. The underlying idea is confusion_matrix
shall have large values on its diagonal line assuming that classification is done correctly. Here is the confusion matrix before associating cluster center ids with training labels:
cm = array([[ 0, 395, 0, 5, 0], [ 0, 2, 5, 391, 2], [ 2, 0, 0, 0, 398], [ 0, 0, 400, 0, 0], [398, 0, 0, 0, 2]])
Now we just need to reorder the confusion matrix to make its large values relocate on the diagonal line. It can be achieved easily with
cm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])
Here we get the new confusion matrix, which looks much familiar now, right?
cm_ = array([[395, 5, 0, 0, 0], [ 2, 391, 2, 5, 0], [ 0, 0, 398, 0, 2], [ 0, 0, 0, 400, 0], [ 0, 0, 2, 0, 398]])
You can further verify the result with accuracy_score
y_pred_ = np.array([cm_argmax[i] for i in y_pred])accuracy_score(y,y_pred_)# 0.991
The entire standalone code is here:
import matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsfrom sklearn.metrics import confusion_matrix,accuracy_scoreblob_centers = np.array( [[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8], [-2.8, 2.8], [-2.8, 1.3]])blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std, random_state=7)def plot_clusters(X, y=None): plt.scatter(X[:, 0], X[:, 1], c=y, s=1) plt.xlabel("$x_1$", fontsize=14) plt.ylabel("$x_2$", fontsize=14, rotation=0)plt.figure(figsize=(8, 4))plot_clusters(X)plt.show()k = 5kmeans = KMeans(n_clusters=k, random_state=42)y_pred = kmeans.fit_predict(X)cm = confusion_matrix(y, y_pred)cmcm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])cm_ = confusion_matrix(y, y_pred)cm_accuracy_score(y,y_pred_)