How to Matching Labels Cluster with True Labels with K-Means using python How to Matching Labels Cluster with True Labels with K-Means using python flask flask

How to Matching Labels Cluster with True Labels with K-Means using python


I had the same problem: my cluster (kmeans) did return different classes (cluster numbers) then the true classes. The result that the true label and predicted labels didn't match. The solution that worked for me was this code (scroll to 'Permutation maximizing the sum of the diagonal elements'). Although this methods works wel, there can be situations where it is wrong I think.


Here is a concrete example showing how to match KMeans cluster ids with training data labels. The underlying idea is confusion_matrixshall have large values on its diagonal line assuming that classification is done correctly. Here is the confusion matrix before associating cluster center ids with training labels:

cm = array([[  0, 395,   0,   5,   0],       [  0,   2,   5, 391,   2],       [  2,   0,   0,   0, 398],       [  0,   0, 400,   0,   0],       [398,   0,   0,   0,   2]])

Now we just need to reorder the confusion matrix to make its large values relocate on the diagonal line. It can be achieved easily with

cm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])

Here we get the new confusion matrix, which looks much familiar now, right?

cm_ = array([[395,   5,   0,   0,   0],       [  2, 391,   2,   5,   0],       [  0,   0, 398,   0,   2],       [  0,   0,   0, 400,   0],       [  0,   0,   2,   0, 398]])

You can further verify the result with accuracy_score

y_pred_ = np.array([cm_argmax[i] for i in y_pred])accuracy_score(y,y_pred_)# 0.991

The entire standalone code is here:

import matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsfrom sklearn.metrics import confusion_matrix,accuracy_scoreblob_centers = np.array(    [[ 0.2,  2.3],     [-1.5 ,  2.3],     [-2.8,  1.8],     [-2.8,  2.8],     [-2.8,  1.3]])blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])X, y = make_blobs(n_samples=2000, centers=blob_centers,                  cluster_std=blob_std, random_state=7)def plot_clusters(X, y=None):    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)    plt.xlabel("$x_1$", fontsize=14)    plt.ylabel("$x_2$", fontsize=14, rotation=0)plt.figure(figsize=(8, 4))plot_clusters(X)plt.show()k = 5kmeans = KMeans(n_clusters=k, random_state=42)y_pred = kmeans.fit_predict(X)cm = confusion_matrix(y, y_pred)cmcm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])cm_ = confusion_matrix(y, y_pred)cm_accuracy_score(y,y_pred_)