How to Matching Labels Cluster with True Labels with K-Means using python

python numpy flask scikit-learn k-means

I had the same problem: my cluster (kmeans) did return different classes (cluster numbers) then the true classes. The result that the true label and predicted labels didn't match. The solution that worked for me was this code (scroll to 'Permutation maximizing the sum of the diagonal elements'). Although this methods works wel, there can be situations where it is wrong I think.

python numpy flask scikit-learn k-means

Here is a concrete example showing how to match KMeans cluster ids with training data labels. The underlying idea is confusion_matrixshall have large values on its diagonal line assuming that classification is done correctly. Here is the confusion matrix before associating cluster center ids with training labels:

cm = array([[  0, 395,   0,   5,   0],       [  0,   2,   5, 391,   2],       [  2,   0,   0,   0, 398],       [  0,   0, 400,   0,   0],       [398,   0,   0,   0,   2]])

Now we just need to reorder the confusion matrix to make its large values relocate on the diagonal line. It can be achieved easily with

cm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])

Here we get the new confusion matrix, which looks much familiar now, right?

cm_ = array([[395,   5,   0,   0,   0],       [  2, 391,   2,   5,   0],       [  0,   0, 398,   0,   2],       [  0,   0,   0, 400,   0],       [  0,   0,   2,   0, 398]])

You can further verify the result with accuracy_score

y_pred_ = np.array([cm_argmax[i] for i in y_pred])accuracy_score(y,y_pred_)# 0.991

The entire standalone code is here:

import matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsfrom sklearn.metrics import confusion_matrix,accuracy_scoreblob_centers = np.array(    [[ 0.2,  2.3],     [-1.5 ,  2.3],     [-2.8,  1.8],     [-2.8,  2.8],     [-2.8,  1.3]])blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])X, y = make_blobs(n_samples=2000, centers=blob_centers,                  cluster_std=blob_std, random_state=7)def plot_clusters(X, y=None):    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)    plt.xlabel("$x_1$", fontsize=14)    plt.ylabel("$x_2$", fontsize=14, rotation=0)plt.figure(figsize=(8, 4))plot_clusters(X)plt.show()k = 5kmeans = KMeans(n_clusters=k, random_state=42)y_pred = kmeans.fit_predict(X)cm = confusion_matrix(y, y_pred)cmcm_argmax = cm.argmax(axis=0)cm_argmaxy_pred_ = np.array([cm_argmax[i] for i in y_pred])cm_ = confusion_matrix(y, y_pred)cm_accuracy_score(y,y_pred_)

CodeHunter

How to Matching Labels Cluster with True Labels with K-Means using python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last