How to set k-Means clustering labels from highest to lowest with Python?
Transforming the labels through a lookup table is a straightforward way to achieve what you want.
To begin with I generate some mock data:
import numpy as npnp.random.seed(1000)n = 38X_morning = np.random.uniform(low=.02, high=.18, size=38)X_afternoon = np.random.uniform(low=.05, high=.20, size=38)X_night = np.random.uniform(low=.025, high=.175, size=38)X = np.vstack([X_morning, X_afternoon, X_night]).T
Then I perform clustering on data:
from sklearn.cluster import KMeansk = 4kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
And finally I use NumPy's argsort
to create a lookup table like this:
idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))lut = np.zeros_like(idx)lut[idx] = np.arange(k)
Sample run:
In [70]: kmeans.cluster_centers_.sum(axis=1)Out[70]: array([ 0.3214523 , 0.40877735, 0.26911353, 0.25234873])In [71]: idxOut[71]: array([3, 2, 0, 1], dtype=int64)In [72]: lutOut[72]: array([2, 3, 1, 0], dtype=int64)In [73]: kmeans.labels_Out[73]: array([1, 3, 1, ..., 0, 1, 0])In [74]: lut[kmeans.labels_]Out[74]: array([3, 0, 3, ..., 2, 3, 2], dtype=int64)
idx
shows the cluster center labels ordered from lowest to highest consumption level. The appartments for which lut[kmeans.labels_]
is 0
/ 3
belong to the cluster with the lowest / highest consumption levels.
Maybe sort the centroids based on their vector magnitude is better, since you can use it to predict other data using the same model. Here is my implementation in my repo
from sklearn.cluster import KMeansdef sorted_cluster(x, model=None): if model == None: model = KMeans() model = sorted_cluster_centers_(model, x) model = sorted_labels_(model, x) return modeldef sorted_cluster_centers_(model, x): model.fit(x) new_centroids = [] magnitude = [] for center in model.cluster_centers_: magnitude.append(np.sqrt(center.dot(center))) idx_argsort = np.argsort(magnitude) model.cluster_centers_ = model.cluster_centers_[idx_argsort] return modeldef sorted_labels_(sorted_model, x): sorted_model.labels_ = sorted_model.predict(x) return sorted_model
Example:
import numpy as nparr = np.vstack([ 100 + np.random.random((2,3)), np.random.random((2,3)), 5 + np.random.random((3,3)), 10 + np.random.random((2,3))])print('Data:')print(arr)cluster = KMeans(n_clusters=4)print('\n Without sort:')cluster.fit(arr)print(cluster.cluster_centers_)print(cluster.labels_)print(cluster.predict([[5,5,5],[1,1,1]]))print('\n With sort:')cluster = sorted_cluster(arr, cluster)print(cluster.cluster_centers_)print(cluster.labels_)print(cluster.predict([[5,5,5],[1,1,1]]))
Output:
Data:[[100.52656263 100.57376566 100.63087757] [100.70144046 100.94095196 100.57095386] [ 0.21284187 0.75623797 0.77349013] [ 0.28241023 0.89878796 0.27965047] [ 5.14328748 5.37025887 5.26064209] [ 5.21030632 5.09597417 5.29507699] [ 5.81531591 5.11629056 5.78542656] [ 10.25686526 10.64181304 10.45651994] [ 10.14153211 10.28765705 10.20653228]] Without sort:[[ 10.19919868 10.46473505 10.33152611] [100.61400155 100.75735881 100.60091572] [ 0.24762605 0.82751296 0.5265703 ] [ 5.38963657 5.19417453 5.44704855]][1 1 2 2 3 3 3 0 0][3 2] With sort:[[ 0.24762605 0.82751296 0.5265703 ] [ 5.38963657 5.19417453 5.44704855] [ 10.19919868 10.46473505 10.33152611] [100.61400155 100.75735881 100.60091572]][3 3 0 0 1 1 1 2 2][1 0]