Kmeans without knowing the number of clusters? [duplicate] Kmeans without knowing the number of clusters? [duplicate] python python

Kmeans without knowing the number of clusters? [duplicate]


One approach is cross-validation.

In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters?

If the memberships are roughly the same, the data fit well into k clusters. Otherwise, you try a different k.

Also, you could do PCA (principal component analysis) to reduce your 50 dimensions to some more tractable number. If a PCA run suggests that most of your variance is coming from, say, 4 out of the 50 dimensions, then you can pick k on that basis, to explore how the four cluster memberships are assigned.


Take a look at this wikipedia page on determining the number of clusters in a data set.

Also you might want to try Agglomerative hierarchical clustering out. This approach does not need to know the number of clusters, it will incrementally form clusters of cluster till only one exists. This technique also exists in SciPy (scipy.cluster.hierarchy).


One interesting approach is that of evidence accumulation by Fred and Jain. This is based on combining multiple runs of k-means with a large number of clusters, aggregating them into an overall solution. Nice aspects of the approach include that the number of clusters is determined in the process and that the final clusters don't have to be spherical.