Python k-means algorithm Python k-means algorithm python python

Python k-means algorithm


Update: (Eleven years after this original answer, it's probably time for an update.)

First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).

Here are some resources:

Old answer:

Scipy's clustering implementations work well, and they include a k-means implementation.

There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.


SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.

For now, I would recommend using PyCluster instead. Example usage:

>>> import numpy>>> import Pycluster>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,                                                             0.03 * numpy.diag([1,1]),                                                            20)                            for mean in [(1, 1), (2, 4), (3, 2)]])>>> labels, error, nfound = Pycluster.kcluster(points, 3)>>> labels  # Cluster number for each pointarray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)>>> error   # The within-cluster sum of distances for the solution1.7721661785401261>>> nfound  # Number of times this solution was found1


For continuous data, k-means is very easy.

You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.

I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is

k*(1-(1/n)) + n*(1/n)

Here is the full code in Python

from __future__ import divisionfrom random import random# init means and data to random values# use real data in your codemeans = [random() for i in range(10)]data = [random() for i in range(1000)]param = 0.01 # bigger numbers make the means change faster# must be between 0 and 1for x in data:    closest_k = 0;    smallest_error = 9999; # this should really be positive infinity    for k in enumerate(means):        error = abs(x-k[1])        if error < smallest_error:            smallest_error = error            closest_k = k[0]        means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!