Clustering values by their proximity in python (machine learning?) [duplicate]

python machine-learning cluster-analysis data-mining

Don't use clustering for 1-dimensional data

Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to split the data set.

There are a number of duplicates of this question:

python machine-learning cluster-analysis data-mining

A good option if you don't know the number of clusters is MeanShift:

import numpy as npfrom sklearn.cluster import MeanShift, estimate_bandwidthx = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]X = np.array(zip(x,np.zeros(len(x))), dtype=np.int)bandwidth = estimate_bandwidth(X, quantile=0.1)ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)ms.fit(X)labels = ms.labels_cluster_centers = ms.cluster_centers_labels_unique = np.unique(labels)n_clusters_ = len(labels_unique)for k in range(n_clusters_):    my_members = labels == k    print "cluster {0}: {1}".format(k, X[my_members, 0])

Output for this algorithm:

cluster 0: [ 1  1  5  6  1  5 10 22 23 23 50 51 51 52]cluster 1: [100 112 130]cluster 2: [500 512]cluster 3: [12000]cluster 4: [12230]cluster 5: [600]

Modifying quantilevariable you can change the clustering number selection criteria

python machine-learning cluster-analysis data-mining

You can use clustering to group these. The trick is to understand that there are two dimensions to your data: the dimension you can see, and the "spatial" dimension that looks like [1, 2, 3... 22]. You can create this matrix in numpy like so:

import numpy as npy = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]x = range(len(y))m = np.matrix([x, y]).transpose()

Then you can perform clustering on the matrix, with:

from scipy.cluster.vq import kmeanskclust = kmeans(m, 5)

kclust's output will look like this:

(array([[   11,    51],       [   15,   114],       [   20, 12115],       [    4,     9],       [   18,   537]]), 21.545126372346271)

For you, the most interesting part is the first column of the matrix, which says what the centers are along that x dimension:

kclust[0][:, 0]# [20 18 15  4 11]

You can then assign your points to a cluster based on which of the five centers they are closest to:

assigned_clusters = [abs(cluster_indices - e).argmin() for e in x]# [3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 2, 2, 2, 2, 1, 1, 0, 0, 0]

CodeHunter

Clustering values by their proximity in python (machine learning?) [duplicate]

Don't use clustering for 1-dimensional data

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last