Grid search for hyperparameter evaluation of clustering in scikit-learn

python scikit-learn cluster-analysis scoring

The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval

Depending on your data, the evaluation method can be chosen.

# Import libraryfrom clusteval import clusteval# Set parameters, as an example dbscance = clusteval(method='dbscan')# Fit to find optimal number of clusters using dbscanresults= ce.fit(X)# Make plot of the cluster evaluationce.plot()# Make scatter plot. Note that the first two coordinates are used for plotting.ce.scatter(X)# results is a dict with various output statistics. One of them are the labels.cluster_labels = results['labx']

python scikit-learn cluster-analysis scoring

Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):    if not parameters:        yield dict()    else:        key_to_iterate = list(parameters.keys())[0]        next_round_parameters = {p : parameters[p]                    for p in parameters if p != key_to_iterate}        for val in parameters[key_to_iterate]:            for pars in make_generator(next_round_parameters):                temp_res = pars                temp_res[key_to_iterate] = val                yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random onefixed_params = {"max_iter":300 } param_grid = {"n_clusters": range(2, 11)}for params in make_generator(param_grid):    params.update(fixed_params)    ca = KMeans( **params )    ca.fit(_data)    labels = ca.labels_    # Estimate your clustering labels and     # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

python scikit-learn cluster-analysis scoring

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])

N = len(distance_matrix)cv_custom = [(range(0,N), range(0,N))]scores = cross_val_score(clf, X, y, cv=cv_custom)

CodeHunter

Grid search for hyperparameter evaluation of clustering in scikit-learn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last