Grid search for hyperparameter evaluation of clustering in scikit-learn Grid search for hyperparameter evaluation of clustering in scikit-learn python python

Grid search for hyperparameter evaluation of clustering in scikit-learn


The clusteval library will help you to evaluate the data and find the optimal number of clusters. This library contains five methods that can be used to evaluate clusterings: silhouette, dbindex, derivative, dbscan and hdbscan.

pip install clusteval

Depending on your data, the evaluation method can be chosen.

# Import libraryfrom clusteval import clusteval# Set parameters, as an example dbscance = clusteval(method='dbscan')# Fit to find optimal number of clusters using dbscanresults= ce.fit(X)# Make plot of the cluster evaluationce.plot()# Make scatter plot. Note that the first two coordinates are used for plotting.ce.scatter(X)# results is a dict with various output statistics. One of them are the labels.cluster_labels = results['labx']


Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):    if not parameters:        yield dict()    else:        key_to_iterate = list(parameters.keys())[0]        next_round_parameters = {p : parameters[p]                    for p in parameters if p != key_to_iterate}        for val in parameters[key_to_iterate]:            for pars in make_generator(next_round_parameters):                temp_res = pars                temp_res[key_to_iterate] = val                yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random onefixed_params = {"max_iter":300 } param_grid = {"n_clusters": range(2, 11)}for params in make_generator(param_grid):    params.update(fixed_params)    ca = KMeans( **params )    ca.fit(_data)    labels = ca.labels_    # Estimate your clustering labels and     # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!


Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])

N = len(distance_matrix)cv_custom = [(range(0,N), range(0,N))]scores = cross_val_score(clf, X, y, cv=cv_custom)