Will pandas dataframe object work with sklearn kmeans clustering?

python pandas scikit-learn cluster-analysis k-means

Assuming all the values in the dataframe are numeric,

# Convert DataFrame to matrixmat = dataset.values# Using sklearnkm = sklearn.cluster.KMeans(n_clusters=5)km.fit(mat)# Get cluster assignment labelslabels = km.labels_# Format results as a DataFrameresults = pandas.DataFrame([dataset.index,labels]).T

Alternatively, you could try KMeans++ for Pandas.

python pandas scikit-learn cluster-analysis k-means

To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:

dataset_array = dataset.valuesprint(dataset_array.dtype)print(dataset_array)

If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.

If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

CodeHunter

Will pandas dataframe object work with sklearn kmeans clustering?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last