Will pandas dataframe object work with sklearn kmeans clustering?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrixmat = dataset.values# Using sklearnkm = sklearn.cluster.KMeans(n_clusters=5)km.fit(mat)# Get cluster assignment labelslabels = km.labels_# Format results as a DataFrameresults = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset
has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.valuesprint(dataset_array.dtype)print(dataset_array)
If the array has an homogeneous numerical dtype
(typically numpy.float64
) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler
for instance.
If your data frame is heterogeneously typed, the dtype
of the corresponding numpy array will be object
which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).