Apply StandardScaler to parts of a data set Apply StandardScaler to parts of a data set python python

Apply StandardScaler to parts of a data set


Introduced in v0.20 is ColumnTransformer which applies transformers to a specified set of columns of an array or pandas DataFrame.

import pandas as pddata = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})col_names = ['Name', 'Age', 'Weight']features = data[col_names]from sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScalerct = ColumnTransformer([        ('somename', StandardScaler(), ['Age', 'Weight'])    ], remainder='passthrough')ct.fit_transform(features)

NB: Like Pipeline it also has a shorthand version make_column_transformer which doesn't require naming the transformers

Output

-1.41100443,  1.20270298,  3.        0.62304092,  0.04295368,  4.        0.78796352, -1.24565666,  6.       


Update:

Currently the best way to handle this is to use ColumnTransformer as explained here.


First create a copy of your dataframe:

scaled_features = data.copy()

Don't include the Name column in the transformation:

col_names = ['Age', 'Weight']features = scaled_features[col_names]scaler = StandardScaler().fit(features.values)features = scaler.transform(features.values)

Now, don't create a new dataframe but assign the result to those two columns:

scaled_features[col_names] = featuresprint(scaled_features)        Age  Name    Weight0 -1.411004     3  1.2027031  0.623041     4  0.0429542  0.787964     6 -1.245657


Another option would be to drop Name column before scaling then merge it back together:

data = pd.DataFrame({'Name' : [3, 4,6], 'Age' : [18, 92,98], 'Weight' : [68, 59,49]})from sklearn.preprocessing import StandardScaler# Save the variable you don't want to scalename_var = data['Name']# Fit scaler to your datascaler.fit(data.drop('Name', axis = 1))# Calculate scaled values and store them in a separate objectscaled_values = scaler.transform(data.drop('Name', axis = 1))data = pd.DataFrame(scaled_values, index = data.index, columns = data.drop('ID', axis = 1).columns)data['Name'] = name_varprint(data)