Can anyone explain me StandardScaler?

python machine-learning scikit-learn scaling standardized

Intro

I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way -- X.shape should be [number_of_samples, number_of_features]).

Core of method

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. eachcolumn of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.

P.S: I find the most upvoted answer on this page, wrong.I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is neither true nor correct.

Example with code

from sklearn.preprocessing import StandardScalerimport numpy as np# 4 samples/observations and 2 variables/featuresdata = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])scaler = StandardScaler()scaled_data = scaler.fit_transform(data)print(data)[[0, 0], [1, 0], [0, 1], [1, 1]])print(scaled_data)[[-1. -1.] [ 1. -1.] [-1.  1.] [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)array([1., 1.])

Appendix: The maths

UPDATE 08/2020: Concerning the input parameters with_mean and with_std to False/True, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”

python machine-learning scikit-learn scaling standardized

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

python machine-learning scikit-learn scaling standardized

How to calculate it:

You can read more here:

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#standardization-and-min-max-scaling

CodeHunter

Can anyone explain me StandardScaler?

Intro

Core of method

Example with code

Appendix: The maths

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last