what is the difference between 'transform' and 'fit_transform' in sklearn
In scikit-learn estimator api,
fit()
: used for generating learning model parameters from training data
transform()
: parameters generated from fit()
method,applied upon model to generate transformed data set.
fit_transform()
:combination of fit()
and transform()
api on same data set
Checkout Chapter-4 from this book & answer from stackexchange for more clarity
These methods are used to center/feature scale of a given data. It basically helps to normalize the data within a particular range
For this, we use Z-score method.
We do this on the training set of data.
1.Fit(): Method calculates the parameters μ and σ and saves them as internal objects.
2.Transform(): Method using these calculated parameters apply the transformation to a particular dataset.
3.Fit_transform(): joins the fit() and transform() method for transformation of dataset.
Code snippet for Feature Scaling/Standardisation(after train_test_split).
from sklearn.preprocessing import StandardScalersc = StandardScaler()sc.fit_transform(X_train)sc.transform(X_test)
We apply the same(training set same two parameters μ and σ (values)) parameter transformation on our testing set.
The .transform
method is meant for when you have already computed PCA
, i.e. if you have already called its .fit
method.
In [12]: pc2 = RandomizedPCA(n_components=3)In [13]: pc2.transform(X) # can't transform because it does not know how to do it.---------------------------------------------------------------------------AttributeError Traceback (most recent call last)<ipython-input-13-e3b6b8ea2aff> in <module>()----> 1 pc2.transform(X)/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y) 714 # XXX remove scipy.sparse support here in 0.16 715 X = atleast2d_or_csr(X)--> 716 if self.mean_ is not None: 717 X = X - self.mean_ 718 AttributeError: 'RandomizedPCA' object has no attribute 'mean_'In [14]: pc2.ftransform(X) pc2.fit pc2.fit_transform In [14]: pc2.fit_transform(X)Out[14]: array([[-1.38340578, -0.2935787 ], [-2.22189802, 0.25133484], [-3.6053038 , -0.04224385], [ 1.38340578, 0.2935787 ], [ 2.22189802, -0.25133484], [ 3.6053038 , 0.04224385]])
So you want to fit
RandomizedPCA
and then transform
as:
In [20]: pca = RandomizedPCA(n_components=3)In [21]: pca.fit(X)Out[21]: RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None, whiten=False)In [22]: pca.transform(z)Out[22]: array([[ 2.76681156, 0.58715739], [ 1.92831932, 1.13207093], [ 0.54491354, 0.83849224], [ 5.53362311, 1.17431479], [ 6.37211535, 0.62940125], [ 7.75552113, 0.92297994]])In [23]:
In particular PCA .transform
applies the change of basis obtained through the PCA decomposition of the matrix X
to the matrix Z
.