Using fit_transform() and transform()
Question:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
What I know is fit()
method calculates mean and standard deviation of the feature and then transform()
method uses them to transform the feature into a new scaled feature. fit_transform()
is nothing but calling fit()
& transform()
method in a single line.
But here why are we only calling fit()
for training data and not for testing data??
Does that means we are using mean & standard deviation of training data to transform our testing data ??
Answers:
fit
computes the mean and stdev to be used for later scaling, note it’s just a computation with no scaling done.
transform
uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).
fit_transform
does both at the same time. So you can do it with just 1 line of code.
For X_train
dataset, we do fit_transform
because we need to compute mean and stdev, and then use it to scale the X_train
dataset. For X_test
dataset, since we already have the mean and stdev, we only do the transformation part.
Edit: X_test
data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train
. The reason why we apply the derived mean and stdev (from X_train
) to transform X_test
as well, is to have the same "apple-to-apple" comparison for y_test
and y_pred
.
By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
What I know is fit()
method calculates mean and standard deviation of the feature and then transform()
method uses them to transform the feature into a new scaled feature. fit_transform()
is nothing but calling fit()
& transform()
method in a single line.
But here why are we only calling fit()
for training data and not for testing data??
Does that means we are using mean & standard deviation of training data to transform our testing data ??
fit
computes the mean and stdev to be used for later scaling, note it’s just a computation with no scaling done.
transform
uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).
fit_transform
does both at the same time. So you can do it with just 1 line of code.
For X_train
dataset, we do fit_transform
because we need to compute mean and stdev, and then use it to scale the X_train
dataset. For X_test
dataset, since we already have the mean and stdev, we only do the transformation part.
Edit: X_test
data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train
. The reason why we apply the derived mean and stdev (from X_train
) to transform X_test
as well, is to have the same "apple-to-apple" comparison for y_test
and y_pred
.
By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.