Using fit_transform() and transform()

Question:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

What I know is fit() method calculates mean and standard deviation of the feature and then transform() method uses them to transform the feature into a new scaled feature. fit_transform() is nothing but calling fit() & transform() method in a single line.

But here why are we only calling fit() for training data and not for testing data??

Does that means we are using mean & standard deviation of training data to transform our testing data ??

Asked By: Satyam Puranik

||

Answers:

fit computes the mean and stdev to be used for later scaling, note it’s just a computation with no scaling done.

transform uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).

fit_transform does both at the same time. So you can do it with just 1 line of code.

For X_train dataset, we do fit_transform because we need to compute mean and stdev, and then use it to scale the X_train dataset. For X_test dataset, since we already have the mean and stdev, we only do the transformation part.

Edit: X_test data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train. The reason why we apply the derived mean and stdev (from X_train) to transform X_test as well, is to have the same "apple-to-apple" comparison for y_test and y_pred.

By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.

Answered By: perpetualstudent