What does calling fit() multiple times on the same model do?

Question:

After I instantiate a scikit model (e.g. LinearRegression), if I call its fit() method multiple times (with different X and y data), what happens? Does it fit the model on the data like if I just re-instantiated the model (i.e. from scratch), or does it keep into accounts data already fitted from the previous call to fit()?

Trying with LinearRegression (also looking at its source code) it seems to me that every time I call fit(), it fits from scratch, ignoring the result of any previous call to the same method. I wonder if this true in general, and I can rely on this behavior for all models/pipelines of scikit learn.

Asked By: Fanta

||

Answers:

If you will execute model.fit(X_train, y_train) for a second time – it’ll overwrite all previously fitted coefficients, weights, intercept (bias), etc.

If you want to fit just a portion of your data set and then to improve your model by fitting a new data, then you can use estimators, supporting "Incremental learning" (those, that implement partial_fit() method)

You can use term fit() and train() word interchangeably in machine learning. Based on classification model you have instantiated, may be a clf = GBNaiveBayes() or clf = SVC(), your model uses specified machine learning technique.
And as soon as you call clf.fit(features_train, label_train) your model starts training using the features and labels that you have passed.

you can use clf.predict(features_test) to predict.
If you will again call clf.fit(features_train2, label_train2) it will start training again using passed data and will remove the previous results. Your model will reset the following inside model:

  • Weights
  • Fitted Coefficients
  • Bias
  • And other training related stuff…

You can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data

Answered By: sgrpwr

Beware that the model is passed kind of "by reference". Here, model1 will be overwritten:

df1 = pd.DataFrame(np.random.rand(100).reshape(10,10))
df2 = df1.copy()
df2.iloc[0,0] = df2.iloc[0,0] -2 # change one value

pca = PCA()
model1 = pca.fit(df)
model2 = pca.fit(df2)

np.unique(model1.explained_variance_ == model2.explained_variance_)

Returns

array([ True])

To avoid this use

from copy import deepcopy
model1 = deepcopy(pca.fit(df))
Answered By: Servus

Yes, successive calls to fit will incrementally train the model.

https://github.com/keras-team/keras/issues/4446

Answered By: Utsab Khakurel