Set parameters for classifier and use it without fitting

Question:

I’m using python and scikit-learn to do some classification.

Is it possible to reuse the parameters, learned by classifier?

For example:

from sklearn.svm import SVC

cl = SVC(...)    # create svm classifier with some hyperparameters
cl.fit(X_train, y_train)
params = cl.get_params()

Let’s store this params somewhere as a dictionary of strings or even write to file a json. Assume, we want later to use this trained classifier to make some predictions on some data. Try to restore it:

params = ...  # retrieve these parameters stored somewhere as a dictionary
data = ...    # the data, we want make predictions on
cl = SVC(...)
cl.set_params(**params)
predictions = cl.predict(data)

If I do it this way, I get the NonFittedError and the following stacktrace:

File "C:UsersviacheslavPythonPython36-32libsite-packagessklearnsvmbase.py", line 548, in predict
    y = super(BaseSVC, self).predict(X)
  File "C:UsersviacheslavPythonPython36-32libsite-packagessklearnsvmbase.py", line 308, in predict
    X = self._validate_for_predict(X)
  File "C:UsersviacheslavPythonPython36-32libsite-packagessklearnsvmbase.py", line 437, in _validate_for_predict
    check_is_fitted(self, 'support_')
  File "C:UsersviacheslavPythonPython36-32libsite-packagessklearnutilsvalidation.py", line 768, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This SVC instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Is it possible to set parameters to classifier and make predictions without fitting? How do I do that?

Answers:

Please read about model persistence in SKLearn:

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 

and later on:

clf = joblib.load('filename.pkl')

This may be possible, but it is object dependent. I believe the scikit developers are usually good about packaging all parameters/weights with no external dependencies to the serialized object. All params should be stored in your object and so you should be able to manually transfer weights to a initialized but untrained object thereby skipping training.

I’m not sure of your use case, but mine was that I didn’t want to retrain because an unsupervised clustering model is very unstable and we wanted to match clusters over existing data as close to exact as possible, without using all similar parameters these clusters could shift. Or, as was the case for me, the algorithm changed slightly between Sckit NMF 0.21 and 1.3.1.

You’ll have to step through the source yourself either using a debugger that lets you step into library code or inspecting github (all scikit source is on there). I was getting something similar to NonFittedError because I didn’t have EVERY attribute set. If you step through the source you’ll see what throws that error and how to avoid it.

Here’s an example of how I did it in case it helps you or anyone else:

"""
Unsupervised topics model was far out of date (2020).  Retraining was not an option
because of risk of topics changing.  Manually loaced model and copied over all attributes/parameters
into a blank model.  Weights are saved as npy files to s3, in bucket below

"""
import os
import muriel # in-house convenience library
import joblib
import numpy as np
from muriel.s3 import upload_file
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF

version = 2

cdir = os.path.dirname(os.path.realpath(__file__))


def murielPreprocessor(x):
    return muriel.tokenizer.preprocessor(x)


def murielTokenizer(x):
    return [
        a.svalue if a.svalue != "" else a.value
        for a in muriel.tokenizer.tokenizer(
            x, do_stop=True, regex=muriel.tokenizer.reFull
        )
        if a.svalue != "<>" and len(a.svalue) > 1
    ]


topic_s3_prefix = "models/topic_modeling"


nmf_model_new = NMF(
    alpha_W=0.1,
    alpha_H='same',
    beta_loss='frobenius',
    init='nndsvd',
    l1_ratio=0.5,
    max_iter=200,
    n_components=9,
    random_state=133573,
    shuffle=False,
    solver='cd',
    tol=0.0001,
    verbose=0,
)

# this npy is exported from the pickled model while accessed from the old scikit version
nmf_model_new.components_ = np.load(
    f'{cdir}/unsupervised_topics_nmf_model_h_coefficients_nopkl.npy',
    allow_pickle=False,
)
# you need ALL of these or you will get an error that it hasn't been fit
nmf_model_new.n_components_ = 9
nmf_model_new.reconstruction_err_ = 4698.519666568644
nmf_model_new.n_iter_ = 199

vectorizer_new = CountVectorizer(
    input='content',
    encoding='utf-8',
    decode_error='strict',
    strip_accents='unicode',
    lowercase=False,
    preprocessor=murielPreprocessor,
    tokenizer=murielTokenizer,
    stop_words=None,
    token_pattern='(?u)\b\w\w+\b',
    ngram_range=(1, 1),
    analyzer='word',
    max_df=1.0,
    min_df=30,
    max_features=None,
    vocabulary=None,
    binary=False,
    dtype=np.int64,
)

vectorizer_new.vocabulary_ = np.load(
    f'{cdir}/unsupervised_topics_vectorizer_vocabulary.npy', allow_pickle=True
).item()
vectorizer_new.fixed_vocabulary_ = False
vectorizer_new.stop_words_ = np.load(
    f'{cdir}/unsupervised_topics_vectorizer_stop_words.npy', allow_pickle=True
).item()

nmf_save_name = f"NMF{version:0>2}.joblib.model"
vec_save_name = f"Vectorizer{version:0>2}.joblib.model"

joblib.dump(nmf_model_new, nmf_save_name, compress=9)
joblib.dump(vectorizer_new, vec_save_name, compress=9)

upload_file(nmf_save_name, topic_s3_prefix)
upload_file(vec_save_name, topic_s3_prefix)
Answered By: diyer0001