tfidfvectorizer Predict in saved classifier

Question:

I trained my model by using TfIdfVectorizer and MultinomialNB and I saved it into a pickle file.

Now that I am trying to use the classifier from another file to predict in unseen data, I cannot do it because it is telling my that the number of features of the classifier is not the same than the number of features of my current corpus.

This is the code where I am trying to predict. The function do_vectorize is exactly the same used in training.

def do_vectorize(data, stop_words=[], tokenizer_fn=tokenize):
    vectorizer = TfidfVectorizer(stop_words=stop_words, tokenizer=tokenizer_fn)
    X = vectorizer.fit_transform(data)
    return X, vectorizer

# Vectorizing the unseen documents 
matrix, vectorizer = do_vectorize(corpus, stop_words=stop_words)

# Predicting on the trained model
clf = pickle.load(open('../data/classifier_0.5_function.pkl', 'rb'))
predictions = clf.predict(matrix)

However I receive the error that the number of features are different

ValueError: Expected input with 65264 features, got 472546 instead

This means I also have to save my vocabulary from training in order to test? What will happen if there are terms that did not exist on training?

I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both. However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.

Asked By: user2288043

||

Answers:

This means I also have to save my vocabulary from training in order to test?

Yes, you have to save whole tfidf vectorizer, which in particular means saving vocabulary.

What will happen if there are terms that did not exist on training?

They will be ignored, which makes perfect sense since you have no training data about this, thus there is nothing to take into consideration (there are more complex methods which could still use it, but they do not use such simple approaches as tfidf).

I tried to used pipelines from scikit-learn with the same vectorizer and classifier, and the same parameters for both. However, it turned too slow from 1 hour to more than 6 hours, so I prefer to do it manually.

There should be little to no overhead when using pipelines, however doing things manually is fine as long as you remember to store vectorizer as well.

Answered By: lejlot

You have to assign max feature limit while intilizing the tfidf vectorizer
like this

tfidf_vectorizer = TfidfVectorizer(max_features = 1200)

and then use same features limit to convert test data into tfidf

Answered By: Almas Zia