SVC text classification- TypeError: unhashable type: 'csr_matrix'

Question:

I’m quite new in the world of machine learning. I’m trying to build a SVC text classifier. However, when I try to do a single prediction I get the error: unhashable type: 'csr_matrix'. I’m not sure why this is happening.

The objective is to make a binary classification from a dataset with the columns [text, label], where the first one is a sentence and the second one is 0 or 1.

I can make predictions in X_test, but I can’t get it to turn out for a single prediction.

Sample code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

tfid = TfidfVectorizer(encoding='utf-8', lowercase=True, analyzer='word')
X = tfid.fit_transform(df['text'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

# Training the SVM model on the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state=42)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)
# array([0, 1, 1, ..., 0, 0, 1])

## Making the Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)
# [[3762   61]
#  [  43 3919]]
# 0.9866409762363519

And here is the traceback:

# Loading tfid with model.feature_names as vocabulary
tfid = TfidfVectorizer(encoding='utf-8', lowercase=True, analyzer='word', vocabulary=X_train)

## Predicting a new result
to_pred = tfid.fit_transform([df['text'].iloc[0]])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-9be72cc31a52> in <module>()
      1 ## Predicting a new result
----> 2 to_pred = tfid.fit_transform([df['text'].iloc[0]])

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_extraction/text.py in _validate_vocabulary(self)
    469                 vocab = {}
    470                 for i, t in enumerate(vocabulary):
--> 471                     if vocab.setdefault(t, i) != i:
    472                         msg = "Duplicate term in vocabulary: %r" % t
    473                         raise ValueError(msg)

TypeError: unhashable type: 'csr_matrix'

This is how it looks df['text'].iloc[0]]:

df['text'].iloc[0]
'coming up with a baby name is hard being lazy is much easier'
Asked By: GUNTER

||

Answers:

There are a few things wrong with this code.

First of all you’re fitting your tf-idf on train & test data. That’s not a good practice. In real life you do not have access to the test dataset. You’re supposed to split into train and test and then fit_transform your tfidf on your train set and simply transform your test set (pretending you don’t know what’s on your test set, just like real life).

Another problem is that you created a new tfidf instance to convert the sentence you want to predict. You should try loading the tfidf instance you created instead:

#imagine that you put this after the code above (so the tfidf here is fitted on train data)
to_pred = tfid.transform(['that thing you said about being lazy'])
#then predict
print(classifier.predict(to_pred))

The reason why you’re getting this error is because in vocabulary it does not expect a csr matrix (aka your text data after transforming them with tfidf – this returns a sparse matrix object for efficiency). It expects a dictionary like:

{'love': 5, 'apples': 1, 'are': 2, 'healthy': 4, 'and': 0, 'fun': 3, 'red': 6}

but it shouldn’t matter because this is wrong anyway.

Answered By: Gaussian Prior