ValueError: X has 3 features, but LinearSVC is expecting 64852 features as input

Question:

I get the following error when I try to deploy this model.

ValueError: X has 3 features, but LinearSVC is expecting 64852 features as input

Example of data below.

data = [[3409, False, 'Lorum Ipsum'], [0409, True, 'dolor sit amet consectetuer'], [7869, False, 'Aenean commodo ligula eget dolor']]
df = pd.DataFrame(data, columns=['id', 'booleanv', 'text'] 

The code where the model gets created below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

df = pd.read_csv('cleandata.csv')

# Split dataset into training and validation set
train_size = int(df.shape[0] * 0.8)

train_df = df[:train_size]
val_df = df[train_size:]

# split text and labels
X_train = train_df.text.to_numpy()
Y_train = train_df.booleanv.to_numpy()
X_test = val_df.text.to_numpy()
Y_test = val_df.booleanv.to_numpy()


tfidf = TfidfVectorizer(ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

model1 = LinearSVC(random_state=0, tol=1e-5)
model1.fit(X_train_tf, Y_train)

import pickle

pickle.dump(model1, open('classification.pickle','wb'))
pickle.dump(tfidf, open('vectorizer.pickle','wb'))

X_Train and X_Test are both arrays. The input I feed in the API I created is in json format. I suspect that I need to transform my input somehow. Is this correct? If so, how can I do that?

Asked By: EvitaSchaap

||

Answers:

To obtain predictions from your model, you need to follow the same transformation steps that were undertaken during the training phase.

The ValueError you are encountering indicates that you are passing raw data to the classifier without vectorization. As the model has been trained on a sparse matrix consisting of 64852 features (the outcome of tfidf.fit_transform(X_train) ), it expects a vectorized input with the same number of features. Here is how it can be done:

input_data = {
               'id': 1234,  
               'booleanv': False, 
               'text' : 'your input text goes here'
              }

#vectorize 
input_vectorized = tfidf.transform([input_data['text']]) 

#get predictions 
predictions = model.predict(input_vectorized)

This can, of course, be modified to work with batches instead of single inputs. Moreover, the use of pipelines is highly recommended to assemble all the different steps.

Answered By: A.T.B