Logistic regression: X has 667 features per sample; expecting 74869

Question

Using a imdb movie reviews dataset i have made a logistic regression to predict the sentiment of the review.

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, 

tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
X = tfidf.fit_transform(df.review)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3, shuffle=False)
clf = LogisticRegressionCV(cv=5, scoring="accuracy", random_state=1, n_jobs=-1, verbose=3,max_iter=300).fit(X_train, y_train)

yhat = clf.predict(X_test)


print("accuracy:")
print(clf.score(X_test, y_test))

model_performance(X_train, y_train, X_test, y_test, clf)

prior to this text preprocessing have been applied.
Model performance is just a function to create a confusion matrix.
this all works well with a good accuracy.

I now scrape new IMDB reviews:

#The movie "Joker" IMBD review page
url_link='https://www.imdb.com/title/tt7286456/reviews'
html=urlopen(url_link)

content_bs=BeautifulSoup(html)

JokerReviews = []
#All the reviews ends in a div class called text in html, can be found in the imdb source code
for b in content_bs.find_all('div',class_='text'):
  JokerReviews.append(b)

df = pd.DataFrame.from_records(JokerReviews)
df['sentiment'] = "0" 
jokerData=df[0]
jokerData = jokerData.apply(preprocessor)

Problem: Now i wish to test the same logistic regression to predict the sentiment:

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.fit_transform(jokerData)

yhat = Clf.predict(Xjoker)

But i get the error:
ValueError: X has 667 features per sample; expecting 74869

I dont get why it has to have the same amount of features as X_test

Asked By: Ronnie

||

Source

Answer 1

The problem is that your model was trained after a preprocessing that identified 74869 unique words, and the preprocessing of your input data for inference have identified 667 words, and you are supposed to send the data to the model with the same number of columns. Besides that, one of the 667 words identified for the inference may also don’t be expected by the model as well.

To create a valid input for your model, you have to use an approach such as:

# check which columns are expected by the model, but not exist in the inference dataframe
not_existing_cols = [c for c in X.columns.tolist() if c not in Xjoker]
# add this columns to the data frame
Xjoker = Xjoker.reindex(Xjoker.columns.tolist() + not_existing_cols, axis=1)
# new columns dont have values, replace null by 0
Xjoker.fillna(0, inplace = True)
# use the original X structure as mask for the new inference dataframe
Xjoker = Xjoker[X.columns.tolist()]

After these steps, you can call the predict() method.

Answered By: Daniel Labbe

Answer 2

You need to use transform instead of fit_transform in the below code

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.fit_transform(jokerData)

yhat = Clf.predict(Xjoker)

to be like the below:

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.transform(jokerData)

yhat = Clf.predict(Xjoker)

Answered By: Osama AbuSitta

Answer 3

I have faced the same problem while I make predictive system for fake news detection. But its a kind of silly mistake 🙂

Reason:

We recreate a TfidfVectorizer() [tfidf2 in your case] during testing our new input.
But we should not do this. Because, we already trained our content (X) by fit_transform() with the vectorizer variable [tfidf in your case]. Hence, the model trained with this vectorized dimension.
Hence, we should transform our new input_data (string) with already created vectorizer for prediction and not with another vectorizer.

Solution:

Don’t create new vectorizer and transform your input_data with already created tfidf.

Your code:

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)  
y = df.sentiment.values
Xjoker = tfidf2.transform(jokerData)
yhat = Clf.predict(Xjoker)

Aletered code:

y = df.sentiment.values
Xjoker = tfidf.transform(jokerData)
yhat = Clf.predict(Xjoker)

Hope this will help you.

Answered By: Sujitha

Logistic regression: X has 667 features per sample; expecting 74869

Question:

Answers: