A problem with the user input during the random forest classifier demonstration

Question:

I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results:
A table with the results
But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns’ places.

model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction)   # output: 0.91

X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) # 
user_input_predictions_1    # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64) 

Does anyone have any idea why this is happening?

The dataset is preprocessed – no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.

Asked By: Mariya Ivanova

||

Answers:

First of all, it is okay that different algorithms give different accuracy rate.

Secondly, in your case, there might be several reasons.

  1. You have scaled your inputs in data but not in df_user_compounds_1
  2. your model might be overfitted
  3. dataset was preprocessed differently than df_user_compounds_1

Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning

Answered By: Elvin Jafarov
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.