A problem with the user input during the random forest classifier demonstration

Question

I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results:
A table with the results
But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns’ places.

model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction)   # output: 0.91

X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) # 
user_input_predictions_1    # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64)

Does anyone have any idea why this is happening?

The dataset is preprocessed – no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.

Asked By: Mariya Ivanova

||

Source

Answer 1

First of all, it is okay that different algorithms give different accuracy rate.

Secondly, in your case, there might be several reasons.

You have scaled your inputs in data but not in df_user_compounds_1
your model might be overfitted
dataset was preprocessed differently than df_user_compounds_1

Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning

Answered By: Elvin Jafarov

A problem with the user input during the random forest classifier demonstration

Question:

Answers: