A problem with the user input during the random forest classifier demonstration
Question:
I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results:
A table with the results
But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns’ places.
model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction) # output: 0.91
X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) #
user_input_predictions_1 # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64)
Does anyone have any idea why this is happening?
The dataset is preprocessed – no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.
Answers:
First of all, it is okay that different algorithms give different accuracy rate.
Secondly, in your case, there might be several reasons.
- You have scaled your inputs in data but not in df_user_compounds_1
- your model might be overfitted
- dataset was preprocessed differently than df_user_compounds_1
Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning
I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results:
A table with the results
But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns’ places.
model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction) # output: 0.91
X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) #
user_input_predictions_1 # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64)
Does anyone have any idea why this is happening?
The dataset is preprocessed – no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.
First of all, it is okay that different algorithms give different accuracy rate.
Secondly, in your case, there might be several reasons.
- You have scaled your inputs in data but not in df_user_compounds_1
- your model might be overfitted
- dataset was preprocessed differently than df_user_compounds_1
Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning