How can I optimize KNN, GNB nd SVC sklearn algorithms to reduce exec time?
Question:
I’m currently evaluating which classifier have the best performance for movie reviews sentiment analysis task. So far I have evaluate Logistic Regression, Linear Regression, Random Forest and Decision tree but I also want to consider KNN, GNB and SVC models as well. The problem is that each execution of those algorithms (particulary KNN) has a large exec time. Even using RandomizedSearch in KNN I have to wait about 1 hour with 10 iterations. Here are some snippets:
KNN Classifier
#KNearestNeighbors X -> large execution time
knn=KNeighborsClassifier()
k_range=list(range(1,50))
options=['uniform', 'distance']
param_grid = dict(n_neighbors=k_range, weights=options)
rand_knn = RandomizedSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_iter=10, random_state=0)
rand_knn.fit(x_train_bow, y_train)
print(rand_knn.best_score_)
print(rand_knn.best_params_)
confm_knn = confusion_matrix(y_test, y_pred_knn)
print_confm(confm_knn)
print("=============K NEAREST NEIGHBORS============")
print_metrics(y_test,y_pred_knn)
print("============================================")
I waited for the execution of the code above for about 85 minutes but it never finished and I had to cut the execution. In order to get any result (at least anything) I try to choose the best k manually with a for loop but still each iteration takes over 12 – 17 minutes.
def testing_k_neighbors(x_train_bow,y_train,x_test_bow,y_test):
accuracy_hist = []
for i in range (1,21):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train_bow, y_train)
yi_pred_knn = knn.predict(x_test_bow)
acc_i = accuracy_score(y_test, yi_pred_knn)
accuracy_hist.append(acc_i)
print(f"K: {i}, accuracy: {acc_i}")
print(accuracy_hist)
output:
K: 1, accuracy: 0.7384384634613782
K: 2, accuracy: 0.7435213732188984
K: 3, accuracy: 0.7574368802599784
K: 4, accuracy: 0.7678526789434214
K: 5, accuracy: 0.7681859845012916
K: 6, accuracy: 0.7745187901008249
K: 7, accuracy: 0.7729355887009416
K: 8, accuracy: 0.7774352137321889
K: 9, accuracy: 0.7742688109324223
K: 10, accuracy: 0.7810182484792934
K: 11, accuracy: 0.7776851929005916
K: 12, accuracy: 0.7854345471210732
K: 13, accuracy: 0.783101408215982
K: 14, accuracy: 0.7866844429630864
K: 15, accuracy: 0.784934588784268
K: 16, accuracy: 0.78860094992084
K: 17, accuracy: 0.7873510540788268
K: 18, accuracy: 0.7893508874260479
K: 19, accuracy: 0.7856011999000083
K: 20, accuracy: 0.7916006999416715
Also SVC and GNB takes similar time to get any result:
#Support Vector Macine X -> large execution time
#svc=SVC(C = 100, kernel = 'linear', random_state=123)
#svc.fit(x_train_bow,y_train)
#y_pred_svc = svc.predict(x_test_bow)
#print("=============SUPPORT VECTOR MACHINE============")
#print_metrics(y_test,y_pred_svc)
#print("============================================")
#Gaussian Naive Bayes
gnbc=GaussianNB()
gnbc.fit(x_train_bow.toarray(),y_train)
y_pred_gnbc = gnbc.predict(x_test_bow)
print("=============GAUSSIAN NAIVE BAYES============")
print_metrics(y_test,y_pred_gnbc)
print("============================================")
Is there any way to tune my code reduce execution time and mantain or improve models performance?
Im expecting to tune my code prioritzing both efficiency and performance
Answers:
i try your code:
then i print "x_train_bow":
<28000x122447 sparse matrix of type '<class 'numpy.float64'>'
with 2796291 stored elements in Compressed Sparse Row format>
you have 122447 columns then used TfidfVectorizer,
This is a problem of dimension, which is why it takes a lot of time.
There is no solution(KNN, SVC, trees). you need to reduce the dimension. You need to extract the corresponding words and then use TfidfVectorizer.
I’m currently evaluating which classifier have the best performance for movie reviews sentiment analysis task. So far I have evaluate Logistic Regression, Linear Regression, Random Forest and Decision tree but I also want to consider KNN, GNB and SVC models as well. The problem is that each execution of those algorithms (particulary KNN) has a large exec time. Even using RandomizedSearch in KNN I have to wait about 1 hour with 10 iterations. Here are some snippets:
KNN Classifier
#KNearestNeighbors X -> large execution time
knn=KNeighborsClassifier()
k_range=list(range(1,50))
options=['uniform', 'distance']
param_grid = dict(n_neighbors=k_range, weights=options)
rand_knn = RandomizedSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_iter=10, random_state=0)
rand_knn.fit(x_train_bow, y_train)
print(rand_knn.best_score_)
print(rand_knn.best_params_)
confm_knn = confusion_matrix(y_test, y_pred_knn)
print_confm(confm_knn)
print("=============K NEAREST NEIGHBORS============")
print_metrics(y_test,y_pred_knn)
print("============================================")
I waited for the execution of the code above for about 85 minutes but it never finished and I had to cut the execution. In order to get any result (at least anything) I try to choose the best k manually with a for loop but still each iteration takes over 12 – 17 minutes.
def testing_k_neighbors(x_train_bow,y_train,x_test_bow,y_test):
accuracy_hist = []
for i in range (1,21):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train_bow, y_train)
yi_pred_knn = knn.predict(x_test_bow)
acc_i = accuracy_score(y_test, yi_pred_knn)
accuracy_hist.append(acc_i)
print(f"K: {i}, accuracy: {acc_i}")
print(accuracy_hist)
output:
K: 1, accuracy: 0.7384384634613782
K: 2, accuracy: 0.7435213732188984
K: 3, accuracy: 0.7574368802599784
K: 4, accuracy: 0.7678526789434214
K: 5, accuracy: 0.7681859845012916
K: 6, accuracy: 0.7745187901008249
K: 7, accuracy: 0.7729355887009416
K: 8, accuracy: 0.7774352137321889
K: 9, accuracy: 0.7742688109324223
K: 10, accuracy: 0.7810182484792934
K: 11, accuracy: 0.7776851929005916
K: 12, accuracy: 0.7854345471210732
K: 13, accuracy: 0.783101408215982
K: 14, accuracy: 0.7866844429630864
K: 15, accuracy: 0.784934588784268
K: 16, accuracy: 0.78860094992084
K: 17, accuracy: 0.7873510540788268
K: 18, accuracy: 0.7893508874260479
K: 19, accuracy: 0.7856011999000083
K: 20, accuracy: 0.7916006999416715
Also SVC and GNB takes similar time to get any result:
#Support Vector Macine X -> large execution time
#svc=SVC(C = 100, kernel = 'linear', random_state=123)
#svc.fit(x_train_bow,y_train)
#y_pred_svc = svc.predict(x_test_bow)
#print("=============SUPPORT VECTOR MACHINE============")
#print_metrics(y_test,y_pred_svc)
#print("============================================")
#Gaussian Naive Bayes
gnbc=GaussianNB()
gnbc.fit(x_train_bow.toarray(),y_train)
y_pred_gnbc = gnbc.predict(x_test_bow)
print("=============GAUSSIAN NAIVE BAYES============")
print_metrics(y_test,y_pred_gnbc)
print("============================================")
Is there any way to tune my code reduce execution time and mantain or improve models performance?
Im expecting to tune my code prioritzing both efficiency and performance
i try your code:
then i print "x_train_bow":
<28000x122447 sparse matrix of type '<class 'numpy.float64'>'
with 2796291 stored elements in Compressed Sparse Row format>
you have 122447 columns then used TfidfVectorizer,
This is a problem of dimension, which is why it takes a lot of time.
There is no solution(KNN, SVC, trees). you need to reduce the dimension. You need to extract the corresponding words and then use TfidfVectorizer.