Why is scikit-learn SVM.SVC() extremely slow?

Question:

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?

Asked By: C. Gary

||

Answers:

General remarks about SVM-learning

SVM-training with nonlinear-kernels, which is default in sklearn’s SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn’s devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.

This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.

So we can do some math to approximate the time-difference between 1k and 100k samples:

1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!

This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!

Scikit-learn specific remarks

The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.

The same warning applies to generating probabilities: SVM’s do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!

Scikit-learn documentation

Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):

enter image description here

Answered By: sascha

If you are using intel CPU then Intel has provided the solution for it.
Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension.
You should follow the following steps:

First install intelex package for sklearn

pip install scikit-learn-intelex

Now just add the following line in the top of the program

from sklearnex import patch_sklearn 

patch_sklearn()

Now run the program it will be much faster than before.

You can read more about it from the following link:
https://intel.github.io/scikit-learn-intelex/

Answered By: AsadMajeed
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.