Python scikit learn pca.explained_variance_ratio_ cutoff

Question:

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.

However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to “99% of variance is retained”? Could anyone enlighten? Thanks.

  • The Python Scikit learn PCA manual is here

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Asked By: Chubaka

||

Answers:

Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.

You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(0)
my_matrix = np.random.randn(20, 5)

my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)

print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()

[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
[ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

Answered By: Curt F.

Although this question is older than 2 years i want to provide an update on this.
I wanted to do the same and it looks like sklearn now provides this feature out of the box.

As stated in the docs

if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

So the code required is now

my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)
Answered By: Yannic Klem

This worked for me with even less typing in the PCA section.
The rest is added for convenience. Only ‘data’ needs to be defined in an earlier stage.

import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA 

st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc

#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";nTotal explained variance = ",
      round(pca.explained_variance_ratio_.sum(),5)  )
Answered By: Julian
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.