Python scikit learn pca.explained_variance_ratio_ cutoff

Question

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.

However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to “99% of variance is retained”? Could anyone enlighten? Thanks.

The Python Scikit learn PCA manual is here

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Asked By: Chubaka

||

Source

Answer 1

Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.

You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

import numpy as np
from sklearn.decomposition import PCA

np.random.seed(0)
my_matrix = np.random.randn(20, 5)

my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)

print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()

[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082]
[ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ]
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

Answered By: Curt F.

Answer 2

Although this question is older than 2 years i want to provide an update on this.
I wanted to do the same and it looks like sklearn now provides this feature out of the box.

As stated in the docs

if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

So the code required is now

my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)

Answered By: Yannic Klem

Answer 3

This worked for me with even less typing in the PCA section.
The rest is added for convenience. Only ‘data’ needs to be defined in an earlier stage.

import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA 

st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc

#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";nTotal explained variance = ",
      round(pca.explained_variance_ratio_.sum(),5)  )

Answered By: Julian

Python scikit learn pca.explained_variance_ratio_ cutoff

Question:

Answers: