Why do I need to indicate the number of components to be kept in Principal Component Analysis?

Question:

I found that to use PCA it is necessary to indicate at the beginning the number of components to be kept such as in the following code:

model = pca(n_components=3, normalize=True)

Is there any way to indicate only the variance and let the algorithm give me the most important components?

Asked By: baddy

||

Answers:

You don’t necessarily need to specify the number of components in advance. You can extract all components and keep only the ones that explain a given fraction of the cumulative variance. See the code below for an example.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_spd_matrix
from sklearn.preprocessing import StandardScaler

# generate the data
np.random.seed(100)

N = 1000  # number of samples
K = 10    # number of features

mean = np.zeros(K)
cov = make_spd_matrix(K)
X = np.random.multivariate_normal(mean, cov, N)
print(X.shape)
# (1000, 10)

# rescale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# perform the PCA
pca = PCA(n_components=None)
pca.fit(X)

# extract the smallest number of components which
# explain at least p% (e.g. 80%) of the variance
p = 0.80
n_components = 1 + np.argmax(np.cumsum(pca.explained_variance_ratio_) >= p)
print(n_components)
# 6

# extract the values of the selected components
Z = pca.transform(X)[:, :n_components]
print(Z.shape)
# (1000, 6)
Answered By: Flavia Giammarino