How to interpret explained variance ratio plot from principal components of PCA with sklearn

Question

I try to use PCA to reduce the dimension of my data before applying K-means clustering.

In the below dataset, I have points, assists and rebounds columns. According to the plot, the first three principal component contains the highest % of the variance.

Is there a way to tell what each of the first 3 components correspond to? For example, if it corresponds to the column "points" in the year of 2021 or so. Or, what should be the correct way to interpret this plot?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df_full =  pd.DataFrame({'year':[2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,
                            2022,2022,2022,2022,2022,2022,2022,2022,2022,2022],
                     'store':['store1','store2','store3','store4','store5','store6','store7','store8','store9','store10',
                    'store1','store2','store3','store4','store5','store6','store7','store8','store9','store10'],
                'points': [18, 33, 19, 14, 14, 11, 20, 28, 30, 31,
                          35, 33, 29, 25, 25, 27, 29, 30, 19, 23],
               'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14,
                           5, 9, 4, 3, 4, 12, 15, 11],
               'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4,
                            11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})

# create pivot table for clustering analysis
df = df_full.pivot(index=['store'],columns=['year']).reset_index()

# set index for clustering analysis
df.set_index(['store'], inplace=True)

# standarized df
scaled_df = StandardScaler().fit_transform(df)

# check what is the best n components
pca = PCA(n_components=6)
pca.fit(scaled_df)

var = pca.explained_variance_ratio_
plt.bar(list(range(var.shape[0])),var)
feature = range(pca.n_components_)
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(feature)

# use optimized number for n componets which is 3 in this case
pca = PCA(n_components=3)
pca.fit(scaled_df)

df_transform = pca.transform(scaled_df)

# apply kmean cluster
kmeans = KMeans(init="random", n_clusters=3, n_init=10, random_state=1)

pca.components_
array([[ 0.35130535, -0.50070859,  0.29700875,  0.26964774, -0.59032958,
    -0.34126579],
   [ 0.56248993,  0.3654443 , -0.30040924,  0.65744874,  0.09535234,
     0.13593718],
   [ 0.18181155,  0.05593549,  0.69079082, -0.0149547 , -0.00170045,
     0.69742189]])

Asked By: user032020

||

Source

Answer 1

The PCs are just a new set of axes.

Imagine x, y axes and a cloud of points shaped like an ellipse with the long axis of the ellipse at 45 degrees to the original axes.

The first PC will correspond to the long axis. The second and subsequent ones will each be at right angles to the previous ones. The second PC will be the minor ellipse axis. And so on. Each PC will contain successively less of the variance.

So if the first PC contains a high percentage of the variance we are talking about a long major axis and a short minor axis, i.e. a long thin cigar. If the first PC is only just more than the second we are looking at something more circular or disk-like in shape.

So the PCs don’t really necessarily correspond to any original features.

Answered By: Mark Setchell

Answer 2

Wikipedia summarizes the definition of PCA pretty good in my opinion:

PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

As you can see from this definition, the principle components are just vectors in your original feature space.
As an example, let’s say you have measured the mental state of some people in two dimensions: "happiness" (dimension 1) and "boredom" (dimension 2). Now you do PCA and get a vector (0.6, 0.4) as your first principle component. You can interpret this as your selection of people being best described by a mental state which combines "happiness" with 60% relevance and "boredom" with 40% relevance, if you only want one dimension to describe them.

In sklearn, you can get the principle components via pca.components_.

Mathematically there are different interpretations and derivations. From a statistical point of view, the principle components are the the eigenvectors of the covariance matrix of your random variables (feature vectors). In linear algebra, you use singular value decomposition (SVD) to describe it. SVD is also the common method for computing PCA.

Answered By: tierriminator

How to interpret explained variance ratio plot from principal components of PCA with sklearn

Question:

Answers: