Plot PCA loadings and loading in biplot in sklearn (like R's autoplot)

Question:

I saw this tutorial in R w/ autoplot. They plotted the loadings and loading labels:

autoplot(prcomp(df), data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

enter image description here
https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

I prefer Python 3 w/ matplotlib, scikit-learn, and pandas for my data analysis. However, I don’t know how to add these on?

How can you plot these vectors w/ matplotlib?

I’ve been reading Recovering features names of explained_variance_ratio_ in PCA with sklearn but haven’t figured it out yet

Here’s how I plot it in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = 2

# PCA (How I tend to set it up)
Mod_PCA = decomposition.PCA(n_components=m)
DF_PCA = pd.DataFrame(Mod_PCA.fit_transform(DF_standard), 
                      columns=["PC%d" % k for k in range(1,m + 1)]).iloc[:,:K]
# Color classes
color_list = [{0:"r",1:"g",2:"b"}[x] for x in Se_targets]

fig, ax = plt.subplots()
ax.scatter(x=DF_PCA["PC1"], y=DF_PCA["PC2"], color=color_list)

enter image description here

Asked By: O.rka

||

Answers:

Answered By: O.rka

You could do something like the following by creating a biplot function.

Nice article here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

In this example I am using the iris data:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

# In general, it's a good idea to scale the data prior to PCA.
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    
pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

RESULT

THE BIPLOT RESULT


Answered By: seralouk

Try the ‘pca’ library. This will plot the explained variance, and create a biplot.

pip install pca

from pca import pca

# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)

# Or reduce the data towards 2 PCs
model = pca(n_components=2)

# Fit transform
results = model.fit_transform(X)

# Plot explained variance
fig, ax = model.plot()

# Scatter first 2 PCs
fig, ax = model.scatter()

# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
Answered By: erdogant

I’d like to add a generic solution to this topic. After doing some careful research on existing solutions (including Python and R) and datasets (especially biological "omic" datasets). I figured out the following Python solution, which has the advantages of:

  1. Scale the scores (samples) and loadings (features) properly to make them visually pleasing in one plot. It should be pointed out that the relative scales of samples and features do not have any mathematical meaning (but their relative directions have), however, making them similarly sized can facilitate exploration.

  2. Can handle high-dimensional data where there are many features and one could only afford to visualize the top several features (arrows) that drive the most variance of data. This involves explicit selection and scaling of the top features.

An example of final output (using "Moving Pictures", a classical dataset in my research field):

movpic_biplot

Preparation:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Basic example: displaying all features (arrows)

We will use the iris dataset (150 samples by 4 features).

# load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
targets = iris.target_names
features = iris.feature_names

# standardization
X_scaled = StandardScaler().fit_transform(X)

# PCA
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)

# coordinates of samples (i.e., scores; let's take the first two axes)
scores = X_reduced[:, :2]

# coordinates of features (i.e., loadings; note the transpose)
loadings = pca.components_[:2].T

# proportions of variance explained by axes
pvars = pca.explained_variance_ratio_[:2] * 100

Here comes the critical part: Scale the features (arrows) properly to match the samples (points). The following code scales by the maximum absolute value of samples on each axis.

arrows = loadings * np.abs(scores).max(axis=0)

Another way, as discussed in seralouk’s answer, is to scale by range (max – min). But it will make the arrows larger than points.

# arrows = loadings * np.ptp(scores, axis=0)

Then plot out the points and arrows:

plt.figure(figsize=(5, 5))

# samples as points
for i, name in enumerate(targets):
    plt.scatter(*zip(*scores[y == i]), label=name)
plt.legend(title='Species')

# empirical formula to determine arrow width
width = -0.0075 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])

# features as arrows
for i, arrow in enumerate(arrows):
    plt.arrow(0, 0, *arrow, color='k', alpha=0.5, width=width, ec='none',
              length_includes_head=True)
    plt.text(*(arrow * 1.05), features[i],
             ha='center', va='center')

# axis labels
for i, axis in enumerate('xy'):
    getattr(plt, f'{axis}ticks')([])
    getattr(plt, f'{axis}label')(f'PC{i + 1} ({pvars[i]:.2f}%)')

iris_biplot

Compare the result with that of the R solution. You can see that they are quite consistent. (Note: it is known that PCAs of R and scikit-learn have opposite axes. You can flip one of them to make the directions consistent.)

iris.pca <- prcomp(iris[, 1:4], center = TRUE, scale. = TRUE)
biplot(iris.pca, scale = 0)

iris_r

library(ggfortify)
autoplot(iris.pca, data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'dimgrey',
         loadings.label = TRUE, loadings.label.colour = 'black')

iris_auto

Advanced example: displaying only top k features

We will use the digits dataset (1797 samples by 64 features).

# load data
digits = datasets.load_digits()
X = digits.data
y = digits.target
targets = digits.target_names
features = digits.feature_names

# analysis
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(X_scaled)
X_reduced = pca.transform(X_scaled)

# results
scores = X_reduced[:, :2]
loadings = pca.components_[:2].T
pvars = pca.explained_variance_ratio_[:2] * 100

Now, we will find the top k features that best explain our data.

k = 8

Method 1: Find top k arrows that appear the longest (i.e., furthest from the origin) in the visible plot:

  • Note that all features are equally long in the m by m space. But they are different in the 2 by m space (m is the total number of features), and the following code is to find the longest ones in the latter.
  • This method is consistent with the microbiome program QIIME 2 / EMPeror (source code).
tops = (loadings ** 2).sum(axis=1).argsort()[-k:]
arrows = loadings[tops]

Method 2: Find top k features that drive most variance in the visible PCs:

# tops = (loadings * pvars).sum(axis=1).argsort()[-k:]
# arrows = loadings[tops]

Now there is a new problem: When the feature number is large, because the top k features are only a very small portion of all features, their contribution to data variance is tiny, thus they will look tiny in the plot.

To solve this, I came up with the following code. The rationale is: For all features, the sum of square loadings is always 1 per PC. With a small portion of features, we should bring them up such that the sum of square loadings of them is also 1. This method is tested and working, and generates nice plots.

arrows /= np.sqrt((arrows ** 2).sum(axis=0))

Then we will scale the arrows to match the samples (as discussed above):

arrows *= np.abs(scores).max(axis=0)

Now we can render the biplot:

plt.figure(figsize=(5, 5))
for i, name in enumerate(targets):
    plt.scatter(*zip(*scores[y == i]), label=name, s=8, alpha=0.5)
plt.legend(title='Class')

width = -0.005 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])
for i, arrow in zip(tops, arrows):
    plt.arrow(0, 0, *arrow, color='k', alpha=0.75, width=width, ec='none',
              length_includes_head=True)
    plt.text(*(arrow * 1.15), features[i], ha='center', va='center')

for i, axis in enumerate('xy'):
    getattr(plt, f'{axis}ticks')([])
    getattr(plt, f'{axis}label')(f'PC{i + 1} ({pvars[i]:.2f}%)')

digits_biplot

I hope my answer is useful to the community.

Answered By: Qiyun Zhu

To plot the PCA loadings and loading labels in a biplot using matplotlib and scikit-learn, you can follow these steps:

After fitting the PCA model using decomposition.PCA, retrieve the loadings matrix using the components_ attribute of the model. The loadings matrix is a matrix of the loadings of each original feature on each principal component.

Determine the length of the loadings matrix and create a list of tick labels using the names of the original features.

Normalize the loadings matrix so that the length of each loading vector is 1. This will make it easier to visualize the loadings on the biplot.

Plot the loadings as arrows on the biplot using pyplot.quiver. Set the length of the arrows to the absolute value of the loading and the angle to the angle of the loading in the complex plane.

Add the tick labels to the biplot using pyplot.xticks and pyplot.yticks.

Here is an example of how you can modify your code to plot the PCA loadings and loading labels in a biplot
Add the loading labels to the biplot using pyplot.text. You can specify the position of the label using the coordinates of the corresponding loading vector, and set the font size and color using the font size and color parameters.

Plot the data points on the biplot using pyplot.scatter.

Add a legend to the plot using pyplot.legend to distinguish the different species.

Here is the complete code with the above modifications applied:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})

%matplotlib inline
np.random.seed(0)

# Iris dataset
DF_data = pd.DataFrame(load_iris().data, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
                       columns = load_iris().feature_names)

Se_targets = pd.Series(load_iris().target, 
                       index = ["iris_%d" % i for i in range(load_iris().data.shape[0])], 
                       name = "Species")

# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data), 
                           index = DF_data.index,
                           columns = DF_data.columns)

# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = 2

# PCA (How I tend to set it up)
Mod_PCA = decomposition.PCA(n_components=m)
DF_PCA = pd.DataFrame(Mod_PCA.fit_transform(DF_standard), 
                      columns=["PC%d" % k for k in range(1,m + 1)]).iloc[:,:K]

# Retrieve the loadings matrix and create the tick labels
loadings = Mod_PCA.components_
tick_labels = DF_data.columns

# Normalize the loadings
loadings = loadings / np.linalg.norm(loadings, axis=1)[:, np.newaxis]

# Plot the loadings as arrows on the biplot
plt.quiver(0, 0, loadings[:,0], loadings[:,1], angles='xy', scale_units='xy', scale=1, color='blue')

# Add the tick labels
plt.xticks(range(-1, 2), tick_labels, rotation='vertical')
plt.yticks(range(-1, 2), tick_labels)

# Add the loading labels
for i, txt in enumerate(tick_labels):
    plt.text(loadings[i, 0], loadings[i, 1], txt, fontsize=12, color='blue')

# Plot the data points on the biplot
color_list = [{0:"r",1:"g",2:"b"}[x] for x in Se_targets]
plt.scatter(x=DF_PCA["PC1"], y=DF_PCA["PC2"], color=color_list)