How to find best fit line using PCA in Python?

Question:

I have this code that does it using SVD. But I want to know how to do the same using PCA. Online all I can find is that they are related, etc, but not sure how they are related and how they are different in code, doing the exact same thing.

I just want to see how PCA does this differently than SVD.

import numpy as np

points = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point

# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)

# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)

# Getting the first column of the first matrix U as the best fit line
normal = U[:, 0]

print("Best fit line:", normal)
Asked By: Joan Venge

||

Answers:

tl;dr: SVD and PCA are used as synonyms. Mathematica

While Singular Value Decomposition refers to the mathematical operation (a factorization, strictly speaking), the Principle Component Analysis is more loosely defined as a method for finding linearly independent directions of maximum variability in high-dimensional space (where the data exists). This can be achieved by performing and SVD on the dataset matrix.
Both terms are used as synonyms depending on the scientific community.

Regarding your question:
The line

U, s, V = np.linalg.svd(covariance_matrix)

performs an SVD, while the lines

# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point

# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)

# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)

perform an PCA, since usually a a zero-mean data-matrix is used.

Answered By: hschoell

I would not use PCA as a way to generate a best fit.

It’s for looking at a high dimensional data space and figuring out which dimensions are most significant. It tells you how much of the total variance is captured by each dimension. I would run it as a pre-processing step prior to fitting data.

I would use the output of PCA to restrict my fit to only use the most significant dimensions. I’d further divide them into train and test sets and then perform the fit using my algorithm of choice (e.g. linear or logistic regression).

Answered By: duffymo
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.