How to find best fit line using PCA in Python?
Question:
I have this code that does it using SVD. But I want to know how to do the same using PCA. Online all I can find is that they are related, etc, but not sure how they are related and how they are different in code, doing the exact same thing.
I just want to see how PCA does this differently than SVD.
import numpy as np
points = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point
# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)
# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)
# Getting the first column of the first matrix U as the best fit line
normal = U[:, 0]
print("Best fit line:", normal)
Answers:
tl;dr: SVD and PCA are used as synonyms. Mathematica
While Singular Value Decomposition refers to the mathematical operation (a factorization, strictly speaking), the Principle Component Analysis is more loosely defined as a method for finding linearly independent directions of maximum variability in high-dimensional space (where the data exists). This can be achieved by performing and SVD on the dataset matrix.
Both terms are used as synonyms depending on the scientific community.
Regarding your question:
The line
U, s, V = np.linalg.svd(covariance_matrix)
performs an SVD, while the lines
# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point
# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)
# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)
perform an PCA, since usually a a zero-mean data-matrix is used.
I would not use PCA as a way to generate a best fit.
It’s for looking at a high dimensional data space and figuring out which dimensions are most significant. It tells you how much of the total variance is captured by each dimension. I would run it as a pre-processing step prior to fitting data.
I would use the output of PCA to restrict my fit to only use the most significant dimensions. I’d further divide them into train and test sets and then perform the fit using my algorithm of choice (e.g. linear or logistic regression).
I have this code that does it using SVD. But I want to know how to do the same using PCA. Online all I can find is that they are related, etc, but not sure how they are related and how they are different in code, doing the exact same thing.
I just want to see how PCA does this differently than SVD.
import numpy as np
points = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point
# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)
# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)
# Getting the first column of the first matrix U as the best fit line
normal = U[:, 0]
print("Best fit line:", normal)
tl;dr: SVD and PCA are used as synonyms. Mathematica
While Singular Value Decomposition refers to the mathematical operation (a factorization, strictly speaking), the Principle Component Analysis is more loosely defined as a method for finding linearly independent directions of maximum variability in high-dimensional space (where the data exists). This can be achieved by performing and SVD on the dataset matrix.
Both terms are used as synonyms depending on the scientific community.
Regarding your question:
The line
U, s, V = np.linalg.svd(covariance_matrix)
performs an SVD, while the lines
# Centering the points
mean_point = np.mean(points, axis=0)
centered_points = points - mean_point
# Calculating the covariance matrix
covariance_matrix = np.cov(centered_points, rowvar=False)
# Performing the SVD
U, s, V = np.linalg.svd(covariance_matrix)
perform an PCA, since usually a a zero-mean data-matrix is used.
I would not use PCA as a way to generate a best fit.
It’s for looking at a high dimensional data space and figuring out which dimensions are most significant. It tells you how much of the total variance is captured by each dimension. I would run it as a pre-processing step prior to fitting data.
I would use the output of PCA to restrict my fit to only use the most significant dimensions. I’d further divide them into train and test sets and then perform the fit using my algorithm of choice (e.g. linear or logistic regression).