# Mean centering before PCA

## Question:

I am unsure if this kind of question (related to PCA) is acceptable here or not.

However, it is suggested to do MEAN CENTER before PCA, as known. In fact, I have 2 different classes (**Each different class has different participants.**). My aim is to distinguish and classify those 2 classes. Still, I am not sure about MEAN CENTER that should be applied to the whole data set, or to each class.

Is it better to make it separately? (if it is, should PREPROCESSING STEPS also be separately as well?) or does it not make any sense?

## Answers:

PCA is more or less per definition a SVD with centering of the data.

Depending on the implementation (if you use a PCA from a library) the centering is applied automatically e.g. sklearn – because as said it has to be centered by definition.

So for sklearn you do not need this preprocessing step and in general you apply it over your whole data.

PCA is unsupervised can be used to find a representation that is more meaningful and representative for you classes *afterwards*. So you need all your samples in the same feature space via the same PCA.

In short: You do the PCA once and over your whole (training) data and must be center over your whole (traning) data. Libraries like sklarn do the centering automatically.

PCA is just a rotation, optionally accompanied with a projection onto a lower-dimensional space. It finds axes of maximal variance (which happen to be the principal axes of inertia of your point cloud) and then rotates the dataset to align those axes with your coordinate’s system. You get to decide how many such axes you’d like to retain, which means the rotation is then followed by projection onto the first `k`

axes of greatest variance, with `k`

the dimensionality of the representation space you’ll have chosen.

With this in mind, again like for calculating axes of inertia, you could decide to look for such axes through the center of mass of your cloud (the mean), or through any arbitrary origin of choice. In the former case, you would mean-center your data, and in the latter you may translate the data to any arbitrary point, with the result being to diminish the importance of the intrinsic cloud shape itself and increase the importance of the distance between the center of mass and the arbitrary point. Thus, in practice, **you would almost always center your data**.

You may also want to **standardize** your data (center and divide by standard deviation so as to make variance 1 on each coordinate), or even whiten your data.

In any case, **you will want to apply the same transformations to the entire dataset, not class by class**. If you were to apply the transformation class by class, whatever distance exists between the centers of gravity of each would be reduced to 0, and you would likely observe a collapsed representation with the two classes as overlapping. This may be interesting if you want to observe the intrinsic shape of each class, but then you would also apply PCA separately for each class.

Please note that PCA *may* make it easier for you to **visualize** the two classes (without guarantees, if the data are truly n-dimensional without much of a lower-dimensional embedding). But **in no circumstances would it make it easier to discriminate between the two**. If anything, PCA will reduce how discriminable your classes are, and it is often the case that the projection will intermingle classes (increase ambiguity) that are otherwise quite distinct and e.g. separable with a simple hyper-surface.

The k neareast neighbor will help you distinquish between the two classes. Also try tsne to visualize data classes using higher dimensions.

```
def pca_classifier(X, y, n_components=2, n_neighbors=1):
"""
X: numpy array of shape (n_samples, n_features)
y: numpy array of shape (n_samples, )
n_components: int, number of components to keep
n_neighbors: int, number of neighbors to use in the knn classifier
"""
# 1. PCA
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
# 2. KNN
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_pca, y)
# 3. plot
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA')
plt.show()
return knn
```