PCA For categorical features?

Question:

In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link:

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features.
Hence confused, please suggest me on the same.

Asked By: data_person

||

Answers:

PCA is a dimensionality reduction method that can be applied any set of features. Here is an example using OneHotEncoded (i.e. categorical) data:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X = enc.fit_transform([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]).toarray()

print(X)

> array([[ 1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.]])


from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

print(X_pca)

> array([[-0.70710678,  0.79056942,  0.70710678],
       [ 1.14412281, -0.79056942,  0.43701602],
       [-1.14412281, -0.79056942, -0.43701602],
       [ 0.70710678,  0.79056942, -0.70710678]])
Answered By: Alex

Basically, PCA finds and eliminate less informative (duplicate) information on feature set and reduce the dimension of feature space. In other words, imagine a N-dimensional hyperspace, PCA finds such M (M < N) features that the data variates most. In this way data may be represented as M-dimensional feature vectors. Mathematically, it is some-kind of a eigen-values & eigen vectors calculation of a feature space.

So, it is not important whether the features are continuous or not.

PCA is used widely on many application. Mostly for eliminating noisy, less informative data that comes from some sensor or hardware before classification/recognition.

Edit:

Statistically speaking, categorical features can be seen as discrete random variables in interval [0,1]. Computation for expectation E{X} and variance E{(X-E{X})^2) are still valid and meaningful for discrete rvs. I still stand for the applicability of PCA in case of categorical features.

Consider a case where you would like to predict whether “It is going to rain for a given day or not”. You have categorical feature X which is “Do I have to go to work for the given day”, 1 for yes and 0 for no. Clearly weather conditions do not depend on our work schedule, so P(R|X)=P(R). Assuming 5 days of work for every week, we have more 1s than 0s for X in our randomly collected dataset. PCA would probably lead to dropping this low-variance dimension in your feature representation.

At the end of the day, PCA is for dimension reduction with minimal loss of information. Intuitively, we rely on variance of the data on a given axis to measure its usefulness for the task. I don’t think there is any theoretical limitation for applying it to categorical features. Practical value depends on application and data which is also the case for continuous variables.

Answered By: Ockhius

I disagree with the others.

While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it’s not as if PCA would segfault on such data. It works, but it is just much less meaningful than you’d want it to be; and supposedly less meaningful than e.g. frequent pattern mining.

MCA is a known technique for categorical data dimension reduction. In R there is a lot of package to use MCA and even mix with PCA in mixed contexts. In python exist a a mca library too. MCA apply similar maths that PCA, indeed the French statistician used to say, “data analysis is to find correct matrix to diagonalize”

http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/

Answered By: joscani

The following publication shows great and meaningful results when computing PCA on categorical variables treated as simplex vertices:

Niitsuma H., Okada T. (2005) Covariance and PCA for Categorical Variables. In: Ho T.B., Cheung D., Liu H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin, Heidelberg

https://doi.org/10.1007/11430919_61

It is available via https://arxiv.org/abs/0711.4452 (including as a PDF).

Answered By: Oleg Melnikov

I think pca is reducing var by leverage the linear relation between vars.
If there’s only one categoral var coded in onehot, there’s not linear relation between the onehoted cols. so it can’t reduce by pca.

But if there exsits other vars, the onehoted cols may be can presented by linear relation of other vars.

So may be it can reduce by pca, depends on the relation of vars.

Answered By: NicolasLi

In this paper, the author’s use PCA to combine categorical features of high cardinality. If I understood correctly, they first calculate conditional probabilities for each target class. Then they choose a threshold hyperparameter and create a new binary variable for each conditional class probability for each categorical feature to be combined. PCA is performed to combine the new binary variables with the number of components retained specified as a hyperparameter.

Answered By: michen00