Does correlation important factor in Unsupervised learning (Clustering)?

Question:

I am working with the dataset of size (500, 33).

In particular the data set contains 9 features say

[X_High, X_medium, X_low, Y_High, Y_medium, Y_low, Z_High, Z_medium, Z_low]

Both visually & after correlation matrix calculation I observed that

[X_High, Y_High, Z_High] & [ X_medium, Y_medium, Z_medium ] & [X_low, Y_low, Z_low] are highly correlated (above 85%).

I would like to perform a Clustering algorithm (say K means or GMM or DBSCAN).

In that case,

Is it necessary to remove the correlated features for Unsupervised learning ?
Whether removing correlation or modifying features creates any impact ?

Asked By: Mari

Source

Answers:

My assumption here is that you’re asking this question because in cases of linear modeling, highly collinear variables can cause issues.

The short answer is no, you don’t need to remove highly correlated variables from clustering for collinearity concerns. Clustering doesn’t rely on linear assumptions, and so collinearity wouldn’t cause issues.

That doesn’t mean that using a bunch of highly correlated variables is a good thing. Your features may be overly redundant and you may be using more data than you need to reach the same patterns. With your data size/feature set that’s probably not an issue, but for large data you could leverage the correlated variables via PCA/dimensionality reduction to reduce your computation overhead.

Answered By: Julian Drago

Removal of features in unsupervised learning is not a complicated matter. You should include features you want to analyze and remove features you don’t want to analyze. Including too many features makes inference much more difficult. I can typically do inference on around 10 to 20 features at the most. More than that and you have a mess of a diagram to explain to someone. If you don’t need to do inference, then you could consider adding more features but it is still not advisable due to blowing up your vector space.

The objective way to determine if 20 features vs. 100 features improved your segmentation model is to utilize supervised learning to validate the segments. This is one approach towards validating your unsupervised method.

Answered By: addi wei