Gaussian Mixture Model cross-validation

Question

I’d like to cross-validate my gaussian mixture model. Currently I use sklearn’s cross_validation method as below.

clf = GaussianMixture(n_components=len(np.unique(y)), covariance_type='full')
cv_ortho = cross_validate(clf, parameters_train, y, cv=10, n_jobs=-1, scoring=scorer)

I see that cross_validation is training my classifier with y_train making it a supervised classifier.

try:
    if y_train is None:
        estimator.fit(X_train, **fit_params)
    else:
        estimator.fit(X_train, y_train, **fit_params)

However, I wanted to cross-validate an unsupervised classifier clf.fit(parameters_train). I understand that the classifier then assigns its own class labels. Since, I have two distinguished clusters (see image) and y I can decipher the corresponding labels. Then cross-validate. Is there a routine in sklearn which does this?

A routine similar to this example: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html

Asked By: newkid

||

Source

Answer 1

It seems that typical cross-validation is not something that either makes sense or has been used for unsupervised learning (see this question of Cross Validated Stack Exchange).

Why it does not make sense?

In the strict case, cross validation requires some ground truth about the “correct” labels or values provided by the model. Typically denoted as the y in scikit-learn methods definition.
When you are training in an unsupervised way the sheer notion of the training not being supervised means that there are no y labels; no true labels, no “ground truth”.
This is raised also in this answer of a question on evaluation of unsupervised learning (which is a broader term than cross-validation).

Answered By: sophros

Gaussian Mixture Model cross-validation

Question:

Answers:

Why it does not make sense?