Why does sklearn KMeans changes my dataset after fitting?

Question:

I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler and I use OneHotEncoder to dummy the categorical variable "Private".

My code is:

num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])

ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()

km = KMeans(n_cluster = 6)
km.fit(data)

The dataset before using the KMeans:
enter image description here

The dataset after using the KMeans:
enter image description here

Asked By: Sara

||

Answers:

The data is the same but shifted over by one column. The Apps column never existed before and everything is shifted to the right.
It has something to do with your line
data[num_vars] = scaler.fit_transform(data[num_vars])
which is actually doing a nested double array data[data[columns[1:]].

Basically, you can follow a method like this

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

data[:, 1:] = sc.fit_transform(data[:, 1:])

Answered By: Raie

It appears that when you run km.fit(data), the .fit method modifies data inplace by inserting a column that is the opposite of your one-hot encoded column. And also confusing is the fact that the "Terminal" column disappears.

enter image description here

For now, you can use this workaround that copies your data:

data1 = data.copy()
km = KMeans(n_clusters = 6, n_init = 'auto')
km.fit(data1)

Edit: When you run km.fit, the first method that is run is km._validate_data – which is a validation step that modifies the dataframe that you pass (see here and here)

For example, if I add the following to the end of your code:

km._validate_data(
    data,
    accept_sparse="csr",
    dtype=[np.float64, np.float32],
    order="C",
    accept_large_sparse=False,
) 

Running this changes your data, but I don’t know exactly why this is happening. It may have to do with something about the data itself.

Answered By: Derek O

There’s a subtle bug in the posted code. Let’s demonstrate it:

new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})

OneHotEncoder returns something like this:

new_data = np.array(
    [[0, 1],
     [0, 1],
     [1, 0]])

What happens if we assign new_df["Private"] with our new (3, 2) array?

>>> new_df["Private"] = new_data
>>> print(new_df)
   Private
0        0
1        0
2        1

Wait, where’d the other column go?

Uh oh, it’s still in there …

… but it’s invisible until we look at the actual values:

>>> print(new_df.values)
[[0 1]
 [0 1]
 [1 0]]

As @Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder.


Is there a better way?

Yep, use a pipeline!

pipe = make_pipeline(
    ColumnTransformer(
        transformers=[
            ("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
        ],
        remainder=StandardScaler(),
    ),
    KMeans(n_clusters=6),
)

out = pipe.fit(df)
Answered By: Alexander L. Hayes