How to make KMeans Clustering more Meaningful for Titanic Data?

Question:

I’m running this code.

import pandas as pd
titanic = pd.read_csv('titanic.csv')
titanic.head()


#Import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

from sklearn.cluster import KMeans

# initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# fit the model
kmeans.fit(X)
# store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
titanic.tail()

Finally...

from sklearn.decomposition import PCA

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]

# assign clusters and pca vectors to our dataframe 
titanic['cluster'] = clusters
titanic['x0'] = x0
titanic['x1'] = x1

titanic.head()

import plotly.express as px

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', text='Name')
fig.show()

Here is the plot that I see.

enter image description here

I guess it’s working…but my question is…how can we make the text more dispersed and/or remove outliers so the chart is more meaningful? I’m guessing that the clustering is correct, because I’m not doing anything special here, but is there some way to make the clustering more significant or meaningful?

Data is sourced from here.

https://www.kaggle.com/competitions/titanic/data?select=test.csv

Asked By: ASH

||

Answers:

You could make the name information be displayed only upon mouse hover over a certain data point. Currently, you’re trying to plot the names of each passenger alongside the data point. Since there are a lot of data points close to each other, including the name directly on the plot results in the names of each passenger being placed on top of each other. You could fix this by changing the plot code to something like:

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

Basically, the only thing we changed on the above code is which parameter we’re using to include the ‘Name’ information. Here’s how it looks after this change:

New plot

Now, the names are only shown when you hover your mouse over the data point.

Complete code

Here’s your complete code, considering the above-mentioned change:

# Import required module
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Where our data is located in our machine
train_data_filepath = '/Users/erikingwersen/Downloads/train.csv'
test_data_filepath = '/Users/erikingwersen/Downloads/test.csv'

# Read the train data from downloaded file
titanic = pd.read_csv(train_data_filepath)

documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)

# Fit the model
kmeans.fit(X)

# Store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
documents = titanic['Name']

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

# Initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)

# Pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# Save our two dimensions into x0 and x1
x0, x1 = pca_vecs[:, 0], pca_vecs[:, 1]

# Assign clusters and pca vectors to our dataframe 
titanic[['cluster', 'x0', 'x1']] = [
    [x, y, z] for x, y, z in zip(clusters, x0, x1)
]


titanic.head()

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                  title_font_size=30)
fig.show()

Answered By: Ingwersen_erik