How to Show "Fuzziness" in Fuzzy Clustering Plot Python

Question:

I have a normally distributed 2D dataset, and I am using the fcmeans library to perform fuzzy c-means clustering. I am able to plot the clusters with the red point indicating the cluster center. However, I need to show a gradient where the fuzziness occurs. I am not sure how to implement this in Python and I haven’t been able to find anything like it online.

from fcmeans import FCM

data = pd.read_csv("my_data.csv")

model = FCM(n_clusters=2) 
model.fit(data) 

cntrs = my_model.centers
hard_prediction_labels = my_model.predict(data)
soft_prediction_labels = my_model.soft_predict(data)

plt.scatter(data[:, 0], data[:, 1], c=hard_prediction_labels, s=30);

I believe that my mistake is coming from the fact that my labels are 1 or 0; however, I’m not sure how to define it in a way that would allow me to determine which points are borderline. I am able to obtain the probabilities from the soft prediction (as to the probability of the data point belonging to each cluster) using the soft_predict() function, but I am unsure of how to create a color gradient with it.

Asked By: CasperCodes

||

Answers:

Let’s say that you have computed the (n_points, n_clusters) array prob, that for n_clusters being 2, looks like

print(prob)
# 0.61 0.39
# 0.55 0.45
# .... ....
# 0.07 0.93

Then you can do the following, to have a point that is more opaque when you have a higher probability of belonging to its cluster, and more transparent when it has a lower probabilty. I think this is what you want…

n_clusters = 2
...
for cluster in range(n_clusters):
    plt.scatter(data[hard_prediction_labels==cluster, 0],
                data[hard_prediction_labels==cluster, 1],
                s=30,
                alpha=0.8*prob[hard_prediction_labels==cluster, cluster])
    plt.scatter(data[hard_prediction_labels==cluster, 0],
                data[hard_prediction_labels==cluster, 1],
                s=5,
                alpha=1.0)

NB I haven’t fcmeans so my code is untested, and is essentially based on informed speculation. You may need to adjust something to make it work.


EDIT

plotting twice each data cluster, with different sizes and transparencies, possibly give you a better visualization of your data.

Answered By: gboffi