Python k-means algorithm

Question:

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.

Asked By: Eeyore

||

Answers:

From wikipedia, you could use scipy, K-means clustering an vector quantization

Or, you could use a Python wrapper for OpenCV, ctypes-opencv.

Or you could OpenCV’s new Python interface, and their kmeans implementation.

Answered By: Jacob

You can also use GDAL, which has many many functions to work with spatial data.

Answered By: George Silva

Update: (Eleven years after this original answer, it’s probably time for an update.)

First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I’d suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don’t know that before you start clustering).

Here are some resources:

Old answer:

Scipy’s clustering implementations work well, and they include a k-means implementation.

There’s also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don’t need to decide on the number of clusters ahead of time.

Answered By: tom10

SciPy’s kmeans2() has some numerical problems: others have reported error messages such as “Matrix is not positive definite – Cholesky decomposition cannot be computed” in version 0.6.0, and I just encountered the same in version 0.7.1.

For now, I would recommend using PyCluster instead. Example usage:

>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean, 
                                                            0.03 * numpy.diag([1,1]),
                                                            20) 
                           for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels  # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error   # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound  # Number of times this solution was found
1
Answered By: Vebjorn Ljosa

For continuous data, k-means is very easy.

You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.

I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is

k*(1-(1/n)) + n*(1/n)

Here is the full code in Python

from __future__ import division
from random import random

# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]

param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1

for x in data:
    closest_k = 0;
    smallest_error = 9999; # this should really be positive infinity
    for k in enumerate(means):
        error = abs(x-k[1])
        if error < smallest_error:
            smallest_error = error
            closest_k = k[0]
        means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short ‘a’ vowel, the long ‘o’ vowel, and the ‘s’ consonant. wierd!

Answered By: Nathan

(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.

Answered By: denis

Python’s Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.

Answered By: Guest

SciKit Learn’s KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).

This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.

>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])

(courtesy of SciKit Learn’s documentation, linked above)

Answered By: gsilv