Handling K-means with large dataset 6gb with scikit-learn?

Question:

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.

I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?

Thank you, Please let me know if you have any questions.

Answers:

Clustering is not in itself that well-defined a problem (a ‘good’ clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

Answered By: John Greenall

Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There’s an example script that demonstrates MBKM.

Answered By: Fred Foo