Community detection for larger than memory embeddings dataset

Question:

I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:

https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/fast_clustering.py

It works great, but, it doesn’t really scale as the data set grows larger than memory.

The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.

I’m certain there are a class of algorithms – and hopefully a python library – that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?

Asked By: Dan Diephouse

Source

Answers:

That seems small enough that you could just rent a bigger computer.

Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but

there’s a whole literature out there;
if you already have a scalable clustering algorithm you like, you can use that.

Answered By: David Eisenstat

The following link could be a good solution to your problem. But please be aware of the sub-optimality.
https://ntropy.com/post/clustering-millions-of-sentences-to-optimize-the-ml-workflow

Answered By: Alireza Javadian