Community detection for larger than memory embeddings dataset
Question:
I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
It works great, but, it doesn’t really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I’m certain there are a class of algorithms – and hopefully a python library – that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
Answers:
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
- there’s a whole literature out there;
- if you already have a scalable clustering algorithm you like, you can use that.
The following link could be a good solution to your problem. But please be aware of the sub-optimality.
https://ntropy.com/post/clustering-millions-of-sentences-to-optimize-the-ml-workflow
I currently have a dataset of textual embeddings (768 dimensions). The current number of records is ~1 million. I am looking to detect related embeddings through a community detection algorithm. For small data sets, I have been able to use this one:
It works great, but, it doesn’t really scale as the data set grows larger than memory.
The key here is that I am able to specify a threshold for community matches. I have been able to find clustering algorithms that scale to larger than memory, but I always have to specify a fixed number of clusters ahead of time. I need the system to detect the number of clusters for me.
I’m certain there are a class of algorithms – and hopefully a python library – that can handle this situation, but I have been unable to locate it. Does anyone know of an algorithm or a solution I could use?
That seems small enough that you could just rent a bigger computer.
Nevertheless, to answer the question, typically the play is to cluster the data into a few chunks (overlapping or not) that fit in memory and then apply a higher-quality in-memory clustering algorithm to each chunk. One typical strategy for cosine similarity is to cluster by SimHashes, but
- there’s a whole literature out there;
- if you already have a scalable clustering algorithm you like, you can use that.
The following link could be a good solution to your problem. But please be aware of the sub-optimality.
https://ntropy.com/post/clustering-millions-of-sentences-to-optimize-the-ml-workflow