Having one vector column for multiple text columns on Qdrant

Question:

I have a products table that has a lot of columns, which from these, the following ones are important for our search:

  1. Title 1 to Title 6 (title in 6 different languages)
  2. Brand name (in 6 different languages)
  3. Category name (in 6 different languages)
  4. Product attributes like size, color, etc. (in 6 different languages)

We are planning on using qdrant vector search to implement fast vector queries. But the problem is that all the data important for searching, are in different columns and I do not think (correct me if I am wrong) generating vector embeddings separately for all the columns is the best solution.

I came up with the idea of mixing the columns together and generating separate collections; and I came up with this solution because the title, the category, brand and attrs columns are essentially the same just in different langs.

Also I use the "BAAI/bge-m3" model which is a multilingual text embedding model that supports more than 100 langs.

So, in short, I created different collections for different languages, and for each collection I have a vector column containing the vector for the combined text of title, brand, color, and category in each language and when searched, because we already know which language the website is, we will search in that specific language collection.

Now, the question is, is this a valid method? What are the pros and cons of this method? I know for sure that when combined, I can not give different weights to different parts of this vector. For example one combined text of title, category, color, and brand may look like this:

"Koala patterned hoodie children blue Bubito"

or Something like:

"Striped t-shirt men navy blue Zara"

Now, user may search "blue hoodie for men", but due to the un-weighted structure of the combined vector, it will not retrieve the best results.

I may be wrong and this may be one of the best results, but please tell me more about the pros and cons of this method, and if you can, give me a better idea.

It is important to note that currently we have more than 300,000(300K) products and they will grow to more than 1,000,000 (1M) in the near future.

Asked By: Vahid

||

Answers:

You seem like you have thought this through already, and your method is valid, practical, simple and scalable. Here is a quick overview of what I think about your particular question.


Pros of your method

  1. By segregating data into collections based on language, you ensure that searches are conducted within the correct linguistic context. It’s quite rare for users to mix languages in search terms so I feel like you are right on this point.
  2. In terms of scalability, your approach seems optimal as you can expand linearly as your database grows. The separation of the languages could allow you to separate the databases for the different regions, Chinese in China, English in England and query only the one from the right region.
  3. Combining relevant fields into a single vector for each language streamlines the search process. This approach reduces the complexity of managing multiple vectors for each product, which can lower overhead and improve efficiency

Cons of your method

  1. As you stated previously, combining fields without weighting can lead to less precise search outcomes because there is no way to tell which keywords are important.
  2. The combined vector approach might not always accurately reflect the nuances of the data. For instance, a product’s title, brand, and category might not always align perfectly with the user’s search intent, especially if the brand name is a common word in the user’s language, which could lead to feeling like the "Verbatim" mode of Google.

Alternative approaches

Weighted Vector Combination

Instead of merging all fields into a single vector, consider creating separate vectors for each field (title, brand, category, attributes) and then combining them with weights that reflect their importance. This method allows for more precise control over search relevance but requires more computational resources and complexity, and a fair bit of judgement if you fine-tune the weights yourself.

Another solution in a similar fashion would be to hard code some "important" keywords, or pin them to search in a specific column. This might be doable if your catalog includes a few "main" categories of products, but can be very tedious/un-doable if your products are very diverse.

Semantic Search with Fine-Tuning

Utilize the BAAI/bge-m3 model to generate embeddings for each field individually, then combine these embeddings in a manner that allows for weighting. This could involve training a custom model on your data to better understand the significance of different fields in the context of your products. This approach essentially automates the previous one, but requires you to already have data about the search intent and the keywords used by the clients.

This method is also fairly complicated to implement but could yield good results if you combine it with analytics from the websites so that it can learn over time.


I hope this can help you, I would be interested to know what method you will end up using.

Answered By: LiteApplication

Beta Answer (not implemented yet, posted for discussion)

As expected, there is indeed a weight distribution problem with the search query in the previous method. For example, if you search for something like "women red skirt", instead of retrieving only "women red skirts", it also retrieves "women red shoes" or something similar.

But with weighted importance levels assigned to different fields, this issue would not occur. Let me explain what I think can be done to implement the weighted importance distribution method.

First Step: Tokenizing the search query

First, we have to tokenize the search query, "women red skirt" to see which keywords are included. We have to somehow (I do not know how) find out that "women" is gender, "red" is attribute -> color, and "skirt" is category.

Second Step: step-by-step filteration

Then, according to the importance level of each field, it would filter the data step by step. For example, first search in category vector column (for example), and fetch all with category "skirt", then from this new result list, filter for gender "women", and in the final step, filter for attribute color "red".

The Problem

Now, the problem is that I do not know whether this method is practible, doable, or optimized. I appreciate any kind of input on this matter.

Answered By: Vahid