how can I project similar text and image data in the same semantic space using Jina

Question:

Can any one tell me how can I project similar text and image data in the same semantic space using Jina Open source available?

I have been exploring over the internet and have not found the answer on the question above. Would appreciate any help.

Asked By: Prometheus

||

Answers:

TL;DR: Jina’s CLIP-as-service is arguably the best way to go.

Let me reply in more details by first sharing a live use case implementation. You can experience it for yourself in just under 30 sec by accessing this open platform: ActInsight, from your mobile device (or laptop). We just wrote it earlier this month in the context of a hackathon. You can take any picture (e.g. AC unit in your office, a flower, company building, anything…) and the platform will provide you with associated relevant insights (related to Climate Change actions in our case).
The point is that we have implemented exactly what you describe – projecting text and image data in the same semantic space, and finding "closest" (in the sense of a cosine distance) matches – so you can have a feel for the end result.

The underlying model that allows this "magic" is CLIP, brainchild of OpenAI. In their words:

CLIP: Connecting Text and Images – We’re introducing a neural network
called CLIP which efficiently learns visual concepts from natural
language supervision. CLIP can be applied to any visual classification
benchmark by simply providing the names of the visual categories to be
recognized, similar to the “zero-shot” capabilities of GPT-2 and
GPT-3.

Now, from a developer’s point of view, you can deploy CLIP directly in production (Docker, Kubernetes…) by yourself, or you can leverage what I would coin as the "production-grade awesomeness" of Jina’s CLIP-as-service open source work. Note that what the Jina team has achieved is a lot more than just a Python wrapper around CLIP. They have packaged many elegant solutions of traditional thorny issues that developers have to face while deploying their workloads in production in the cloud – all coming out of the box with CLIP-as-service open source.

For ActInsight, I used a combination of the "raw" OpenAI API and Jina CaaS, for a couple of reasons linked to my architecture, but you don’t have to – Jina is pretty much all you need I think.

One last important note: CLIP will allow you to basically connect test & images but these models come in different flavours. You have to make sure that your embeddings got generated within the same semantic space by using the same model for different inputs, which also means making sure your embeddings (vectors) all have the same shape – in order to compare/rank them down the road. Jina easily lets you select your model of choice but I use the chart down this page to understand the trade-off between them. It also shows you that CLIP is basically currently the best out there:

the best CLIP model outperforms the best publicly available ImageNet
model, the Noisy Student EfficientNet-L2,23 on 20 out of 26 different
transfer datasets we tested.

To get started, I would suggest you to just go with the "ViT-L/14@336px" model, which is currently the best. It will project your text/images into a 768-dimensional space, which means your vectors should all be of shape (1, 768) in that case.

Answered By: Marc
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.