How to get token or code embedding using Codex API?

Question:

For a given code snippet, how to get embedding using the Codex API?

import os
import openai
import config


openai.api_key = config.OPENAI_API_KEY

def runSomeCode():
    response = openai.Completion.create(
      engine="code-davinci-001",
      prompt=""""n1. Get a reputable free news apin2. Make a request to the api for the latest news storiesn"""",
      temperature=0,
      max_tokens=1500,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0)

    if 'choices' in response:
        x = response['choices']
        if len(x) > 0:
            return x[0]['text']
        else:
            return ''
    else:
        return ''



answer = runSomeCode()
print(answer)

But I want to figure out given a python code block like the following, can I get the embedding from codex?

Input:

import Random
a = random.randint(1,12)
b = random.randint(1,12)
for i in range(10):
    question = "What is "+a+" x "+b+"? "
    answer = input(question)
    if answer = a*b
        print (Well done!)
    else:
        print("No.")

Output:

  • Embedding of the input code
Asked By: Exploring

||

Answers:

Yes, OpenAI can create embedding for any input text — even if it’s code. You only need to pass the correct engine or model in its get_embedding() function call. I tested out this code:

# Third-party imports
import openai

from openai.embeddings_utils import get_embedding


openai.api_key = OPENAI_SEC_KEY


embedding = get_embedding("""
    def sample_code():
        print("Hello from IamAshKS !!!")
""", engine="code-search-babbage-code-001")

print()
print(f"{embedding=}")
print(f"{len(embedding)=}")

# OUTPUT:
# embedding=[-0.007094269152730703, 0.006055716425180435, -0.005044757854193449, ...]
# len(embedding)=2048


embedding = get_embedding("""
import Random
a = random.randint(1,12)
b = random.randint(1,12)
for i in range(10):
    question = "What is "+a+" x "+b+"? "
    answer = input(question)
    if answer = a*b
        print (Well done!)
    else:
        print("No.")
""", engine="code-search-babbage-code-001")

print()
print(f"{embedding=}")
print(f"{len(embedding)=}")

# OUTPUT:
# embedding=[-0.011341490782797337, -0.005919027142226696, 0.0011923711281269789, ...]
# len(embedding)=2048

NOTE: You can replace the model or engine using engine parameter for get_embedding().

The above given code gets you embeddings for any code. There is another engine/model for code search named code-search-ada-code-001 but it’s less powerful than code-search-babbage-code-001, which I used for this answer. If you also want to do code search, go through references below.

References:

Answered By: IamAshKS

The function get_embedding will give us an embedding for an input text.

Canonical code from OpenAI here: https://github.com/openai/openai-python/blob/main/examples/embeddings/Get_embeddings.ipynb

import openai
from tenacity import retry, wait_random_exponential, stop_after_attempt

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, engine="text-similarity-davinci-001") -> List[float]:

    # replace newlines, which can negatively affect performance.
    text = text.replace("n", " ")

    return openai.Embedding.create(input=[text], engine=engine)["data"][0]["embedding"]

embedding = get_embedding("Sample query text goes here", engine="text-search-ada-query-001")
print(len(embedding))
Answered By: Exploring