pytorch embedding index out of range

Question

I’m following this tutorial here https://cs230-stanford.github.io/pytorch-nlp.html. In there a neural model is created, using nn.Module, with an embedding layer, which is initialized here

self.embedding = nn.Embedding(params['vocab_size'], params['embedding_dim'])

vocab_size is the total number of training samples, which is 4000. embedding_dim is 50. The relevant piece of the forward method is below

def forward(self, s):
        # apply the embedding layer that maps each token to its embedding
        s = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim

I get this exception when passing a batch to the model like so
model(train_batch)
train_batch is a numpy array of dimension batch_sizexbatch_max_len. Each sample is a sentence, and each sentence is padded so that it has the length of the longest sentence in the batch.

File
“/Users/liam_adams/Documents/cs512/research_project/custom/model.py”,
line 34, in forward
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim File
“/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/module.py”,
line 493, in call
result = self.forward(*input, **kwargs) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/sparse.py”,
line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/functional.py”,
line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range at
../aten/src/TH/generic/THTensorEvenMoreMath.cpp:193

Is the problem here that the embedding is initialized with different dimensions than those of my batch array? My batch_size will be constant but batch_max_len will change with every batch. This is how its done in the tutorial.

Asked By: gary69

||

Source

Answer 1

You’ve got some things wrong. Please correct those and re-run your code:

params['vocab_size'] is the total number of unique tokens. So, it should be len(vocab) in the tutorial.
params['embedding_dim'] can be 50 or 100 or whatever you choose. Most folks would use something in the range [50, 1000] both extremes inclusive. Both Word2Vec and GloVe uses 300 dimensional embeddings for the words.
self.embedding() would accept arbitrary batch size. So, it doesn’t matter. BTW, in the tutorial the commented things such as # dim: batch_size x batch_max_len x embedding_dim indicates the shape of output tensor of that specific operation, not the inputs.

Answered By: kmario23

Answer 2

Found the answer here https://discuss.pytorch.org/t/embeddings-index-out-of-range-error/12582

I’m converting words to indexes, but I had the indexes based off the total number of words, not vocab_size which is a smaller set of the most frequent words.

Answered By: gary69

Answer 3

embedding size in nn.embedding should be max(input_data). check your datatypes of the input_data, as this have to be integer for deterministic.

Answered By: shivanand naduvin

Answer 4

If you are using tokenizer from huggingface transformers this is how you will set up your embedding.

The torch.nn.Embedding accepts two mandatory parameters, Pytorch Documentation

num_embeddings
embedding_dim

your num_embeddings is the vocab size associated with your tokenizer and your embedding_dim can be the max sequence length (or anything you like, try not to unnecessarily use values that are too big)

So you define your embedding as follows.

embedding = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size, 
                               embedding_dim=embedding_dim)
output = embedding(input)

Note that you may add additional parameters as per your requirement and adjust the embedding dimension to your needs.

Answered By: codeslord

pytorch embedding index out of range

Question:

Answers: