pytorch embedding index out of range

Question:

I’m following this tutorial here https://cs230-stanford.github.io/pytorch-nlp.html. In there a neural model is created, using nn.Module, with an embedding layer, which is initialized here

self.embedding = nn.Embedding(params['vocab_size'], params['embedding_dim'])

vocab_size is the total number of training samples, which is 4000. embedding_dim is 50. The relevant piece of the forward method is below

def forward(self, s):
        # apply the embedding layer that maps each token to its embedding
        s = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim

I get this exception when passing a batch to the model like so
model(train_batch)
train_batch is a numpy array of dimension batch_sizexbatch_max_len. Each sample is a sentence, and each sentence is padded so that it has the length of the longest sentence in the batch.

File
“/Users/liam_adams/Documents/cs512/research_project/custom/model.py”,
line 34, in forward
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim File
“/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/module.py”,
line 493, in call
result = self.forward(*input, **kwargs) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/sparse.py”,
line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/functional.py”,
line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range at
../aten/src/TH/generic/THTensorEvenMoreMath.cpp:193

Is the problem here that the embedding is initialized with different dimensions than those of my batch array? My batch_size will be constant but batch_max_len will change with every batch. This is how its done in the tutorial.

Asked By: gary69

||

Answers:

You’ve got some things wrong. Please correct those and re-run your code:

  • params['vocab_size'] is the total number of unique tokens. So, it should be len(vocab) in the tutorial.

  • params['embedding_dim'] can be 50 or 100 or whatever you choose. Most folks would use something in the range [50, 1000] both extremes inclusive. Both Word2Vec and GloVe uses 300 dimensional embeddings for the words.

  • self.embedding() would accept arbitrary batch size. So, it doesn’t matter. BTW, in the tutorial the commented things such as # dim: batch_size x batch_max_len x embedding_dim indicates the shape of output tensor of that specific operation, not the inputs.

Answered By: kmario23

Found the answer here https://discuss.pytorch.org/t/embeddings-index-out-of-range-error/12582

I’m converting words to indexes, but I had the indexes based off the total number of words, not vocab_size which is a smaller set of the most frequent words.

Answered By: gary69

embedding size in nn.embedding should be max(input_data). check your datatypes of the input_data, as this have to be integer for deterministic.

Answered By: shivanand naduvin

If you are using tokenizer from huggingface transformers this is how you will set up your embedding.

The torch.nn.Embedding accepts two mandatory parameters, Pytorch Documentation

  1. num_embeddings
  2. embedding_dim

your num_embeddings is the vocab size associated with your tokenizer and your embedding_dim can be the max sequence length (or anything you like, try not to unnecessarily use values that are too big)

So you define your embedding as follows.

embedding = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size, 
                               embedding_dim=embedding_dim)
output = embedding(input) 

Note that you may add additional parameters as per your requirement and adjust the embedding dimension to your needs.

Answered By: codeslord