pytorch embedding index out of range
Question:
I’m following this tutorial here https://cs230-stanford.github.io/pytorch-nlp.html. In there a neural model is created, using nn.Module
, with an embedding layer, which is initialized here
self.embedding = nn.Embedding(params['vocab_size'], params['embedding_dim'])
vocab_size
is the total number of training samples, which is 4000. embedding_dim
is 50. The relevant piece of the forward
method is below
def forward(self, s):
# apply the embedding layer that maps each token to its embedding
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim
I get this exception when passing a batch to the model like so
model(train_batch)
train_batch
is a numpy array of dimension batch_size
xbatch_max_len
. Each sample is a sentence, and each sentence is padded so that it has the length of the longest sentence in the batch.
File
“/Users/liam_adams/Documents/cs512/research_project/custom/model.py”,
line 34, in forward
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim File
“/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/module.py”,
line 493, in call
result = self.forward(*input, **kwargs) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/sparse.py”,
line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/functional.py”,
line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range at
../aten/src/TH/generic/THTensorEvenMoreMath.cpp:193
Is the problem here that the embedding is initialized with different dimensions than those of my batch array? My batch_size
will be constant but batch_max_len
will change with every batch. This is how its done in the tutorial.
Answers:
You’ve got some things wrong. Please correct those and re-run your code:
-
params['vocab_size']
is the total number of unique tokens. So, it should be len(vocab)
in the tutorial.
-
params['embedding_dim']
can be 50
or 100
or whatever you choose. Most folks would use something in the range [50, 1000]
both extremes inclusive. Both Word2Vec and GloVe uses 300
dimensional embeddings for the words.
-
self.embedding()
would accept arbitrary batch size. So, it doesn’t matter. BTW, in the tutorial the commented things such as # dim: batch_size x batch_max_len x embedding_dim
indicates the shape of output tensor of that specific operation, not the inputs.
Found the answer here https://discuss.pytorch.org/t/embeddings-index-out-of-range-error/12582
I’m converting words to indexes, but I had the indexes based off the total number of words, not vocab_size
which is a smaller set of the most frequent words.
embedding size in nn.embedding
should be max(input_data)
. check your datatypes of the input_data, as this have to be integer for deterministic.
If you are using tokenizer from huggingface transformers this is how you will set up your embedding.
The torch.nn.Embedding accepts two mandatory parameters, Pytorch Documentation
- num_embeddings
- embedding_dim
your num_embeddings is the vocab size associated with your tokenizer and your embedding_dim can be the max sequence length (or anything you like, try not to unnecessarily use values that are too big)
So you define your embedding as follows.
embedding = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size,
embedding_dim=embedding_dim)
output = embedding(input)
Note that you may add additional parameters as per your requirement and adjust the embedding dimension to your needs.
I’m following this tutorial here https://cs230-stanford.github.io/pytorch-nlp.html. In there a neural model is created, using nn.Module
, with an embedding layer, which is initialized here
self.embedding = nn.Embedding(params['vocab_size'], params['embedding_dim'])
vocab_size
is the total number of training samples, which is 4000. embedding_dim
is 50. The relevant piece of the forward
method is below
def forward(self, s):
# apply the embedding layer that maps each token to its embedding
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim
I get this exception when passing a batch to the model like so
model(train_batch)
train_batch
is a numpy array of dimension batch_size
xbatch_max_len
. Each sample is a sentence, and each sentence is padded so that it has the length of the longest sentence in the batch.
File
“/Users/liam_adams/Documents/cs512/research_project/custom/model.py”,
line 34, in forward
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim File
“/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/module.py”,
line 493, in call
result = self.forward(*input, **kwargs) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/modules/sparse.py”,
line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse) File “/Users/liam_adams/Documents/cs512/venv_research/lib/python3.7/site-packages/torch/nn/functional.py”,
line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range at
../aten/src/TH/generic/THTensorEvenMoreMath.cpp:193
Is the problem here that the embedding is initialized with different dimensions than those of my batch array? My batch_size
will be constant but batch_max_len
will change with every batch. This is how its done in the tutorial.
You’ve got some things wrong. Please correct those and re-run your code:
-
params['vocab_size']
is the total number of unique tokens. So, it should belen(vocab)
in the tutorial. -
params['embedding_dim']
can be50
or100
or whatever you choose. Most folks would use something in the range[50, 1000]
both extremes inclusive. Both Word2Vec and GloVe uses300
dimensional embeddings for the words. -
self.embedding()
would accept arbitrary batch size. So, it doesn’t matter. BTW, in the tutorial the commented things such as# dim: batch_size x batch_max_len x embedding_dim
indicates the shape of output tensor of that specific operation, not the inputs.
Found the answer here https://discuss.pytorch.org/t/embeddings-index-out-of-range-error/12582
I’m converting words to indexes, but I had the indexes based off the total number of words, not vocab_size
which is a smaller set of the most frequent words.
embedding size in nn.embedding
should be max(input_data)
. check your datatypes of the input_data, as this have to be integer for deterministic.
If you are using tokenizer from huggingface transformers this is how you will set up your embedding.
The torch.nn.Embedding accepts two mandatory parameters, Pytorch Documentation
- num_embeddings
- embedding_dim
your num_embeddings is the vocab size associated with your tokenizer and your embedding_dim can be the max sequence length (or anything you like, try not to unnecessarily use values that are too big)
So you define your embedding as follows.
embedding = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size,
embedding_dim=embedding_dim)
output = embedding(input)
Note that you may add additional parameters as per your requirement and adjust the embedding dimension to your needs.