TorchText Vocab TypeError: Vocab.__init__() got an unexpected keyword argument 'min_freq'

Question:

I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library.
On one of my lines of code

vocab = Vocab(counter, min_freq = 1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

I am getting a TypeError for the min_freq argument even though I am certain that it is one of the accepted arguments for the function. I am also getting UserWarning Lambda function is not supported for pickle, please use regular python function or functools partial instead. Full code

from torchtext.datasets import IMDB
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
tokenizer = get_tokenizer('basic_english')  
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq = 1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

Source Links
towardsdatascience
github Legacy to new

I have tried removing the min_freq argument and use the functions default as follows

vocab = Vocab(counter, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

however I end up getting the same type error but for the specials argument rather than min_freq.

Any help will be much appreciated

Thank you.

Asked By: James B

||

Answers:

As https://github.com/pytorch/text/issues/1445 mentioned, you should change "Vocab" to "vocab". I think they miss-type the legacy-to-new notebook.

correct code:

from torchtext.datasets import IMDB
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import vocab
tokenizer = get_tokenizer('basic_english')  
train_iter = IMDB(split='train')
test_iter = IMDB(split='test')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = vocab(counter, min_freq = 1, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

my environment:

  • python 3.9.12
  • torchtext 0.12.0
  • pytorch 1.11.0
Answered By: razzberry

You can try torchtext.legacy.vocab instead of torchtext.vocab which might solve the issue. This worked for me:

from torchtext.datasets import IMDB
from collections import Counter
from torchtext.data.utils import get_tokenizer
from torchtext.legacy.vocab import vocab
Answered By: Sawmya

Sorry, it doesn’t work for me. 🙁
Vocab is correct name of object and vocab is not.

Simply solution I found is: that "specials" tuple was removed from experimental Vocab and in no more in use! That’s all.

https://github.com/pytorch/text/issues/890

my environment:

python 3.8.16
torchtext 0.15.1
pytorch 2.0.0