Unable to build vocab for a torchtext text classification

Question:

I’m trying to prepare a custom dataset loaded from a csv file in order to use in a torchtext text binary classification problem. It’s a basic dataset with news headlines and a market sentiment label assigned "positive" or "negative". I’ve been following some online tutorials on PyTorch to get this far but they’ve made some significant changes in the latest torchtext package so most of the stuff is out of date.

Below I’ve successfully parsed my csv file into a pandas dataframe with two columns – text headline and a label which is either 0 or 1 for positive/negative, split into a training and test dataset then wrapped them as a PyTorch dataset class:

train, test = train_test_split(eurusd_df, test_size=0.2)
class CustomTextDataset(Dataset):
def __init__(self, text, labels):
    self.text = text
    self.labels = labels
    
def __getitem__(self, idx):
    label = self.labels.iloc[idx]
    text = self.text.iloc[idx]
    sample = {"Label": label, "Text": text}
    return sample

def __len__(self):
    return len(self.labels)
train_dataset = CustomTextDataset(train['Text'], train['Labels'])
test_dataset = CustomTextDataset(test['Text'], test['Labels'])

I’m now trying to build a vocabulary of tokens following this tutorial https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-simple-guide-to-text-classification and the official pytorch tutorial https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html .

However using the below code

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = train_dataset

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
        
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

yields a very small length of vocabulary, and applying the example vocab(['here', 'is', 'an', 'example']) on a text field taken from the original dataframe yields a list of 0s, implying the vocab is being built from the label field, containing only 0s and 1s, not the text field. Could anyone review and show me how to build the vocab targeting the text field?

Asked By: suiprocs1

||

Answers:

The very small length of vocabulary is because under the hood, build_vocab_from_iterator uses a Counter from the Collections standard library, and more specifically its update function. This function is used in a way that assumes that what you are passing to build_vocab_from_iterator is an iterable wrapping an iterable containing words/tokens.

This means that in its current state, because strings can be iterated upon, your code will create a vocab able to encode all letters, not words, comprising your dataset, hence the very small vocab size.

I do not know if that is intended by Python/Pytorch devs, but because of this you need to wrap your simple iterator in a list, for example like this :

vocab = build_vocab_from_iterator([yield_tokens(train_iter)], specials=["<unk>"])

Note : If your vocab gives only zeros, it is not because it is taking from the label field, it is just returning the integer corresponding to an unknown token, since all words that are not just a character will be unknown to it.

Hope this helps!

Answered By: Callim Ethée

So it turned out the issue was with the get item function in my CustomTextDataset class, it was returning a dict which was then firstly creating issues building the vocab, then when passing the iterator in a list, created a TypeError.
Thank you Callim Ethée for your answer as it definitely pointed me in the right direction!

Answered By: suiprocs1
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.