Train T5/BART to convert a string into multiple strings

Question:

Is it possible to train a seq2seq model like T5 or BART to convert a string into a list of strings? On my first attempt, the tokenizer complained that my 2D list of labels isn’t the correct data type:

File "/home/matt/miniconda3/envs/nlp/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 429, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I suppose I could concatenate the multiple strings in each of my training examples, but then I’d have to use a potentially error-prone splitter to split them up again. Maybe using a special character as a delimiter is the answer here?

It’s not super relevant, but here’s how I’m invoking the tokenizer. Also, I’m using a subclass of torch.utils.data.Dataset:

tokenizer = AutoTokenizer.from_pretrained(args.model_name)
encodings = tokenizer(texts, truncation=True, padding=True, return_tensors='pt')
decodings = tokenizer(labels, truncation=True, padding=True, return_tensors='pt')
dataset_tokenized = Dataset(encodings, decodings)

What is relevant is that my texts variable is a list of strings, and my labels variable is a 2D list of strings, which obviously isn’t allowed.

Asked By: mph

||

Answers:

Using a special delimiter worked great! I chose the pipe character.

pairs = [(source, ' | '.join(target)) for source, target in pairs]
Answered By: mph