Problem tokenizing with HuggingFace's library when fine tuning bloom

Question:

I have a problem with my tokenizer function. To be honest I am quiet lost, since I do not really understand whats happening inside the transformer library. Here is what I wanted to do:

I would like to fine tune the BLOOM model to a conversation bot. Now when tokenizing I dont really understand whats happening and therefore how the data is supposed to look. All examples I find online are with plain text but none of them touch the topic of conversation training with a dataset.

In HuggingFace’s example they simply put ['text']at the end of their tokenizer function. Since I dont have the feature text, but ['dialog'] I thought replacing it here would work. But apparently, it does not.

I would really appreciate if someone could say a few words of what exactly went wrong in my code and how to fix it. Since I want to train varies models over the next months, explaining the mistake would help a lot in future.

here is my code and below the exact error as well as my notebook:

import torch
import random
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import datasets

# Laden des Modells und des Tokenizers
model_name = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Laden des Datasets
dataset = datasets.load_dataset('conv_ai_2')

# Tokenisieren des Datasets
def tokenize_function(examples):
    return tokenizer(examples["dialog"])

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)

# Aufteilen in Trainings- und Validierungsset
train_dataset = tokenized_dataset['train']
val_dataset = tokenized_dataset['valid']

# Trainingsargumente
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy = "epoch",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    logging_steps=500,
    save_steps=500,
    seed=42,
    learning_rate=5e-5,
    report_to="none"
)

# Trainer-Objekt
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Finetuning des Modells
trainer.train()

# Generieren einer Antwort
def generate_response(input_text, model, tokenizer):
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    chat_history_ids = model.generate(
        input_ids=input_ids,
        max_length=1000,
        do_sample=True,
        top_p=0.9,
        top_k=50
    )
    return tokenizer.decode(chat_history_ids[0], skip_special_tokens=True)

# Testen des Conversational Bots
while True:
    user_input = input("You: ")
    response = generate_response(user_input, model, tokenizer)
    print("Bot: " + response)

Error:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.9/dist-packages/datasets/utils/py_utils.py", line 1349, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py", line 3329, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/usr/local/lib/python3.9/dist-packages/datasets/arrow_dataset.py", line 3210, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "<ipython-input-18-25d239b4d59f>", line 17, in tokenize_function
    return tokenizer(examples["dialog"])
  File "/usr/local/lib/python3.9/dist-packages/datasets/formatting/formatting.py", line 280, in __getitem__
    value = self.data[key]
KeyError: 'dialog'
"""

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-18-25d239b4d59f> in <module>
     17     return tokenizer(examples["dialog"])
     18 
---> 19 tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)
     20 
     21 # Aufteilen in Trainings- und Validierungsset

13 frames
/usr/local/lib/python3.9/dist-packages/datasets/formatting/formatting.py in __getitem__()
    278 
    279     def __getitem__(self, key):
--> 280         value = self.data[key]
    281         if key in self.keys_to_format:
    282             value = self.format(key)

KeyError: 'dialog'
Asked By: Max

||

Answers:

In the original tokenize_function, you were directly tokenizing the "dialog" key from the examples. However, this didn’t ensure that the dimensions of the input and label tensors were consistent. This mismatch in dimensions was causing the error you encountered during training I converted each dialog entry into a single string by joining the "text" key values in each dialog. Then I tokenize the dialog strings with proper truncation, padding, and a specified maximum length. This creates tokenized input tensors with consistent dimensions. Then I shift the input_ids by one position. This means that the model will learn to predict the next token in the sequence. I also clone the shifted input_ids to avoid modifying the original tensor in place.

def tokenize_function(examples):
    dialog_texts = [' '.join([entry["text"] for entry in dialog]) for dialog in examples["dialog"]]
    tokenized = tokenizer(dialog_texts, truncation=True, padding='max_length', max_length=128, return_tensors="pt")  
    tokenized["labels"] = tokenized.input_ids[:, 1:].clone()
    tokenized.input_ids = tokenized.input_ids[:, :-1]
    tokenized["labels"] = torch.cat([tokenized.labels, torch.full((tokenized.labels.size(0), 1), tokenizer.pad_token_id, dtype=torch.long)], dim=1)

    return tokenized
Answered By: Abdulmajeed

In the original tokenize_function, you were directly tokenizing the "dialog" key from the examples. However, this didn’t ensure that the dimensions of the input and label tensors were consistent. This mismatch in dimensions was causing the error you encountered during training
I converted each dialog entry into a single string by joining the "text" key values in each dialog. Then I tokenize the dialog strings with proper truncation, padding, and a specified maximum length. This creates tokenized input tensors with consistent dimensions. The I shift the input_ids by one position. This means that the model will learn to predict the next token in the sequence. I also clone the shifted input_ids to avoid modifying the original tensor in place.

Answered By: Abdulmajeed