How to fine-tune gpt-j using Huggingface Trainer

Question:

I’m attempting to fine-tune gpt-j using the huggingface trainer and failing miserably. I followed the example that references bert, but of course, the gpt-j model isn’t exactly like the bert model.

The error indicates that the model isn’t producing a loss, which is great, except that I have no idea how to make it generate a loss or how to change what the trainer is expecting.

I’m using Transformers 4.22.2. I would like to get this working on a CPU before I try to do anything on Paperspace with a GPU. I did make an initial attempt there using a GPU that received the same error, with slightly different code to use cuda.

I suspect that my approach is entirely wrong. I found a very old example of fine-tuning gpt-j using 8-bit quantization, but even that repository says it is deprecated.

I’m unsure if my mistake is in using the compute_metrics() I found in the bert example or if it is something else. Any advice would be appreciated. Or, maybe it is an issue with the labels I provide the config, but I’ve tried different permutations.

I understand what a loss function is, but I don’t know how it is supposed to be configured in this case.

My Code:

from transformers import Trainer, TrainingArguments, AutoModelForCausalLM
from transformers import GPTJForCausalLM, AutoTokenizer
from datasets import load_dataset
import time
import torch
import os
import numpy as np
import evaluate
import sklearn

start = time.time()

GPTJ_FINE_TUNED_FILE = "./fine_tuned_models/gpt-j-6B"

print("Loading model")
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", low_cpu_mem_usage=True)
model.config.pad_token_id = model.config.eos_token_id

print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.pad_token = tokenizer.eos_token

print("Loading dataset")
current_dataset = load_dataset("wikitext", 'wikitext-103-v1')
current_dataset['train'] = current_dataset['train'].select(range(1200))


def tokenize_function(examples):
    current_tokenizer_result = tokenizer(examples["text"], padding="max_length", truncation=True)
    return current_tokenizer_result


print("Splitting and tokenizing dataset")
tokenized_datasets = current_dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].select(range(100))

print("Preparing training arguments")

training_args = TrainingArguments(output_dir=GPTJ_FINE_TUNED_FILE,
                                  report_to='all',
                                  logging_dir='./logs',
                                  per_device_train_batch_size=1,
                                  label_names=['input_ids', 'attention_mask'],  # 'logits', 'past_key_values'
                                  num_train_epochs=1,
                                  no_cuda=True
                                  )

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset
)

print("Starting training")
trainer.train()
print(f"Finished fine-tuning in {time.time() - start}")

Which leads to the error and stacktrace:

  File "xxxft_v3.py", line 66, in <module>
  File "xxxvenvlibsite-packagestransformerstrainer.py", line 1521, in train
    return inner_training_loop(
  File "xxxvenvlibsite-packagestransformerstrainer.py", line 1763, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "xxxvenvlibsite-packagestransformerstrainer.py", line 2499, in training_step
    loss = self.compute_loss(model, inputs)
  File "xxxvenvlibsite-packagestransformerstrainer.py", line 2544, in compute_loss
    raise ValueError(
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.
Asked By: Erik Hyrkas

||

Answers:

I found what appears to work, though now I’m running low on memory and working through ways of handling it.

The data_collator parameter seems to take care of the exact issue that I was having.

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
Answered By: Erik Hyrkas

Thank you for sharing this as I am also looking for reference to fine-tune GPT-J. Couple of questions from what you have shared.

  1. Could you please share the format of the data you have prepared and passed to the GPT-J model. In this case,

    current_dataset = load_dataset("wikitext", 'wikitext-103-v1') current_dataset['train'] = current_dataset['train'].select(range(1200))

What is "wikitext" and ‘wikitext-103-v1’ here and how does ‘current_dataset[‘train’]’ looks like. If you could please share it.

  1. Also wanted to ask by any chance have you tried to run GPT-J on google colab or pro or pro+. If yes, were you able to run successfully and have you tired fine-tuning version of GPT-J on colab, were you able to run it then ?

Thank you for reading my comment.

Answered By: Archaeolexicologist