Losing all training gains when switching to another PC

Question:

I am losing all of my training gains when moving to another PC and I can’t figure out why, it should be saving the model after each chunk and it does so because when I restart it on the same PC, the loss is the same(low), but when I move it to another PC with the same data, I lose everything I had gained, the predictions don’t work nearly as good and the training starts from high loss again. When I move to a new PC, I’m just copying the ./gpt2_folder, is that right?

import os
import math
import numpy as np
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import random


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

batch_size = 32  # Batch size for training.
epochs = 1  # Number of epochs to train for.
chunk_size = 250000  # Number of samples to train on.
data_path = "data.txt"

start_token = "<start>"
end_token = "<end>"

# Read the number of lines in the data file
with open(data_path, "r", encoding="utf-8") as f:
    num_lines = sum(1 for line in f)

# Calculate the number of chunks needed
num_chunks = math.ceil(num_lines / chunk_size)

# Use GPT-2's tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2_model")

# Add the start and end tokens to the tokenizer
tokenizer.add_tokens([start_token, end_token])

# Load GPT-2's pre-trained model
model = GPT2LMHeadModel.from_pretrained("./gpt2_model")

# Resize the model's token embeddings to include the new tokens
model.resize_token_embeddings(len(tokenizer))

for chunk in range(num_chunks):
    # Clear previous data
    input_texts = []
    filename = ""
    # ...

    with open(data_path, "r", encoding="utf-8") as f:
        # Skip lines that have already been read
        for _ in range(chunk * chunk_size):
            next(f)

        # Read the lines for this chunk
        for _ in range(chunk_size):
            line = f.readline().strip()
            if not line:
                break

            input_text, target_text = line.split(":")
            input_texts.append(start_token + input_text + ":" + target_text + end_token)
    filename = "processed_data.txt"
    # Save the processed data into a new file
    with open(str(chunk) + filename, "w", encoding="utf-8") as f:
        for text in input_texts:
            f.write(text + "n")
    print("file created:" + str(chunk) + filename)
    # Create the dataset using the tokenizer
    dataset = TextDataset(tokenizer=tokenizer, file_path=str(chunk) + filename, block_size=128)

    # Create a data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    # Define training arguments
    training_args = TrainingArguments(
    output_dir="./gpt2_model",
    overwrite_output_dir=True,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    save_steps=10_000,
    save_total_limit=2,
    # Add these arguments to enable multi-GPU training
)

    # Create a Trainer instance with the model, dataset, and training arguments
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
    )

    # Train the model on this chunk
    trainer.train()
    model.save_pretrained("./gpt2_model")
    tokenizer.save_pretrained("./gpt2_model")

Answers:

The output directory is not specified in an absolute sense, but as a reference to the current directory.

Perhaps the installation is slightly different on the new machine and you are looking in the wrong place.

Alternately you may have different software (e.g. Dependencies) causing trouble.

I don’t believe anything is kept in memory, so as long as you transfer everything on disk and refer to it correctly the code should not notice that it is running in a different computer.

The easy way to exclude the hypothetical root cause that this specific code might implicitly write to a second location is just having a simple script write something to the path you use here and after migrating that data try to read it in on the new computer.

Answered By: Dennis Jaheruddin