Can't train model from checkpoint on Google Colab as session expires

Question:

I’m using Google Colab for finetuning a pre-trained model.

I successfully preprocessed a dataset and created an instance of the Seq2SeqTrainer class:

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

The problem is training it from last checkpoint after the session is over.

If I run trainer.train(), it runs correctly. As it takes a long time, I sometimes came back to the Colab tab after a few hours, and I know that if the session has crashed I can continue training from the last checkpoint like this: trainer.train("checkpoint-5500")

The checkpoint data does no longer exist on Google Colab if I come back too late, so even though I know the point the training has reached, I will have to start all over again.

Is there any way to solve this problem? i.e. extend the session?

Asked By: Seungjun

||

Answers:

To fix your problem try adding a full fixed path, for example for your google drive and saving the checkpoint-5500 to it.

Using your trainer you can set the output directory as your Google Drive path when creating an instance of the Seq2SeqTrainingArguments.

When you come back to your code, if the session is indeed over you’ll just need to load your checkpoint-5500 from your google drive instead of retraining everything.

Add the following code:

from google.colab import drive
drive.mount('/content/drive')

And then after your trainer.train("checkpoint-5500") is finished (or as it’s last step) save your checkpoint to your google drive.
Or if you prefer, you can add a callback inside your fit function in order to save and update after every single epoch (that was if for some reason the session is crashing before it finish you’ll still have some progress saved).