ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

Question:

Goal: Amend this Notebook to work with albert-base-v2 model

Kernel: conda_pytorch_p36.

Section 1.2 instantiates a model from files in ./MRPC/ dir.

However, I think it is for a BERT model, not Albert. So, I downloaded an Albert config.json file from here. It is this chnage that causes the error.

What else do I need to do in order to instantiate an Albert model?


./MRPC/ dir:

!curl https://download.pytorch.org/tutorial/MRPC.zip --output MPRC.zip
!unzip -n MPRC.zip
from os import listdir
from os.path import isfile, join
​
mypath = './MRPC/'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
onlyfiles
---

['tokenizer_config.json',
 'special_tokens_map.json',
 'pytorch_model.bin',
 'config.json',
 'training_args.bin',
 'added_tokens.json',
 'vocab.txt']

Configs:

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "albert-base-v2"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "albert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.eval_batch_size = 1
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False

Model:

model = AlbertForSequenceClassification.from_pretrained(configs.output_dir)  # !
model.to(configs.device)

Traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-0936fd8cbb17> in <module>
      1 # load model
----> 2 model = AlbertForSequenceClassification.from_pretrained(configs.output_dir)
      3 model.to(configs.device)
      4 
      5 # quantize model

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1460                     pretrained_model_name_or_path,
   1461                     ignore_mismatched_sizes=ignore_mismatched_sizes,
-> 1462                     _fast_init=_fast_init,
   1463                 )
   1464 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in _load_state_dict_into_model(cls, model, state_dict, pretrained_model_name_or_path, ignore_mismatched_sizes, _fast_init)
   1601             if any(key in expected_keys_not_prefixed for key in loaded_keys):
   1602                 raise ValueError(
-> 1603                     "The state dictionary of the model you are training to load is corrupted. Are you sure it was "
   1604                     "properly saved?"
   1605                 )

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?
Asked By: DanielBell99

||

Answers:

Exactly what I was looking for, textattack/albert-base-v2-MRPC

How to use from the /transformers library

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("textattack/albert-base-v2-MRPC")

model = AutoModelForSequenceClassification.from_pretrained("textattack/albert-base-v2-MRPC")

Or just clone the model repo

git lfs install
git clone https://huggingface.co/textattack/albert-base-v2-MRPC
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1
Answered By: DanielBell99