tokenizer.save_pretrained TypeError: Object of type property is not JSON serializable

Question:

I am trying to save the GPT2 tokenizer as follows:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = GPT2Tokenizer.eos_token
dataset_file = "x.csv"
df = pd.read_csv(dataset_file, sep=",")
input_ids = tokenizer.batch_encode_plus(list(df["x"]), max_length=1024,padding='max_length',truncation=True)["input_ids"]

# saving the tokenizer
tokenizer.save_pretrained("tokenfile")

I am getting the following error:
TypeError: Object of type property is not JSON serializable

More details:

TypeError                                 Traceback (most recent call last)
Cell In[x], line 3
      1 # Save the fine-tuned model
----> 3 tokenizer.save_pretrained("tokenfile")

File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2130, in PreTrainedTokenizerBase.save_pretrained(self, save_directory, legacy_format, filename_prefix, push_to_hub, **kwargs)
   2128 write_dict = convert_added_tokens(self.special_tokens_map_extended, add_type_field=False)
   2129 with open(special_tokens_map_file, "w", encoding="utf-8") as f:
-> 2130     out_str = json.dumps(write_dict, indent=2, sort_keys=True, ensure_ascii=False) + "n"
   2131     f.write(out_str)
   2132 logger.info(f"Special tokens file saved in {special_tokens_map_file}")

File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/json/__init__.py:238, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    232 if cls is None:
    233     cls = JSONEncoder
    234 return cls(
    235     skipkeys=skipkeys, ensure_ascii=ensure_ascii,
    236     check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237     separators=separators, default=default, sort_keys=sort_keys,
--> 238     **kw).encode(obj)

File /3tb/share/anaconda3/envs/ak_env/lib/python3.10/json/encoder.py:201, in JSONEncoder.encode(self, o)
    199 chunks = self.iterencode(o, _one_shot=True)
...
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type property is not JSON serializable

How can I solve this issue?

Asked By: AKMalkadi

||

Answers:

The Problem is on the line:

tokenizer.pad_token = GPT2Tokenizer.eos_token

Here the initializer is wrong, that’s why this error occurred.
A simple solution is to modify this line to:
tokenizer.pad_token = tokenizer.eos_token

For the reference purpose, your final code will look like this:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset_file = "x.csv"
df = pd.read_csv(dataset_file, sep=",")
input_ids = tokenizer.batch_encode_plus(list(df["x"]), max_length=1024,padding='max_length',truncation=True)["input_ids"]

# saving the tokenizer
tokenizer.save_pretrained("tokenfile")
Answered By: EvilReboot