How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers

Question

I’ve been trying to check the maximum length allowed by emilyalsentzer/Bio_ClinicalBERT, and after these lines of code:

model_name = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer

I’ve obtained the following:

PreTrainedTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Is that true? Is the max length of the model (in the number of tokens, as it says here) that high? Then, how am I supposed to interpret that?

Cheers!

Asked By: ignacioct

||

Source

Answer 1

This issue thread addresses a similar question.
According to that this is due to an error caused due to the max length not being specified in the tokenizer config file (tokenizer_config.json).
according to this, a solution would be to modify the config file.

The docs also say this

If no value is provided, will default to VERY_LARGE_INTEGER (int(1e30))

you can find similar issues related to this

Answered By: cmgchess

How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers

Question:

Answers: