huggingface-tokenizers

Question about data_collator throwing a key error in Hugging face

Question about data_collator throwing a key error in Hugging face Question: I am trying to use data_collator function in hugging face using this code: datasets = dataset.train_test_split(test_size=0.1) train_dataset = datasets["train"] val_dataset = datasets["test"] print(type(train_dataset)) def data_collator(data): # Initialize lists to store pixel values and input ids pixel_values_list = [] input_ids_list = [] # Iterate over …

Total answers: 1

How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers

How to interpret the model_max_len attribute of the PreTrainedTokenizer object in Huggingface Transformers Question: I’ve been trying to check the maximum length allowed by emilyalsentzer/Bio_ClinicalBERT, and after these lines of code: model_name = "emilyalsentzer/Bio_ClinicalBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer I’ve obtained the following: PreTrainedTokenizerFast(name_or_path=’emilyalsentzer/Bio_ClinicalBERT’, vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side=’right’, truncation_side=’right’, special_tokens={‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: …

Total answers: 1

tokenizer.push_to_hub(repo_name) is not working

tokenizer.push_to_hub(repo_name) is not working Question: I’m trying to puch my tokonizer to my huggingface repo… it consist of the model vocab.Json (I’m making a speech recognition model) My code: vocab_dict["|"] = vocab_dict[" "] del vocab_dict[" "] vocab_dict["[UNK]"] = len(vocab_dict) vocab_dict["[PAD]"] = len(vocab_dict) len(vocab_dict) import json with open(‘vocab.json’, ‘w’) as vocab_file: json.dump(vocab_dict, vocab_file) from transformers import …

Total answers: 3

How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers

How to split input text into equal size of tokens, not character length, and then concatenate the summarization results for Hugging Face transformers Question: I am using the below methodology to summarize longer than 1024 token size long texts. Current method splits the text by half. I took this from another user’s post and modified …

Total answers: 1

issue when importing BloomTokenizer from transformers in python

issue when importing BloomTokenizer from transformers in python Question: I am trying to import BloomTokenizer from transformers from transformers import BloomTokenizer and I receive the following error Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name ‘BloomTokenizer’ from ‘transformers’ (/root/miniforge3/envs/pytorch/lib/python3.8/site-packages/transformers/__init__.py) my version of transformers: transformers 4.20.1 what could I …

Total answers: 1

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved? Question: Goal: Amend this Notebook to work with albert-base-v2 model Kernel: conda_pytorch_p36. Section 1.2 instantiates a model from files in ./MRPC/ dir. However, I think it is for a BERT model, not Albert. So, …

Total answers: 1

Hugging face – Efficient tokenization of unknown token in GPT2

Hugging face – Efficient tokenization of unknown token in GPT2 Question: I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens. from transformers import ( AdamW, AutoConfig, AutoTokenizer, PreTrainedModel, PreTrainedTokenizer, get_linear_schedule_with_warmup, ) SPECIAL_TOKENS = { "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "pad_token": "[PAD]", "additional_special_tokens": …

Total answers: 2

Huggingface MarianMT translators lose content, depending on the model

Huggingface MarianMT translators lose content, depending on the model Question: Context I am using MarianMT von Huggingface via Python in order to translate text from a source to a target language. Expected behaviour I enter a sequence into the MarianMT model and get this sequence translated back. For this, I use a corresponding language model …

Total answers: 1

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased') work??

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')? Question: I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences as: ‘My name is …

Total answers: 1

How to disable TOKENIZERS_PARALLELISM=(true | false) warning?

How to disable TOKENIZERS_PARALLELISM=(true | false) warning? Question: I use pytorch to train huggingface-transformers model, but every epoch, always output the warning: The current process just got forked. Disabling parallelism to avoid deadlocks… To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false) How to disable this warning? Asked By: snowzjy || Source Answers: I …

Total answers: 4