nlp

How to prevent transformer generate function to produce certain words?

How to prevent transformer generate function to produce certain words? Question: I have the following code: from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained("t5-small") model = T5ForConditionalGeneration.from_pretrained("t5-small") input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids sequence_ids = model.generate(input_ids) sequences = tokenizer.batch_decode(sequence_ids) sequences Currently it produces this: [‘<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>’] Is there a …

Total answers: 1

Count words in a sentence controlling for negations

Count words in a sentence controlling for negations Question: I am trying to count the number of times some words occur in a sentence while controlling for negations. In the example below, I write a very basic code where I count the number of times "w" appear in "txt". Yet, I fail to control for …

Total answers: 1

How to create a column as a list of similar strings onto a new column?

How to create a column as a list of similar strings onto a new column? Question: I’ve been trying to get a new row in a pandas dataframe which encapsullates as a list all the similar strings into it’s original matching row. This is the original pandas dataframe: import pandas as pd d = {‘product_name’: …

Total answers: 1

How to use marisa-trie in Python for nlp processing

How to use marisa-trie in Python for nlp processing Question: I’m working for a NLP function to store tokens in a trie. This my well working code for tokenization: import spacy def preprocess_text_spacy(text): stop_words = ["a", "the", "is", "are"] nlp = spacy.load(‘en_core_web_sm’) tokens = set() doc = nlp(text) print(doc) for word in doc: if word.is_currency: …

Total answers: 1

Loading Hugging face model is taking too much memory

Loading Hugging face model is taking too much memory Question: I am trying to load a large Hugging face model with code like below: model_from_disc = AutoModelForCausalLM.from_pretrained(path_to_model) tokenizer_from_disc = AutoTokenizer.from_pretrained(path_to_model) generator = pipeline("text-generation", model=model_from_disc, tokenizer=tokenizer_from_disc) The program is quickly crashing after the first line because it is running out of memory. Is there a way …

Total answers: 1

Python NLP processing if statement not in stop words list

Python NLP processing if statement not in stop words list Question: I’m working with NLP spacy library and I created a function to return a list of token from a text. import spacy def preprocess_text_spacy(text): stop_words = ["a", "the", "is", "are"] nlp = spacy.load(‘en_core_web_sm’) tokens = set() doc = nlp(text) for word in doc: if …

Total answers: 4

How to normalise keywords extracted with Named Entity Recognition

How to normalise keywords extracted with Named Entity Recognition Question: I’m trying to employ NER to extract keywords (tags) from job postings. This can be anything along with React, AWS, Team Building, Marketing. After training a custom model in SpaCy I’m presented with a problem – extracted tags are not unified/normalized across all of the …

Total answers: 2

What does config inside “super().__init__(config)“ actually do?

What does config inside “super().__init__(config)“ actually do? Question: I have the following code to create a custom model for Named-entity-recognition. Using ChatGPT and Copilot, I’ve commented it to understand its functionality. However, the point with config inside super().__init__(config) is not clear for me. Which role does it play since we have already used XLMRobertaConfig at …

Total answers: 1

Huggingface Trainer throws an AttributeError:'Namespace' object has no attribute 'get_process_log_level

Huggingface Trainer throws an AttributeError:'Namespace' object has no attribute 'get_process_log_level Question: I am trying to run Trainer from Hugging face(pytorch) with arguments parser. My code looks like if __name__ == ‘__main__’: parser = HfArgumentParser(TrainingArguments) parser.add_argument(‘–model_name_or_path’, type=str, required=True) . . . . training_args = parser.parse_args() print(‘args’, training_args) os.makedirs(training_args.output_dir, exist_ok=True) random.seed(training_args.seed) set_seed(training_args.seed) dataset_train = … . . …

Total answers: 1

How to convert txt file to jsoinl lines file for Hungarian char.?

How to convert txt file to jsoinl lines file for Hungarian char.? Question: I have txt file that contians two columns (filename and text) the spreater during generating txt file is tab example of input file below : text.txt 23.jpg még 24.jpg több the expacted output_file.jsonl type json line format {"file_name": "23.jpg", "text": "még"} {"file_name": …

Total answers: 2