tokenize

TorchText Vocab TypeError: Vocab.__init__() got an unexpected keyword argument 'min_freq'

TorchText Vocab TypeError: Vocab.__init__() got an unexpected keyword argument 'min_freq' Question: I am working on a CNN Sentiment analysis machine learning model which uses the IMDb dataset provided by the Torchtext library. On one of my lines of code vocab = Vocab(counter, min_freq = 1, specials=(‘<unk>’, ‘<BOS>’, ‘<EOS>’, ‘<PAD>’)) I am getting a TypeError for …

Total answers: 3

BertTokenizer – when encoding and decoding sequences extra spaces appear

BertTokenizer – when encoding and decoding sequences extra spaces appear Question: When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = ‘text with percentage%’ Then I am running the following code: import torch from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained(‘bert-base-cased’) test_string …

Total answers: 3

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy

Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy Question: I am trying to tag and parse text that has already been split up in sentences and has already been tokenized. As an example: sents = [[‘I’, ‘like’, ‘cookies’, ‘.’], [‘Do’, ‘you’, ‘?’]] The fastest approach to process batches of text is .pipe(). However, it …

Total answers: 4

Is there a way to get entire constituents using SpaCy?

Is there a way to get entire constituents using SpaCy? Question: I guess I’m trying to navigate SpaCy’s parse tree in a more blunt way than is provided. For instance, if I have sentences like: “He was a genius” or “The dog was green,” I want to be able to save the objects to variables …

Total answers: 2

Only Get Tokenized Sentences as Output from Stanford Core NLP

Only Get Tokenized Sentences as Output from Stanford Core NLP Question: I need to split sentences. I’m using the pycorenlp wrapper for python3. I’ve started the server from my jar directory using: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 I’ve run the following commands: from pycorenlp import StanfordCoreNLP nlp = StanfordCoreNLP(‘http://localhost:9000’) text = …

Total answers: 2

Can a line of Python code know its indentation nesting level?

Can a line of Python code know its indentation nesting level? Question: From something like this: print(get_indentation_level()) print(get_indentation_level()) print(get_indentation_level()) I would like to get something like this: 1 2 3 Can the code read itself in this way? All I want is the output from the more nested parts of the code to be more …

Total answers: 5

How do you extract only the date from a python datetime?

How do you extract only the date from a python datetime? Question: I have a dataframe in python. One of its columns is labelled time, which is a timestamp. Using the following code, I have converted the timestamp to datetime: milestone[‘datetime’] = milestone.apply(lambda x: datetime.datetime.fromtimestamp(x[‘time’]), axis = 1) Now I want to separate (tokenize) date …

Total answers: 1

Python – RegEx for splitting text into sentences (sentence-tokenizing)

Python – RegEx for splitting text into sentences (sentence-tokenizing) Question: I want to make a list of sentences from a string and then print them out. I don’t want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations …

Total answers: 10

How do I tokenize a string sentence in NLTK?

How do I tokenize a string sentence in NLTK? Question: I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I’ve just got up to the method like my_text = [‘This’, ‘is’, ‘my’, ‘text’] I’d like to discover any way to input my “text” …

Total answers: 2