tokenize

Can't Initialise Two Different Tokenizers with Keras

Can't Initialise Two Different Tokenizers with Keras Question: For spelling correction task, I build a seq2seq model including LSTM and attention mechanism. I do char-level tokenisation with Keras. I initialised two different tokenizers, one for typo sentence and the other for corrected sentence. After testing, I see that model produced empty string and I believe …

Total answers: 1

Tokenize phrases in tokenized tuple

Tokenize phrases in tokenized tuple Question: I have a dataset consisting of tokenized tuples. My steps of pre-processing were first, tokenizing the words, and then normalizing slang words. But then the slang words could consist of phrases with white spaces. I’m trying to do another round of tokenizing, but I couldn’t figure out the way. …

Total answers: 1

How to Tokenize block of text as one token in python?

How to Tokenize block of text as one token in python? Question: Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions …

Total answers: 1

With python what is the most efficient way to tokenize a string (SELFIES) to a list?

With python what is the most efficient way to tokenize a string (SELFIES) to a list? Question: I am currently working with SELFIES (self-referencing embedded strings, github : https://github.com/aspuru-guzik-group/selfies) which is basically a string representation of a molecule. Basically it is a sequence of tokens that are defined by brackets , e.g. propane would be …

Total answers: 3

Python set.add() is triggering outside a conditional statement

Python set.add() is triggering outside a conditional statement Question: I’m tokenizing some document and I want to find out which tokens are shared between one or more tokenizations. To do this, for each tokenization, I am looping through the set of all tokens in all tokenizations, called all_tokens and checking if a given token exists …

Total answers: 2

splitting string made out of dataframe row wise

splitting string made out of dataframe row wise Question: I’m trying to tokenize the words within dataframe which looks like A B C D E F 0 Orange robot x eyes discomfort striped tee nan 1 orange robot blue beams grin vietnam jacket nan 2 aquamarine robot 3d bored cigarette nan After removing all the …

Total answers: 1

Delete brackets from column values

Delete brackets from column values Question: I have the following dataframe: df = pd.DataFrame({‘column1’: [‘Severe weather Not Severe weather kind of severe weather]}) I tokenized this dataframe: from nltk.tokenize import word_tokenize df[‘column1’] = df[‘column1’].apply(lambda x: word_tokenize(x)) The output is enclosed inside brackets: column1 0 [Severe, weather, Not, Severe, weather, kind, of, severe, weather] I want …

Total answers: 3

How to keep structure of text after feeding it to a pipeline for NER

How to keep structure of text after feeding it to a pipeline for NER Question: I’ve build an NER (named entity recognition) model, based on a HuggingFace existing model and that I fine-tuned to recognize my custom entities. The text I want to run my model on is in a txt file. The code of …

Total answers: 1

How to create a list of tokenized words from dataframe column using spaCy?

How to create a list of tokenized words from dataframe column using spaCy? Question: I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: import pandas as pd details = { ‘Text_id’ : [23, 21, 22, 21], ‘Text’ : [‘All roads …

Total answers: 2