n-gram

Why is appending my list of tuples changing their content?

Why is appending my list of tuples changing their content? Question: I am trying to make a list of tuples that contain a string and a dictionary. The string is a filename and the dictionary is a frequency list of n-grams. (‘story.txt’, {‘back’: 12, ‘been’: 13, ‘bees’: 58, ‘buzz’: 13, ‘cant’: 30, ‘come’: 12, ‘dont’: …

Total answers: 1

Multiprocess error while using map function in python with N-Gram language model

Multiprocess error while using map function in python with N-Gram language model Question: I wanna increase the accuracy of my speech2text model with using a N-Gram. So i’m using this line of code to apply the function on the whole dataset as below: result = dataset.map(predict, batch_size=5, num_proc=int(os.environ.get(‘cpu_core’))) The CPU core I set for ‘cpu_core’ …

Total answers: 1

Find trigrams for all groupby clusters in a Pandas Dataframe and return in a new column

Find trigrams for all groupby clusters in a Pandas Dataframe and return in a new column Question: I’m trying to return the highest frequency trigram in a new column in a pandas dataframe for each group of keywords. (Essentially something like a groupby with transform, returning the highest trigram in a new column). An example …

Total answers: 1

Calculate TF-IDF using sklearn for variable-n-grams in python

Calculate TF-IDF using sklearn for variable-n-grams in python Question: Problem: using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary. Explanation. I got examples from here. Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one: myvocabulary = [(window=4, …

Total answers: 1

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

Understanding the `ngram_range` argument in a CountVectorizer in sklearn Question: I’m a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.feature_extraction.text import CountVectorizer vocabulary = [‘hi ‘, ‘bye’, ‘run away’] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ …

Total answers: 1

n-grams in python, four, five, six grams?

n-grams in python, four, five, six grams? Question: I’m looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = “I really like python, it’s pretty awesome.” string_bigrams = bigrams(string) print string_bigrams I am aware that nltk only offers bigrams and trigrams, …

Total answers: 17

counting n-gram frequency in python nltk

counting n-gram frequency in python nltk Question: I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don’t know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide …

Total answers: 4

Computing N Grams using Python

Computing N Grams using Python Question: I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: “Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although …

Total answers: 8

Counting bigrams (pair of two words) in a file using Python

Counting bigrams (pair of two words) in a file using Python Question: I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex …

Total answers: 6

Python: Reducing memory usage of dictionary

Python: Reducing memory usage of dictionary Question: I’m trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For instance: i_love TAB 10 love_you …

Total answers: 6