How to get PMI scores for trigrams with NLTK Collocations? python

Question:

I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below.

My only problem is how to print out the birgram with the PMI value? I search NLTK documentation multiple times. It’s either I’m missing something or it’s not there.

import nltk
from nltk.collocations import *

myFile = open("large.txt", 'r').read()
myList = myFile.split()
myCorpus = nltk.Text(myList)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words((myCorpus))

finder.apply_freq_filter(3)
print finder.nbest(trigram_measures.pmi, 500000)
Asked By: Sabba

||

Answers:

I think you’re looking for score_ngram. Anyway, you don’t need a printing function. Just munge the output yourself…

trigrams = finder.nbest(trigram_measures.pmi, 500000)
print [(x, finder.score_ngram(trigram_measures.pmi, x[0], x[1], x[2])) for x in trigrams]
Answered By: dmvianna

If you take a look at the source code for nlkt.collocations.TrigramCollocationFinder (see http://www.nltk.org/_modules/nltk/collocations.html), you’ll find that it returns a TrigramCollocationFinder().score_ngrams:

def nbest(self, score_fn, n):
    """Returns the top n ngrams when scored by the given function."""
    return [p for p,s in self.score_ngrams(score_fn)[:n]]

So you could call the score_ngrams() directly without getting the nbest since it returns a sorted list anyways.:

def score_ngrams(self, score_fn):
    """Returns a sequence of (ngram, score) pairs ordered from highest to
    lowest score, as determined by the scoring function provided.
    """
    return sorted(self._score_ngrams(score_fn),
                  key=_itemgetter(1), reverse=True)

Try:

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(word_tokenize(text))

for i in finder.score_ngrams(trigram_measures.pmi):
    print i

[out]:

(('this', 'is', 'a'), 9.047123912114026)
(('is', 'a', 'foo'), 7.46216141139287)
(('black', 'sheep', 'shep'), 5.46216141139287)
(('black', 'sheep', 'foo'), 4.877198910671714)
(('a', 'foo', 'bar'), 4.462161411392869)
(('sheep', 'shep', 'bar'), 4.462161411392869)
(('bar', 'black', 'sheep'), 4.047123912114026)
(('bar', 'black', 'sentence'), 4.047123912114026)
(('sheep', 'foo', 'bar'), 3.877198910671714)
(('bar', 'bar', 'black'), 3.047123912114026)
(('foo', 'bar', 'bar'), 3.047123912114026)
(('shep', 'bar', 'bar'), 3.047123912114026)
Answered By: alvas

NLTK has a dedicated documentation page that shows how to use different collocations https://www.nltk.org/howto/collocations.html

You can find also a sample usage below that how to use with BigramCollocationFinder and BigramAssocMeasures that is measured using Pointwise Mutual Information.

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.tokenize import word_tokenize


text = "Collocations are expressions of multiple words which commonly co-occur. For example, the top ten bigram collocations in Genesis are listed below, as measured using Pointwise Mutual Information."
words = word_tokenize(text)

finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.pmi
# it combines bigram words with `_` to a single str
bigram_collocations = {"_".join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}
print(f"bigram collocations: {bigram_collocations}")

Output

{'For_example': 4.954196310386875, 'Mutual_Information': 4.954196310386875, 'Pointwise_Mutual': 4.954196310386875, 'as_measured': 4.954196310386875, 'bigram_collocations': 4.954196310386875, 'collocations_in': 4.954196310386875, 'commonly_co-occur': 4.954196310386875, 'expressions_of': 4.954196310386875, 'in_Genesis': 4.954196310386875, 'listed_below': 4.954196310386875, 'measured_using': 4.954196310386875, 'multiple_words': 4.954196310386875, 'of_multiple': 4.954196310386875, 'ten_bigram': 4.954196310386875, 'the_top': 4.954196310386875, 'top_ten': 4.954196310386875, 'using_Pointwise': 4.954196310386875, 'which_commonly': 4.954196310386875, 'words_which': 4.954196310386875, ',_as': 3.954196310386875, ',_the': 3.954196310386875, '._For': 3.954196310386875, 'Collocations_are': 3.954196310386875, 'Genesis_are': 3.954196310386875, 'Information_.': 3.954196310386875, 'are_expressions': 3.954196310386875, 'are_listed': 3.954196310386875, 'below_,': 3.954196310386875, 'co-occur_.': 3.954196310386875, 'example_,': 3.954196310386875}

NLTK module has also TrigramCollocationFinder and QuadgramCollocationFinder available under nltk.collocations.

Answered By: abdullahselek
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.