Is it more correct to export bigrams from the bigram model or the trigram model in Gensim?

Question:

After I train a bigram model and a trigram model using Gensim, I can export the bigrams from the bigram model. Alternatively, I can export the bigrams from the trigram model. I find that the bigrams from the two models can be quite different. There is a large overlap. But there is a large number appearing in only one of the lists. What is the right way? Thanks!

bigram_model = gensim.models.Phrases(texts_unigram)
texts_bigram = [bigram_model[sent] for sent in texts]
trigram_model = gensim.models.Phrases(texts_bigram)

# Get from the bigram model
bigrams1 = list(bigram_model.export_phrases().keys())

# Get from the trigram model
ngrams = list(trigram_model.export_phrases().keys()) # This includes both bigrams and trigrams
bigrams2 = [g for g in ngrams if g.count("_")==1]
Asked By: Victor Wang

||

Answers:

When you’re applying the Phrases-class statistical bigram-combinations multiple times, you’re in experimental territory that’s doesn’t have well-established rules-of-thumb.

So you should be guided by your own project’s evaluations of model effectiveness: for whatever your downstream purposes are, which set of n-grams works better?

Note also:

  • Applying bigram-combinations twice may create not just trigrams (unigrams that are found to combine well with a neighboring bigram) but even quad-grams (bigrams that combine well with neighboring bigrams).
  • The crude statistical thresholds used by the Phrases class will often combine things that don’t match human intuitions, & miss other things you might see as useful multiword n-grams, and tuning will often tend to improve some pairings only at the expense of others. Ultimately, the n-grams created this way may not be appropriate or attractive, for end-user display, but might still help as the input for classification/info-retrieval tasks.
Answered By: gojomo
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.