'pseudocorpus' no longer available from 'gensim.models.phrases'?

Question:

Several months ago, I used "pseudocorpus" to create a fake corpus as part of phrase training using Gensim with the following code:

from gensim.models.phrases import pseudocorpus 

corpus = pseudocorpus(bigram_model.vocab, bigram_model.delimiter, bigram_model.common_terms)
bigrams = []
for bigram, score in bigram_model.export_phrases(corpus, bigram_model.delimiter, as_tuples=False):
    if score >= bigram_model.threshold:
        bigrams.append(bigram.decode('utf-8'))

Now when I run the code, I got the following error message:

ImportError: cannot import name 'pseudocorpus' from 'gensim.models.phrases'

I’m using Gensim 4.2.0. Is pseudocorpus() no longer available with Gensim 4.2.0?

Thanks a lot!

Asked By: Victor Wang

||

Answers:

I believe the main internal consumer of a pseudocorpus() result, the .export_phrases() method, was improved to achieve the same goals more efficiently, so that method disappeared – as it hadn’t really been promoted as part of the public functionality of the module.

Can you make use of .export_phrases() for your purposes?

If not, can you say a bit more about how you were using the (odd synthetic) ‘pseudocorpus’?

If all else fails, the prior functionality was a pretty simple extraction from the model’s state, and you can view the last version of the function before it was refactored-away at the project’s open source repository:

https://github.com/RaRe-Technologies/gensim/blob/da8847a04f9ee56702cb81a0218cd5a57e1f24e6/gensim/models/phrases.py#L750

So, you could simply use that as a guide to reimplementing equivalent extraction in your own code.

Answered By: gojomo
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.