Word embedding with gensim and FastText, training on pretrained vectors

Question:

I am trying to load the pretrained vec file of Facebook fasttext crawl-300d-2M.vec with the next code:

from gensim.models.fasttext import load_facebook_model, load_facebook_vectors

model_facebook = load_facebook_vectors('fasttext/crawl-300d-2M.vec')

But it fails with the next error:

NotImplementedError: Supervised fastText models are not supported

It is not possible to load this vector?

If it is possible, afterwards can I train it with my own sentences?

Thanks in advance.

Whole error trace:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-181-f8262e0857b8> in <module>
----> 1 model_facebook = load_facebook_vectors('fasttext/crawl-300d-2M.vec')

/opt/conda/lib/python3.7/site-packages/gensim/models/fasttext.py in load_facebook_vectors(path, encoding)
   1196 
   1197     """
-> 1198     model_wrapper = _load_fasttext_format(path, encoding=encoding, full_model=False)
   1199     return model_wrapper.wv
   1200 

/opt/conda/lib/python3.7/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1220     """
   1221     with gensim.utils.open(model_file, 'rb') as fin:
-> 1222         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1223 
   1224     model = FastText(

/opt/conda/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py in load(fin, encoding, full_model)
    339         model.update(dim=magic, ws=version)
    340 
--> 341     raw_vocab, vocab_size, nwords, ntokens = _load_vocab(fin, new_format, encoding=encoding)
    342     model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords, ntokens=ntokens)
    343 

/opt/conda/lib/python3.7/site-packages/gensim/models/_fasttext_bin.py in _load_vocab(fin, new_format, encoding)
    192     # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
    193     if nlabels > 0:
--> 194         raise NotImplementedError("Supervised fastText models are not supported")
    195     logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)
    196 

NotImplementedError: Supervised fastText models are not supported
Asked By: IMB

||

Answers:

I believe, but am not certain, that in this particular case you’re getting this error because you’re trying to load a set of just-plain vectors (which FastText projects tend to name as files ending .vec) with a method that’s designed for use on the FastText-specific format that includes subword/model info.

As a result, it’s misinterpreting the file’s leading bytes as declaring the model as one using FastText’s ‘-supervised’ mode. (Gensim truly doesn’t support such full models, in that less-common mode. But it could load the end-vectors from such a model, and in any case your file isn’t truly from that mode.)

Released files that will work with load_facebook_vectors() typically end with .bin. See the docs for this method for more details:

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors

So, you could either:

  • Supply an alternate .bin-named, Facebook-FastText-formatted set of vectors (with subword info) to this method. (From a quick look at their download options, I believe their file analogous to your 1st try would be named crawl-300d-2M-subword.bin & be about 7.24GB in size.)

  • Load the file you have, with just its full-word vectors, via:

    from gensim.models import KeyedVectors
    model = KeyedVectors.load_word2vec_format('fasttext/crawl-300d-2M.vec', binary=False)

In this latter case, no FastText-specific features (like the synthesis of guess-vectors for out-of-vocabulary words using subword vectors) will be available – but that info isn’t in the ‘crawl-300d-2M.vec’ file, anyway. (Those features would be available if you used the larger .bin file & .load_facebook_vectors() method above.)

Answered By: gojomo