Continue training a FastText model
Question:
I have downloaded a .bin
FastText model, and I use it with gensim
as follows:
model = FastText.load_fasttext_format("cc.fr.300.bin")
I would like to continue the training of the model to adapt it to my domain. After checking FastText’s Github and the Gensim documentation it seems like it is not currently feasible appart from using this person’s proposed modification (not yet merged).
Am I missing something?
Answers:
The official FastText implementation currently doesn’t support that, although there is an open ticket related to this issue which you can find here.
You can continue training in some versions of Gensim’s fastText
(for example, v.3.7.*). Here is an example of “Loading, inferring, continuing training“
from gensim.test.utils import datapath
model = load_facebook_model(datapath("crime-and-punishment.bin"))
sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'semi-groups']]
model.build_vocab(sent, update=True)
model.train(sentences=sent, total_examples = len(sent), epochs=5)
For some reason, the gensim.models.fasttext.load_facebook_model()
is missing on Windows, but exists on Mac’s installation. Alternatively, one can use gensim.models.FastText.load_fasttext_format()
to load a pre-trained model and continue training.
Here are various pre-trained Wiki word models and vectors (or here).
Another example. “Note: As in the case of Word2Vec, you can continue to train your model while using Gensim’s native implementation of fastText.“
The pull request #1327 (https://github.com/facebookresearch/fastText/pull/1327)
Allows for:
- test after each epoch
- checkpointing
- training on large data which does not fit into memory (largest I tested was 1.6TB)
- finetuning already trained models
The trained model is indistinguishable from a model that was created by an original tool and can be used for inference by the old code.
I have downloaded a .bin
FastText model, and I use it with gensim
as follows:
model = FastText.load_fasttext_format("cc.fr.300.bin")
I would like to continue the training of the model to adapt it to my domain. After checking FastText’s Github and the Gensim documentation it seems like it is not currently feasible appart from using this person’s proposed modification (not yet merged).
Am I missing something?
The official FastText implementation currently doesn’t support that, although there is an open ticket related to this issue which you can find here.
You can continue training in some versions of Gensim’s fastText
(for example, v.3.7.*). Here is an example of “Loading, inferring, continuing training“
from gensim.test.utils import datapath
model = load_facebook_model(datapath("crime-and-punishment.bin"))
sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'semi-groups']]
model.build_vocab(sent, update=True)
model.train(sentences=sent, total_examples = len(sent), epochs=5)
For some reason, the gensim.models.fasttext.load_facebook_model()
is missing on Windows, but exists on Mac’s installation. Alternatively, one can use gensim.models.FastText.load_fasttext_format()
to load a pre-trained model and continue training.
Here are various pre-trained Wiki word models and vectors (or here).
Another example. “Note: As in the case of Word2Vec, you can continue to train your model while using Gensim’s native implementation of fastText.“
The pull request #1327 (https://github.com/facebookresearch/fastText/pull/1327)
Allows for:
- test after each epoch
- checkpointing
- training on large data which does not fit into memory (largest I tested was 1.6TB)
- finetuning already trained models
The trained model is indistinguishable from a model that was created by an original tool and can be used for inference by the old code.