How to test masked language model after training it?

Question:

I have followed this tutorial for masked language modelling from Hugging Face using BERT, but I am unsure how to actually deploy the model.

Tutorial: https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb

I have trained the model using my own dataset, which has worked fine, but I don’t know how to actually use the model, as the notebook does not include an example on how to do this, sadly.

Example of what I want to do with my trained model

On the Hugging Face website, this is the code used in the example; hence, I want to do this exact thing but with my model:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}

Any help on how to do this would be great.

Asked By: user14946125

||

Answers:

This depends a lot of your task. Your task seems to be masked language modelling, that, is to predict one or more masked words:

today I ate ___ .

(pizza) or (pasta) could be equally correct, so you cannot use a metric such as accuray. But (water) should be less "correct" than the other two.
So what you normally do is to check how "surprised" the language model is, on an evaluation data set. This metric is called perplexity.
Therefore, before and after you finetune a model on you specific dataset, you would calculate the perplexity and you would expect it to be lower after finetuning. The model should be more used to your specific vocabulary etc. And that is how you test your model.

As you can see, they calculate the perplexity in the tutorial you mentioned:

import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") 

To predict samples, you need to tokenize those samples and prepare the input for the model. The Fill-mask-Pipeline can do this for you:

# if you trained your model on gpu you need to add this line:
trainer.model.to('cpu')

unmasker = pipeline('fill-mask', model=trainer.model, tokenizer=tokenizer)
unmasker("today I ate <mask>")

which results in the following output:

[{'score': 0.23618391156196594,
  'sequence': 'today I ate it.',
  'token': 24,
  'token_str': ' it'},
 {'score': 0.03940323367714882,
  'sequence': 'today I ate breakfast.',
  'token': 7080,
  'token_str': ' breakfast'},
 {'score': 0.033759087324142456,
  'sequence': 'today I ate lunch.',
  'token': 4592,
  'token_str': ' lunch'},
 {'score': 0.025962186977267265,
  'sequence': 'today I ate pizza.',
  'token': 9366,
  'token_str': ' pizza'},
 {'score': 0.01913984678685665,
  'sequence': 'today I ate them.',
  'token': 106,
  'token_str': ' them'}]
Answered By: chefhose

Closely related to perplexity, and a bit more specific to masked language model evaluation:
https://aclanthology.org/2020.acl-main.240.pdf

Answered By: xtof54