Masking layer vs attention_mask parameter in MultiHeadAttention

Question:

I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?

Asked By: Mykola Zotko

Source

Answers:

The masking layer keeps the input vector as it and creates a masking vector to be propagated to the following layers if they need a mask vector ( like RNN layers). you can use it if you implement your own model.If you use models from huggingFace, you can use a masking layer for example if you you want to save the mask vector for future use, if not the masking operations are already built_in, so there is no need to add any masking layer at the beginning.

Answered By: Tou You

The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.

When using the Functional API or the Sequential API, a mask generated
by an Embedding or Masking layer will be propagated through the
network for any layer that is capable of using them (for example, RNN
layers). Keras will automatically fetch the mask corresponding to an
input and pass it to any layer that knows how to use it.

tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.

Improved masking support for tf.keras.layers.MultiHeadAttention.

Implicit masks for query, key and value inputs will automatically be
used to compute a correct attention mask for the layer. These padding
masks will be combined with any attention_mask passed in directly when
calling the layer. This can be used with tf.keras.layers.Embedding
with mask_zero=True to automatically infer a correct padding mask.

Added a use_causal_mask call time arugment to the layer. Passing
use_causal_mask=True will compute a causal attention mask, and
optionally combine it with any attention_mask passed in directly when
calling the layer.

Answered By: satojkovic