Multi-instance classification using tranformer model
Question:
I use the transformer from this Keras documentation example for multi-instance classification. The class of each instance depends on other instances that come in one bag. I use transformer model because:
It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects
For example, each bag may have maximal 5 instances and there are 3 features per instance.
# Generate data
max_length = 5
x_lst = []
y_lst = []
for _ in range(10):
num_instances = np.random.randint(2, max_length + 1)
x_bag = np.random.randint(0, 9, size=(num_instances, 3))
y_bag = np.random.randint(0, 2, size=num_instances)
x_lst.append(x_bag)
y_lst.append(y_bag)
Features and labels of first 2 bags (with 5 and 2 instances):
x_lst[:2]
[array([[8, 0, 3],
[8, 1, 0],
[4, 6, 8],
[1, 6, 4],
[7, 4, 6]]),
array([[5, 8, 4],
[2, 1, 1]])]
y_lst[:2]
[array([0, 1, 1, 1, 0]), array([0, 0])]
Next, I pad features with zeros and targets with -1:
x_padded = []
y_padded = []
for x, y in zip(x_lst, y_lst):
x_p = np.zeros((max_length, 3))
x_p[:x.shape[0], :x.shape[1]] = x
x_padded.append(x_p)
y_p = np.negative(np.ones(max_length))
y_p[:y.shape[0]] = y
y_padded.append(y_p)
X = np.stack(x_padded)
y = np.stack(y_padded)
where X.shape
is equal to (10, 5, 3)
and y.shape
is equal to (10, 5)
.
I made two changes to the original model: added the Masking layer
after the Input layer and set the number of units in the last Dense layer to the maximal size of the bag (plus ‘sigmoid’ activation):
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Attention and Normalization
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(inputs, inputs)
x = layers.Dropout(dropout)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
res = x + inputs
# Feed Forward Part
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
inputs = keras.layers.Masking(mask_value=0)(inputs) # ADDED MASKING LAYER
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(5, activation='sigmoid')(x) # CHANGED ACCORDING TO MY OUTPUT
return keras.Model(inputs, outputs)
input_shape = (5, 3)
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
model.summary()
It looks like my model doesn’t learn much. If I use the number of true values for each bag (y.sum(axis=1)
and Dense(1)
) as a target instead of classification of each instance, the model learns good. Where is my error? How should I build the output layer in this case? Do I need a custom lost function?
UPDATE:
I made a custom loss function:
def my_loss_fn(y_true, y_pred):
mask = tf.cast(tf.math.not_equal(y_true, tf.constant(-1.)), tf.float32)
y_true, y_pred = tf.expand_dims(y_true, axis=-1), tf.expand_dims(y_pred, axis=-1)
bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
return tf.reduce_sum(tf.cast(bce(y_true, y_pred), tf.float32) * mask)
mask = (y_test != -1).astype(int)
pd.DataFrame({'n_labels': mask.sum(axis=1), 'preds': ((preds * mask) >= .5).sum(axis=1)}).plot(figsize=(20, 5))
And it looks like the model learns:
But it predicts all nonmasked labels as 1.
@thushv89 This is my problem. I take 2 time points: t1 and t2 and look for all vehicles that are in maintenance at the time t1 and for all vehicles that are planned to be in maintenance at the time t2. So, this is my bag of items. Then I calculate features like how much time t1 vehicles have already spent in maintenance, how much time from t1 to the plan start for t2 vehicle etc. My model learns well if I try to predict the number of vehicles in maintenance at the time t2, but I would like to predict which of them will leave and which of them will come in (3 vs [True, False, True, True] for 4 vehicles in the bag).
Answers:
There are three important improvements:
- Remove the GlobalAveragePooling1D. It’s a kind of bottleneck (data compression) if you make a prediction for each item. Without this layer, you also get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output. So, you can set the last Dense layer to the number of categories (1 in my case).
- Add a custom loss function to exclude target padding from calculation (already added to my question) and a custom metric function if you want to see the real metric.
- Add an attention_mask to the MultiHeadAttention (instead of Masking layer) to mask the padding.
Just a simple add-on to @Mykola_Zotko ‘s improvement answer to those new users who are learning deep-learning with keras
and tensorflow
.
Remove the GlobalAveragePooling1D
For context, this GlobalAveragePooling1D
is basically a Global average pooling operation for temporal data. So basically when you remove this method call, you are removing the "pooling" operation, or in simpler terms by @Mykola_Zotko:
… you get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output
The alias is:
tf.keras.layers.GlobalAvgPool1D
and the code for this method:
tf.keras.layers.GlobalAveragePooling1D (
data_format = "channels_last", **kwargs
)
The source for this can be found on:
Add a custom loss function
What a loss function simply does is to "to generate the quantity that a model should seek to minimize during training time". Source
Or in other terms:
In mathematical optimization, statistics, machine learning and Deep Learning the Loss Function (also known as Cost Function or Error Function) is a function that defines a correlation between a series of values and a real number. That number represents conceptually the cost associated with an event or a set of values. In general, the goal of an optimization procedure is to minimize the loss function. Towardsdatascience – custom loss function in tensorflow
Add an attention_mask to the MultiHeadAttention
Alias:
tf.keras.layers.MultiHeadAttention
Code for the method:
tf.keras.layers.MultiHeadAttention(
num_heads,
key_dim,
value_dim=None,
dropout=0.0,
use_bias=True,
output_shape=None,
attention_axes=None,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs
)
Source on:
Previous improvements that were made to the code:
metrics=["accuracy"]
to metrics=["binary_accuracy"]
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
- Using
Crossentropy
in the custom loss function
I use the transformer from this Keras documentation example for multi-instance classification. The class of each instance depends on other instances that come in one bag. I use transformer model because:
It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects
For example, each bag may have maximal 5 instances and there are 3 features per instance.
# Generate data
max_length = 5
x_lst = []
y_lst = []
for _ in range(10):
num_instances = np.random.randint(2, max_length + 1)
x_bag = np.random.randint(0, 9, size=(num_instances, 3))
y_bag = np.random.randint(0, 2, size=num_instances)
x_lst.append(x_bag)
y_lst.append(y_bag)
Features and labels of first 2 bags (with 5 and 2 instances):
x_lst[:2]
[array([[8, 0, 3],
[8, 1, 0],
[4, 6, 8],
[1, 6, 4],
[7, 4, 6]]),
array([[5, 8, 4],
[2, 1, 1]])]
y_lst[:2]
[array([0, 1, 1, 1, 0]), array([0, 0])]
Next, I pad features with zeros and targets with -1:
x_padded = []
y_padded = []
for x, y in zip(x_lst, y_lst):
x_p = np.zeros((max_length, 3))
x_p[:x.shape[0], :x.shape[1]] = x
x_padded.append(x_p)
y_p = np.negative(np.ones(max_length))
y_p[:y.shape[0]] = y
y_padded.append(y_p)
X = np.stack(x_padded)
y = np.stack(y_padded)
where X.shape
is equal to (10, 5, 3)
and y.shape
is equal to (10, 5)
.
I made two changes to the original model: added the Masking layer
after the Input layer and set the number of units in the last Dense layer to the maximal size of the bag (plus ‘sigmoid’ activation):
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Attention and Normalization
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(inputs, inputs)
x = layers.Dropout(dropout)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
res = x + inputs
# Feed Forward Part
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
x = layers.LayerNormalization(epsilon=1e-6)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
inputs = keras.layers.Masking(mask_value=0)(inputs) # ADDED MASKING LAYER
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(5, activation='sigmoid')(x) # CHANGED ACCORDING TO MY OUTPUT
return keras.Model(inputs, outputs)
input_shape = (5, 3)
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
model.summary()
It looks like my model doesn’t learn much. If I use the number of true values for each bag (y.sum(axis=1)
and Dense(1)
) as a target instead of classification of each instance, the model learns good. Where is my error? How should I build the output layer in this case? Do I need a custom lost function?
UPDATE:
I made a custom loss function:
def my_loss_fn(y_true, y_pred):
mask = tf.cast(tf.math.not_equal(y_true, tf.constant(-1.)), tf.float32)
y_true, y_pred = tf.expand_dims(y_true, axis=-1), tf.expand_dims(y_pred, axis=-1)
bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
return tf.reduce_sum(tf.cast(bce(y_true, y_pred), tf.float32) * mask)
mask = (y_test != -1).astype(int)
pd.DataFrame({'n_labels': mask.sum(axis=1), 'preds': ((preds * mask) >= .5).sum(axis=1)}).plot(figsize=(20, 5))
And it looks like the model learns:
But it predicts all nonmasked labels as 1.
@thushv89 This is my problem. I take 2 time points: t1 and t2 and look for all vehicles that are in maintenance at the time t1 and for all vehicles that are planned to be in maintenance at the time t2. So, this is my bag of items. Then I calculate features like how much time t1 vehicles have already spent in maintenance, how much time from t1 to the plan start for t2 vehicle etc. My model learns well if I try to predict the number of vehicles in maintenance at the time t2, but I would like to predict which of them will leave and which of them will come in (3 vs [True, False, True, True] for 4 vehicles in the bag).
There are three important improvements:
- Remove the GlobalAveragePooling1D. It’s a kind of bottleneck (data compression) if you make a prediction for each item. Without this layer, you also get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output. So, you can set the last Dense layer to the number of categories (1 in my case).
- Add a custom loss function to exclude target padding from calculation (already added to my question) and a custom metric function if you want to see the real metric.
- Add an attention_mask to the MultiHeadAttention (instead of Masking layer) to mask the padding.
Just a simple add-on to @Mykola_Zotko ‘s improvement answer to those new users who are learning deep-learning with keras
and tensorflow
.
Remove the GlobalAveragePooling1D
For context, this GlobalAveragePooling1D
is basically a Global average pooling operation for temporal data. So basically when you remove this method call, you are removing the "pooling" operation, or in simpler terms by @Mykola_Zotko:
… you get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output
The alias is:
tf.keras.layers.GlobalAvgPool1D
and the code for this method:
tf.keras.layers.GlobalAveragePooling1D (
data_format = "channels_last", **kwargs
)
The source for this can be found on:
Add a custom loss function
What a loss function simply does is to "to generate the quantity that a model should seek to minimize during training time". Source
Or in other terms:
In mathematical optimization, statistics, machine learning and Deep Learning the Loss Function (also known as Cost Function or Error Function) is a function that defines a correlation between a series of values and a real number. That number represents conceptually the cost associated with an event or a set of values. In general, the goal of an optimization procedure is to minimize the loss function. Towardsdatascience – custom loss function in tensorflow
Add an attention_mask to the MultiHeadAttention
Alias:
tf.keras.layers.MultiHeadAttention
Code for the method:
tf.keras.layers.MultiHeadAttention(
num_heads,
key_dim,
value_dim=None,
dropout=0.0,
use_bias=True,
output_shape=None,
attention_axes=None,
kernel_initializer='glorot_uniform',
bias_initializer='zeros',
kernel_regularizer=None,
bias_regularizer=None,
activity_regularizer=None,
kernel_constraint=None,
bias_constraint=None,
**kwargs
)
Source on:
Previous improvements that were made to the code:
metrics=["accuracy"]
tometrics=["binary_accuracy"]
model.compile(
loss="binary_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=1e-4),
metrics=["binary_accuracy"],
)
- Using
Crossentropy
in the custom loss function