Multi-instance classification using tranformer model

Question

I use the transformer from this Keras documentation example for multi-instance classification. The class of each instance depends on other instances that come in one bag. I use transformer model because:

It makes no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects

For example, each bag may have maximal 5 instances and there are 3 features per instance.

# Generate data
max_length = 5
x_lst = []
y_lst = []
for _ in range(10):
    num_instances = np.random.randint(2, max_length + 1)
    x_bag = np.random.randint(0, 9, size=(num_instances, 3))
    y_bag = np.random.randint(0, 2, size=num_instances)
    
    x_lst.append(x_bag)
    y_lst.append(y_bag)

Features and labels of first 2 bags (with 5 and 2 instances):

x_lst[:2]

[array([[8, 0, 3],
        [8, 1, 0],
        [4, 6, 8],
        [1, 6, 4],
        [7, 4, 6]]),
 array([[5, 8, 4],
        [2, 1, 1]])]

y_lst[:2]

[array([0, 1, 1, 1, 0]), array([0, 0])]

Next, I pad features with zeros and targets with -1:

x_padded = []
y_padded = []

for x, y in zip(x_lst, y_lst):
    x_p = np.zeros((max_length, 3))
    x_p[:x.shape[0], :x.shape[1]] = x
    x_padded.append(x_p)

    y_p = np.negative(np.ones(max_length))
    y_p[:y.shape[0]] = y
    y_padded.append(y_p)

X = np.stack(x_padded)
y = np.stack(y_padded)

where X.shape is equal to (10, 5, 3) and y.shape is equal to (10, 5).

I made two changes to the original model: added the Masking layer
after the Input layer and set the number of units in the last Dense layer to the maximal size of the bag (plus ‘sigmoid’ activation):

def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    # Attention and Normalization
    x = layers.MultiHeadAttention(
        key_dim=head_size, num_heads=num_heads, dropout=dropout
    )(inputs, inputs)
    x = layers.Dropout(dropout)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    res = x + inputs

    # Feed Forward Part
    x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(res)
    x = layers.Dropout(dropout)(x)
    x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    return x + res

def build_model(
    input_shape,
    head_size,
    num_heads,
    ff_dim,
    num_transformer_blocks,
    mlp_units,
    dropout=0,
    mlp_dropout=0,
):
    inputs = keras.Input(shape=input_shape)
    inputs = keras.layers.Masking(mask_value=0)(inputs) # ADDED MASKING LAYER
    x = inputs
    for _ in range(num_transformer_blocks):
        x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)

    x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
    for dim in mlp_units:
        x = layers.Dense(dim, activation="relu")(x)
        x = layers.Dropout(mlp_dropout)(x)
    outputs = layers.Dense(5, activation='sigmoid')(x) # CHANGED ACCORDING TO MY OUTPUT
    return keras.Model(inputs, outputs)

input_shape = (5, 3)

model = build_model(
    input_shape,
    head_size=256,
    num_heads=4,
    ff_dim=4,
    num_transformer_blocks=4,
    mlp_units=[128],
    mlp_dropout=0.4,
    dropout=0.25,
)

model.compile(
    loss="binary_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["binary_accuracy"],
)
model.summary()

It looks like my model doesn’t learn much. If I use the number of true values for each bag (y.sum(axis=1) and Dense(1)) as a target instead of classification of each instance, the model learns good. Where is my error? How should I build the output layer in this case? Do I need a custom lost function?

UPDATE:
I made a custom loss function:

def my_loss_fn(y_true, y_pred):
    mask = tf.cast(tf.math.not_equal(y_true, tf.constant(-1.)), tf.float32)
    y_true, y_pred = tf.expand_dims(y_true, axis=-1), tf.expand_dims(y_pred, axis=-1)
    bce = tf.keras.losses.BinaryCrossentropy(reduction='none')
    return tf.reduce_sum(tf.cast(bce(y_true, y_pred), tf.float32) * mask)

mask = (y_test != -1).astype(int)
pd.DataFrame({'n_labels': mask.sum(axis=1), 'preds': ((preds * mask) >= .5).sum(axis=1)}).plot(figsize=(20, 5))

And it looks like the model learns:

But it predicts all nonmasked labels as 1.

@thushv89 This is my problem. I take 2 time points: t1 and t2 and look for all vehicles that are in maintenance at the time t1 and for all vehicles that are planned to be in maintenance at the time t2. So, this is my bag of items. Then I calculate features like how much time t1 vehicles have already spent in maintenance, how much time from t1 to the plan start for t2 vehicle etc. My model learns well if I try to predict the number of vehicles in maintenance at the time t2, but I would like to predict which of them will leave and which of them will come in (3 vs [True, False, True, True] for 4 vehicles in the bag).

Asked By: Mykola Zotko

||

Source

Answer 1

There are three important improvements:

Remove the GlobalAveragePooling1D. It’s a kind of bottleneck (data compression) if you make a prediction for each item. Without this layer, you also get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output. So, you can set the last Dense layer to the number of categories (1 in my case).
Add a custom loss function to exclude target padding from calculation (already added to my question) and a custom metric function if you want to see the real metric.
Add an attention_mask to the MultiHeadAttention (instead of Masking layer) to mask the padding.

Answered By: Mykola Zotko

Answer 2

Just a simple add-on to @Mykola_Zotko ‘s improvement answer to those new users who are learning deep-learning with keras and tensorflow.

Remove the GlobalAveragePooling1D

For context, this GlobalAveragePooling1D is basically a Global average pooling operation for temporal data. So basically when you remove this method call, you are removing the "pooling" operation, or in simpler terms by @Mykola_Zotko:

… you get a two-dimensional tensor with the max number of items in the first dimension (5 in my case) for free in the output

The alias is:

tf.keras.layers.GlobalAvgPool1D

and the code for this method:

tf.keras.layers.GlobalAveragePooling1D (
    data_format = "channels_last", **kwargs
)

The source for this can be found on:

Add a custom loss function

What a loss function simply does is to "to generate the quantity that a model should seek to minimize during training time". Source

Or in other terms:

In mathematical optimization, statistics, machine learning and Deep Learning the Loss Function (also known as Cost Function or Error Function) is a function that defines a correlation between a series of values and a real number. That number represents conceptually the cost associated with an event or a set of values. In general, the goal of an optimization procedure is to minimize the loss function. Towardsdatascience – custom loss function in tensorflow

Add an attention_mask to the MultiHeadAttention

Alias:

tf.keras.layers.MultiHeadAttention

Code for the method:

tf.keras.layers.MultiHeadAttention(
    num_heads,
    key_dim,
    value_dim=None,
    dropout=0.0,
    use_bias=True,
    output_shape=None,
    attention_axes=None,
    kernel_initializer='glorot_uniform',
    bias_initializer='zeros',
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Source on:

Previous improvements that were made to the code:

metrics=["accuracy"] to metrics=["binary_accuracy"]

model.compile(
    loss="binary_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=1e-4),
    metrics=["binary_accuracy"],
)

Using Crossentropy in the custom loss function

Answered By: DialFrost

Multi-instance classification using tranformer model

Question:

Answers: