Loss is Nan for SegFormer vision transformer trained on BDD10k

Question

I’m trying to implement a SegFormer pretrained with a mit-b0 model to perform semantic segmentation on images obtained from the bdd100k dataset. Specifically, semantic segmentation has masks for only a subset of the 100k images, being 10k with appropriate masks for segmentation where the pixel value of the mask is the label between 0 – 18, or 255 for unknown labels. I’m also following this example from collab on a simple segmentation of three labels.

The problem I have is that any further training I do on the training data ends up with nan as a loss. Inspecting any predicted masks ends up with values of Nan which is not right. I’ve tried to ensure that the input images for training are normalized, reduced the learning rate, increased the epochs for learning, changed the pretrained model, but still end up with nan as a loss right away.

I have my datasets as:

dataset = tf.data.Dataset.from_tensor_slices((image_train_paths, mask_train_paths))
val_dataset = tf.data.Dataset.from_tensor_slices((image_val_paths, mask_val_paths))

with this method to preprocess and normalize the data

height = 512
width = 512
mean = tf.constant([0.485, 0.456, 0.406])
std = tf.constant([0.229, 0.224, 0.225])

def normalize(input_image):
    input_image = tf.image.convert_image_dtype(input_image, tf.float32)
    input_image = (input_image - mean) / tf.maximum(std, backend.epsilon())
    return input_image

# Define a function to load and preprocess each example
def load_and_preprocess(image_path, mask_path):
    # Load the image and mask
    image = tf.image.decode_jpeg(tf.io.read_file(image_path), channels=3)
    mask = tf.image.decode_jpeg(tf.io.read_file(mask_path), channels=1)

    # Preprocess the image and mask
    image = tf.image.resize(image, (height, width))
    mask = tf.image.resize(mask, (height, width), method='nearest')
    image = normalize(image)
    mask = tf.squeeze(mask, axis=-1)
    image = tf.transpose(image, perm=(2, 0, 1))
    return {'pixel_values': image, 'labels': mask}

Actually created the datasets:

batch_size = 4
train_dataset = (
    dataset
    .cache()
    .shuffle(batch_size * 10)
    .map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(batch_size)
    .prefetch(tf.data.AUTOTUNE)
)

validation_dataset = (
    val_dataset
    .map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(batch_size)
    .prefetch(tf.data.AUTOTUNE)
)

Setting up the labels and pre-trained model:

id2label = {
    0:  'road',
    1:  'sidewalk',
    2:  'building',
    3:  'wall',
    4:  'fence',
    5:  'pole',
    6:  'traffic light',
    7:  'traffic sign',
    8:  'vegetation',
    9:  'terrain',
    10: 'sky',
    11: 'person',
    12: 'rider',
    13: 'car',
    14: 'truck',
    15: 'bus',
    16: 'train',
    17: 'motorcycle',
    18: 'bicycle',
}
label2id = { label: id for id, label in id2label.items() }
num_labels = len(id2label)

model = TFSegformerForSemanticSegmentation.from_pretrained('nvidia/mit-b0', num_labels=num_labels, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True)

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001))

Finally fitting the data to the model, using only 1 epoch just to see if I can figure out why the loss in nan:

epochs = 1

history = model.fit(train_dataset, validation_data=validation_dataset, epochs=epochs)

Segformer implements it’s own loss function, so I don’t need to supply one. I see the collab example I was following has some sort of loss, but I can’t figure out why mine is nan.

Did I approach this correctly, or am I missing something along the way? What else can I try to figure out on why the loss is nan? I did also make sure my labels used match between validation and training data sets. The pixel values ranged from 0 – 18 with 255 as unknown as supplied by the docs.

Edit: 3/16

I did find this example which pointed out some flaws I had in my approach, but even after following this example with everything besides how the dataset is gathered, I was still unable to produce any loss other than nan.

My new code is mostly the same, other then how I am pre-processing the data with numpy before converting them to tensors.

Dataset dict definition for training and validation data:

dataset = DatasetDict({
    'train': Dataset.from_dict({'pixel_values': image_train_paths, 'label': mask_train_paths}).cast_column('pixel_values', Image()).cast_column('label', Image()),
    'val': Dataset.from_dict({'pixel_values': image_val_paths, 'label': mask_val_paths}).cast_column('pixel_values', Image()).cast_column('label', Image())
})

train_dataset = dataset['train']
val_dataset = dataset['val']

train_dataset.set_transform(preprocess)
val_dataset.set_transform(preprocess)

where preprocess is where I am processing the images using the AutoImageProcessor to get the inputs.

image_processor  = AutoImageProcessor.from_pretrained('nvidia/mit-b0', semantic_loss_ignore_index=255) # This is a SegformerImageProcessor 

def transforms(image):
    image = tf.keras.utils.img_to_array(image)
    image = image.transpose((2, 0, 1))  # Since vision models in transformers are channels-first layout
    return image


def preprocess(example_batch):
    images = [transforms(x.convert('RGB')) for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = image_processor(images, labels)
    # print(type(inputs))
    return inputs

transforming the sets to a tensorflow dataset:

batch_size = 4

data_collator = DefaultDataCollator(return_tensors="tf")

train_set = dataset['train'].to_tf_dataset(
    columns=['pixel_values', 'label'],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
val_set = dataset['val'].to_tf_dataset(
    columns=['pixel_values', 'label'],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

fitting the model

history = model.fit(
    train_set,
    validation_data=val_set,
    epochs=10,
)

1750/1750 [==============================] - ETA: 0s - loss: nan

Asked By: Jimenemex

||

Source

Answer 1

This might be due to a variety of factors, including:

It seems that there is a discrepancy in the number of classes. Check that the prediction dimension of your model matches the number of classes in the dataset you’re working on. Please check your dataset labels.

Try this colab notebook: shorturl.at/hBHLR

Answered By: Faisal Shahbaz

Answer 2

As there is no reproducible code, it is hard to spot the main issue. But I’ve successfully trained SegFormer model with BDD10K dataset on semantic segmentation task. I’m sharing the solutions. I’ve used data and tested on kaggle environment, bdd100k-dataset. Here is the complete code. To make the test more complete, I’ve used this dataset on huggingface model (seg-former) and also with open source segmentation model (unet-efficientnet-b0).

One of the key difference perhaps in the above gist is how the data is being processed. The functionality likes DatasetDict, DefaultDataCollator, AutoImageProcessor are abset in the above gist. Only TFSegformerForSemanticSegmentation is present. It is better to perform data processing by looking at the code rather than relying on some auto tools.

Another difference for the huggingfacae model is, we can’t pass custom or built-in loss function (also metrics), that is real unfortunate. In their doc, they showed some approach to evaluate the model but the procedure is for torch model. To realize on this issue, it would be better to reach huggingface forum.

Here is the Full code, and below are some highlights.

Preprocess and Dataloader (BDD100K)

data_path = 'bdd100k/seg/'
image_path = os.path.join(data_path, 'images', 'train')
label_path = os.path.join(data_path, 'labels', 'train')

def read_files(image_path, mask=False):
    image = tf.io.read_file(image_path)
    if mask:
        image = tf.image.decode_png(image, channels=1)
        image.set_shape([None, None, 1])
        image = tf.image.resize(
            images=image, 
            size=[IMAGE_SIZE, IMAGE_SIZE], 
            method=tf.image.ResizeMethod.NEAREST_NEIGHBOR
        )
        image = tf.where(image == 255, np.dtype('uint8').type(0), image)
        image = tf.cast(image, tf.int32)
    else:
        image = tf.image.decode_png(image, channels=3)
        image.set_shape([None, None, 3])
        image = tf.image.resize(images=image, size=[IMAGE_SIZE, IMAGE_SIZE])
        image = image / 255.
    return image

def load_data_HF(image_list, mask_list):
    image = read_files(image_list)
    mask  = read_files(mask_list, mask=True)
    image = tf.transpose(image, (2, 0, 1))
    mask = tf.squeeze(mask)
    return {"pixel_values": image, "labels": mask}

def data_generator_HF(image_list, mask_list, split='train'):
    dataset = tf.data.Dataset.from_tensor_slices((image_list, mask_list))
    dataset = dataset.shuffle(10 * BATCH_SIZE) if split == 'train' else dataset 
    dataset = dataset.map(load_data_HF, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=False)
    return dataset.prefetch(tf.data.AUTOTUNE)

images = image_path
masks = label_path

train_ds_hf = data_generator_HF(images, masks)
val_ds_hf = data_generator_HF(images, masks, split='validation')

Segformer (Huggingface-Transformer)

from transformers import TFSegformerForSemanticSegmentation

model_checkpoint = "nvidia/mit-b0"
num_labels = n_classes
model_hf = TFSegformerForSemanticSegmentation.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

optim = tf.keras.optimizers.Adam(0.001)
model_hf.compile(optimizer=optim)

model_hf.fit(
    train_ds_hf, 
    validation_data=val_ds_hf,
    epochs=5
)

73s 954ms/step - loss: 1.3698 - val_loss: 1.0942
Epoch 2/5
7s 382ms/step - loss: 0.8068 - val_loss: 1.0197
Epoch 3/5
7s 375ms/step - loss: 0.7110 - val_loss: 0.8641
Epoch 4/5
7s 388ms/step - loss: 0.6365 - val_loss: 0.8025
Epoch 5/5
7s 377ms/step - loss: 0.5678 - val_loss: 0.7920
<keras.callbacks.History at 0x7f5f11bf9710>
---

Some resource

stanford-background-scene-understanding-starter

Answered By: Innat

Answer 3

You didn’t set your loss function in your model.compile. Change your model.compile line to this

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.00001), loss='categorical_crossentropy', metrics=['accuracy'])

This should make your code work just fine

Answered By: mail_liw

Answer 4

I recently ran into this problem as well, and it stems from the use of 255 as the background class label and how SparseCategoricalCrossentropy works under the hood. They attempt to mask out the nan values but a NaN * mask is still a nan

>>> y_true = [255, 1] #any number greater than the length of your predictions will nan
>>> y_pred = [[0.05, 0, 0.95], [0.1, 0.8, 0.1]]
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy(reduction="none")
>>> scce(y_true, y_pred).numpy()
array([       nan, 0.22314355], dtype=float32)

You have two ways to address this issue.

Include the background class as part of your labels, non ideal or to fix this function under the hood.
fix the loss function to properly ignore the background class

I went with the second option because I also wanted to include class weights for my loss, my hacky but working solution looks like this by overriding the underlying loss function.

constant_weights = tf.constant(weights)

#based off the currently used loss https://github.com/huggingface/transformers/blob/v4.27.1/src/transformers/models/segformer/modeling_tf_segformer.py#L800
def custom_loss(logits, labels):
    # `labels` is of shape (batch_size, height, width)
    # logits are predicted as (batch_size, classes, height//2, width//2)
    logits = tf.transpose(logits, [0, 2, 3, 1])
    label_interp_shape = train_batch['labels'].shape[1:]
    upsampled_logits = tf.image.resize(logits, size=label_interp_shape, method="bilinear")
    
    loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True,
        reduction="none",
        ignore_class=model.config.semantic_loss_ignore_index # missing from original 
    )
    
    def masked_loss(real, pred):
        label_weights = tf.one_hot(real, len(weights))*constant_weights
        label_weights = tf.reduce_sum(label_weights, axis=-1)
        unmasked_loss = loss_fct(real, pred)
        unmasked_loss *= label_weights
        mask = tf.cast(real != model.config.semantic_loss_ignore_index, dtype=unmasked_loss.dtype)
        masked_loss = unmasked_loss * mask
        reduced_masked_loss = tf.reduce_sum(masked_loss) /  tf.reduce_sum(mask)
        return tf.reshape(reduced_masked_loss, (1,))

    return masked_loss(labels, upsampled_logits)

model.hf_compute_loss = custom_loss

Answered By: koontz

Loss is Nan for SegFormer vision transformer trained on BDD10k

Question:

Answers: