Sigmoid activation output layer produce Many near-1 value

Question:

🙂

I have a Datset of ~16,000 .wav recording from 70 bird species.
I’m training a model using tensorflow to classify the mel-spectrogram of these recordings using Convolution based architectures.

One of the architectures used is simple multi-layer convolutional described below.
The pre-processing phase include:

  1. extract mel-spectrograms and convert to dB Scale
  2. segment audio to 1-second segment (pad with zero Or gaussian noise if residual is longer than 250ms, discard otherwise)
  3. z-score normalization of training data – reduce mean and divide result by std

pre-processing while inference:

  1. same as described above
  2. z-score normalization BY training data – reduce mean (of training) and divide result by std (of training data)

I understand that the output layer’s probabilities with sigmoid activation is not suppose to accumulate to 1, But I get many (8-10) very high prediction (~0.999) probabilities. and some is exactly 0.5.

The current test set correct classification rate is ~84%, tested with 10-fold cross validation, So it seems that the the network mostly operates well.

notes:
1.I understand there are similar features in the vocalization of different birds species, but the recieved probabilities doesn’t seem to reflect them correctly
2. probabilities for example – a recording of natural noise:
Natural noise: 0.999
Mallard – 0.981

I’m trying to understand the reason for these results, if it’s related the the data etc extensive mislabeling (probably not) or from another source.

Any help will be much appreciated! 🙂

EDIT: I use sigmoid because the probabilities of all classes are necessary, and I don’t need them to accumulate to 1.

def convnet1(input_shape, numClasses, activation='softmax'):

    # Define the network
    model = tf.keras.Sequential()
    model.add(InputLayer(input_shape=input_shape))
    # model.add(Augmentations1(p=0.5, freq_type='mel', max_aug=2))

    model.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 1)))
    model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 1)))
    model.add(Conv2D(128, (5, 5), activation='relu', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(256, (5, 5), activation='relu', padding='same'))
    model.add(BatchNormalization())

    model.add(Flatten())
    # model.add(Dense(numClasses, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(numClasses, activation='sigmoid'))

    model.compile(
        loss='categorical_crossentropy',
        metrics=['accuracy'],
        optimizer=optimizers.Adam(learning_rate=0.001),
        run_eagerly=False)  # this parameter allows to debug and use regular functions inside layers: print(), save() etc..
    return model
Asked By: Ronen

||

Answers:

For future searches – this problem was solved, and the reason was found(!).

The initial batch size that was used was 256 or 512. reducing the batch size to 16 or 32 SOLVED THE PROBLEM, and now the difference in probabilities are as expected for training AND test set samples – very high for the correct label and very low for other classes.

Answered By: Ronen