Convolution Neural Networks Intuition – Difference in outcome between high kernel filter size vs high number of features

Question:

I wanted to understand architectural intuition behind the differences of:

tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1))

and

tf.keras.layers.Conv2D(32, (7,7), activation='relu', input_shape=(28, 28, 1))

Assuming,

  1. As kernel size increases, more complex feature-pattern matching can be performed in the convolution step.
  2. As feature size increases, a larger variance of smaller features can define a particular layer.

How and when (if possible kindly give scenarios) do we justify the tradeoff at an abstract level?

Asked By: Akash Sonthalia

||

Answers:

This can be answered from 3 different views.

Parameters:
Since you comparing 2 different convolution2D layers with different sizes, it’s important to see the number of training parameters ∗( ∗ ∗ )+ needed for each, which in-turn makes your model more complex, and easy/difficult to train.

Here, the number of trainable parameters increases over 2.5 times when using the second configuration for conv2d

first conv2d layer: 64*(3*3*1)+64 = 640

second conv2d layer: 32*(7*7*1)+32 = 1600

Input:
Another way of asking what filter size must be used and why is by analyzing the input data in the first place. Since the goal of the first conv2d layer (over the input) is to capture the most basic of patterns in the image, ask yourself if the MOST basic of the pattern in the image really do need a larger filter to learn?

If you think that a large amount of pixels is necessary for the network to recognize the object you will use large filters (as 11×11 or 9×9). If you think what differentiates objects are some small and local features you should use small filters (3×3 or 5×5)

Usually, a better practice is to stack conv2d layers to capture bigger patterns in the image since they are made of a combination of smaller patterns that are more easily captured by smaller filters.

End goal:
Usually the goal of a conv network is to compress the image’s height and width into a large number of channels which here are made of filters.

enter image description here

This process of down sampling image into its representative features allows us to finally add a few dense layers at the end to do our classification tasks.

The first conv2d will downsample the image only by a little, and generate a large number of channels, while the second conv2d will downsample it a lot (because larger conv filter strides over the image), and have lesser number of filters.

But the act of downsampling, to get a smaller image with a lesser number of channels (filters) immediately causes loss of information. Therefore it’s recommended that it’s done gradually to retain as much information as possible from the original image.

Then it can be stacked with other conv2d to get to a near vector representation of the image before classification.

Summary:

  • The second conv2d will be able to capture larger more complex patterns at once as compared to the first conv2d at that step.

  • The second conv2d will have a higher loss of information from the original image as it would skip features that are from much smaller and simpler patterns. The first conv2d will be able to capture more basic patterns in the image and use the combinations of those (in stacked Conv layers) to build a more robust set of features for your end task.

  • Second conv2d needs a higher number of parameters to learn the structure of the image as compared to the first conv2d.

In practice, it is recommended to have a stack of Conv layers with smaller filters to better detect larger more complex patterns in the image.

Answered By: Akshay Sehgal