What happens in a convolution when the stride is larger than the kernel?

Question:

I recently was experiment with convolutions and transposed convolutions in Pytorch. I noticed with the nn.ConvTranspose2d API (I haven’t tried with the normal convolution API yet), you can specify a stride that is larger than the kernel size and the convolution will still work.

What is happening in this case? I’m confused because if the stride is larger than the kernel, that means some pixels in the input image will not be convolved. So what happens to them?

I have the following snippet where I manually set the weights for a nn.ConvTranspose2d layer:

IN = 1
OUT = 1
KERNEL_SIZE = 2
proof_conv = nn.ConvTranspose2d(IN, OUT, kernel_size=KERNEL_SIZE, stride=4)
assert proof_conv.weight.shape == (IN, OUT, KERNEL_SIZE, KERNEL_SIZE)

FILTER = [
    [1., 2.],
    [0., 1.]
]
weights = [
    [FILTER]
]

weights_as_tensor = torch.from_numpy(np.asarray(weights)).float()
assert weights_as_tensor.shape == proof_conv.weight.shape
proof_conv.weight = nn.Parameter(weights_as_tensor)

img = [[
  [1., 2.],
  [3., 4.]
]]
img_as_tensor = torch.from_numpy(np.asarray(img)).float()
out_img = proof_conv(img_as_tensor)
assert out_img.shape == (OUT, 6, 6)

The stride is larger than the KERNEL_SIZE of 2. Yet, the transposed convolution still occurs and we get an output of 6×6. What is happening underneath the hood?

This post: Understanding the PyTorch implementation of Conv2DTranspose is helpful but does not answer the edge-case of when the stride is greater than the kernel.

Asked By: Foobar

||

Answers:

As you already guessed – when the stride is larger than the kernel size, there are input pixels that do not participate in the convolution operation.
It’s up to you – the designer of the architecture to decide whether this property is a bug or a feature. In some cases, I took advantage of this property to ignore portions of the inputs.

Update:
I think you are being confused by the bias term in proof_conv. Try to eliminate it:

proof_conv = nn.ConvTranspose2d(IN, OUT, kernel_size=KERNEL_SIZE, stride=4, bias=False)

Now you’ll get out_img to be:

[[[[1., 2., 0., 0., 2., 4.],
          [0., 1., 0., 0., 0., 2.],
          [0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0.],
          [3., 6., 0., 0., 4., 8.],
          [0., 3., 0., 0., 0., 4.]]]]

Which represent 4 copies of the kernel, weighted by the input image, spaced 4 pixels apart according to stride=4.
The rest of the output image is filled with zeros – representing pixels that do not contribute to the transposed convolution.

ConvTranspose follows the same "logic" as the regular conv, only in a "transposed" fashion. If you look at the formula for computing output shape you’ll see that the behavior you get is consistent.

Answered By: Shai

My understanding is that ConvTranspose2D will always use all of the pixels in the input image, regardless of the stride and kernel_size. This is different than for Conv2D. As you can see by looking at the actual values in out_img (as seen in @Shai’s answer), each value is used to generate the four sets of 2×2 values at the corners of the image. stride in ConvTranspose2D instead affects the output image size and spacing. You can see that because stride=4 in this case, the 4 2×2 results of the 2×2 inputs and 2×2 kernel are spaced 4 units apart. The intervening spaces are filled with zeros, as some of the output pixels will have no input if stride > kernel_size.

This is essentially the corollary of some input cells not being used in Conv2D if stride > kernel_size. I think maybe this is what you were trying to get at with your question.

Answered By: bhenn1983