Pytorch softmax: What dimension to use?

Question:

The function torch.nn.functional.softmax takes two parameters: input and dim. According to its documentation, the softmax operation is applied to all slices of input along the specified dim, and will rescale them so that the elements lie in the range (0, 1) and sum to 1.

Let input be:

input = torch.randn((3, 4, 5, 6))

Suppose I want the following, so that every entry in that array is 1:

sum = torch.sum(input, dim = 3) # sum's size is (3, 4, 5, 1)

How should I apply softmax?

softmax(input, dim = 0) # Way Number 0
softmax(input, dim = 1) # Way Number 1
softmax(input, dim = 2) # Way Number 2
softmax(input, dim = 3) # Way Number 3

My intuition tells me that is the last one, but I am not sure. English is not my first language and the use of the word along seemed confusing to me because of that.

I am not very clear on what “along” means, so I will use an example that could clarify things. Suppose we have a tensor of size (s1, s2, s3, s4), and I want this to happen

Asked By: Jadiel de Armas

||

Answers:

Let’s consider the example in two dimensions

x = [[1,2],
    [3,4]]

do you want your final result to be

y = [[0.27,0.73],
    [0.27,0.73]]

or

y = [[0.12,0.12],
    [0.88,0.88]]

If it’s the first option then you want dim = 1. If it’s the second option you want dim = 0.

Notice that the columns or zeroth dimension is normalized in the second example hence it is normalized along the zeroth dimension.

Updated 2018-07-10: to reflect that zeroth dimension refers to columns in pytorch.

Answered By: Steven

The easiest way I can think of to make you understand is: say you are given a tensor of shape (s1, s2, s3, s4) and as you mentioned you want to have the sum of all the entries along the last axis to be 1.

sum = torch.sum(input, dim = 3) # input is of shape (s1, s2, s3, s4)

Then you should call the softmax as:

softmax(input, dim = 3)

To understand easily, you can consider a 4d tensor of shape (s1, s2, s3, s4) as a 2d tensor or matrix of shape (s1*s2*s3, s4). Now if you want the matrix to contain values in each row (axis=0) or column (axis=1) that sum to 1, then, you can simply call the softmax function on the 2d tensor as follows:

softmax(input, dim = 0) # normalizes values along axis 0
softmax(input, dim = 1) # normalizes values along axis 1

You can see the example that Steven mentioned in his answer.

Answered By: Wasi Ahmad

Steven’s answer is not correct. See the snapshot below. It is actually the reverse way.

enter image description here

Image transcribed as code:

>>> x = torch.tensor([[1,2],[3,4]],dtype=torch.float)
>>> F.softmax(x,dim=0)
tensor([[0.1192, 0.1192],
        [0.8808, 0.8808]])
>>> F.softmax(x,dim=1)
tensor([[0.2689, 0.7311],
        [0.2689, 0.7311]])
Answered By: sww

I am not 100% sure what your question means but I think your confusion is simply that you don’t understand what dim parameter means. So I will explain it and provide examples.

If we have:

m0 = nn.Softmax(dim=0)

what that means is that m0 will normalize elements along the zeroth coordinate of the tensor it receives. Formally if given a tensor b of size say (d0,d1) then the following will be true:

sum^{d0}_{i0=1} b[i0,i1] = 1, forall i1 in {0,...,d1}

you can easily check this with a Pytorch example:

>>> b = torch.arange(0,4,1.0).view(-1,2)
>>> b 
tensor([[0., 1.],
        [2., 3.]])
>>> m0 = nn.Softmax(dim=0) 
>>> b0 = m0(b)
>>> b0 
tensor([[0.1192, 0.1192],
        [0.8808, 0.8808]])

now since dim=0 means going through i0 in {0,1} (i.e. going through the rows) if we choose any column i1 and sum its elements (i.e. the rows) then we should get 1. Check it:

>>> b0[:,0].sum()
tensor(1.0000)
>>> b0[:,1].sum()
tensor(1.0000)

as expected.

Note we do get all rows sum to 1 by “summing out the rows” with torch.sum(b0,dim=0), check it out:

>>> torch.sum(b0,0)
tensor([1.0000, 1.0000])

We can create a more complicated example to make sure it’s really clear.

a = torch.arange(0,24,1.0).view(-1,3,4)
>>> a
tensor([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])
>>> a0 = m0(a)
>>> a0[:,0,0].sum()
tensor(1.0000)
>>> a0[:,1,0].sum()
tensor(1.0000)
>>> a0[:,2,0].sum()
tensor(1.0000)
>>> a0[:,1,0].sum()
tensor(1.0000)
>>> a0[:,1,1].sum()
tensor(1.0000)
>>> a0[:,2,3].sum()
tensor(1.0000)

so as we expected if we sum all the elements along the first coordinate from the first value to the last value we get 1. So everything is normalized along the first dimension (or first coordiante i0).

>>> torch.sum(a0,0)
tensor([[1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000]])

Also along the dimension 0 means that you vary the coordinate along that dimension and consider each element. Sort of like having a for loop going through the values the first coordinates can take i.e.

for i0 in range(0,d0):
    a[i0,b,c,d]
Answered By: Charlie Parker
import torch
import torch.nn.functional as F

x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float)

s1 = F.softmax(x, dim=0)
tensor([[0.1192, 0.1192],
    [0.8808, 0.8808]])

s2 = F.softmax(x, dim=1)
tensor([[0.2689, 0.7311],
    [0.2689, 0.7311]])

torch.sum(s1, dim=0)
tensor([1., 1.])

torch.sum(s2, dim=1)
tensor([1., 1.])
Answered By: mrgloom

Think of what softmax is trying to achieve. It outputs probability of one outcome against the other. Let’s say you are trying to predict two outcomes: is it A or is it B. If p(A) is greater than p(B) then the next step is to convert the outcome into Boolean( i.e. the outcome would be A if p(A) > 50% or B if p(B) > 50% Since we are dealing with probabilities they should add-up to 1.
Therefore what you want is sum probabilities OF EACH ROW to be 1. Therefore you specify dim=1 or row sum

On the other hand if your model is designed to predict more than two variables the output tensor will look something like [p(a), p(b), p(c)…p(i)]
What matters here is that p(a) + p(b) + p(c) +…p(i) = 1
then you would use dim = 0

It all depends on how you define your output layer.

Answered By: yury_gurevich
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.