Softmax Activation Function

A Softmax Activation Function is a neuron activation function that is based on a Softmax function (that can convert an input into a posterior probability, i.e. [math]\displaystyle{ f_i(x)=\dfrac{\exp(x_i)}{\sum_j\exp(x_j)} }[/math]).

Context:
- It can (typically) be used in the activation of Softmax Neurons.
Example(s):
- torch.nn.Softmax,
- torch.nn.Softmax2D,
- ...
- …
Counter-Example(s):
See: Softmax Regression, Softmax Function, Artificial Neural Network, Artificial Neuron, Neural Network Topology, Neural Network Layer, Neural Network Learning Rate.

References

2018a

(Pyttorch, 2018) ⇒ http://pytorch.org/docs/master/nn.html#softmax
- QUOTE: class torch.nn.Softmax(dim=None) source
  Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range (0,1) and sum to 1
  Softmax is defined as [math]\displaystyle{ f_i(x)=\dfrac{\exp(x_i)}{\sum_j\exp(x_j)} }[/math]
  Shape:
  *** Input: any shape
  - Output: same as input

Returns: a Tensor of the same dimension and shape as the input with values in the range [0, 1]

Parameters: dim(int) – A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

Note

This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use Logsoftmax instead (it’s faster and has better numerical properties).

Examples:

>>> m = nn.Softmax()
>>> input = autograd.Variable(torch.randn(2, 3))
>>> print(input)
>>> print(m(input))

2018b

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Softmax_function#Artificial_neural_networks Retrieved:2018-2-11.
- The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression. Since the function maps a vector and a specific index i to a real value, the derivative needs to take the index into account: [math]\displaystyle{ \frac{\partial}{\partial q_k}\sigma(\textbf{q}, i) = \cdots = \sigma(\textbf{q}, i)(\delta_{ik} - \sigma(\textbf{q}, k)) }[/math]
  Here, the Kronecker delta is used for simplicity (cf. the derivative of a sigmoid function, being expressed via the function itself).
  See Multinomial logit for a probability model which uses the softmax activation function.

2018c

(Yang, Dai et al., 2018) ⇒ Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. (2018). “Breaking the Softmax Bottleneck: A High-rank RNN Language Model.” In: Proceedings of 6th International Conference on Learning Representations (ICLR-2018).
- QUOTE: ... We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. ...

2017

(Mate Labs, 2017) ⇒ Mate Labs Aug 23, 2017. Secret Sauce behind the beauty of Deep Learning: Beginners guide to Activation Functions
- QUOTE: Softmax-Softmax functions convert a raw value into a posterior probability. This provides a measure of certainty. It squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1.
  [math]\displaystyle{ \sigma(\mathbf{Z})_j = \dfrac{e^{z_j}}{\sum_{k=1}^Ke^{z_k}} \mbox{ for } j=1,\cdots, K }[/math]
  The output of the softmax function is equivalent to a categorical probability distribution, it tells you the probability that any of the classes are true.

2013

(Graves et al., 2013) ⇒ Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. (2013). “Speech Recognition with Deep Recurrent Neural Networks.” In: Acoustics, speech and signal processing (icassp), 2013 ieee International Conference on, pp. 6645-6649 . IEEE,
- QUOTE: ... The first method, known as Connectionist Temporal Classification (CTC) [8, 9], uses a softmax layer to define a separate output distribution Pr(k|t) at every step t along the input sequence … ...