Sigmoid Neuron
A Sigmoid Neuron is an artificial neuron that uses a Logistic Sigmoid Activation Function.
- AKA: Sigmoidal Neuron, Logistic Neuron, Log-Sigmoid Neuron, Sigmoid Neural Unit.
- Context:
- It can be mathematically described as
[math]\displaystyle{ y_j=\sigma(z_j)=1/(1+e^{-z_j})\quad \text{with} \quad z_j=\sum_{i=0}^nw_{ji}x_i+b \quad \text{for}\quad j=0,\cdots, p }[/math]
where [math]\displaystyle{ x_i }[/math] are the Neural Network Input vector, [math]\displaystyle{ y_j }[/math] are the Neural Network Output vector, [math]\displaystyle{ w_{ji} }[/math] is the Neural Network Weights and [math]\displaystyle{ b }[/math] is the Bias Neuron.
- It can be mathematically described as
- Example(s):
- Let's consider a sigmoid neuron with 3 inputs [math]\displaystyle{ \{x_1,x_2,x_3\} }[/math], 3 neural network weight values [math]\displaystyle{ \{w_1,w_2,w_3\} }[/math] and bias value [math]\displaystyle{ b }[/math]. The output is given by [math]\displaystyle{ y=1/(1+e-z) }[/math] with [math]\displaystyle{ z=w_1*x_1+w_2*x_2+w_3*x_3 + b }[/math].
- Let's consider a sigmoid neuron with 3 inputs [math]\displaystyle{ X=\{0.5699, 0.1250, 0.5925\} }[/math], 3 neural network weight values [math]\displaystyle{ W=\{0.2217, 0.5029, 0.1168\} }[/math] and bias value [math]\displaystyle{ b=0.02 }[/math]. The output is [math]\displaystyle{ y=1/(1+e^{-0.2780})=0.5691 }[/math] as [math]\displaystyle{ z=0.221*0.56997+0.5029*0.1250+0.1168*0.5925 + 0.02=0.2780 }[/math].
- …
- Counter-Example(s):
- See: Artificial Neural Network, Perceptron.
References
2018
- (CS231n, 2018) ⇒ Commonly used activation functions. In: CS231n Convolutional Neural Networks for Visual Recognition Retrieved: 2018-01-14.
- QUOTE: Sigmoid. The sigmoid non-linearity has the mathematical form [math]\displaystyle{ \sigma(x)=1/(1+e^{−x}) }[/math] and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:
- Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron’s activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.
- Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. [math]\displaystyle{ x gt 0 }[/math] elementwise in [math]\displaystyle{ f=wTx+b }[/math]), then the gradient on the weights [math]\displaystyle{ w }[/math] will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression [math]\displaystyle{ f }[/math]). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
- QUOTE: Sigmoid. The sigmoid non-linearity has the mathematical form [math]\displaystyle{ \sigma(x)=1/(1+e^{−x}) }[/math] and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:
2017
- (Mate Labs, 2017) ⇒ Mate Labs Aug 23, 2017. Secret Sauce behind the beauty of Deep Learning: Beginners guide to Activation Functions
- QUOTE: Sigmoid or Logistic activation function(Soft Step)-It is mostly used for binary classification problems (i.e. outputs values that range 0–1) . It has problem of vanishing gradients. The network refuses to learn or the learning is very slow after certain epochs because input(X) is causing very small change in output(Y). It is a widely used activation function for classification problems, but recently. This function is more prone to saturation of the later layers, making training more difficult. Calculating derivative of Sigmoid function is very easy.
For the backpropagation process in a neural network, your errors will be squeezed by (at least) a quarter at each layer. Therefore, deeper your network is, more knowledge from the data will be “lost”. Some “big” errors we get from the output layer might not be able to affect the synapses weight of a neuron in a relatively shallow layer much (“shallow” means it’s close to the input layer) — Source https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions
Sigmoid or Logistic activation function:
[math]\displaystyle{ f(x)=\dfrac{1}{1 + e^{-x}} }[/math]
[math]\displaystyle{ f'(x)=f(x)(1-f(x)) }[/math]
Range: (0, 1)
Examples: f(4) = 0.982, f(-3) = 0.0474, f(-5) = 0.0067
- QUOTE: Sigmoid or Logistic activation function(Soft Step)-It is mostly used for binary classification problems (i.e. outputs values that range 0–1) . It has problem of vanishing gradients. The network refuses to learn or the learning is very slow after certain epochs because input(X) is causing very small change in output(Y). It is a widely used activation function for classification problems, but recently. This function is more prone to saturation of the later layers, making training more difficult. Calculating derivative of Sigmoid function is very easy.
2016a
- (Ruozzi, 2016) ⇒ Nicholas Ruozzi (2016). Neural Networks: http://www.utdallas.edu/~nrr150130/cs7301/2016fa/lects/Lecture_22_NN.pdf
- QUOTE: A sigmoid neuron is an artificial neuron that takes a collection of inputs in the interval [0,1] and produces an output in the interval [0,1] – The output is determined by summing up the weighted inputs plus the bias and applying the sigmoid function to the result
[math]\displaystyle{ y=\sigma(w_1x_1+w_2x_2+w_3x_3+b) }[/math]
where [math]\displaystyle{ \sigma }[/math] is the sigmoid function.
- QUOTE: A sigmoid neuron is an artificial neuron that takes a collection of inputs in the interval [0,1] and produces an output in the interval [0,1] – The output is determined by summing up the weighted inputs plus the bias and applying the sigmoid function to the result
2016b
- (Garcia et al., 2016) ⇒ García Benítez, S. R., López Molina, J. A., & Castellanos Pedroza, V. (2016). Neural networks for defining spatial variation of rock properties in sparsely instrumented media. Boletín de la Sociedad Geológica Mexicana, 68(3), 553-570.
- QUOTE: The activation function of the neurons in NN implementing the backpropagation algorithm is a weighted sum (the sum of the inputs [math]\displaystyle{ x_i }[/math] multiplied by their respective weights [math]\displaystyle{ w_{ji} }[/math]:
[math]\displaystyle{ A_j(\hat{x},\hat{w})=\sum_{i=0}^n x_iw_{ji} }[/math]
As can be seen, the neuron activation depends only on the inputs and the weights. If the output function would be the identity (activation = output) then the neuron would be called linear. But these have severe limitations, the most common output function is the sigmoidal function:
[math]\displaystyle{ O_j=\dfrac{1}{1+e^{-A_j(\hat{x},\hat{w})}} }[/math]
The sigmoidal function is very close to one for large positive numbers, 0.5 at zero, and very close to zero for large negative numbers. This allows a smooth transition between the low and high out-puts (close to zero or close to one). The goal of the training process is to obtain a desired output when certain inputs are given. Since the error is the difference between the actual and the desired output, the error depends on the weights, and we need to adjust the weights in order to minimize the error.
- QUOTE: The activation function of the neurons in NN implementing the backpropagation algorithm is a weighted sum (the sum of the inputs [math]\displaystyle{ x_i }[/math] multiplied by their respective weights [math]\displaystyle{ w_{ji} }[/math]: