Inverse Square Root Unit (ISRU) Activation Function

An Inverse Square Root Unit (ISRU) Activation Function is a neuron activation function that is based on the inverse function of the square root function, [math]\displaystyle{ f(x) = x/\sqrt{1 + \alpha x^2} }[/math].

Context:
- It can (typically) be used in the activation of ISRUs in RNNs.
- It is a generalization of the Inverse Square Root Linear Unit (ISRLU) Activation Function, [math]\displaystyle{ f(x) = \begin{cases} \frac{x}{\sqrt{1 + \alpha x^2}} & \text{for } x \lt 0\\ x & \text{for } x \ge 0\end{cases} }[/math]
Example(s):
- …
Counter-Example(s):
See: Artificial Neural Network, Artificial Neuron, Neural Network Topology, Neural Network Layer, Neural Network Learning Rate.

References

2018

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions Retrieved:2018-2-18.
- The following table compares the properties of several activation functions that are functions of one fold from the previous layer or layers:

Name	Plot	Equation	Derivative (with respect to x)	Range	Order of continuity	Monotonic	Derivative Monotonic	Approximates identity near the origin
Identity		[math]\displaystyle{ f(x)=x }[/math]	[math]\displaystyle{ f'(x)=1 }[/math]	[math]\displaystyle{ (-\infty,\infty) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	Yes	Yes
Binary step		[math]\displaystyle{ f(x) = \begin{cases} 0 & \text{for } x \lt 0\\ 1 & \text{for } x \ge 0\end{cases} }[/math]	[math]\displaystyle{ f'(x) = \begin{cases} 0 & \text{for } x \ne 0\\ ? & \text{for } x = 0\end{cases} }[/math]	[math]\displaystyle{ \{0,1\} }[/math]	[math]\displaystyle{ C^{-1} }[/math]	Yes	No	No
Logistic (a.k.a. Soft step)		[math]\displaystyle{ f(x)=\frac{1}{1+e^{-x}} }[/math]	[math]\displaystyle{ f'(x)=f(x)(1-f(x)) }[/math]	[math]\displaystyle{ (0,1) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	No	No
(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)
Inverse square root unit (ISRU)^[1]		[math]\displaystyle{ f(x) = \frac{x}{\sqrt{1 + \alpha x^2}} }[/math]	[math]\displaystyle{ f'(x) = \left(\frac{1}{\sqrt{1 + \alpha x^2}}\right)^3 }[/math]	[math]\displaystyle{ \left(-\frac{1}{\sqrt{\alpha}},\frac{1}{\sqrt{\alpha}}\right) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	No	Yes
(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)	(...)
Inverse square root linear unit (ISRLU)		[math]\displaystyle{ f(x) = \begin{cases} \frac{x}{\sqrt{1 + \alpha x^2}} & \text{for } x \lt 0\\ x & \text{for } x \ge 0\end{cases} }[/math]	[math]\displaystyle{ f'(x) = \begin{cases} \left(\frac{1}{\sqrt{1 + \alpha x^2}}\right)^3 & \text{for } x \lt 0\\ 1 & \text{for } x \ge 0\end{cases} }[/math]	[math]\displaystyle{ \left(-\frac{1}{\sqrt{\alpha}},\infty\right) }[/math]	[math]\displaystyle{ C^2 }[/math]	Yes	Yes	Yes
Adaptive piecewise linear (APL) ^[2]		[math]\displaystyle{ f(x) = \max(0,x) + \sum_{s=1}^{S}a_i^s \max(0, -x + b_i^s) }[/math]	[math]\displaystyle{ f'(x) = H(x) - \sum_{s=1}^{S}a_i^s H(-x + b_i^s) }[/math]	[math]\displaystyle{ (-\infty,\infty) }[/math]	[math]\displaystyle{ C^0 }[/math]	No	No	No
SoftPlus^[3]		[math]\displaystyle{ f(x)=\ln(1+e^x) }[/math]	[math]\displaystyle{ f'(x)=\frac{1}{1+e^{-x}} }[/math]	[math]\displaystyle{ (0,\infty) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	Yes	No
Bent identity		[math]\displaystyle{ f(x)=\frac{\sqrt{x^2 + 1} - 1}{2} + x }[/math]	[math]\displaystyle{ f'(x)=\frac{x}{2\sqrt{x^2 + 1}} + 1 }[/math]	[math]\displaystyle{ (-\infty,\infty) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	Yes	Yes
SoftExponential ^[4]		[math]\displaystyle{ f(\alpha,x) = \begin{cases} -\frac{\ln(1-\alpha (x + \alpha))}{\alpha} & \text{for } \alpha \lt 0\\ x & \text{for } \alpha = 0\\ \frac{e^{\alpha x} - 1}{\alpha} + \alpha & \text{for } \alpha \gt 0\end{cases} }[/math]	[math]\displaystyle{ f'(\alpha,x) = \begin{cases} \frac{1}{1-\alpha (\alpha + x)} & \text{for } \alpha \lt 0\\ e^{\alpha x} & \text{for } \alpha \ge 0\end{cases} }[/math]	[math]\displaystyle{ (-\infty,\infty) }[/math]	[math]\displaystyle{ C^\infty }[/math]	Yes	Yes	Template:Depends
Sinusoid^[5]		[math]\displaystyle{ f(x)=\sin(x) }[/math]	[math]\displaystyle{ f'(x)=\cos(x) }[/math]	[math]\displaystyle{ [-1,1] }[/math]	[math]\displaystyle{ C^\infty }[/math]	No	No	Yes
Sinc		[math]\displaystyle{ f(x)=\begin{cases} 1 & \text{for } x = 0\\ \frac{\sin(x)}{x} & \text{for } x \ne 0\end{cases} }[/math]	[math]\displaystyle{ f'(x)=\begin{cases} 0 & \text{for } x = 0\\ \frac{\cos(x)}{x} - \frac{\sin(x)}{x^2} & \text{for } x \ne 0\end{cases} }[/math]	[math]\displaystyle{ [\approx-.217234,1] }[/math]	[math]\displaystyle{ C^\infty }[/math]	No	No	No
Gaussian		[math]\displaystyle{ f(x)=e^{-x^2} }[/math]	[math]\displaystyle{ f'(x)=-2xe^{-x^2} }[/math]	[math]\displaystyle{ (0,1] }[/math]	[math]\displaystyle{ C^\infty }[/math]	No	No	No

Here, H is the Heaviside step function.

α is a stochastic variable sampled from a uniform distribution at training time and fixed to the expectation value of the distribution at test time.

2017

(Carlile et al.,2017) ⇒ Carlile, B., Delamarter, G., Kinney, P., Marti, A., & Whitney, B. (2017). Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs). arXiv preprint arXiv:1710.09967.
- ABSTRACT: We introduce the “inverse square root linear unit” (ISRLU) to speed up learning in deep neural networks. ISRLU has better performance than ELU but has many of the same benefits. ISRLU and ELU have similar curves and characteristics. Both have negative values, allowing them to push mean unit activation closer to zero, and bring the normal gradient closer to the unit natural gradient, ensuring a noise-robust deactivation state, lessening the over fitting risk. The significant performance advantage of ISRLU on traditional CPUs also carry over to more efficient HW implementations on HW/SW codesign for CNNs/RNNs. In experiments with TensorFlow, ISRLU leads to faster learning and better generalization than ReLU on CNNs. This work also suggests a computationally efficient variant called the “inverse square root unit” (ISRU) which can be used for RNNs. Many RNNs use either long short-term memory (LSTM) and gated recurrent units (GRU) which are implemented with tanh and sigmoid activation functions. ISRU has less computational complexity but still has a similar curve to tanh and sigmoid.

↑ Carlile, Brad; Delamarter, Guy; Kinney, Paul; Marti, Akiko; Whitney, Brian (2017-11-09). “Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)". arXiv:1710.09967 Freely accessible [cs.LG].
↑ Forest Agostinelli; Matthew Hoffman; Peter Sadowski; Pierre Baldi (21 Dec 2014). “Learning Activation Functions to Improve Deep Neural Networks". arXiv:1412.6830 Freely accessible cs.NE.
↑ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011). "Deep sparse rectifier neural networks" (PDF). International Conference on Artificial Intelligence and Statistics.
↑ Godfrey, Luke B.; Gashler, Michael S. (2016-02-03). "A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks". 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management: KDIR 1602: 481–486. arXiv:1602.01321. Bibcode 2016arXiv160201321G.
↑ Gashler, Michael S.; Ashmore, Stephen C. (2014-05-09). “Training Deep Fourier Neural Networks To Fit Time-Series Data". arXiv:1405.2262 Freely accessible cs.NE.

[isrlu-1] Carlile, Brad; Delamarter, Guy; Kinney, Paul; Marti, Akiko; Whitney, Brian (2017-11-09). “Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)". arXiv:1710.09967 Freely accessible [cs.LG].

[2] Forest Agostinelli; Matthew Hoffman; Peter Sadowski; Pierre Baldi (21 Dec 2014). “Learning Activation Functions to Improve Deep Neural Networks". arXiv:1412.6830 Freely accessible cs.NE.

[3] Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011). "Deep sparse rectifier neural networks" (PDF). International Conference on Artificial Intelligence and Statistics.

[4] Godfrey, Luke B.; Gashler, Michael S. (2016-02-03). "A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks". 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management: KDIR 1602: 481–486. arXiv:1602.01321. Bibcode 2016arXiv160201321G.

[5] Gashler, Michael S.; Ashmore, Stephen C. (2014-05-09). “Training Deep Fourier Neural Networks To Fit Time-Series Data". arXiv:1405.2262 Freely accessible cs.NE.

[1]

[2]

[3]

[4]

[5]

Inverse Square Root Unit (ISRU) Activation Function

References

2018

2017

Navigation menu

Search