Long Short-Term Memory Unit Activation Function

A Long Short-Term Memory Unit Activation Function is a neuron activation function that implements LSTM Neurons with forget gates.

AKA: LSTM Activation Function.
Context:
- It can (typically) be used in Recurrent Neural Networks.
Example(s):
- chainer.functions.lstm() - Chainer's implementation,
- …
Counter-Example(s):
See: Artificial Neural Network, Recurrent Neural Network (RNN), Artificial Neuron, Neural Network Topology, Neural Network Layer, Neural Network Learning Rate.

References

2018a

(Chainer, 2018) ⇒ http://docs.chainer.org/en/stable/reference/generated/chainer.functions.lstm.html Retrieved:2018-2-24
- QUOTE: chainer.functions.lstm(c_prev, x)source
  Long Short-Term Memory units as an activation function. This function implements LSTM units with forget gates. Let the previous cell state c_prev and the input array x.
  First, the input array x is split into four arrays [math]\displaystyle{ a,i,f,o }[/math] of the same shapes along the second axis. It means that x ‘s second axis must have 4 times the c_prev‘s second axis.
  The split input arrays are corresponding to:

[math]\displaystyle{ a }[/math] : sources of cell input
[math]\displaystyle{ i }[/math] : sources of input gate
[math]\displaystyle{ f }[/math] : sources of forget gate
[math]\displaystyle{ o }[/math] : sources of output gate

Second, it computes the updated cell state c and the outgoing signal h as:

[math]\displaystyle{ c=tanh(a)\sigma(i)+c_{prev}\sigma (f) }[/math],

[math]\displaystyle{ h=tanh(c)\sigma(o) }[/math],

where [math]\displaystyle{ \sigma }[/math] is the elementwise sigmoid function. These are returned as a tuple of two variables.

This function supports variable length inputs. The mini-batch size of the current input must be equal to or smaller than that of the previous one. When mini-batch size of x is smaller than that of c, this function only updates c[0:len(x)] and doesn’t change the rest of c, c[len(x):]. So, please sort input sequences in descending order of lengths before applying the function (...)

2018b

(Chainer, 2018) ⇒ http://docs.chainer.org/en/stable/reference/generated/chainer.links.LSTM.html Retrieved:2018-2-24
- QUOTE: class chainer.links.LSTM(in_size, out_size=None, lateral_init=None, upward_init=None, bias_init=None, forget_bias_init=None)source
  Fully-connected LSTM layer.
  This is a fully-connected LSTM layer as a chain. Unlike the lstm() function, which is defined as a stateless activation function, this chain holds upward and lateral connections as child links.
  It also maintains states, including the cell state and the output at the previous time step. Therefore, it can be used as a stateful LSTM.
  This link supports variable length inputs. The mini-batch size of the current input must be equal to or smaller than that of the previous one. The mini-batch size of c and h is determined as that of the first input x. When mini-batch size of i-th input is smaller than that of the previous input, this link only updates c[0:len(x)] and h[0:len(x)] and doesn’t change the rest of c and h. So, please sort input sequences in descending order of lengths before applying the function (...)

2018c

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Long_short-term_memory Retrieved:2018-2-25.
- Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Each of the three gates can be thought of as a "conventional" artificial neuron, as in a multi-layer (or feedforward) neural network: that is, they compute an activation (using an activation function) of a weighted sum. Intuitively, they can be thought as regulators of the flow of values that goes through the connections of the LSTM; hence the denotation "gate". There are connections between these gates and the cell.
  The expression long short-term refers to the fact that LSTM is a model for the short-term memory which can last for a long period of time. An LSTM is well-suited to classify, process and predict time series given time lagsof unknown size and duration between important events. LSTMs were developed to deal with the exploding and vanishing gradient problem when training traditional RNNs. Relative insensitivity to gap length gives an advantage to LSTM over alternative RNNs, hidden Markov models and other sequence learning methods in numerous applications.

2001

(Gers, 2001) ⇒ Gers, F. (2001). Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
- ABSTRACT: For a long time, recurrent neural networks (RNNs) were thought to be theoretically fascinating. Unlike standard feed-forward networks RNNs can deal with arbitrary input sequences instead of static input data only. This combined with the ability to memorize relevant events over time makes recurrent networks in principal more powerful than standard feed-forward networks. The set of potential applications is enormous: any task that requires to learn how to use memory is a potential task for recurrent networks. Potential application areas include time series prediction, motor control in non-Markovian environments and rhythm detection (in music and speech).
  Previous successes in real world applications, with recurrent networks were limited, however, due to practical problems when long time lags between relevant events make learning difficult. For these applications conventional gradient-based recurrent network algorithms for Machine Learning Systemlearning to store information over extended time intervals take too long. The main reason for this failure is the rapid decay of back-propagated error. The "Long Short Term Memory" (LSTM) algorithm overcomes this and related problems by enforcing constant error flow. Using gradient descent, LSTM explicitly learns when to store information and when to access it. In this thesis, we extend, analyze, and apply the LSTM algorithm. In particular, we identify two weaknesses of LSTM, offer solutions and modify the algorithm accordingly: (1) We recognize a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel, adaptive “forget gate” that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. (2) We identify a weakness in LSTM's connection sheme, and extend it by introducing “peephole connections" from LSTM's “Constant Error Carousel” to the multiplicative gates protecting them. These connetions provide the gates with explicit information about the state to which they control access. We show that peephole connections are necessary for numerous tasks and do not significantly affect LSTM's performance on previously solved tasks.
  We apply the extended LSTM with forget gates and peephole connections to tasks that no other RNN algorithm an solve (including traditional LSTM): Grammar tasks and temporal order tasks involving continual input streams, arithmetic operation on continual input streams, tasks that require precise, continual timing, periodic function generation and context free and context sensitive language tasks. Finally we establish limits of LSTM on time series predition problems solvable by time window approaches.