REINFORCE Algorithm
A REINFORCE Algorithm is a reinforcement learning algorithm that updates neural network weight parameter at the end of each trial by an increment of the form: REward Increment = Non-negative Factor × Offset Reinforcement × Characteristic Eligibility .
- AKA: Williams Reinforce Algorithm.
- Context:
- It was initially developed by Williams (1992).
- It can expressed as $\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) e_{i j}$ where $\alpha_{ij}$ is a learning rate factor, $b_{ij}$ is a reinforcement baseline, and $e_{i j}=\partial \ln g_{i} / \partial w_{i j}$ is called the characteristic eligibility of neural network weight $w_{ij}$.
- Example(s):
- an Episodic REINFORCE Algorithm.
- For network of Bernoulli-logistic units, a REINFORCE algorithm updates weight according to the increment rule: $\Delta w_{i j}=\alpha r\left(y_{i}-p_{i}\right) x_{j}$.
- …
- Counter-Example(s):
- See: LeakGAN System, Reinforcement Learning Neural Network, FeUdal Network (FuN), Reinforcement Learning, Associative Reinforcement Task, Immediate Reinforcement Task, Delayed Reinforcement Task.
References
1992
- (Williams, 1992) ⇒ Ronald J. Williams (1992). "Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning". In: Machine learning. DOI: https://doi.org/10.1007/BF00992696
- QUOTE: QUOTE: Consider a network facing an associative immediate-reinforcement learning task. Recall that weights are adjusted in this network following receipt of the reinforcement value $r$ at each trial. Suppose that the learning algorithm for this network is such that at the end of each trial each parameter $w_{ij}$ in the network is incremented by an amount:$\Delta w_{i j}=\alpha_{i j}\left(r-b_{i j}\right) e_{i j}$
where $\alpha_{ij}$ is a learning rate factor, $b_{ij}$ is a reinforcement baseline, and $e_{i j}=\partial \ln g_{i} / \partial w_{i j}$ is called the characteristic eligibility of $w_{ij}$. Suppose further that the reinforcement baseline $b_{ij}$ is conditionally independent of $y_i$, given $\mathbf{W}$ and $\mathbf{x}^i$, and the rate factor $\alpha_{ij}$ is nonnegative and depends at most on $\mathbf{w}^i$ and $t$. (Typically, $\alpha_{ij}$ will be taken to be a constant.) Any learning algorithm having this particular form will be called a REINFORCE algorithm. The name is an acronym for "Reward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility", which describes the form of the algorithm.
- QUOTE: QUOTE: Consider a network facing an associative immediate-reinforcement learning task. Recall that weights are adjusted in this network following receipt of the reinforcement value $r$ at each trial. Suppose that the learning algorithm for this network is such that at the end of each trial each parameter $w_{ij}$ in the network is incremented by an amount: