FeUdal Network (FuN)

AKA: Manager-Worker Neural Network.
Context:
- It is trained using FeUdal Network Training System that is based on a Feudal Reinforcement Learning (FRL) System introduced by Dayan & Hinton (1993).
- It is a fully differentiable neural network and a with two levels of hierarchy.
- It consists of the following architecture as first introduced by Vezhnevets et al. (2017).:
  - a perceptual module ($f^{percep}$) that computes a intermediate representation ($z_t$), shared Manager and Worker modules, from input vector ($x_t$).
  - a Manager Module that is composed by:
    - a Dilated LSTM-RNN ($f^{Mrnn}$) that computes latent state representations ($s_t$) internally and outputs goal vectors ($g_t$).
    - a fully connected neural network layer $f^{Mspace}$ + non-linear rectifier that feeds $s_t$ to Manager Module from $z_t$.
  - a Worker Module that is composed by:
    - a LSTM-RNN ($f^{Mrnn}$) that produces FuN Worker actions embedding matrix ($U_t$) from $z_t$
    - a linear projection+ summation layer $\phi$ that produces a goal embedding vector $w_t$ by which incorporates Manager's goals $g_t$.
    - a matrix vector product layer that combines $U_t$ and $w_t$ to produce FuN Worker actions $a_t$.
Example(s):
- LeakGAN Model.
- …
Counter-Example(s):
See: Reinforcement Learning, Hierarchical Reinforcement Learning System, Transition policy Gradient, Policy Gradient Training System, Latent State Representation.

References

(Vezhnevets et al., 2017) ⇒ Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. (2017). “FeUdal Networks for Hierarchical Reinforcement Learning.” In: Proceedings of the 34th International Conference on Machine Learning (ICML2017).
- QUOTE: What is FuN? FuN is a modular neural-network consisting of two modules – the Worker and the Manager. The Manager internally computes a latent state representation $s_t$ and outputs a goal vector $g_t$. The Worker produces actions conditioned on external observation, its own state, and the Managers goal. The Manager and the Worker share a perceptual module which takes an observation from the environment $x_t$ and computes a shared intermediate representation $z_t$. The Manager's goals $g_t$ are trained using an approximate transition policy gradient. This is a particularly efficient form of policy gradient training that exploits the knowledge that the Worker's behaviour will ultimately align with the goal directions it has been set. The Worker is then trained via intrinsic reward to produce actions that cause these goal directions to be achieved. Figure 1a illustrates the overall design and the following equations describe the forward dynamics of our network:

$z_{t}=f^{\text {percept }}\left(x_{t}\right) ; s_{t}=f^{\text {Mspace}}\left(z_{t}\right)$	(1)
$h_{t}^{M}, \hat{g}_{t}=f^{M r n n}\left(s_{t}, h_{t-1}^{M}\right) ; g_{t}=\dfrac{\hat{g}_{t}}{\parallel\hat{g}_{t}\parallel}$	(2)
$w_{t}=\phi\left(\sum_{i=t-c}^{t} g_{i}\right) $	(3)
$h^{W}, U_{t}=f^{W r n n}\left(z_{t}, h_{t-1}^{W}\right) ; \pi_{t}=\operatorname{SoftMax}\left(U_{t} w_{t}\right)$	(4)

**Figure 1**. The schematic illustration of FuN.