FeUdal Network Manager Module
A FeUdal Network Manager Module is a FeUdal Network module that generates a goal vector and feeds it to a lower-level of hierarchy worker module that will produce a required action.
- AKA: Manager Neural Network.
- Context:
- It can (typically) be part of FeUdal Network (FuN).
- …
- Example(s):
- Counter-Example(s):
- See: Feudal Reinforcement Learning (FRL) System, LeakGAN, Policy Gradient Training System, Hierarchical Reinforcement Learning System.
References
2018
- (Guo et al., 2018) ⇒ Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. (2018). “Long Text Generation via Adversarial Training with Leaked Information.” In: Proceedings of the Thirty-Second (AAAI) Conference on Artificial Intelligence (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th (AAAI) Symposium on Educational Advances in Artificial Intelligence (EAAI-18).
- QUOTE: As illustrated in Figure 1, we specifically introduce a hierarchical generator $G$, which consists of a high-level MANAGER module and a low-level WORKER module. The MANAGER is a long short-term memory network (LSTM) (Hochreiter and Schmidhuber 1997) and serves as a mediator. In each step, it receives generator $D$’s high-level feature representation, e.g., the feature map of the CNN, and uses it to form the guiding goal for the WORKER module in that timestep. As the information from $D$ is internally-maintained and in an adversarial game it is not supposed to provide $G$ with such information. We thus call it a leakage of information from $D$.
- QUOTE: As illustrated in Figure 1, we specifically introduce a hierarchical generator $G$, which consists of a high-level MANAGER module and a low-level WORKER module. The MANAGER is a long short-term memory network (LSTM) (Hochreiter and Schmidhuber 1997) and serves as a mediator. In each step, it receives generator $D$’s high-level feature representation, e.g., the feature map of the CNN, and uses it to form the guiding goal for the WORKER module in that timestep. As the information from $D$ is internally-maintained and in an adversarial game it is not supposed to provide $G$ with such information. We thus call it a leakage of information from $D$.
{|style="border: 0px; text-align:center; border-spacing: 1px; margin: 1em auto; width: 80%"
|- |$f =\mathcal{F}\left(s ; \phi_{f}\right)$ |style="width:5%;text-align:right"|(1) |- |$D_{\phi}(s) =\operatorname{sigmoid}\left(\phi_{l} \cdot \mathcal{F}\left(s ; \phi_{f}\right)\right)=\operatorname{sigmoid}\left(\phi_{l}, f\right)$ |style="width:5%;text-align:right"|(2) |- |+ align="bottom" style="caption-side:top;text-align:center;font-weight:bold"|Discriminator
|}{|style="border: 0px; text-align:center; border-spacing: 1px; margin: 1em auto; width: 80%" |- |$\hat{g}_{t}, h_{t}^{M} =\mathcal{M}\left(f_{t}, h_{t-1}^{M} ; \theta_{m}\right) $ |style="width:5%;text-align:right"|(3) |- |$g_{t} =\hat{g}_{t} /\left\|\hat{g}_{t}\right\|$ |style="width:5%;text-align:right"|(4) |- |$w_{t}=\psi\left(\sum_{i=1}^{c} g_{t-i}\right)=W_{\psi}\left(\sum_{i=1}^{c} g_{t-i}\right)$ |style="width:5%;text-align:right"|(5) |- |$O_{t}, h_{t}^{W}= \mathcal{W}\left(x_{t}, h_{t-1}^{W} ; \theta_{w}\right)$ |style="width:5%;text-align:right"|(6) |- |$G_{\theta}\left(\cdot \mid s_{t}\right)= \operatorname{sigmoid}\left(O_{t} \cdot w_{t} / \alpha\right)$ |style="width:5%;text-align:right"|(7) |- |$Q\left(f_{t}, g_{t}\right)=\mathbb{E}\left[r_{t}\right]$ |style="width:5%;text-align:right"|(8) |- |$\nabla_{\theta_{m}}^{\mathrm{adv}} g_{t}=-Q\left(f_{t}, g_{t}\right) \nabla_{\theta_{m}} d_{\cos }\left(\mathcal{F}\left(s_{t+c}\right)-\mathcal{F}\left(s_{t}\right), g_{t}\left(\theta_{m}\right)\right)$ |style="width:5%;text-align:right"|(9) |- |+ align="bottom" style="caption-side:top;text-align:center;font-weight:bold"|MANAGER of Generator |}
{|style="border: 0px; text-align:center; border-spacing: 1px; margin: 1em auto; width: 80%" |- |$\nabla_{\theta_{w}} \mathbb{E}_{s_{t-1} \sim G}\left[\sum_{x_{t}} r_{t}^{I} \mathcal{W}\left(x_{t} \mid s_{t-1} ; \theta_{w}\right)\right] =\mathbb{E}_{s_{t-1} \sim G, x_{t} \sim \mathcal{W}\left(x_{t} \mid s_{t-1}\right)}\left[r_{t}^{I} \nabla_{\theta_{w}} \log \mathcal{W}\left(x_{t} \mid s_{t-1} ; \theta_{w}\right)\right] $ |style="width:5%;text-align:right"|(10) |- |$r_{t}^{I}=\frac{1}{c} \sum_{i=1}^{c} d_{\cos }\left(\mathcal{F}\left(s_{t}\right)-\mathcal{F}\left(s_{t-i}\right), g_{t-i}\right)$ |style="width:5%;text-align:right"|(11) |- |+ align="bottom" style="caption-side:top;text-align:center;font-weight:bold"|WORKER of Generator |}
2017
- (Vezhnevets et al., 2017) ⇒ Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. (2017). “FeUdal Networks for Hierarchical Reinforcement Learning.” In: Proceedings of the 34th International Conference on Machine Learning (ICML2017).
- QUOTE: What is FuN? FuN is a modular neural-network consisting of two modules – the Worker and the Manager. The Manager internally computes a latent state representation $s_t$ and outputs a goal vector $g_t$. The Worker produces actions conditioned on external observation, its own state, and the Managers goal. The Manager and the Worker share a perceptual module which takes an observation from the environment $x_t$ and computes a shared intermediate representation $z_t$. The Manager's goals $g_t$ are trained using an approximate transition policy gradient. This is a particularly efficient form of policy gradient training that exploits the knowledge that the Worker's behaviour will ultimately align with the goal directions it has been set. The Worker is then trained via intrinsic reward to produce actions that cause these goal directions to be achieved. Figure 1a illustrates the overall design and the following equations describe the forward dynamics of our network:
- QUOTE: What is FuN? FuN is a modular neural-network consisting of two modules – the Worker and the Manager. The Manager internally computes a latent state representation $s_t$ and outputs a goal vector $g_t$. The Worker produces actions conditioned on external observation, its own state, and the Managers goal. The Manager and the Worker share a perceptual module which takes an observation from the environment $x_t$ and computes a shared intermediate representation $z_t$. The Manager's goals $g_t$ are trained using an approximate transition policy gradient. This is a particularly efficient form of policy gradient training that exploits the knowledge that the Worker's behaviour will ultimately align with the goal directions it has been set. The Worker is then trained via intrinsic reward to produce actions that cause these goal directions to be achieved. Figure 1a illustrates the overall design and the following equations describe the forward dynamics of our network:
$z_{t}=f^{\text {percept }}\left(x_{t}\right) ; s_{t}=f^{\text {Mspace}}\left(z_{t}\right)$ | (1) |
$h_{t}^{M}, \hat{g}_{t}=f^{M r n n}\left(s_{t}, h_{t-1}^{M}\right) ; g_{t}=\dfrac{\hat{g}_{t}}{\parallel\hat{g}_{t}\parallel}$ | (2) |
$w_{t}=\phi\left(\sum_{i=t-c}^{t} g_{i}\right) $ | (3) |
$h^{W}, U_{t}=f^{W r n n}\left(z_{t}, h_{t-1}^{W}\right) ; \pi_{t}=\operatorname{SoftMax}\left(U_{t} w_{t}\right)$ | (4) |
1992
- (Dayan & Hinton, 1992) ⇒ Peter Dayan, and Geoffrey E. Hinton. (1992). “Feudal Reinforcement Learning.” In: Proceedings of Advances in Neural Information Processing Systems 5 (NIPS 1992).