Replicatior Dynamics with a Variable Learning Rate (ReDVaLeR) Algorithm
A Replicatior Dynamics with a Variable Learning Rate (ReDVaLeR) Algorithm is a Multi-Agent Reinforcement Learning (MARL) Algorithm that can be implemented by a ReDVaLeR System to solve a ReDVaLeR Task.
- AKA: ReDVaLeR Algorithm.
- Example(s):
- …
- Counter-Example(s):
- Adapt When Everybody is Stationary Otherwise Move to Equilibrium (AWESOME) Algorithm,
- Enhanced Cooperative Multi-Agent Learning Algorithm (ECMLA) Algorithm,
- Learn or Exploit for Adversary Induced Markov Decision Process (LoE-AIM) Algorithm,
- Weighted Policy Learner (WPL) Algorithm,
- Win or Learn Fast (WoLF) Algorithm.
- See: Game Theory, Machine Learning System, Q-Learning, Reinforcent Learning, Nash Equilibrium.
References
2004
- (Banerjee & Peng, 2004) ⇒ Bikramjit Banerjee, and Jing Peng. (2004). “Performance Bounded Reinforcement Learning in Strategic Interactions.” In: Proceedings of the 19th national conference on Artifical intelligence. ISBN:0-262-51183-5
- QUOTE: We propose to use the Replicator (Fudenberg & Levine 1998) rule for policy update of our target algorithm with a WoLF-like modification, which we call ReDVaLeR (Replicator Dynamics with a Variable Learning Rate).
(...) which unlike the previous algorithms has a specific desirable property against opponents that are neither stationary nor using the same algorithm as the learner. It does not attempt to explicitly identify the opponents’ algorithms, and as such are unaware of any desirable point of convergence (in policies) against such opponents. Therefore, instead of policy convergence, it achieves constant bounded regret at any time (no-regret payoff asymptotically) against such opponents, while preserving the earlier properties of convergence against stationary opponents and in self-play. We have established the following properties of ReDVaLeR in arbitrary sized general-pum non-negative games: (i) It converges to a stationary best response against an arbitrary number of stationary opponents, (ii) It achieves no-regret payoff against an arbitrary number of opponents playing arbitrary fixed sequences of non-stationary policies, and (iii) If all the players are following ReDVaLeR, then they converge to their portions of a NE, but one case requires a learning rate schedule that conflicts with property (ii). A simple experiment shows that even in this case, the learning rates sometimes can be such that convergence in self-play is maintained and no-regret payoff is achieved.
- QUOTE: We propose to use the Replicator (Fudenberg & Levine 1998) rule for policy update of our target algorithm with a WoLF-like modification, which we call ReDVaLeR (Replicator Dynamics with a Variable Learning Rate).