Weighted Policy Learner (WPL) Algorithm
Jump to navigation
Jump to search
A Weighted Policy Learner (WPL) Algorithm is a Multi-Agent Reinforcement Learning (MARL) Algorithm that enables agents to converge to a Nash Equilibrium assuming each agent is oblivious to other agents and receives only one type of feedback.
- Example(s):
- as in Abdallah & Lesser (2007).
- …
- Counter-Example(s):
- Adapt When Everybody is Stationary Otherwise Move to Equilibrium (AWESOME) Algorithm,
- Enhanced Cooperative Multi-Agent Learning Algorithm (ECMLA) Algorithm,
- Learn or Exploit for Adversary Induced Markov Decision Process (LoE-AIM) Algorithm,
- Replicatior Dynamics with a Variable Learning Rate (ReDVaLeR) Algorithm,
- Win or Learn Fast (WoLF) Algorithm.
- See: Game Theory, Machine Learning System, Q-Learning, Reinforcement Learning, Nash Equilibrium.
References
2008
- (Abdallah & Lesser, 2008) ⇒ Sherief Abdallah, and Victor Lesser. (2008). “A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics.” In: Journal of Artificial Intelligence Research, 33(1).
- QUOTE: In this paper we propose a new MARL algorithm that enables agents to converge to a Nash Equilibrium, in benchmark games, assuming each agent is oblivious to other agents and receives only one type of feedback: the reward associated with choosing a given action. The new algorithm is called the Weighted Policy Learner or WPL for reasons that will become clear shortly. We experimentally show that WPL converges in well-known benchmark two-player-two-action games.
2007
- (Abdallah & Lesser, 2007) ⇒ Sherief Abdallah, and Victor Lesser. (2007). “Multiagent Reinforcement Learning and Self-organization in a Network of Agents.” In: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ISBN:978-1-904262-7-5 doi:10.1145/1329125.1329172
- QUOTE: The advantage of using this gradient ascent approach is that agents can learn stochastic policies, which is necessary for most of the convergence guarantees. Algorithm 1 describes the Weighted Policy Learner (WPL) algorithm [1], which we have chosen as the accompanying learning algorithm for our self-organizing mechanism. It should be noted, however, that our mechanism does not depend on the accompanying learning algorithm. In fact, the interaction between WPL and our self-organizing mechanism is encapsulated through the Q and π data structures, which are common to learning algorithms other than WPL.
WPL achieves convergence using an intuitive idea: slow down learning when moving away from a stable policy and speedup learning when moving towards the stable policy. In that respect, the idea has similarity with the Win or Lose Fast heuristic (WoLF) [3], but the WPL algorithm is more intuitive and achieves higher performance than algorithms using WoLF.
- QUOTE: The advantage of using this gradient ascent approach is that agents can learn stochastic policies, which is necessary for most of the convergence guarantees. Algorithm 1 describes the Weighted Policy Learner (WPL) algorithm [1], which we have chosen as the accompanying learning algorithm for our self-organizing mechanism. It should be noted, however, that our mechanism does not depend on the accompanying learning algorithm. In fact, the interaction between WPL and our self-organizing mechanism is encapsulated through the Q and π data structures, which are common to learning algorithms other than WPL.