Weighted Policy Learner (WPL) Algorithm

From GM-RKB

Jump to navigation Jump to search

A Weighted Policy Learner (WPL) Algorithm is a Multi-Agent Reinforcement Learning (MARL) Algorithm that enables agents to converge to a Nash Equilibrium assuming each agent is oblivious to other agents and receives only one type of feedback.

Example(s):
- as in Abdallah & Lesser (2007).
- …
Counter-Example(s):
See: Game Theory, Machine Learning System, Q-Learning, Reinforcement Learning, Nash Equilibrium.

References

2008

(Abdallah & Lesser, 2008) ⇒ Sherief Abdallah, and Victor Lesser. (2008). “A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics.” In: Journal of Artificial Intelligence Research, 33(1).
- QUOTE: In this paper we propose a new MARL algorithm that enables agents to converge to a Nash Equilibrium, in benchmark games, assuming each agent is oblivious to other agents and receives only one type of feedback: the reward associated with choosing a given action. The new algorithm is called the Weighted Policy Learner or WPL for reasons that will become clear shortly. We experimentally show that WPL converges in well-known benchmark two-player-two-action games.

2007

(Abdallah & Lesser, 2007) ⇒ Sherief Abdallah, and Victor Lesser. (2007). “Multiagent Reinforcement Learning and Self-organization in a Network of Agents.” In: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ISBN:978-1-904262-7-5 doi:10.1145/1329125.1329172
- QUOTE: The advantage of using this gradient ascent approach is that agents can learn stochastic policies, which is necessary for most of the convergence guarantees. Algorithm 1 describes the Weighted Policy Learner (WPL) algorithm [1], which we have chosen as the accompanying learning algorithm for our self-organizing mechanism. It should be noted, however, that our mechanism does not depend on the accompanying learning algorithm. In fact, the interaction between WPL and our self-organizing mechanism is encapsulated through the Q and π data structures, which are common to learning algorithms other than WPL.
  WPL achieves convergence using an intuitive idea: slow down learning when moving away from a stable policy and speedup learning when moving towards the stable policy. In that respect, the idea has similarity with the Win or Lose Fast heuristic (WoLF) [3], but the WPL algorithm is more intuitive and achieves higher performance than algorithms using WoLF.

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=Weighted_Policy_Learner_(WPL)_Algorithm&oldid=882555"