Discounted Infinite Horizon Reinforcement Learning Task
Jump to navigation
Jump to search
A Discounted Infinite Horizon Reinforcement Learning Task is an Infinite Horizon Reinforcement Learning Task that is a Discounted Reinforcement Learning Task.
- …
- Counter-Example(s):
- …
- See: Finite Markov Decision Process, Q-Learning, Infinite Markov Decision Process, Partially Observable Markov Decision Processes (POMDPs).
References
2004
- (Ferns et al., 2004) ⇒ Norm Ferns, Prakash Panangaden, and Doina Precup. (2004). “Metrics for Finite Markov Decision Processes.” In: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 162-169 . AUAI Press,
- ABSTRACT: We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.
2001
- (Baxter & Bartlett, 2001) ⇒ Jonathan Baxter, and Peter L. Bartlett. (2001). “Infinite-horizon Policy-gradient Estimation.” Journal of Artificial Intelligence Research 15
- ABSTRACT: Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes POMDPs controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter et al., this volume) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward.