Markov Decision Process

A Markov Decision Process is a discrete-time stochastic decision process that is a finite state-space decision process and a finite action-space decision process.

Context:
- It can be represented by a 4-tuple [math]\displaystyle{ (\mathcal{S},\mathcal{A},P_{()},R_{()}) }[/math], where
  - [math]\displaystyle{ \mathcal{S} }[/math] is a finite set of states,
  - [math]\displaystyle{ \mathcal{A} }[/math] is a finite set of actions (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
  - [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
  - [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transition to state [math]\displaystyle{ s' }[/math] from state [math]\displaystyle{ s }[/math] with transition probability [math]\displaystyle{ P_a(s,s') }[/math]., where:
- It can range from a Fully Observable Markov Decision Process to being a Partially Observable Markov Decision Process.
- It can range from a Finite Markov Decision Process to being an Infinite Markov Decision Process.
- It can be solved by a Markov Decision Process Solving System (that implements a Markov Decision Process solving algorithm).
Example(s):
- a Markovian Arrival Process.
- an Optimal Dispatch Policy?
- Policy Iteration Algorithm (Howard, 1960), for communicating PMDPs.
- a Stochastic Shortest Path Problem with the assumption that there is an absorbing state with a proper strategy.
- …
Counter-Example(s):
- a Hidden Markov Model, because it is not a Decision Process.
See: Decision Process, Markov Principle, Markov Field, Continuous Markov Decision Process, Q-Learning.

References

2015

(Bäuerle & Riess, 2015) ⇒ Nicole Bäuerle, and Viola Riess. (2015). “On Markov Decision Processes.” In: SIAM News Journal, June 01, 2015.
- QUOTE: To obtain a tractable problem, it is often assumed that the transition law of the underlying state process is Markovian, i.e., that only the current state has an influence on future states. Such a situation leads to a Markov decision process (MDP); textbooks on MDPs include [1, 3, 5, 7]. MDPs differ from general stochastic control problems in that the actions are taken at discrete time points, rather than continuously. Stochastic shortest-path problems, consumption and investment of money, allocation of resources, production planning, and harvesting problems are a few examples of MDPs. The formulation and development of MDPs started in the 1950s with Shapley, Bellman, and, later, Howard, Dubins, Savage, and Blackwell. The early achievements are closely related to the study of stochastic dynamic games. Subtle mathematical problems in the theory include measurability issues with arbitrary Borel state spaces, which naturally arise in, for example, partially observable Markov decision processes.

2014

http://en.wikipedia.org/wiki/Markov_decision_process
- Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). Much research in the area was spawned due to Ronald A. Howard's book, Dynamic Programming and Markov Processes, in 1960. Today they are used in a variety of areas, including robotics, automated control, economics and manufacturing.
  More precisely, a Markov Decision Process is a discrete time stochastic control process. At each time step, the process is in some state [math]\displaystyle{ s }[/math], and the decision maker may choose any action [math]\displaystyle{ a }[/math] that is available in state [math]\displaystyle{ s }[/math]. The process responds at the next time step by randomly moving into a new state [math]\displaystyle{ s' }[/math], and giving the decision maker a corresponding reward [math]\displaystyle{ R_a(s,s') }[/math].
  The probability that the process moves into its new state [math]\displaystyle{ s' }[/math] is influenced by the chosen action. Specifically, it is given by the state transition function [math]\displaystyle{ P_a(s,s') }[/math]. Thus, the next state [math]\displaystyle{ s' }[/math] depends on the current state [math]\displaystyle{ s }[/math] and the decision maker's action [math]\displaystyle{ a }[/math]. But given [math]\displaystyle{ s }[/math] and [math]\displaystyle{ a }[/math], it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess the Markov property.
  Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state and all rewards are zero, a Markov decision process reduces to a Markov chain.

2013

http://en.wikipedia.org/wiki/Markov_decision_process#Definition
- A Markov decision process is a 4-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot)) }[/math], where
  - [math]\displaystyle{ S }[/math] is a finite set of states,
  - [math]\displaystyle{ A }[/math] is a finite set of actions (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
  - [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
  - [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transition to state [math]\displaystyle{ s' }[/math] from state [math]\displaystyle{ s }[/math] with transition probability [math]\displaystyle{ P_a(s,s') }[/math].

2012

(Mousam & Kolobov, 2012) ⇒ Mousam, and Andrey Kolobov. (2012). “Planning with Markov Decision Processes: An AI Perspective.” In: Morgan & Claypool Publishers. ISBN: 1608458865, 9781608458868.
- QUOTE: Markov Decision Processes (MDPs) are widely popular in Artificial Intelligence for modeling sequential decision-making scenarios with probabilistic dynamics. They are the framework of choice when designing an intelligent agent that needs to act for long periods of time in an environment where its actions could have uncertain outcomes.
  A finite discrete-time fully observable MDP is ...

1994

(Puterman, 1994) ⇒ Martin L. Puterman. (1994). “Markov Decision Processes: Discrete Stochastic Dynamic Programming.” In: John Wiley & Sons, Inc.. ISBN: 0471619779.
- QUOTE: Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models.

Markov Decision Process

References

2015

2014

2013

2012

1994

Navigation menu

Search