Markov Decision Process

From GM-RKB
(Redirected from MDP)
Jump to navigation Jump to search

A Markov Decision Process is a discrete-time stochastic decision process that is a finite state-space decision process and a finite action-space decision process.



References

2015

2014

  • http://en.wikipedia.org/wiki/Markov_decision_process
    • Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). Much research in the area was spawned due to Ronald A. Howard's book, Dynamic Programming and Markov Processes, in 1960. Today they are used in a variety of areas, including robotics, automated control, economics and manufacturing.

      More precisely, a Markov Decision Process is a discrete time stochastic control process. At each time step, the process is in some state [math]\displaystyle{ s }[/math], and the decision maker may choose any action [math]\displaystyle{ a }[/math] that is available in state [math]\displaystyle{ s }[/math]. The process responds at the next time step by randomly moving into a new state [math]\displaystyle{ s' }[/math], and giving the decision maker a corresponding reward [math]\displaystyle{ R_a(s,s') }[/math].

      The probability that the process moves into its new state [math]\displaystyle{ s' }[/math] is influenced by the chosen action. Specifically, it is given by the state transition function [math]\displaystyle{ P_a(s,s') }[/math]. Thus, the next state [math]\displaystyle{ s' }[/math] depends on the current state [math]\displaystyle{ s }[/math] and the decision maker's action [math]\displaystyle{ a }[/math]. But given [math]\displaystyle{ s }[/math] and [math]\displaystyle{ a }[/math], it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess the Markov property.

      Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state and all rewards are zero, a Markov decision process reduces to a Markov chain.

2013

  • http://en.wikipedia.org/wiki/Markov_decision_process#Definition
    • A Markov decision process is a 4-tuple [math]\displaystyle{ (S,A,P_\cdot(\cdot,\cdot),R_\cdot(\cdot,\cdot)) }[/math], where
      • [math]\displaystyle{ S }[/math] is a finite set of states,
      • [math]\displaystyle{ A }[/math] is a finite set of actions (alternatively, [math]\displaystyle{ A_s }[/math] is the finite set of actions available from state [math]\displaystyle{ s }[/math]),
      • [math]\displaystyle{ P_a(s,s') = \Pr(s_{t+1}=s' \mid s_t = s, a_t=a) }[/math] is the probability that action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] at time [math]\displaystyle{ t }[/math] will lead to state [math]\displaystyle{ s' }[/math] at time [math]\displaystyle{ t+1 }[/math],
      • [math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward (or expected immediate reward) received after transition to state [math]\displaystyle{ s' }[/math] from state [math]\displaystyle{ s }[/math] with transition probability [math]\displaystyle{ P_a(s,s') }[/math].

2012

1994