Sequential Decision-Making Task
A Sequential Decision-Making Task is a Sequence Learning Task that consists of selection an action to accomplish a goal in the future.
- AKA: Sequence Generation Through Actions.
- Context:
- It can be solved by a Sequential Decision-Making System that implements a Sequential Decision-Making Algorithm.
- It can range from being a Goal-Oriented Sequential Decision-Making Task, to being a Trajectory-Oriented Sequential Decision-Making Task, and to being a Reinforcement-Maximizing Sequential Decision-Making Task.
- Example(s):
- Counter-Example(s):
- See: Decision-Making task, Markov Decision Process, Complex Input Classification Task, Partially Observable Markov Decision Process, Reinforcement Learning.
References
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Sequence_learning#Sequence_learning_problems Retrieved:2019-2-24.
- Sequence learning problems are used to better understand the different types of sequence learning. There are four basic sequence learning problems: sequence prediction, sequence generation, sequence recognition, and sequential decision making. These “problems” show how sequences are formulated. They show the patterns sequences follow and how these different sequence learning problems are related to each other.
Sequence prediction attempts to predict the next immediate element of a sequence based on all the preceding elements. Sequence generation is basically the same as sequence prediction: an attempt to piece together a sequence one by one the way it naturally occurs. Sequence recognition takes certain criteria and determines whether the sequence is legitimate. Sequential decision making or sequence generation through actions breaks down into three variations: goal-oriented, trajectory-oriented, and reinforcement-maximizing. These three variations all want to pick the action(s) or step(s) that will lead to the goal in the future. These sequence learning problems reflect hierarchical organization of plans because each element in the sequences builds on the previous elements. In a classic experiment published in 1967, Alfred L. Yarbus demonstrated that though subjects viewing portraits reported apprehending the portrait as a whole, their eye movements successively fixated on the most informative parts of the image. These observations suggest that underlying an apparently parallel process of face perception, a serial oculomotor process is concealed. [1] It is a common observation that when a skill is being acquired, we are more attentive in the initial phase, but after repeated practice, the skill becomes nearly automatic; [2] this is also known as unconscious competence. We can then concentrate on learning a new action while performing previously learned actions skillfully. Thus, it appears that a neural code or representation for the learned skill is created in our brain, which is usually called procedural memory. The procedural memory encodes procedures or algorithms rather than facts.
- Sequence learning problems are used to better understand the different types of sequence learning. There are four basic sequence learning problems: sequence prediction, sequence generation, sequence recognition, and sequential decision making. These “problems” show how sequences are formulated. They show the patterns sequences follow and how these different sequence learning problems are related to each other.
- ↑ Yarbus, Alfred L., "Eye movements during perception of complex objects", Yarbus, Alfred L., tr. Basil Haigh, ed. Lorrin A. Riggs, Eye Movements and Vision, New York: Plenum, 1967, OCLC 220267263, ch. 7, pp. 171–96.
- ↑ Fitts, P. M., "Perceptual motor skill learning", in Arthur W. Melton (ed.), Categories of Human Learning, New York: Academic Press, 1964, OCLC 180195, pp. 243–85.
2001
- (Sun, 2001) ⇒ Ron Sun. (2001). “Introduction to Sequence Learning.” In: Sequence Learning - Paradigms, Algorithms, and Applications. ISBN:978-3-540-44565-4 doi:10.1007/3-540-44565-X_1
- QUOTE: (...) sequential decision making involves selecting sequence of actions, to accomplish a goal, to follow a trajectory, or to maximize (or minimize) a reinforcement (or cost) function that is normally the (discounted) sum of reinforcements (costs) that are received along the way (see Bellman 1957, Bertsekas and Tsitsiklis 1995) (...)
- Sequential decision making (that is, sequence generation through actions): there are several possible variations. In the goal-oriented case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j; s_G \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] and the goal state [math]\displaystyle{ s_G }[/math], we want to choose an action [math]\displaystyle{ a_j }[/math] at time step [math]\displaystyle{ j }[/math] that will likely lead to [math]\displaystyle{ s_G }[/math] in the future. In the trajectory oriented case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j; s_{j+1} \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] the desired next state [math]\displaystyle{ s_{j+1} }[/math] , we want to choose an action [math]\displaystyle{ a_j }[/math] that at time step [math]\displaystyle{ j }[/math] will be likely to lead to [math]\displaystyle{ s_{j+1} }[/math] in the next step. In the reinforcement maximizing case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] we want to choose action [math]\displaystyle{ a_j }[/math] at time step [math]\displaystyle{ j }[/math] that will likely lead to receiving maximum total reinforcement in the future. The calculation of total reinforcement can be in terms of discounted or undiscounted cumulative reinforcement, in terms of average reinforcement, or in terms of some other functions of reinforcement (Bertsekas end Tsitsiklis 1996, Kaelbling et al 1996, Sutton and Barto 1997).
The action selection can be based on the immediately preceding element (i.e. [math]\displaystyle{ i=j }[/math]), in which casea Markovian action polity is in force: otherwise, a non-Markovian action policy is involved (McCallum 1190, Sun and Sessions 2000). Yet another possibility is that the action decisions are not based on preceding elements at all (i.e.,[math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] are irrelevant), in which case an open-loop policy is used (Sun and Sessions 1998) while in the Previous two cases, closed-loop policies are involved, as will be discussed later.
Note that on this view, sequence generation/prediction can be viewed as a special call of sequential decision making.
- Sequential decision making (that is, sequence generation through actions): there are several possible variations. In the goal-oriented case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j; s_G \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] and the goal state [math]\displaystyle{ s_G }[/math], we want to choose an action [math]\displaystyle{ a_j }[/math] at time step [math]\displaystyle{ j }[/math] that will likely lead to [math]\displaystyle{ s_G }[/math] in the future. In the trajectory oriented case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j; s_{j+1} \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] the desired next state [math]\displaystyle{ s_{j+1} }[/math] , we want to choose an action [math]\displaystyle{ a_j }[/math] that at time step [math]\displaystyle{ j }[/math] will be likely to lead to [math]\displaystyle{ s_{j+1} }[/math] in the next step. In the reinforcement maximizing case, we have [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j \to a_j }[/math], where [math]\displaystyle{ 1\lt i\lt j\lt \infty; }[/math] that is, given [math]\displaystyle{ s_i,s_{i+1}, \cdots, s_j }[/math] we want to choose action [math]\displaystyle{ a_j }[/math] at time step [math]\displaystyle{ j }[/math] that will likely lead to receiving maximum total reinforcement in the future. The calculation of total reinforcement can be in terms of discounted or undiscounted cumulative reinforcement, in terms of average reinforcement, or in terms of some other functions of reinforcement (Bertsekas end Tsitsiklis 1996, Kaelbling et al 1996, Sutton and Barto 1997).
- QUOTE: (...) sequential decision making involves selecting sequence of actions, to accomplish a goal, to follow a trajectory, or to maximize (or minimize) a reinforcement (or cost) function that is normally the (discounted) sum of reinforcements (costs) that are received along the way (see Bellman 1957, Bertsekas and Tsitsiklis 1995) (...)
1996
- (Littmanederman, 1996) ⇒ Michael L. Littmanederman. (1996). “Algorithms for Sequential Decision Making". PhD diss., Brown University,
- ABSTRACT: Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?". In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a long-run measure of reward, and "I" is an automated planning or learning system (agent). In particular, I collect basic results concerning methods for finding optimal (or near-optimal) behavior in several different kinds of model environments: Markov decision processes, in which the agent always knows its state; partially observable Markov decision processes (POMDPs), in which the agent must piece together its state on the basis of observations it makes; and Markov games, in which the agent is in direct competition with an opponent. The thesis is written from a computer science perspective, meaning that many mathematical details are not discussed, and descriptions of algorithms and the complexity of problems are emphasized. New results include an improved algorithm for solving POMDPs exactly over finite horizons, a method for learning minimax-optimal policies for Markov games, a pseudopolynomial bound for policy iteration, and a complete complexity theory for finding zero-reward POMDP policies.