Thompson Sampling Algorithm
Jump to navigation
Jump to search
A Thompson Sampling Algorithm is a heuristic sampling algorithm that ...
- Example(s):
- See: Multi-Armed Bandit, Sampling Algorithm.
References
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Thompson_sampling Retrieved:2017-4-24.
- In artificial intelligence, Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
2017
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Thompson_sampling#Description Retrieved:2017-4-24.
- Consider a set of contexts [math]\displaystyle{ \mathcal{X} }[/math] , a set of actions [math]\displaystyle{ \mathcal{A} }[/math] , and rewards in [math]\displaystyle{ \mathbb{R} }[/math] . In each round, the player obtains a context [math]\displaystyle{ x \in \mathcal{X} }[/math] , plays an action [math]\displaystyle{ a \in \mathcal{A} }[/math] and receives a reward [math]\displaystyle{ r \in \mathbb{R} }[/math] following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.
The elements of Thompson sampling are as follows:
- a likelihood function [math]\displaystyle{ P(r|\theta,a,x) }[/math] ;
- a set [math]\displaystyle{ \Theta }[/math] of parameters [math]\displaystyle{ \theta }[/math] of the distribution of [math]\displaystyle{ r }[/math] ;
- a prior distribution [math]\displaystyle{ P(\theta) }[/math] on these parameters;
- past observations triplets [math]\displaystyle{ \mathcal{D} = \{(x; a; r)\} }[/math] ;
- a posterior distribution [math]\displaystyle{ P(\theta|\mathcal{D}) \propto P(\mathcal{D}|\theta)P(\theta) }[/math] , where [math]\displaystyle{ P(\mathcal{D}|\theta) }[/math] is the likelihood function.
- Thompson sampling consists in playing the action [math]\displaystyle{ a^\ast \in \mathcal{A} }[/math] according to the probability that it maximizes the expected reward, i.e. : [math]\displaystyle{ \int \mathbb{I}[\mathbb{E}(r|a^\ast,x,\theta) = \max_{a'} \mathbb{E}(r|a',x,\theta)] P(\theta|\mathcal{D}) \, d\theta, }[/math] where [math]\displaystyle{ \mathbb{I} }[/math] is the indicator function.
In practice, the rule is implemented by sampling, in each round, a parameter [math]\displaystyle{ \theta^\ast }[/math] from the posterior [math]\displaystyle{ P(\theta|\mathcal{D}) }[/math] , and choosing the action [math]\displaystyle{ a^\ast }[/math] that maximizes [math]\displaystyle{ \mathbb{E}[r|\theta^\ast,a^\ast,x] }[/math] , i.e. the expected reward given the parameter, the action and the current context. Conceptually, this means that the player instantiates his or her beliefs randomly in each round, and then he acts optimally according to them.
- Consider a set of contexts [math]\displaystyle{ \mathcal{X} }[/math] , a set of actions [math]\displaystyle{ \mathcal{A} }[/math] , and rewards in [math]\displaystyle{ \mathbb{R} }[/math] . In each round, the player obtains a context [math]\displaystyle{ x \in \mathcal{X} }[/math] , plays an action [math]\displaystyle{ a \in \mathcal{A} }[/math] and receives a reward [math]\displaystyle{ r \in \mathbb{R} }[/math] following a distribution that depends on the context and the issued action. The aim of the player is to play actions such as to maximize the cumulative rewards.
2012
- (Kaufmann et al., 2012) ⇒ Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. (2012). “Thompson Sampling: An Asymptotically Optimal Finite-time Analysis.” In: Proceedings of the 23rd International Conference on Algorithmic Learning Theory. ISBN:978-3-642-34105-2 doi:10.1007/978-3-642-34106-9_18
- QUOTE: The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.
2011
- (Chapelle & Li, 2011) ⇒ Olivier Chapelle, and Lihong Li. (2011). “An Empirical Evaluation of Thompson Sampling.” In: Proceedings of the 24th International Conference on Neural Information Processing Systems. ISBN:978-1-61839-599-3
- ABSTRACT: Thompson sampling is one of oldest heuristic to address the exploration/exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare