Double Thompson Sampling Algorithm

A Double Thompson Sampling Algorithm is a multi-armed bandit algorithm that extends the Thompson Sampling strategy by maintaining two separate probability models for each action, aimed at reducing variance in the action selection process and improving exploration efficiency.

Context:
- It can (typically) be applied in online learning and decision-making processes where the goal is to balance exploration of new actions with exploitation of known rewards.
- It can (often) use Bayesian updating to adjust the probability models based on the observed outcomes of actions, allowing for a dynamic and probabilistic approach to action selection.
- It can provide a more robust mechanism for exploration by using one model to determine the action to explore and another to evaluate the action, thus mitigating the risk of overexploiting suboptimal actions due to sampling noise.
- It can be particularly effective in environments with high uncertainty and where the cost of exploration is significant, as it aims to make more informed decisions about which actions to explore.
- It can enhance the performance of reinforcement learning algorithms by integrating a double sampling technique into the action selection process, thereby improving the efficiency of exploration and exploitation trade-offs.
- ...
Example(s):
- ...
Counter-Example(s):
- A simple Thompson Sampling algorithm that uses a single probability model for action selection.
- A greedy algorithm in multi-armed bandit problems that always selects the action with the highest estimated reward without exploration.
See: Bayesian updating, Online learning, Exploration-Exploitation Trade-Off, Reinforcement learning, Thompson Sampling, Linear Contextual bandits, Context-Attentive Thompson Sampling (CATS).

References

2024

(Dwaracherla et al., 2024) ⇒ Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. (2024). “Efficient Exploration for LLMs.” doi:10.48550/arXiv.2402.00396
- NOTE: It compares passive exploration with several active exploration algorithms, highlighting the effectiveness of double Thompson sampling with an epistemic neural network.

2023

(Kim et al., 2023) ⇒ W Kim, K Lee, and MC Paik. (2023). “Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits.” In: Proceedings of the AAAI Conference on Artificial Intelligence. [1](http://ojs.aaai.org/index.php/AAAI/article/view/21556)
- It presents an algorithm, the double doubly robust Thompson sampling algorithm for generalized linear contextual bandits, extending DR Thompson sampling with enhanced performance in uncertain environments.

2021

(Bouneffouf et al., 2021) ⇒ D Bouneffouf, R Féraud, S Upadhyay, and Others. (2021). “Double-Linear Thompson Sampling for Context-Attentive Bandits.” In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [2](https://ieeexplore.ieee.org/document/9414864)

2016

(Wu & Liu, 2016) ⇒ H Wu, and X Liu. (2016). “Double Thompson Sampling for Dueling Bandits.” In: Advances in Neural Information Processing Systems. [3](https://proceedings.neurips.cc/paper/2016/hash/1a9a3129e7c3c8334b75339a2f10c9e9-Abstract.html)