Upper Confidence Bound (UCB) Algorithm
Jump to navigation
Jump to search
An Upper Confidence Bound (UCB) Algorithm is a sequential model-based optimization multi-armed bandit algorithm that balance exploration and exploitation by estimating the mean reward of each arm and choosing the arm with the highest upper confidence bound based on the estimated variance.
- Context:
- It can be composed of steps such as:
- Initialize the estimated reward and uncertainty for each action.
- Select the action with the highest upper confidence bound.
- Observe the reward for the selected action and update the estimated reward and uncertainty accordingly.
- Repeat steps 2 and 3 until the desired number of time steps have elapsed.
- ...
- It can be composed of steps such as:
- Context:
- It can range from being an Exact k-Armed Bandit Algorithm to being an Approximate k-Armed Bandit Algorithm.
- ...
- Example(s):
- ...
- Counter-Example(s):
- See: Optimization Algorithm.
References
2023
- GBard
- The Upper Confidence Bound (UCB) algorithm is a commonly used algorithm in machine learning, and it works by estimating the mean reward of each arm and choosing the arm with the highest upper confidence bound based on the estimated variance.
- Multi-Armed Bandit Problem: The UCB algorithm is used to solve the multi-armed bandit problem (MAB problem). The MAB problem is a sequential decision-making problem where an agent has a set of actions (arms) to choose from, and each arm has a different reward distribution. The agent's goal is to maximize its cumulative reward over time.
- UCB Algorithm: The UCB algorithm works by maintaining an estimate of the mean reward for each arm. At each time step, the algorithm chooses the arm with the highest upper confidence bound. The upper confidence bound is calculated as follows:
- UCB(a) = Q(a) + sqrt(c * ln(t) / N(a))
- where:
a is the arm Q(a) is the estimated mean reward for arm a c is a constant that controls the trade-off between exploration and exploitation t is the current time step N(a) is the number of times arm a has been chosen
The UCB algorithm balances exploration and exploitation by choosing arms with both high estimated rewards and high uncertainty. By choosing arms with high uncertainty, the algorithm is able to learn more about the reward distributions of the arms.
2014
- (Desautels et al., 2014) ⇒ Thomas Desautels, Andreas Krause, and Joel W. Burdick. (2014). “Parallelizing Exploration-exploitation Tradeoffs in Gaussian Process Bandit Optimization.” In: The Journal of Machine Learning Research, 15(1).
- QUOTE: ... We introduce Gaussian Process Batch Upper Confidence Bound (GP-BUCB), an upper confidence bound-based algorithm, which models the reward function as a sample from a Gaussian process and which can select batches of experiments to run in parallel. We prove a general regret bound for GP-BUCB, as well as the surprising result that for some common kernels, the asymptotic average regret can be made independent of the batch size. ...