Contextual Multi-Armed Bandit Task

A Contextual Multi-Armed Bandit Task is a k-armed bandit task that includes a context vector ... where the rewarding systems are not independent.

Context:
- It can be solved by a Contextual Bandit System (that implements a contextual bandit algorithm).
- It does not include a Loss Function (which can be used to calculate the gradient).
- It can range from being a Stateless Contextual Bandit Task to being a Stateful Contextual Bandit Task (i.e. reinforcement learning.
Example(s):
- a Personalized Recommendations Task.
- …
Counter-Example(s):
- a Supervised Learning Task.
See: Item Recommendations, Gaussian Mixture, Reinforcement Learning, Singular-Value Decomposition, Nonparametric Regression.

References

2020

(Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Multi-armed_bandit#Contextual_bandit Retrieved:2020-3-25.
- A particularly useful version of the multi-armed bandit is the contextual multi-armed bandit problem. In this problem, in each iteration an agent has to choose between arms. Before making the choice, the agent sees a d-dimensional feature vector (context vector), associated with the current iteration. The learner uses these context vectors along with the rewards of the arms played in the past to make the choice of the arm to play in the current iteration. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors.

2016a

Bay Area Deep Learning School Day 2 https://youtu.be/9dXiAecyJrY?t=16m2s
- QUOTE: Contextual Bandits provide the agent a little less information than in supervised learning]
  - Environment samples input [math]\displaystyle{ x_t \sim \rho }[/math]
  - Agent takes action [math]\displaystyle{ \hat{y}_t = f(m_t) }[/math]
  - Agent receives cost [math]\displaystyle{ C_a \sim P(c_t | x_t, \hat(y)_t) }[/math] where P is an unknown probability.
  - Environment asks agent a question, and giver her a noisy score on her answer.

2016b

https://getstream.io/blog/introduction-contextual-bandits/
- QUOTE: In most real life applications, we have access to information than can be used to make a better decision when choosing amongst all actions in a MAB setting, this extra information, or context, is what gives Contextual Bandits their name. For the ad-placement example, having access to historical data about the user’s buying habits can be highly informative of what type of products or promotions they will engage with in the future.
  For really large datasets, the highly optimized Contextual Bandit Algorithms in Vowpal Wabbit are the way to go. There are a four of bandit algorithms that are ready to use in VW:
  - ε-greedy: Exploit the best strategy with probability 1-epsilon, keep exploring uniformly over all the other actions with probability epsilon.
  - Explore-first: Starts with a phase of pure exploration in the first K trials, after the exploration phase the algorithm exploits the leading strategy.
  - Bagging: Different policies are trained using bootstrapping (sampling with replacement).
  - Online cover: Explores all available actions by keeping only a small subset of policies active, this approach is fully described in the 2014 paper by Agarwal et. al.
- Bandit datasets are highly proprietary and hard to come by, in order to test some of the bandit algorithms that are implemented in VW we can use the –cbify option to transform any multiclass classification training set into a contextual bandit training set.

2010

(Li et al., 2010) ⇒ Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. (2010). “A Contextual-bandit Approach to Personalized News Article Recommendation.” In: Proceedings of the 19th International Conference on World wide web. doi:10.1145/1772690.1772758
- QUOTE: we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article selection strategy based on user-click feedback to maximize total user clicks.

Contextual Multi-Armed Bandit Task

References

2020

2016a

2016b

2010

Navigation menu

Search