Thompson sampling is not just a heuristic with nice properties, but, under closer scrutiny, reveals some interesting aspects about the reinforcement learning problem that have not been analyzed before. Two aspects that are particularly interesting are the intimate connection to Bayesian inference (in fact, to adaptive compression) and the intricate relation to causality.
Also check out the MDP and the Belief Flows examples.
If you find this useful, please cite it as: Ortega, P. A., and Braun, D. A. “A minimum relative entropy principle for learning and acting.” Journal of Artificial Intelligence Research (2010): 475-511. Note that the paper never uses the term “Thompson sampling” because AFAIK it only got that name later in Chappelle & Li's superb 2011 paper.
The Bayesian control rule is an extension to Bayes' Rule that is obtained by combining probability theory and causal interventions. Simply stated, it says \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A}, O|\theta) }{ P(\hat{A}, O) }, \] where the “hat”-notation $\hat{A}$ denotes a causal intervention rather than a condition. At a first glance, it seems to be just Bayes' rule. But there is a subtle difference: the Bayesian control rule makes a distinction between seeing (= observing = conditioning) and doing (= acting = manipulating). Or, in other words, it allows us to model learning when we have the ability to control our world - hence the name.
…and why do we need it?
For our purposes, a causal intervention is a transformation of a probability distribution, such that a chosen event $B$ ends up having probability one. This is done in order to inform a Bayesian reasoner that the event $B$ carries no evidence. This is important, because a Bayesian reasoner must tag his own actions (assuming they're random variables) as causal interventions in order to prevent himself from interpreting them as evidence about the state of the world.
The problem is that setting the probability of $B$ to one can be done in many different ways, and singling out the correct way of doing so requires causal understanding, that is, knowing how the random variables functionally depend on each other.
The classical example to illustrate this is the barometer. Imagine you have a barometer in your house. The atmospheric pressure changes the height of the (say) mercury column in your barometer: if it rises, we expect good weather; and if it drops rapidly, we expect rain. A simple Bayesian model captures this relation: \[ P(W|B) = \frac{ P(B|W) P(W) }{ P(B) }, \] where $P(W)$ is the prior probability of the weather (say, good or bad) and $P(B|W)$ is the likelihood of the barometer change given the weather - more likely to be high when good weather is approaching and more likely to the low in the case of bad weather.
Choosing reasonable values for the prior probabilities and the likelihoods, we can calculate the posterior probability of having good weather given a rapid rise of the mercury level: \[ P(W=good|B=high) = \frac{ P(B=high|W=good) P(W=good) }{ P(B=high) } \] where $P(B=high)$ is just the normalizing constant \begin{align*} P(B=high) &= P(B=high|W=bad) P(W=bad) \\&\quad + P(B=high|W=good) P(W=good). \end{align*}
Now, imagine you decide to change the level of the mercury yourself, say (using a bit of imagination) by using a pressurizing device. Now, you set the value of the random variable - and intuition tells us that we cannot predict the weather anymore. Apparently, our previous Bayesian model is useless now. This is because the intervention of the barometer changes the joint probability distribution $P(B,W)$ to \[ P(\hat{B},W) = \delta(B) P(W), \] that is, where $P(B|W)$ has been replaced by $\delta(B)$, here corresponding to the Kronecker delta function that evaluates to one whenever the value of $B$ is equal to the value we have chosen. The hat-notation is just a shorthand referring to this particular transformation of the probability distribution. Hence, the posterior, assuming we set $B \leftarrow high$, is \[ P(W=good|\hat{B}=high) = \frac{ 1 \cdot P(W=good) }{ 1 \cdot P(W=bad) + 1 \cdot P(W=good) } = P(W=good). \] In other words, we don't gain knowledge about the weather - as expected. The reason for this special treatment of actions is that when we set the value of a random variable, we change Nature's probability law. This has important consequences to the philosophy behind Bayesian modelling. The Bayesian interpretation of probability theory interprets probabilities as degrees of belief, but it is incomplete if not enriched with causal interventions: it only models belief updates after observations, not actions.
To see why causal reasoning is important, assume now (wrongly) that $B$ causes $W$ instead. Then, the intervention has a very different outcome: instead of replacing $P(B|W)$ by $\delta(B)$, we now replace $P(B)$ by $\delta(B)$. These two replacements obviously do not yield the same intervened joint probability distribution, and as a result, the posterior $P(W|\hat{B}=high)$ would have looked differently. To summarize, the following picture shows four cases: a) the unintervened graphical model; b) after observing $B$; c) after intervening $B$ under the assumption that $W \rightarrow B$; and d) after intervening $B$ under the assumption that $B \rightarrow W$.
If you find this interesting, then you can find more material here:
The most important application of the Bayesian control rule is adaptive control. That is, the rule says that \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A},O|\theta) }{ P(\hat{A}, O) }, \] i.e. the posterior of the hypothesis $\theta$ is given by multiplying the prior $P(\theta)$ with a likelihood $P(\hat{A}, O|\theta)$ obtained from $P(A, O|\theta)$ by intervening the random variable $A$. Tagging actions as causal interventions makes sure that we do not learn the hypothesis $\theta$ from our actions, but from the effects of our actions.
To illustrate the causal aspect of Thompson sampling, we review the multi-armed bandit problem, following the discussion in (Ortega & Braun, JAIR, 2010). A multi-armed bandit is a slot machine with many levers. Each time we pull a lever, the machine produces a Bernoulli-distributed reward with a bias specific to the lever. Our goal is to produce a sequence of actions $a_1, a_2, \ldots$ in order to maximize the cumulative sum of the rewards $r_1, r_2, \ldots$. In this case, if we knew the biases of the levers, then the optimal strategy would be to always pull the lever with the highest bias. This knowledge of the optimal strategy given the characterization of the multi-armed bandit allows us to use the Bayesian control rule. The model is specified as follows:
Together, these 3 components yield a Bayesian model having the following graphical representation:
In other words, we have created a Bayesian model that, for instance, can be used to infer the parameters of the bandit given a sequence of plays and rewards that we observe.
Now, how can we use this model not only to estimate the parameters of the bandit by observing an optimal player, but also to suggest new actions? It turns out that we can simply condition on all known variables and ask the model to suggest the next action. In other words, the $N$-armed bandit problem can now be solved by sampling from the predictive distribution over actions, which in this case translates to \[ P(a_{t+1}|\hat{a}_{1:t}, r_{1:t}) = \int P(a_{t+1}|\theta, \hat{a}_{1:t}, r_{1:t}) P(\theta|\hat{a}_{1:t}, r_{1:t}) \, d\theta. \] Importantly, this has to be done, again, distinguishing between actions and observations: previous actions are treated as interventions. For instance, assume we have already issued two actions and made two observations. Then the original model a) becomes an intervened model b) in the next picture:
To obtain the third action $a_3$, we can just sample it from model b). Notice that we always remove the incoming arrows from the action variables after we have sampled them. Intuitively, in this case we are saying that there cannot be any information flowing from the actions $\{a_t\}$ to the parameter variable $\theta$, because these actions where sampled from the probabilistic model itself.
To make this expression concrete, we need to figure out what the interventions do to our original joint distribution. This is done by evaluating the two intervened distributions under the integral:
Using these substitutions, the sampling distribution becomes, \begin{align} P(a_{t+1}|\hat{a}_{1:t}, r_{1:t}) &= \int P(a_{t+1}|\theta) \biggl\{ \frac{ P(\theta) \prod_{k=1}^t P(r_k|\theta, a_k) }{ \int P(\theta') \prod_{k=1}^t P(r_k|\theta', a_k) \, d\theta' } \biggr\} \, d\theta \\&= \int \delta[A_{t+1} = a^\ast(\theta)] \prod_{n=1}^N \mathcal{B}(\theta_n; \alpha_n + s_n, \beta_n + f_n), \end{align} where $s_n$ counts the number of times a reward has been obtained whenever lever $n$ was pulled, and $f_n$ the number of times no reward has been obtained. A closer look at this equation reveals that the integral can be thought as:
As we have anticipated before, this is precisely what Thompson sampling does. But what is interesting in this particular derivation is that it highlights an important difference to standard Bayesian reinforcement learning: rather than using Bayesian probability theory as a predictive model that is then used by a decision making strategy, here it is the whole behavior of the agent that is cast in Bayesian terms. More precisely, we can write a probabilistic model over behaviors, and use causal Bayesian inference to derive adaptive controls.
The figure shows the probability of pulling the optimal lever and the regret as a function of time for the optimal solution (Gittins' indices), UCB1 and Thompson sampling. These curves were estimated using Monte Carlo simulations. Note that Thompson sampling performs significantly better than UCB1, and gets very close to the optimal solution. Download the MATLAB script here: [ZIP].
It is not hard to see that this way of modelling adaptive controllers is very general. In particular, notice the following:
We have successfully applied the Bayesian control rule to many different problem classes, including bandits, MDPs, LQRs and even games with adversarial players.
[27] Ortega, P.A. and Kim, K.-E. and Lee, D.D.
Reactive bandits with attitude
18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015
[PDF] [Poster]
Ortega, P.A. and Crammer, K. and Lee, D.D.
Belief Flows for Robust Online Learning
Information Theory and Applications (ITA), 2015.
[PDF]
Ortega, P.A. and Braun, D.A.
Generalized Thompson Sampling for Sequential Decision-Making and Causal Inference
Complex Adaptive Systems Modeling 2:2, 2014.
[PDF]
Ortega, P.A.
Bayesian Control Rule - Talk Slides, 2012
[PDF]
Ortega, P.A. and Braun, D.A.
Adaptive Coding of Actions and Observations
NIPS Workshop on Information in Perception and Action, 2012.
[PDF]
Ortega, P.A., Grau-Moya, J., Genewein, T., Balduzzi, D. and Braun, D.A.
A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
Neural Information Processing Systems (NIPS), 2012
[PDF] [CODE]
Ortega, P.A. and Braun, D.A. and Godsill, S.J.
Reinforcement Learning and the Bayesian Control Rule
The fourth conference on artificial general intelligence, pp. 281-285, 2011.
[PDF]
Ortega, P.A.
A Unified Framework for Resource-Bounded Autonomous Agents Interacting with Unknown Environments
PhD Thesis, Dept. of Engineering, University of Cambridge, 2011.
Thesis supervisor: Zoubin Ghahramani
Thesis committee: Marcus Hutter and Carl E. Rasmussen
[PDF]
Ortega, P.A. and Braun, D.A.
A minimum relative entropy principle for learning and acting
Journal of Artificial Intelligence Research 38, pp. 475-511, 2010.
[PDF]
Braun, D.A. and Ortega, P.A.
A minimum relative entropy principle for adaptive control in linear quadratic regulators
Proceedings of the 7th international conference on informatics in control, automation and robotics, pp. 103-108, 2010
[PDF]
Ortega, P.A. and Braun, D.A.
A Bayesian rule for adaptive control based on causal interventions
The third conference on artificial general intelligence, pp. 121-126, 2010
[PDF]