Thompson Sampling & Bayesian Control Rule

Thompson sampling is not just a heuristic with nice properties, but, under closer scrutiny, reveals some interesting aspects about the reinforcement learning problem that have not been analyzed before. Two aspects that are particularly interesting are the intimate connection to Bayesian inference (in fact, to adaptive compression) and the intricate relation to causality.

Also check out the MDP and the Belief Flows examples.

If you find this useful, please cite it as: Ortega, P. A., and Braun, D. A. “A minimum relative entropy principle for learning and acting.” Journal of Artificial Intelligence Research (2010): 475-511. Note that the paper never uses the term “Thompson sampling” because AFAIK it only got that name later in Chappelle & Li's superb 2011 paper.

The Bayesian control rule is an extension to Bayes' Rule that is obtained by combining probability theory and causal interventions. Simply stated, it says \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A}, O|\theta) }{ P(\hat{A}, O) }, \] where the “hat”-notation $\hat{A}$ denotes a causal intervention rather than a condition. At a first glance, it seems to be just Bayes' rule. But there is a subtle difference: the Bayesian control rule makes a distinction between seeing (= observing = conditioning) and doing (= acting = manipulating). Or, in other words, it allows us to model learning when we have the ability to control our world - hence the name.

…and why do we need it?

For our purposes, a causal intervention is a transformation of a probability distribution, such that a chosen event $B$ ends up having probability one. This is done in order to inform a Bayesian reasoner that the event $B$ carries no evidence. This is important, because a Bayesian reasoner must tag his own actions (assuming they're random variables) as causal interventions in order to prevent himself from interpreting them as evidence about the state of the world.

The problem is that setting the probability of $B$ to one can be done in many different ways, and singling out the correct way of doing so requires causal understanding, that is, knowing how the random variables functionally depend on each other.

The classical example to illustrate this is the barometer. Imagine you have a barometer in your house. The atmospheric pressure changes the height of the (say) mercury column in your barometer: if it rises, we expect good weather; and if it drops rapidly, we expect rain. A simple Bayesian model captures this relation: \[ P(W|B) = \frac{ P(B|W) P(W) }{ P(B) }, \] where $P(W)$ is the prior probability of the weather (say, good or bad) and $P(B|W)$ is the likelihood of the barometer change given the weather - more likely to be high when good weather is approaching and more likely to the low in the case of bad weather.

Choosing reasonable values for the prior probabilities and the likelihoods, we can calculate the posterior probability of having good weather given a rapid rise of the mercury level: \[ P(W=good|B=high) = \frac{ P(B=high|W=good) P(W=good) }{ P(B=high) } \] where $P(B=high)$ is just the normalizing constant \begin{align*} P(B=high) &= P(B=high|W=bad) P(W=bad) \\&\quad + P(B=high|W=good) P(W=good). \end{align*}

Now, imagine you decide to change the level of the mercury yourself, say (using a bit of imagination) by using a pressurizing device. Now, you set the value of the random variable - and intuition tells us that we cannot predict the weather anymore. Apparently, our previous Bayesian model is useless now. This is because the intervention of the barometer changes the joint probability distribution $P(B,W)$ to \[ P(\hat{B},W) = \delta(B) P(W), \] that is, where $P(B|W)$ has been replaced by $\delta(B)$, here corresponding to the Kronecker delta function that evaluates to one whenever the value of $B$ is equal to the value we have chosen. The hat-notation is just a shorthand referring to this particular transformation of the probability distribution. Hence, the posterior, assuming we set $B \leftarrow high$, is \[ P(W=good|\hat{B}=high) = \frac{ 1 \cdot P(W=good) }{ 1 \cdot P(W=bad) + 1 \cdot P(W=good) } = P(W=good). \] In other words, we don't gain knowledge about the weather - as expected. The reason for this special treatment of actions is that when we set the value of a random variable, we change Nature's probability law. This has important consequences to the philosophy behind Bayesian modelling. The Bayesian interpretation of probability theory interprets probabilities as degrees of belief, but it is incomplete if not enriched with causal interventions: it only models belief updates after observations, not actions.

To see why causal reasoning is important, assume now (wrongly) that $B$ causes $W$ instead. Then, the intervention has a very different outcome: instead of replacing $P(B|W)$ by $\delta(B)$, we now replace $P(B)$ by $\delta(B)$. These two replacements obviously do not yield the same intervened joint probability distribution, and as a result, the posterior $P(W|\hat{B}=high)$ would have looked differently. To summarize, the following picture shows four cases: a) the unintervened graphical model; b) after observing $B$; c) after intervening $B$ under the assumption that $W \rightarrow B$; and d) after intervening $B$ under the assumption that $B \rightarrow W$.

If you find this interesting, then you can find more material here:

causality, if you are interested in standard causality (but incomplete);
and Bayesian causal induction, if you want to do know how to learn a causal law from examples.

The most important application of the Bayesian control rule is adaptive control. That is, the rule says that \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A},O|\theta) }{ P(\hat{A}, O) }, \] i.e. the posterior of the hypothesis $\theta$ is given by multiplying the prior $P(\theta)$ with a likelihood $P(\hat{A}, O|\theta)$ obtained from $P(A, O|\theta)$ by intervening the random variable $A$. Tagging actions as causal interventions makes sure that we do not learn the hypothesis $\theta$ from our actions, but from the effects of our actions.

To illustrate the causal aspect of Thompson sampling, we review the multi-armed bandit problem, following the discussion in (Ortega & Braun, JAIR, 2010). A multi-armed bandit is a slot machine with many levers. Each time we pull a lever, the machine produces a Bernoulli-distributed reward with a bias specific to the lever. Our goal is to produce a sequence of actions $a_1, a_2, \ldots$ in order to maximize the cumulative sum of the rewards $r_1, r_2, \ldots$. In this case, if we knew the biases of the levers, then the optimal strategy would be to always pull the lever with the highest bias. This knowledge of the optimal strategy given the characterization of the multi-armed bandit allows us to use the Bayesian control rule. The model is specified as follows:

Actions: First, we specify the causal model over the optimal actions. The optimal action $a_t$ depends functionally only on the vector of biases $\theta = [\theta_1, \ldots, \theta_N]$: \[ P(a_t|\theta, a_{1:t-1}, r_{1:t-1}) = P(a_t|\theta) = \delta[A_t = a^\ast(\theta)]. \] where $a^\ast(\theta)$ is the index of the lever with the highest bias in $\theta$, and where $\delta$ is the Kronecker delta function.
Observations: Second, we specify the causal model over the rewards. In a manner analogous to the case of actions, each reward $r_t$ depends functionally only on the bias vector $\theta$ and the previous action $a_t$: \[ P(r_t|\theta,a_{1:t}, r_{1:t-1}) = P(r_t|\theta, a_t) = \theta_{a_t}, \] where $\theta_{a_t}$ denotes the bias component corresponding to lever $a_t$. Together with the action model, this fully determines a causal model over optimal interactions between an agent and the multi-armed bandit: \[ P(a_1, r_1, a_2, r_2, \ldots|\theta) = P(a_1|\theta) P(r_1|\theta, a_1) P(a_2|\theta) P(r_2|\theta, a_2) \cdots \]
Prior: The last modelling step consists in determining a prior distribution $P(\theta)$ over the behavioral hypotheses $\theta$. This is straightforward: since each lever behaves like a Bernoulli distribution, we can place a Beta prior over $\theta$—because it is conjugate to the likelihood. Since we need one Beta per lever, the resulting prior is a product of $N$ Beta distributions \[ P(\theta) = \prod_{n=1}^N \mathcal{B}(\theta_n; \alpha_n, \beta_n). \]

Together, these 3 components yield a Bayesian model having the following graphical representation:

In other words, we have created a Bayesian model that, for instance, can be used to infer the parameters of the bandit given a sequence of plays and rewards that we observe.

Now, how can we use this model not only to estimate the parameters of the bandit by observing an optimal player, but also to suggest new actions? It turns out that we can simply condition on all known variables and ask the model to suggest the next action. In other words, the $N$-armed bandit problem can now be solved by sampling from the predictive distribution over actions, which in this case translates to \[ P(a_{t+1}|\hat{a}_{1:t}, r_{1:t}) = \int P(a_{t+1}|\theta, \hat{a}_{1:t}, r_{1:t}) P(\theta|\hat{a}_{1:t}, r_{1:t}) \, d\theta. \] Importantly, this has to be done, again, distinguishing between actions and observations: previous actions are treated as interventions. For instance, assume we have already issued two actions and made two observations. Then the original model a) becomes an intervened model b) in the next picture:

To obtain the third action $a_3$, we can just sample it from model b). Notice that we always remove the incoming arrows from the action variables after we have sampled them. Intuitively, in this case we are saying that there cannot be any information flowing from the actions $\{a_t\}$ to the parameter variable $\theta$, because these actions where sampled from the probabilistic model itself.

To make this expression concrete, we need to figure out what the interventions do to our original joint distribution. This is done by evaluating the two intervened distributions under the integral:

First term: In $P(a_{t+1}|\theta, \hat{a}_{1:t}, r_{1:t})$, the intervened random variables $a_{1:t}$ functionally precede the probability argument $a_{t+1}$. Therefore, the intervened distribution is identical to the unintervened distribution, i.e. \[ P(a_{t+1}|\theta, \hat{a}_{1:t}, r_{1:t}) = P(a_{t+1}|\theta, a_{1:t}, r_{1:t}) = P(a_{t+1}|\theta). \]
Second term: In $P(\theta|\hat{a}_{1:t}, r_{1:t})$, the intervened random variables $a_{1:t}$ are functionally dependent on the probability argument $\theta$. Therefore, using Bayes' rule, we rewrite this term as \[ P(\theta|\hat{a}_{1:t}, r_{1:t}) = \frac{ P(\hat{a}_{1:t}, r_{1:t}|\theta) P(\theta) }{ \int P(\hat{a}_{1:t}, r_{1:t}|\theta') P(\theta') \, d\theta' }, \] where the likelihood term factorizes as \[ P(\hat{a}_{1:t}, r_{1:t}|\theta) = P(\hat{a}_1|\theta) P(r_1|\theta, \hat{a}_1) P(\hat{a}_2|\theta, \hat{a}_1, r_1) \cdots P(r_t|\theta, \hat{a}_{1:t}, r_{1:t-1}). \] This rewriting exposes the causal structure of the posterior, allowing us to safely evaluate the causal operations. The way I typically calculate the result of a causal intervention has three steps. First, whenever the probability argument depends causally on the probability conditions, it is safe to remove the hats from the conditions. This is precisely the case in the factorization above. Thus, \begin{gather} P(\hat{a}_k|\theta, \hat{a}_{1:k-1}, r_{1:k-1}) = P(\hat{a}_k|\theta, a_{1:k-1}, r_{1:k-1}) = P(\hat{a}_k|\theta) \\ P(r_k|\theta, \hat{a}_{1:k}, r_{1:k-1}) = P(r_k|\theta, a_{1:k}, r_{1:k-1}) = P(r_k|\theta, a_k). \end{gather} Second, we substitute the intervened mechanisms \[ P(a_k|\theta) = \delta[A_t = a^\ast(\theta)] \quad\longleftrightarrow\quad P(\hat{a}_k|\theta) = \delta[A_k = a_k] = 1. \] These values are all equal to one because we know that $A_k = a_k$ for all the random variables that ocurred in the past.

Using these substitutions, the sampling distribution becomes, \begin{align} P(a_{t+1}|\hat{a}_{1:t}, r_{1:t}) &= \int P(a_{t+1}|\theta) \biggl\{ \frac{ P(\theta) \prod_{k=1}^t P(r_k|\theta, a_k) }{ \int P(\theta') \prod_{k=1}^t P(r_k|\theta', a_k) \, d\theta' } \biggr\} \, d\theta \\&= \int \delta[A_{t+1} = a^\ast(\theta)] \prod_{n=1}^N \mathcal{B}(\theta_n; \alpha_n + s_n, \beta_n + f_n), \end{align} where $s_n$ counts the number of times a reward has been obtained whenever lever $n$ was pulled, and $f_n$ the number of times no reward has been obtained. A closer look at this equation reveals that the integral can be thought as:

obtain a candidate $\theta'$ from the posterior over the parameters where the actions are treated as mute conditionals;
and using the optimal action assuming that this simulated parameter $\theta'$ is the true parameter (Alternatively, one could directly draw the action from the marginal $P(a_{t+1}|\hat{a}_{1:t}, r_{1:t})$, but this integral is in general intractable.).

As we have anticipated before, this is precisely what Thompson sampling does. But what is interesting in this particular derivation is that it highlights an important difference to standard Bayesian reinforcement learning: rather than using Bayesian probability theory as a predictive model that is then used by a decision making strategy, here it is the whole behavior of the agent that is cast in Bayesian terms. More precisely, we can write a probabilistic model over behaviors, and use causal Bayesian inference to derive adaptive controls.

The figure shows the probability of pulling the optimal lever and the regret as a function of time for the optimal solution (Gittins' indices), UCB1 and Thompson sampling. These curves were estimated using Monte Carlo simulations. Note that Thompson sampling performs significantly better than UCB1, and gets very close to the optimal solution. Download the MATLAB script here: [ZIP].

It is not hard to see that this way of modelling adaptive controllers is very general. In particular, notice the following:

Generality of Modelling approach: Essentially identical to Bayesian estimation, the previous modelling approach suggests that we can simply take a class of desirable behaviours and combine them into a Bayesian mixture to create a new adaptive controller that is universal over this class. These behaviours will constitute the modes of operation that the controller can converge to.
Operation Modes do not need to be derived from Variational Principles: The set of desirable behaviours does not necessarily have to correspond to optimal controllers in the classical sense. In the previously given bandit example, the policy $P(a_t|\theta)$ was derived by choosing the action that maximizes the reward. However, these modes of operation can be created in any desirable way, like a likelihood model in Bayesian estimation. In particular, the behavioural models can: (1) simply be stochastic trajectories that you want to track; (2) constructed from real data (think of collecting the data from actual robots and skipping the inverse reinforcement learning problem, for instance); (3) obtained as solutions to equilibria (for instance, as solutions of Nash equilibria).
In what sense is this approach optimal? Although it has been shown to perform pretty well in many examples, this approach is not optimal in the maximum expected utility sense. However, it is optimal in the adaptive coding sense: like Bayesian inference, it is the optimal solution to the adaptive lossless source coding problem, i.e. it is the optimal adaptive compressor of actions and observations.
Can it fail? Yes. An important requirement of any Bayesian model is (some form of) ergodicity and identifiability. Like in Bayesian inference, if the behavioural (likelihood) models are non-ergodic or non-identifiable, then the data might look different every time we see a realization, and worse, enter a regime from which it becomes indistinguishable from others behaviours. Our paper (2010) and my thesis discusses this issue in more detail. In i.i.d. data we do not encounter these problems, but in more complex stochastic time series this becomes crucial.

We have successfully applied the Bayesian control rule to many different problem classes, including bandits, MDPs, LQRs and even games with adversarial players.

[27] Ortega, P.A. and Kim, K.-E. and Lee, D.D.
Reactive bandits with attitude
18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015
[PDF] [Poster]

Ortega, P.A. and Crammer, K. and Lee, D.D.
Belief Flows for Robust Online Learning
Information Theory and Applications (ITA), 2015.
[PDF]

Ortega, P.A. and Braun, D.A.
Generalized Thompson Sampling for Sequential Decision-Making and Causal Inference
Complex Adaptive Systems Modeling 2:2, 2014.
[PDF]

Ortega, P.A.
Bayesian Control Rule - Talk Slides, 2012
[PDF]

Ortega, P.A. and Braun, D.A.
Adaptive Coding of Actions and Observations
NIPS Workshop on Information in Perception and Action, 2012.
[PDF]

Ortega, P.A., Grau-Moya, J., Genewein, T., Balduzzi, D. and Braun, D.A.
A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
Neural Information Processing Systems (NIPS), 2012
[PDF] [CODE]

Ortega, P.A. and Braun, D.A. and Godsill, S.J.
Reinforcement Learning and the Bayesian Control Rule
The fourth conference on artificial general intelligence, pp. 281-285, 2011.
[PDF]

Ortega, P.A.
A Unified Framework for Resource-Bounded Autonomous Agents Interacting with Unknown Environments
PhD Thesis, Dept. of Engineering, University of Cambridge, 2011.
Thesis supervisor: Zoubin Ghahramani
Thesis committee: Marcus Hutter and Carl E. Rasmussen
[PDF]

Ortega, P.A. and Braun, D.A.
A minimum relative entropy principle for learning and acting
Journal of Artificial Intelligence Research 38, pp. 475-511, 2010.
[PDF]

Braun, D.A. and Ortega, P.A.
A minimum relative entropy principle for adaptive control in linear quadratic regulators
Proceedings of the 7th international conference on informatics in control, automation and robotics, pp. 103-108, 2010
[PDF]

Ortega, P.A. and Braun, D.A.
A Bayesian rule for adaptive control based on causal interventions
The third conference on artificial general intelligence, pp. 121-126, 2010
[PDF]

Thompson Sampling & Bayesian Control Rule

Bayesian Control Rule

What is a causal intervention?

Applications to Adaptive Control

Example: Multi-Armed Bandit

Evaluating the Causal Expression

Thompson Sampling

More Applications

References (In Reverse Chronological Order)