# MDPs Using Bayesian Control Rule/Thompson Sampling

*This is the model-free reinforcement learning algorithm that we originally used as an example to showcase the Bayesian control rule, inspired by the “Bayesian Q-Learning” paper by Dearden et al. ^{1)} Please cite as: Ortega, P.A. and Braun D.A. “A minimum relative entropy principle for learning and acting”. Journal of Artificial Intelligence Research 38, pp. 475-511, 2010.*

This page is under construction!

*Before you move on, please make sure you understand the example with the bandits.*

This time we want to do something more fun: controlling a Markov Decision Process (MDP) using Thompson sampling. In particular, our goal is to:

- learn to control an MDP with unknown transition probabilities;
- and optimize the
**average reward**, rather than the the discounted cumulative reward.

If we use Thompson sampling, then the straightforward way to do this consists in:

- placing a probabilistic model over the transition probabilities and reward distributions;
- iterating by sampling an MDP from this model, solving it using your favorite method (e.g. policy iteration), and use the policy for one or more iterations;
- and finally, collecting the environmental responses and conditioning your probabilistic model.

This works, but we will end up solving a much harder problem than necessary: system identification plus optimal control, or model-based reinforcement learning. It's important to realize that all we really need to know about the environment is how to control it optimally, and this information is nicely encapsulated in the Q-function [following Watkins 1989]. So let's just place a probabilistic model over the Q-function instead. And, to do some different, let's have a look at how to cope with undiscounted MDPs.

### What we will achieve here

We will derive a **Bayes-causal** algorithm for **off-policy**, **model-free** reinforcement learning. At any time, the agent holds beliefs about the Q-table (and the average reward) which it updates using its experience. In order to act, the agent instantiates a random Q-table that it takes as the true one, and thus issues actions that are optimal w.r.t. these simulated beliefs. Q-tables are sampled using MCMC: more precisely, it uses a Gibbs sampler that samples each Q-value from a Normal distribution.

## Undiscounted MDPs

Let's introduce some notation. An MDP is a tuple $(\mathcal{X}, \mathcal{A}, T, r)$, where

- $\mathcal{X}$ is the state space;
- $\mathcal{A}$ is the action space;
- $T_a(x;x') = P(x'|a,x)$ is the probability that an action $a \in \mathcal{A}$ taken in state $x \in \mathcal{X}$ will lead to state $x' \in \mathcal{X}$;
- and $r(x,a) \in \mathcal{R} := \mathbb{R}$ is the immediate reward obtained in state $x \in \mathcal{X}$ under action $a \in \mathcal{A}$.

The agent must play actions to maximize the average reward:
\[
\lim_{t\rightarrow \infty} \frac{1}{T} \sum_{t=1}^T r_t.
\]
These MDPs are (rather oddly, IMHO) called **undiscounted MDPs**, and they are useful to build controllers that have to operate indefinitely.

## Optimal policies

For every given MDP there's always an optimal stationary policy, hence we can center our attention on them and ignore the stochastic and nonstationary ones. Thus, we only consider policies $\pi$ that are mappings $\pi: \mathcal{X} \rightarrow \mathcal{A}$.

### Ergodicity

Now we need to make a technical assumption. Imagine we pick a policy $\pi$ and run it on a given MDP for a very long time until we feel that the average reward has converged. Let's say we repeat this procedure several times with different initial states. If the MDP has different “absorbing regions” that the agent could enter but never leave again, then it is possible that the average reward converges to different values depending on the initial state. We now rule out this possibility: we assume that, for every policy, the average reward converges to the same value no matter the initial state (although of course different policies can yield different values). Such an MDP is said to be **ergodic** (a Greek-ish term dreamt up by Boltzmann).

### Characterizing Optimal Policies

Before we move on to place a probabilistic model over optimal policies, we need to come up with a way of characterizing them. Fortunately, Bellman optimality equations come to the rescue. For undiscounted MDPs, the optimality equations are: \begin{align*} Q(x,a) + \rho &= r(x,a) + \sum_{x' \in \mathcal{X}} P(x'|x,a) \max_{a'} Q(x',a') \\ &= r(x,a) + \mathbb{E}\Bigl[\max_{a'} Q(X',a') \Bigl|x,a \Bigr] \end{align*} where $Q(x,a), Q(x',a')$ are Q-values and $\rho$ is the average reward of the MDP under the optimal policy. It's important to realize here that the set of optimality equations form a system of nonlinear equations that are strongly coupled. If we knew the transition probabilities $P(x'|x,a)$ and rewards $r(x,a)$ then the problem would simply consist in finding a Q-table $\mathbf{Q}$ with entries $Q(x,a)$ and an optimal average reward $\rho$ that solve the system of equations.

The previous optimality equations look very similar to the discounted case, safe for the average reward $\rho$. Therefore, given the Q-table, the optimal action $a^\ast(x)$ in state $x$ is given by \[ a^\ast(x) = \arg \max_a Q(x,a). \]

## Causal-probabilistic model

In order to apply Thompson sampling (using the Bayesian control rule), we now need to describe the stochastic processes that would result when the agent and the environment interact with each other under the assumption that both the Q-table $\mathbf{Q}$ and the average reward $\rho$ are known. To shorten our notation, we collect these parameters into a single parameter structure $\theta = (\rho,\mathbf{Q})$.

In the case of model-free reinforcement learning we consider **experience tuples** of the form
\[
\langle x, a, r, x' \rangle,
\]
containing the current state $x$, the current action $a$, the current reward $r$ and the next state $x'$. In particular, since we are not interested in the transition model, we will take

- the states $x$ and $x'$ as
**given**, - the action $a$ as drawn from a policy $P(a|\theta,x)$;
- and the reward $r$ as drawn from $P(r|\theta,x,a,x')$.

Our next goal is to write down these distributions.

### Actions

This distribution is the policy used by the agent to generate the actions, and from our characterization of the optimal policy from above, we know that the optimal action is the one that maximizes the Q-value of the current state: \[ P(a|\theta,x) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} Q(x,a') \\ 0 & \text{otherwise.} \end{cases} \]

### Rewards

To model the distribution over the rewards, we take the Bellman optimality equation and add and subtract $\max_{a'} Q(x',a')$ to get the equation
\begin{align*}
r(x,a) = &Q(x,a) + \rho - \max_{a'} Q(x',a') \\
&+ \max_{a'} Q(x',a') - \mathbb{E}\Bigl[ \max_{a'} Q(X',a')\Bigl|x,a \Bigr].
\end{align*}
Here, we identity two parts. The first is the **mean instantaneous reward** defined as
\[
\bar{r}(x,a,x') := Q(x,a) + \rho - \max_{a'} Q(x',a').
\]
The second is a **noise term**
\[
\nu := \max_{a'} Q(x',a') - \mathbb{E}\Bigl[ \max_{a'} Q(X',a') \Bigl| x,a \Bigr]
\]
representing the deviation of the Q-value in the experienced transition from the expected Q-value. This peculiar interpretation might look a bit odd at first, but remember that here we are ignoring the transition model. Hence, we need to purge every mention of transition probabilities from our representation, and in particular also the expectations. Instead, we focus on representing every quantity solely in terms of the average reward $\rho$ and the Q-table $\mathbf{Q}$. Thus, in this restricted representation, the reward becomes a **random variable**
\[
r(x,a) = \bar{r}(x,a,x') + \nu.
\]
To simplify things, we will assume that the noise $\nu$ is normal $\mathcal{N}(0,1/p)$ with a **known** precision $p$. That's a fairly reasonable assumption. The good news is that we can now write down a likelihood model for the immediate reward $r$ using the $Q$-values and the average reward, i.e.
\begin{align*}
P(r|\theta,x,a,x')
&= \mathcal{N}\bigl(\bar{r}(x,a,x'), 1/p \bigr) \\
&= \sqrt{\frac{p}{2\pi}} \exp\Bigl\{ -\frac{p}{2} (r- \bar{r}(x,a,x'))^2 \Bigr\}.
\end{align*}

## Learning to act optimally

The previous model assumes that the Q-table and the average reward are known and encapsulated in the big parameter structure $\theta = (\rho,\mathbf{Q})$. Here we are interested in learning it, so we place a prior over the parameter space $\Theta$ to represent our uncertainty. Fortunately, doing so is fairly straightforward.

### Bayesian Model for Rewards

First, we focus on the reward $r$ for a given transition $(x,a,x')$. The likelihood
\[
P(r|\theta,x,a,x') = P(r|\bar{r}(x,a,x')) = \mathcal{N}(r; \bar{r}(x,a,x'), 1/p)
\]
is a Normal distribution with unknown mean and known precision, and so we can place another Normal
\[
P\bigl(\bar{r}(x,a,x')|\mu(x,a,x'), \lambda(x,a,x')\bigr)
= \mathcal{N}\Bigl(\bar{r}(x,a,x'); \mu(x,a,x'), 1/\lambda(x,a,x')\Bigr)
\]
to model the uncertainty over $\bar{r}(x,a,x')$, because it is **conjugate** to the likelihood. In this prior, the hyperparameters $\mu(x,a,x')$ and $\lambda(x,a,x')$ represent the mean and precision of the Normal distribution.

When we experience the tuple $(x,a,r,x')$, the new posterior distribution becomes \[ P\bigl(\bar{r}(x,a,x')|r, \lambda(x,a,x'), \mu(x,a,x')\bigr) = P\bigl(\bar{r}(x,a,x')|\mu'(x,a,x'),\lambda'(x,a,x')\bigr) \] where $\mu'(x,a,x')$ and $\lambda(x,a,x')$ are the new hyperparameters calculated as \[ \mu'(x,a,x') = \frac{\lambda(x,a,x') \mu(x,a,x') + p r}{ \lambda(x,a,x') + p } \quad\text{and}\quad \lambda'(x,a,x') = \lambda(x,a,x') + p \] that is, the standard update for a Gaussian belief model.

To maintain a model over all the rewards in the MDP, we need to keep track of $\mu(x,a,x')$ and $\lambda(x,a,x')$ for all possible state transitions. In a brute-force implementation, this bookkeeping is done with two 3-dimensional matrices, one for the means $\mu$ and the other for precisions $\lambda$.

### Gibbs Sampler for Q-values and Average Reward

So far, we have only modelled the rewards but not the Q-values, which are really what we're interested in. Fortunately, we can use the previous Bayesian model over instantaneous rewards to derive an approximate Gibbs sampler for the parameter structure $\theta = (\rho,\mathbf{Q})$. For this, we recall that mean instantaneous reward $\bar{r}(x,a,x')$ is equal to $Q(x,a) - \rho + \max_{a'} Q(x',a')$, and hence **posterior over the paramater structure** is just equal to the product
\begin{align*}
P(\theta|\mu,\lambda)
&= \prod_{x,a,x'} P(\bar{r}(x,a,x')|\mu,\lambda) \\
&\propto \prod_{x,a,x'} \exp\Bigl\{
-\frac{1}{2} \lambda(x,a,x') \bigl(\bar{r}(x,a,x')-\mu(x,a,x')\bigr)^2\Bigr\}\\
&\propto \exp\Bigl\{
-\frac{1}{2} \sum_{x,a,x'} \lambda(x,a,x') \bigl(\bar{r}(x,a,x')-\mu(x,a,x')\bigr)^2\Bigr\}.
\end{align*}
By replacing each mean instantaneous reward with its representation in terms of $\rho$ and $\mathbf{Q}$, we can calculate various conditional distributions through rearranging the terms and completing the square.

#### Average Reward

With the previous replacement and rearranging the terms, we get that the conditional distribution over the average reward is given by \[ P(\rho|\mathbf{Q},\mu,\lambda) = \mathcal{N}(\rho; \bar{\rho}, 1/S) \] where the mean and precision are calculated as \begin{align*} \bar{\rho} &:= \frac{1}{S} \sum_{x,a,x'} \lambda(x,a,x)\Bigl(\mu(x,a,x') - Q(x,a) + \max_{a'}Q(x',a')\Bigr) \\ S &:= \sum_{x,a,x'} \lambda(x,a,x'). \end{align*}

The remarkable thing is that the average reward can be just sampled from a **Normal distribution** with parameters that are simple functions of the hyperparameters $\mu$ and $\lambda$.

#### Q-Values

The conditional distribution over the Q-values is bit uglier though because each $Q(x,a)$ enters the posterior distribution both linearly and non-linearly through $\bar{r}(x,a,x')$, in the form of $Q(x,a)$ and $\max_{a'} Q(x,a')$ terms. However, if we treat $\max_{a'}Q(x,a')$ as a constant for the Gibbs step, then the conditional distribution can be approximated by a Normal distribution: \[ P\bigl(Q(x,a)|\rho,\mathbf{Q}^{-(x,a)},\lambda,\mu \bigr) \approx \mathcal{N}\Bigr( Q(x,a); \bar{Q}(x,a), 1/S(x,a) \Bigr) \] where $\mathbf{Q}^{-(x,a)}$ denotes the Q-table without the entry $Q(x,a)$. Here, the mean and precision are given by \begin{align*} \bar{Q}(x,a) &:= \frac{1}{\lambda(x,a)} \sum_{x'} \lambda(x,a,x)\Bigl(\mu(x,a,x') - \rho + \max_{a'}Q(x',a')\Bigr) \\ S(x,a) &:= \sum_{x'} \lambda(x,a,x'). \end{align*} We have found this approximation to work surprisingly well in our simulations.

## Putting everything together

We now have all the ingredients we need for our reinforcement learning algorithm:

- a policy $P(a|\theta,x)$;
- a reward model $P(r|\theta,x,a,x')$;
- and a prior distribution over the parameter structure $P(\theta)$.

This probabilistic model can be represented with a causal DAG, where two subsequent states $x_t$ and $x_{t+1}$ are treated as mute conditionals of the joint probability distribution:

To use it, we proceed as follows. In every step $t=1,2,\ldots$, we carry out the following three steps:

**Current state:**We condition the model on the current state $x_t$.**Current action:**Now we draw an action $a_t$ directly from the model. Since the parameter $\theta$ is unknown, the distribution over the action is \[ P(a_t|x_t,q_{1:t-1}) = \int P(a_t|\theta,x_t) P(\theta|q_{1:t-1}) \, d\theta\] where we use the shorthand $q_\tau = (x_\tau,a_\tau,r_\tau)$ to represent the past experience. Algorithmically, this can be done in two steps: first by sampling a parameter structure $\theta_t \sim P(\theta|q_{1:t-1})$ and then by sampling the action $a_t \sim P(a_t|\theta_t)$, where we have used $\theta_t$ as a plugin parameter. However, since we are using a value generated by the model itself, we have to the random variable by performing the causal intervention before conditioning.**Current reward and next state:**We condition the model on the current reward $r_t$ and the next state $x_{t+1}$. This will update the posterior distribution over the parameter structure $\theta$.

Afer performing these three steps, the unrolled model will look as shown below:

The pseudocode below explains how to do all of this.

### Bayesian Control Rule for Undiscounted MDPs

- Initialize the entries of $\lambda$ and $\mu$ to zero or any other desired value.
- Set the initial state to $x \leftarrow x_0$.
- For every time step $t = 1, 2, 3, \ldots$ do:
- Sample a new average reward and Q-table using Gibbs sampling:
- $\rho \sim \mathcal{N}(\bar{\rho}, 1/S)$.
- For each $(x,a)$, do $Q(x,a) \sim \mathcal{N}(\bar{Q}(x,a),1/S(x,a))$.

- Interact with the world:
- Pick the next action $a \leftarrow \max_{a'} Q(x,a')$ and issue it.
- Get the next observation $o = (r,x')$ from the environment.

- Update the hyperparameters:
- $\mu(x,a,x') \leftarrow \frac{ \lambda(x,a,x') \mu(x,a,x') + pr }{ \lambda(x,a,x') + p }$
- $\lambda(x,a,x') \leftarrow \lambda(x,a,x') + p$

- Update the current state:
- Set $x \leftarrow x'$.

^{1)}