# Measure-Theoretic Causality

My super-old causality slides can be found here. Try out the Colab tutorial with a causal reasoning engine in it.

## Paper

Subjectivity, Bayesianism, and Causality

Ortega, P.A.

Special Issue on Philosophical Aspects of Pattern Recognition

Pattern Recognition Letters, pp. 63-70, 2015

[PDF]

Algorithms for Causal Reasoning in Probability Trees

Genewein T., McGrath T., Delétang G., Mikulik V., Martic M., Legg S., **Ortega P.A.**

ArXiv:2010.12237, 2020

[PDF][Colab Tutorial]

## In a Nutshell

The paper discusses the relationship between agency and probability theory. It then introduces an abstract formalisation of causal dependencies in **measure-theoretic** terms that looks pretty much like the skeleton of a **probabilistic program**, that is, a causal generative process that gives rise to the things we can measure and control. The main points of the paper are:

- Main part: Bayesian probability theory is a model of subjectivity (skip this is you don't like continental philosophy!).
- Appendix A: Bayesian probability theory can be regarded as a model of passive observers of the world. To model interactive agents, actions have to be treated as causal interventions (Appendix A).
- Appendix B: There are interventions that
**cannot be modelled**using directed acyclic graphs alone (e.g. bayesian_causal_induction). Thus, the paper introduces causal probability spaces, which extend the familiar probability spaces with causal information. Technically, the causal information supplied as a partial order over a subset of privileged events called realisations. Causal spaces tell a story of how events are brought about. - Appendix B: Finally, causal interventions are introduced. These are abstract manipulations that can have different implementations depending on the underlying causal structure of the world.

Below I give an informal description of the ideas in the paper. You will find the more formal stuff (all the rigorous definitions & proofs) in the paper.

## Intuition

Imagine we want to build a robot that interacts with the world. To model the agent's beliefs, we use the Bayesian framework. We do so by creating a causal generative model over all the possible lives of the robot, including all the things that could potentially happen to it during its life. This includes everything from the cycles of the internal battery, the orientation of the robot's servos, to the natural images that the sensors could pick up when the robot is running in the wild. The robot then learns about the world by conditioning this generative model on the data generated by Nature.

In this model however, the robot is just a passive observer of the world with no ability to change it. The model can generate a sequence of random variables in which the robot features in it—but as an embedded pattern like in a film. In other words, it's as if the robot is watching itself exploring the environment, picking up stones, etc. **but without being the cause of its own actions**.

To implement the sense of agency, an additional ingredient is missing: the random variables that correspond to the robot's actions have to be treated as **causal interventions**. This creates an asymmetry in the random variables, delimiting the boundaries of the robot and defining an outward and inward flow of information corresponding to the robot's actions and observations respectively. (Note that this is also the crucial assumption that is necessary to derive **Thompson sampling** from causal principles, as explained here.)

To further insist on this point, let's look at another situation. The robot dreams about a conversation it is having with a friend who recounts the difficulties it had fixing a coffee machine earlier in the day. Since it's a dream, the dialogue is entirely generated by the robot's generative model. But then, why did the robot not anticipate the friend's story?

One explanation is that, even though the robot's brain simulates the entire dialogue, it carefully keeps track of the information that the robot is supposed to know **from the one that is private to the friend**—in this case, the friend's memories about fixing the machine. It's an information barrier that is kept up by the robot's brain in order to maintain the illusion of agency.

In game-theoretic jargon, we say that the conversation between the robot and the friend is implemented as an **extensive-form game with imperfect information** in which the robot doesn't know the friend's private information. Strikingly, the robot's actions cannot depend on the friend's private information, which mathematically translates into (essentially) the same conditional independences as in causal interventions (see Von Neumann & Morgenstern's classical game theory book, Chapter 7)!

Now, when the robot is awake, there is no reason to believe that the mechanisms that manage the distinction between “inside” and “outside” change in any fundamental way. The life of the robot can still be regarded as a (sensorimotor) dialogue between itself and Nature (and other agents within) modelled entirely by the robot's generative model.

Again, the distinction between the events generated by robot and by Nature is made by virtue of how the generative model is conditioned on the experience: the robot's actions must the treated as causal interventions. As a consequence, the robot will always perceive itself as the originator of its actions, even though these actions might internally have been generated through deterministic functions of the input (i.e. fully determined by the robot's state). For a formal treatment of this argument, check Appendix A of the paper at the top.

## Causal Probability Space

In the paper, the mathematical structure to represent the robot's generative model is the **causal probability space**, a measure-theoretic model of causality. Basically, it consists of a **sample space**, a privileged set of **states of realisations**, and a **causal probability measure**. Their definition is somewhat technical and beyond the scope of this tutorial, but to give you the gist of how this looks like, see the box below.

### Set of (States of) Realisations: Axioms

A set $\mathcal{R}$ of non-empty subsets of $\Omega$ is called a set of realisations states iff

*The sure event is a realisation state*: $\Omega \in \mathcal{R}$*Realisation states form a tree*: for each distinct $U, V \in \mathcal{R}$, either $U \cap V = \varnothing$ or $U \subset V$, or $V \subset U$*The tree is complete*: for each $U, V \in \mathcal{R}$ where $V \subset U$, there exists a sequence $(V_n)_{n \in \mathbb{N}}$ in $\mathcal{R}$ such that $U \setminus V = \bigcup_n V_n$,*Every branch has a starting and an end point*: let $(V_n)_{n \in \mathbb{N}} \in \mathcal{R}$ be such that $V_n \uparrow V$ or $V_n \downarrow V$. Then, $V \in \mathcal{R}$

For our current exposition we adopt a simplified route. A convenient way of representing a causal probability space is in terms of a probability tree like the one below. It can be regarded as the tree of all possible executions of a probabilistic program.

In the example, the model describes the generative process of four binary random variables, namely $\theta$ (the physics of the world), $W$ (the weather), $B$ (the barometer), and $U$ (the strange variable). Every node in the tree corresponds to a state of realisation of the world.

This particular tree describes a total of 8 possible lives of the robot, each one starting from the root one and ending in a leave, where events higher up are the causes of the events further down. The tree models the typical barometer example used in causality textbooks, but with a gist: $\theta$ can choose the causal dependency between the weather and the barometer. The left side models our normal physics, in which the weather $W$ precedes the barometer $B$, whereas the right side represents an alternative physics in which the causal dependency between $W$ and $B$ is inverted. The two stories are necessary if we want the robot to learn this from experience. (Notice that this also suggests that causal dependencies are **imprinted onto/matched to** the data rather than “out there in the world”, in adherence to the Bayesian philosophy.)

The probability of each life is calculated by multiplying the probabilities along the path. These are given in the next table.

Life | $\theta$ | $W$ | $B$ | $U$ | Prob. |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 3/16 |

2 | 0 | 0 | 1 | 0 | 1/16 |

3 | 0 | 1 | 0 | 1 | 1/16 |

4 | 0 | 1 | 1 | 1 | 3/16 |

5 | 1 | 0 | 0 | 1 | 3/16 |

6 | 1 | 1 | 0 | 1 | 1/16 |

7 | 1 | 0 | 1 | 1 | 1/16 |

8 | 1 | 1 | 1 | 1 | 3/16 |

## Observations

An observation constrains the possible realisations to the subset of lives that are consistent with it. For instance, if the robot observes $W = 0$, then the compatible lives are given by 1, 2, 5 and 7. These are all the lives that pass through the nodes highlighted in red:

The generative model is then conditioned by the observation $W = 0$ by removing the incompatible paths and then renormalising the rest—this is just the good old Bayes' rule. The resulting probabilities are shown in the table below.

N | $\theta$ | $W$ | $B$ | $U$ | Prob. |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 3/8 |

2 | 0 | 0 | 1 | 0 | 1/8 |

3 | 0 | 1 | 0 | 1 | - |

4 | 0 | 1 | 1 | 1 | - |

5 | 1 | 0 | 0 | 1 | 3/8 |

6 | 1 | 1 | 0 | 1 | - |

7 | 1 | 0 | 1 | 1 | 1/8 |

8 | 1 | 1 | 1 | 1 | - |

## Interventions

The situation changes when the robot emits an action. Actions are a reflection of the robot's current state of knowledge, and thus should carry no surprise value, that is, no information. For this, the generative model has to be intervened: after the action is issued, the model has to be changed **so that the action appears to have been intended all along**. Subsequently, the robot can condition its model on the action. One of the main contributions of the paper above is the definition of interventions in these abstract causal probability spaces.

### A simple intervention

Loosely speaking, an **intervention** is a minimal change to the generative model (minimal in terms of information) such that a desired event is produced with probability one. In practice, this is implemented as follows. Let's say the robot enforces $W \leftarrow 0$. Then, the intervention is carried out in two steps:

**Critical bifurcations**: First we mark all the**critical bifurcations**, defined as the nodes that have:- at least one branch containing only incompatible lives (
*e.g.*$W \neq 0$); - and at least one branch containing at least one compatible life (
*e.g.*$W = 0$).

**Pruning**: For each critical bifurcation, prune all the branches that have only incompatible lives, renormalising the remaining branches afterwards.

Once the tree has been intervened, the robot can condition on its own action. For the case $W \leftarrow 0$, this looks as follows.

The critical bifurcations are highlighted in red. In this case, they were rather easy to identify because they have exactly two branches, a compatible and an incompatible one. The incompatible branches are pruned and the probability mass is placed onto the compatible branches. The probabilities of the robot's lives are now as follows:

N | $\theta$ | $W$ | $B$ | $U$ | Prob. |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 3/8 |

2 | 0 | 0 | 1 | 0 | 1/8 |

3 | 0 | 1 | 0 | 1 | - |

4 | 0 | 1 | 1 | 1 | - |

5 | 1 | 0 | 0 | 1 | 1/4 |

6 | 1 | 1 | 0 | 1 | - |

7 | 1 | 0 | 1 | 1 | 1/4 |

8 | 1 | 1 | 1 | 1 | - |

In particular, notice that the resulting probabilities of the intervention $W \leftarrow 0$ are different from the ones of the observation $W = 0$ that we have seen before. The difference is that an observation acts merely as a **filter** on the lives of the robot, whereas an action is an actual **change of the generative process** of the robot's life. Through the intervention, the robot creates a statistical asymmetry that allows it to disentangle the two causal hypotheses ($\theta = 0$ and $\theta = 1$) — see the discussion in bayesian_causal_induction.

Let's have a look at two more examples.

### Choosing the laws of physics

Perhaps the robot has the ability to determine the laws of physics in the world. It can choose between either a world like ours in which the weather determines the measurement of the barometer ($W \rightarrow B$), or an “alternative world” in which it can control the weather through the barometer ($B \rightarrow W$).

In the generative model, this is done by setting the value of $\theta$. In particular, let's imagine that the robot picks the alternative world. Then, it must set $\theta = 1$, because this chooses the right half of the tree in which $B$ causally precedes $W$. The corresponding intervention looks as follows:

And the resulting probabilities are:

N | $\theta$ | $W$ | $B$ | $U$ | Prob. |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | - |

2 | 0 | 0 | 1 | 0 | - |

3 | 0 | 1 | 0 | 1 | - |

4 | 0 | 1 | 1 | 1 | - |

5 | 1 | 0 | 0 | 1 | 3/8 |

6 | 1 | 1 | 0 | 1 | 1/8 |

7 | 1 | 0 | 1 | 1 | 1/8 |

8 | 1 | 1 | 1 | 1 | 3/8 |

The variable $\theta$ is an illustration of a **hyper-cause** or **higher-order cause**, *i.e.* a cause that controls the very causal dependency of other variables. It's important to point out that hyper-causes are not just pathological examples—on the contrary, they are necessary for modelling bayesian_causal_induction, playing a role that is analogous to the one played by the latent parameters in a Bayesian model. This example shows that modelling and intervening hyper-causes is straightforward in a probability tree.

### Macro-interventions

Next, we consider macro-interventions. The robot's action sets the value of the **strange variable** as $U \leftarrow 0$. In the probability tree, the intervention looks as follows:

Interestingly, through the application of the definition of a causal intervention, we get that the intervention $U \leftarrow 0$ chains together two simpler interventions, namely $\theta \leftarrow 0$ and $W \leftarrow 0$. This happened because the intervention $U \leftarrow 0$ has not one, but two critical bifurcations lying on the **same path leading up to the desired event** $U = 0$.

The table below lists the probabilities of the 2 possible lives that are left after the intervention.

N | $\theta$ | $W$ | $B$ | $U$ | Prob. |
---|---|---|---|---|---|

1 | 0 | 0 | 0 | 0 | 3/4 |

2 | 0 | 0 | 1 | 0 | 1/4 |

3 | 0 | 1 | 0 | 1 | - |

4 | 0 | 1 | 1 | 1 | - |

5 | 1 | 0 | 0 | 1 | - |

6 | 1 | 1 | 0 | 1 | - |

7 | 1 | 0 | 1 | 1 | - |

8 | 1 | 1 | 1 | 1 | - |

These kind of **macro-interventions** are important when the agent wants to execute plans, abstracting away from how they are implemented. For instance, let's say the robot is driving a car. When the robot wants to steer the car to the left, it performs the **action directly**. The details, such as rotating the steering wheel to the left, pushing down on the break pedal, etc. are then just consequences of the bigger plan. How the big plan is translated into a final chain of actions is **figured out automatically** by the generative model.

### Conclusions

So why is this important? Let's quickly review what we have seen:

**Modelling the whole interaction stream:**An agent that wants to act in the world must be able to predict both its actions and its observations. In particular, agents can also be uncertain about their own policy, and about the causal dependencies in the world.**Internally versus externally generated evidence:**To learn, the agent must carefully distinguish between the random variables generated by itself (which do not generate evidence) and the random variables generated by the world (which provide evidence). It does so using an accounting trick: it treats actions as causal interventions, which erase the information.**Interventions in a probability tree:**We've also defined a very general type of intervention. We can use it to intervene even probabilistic programs.

Equipped with this knowledge, we can now avoid generating spurious evidence when training our probabilistic models.