Article under construction.
Imitation is a powerful learning mechanism in (human) animals which does not require first-person experience as in operant conditioning, the core mechanism of reinforcement learning. If understood, imitation could open up new avenues for artificial intelligence, such as training highly-general adaptive agents from observed demonstrations, e.g. those acquired within a social context, analogies, and metaphors. Unfortunately, imitation learning is currently not well understood except in very special cases.
The policy learning paradigm behind today's reinforcement learning (RL) algorithms is based on operant conditioning (OC). OC works well when there is:
Imitation is another form of learning which is ubiquitous in animals1). In this paradigm, an animal can learn to solve a task by matching its own behavior to that observed in another animal. At the root of this learning scheme is a mechanism which converts third-person (other) experience into first-person (self) experience. In this narrow form the imitated behavior is literal, but more advanced forms of imitation include acquiring new behaviors through analogies, metaphors, and language in general.
Because the third-person requirement violates the first-person experience-acquisition assumption of OC, imitation learning cannot be cast in terms of RL. First-person experience contains causal information which third-person experience simply lacks. Therefore, transforming third-person into first-person experience requires the addition of the missing causal information by the learner themself.
For concreteness, let's consider the simplest imitation scenario. The goal of the learner is to imitate the demonstrator as close as possible after seeing multiple demonstrations. We'll assume each demonstration consists of many repeated turns between the demonstrator and the environment. In each turn, the demonstrator emits a choice $X$ and the environment replies and answer $Y$ (time indices are omitted), as in a multi-armed bandit problem. The difference with bandits is that there is no known preference/reward function associated with the responses $Y$ (in fact, it could be context-dependent).
The root of the challenge in transforming third-person into first-person experience lies in the difference between actions and observations. From a subjective point of view, actions are random variables chosen by the learner (self) and observations are random variables chosen by the demonstrator or environment (other). For imitation, the learner must answer questions of the form: If I see $Y$ change when they manipulate $X$, then what will happen to $Y$ when I manipulate $X$?
At a first glance, this task seems trivial: the learner could regress the joint probability distribution $P(X,Y)$ and then use $P(X)$ as a model for the correct choice and $P(Y|X)$ as a model for how the choice $X$ impacts the consequence $Y$. But often there is a confounding variable $\theta$, hidden from the learner's view, which informs the choice of the observed manipulation: causally, $\theta \rightarrow X$, $(\theta, X) \rightarrow Y$. This confounder holds the choice intention about the state of the world, linking the manipulation $X$ and the outcome $Y$: $$ P(x, y) = \sum_\theta P(\theta)P(x|\theta)P(y|x, \theta). \qquad (1) $$
As an example, it could be that the demonstrator chose $X=x$ because they knew it would lead to the outcome $Y=y$, the most desirable consequence given the state of the world and intentions hidden from view. Here, both the relevant features of the state of the world and the preferences are encapsulated in the choice intention $\theta$.
This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice—even when imitating—are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the intervention $X$ in causal lingo.
Let's compare them. Formally, $P(Y|X)$ is $$ P(y|x) = \sum_\theta P(\theta) P(y|x, \theta) R(x, \theta) \qquad (2) $$ where the $R(x, \theta)$ are 'information coupling terms' 2) defined as $$ R(x, \theta) = \frac{ P(x,\theta) }{ P(x)P(\theta) } \qquad (3) $$ whereas $p(Y|\text{do}(X))$ is $$ P(y|\text{do}(x)) = \sum_\theta P(\theta) P(y|x, \theta), \qquad (4) $$ that is, without coupling terms (or, $R(x,\theta)=1$, meaning that no information is transmitted between $x$ and $\theta$).
The core mechanism of imitation learning then consists of the following. If the behavior schema is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing as $X \approx P(X|\theta)$ (which is impossible because $\theta$ is latent), the first-person version chooses $X$ from its nearest distribution, namely the marginal $P(X)$ which averages over all the settings of the latent variable $\theta$: $$ P(x) = \sum_\theta P(\theta) P(x|\theta). \qquad (5) $$ The response of the environment will then come from $P(Y|\text{do}(X))$, which allows the learner to refine its estimate of the unknown intention from the prior $P(\theta)$ to the posterior $P(\theta|\text{do}(X), Y)$. This transformation avoids the self-delusion problem pointed out in Ortega et al. 2021.
Next let's discuss approaches to imitation learning. It is important to keep in mind that the learner does not know the reward function of the demonstrator.
If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then $$ P(Y|X,C) = P(Y|\text{do}(X), C). \qquad (6) $$ Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that $C$ screens off $\theta$.
This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy will be a function of the state, which can safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task—think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk.
This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows: since the learner chooses the actions themself, they get to observe the effect of their own action $P(Y|\text{do}(X))$; and since the teacher then provides the best action in hindsight, the learner also separately observes the policy $P(X)$. In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
A grand challenge of imitation learning is to learn from pre-recorded demonstrations without access to first-person experience as in the hybrid case. Intuitively, this problems is like inferring the implicit intentions of the demonstrator, i.e. finding an explanation in order to justify their choices.
Since $\theta$ is in general unobservable and arbitrary, the best the learner can do is to propose a subjective explanation $\psi$ as a proxy for $\theta$, for instance using a generative model. The objective is to obtain a complete behavioral schema characterizing the demonstration. In the ideal case the learner would attain $$ \begin{align*} P(X|\psi) &= P(X|\theta) \\ P(Y|X, \psi) &= P(Y|X, \theta) \end{align*} $$ This can then be converted into first-person view by sampling the choice $X$ from the marginal, analogous to (5). The subsequent update will refine the estimate of $\psi$: in the limit case, $P(\psi)$ could concentrate so that it effectively provides a causally sufficient context.
Combine cases 2 and 3! Training can be done seamlessly if $x = f_X(\psi)$ and $y = f_Y(x, \psi)$ using feedback $x$ from the teacher and $y$ from the environment, and does not require only one stop gradient from $y$ to $x$ without letting the info of $y$ to flow into $\psi$.
I made a case for the study of imitation learning. It goes beyond reinforcement learning in that it allows for the acquisition of new behavioral schemas (including new purposes) from observation. We looked at literal imitation here, but in a more advanced interpretation, imitation is what makes language so powerful, allowing for the on-the-fly synthesis of novel behaviors base on descriptions, analogies, and metaphors.
At the hear of imitation is a mechanism translating third-person into first-person experience. This translation requires the learner to subjectively supply causal information (in the form of explanations) that is not present in the data. In this sense, imitation requires a supporting cognitive structure, and cannot work purely based on data.
But even when successful, imitation comes with new challenges, especially from the point of view of safety. New schemas—because they encode new preferences—bring along new behavioral drives and objectives (potentially of moral or ethical nature) that operate independently of the hard-wired reward function a learner might have. Thus, careful data curation, perhaps in the form of designing special curricula, might become a central concern for training imitation-based agents.