Article under construction.
Imitation is a potent learning mechanism observed across the animal kingdom, enabling individuals to acquire behaviors without direct, first-person experience. Unlike operant conditioning, the core mechanism behind reinforcement learning which relies on personal actions and their consequences, imitation allows for learning through observation, bypassing the need for direct reinforcement.
Understanding imitation could revolutionize artificial intelligence by facilitating the development of adaptive agents capable of learning from demonstrations, analogies, and metaphors. This approach would enable AI systems to grasp complex tasks through observation, much like humans do, enhancing their versatility and efficiency.
Cite as: Ortega, P.A. “How to translate third-person into first-person?”, Tech Note 4, Daios, 2024.
Operant condidioning versus imitation: The dominant paradigm in AI for policy learning is reinforcement learning (RL). In turn, reinforcement learning (RL) is based on operant conditioning (OC), which necessitates:
These elements are essential for learning through direct interaction with the environment.
But learning like this is extremely limited. Most of what we know about the world does not come from first person experience!
Imitation is another form of learning which is ubiquitous in animals1). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience.
To bridge this gap, learners must augment observed information with inferred causal understanding, enabling them to internalize and replicate behaviors accurately. This capability not only broadens the scope of learning beyond personal experience but also opens new pathways for developing AI systems that learn and adapt in more human-like ways.
For concreteness, let's consider the simplest imitation scenario. The goal of the learner is to imitate the demonstrator as close as possible after seeing multiple demonstrations. We'll assume each demonstration consists of many repeated turns between the demonstrator and the environment. In each turn, the demonstrator emits a choice $X$ and the environment replies and answer $Y$ (time indices are omitted), as in a multi-armed bandit problem. The difference with standard bandits is that there is no known preference/reward function associated with the responses $Y$ (in fact, it could be context-dependent).
The root of the challenge in transforming third-person into first-person experience lies in the difference between actions and observations. From a subjective point of view, actions are random variables chosen by the learner (self) and observations are random variables chosen by the demonstrator or environment (other). For imitation, the learner must answer questions of the form:
If I see $Y$ change when they manipulate $X$, then what will happen to $Y$ when I manipulate $X$?
At a first glance, this task seems trivial: the learner could regress the joint probability distribution $P(X,Y)$ and then use $P(X)$ as a model for the correct choice and $P(Y|X)$ as a model for how the choice $X$ impacts the consequence $Y$. But often there is a confounding variable $\theta$, hidden from the learner's view, which informs the choice of the observed manipulation: causally, $\theta \rightarrow X$, $(\theta, X) \rightarrow Y$. This confounder holds the choice intention about the state of the world, linking the manipulation $X$ and the outcome $Y$: $$ P(x, y) = \sum_\theta P(\theta)P(x|\theta)P(y|x, \theta). \qquad (1) $$
As an example, it could be that the demonstrator chose $X=x$ because they knew it would lead to the outcome $Y=y$, the demonstrator's most desirable consequence given the state of the world and their (hidden) intentions. Here, both the relevant features of the state of the world and the preferences are encapsulated in the choice intention $\theta$.
This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice—even when imitating—are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the intervention $X$ in causal lingo.
Let's compare them. Formally, $P(Y|X)$ is $$ P(y|x) = \sum_\theta P(\theta) P(y|x, \theta) R(x, \theta) \qquad (2) $$ where the $R(x, \theta)$ are terms that couple information2) defined as $$ R(x, \theta) = \frac{ P(x,\theta) }{ P(x)P(\theta) } \qquad (3) $$ whereas $p(Y|\text{do}(X))$ is $$ P(y|\text{do}(x)) = \sum_\theta P(\theta) P(y|x, \theta), \qquad (4) $$ that is, without coupling terms (or, $R(x,\theta)=1$, meaning that no information is transmitted between $x$ and $\theta$).
The core mechanism of imitation learning then consists of the following. If the behavior schema is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing $X$ by sampling it according to $P(X|\theta)$ (which is impossible because $\theta$ is unknown), the first-person version chooses $X$ from its nearest distribution, namely the marginal $P(X)$ which averages over all the settings of the latent variable $\theta$: $$ P(x) = \sum_\theta P(\theta) P(x|\theta). \qquad (5) $$ The response of the environment will then come from $P(Y|\text{do}(X))$, which allows the learner to refine its estimate of the unknown intention from the prior $P(\theta)$ to the posterior $P(\theta|\text{do}(X), Y)$. This transformation avoids the self-delusion problem pointed out in Ortega et al. 2021.
Next let's discuss approaches to imitation learning. Throughout our discussion below, it is important to always keep in mind that the learner does not know the reward function of the demonstrator.
If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then $$ P(Y|X,C) = P(Y|\text{do}(X), C). \qquad (6) $$ Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that $C$ screens off $\theta$.
This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy is a function of the observable state, and hence it can be safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task—think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk.
This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows:
In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
A grand challenge of imitation learning is to learn from pre-recorded demonstrations without access to first-person experience as in the hybrid case. Intuitively, this problems is like inferring the implicit intentions of the demonstrator, i.e. finding an explanation in order to justify their choices.
Since $\theta$ is in general unobservable and arbitrary, the best the learner can do is to propose a subjective explanation $\psi$ as a proxy for $\theta$, for instance using a generative model. The objective is to obtain a complete behavioral schema characterizing the demonstration. In the ideal case the learner would attain $$ \begin{align*} P(X|\psi) &= P(X|\theta) \\ P(Y|X, \psi) &= P(Y|X, \theta) \end{align*} $$ This can then be converted into first-person view by sampling the choice $X$ from the marginal, analogous to (5). The subsequent update will refine the estimate of $\psi$: in the limit case, $P(\psi)$ could concentrate so that it effectively provides a causally sufficient context.
Combine cases 2 and 3! Training can be done seamlessly if $x = f_X(\psi)$ and $y = f_Y(x, \psi)$ using feedback $x$ from the teacher and $y$ from the environment, and does not require only one stop gradient from $y$ to $x$ without letting the info of $y$ to flow into $\psi$.
I made a case for the study of imitation learning. It goes beyond reinforcement learning in that it allows for the acquisition of new behavioral schemas (including new purposes) from observation. We looked at literal imitation here, but in a more advanced interpretation, imitation is what makes language so powerful, allowing for the on-the-fly synthesis of novel behaviors base on descriptions, analogies, and metaphors.
At the hear of imitation is a mechanism translating third-person into first-person experience. This translation requires the learner to subjectively supply causal information (in the form of explanations) that is not present in the data. In this sense, imitation requires a supporting cognitive structure, and cannot work purely based on data.
But even when successful, imitation comes with new challenges, especially from the point of view of safety. New schemas—because they encode new preferences—bring along new behavioral drives and objectives (potentially of moral or ethical nature) that operate independently of the hard-wired reward function a learner might have. Thus, careful data curation, perhaps in the form of designing special curricula, might become a central concern for training imitation-based agents.