Differences

This shows you the differences between two versions of the page.

--- third_person [2024/12/25 11:52] – [The problem: actions versus observations] pedroortega
+++ third_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega
@@ Line 1: / Line 1: @@
+{{ ::third-person.webp |}}
 ====== How to translate third-person into first-person? ======
 :!: Article under construction.
-> Imitation is a powerful learning mechanism in (human) animals which does not require first-person experience as in operant conditioning, the core mechanism of reinforcement learning. If understood, imitation could open up new avenues for artificial intelligence, such as training highly-general adaptive agents from observed demonstrations, e.g. those acquired within a social context, analogies, and metaphors. Unfortunately, imitation learning is currently not well understood except in very special cases.
+> Imitation is a potent learning mechanism observed across the animal kingdom, enabling individuals to acquire behaviors without direct, first-person experience. Unlike operant conditioning, the core mechanism behind reinforcement learning which relies on personal actions and their consequences, imitation allows for learning through observation, bypassing the need for direct reinforcement.\\ \\ Understanding imitation could revolutionize artificial intelligence by facilitating the development of adaptive agents capable of learning from demonstrations, analogies, and metaphors. This approach would enable AI systems to grasp complex tasks through observation, much like humans do, enhancing their versatility and efficiency.
-{{ ::third-person.webp |}}
+//Cite as: Ortega, P.A. “How to translate third-person into first-person?”, Tech Note 4, Daios, 2024.//
-The policy learning paradigm behind today's reinforcement learning (RL) algorithms is based on [[https://en.wikipedia.org/wiki/Operant_conditioning|operant conditioning]] (OC). OC works well when there is:
+**Operant condidioning versus imitation:** The dominant paradigm in AI for policy learning is reinforcement learning (RL). In turn, reinforcement learning (RL) is based on [[https://en.wikipedia.org/wiki/Operant_conditioning|operant conditioning]] (OC), which necessitates:
   - an information context,
-  - a first-person behavior,
+  - first-person behavior,
-  - and a reinforcing consequence (either rewarding of punishing).
+  - and reinforcing consequences (rewards and punishments).
+These elements are essential for learning through direct interaction with the environment.
-[[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). In this paradigm, an animal can learn to solve a task by matching its own behavior to that observed in another animal. At the root of this learning scheme is a mechanism which converts third-person (//other//) experience into first-person (//self//) experience. In this narrow form the imitated behavior is literal, but more advanced forms of imitation include acquiring new behaviors through analogies, metaphors, and language in general.
+But learning like this is extremely limited. Most of what we know about the world does not come from first person experience!
+[[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience.
+To bridge this gap, learners must augment observed information with inferred causal understanding, enabling them to internalize and replicate behaviors accurately. This capability not only broadens the scope of learning beyond personal experience but also opens new pathways for developing AI systems that learn and adapt in more human-like ways.
-Because the third-person requirement violates the first-person experience-acquisition assumption of OC, imitation learning cannot be cast in terms of RL. First-person experience contains causal information which third-person experience simply lacks. Therefore, transforming third-person into first-person experience requires the addition of the missing causal information //by the learner themself//.
 ===== The problem: actions versus observations =====
@@ Line 32: / Line 38: @@
 This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
-==== Why does this happen? ====
+==== The math: why does this happen? ====
 To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo.
@@ Line 40: / Line 46: @@
 P(y|x) = \sum_\theta P(\theta) P(y|x, \theta) R(x, \theta) \qquad (2)
 $$
-where the $R(x, \theta)$ are 'information coupling terms' ((Yes, the same as in the definition of mutual information.)) defined as
+where the $R(x, \theta)$ are terms that couple information((Yes, the same as in the definition of mutual information.)) defined as
 $$
 R(x, \theta) = \frac{ P(x,\theta) }{ P(x)P(\theta) } \qquad (3)
@@ Line 52: / Line 58: @@
 ==== Core mechanism of imitation ====
-The core mechanism of imitation learning then consists of the following. If the //behavior schema// is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing as $X \approx P(X|\theta)$ (which is impossible because $\theta$ is latent), the first-person version chooses $X$ from its nearest distribution, namely the marginal $P(X)$ which averages over all the settings of the latent variable $\theta$:
+The core mechanism of imitation learning then consists of the following. If the //behavior schema// is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing $X$ by sampling it according to $P(X|\theta)$ (which is impossible because $\theta$ is unknown), the first-person version chooses $X$ from its nearest distribution, namely the marginal $P(X)$ which averages over all the settings of the latent variable $\theta$:
 $$
 P(x) = \sum_\theta P(\theta) P(x|\theta). \qquad (5)
@@ Line 59: / Line 65: @@
 ===== Imitation learning =====
-Next let's discuss approaches to imitation learning. It is important to keep in mind that the learner does not know the reward function of the demonstrator.
+Next let's discuss approaches to imitation learning. Throughout our discussion below, it is important to always keep in mind that the learner does not know the reward function of the demonstrator.
-==== Case 1: No confounding or causally sufficient context ====
+==== Case 1: No confounding, or causally sufficient context ====
 If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then
@@ Line 69: / Line 75: @@
 Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//.
-This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy will be a function of the state, which can safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
+This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy is a function of the observable state, and hence it can be safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
 ==== Case 2: Teacher supervision ====
@@ Line 75: / Line 81: @@
 A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk.
-This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows: since the learner chooses the actions themself, they get to observe the effect of their own action $P(Y|\text{do}(X))$; and since the teacher then provides the best action in hindsight, the learner also separately observes the policy $P(X)$. In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
+This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows:
+  - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$;
+  - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert policy $P(X)$.
+In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
 ==== Case 3: General case with no supervision ====