third_person

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
third_person [2024/12/25 12:14] – [Core mechanism of imitation] pedroortegathird_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega
Line 14: Line 14:
   - and reinforcing consequences (rewards and punishments).   - and reinforcing consequences (rewards and punishments).
 These elements are essential for learning through direct interaction with the environment. These elements are essential for learning through direct interaction with the environment.
 +
 +But learning like this is extremely limited. Most of what we know about the world does not come from first person experience!
  
 [[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience.  [[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience. 
Line 36: Line 38:
 This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did. This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
  
-==== Why does this happen? ====+==== The math: why does this happen? ====
  
 To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo.  To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo. 
Line 63: Line 65:
  
 ===== Imitation learning ===== ===== Imitation learning =====
-Next let's discuss approaches to imitation learning. It is important to keep in mind that the learner does not know the reward function of the demonstrator. +Next let's discuss approaches to imitation learning. Throughout our discussion below, it is important to always keep in mind that the learner does not know the reward function of the demonstrator. 
  
-==== Case 1: No confounding or causally sufficient context ====+==== Case 1: No confoundingor causally sufficient context ====
  
 If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then
Line 73: Line 75:
 Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//.
  
-This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy will be a function of the state, which can safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.+This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy is a function of the observable state, and hence it can be safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
  
 ==== Case 2: Teacher supervision ==== ==== Case 2: Teacher supervision ====
Line 79: Line 81:
 A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk.  A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. 
  
-This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows: since the learner chooses the actions themself, they get to observe the effect of their own action $P(Y|\text{do}(X))$; and since the teacher then provides the best action in hindsight, the learner also separately observes the policy $P(X)$. In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.+This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows:  
 + 
 +  - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$;  
 +  - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert policy $P(X)$.  
 + 
 +In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
  
 ==== Case 3: General case with no supervision ==== ==== Case 3: General case with no supervision ====
  • third_person.1735128855.txt.gz
  • Last modified: 2024/12/25 12:14
  • by pedroortega