

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
third_person [2024/12/25 12:15] – [Case 1: No confounding or causally sufficient context] pedroortegathird_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega
Line 14: Line 14:
   - and reinforcing consequences (rewards and punishments).   - and reinforcing consequences (rewards and punishments).
 These elements are essential for learning through direct interaction with the environment. These elements are essential for learning through direct interaction with the environment.
 +But learning like this is extremely limited. Most of what we know about the world does not come from first person experience!
 [[|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience.  [[|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience. 
Line 36: Line 38:
 This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did. This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
-==== Why does this happen? ====+==== The math: why does this happen? ====
 To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo.  To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo. 
Line 73: Line 75:
 Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//.
-This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy will be a function of the state, which can safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.+This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy is a function of the observable state, and hence it can be safely learned and used. But this assumptions won't hold in a typical POMDP because the learner can't see the beliefs of the demonstrator.
 ==== Case 2: Teacher supervision ==== ==== Case 2: Teacher supervision ====
Line 79: Line 81:
 A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk.  A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. 
-This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows: since the learner chooses the actions themself, they get to observe the effect of their own action $P(Y|\text{do}(X))$; and since the teacher then provides the best action in hindsight, the learner also separately observes the policy $P(X)$. In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.+This is a hybrid first/third-person setting, because the learner chooses the actions themself, and then gets immediately told the best action in hindsight by the teacher. The rationale here is as follows:  
 +  - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$;  
 +  - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert policy $P(X)$.  
 +In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
 ==== Case 3: General case with no supervision ==== ==== Case 3: General case with no supervision ====
  • third_person.1735128938.txt.gz
  • Last modified: 2024/12/25 12:15
  • by pedroortega