third_person

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
third_person [2024/12/25 12:20] – [Case 2: Teacher supervision] pedroortegathird_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega
Line 14: Line 14:
   - and reinforcing consequences (rewards and punishments).   - and reinforcing consequences (rewards and punishments).
 These elements are essential for learning through direct interaction with the environment. These elements are essential for learning through direct interaction with the environment.
 +
 +But learning like this is extremely limited. Most of what we know about the world does not come from first person experience!
  
 [[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience.  [[https://en.wikipedia.org/wiki/Imitation|Imitation]] is another form of learning which is ubiquitous in animals((In addition, there is evidence suggesting animals have dedicated neural circuitry for imitation---see e.g. mirror neurons (Kilner and Lemon, 2012).)). Imitation learning, however, involves translating third-person (observed) experiences into first-person (self) knowledge. This process requires the learner to infer causal relationships from observations, effectively reconstructing the underlying principles behind observed behaviors. Such a transformation is challenging because third-person observations lack the direct causal feedback inherent in personal experience. 
Line 36: Line 38:
 This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did. This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner's choice---even when imitating---are based on their own subjective information state, which is ignorant about the unobserved intention $\theta$, and thus unable to implement the necessary causal dependency between $X$ and $\theta$ the same way the demonstrator did.
  
-==== Why does this happen? ====+==== The math: why does this happen? ====
  
 To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo.  To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, i.e. the distribution over $Y$ when $X$ is chosen independently, also known as the effect $Y$ under the //intervention $X$// in causal lingo. 
Line 82: Line 84:
  
   - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$;    - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$; 
-  - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired policy $P(X)$. +  - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert policy $P(X)$. 
  
 In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences. In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner's policy is acquired exclusively using the teacher's instructions, and not from the action's consequences.
  • third_person.1735129200.txt.gz
  • Last modified: 2024/12/25 12:20
  • by pedroortega