Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
third_person [2024/12/25 12:20] – [Case 2: Teacher supervision] pedroortega | third_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega | ||
---|---|---|---|
Line 14: | Line 14: | ||
- and reinforcing consequences (rewards and punishments). | - and reinforcing consequences (rewards and punishments). | ||
These elements are essential for learning through direct interaction with the environment. | These elements are essential for learning through direct interaction with the environment. | ||
+ | |||
+ | But learning like this is extremely limited. Most of what we know about the world does not come from first person experience! | ||
[[https:// | [[https:// | ||
Line 36: | Line 38: | ||
This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | ||
- | ==== Why does this happen? ==== | + | ==== The math: why does this happen? ==== |
To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | ||
Line 82: | Line 84: | ||
- since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$; | - since the learner chooses the actions themself, they get to observe and learn the effect of their own action $P(Y|\text{do}(X))$; | ||
- | - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired policy $P(X)$. | + | - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired |
In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner' | In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner' |