Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
third_person [2024/12/25 12:15] – [Imitation learning] pedroortega | third_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega | ||
---|---|---|---|
Line 14: | Line 14: | ||
- and reinforcing consequences (rewards and punishments). | - and reinforcing consequences (rewards and punishments). | ||
These elements are essential for learning through direct interaction with the environment. | These elements are essential for learning through direct interaction with the environment. | ||
+ | |||
+ | But learning like this is extremely limited. Most of what we know about the world does not come from first person experience! | ||
[[https:// | [[https:// | ||
Line 36: | Line 38: | ||
This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | ||
- | ==== Why does this happen? ==== | + | ==== The math: why does this happen? ==== |
To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | ||
Line 65: | Line 67: | ||
Next let's discuss approaches to imitation learning. Throughout our discussion below, it is important to always keep in mind that the learner does not know the reward function of the demonstrator. | Next let's discuss approaches to imitation learning. Throughout our discussion below, it is important to always keep in mind that the learner does not know the reward function of the demonstrator. | ||
- | ==== Case 1: No confounding or causally sufficient context ==== | + | ==== Case 1: No confounding, or causally sufficient context ==== |
If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then | If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then | ||
Line 73: | Line 75: | ||
Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. | Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. | ||
- | This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy | + | This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy |
==== Case 2: Teacher supervision ==== | ==== Case 2: Teacher supervision ==== | ||
Line 79: | Line 81: | ||
A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. | A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. | ||
- | This is a hybrid first/ | + | This is a hybrid first/ |
+ | |||
+ | - since the learner chooses the actions themself, they get to observe | ||
+ | - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert | ||
+ | |||
+ | In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner' | ||
==== Case 3: General case with no supervision ==== | ==== Case 3: General case with no supervision ==== |