Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
third_person [2024/12/25 11:52] – [The problem: actions versus observations] pedroortega | third_person [2024/12/25 12:40] (current) – [Why does this happen?] pedroortega | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | {{ :: | ||
+ | |||
====== How to translate third-person into first-person? | ====== How to translate third-person into first-person? | ||
:!: Article under construction. | :!: Article under construction. | ||
- | > Imitation is a powerful | + | > Imitation is a potent |
- | {{ :: | + | //Cite as: Ortega, P.A. “How to translate |
- | The policy learning | + | **Operant condidioning versus imitation: |
- an information context, | - an information context, | ||
- | - a first-person behavior, | + | - first-person behavior, |
- | - and a reinforcing | + | - and reinforcing |
+ | These elements are essential for learning through direct interaction with the environment. | ||
- | [[https:// | + | But learning like this is extremely limited. Most of what we know about the world does not come from first person experience! |
+ | |||
+ | [[https:// | ||
+ | |||
+ | To bridge this gap, learners must augment observed information with inferred causal understanding, enabling them to internalize | ||
- | Because the third-person requirement violates the first-person experience-acquisition assumption of OC, imitation learning cannot be cast in terms of RL. First-person experience contains causal information which third-person experience simply lacks. Therefore, transforming third-person into first-person experience requires the addition of the missing causal information //by the learner themself//. | ||
===== The problem: actions versus observations ===== | ===== The problem: actions versus observations ===== | ||
Line 32: | Line 38: | ||
This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | This implies that $P(Y|X)$ will predict well what will happen when the demonstrator chooses $X$, but it won't predict what will happen when the learner chooses $X$. This last prediction differs because the learner' | ||
- | ==== Why does this happen? ==== | + | ==== The math: why does this happen? ==== |
To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | To understand what will happen when we substitute the demonstrator by the learner, we need $P(Y|\text{do}(X))$, | ||
Line 40: | Line 46: | ||
P(y|x) = \sum_\theta P(\theta) P(y|x, \theta) R(x, \theta) \qquad (2) | P(y|x) = \sum_\theta P(\theta) P(y|x, \theta) R(x, \theta) \qquad (2) | ||
$$ | $$ | ||
- | where the $R(x, \theta)$ are ' | + | where the $R(x, \theta)$ are terms that couple information((Yes, the same as in the definition of mutual information.)) defined as |
$$ | $$ | ||
R(x, \theta) = \frac{ P(x,\theta) }{ P(x)P(\theta) } \qquad (3) | R(x, \theta) = \frac{ P(x,\theta) }{ P(x)P(\theta) } \qquad (3) | ||
Line 52: | Line 58: | ||
==== Core mechanism of imitation ==== | ==== Core mechanism of imitation ==== | ||
- | The core mechanism of imitation learning then consists of the following. If the //behavior schema// is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing | + | The core mechanism of imitation learning then consists of the following. If the //behavior schema// is known, by which we mean that the learner knows the joint $P(\theta, X, Y)$ and the underlying causal graph, then instead of choosing $X$ by sampling it according to $P(X|\theta)$ (which is impossible because $\theta$ is unknown), the first-person version chooses $X$ from its nearest distribution, |
$$ | $$ | ||
P(x) = \sum_\theta P(\theta) P(x|\theta). \qquad (5) | P(x) = \sum_\theta P(\theta) P(x|\theta). \qquad (5) | ||
Line 59: | Line 65: | ||
===== Imitation learning ===== | ===== Imitation learning ===== | ||
- | Next let's discuss approaches to imitation learning. | + | Next let's discuss approaches to imitation learning. |
- | ==== Case 1: No confounding or causally sufficient context ==== | + | ==== Case 1: No confounding, or causally sufficient context ==== |
If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then | If there is no confounding variable, or if the confounding variable $\theta$ is a deterministic function of the observable context $\theta = f(C)$, then | ||
Line 69: | Line 75: | ||
Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. | Hence watching a demonstration is like acting oneself. Therefore in this case it is safe for the agent to learn the joint $P(C, X, Y)$, and then use the conditional $P(X | C)$ for choosing its actions to obtain the effect $P(Y|X,C)$. This is an important special case. In causal lingo, we say that //$C$ screens off $\theta$//. | ||
- | This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy | + | This causal sufficiency is typically the (tacit) assumption in the imitation learning literature. For instance, if the demonstrator chooses optimal actions in an MDP, then their policy |
==== Case 2: Teacher supervision ==== | ==== Case 2: Teacher supervision ==== | ||
Line 75: | Line 81: | ||
A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. | A limited form of supervision can be achieved when a teacher directly supervises the learner while performing the task---think of it as having the driving instructor sitting next to the student during a driving lesson, providing instantaneous feedback, or a parent teaching their child how to walk. | ||
- | This is a hybrid first/ | + | This is a hybrid first/ |
+ | |||
+ | - since the learner chooses the actions themself, they get to observe | ||
+ | - and since the teacher then provides the best action in hindsight, the learner also separately observes the desired expert | ||
+ | |||
+ | In this case, it is safe for the learner to regress $P(X)$ and $P(Y|\text{do}(X))$ as long as there is no information flowing from the choice $X$ back into the policy parameters of $P(X)$. The last constraint is typically achieved via a stop-gradient in a deep learning implementation. This makes sure the learner' | ||
==== Case 3: General case with no supervision ==== | ==== Case 3: General case with no supervision ==== |