====== Universal Artificial Intelligence as Imitation ====== **Pedro A. Ortega**\\ //Keywords: Solomonoff induction, universal imitation, causal interventions, adaptive control. //\\ Technical Report\\ March 2026\\ ===== Abstract ===== Modern AI often defines agency as reward maximization: specify an objective, then learn to optimize it through interaction. This paper argues for an alternative foundation in which agency is inference: purposeful behavior emerges from learning compact generative explanations of how outcomes depend on chosen interventions. We extend universal induction to interactions by placing a Solomonoff universal mixture over computable generators of complete action-observation histories, with a crucial epistemic rule: actions are interventions, not evidence, so beliefs update only through the world’s responses to what the agent does. The resulting posterior over generators is a first-person belief state, and behavior follows from sampling from the posterior predictive over actions (probability matching). To connect “what happens” to “what the agent should do,” we formalize a counterfactual target action: the action the world would have emitted in the agent’s place. We prove a finite cumulative divergence bound between the agent’s actions and these counterfactual actions, implying only finitely many large deviations (and finitely many mismatches for deterministic targets). We also show that the agent can be taught to behave like any computable function and in particular to learn arbitrary computable behavioral schemas, including reward maximization. In this view, rewards are one kind of observation among many, alongside demonstrations, language, tool outputs, and feedback, rather than the primitive definition of purpose. ===== Introduction ===== Reinforcement learning has become the default language for “purposeful” behavior in machine learning: an agent is modeled as maximizing expected cumulative reward [SuttonBarto2018RL; KaelblingLittmanMoore1996RLSurvey]. This is a powerful formalization, but it is also a historically contingent modeling choice. The word //reinforcement// comes from psychology’s operant-conditioning picture of behavior shaped by consequences, while the modern mathematical template inherits the optimal-control and economics view that “purpose” is the optimization of a specified objective [Bellman1957DP; Puterman1994MDP]. Because this template is so successful, we rarely ask a prior question: //must// purposeful behavior be characterized as reward maximization? A competing intuition is both older and more general: much of intelligent behavior is acquired //second-hand//. Animals and humans do not learn only from first-person trial and error; they learn by absorbing patterns in what others do and say. In the broadest sense, //imitation// is the uptake of a //pattern//, a regularity in behavior, language, demonstration, analogy, or metaphor, and its conversion into a first-person capacity to act. This is not merely “copying actions” but schema acquisition: the learner internalizes a generative rule that can be applied in new situations, often without ever observing an explicit reward. A child learns how to apologize, how to take turns, and what counts as “rude”; a novice learns laboratory practice; a driver learns norms of merging. These are not fixed objectives revealed by scalar feedback. They are behavioral schemas and justifications acquired on the fly from observed structure. **Imitation as continuation.** In imitation-style settings, the learner often receives third-person demonstrations but no explicit reward or teacher corrections. In such cases, learning is best read as //continuation under a schema//: after observing a structured prefix (examples, demonstrations, instructions), the learner must itself produce the next action that extends the demonstrated regularity. This is why here we represent each hypothesis as a //joint interaction generator//, inducing both an action channel and an observation channel: imitation requires hypotheses that can generate behavior, not only predict world responses. Seen this way, imitation is inseparable from compression. To imitate a pattern is to find a short program that generates it: a reusable explanation that captures what is essential and discards what is incidental. Language is a particularly powerful carrier of such patterns: a verbal description can specify a behavior never before encountered; an analogy can transfer a schema from one domain to another; a metaphor can compress a complex policy into a memorable rule. In all these cases, the learner acquires not only actions but also the //reasons// that make the actions make sense in context. Purposeful behavior, on this view, is grounded in learned generative structure rather than in a pre-declared reward signal. The obstacle is that third-person patterns do not automatically translate into first-person competence. The reason is causal. From a first-person perspective, //actions are choices// and //observations are evidence//. In third-person data, however, the demonstrator’s “actions” are themselves observations for the learner: they are entangled with the demonstrator’s information and latent intentions. Treating observed actions as if they were first-person interventions produces a characteristic error: updating beliefs as if self-generated actions were evidence about which hypothesis is true (“learning from one’s own actions”) or, in imitation, predicting consequences using $P(Y \mid X)$ when what one needs is $P(Y \mid \mathrm{do}(X))$ [pearl2009causality; Ortega2021shaking; OrtegaBraun2009BayesianControlInterventions; OrtegaBraun2010BayesianControlRuleInterventions]. Translating a pattern from “what //they// do” to “what happens when //I// do it” requires an explicit intervention/evidence asymmetry. This paper develops a universal account of interactive learning that makes this asymmetry primary and ties it directly to compression. Solomonoff induction constructs a universal predictor by mixing over all computable hypotheses with weights given by description length [Solomonoff1964FTII; Solomonoff1964FTII2; Levin1974LawsInfoConservation; LiVitanyi2019KC]. To extend this universality to agents, the object of prediction must be interaction histories, not passive observation strings. Therefore, we construct a Solomonoff-style universal semimeasure directly on interactions. This yields a single universal joint model over alternating action-observation sequences: a universal compressor of interaction patterns. A joint universal model is still not a first-person agent. The agent-world distinction is enforced epistemically by the belief update rule: only observations contribute evidence, while actions do //not// [OrtegaBraun2009BayesianControlInterventions; OrtegaBraun2010BayesianControlRuleInterventions]. Thus, hypothesis weights are updated only by observation likelihoods, while action probabilities are excluded from evidence. The resulting //intervention posterior// is a universal first-person belief state: it is the agent’s compressed explanation of “how the world responds when I do things.” Behavior then follows by mixing, or probability matching, the hypotheses’ induced action channels under that posterior. In this sense, purposeful behavior can be grounded in universal inference under interventions, with reward maximization appearing as one optional way of selecting among schemas, rather than as the defining semantic core. **Contributions.** - **Universal induction for interaction:** We extend Solomonoff’s mixture idea from passive strings to action-observation transcripts, by mixing over computable programs that generate complete interactive histories [Solomonoff1964FTII; Solomonoff1964FTII2; LiVitanyi2019KC]. - **First-person learning rule:** We give an explicit belief update in which the agent’s own actions are treated as choices rather than evidence. Beliefs update only through the world’s responses to what the agent does, avoiding “learning from one’s own actions” [OrtegaBraun2009BayesianControlInterventions; OrtegaBraun2010BayesianControlRuleInterventions; Ortega2021shaking]. - **Behavior from beliefs:** We derive a simple control rule: at each step, act by mixing, or sampling, the action tendencies implied by the currently plausible programs [OrtegaBraun2009BayesianControlInterventions; OrtegaBraun2010BayesianControlRuleInterventions]. - **Interface and continuation targets:** We formalize variable-length turns with a computable gate that defines token boundaries, and use it to define what the world //would have written// at a would-be action slot, and when the same continuation is instead revealed by the world [pearl2009causality]. - **Universal imitation guarantee:** In a protocol where such continuations are sometimes revealed and sometimes demanded, we prove a finite cumulative divergence bound to any computable target continuation on the realized history; hence large deviations occur only finitely often, and for deterministic targets the expected number of mismatches is finite [Solomonoff1964FTII; Solomonoff1964FTII2; Levin1974LawsInfoConservation; LiVitanyi2019KC]. ===== Setup and the Universal Prior ===== We start by defining the agent’s prior beliefs over possible generators of a single-stream substrate. Following Solomonoff [Solomonoff1964FTII; Solomonoff1964FTII2], we avoid committing to a parametric family and instead mix over computable generators, with an inductive bias toward shorter descriptions. Here we work with generators presented by explicit //one-step rules// $\nu(\cdot \mid \cdot)$, because the interaction setting will require conditioning on, and intervening into, these conditional rules directly. Throughout, we use the following shorthand for sequences: $$ x_{n:m} = x_n, x_{n+1}, \ldots, x_m $$ for $n \le m$, and similarly $$ x_{\le n} = x_{1:n}, \qquad x_{ 0$ but $Q(z) = 0$ for some $z$. Define total variation distance by $$ \mathrm{TV}(P,Q) := \sup_{E \subseteq Z} |P(E) - Q(E)| = \frac{1}{2} \sum_{z \in Z} |P(z) - Q(z)|. $$ When $Q$ comes from a semimeasure rule (total mass at most $1$), $\mathrm{TV}$ and $D_{\mathrm{KL}}$ are understood between the completed rules $\bar{P}, \bar{Q}$ on $Z \cup \{\bot\}$, and a measure assigns probability $0$ to $\bot$. Throughout, $\log$ is the natural logarithm. **The universal prior.** Fix an effective listing $(\nu_p)_{p \in \mathbb{N}}$ of $\mathcal{M}$, so every $\nu \in \mathcal{M}$ appears at least once, and fix positive weights $(w(p))_{p \in \mathbb{N}}$ with $$ \sum_{p=1}^\infty w(p) \le 1. $$ We interpret $w(p)$ as the agent’s //universal prior// weight on generator $p$, with a built-in bias toward shorter descriptions. For example, choose a prefix-free code for indices $p$ and set $w(p) = 2^{-|p|}$ [LiVitanyi2019KC]. Define the //universal semimeasure// on substrate prefixes by $$ M(x) := \sum_{p=1}^\infty w(p)\,\nu_p(x), \qquad x \in \Sigma^\ast. $$ Then $M$ is itself a lower-semicomputable semimeasure on $\Sigma^\ast$, and it //dominates// every $\nu \in \mathcal{M}$: for each $\nu$ there exists a constant $c_\nu > 0$ such that $$ M(x) \ge c_\nu\,\nu(x) $$ for all $x \in \Sigma^\ast$. This dominance property is the sense in which $M$ is //universal//: up to a fixed constant factor, it assigns at least as much mass to every substrate prefix as any computable generator in the class [LiVitanyi2019KC]. Note that this universality is relative to $\mathcal{M}$ as defined above: $\mathcal{M}$ is strictly smaller than the full class of lower-semicomputable semimeasures on $\Sigma^\ast$, because a lower-semicomputable prefix-mass function need not admit uniformly lower-semicomputable one-step conditionals [Levin1974LawsInfoConservation]. ===== Interaction Systems ===== We now make the interaction protocol explicit. The guiding intuition is simple: the agent and the world write into a shared substrate stream, but //who// gets to write next is decided by an interface rule. From the agent’s first-person perspective, what it writes are //choices//, and therefore not evidence, while what the world writes are //outcomes//, and therefore evidence. ==== Definition: Interaction system ==== An //interaction system// is a tuple $(\Sigma,\Gamma,\pi,\mu)$ where: - $\Sigma$ is a fixed finite base alphabet and the substrate is a single stream $x_{1:\infty} \in \Sigma^\infty$. - $\Gamma$ is a //gate// that produces a binary process $\gamma_1,\gamma_2,\ldots$ indexed by substrate position, where $\gamma_k = 1$ means the agent writes the next substrate symbol at position $k$ and $\gamma_k = 0$ means the world writes it. - $\pi$ and $\mu$ are (semi)measures on $\Sigma^\ast$ (typically lower-semicomputable semimeasures) used as symbol-level generators for the agent and world, respectively, when they hold the gate. Each comes with a chronological one-step decomposition in the sense of the previous section: at each substrate position, the next symbol is drawn from a conditional distribution given the written prefix so far and the already-emitted prefix of the current block. **Gating and token boundaries.** The gate $\Gamma$ is a //chronological// (non-anticipating) and //computable// rule for producing a binary process $\gamma_1,\gamma_2,\ldots \in \{0,1\}$ indexed by substrate position. We fix $\gamma_1 = 1$ so the agent writes first, and for each $k \ge 1$ we sample $$ \gamma_{k+1} \sim \Gamma(\cdot \mid \gamma_{\le k}, x_{\le k}), $$ so the next gate value may depend on the entire written prefix and past gate values but never on future symbols. While $\gamma_k$ remains constant, the same side continues emitting symbols into the shared stream. A //token boundary// occurs exactly when $\gamma$ switches value; this partitions the substrate into maximal constant-$\gamma$ blocks. Reading these blocks in order yields an alternating sequence of interface-level tokens $$ a_1, o_1, a_2, o_2, \ldots $$ with $(a_t, o_t) \in \mathcal{A} \times \mathcal{O}$, where $a_t$ is the $t$-th agent-written block and $o_t$ is the subsequent world-written block. This is how the sets $\mathcal{A}$ and $\mathcal{O}$ are defined. **Interface-level tokens.** The gate $\Gamma$ induces the segmentation into blocks; the sets $\mathcal{A}$ and $\mathcal{O}$ are therefore //interface-level// token sets rather than intrinsic properties of $\Sigma^\ast$. In particular, action and observation tokens need not be self-delimiting when viewed as raw substrings of $\Sigma^\ast$ in isolation: the block boundaries are provided by the gate, not by an assumed prefix-free code for $\mathcal{A}$ or $\mathcal{O}$. In other words, to uniquely decode actions and observations from the symbol transcript, we need the gating transcript. **First-person accounting in block generation.** The key discipline is that each side treats its //own// completed tokens as interventions and the other side’s completed tokens as evidence. We write $\hat{a}_t$ for an agent-produced action token (an intervention from the agent’s view) and $\hat{o}_t$ for a world-produced observation token (an intervention from the world’s view), where $\hat{z} := \mathrm{do}(Z=z)$ is bookkeeping for “this token was chosen by the side recording it” [pearl2009causality]. Let $k$ lie inside the current block, let $\tau$ be the substrate index of the most recent token boundary, and let $x_{\tau:k}$ denote the prefix of the current block emitted so far. Then the next symbol is drawn from the generator that currently holds the gate. Writing the completed token history up to the current interaction round as $\underline{ao}_{ k$ be the first position such that $\dot{\gamma}_{k'} = 0$. Equivalently, $[k,k')$ is the maximal block in the branch with $\dot{\gamma} = 1$. The //counterfactual action// $\dot{a}_{t+1} \in \mathcal{A}$ is defined to be the random block content written over positions $k,\ldots,k'-1$ in that branch: $$ \dot{a}_{t+1} := \dot{x}_{k:k'-1}. $$ Note that $k'$ is determined inside the branch and therefore the length of $\dot{a}_{t+1}$ need not match the length of the on-path $\mathcal{A}$-token written by the agent starting at $k$. **Diagram note.** The intended picture is that after the shared on-path prefix ending at $a_3,o_3$, the on-path transcript has factual action $a_4$, while the counterfactual branch replaces that with the world-generated block $\dot{a}_4$ occupying the same would-be action slot. ==== Definition: Third-party action ==== Let $k$ be a potential action index with associated transcript $\underline{\hat{a}o}_{ k$ be the first position such that $\dot{\gamma}_{k'} = 0$. Equivalently, $[k,k')$ is the maximal block in the branch with $\dot{\gamma} = 1$. The //third-party action// $\dot{a}_{t+1} \in \mathcal{A}$ is defined to be the random block content written in the transcript over positions $k,\ldots,k'-1$: $$ \dot{a}_{t+1} := x_{k:k'-1}. $$ If $\dot{a}_{t+1}$ ends before its embedding $\mathcal{O}$-token, we can write $v$ for the remaining non-empty suffix of the same world-written token after position $k'-1$, so that $$ o_t = w\,\dot{a}_{t+1}\,v. $$ **Diagram note.** The intended picture is that inside a long world-written observation token, there is an embedded block $\dot{a}_4$ that occupies an $\mathcal{A}$-position under the tokenization convention, even though on-path it is still written by the world and therefore appears as evidence. It is not hard to see that, for a given potential index $k$, the counterfactual action $\dot{a}_{t+1}$ and the third-party action $\dot{a}_{t+1}$ are the same random block: the difference is only whether the gate sampled $\gamma_k = 1$ (counterfactual, not observed) or $\gamma_k = 0$ (third-party, observed) at position $k$. The precise distinction between the different types of $\mathcal{A}$-tokens is important; we will also refer to them as //factual// (first-person, agent-generated), //counterfactual//, and //third-party// $\mathcal{A}$-tokens. ===== Observations on the model ===== **Why actions cannot be evidence.** Hypotheses may assign probabilities to actions. If one naively updated on realized actions as if they were evidence, the posterior would spuriously reward hypotheses that happened to predict the agent’s own sampled choices [OrtegaBraun2010BayesianControlRuleInterventions; Ortega2021shaking]. Concretely, suppose two programs $p,q$ have identical tested world-response rules, $$ \nu_p(o \mid h,a) = \nu_q(o \mid h,a) \quad \text{for all } (h,a,o), $$ but different action channels $$ \nu_p(a \mid h) \ne \nu_q(a \mid h). $$ Then the world responses provide no evidence to distinguish them. Under the intervention posterior, appending an action alone leaves odds unchanged, so neither program is self-reinforced by the agent’s own behavior. If one instead conditioned on $\nu_p(a_{t+1} \mid h)$ as evidence, the posterior would drift toward whichever program better predicted the sampled action, even though no new information about world responses was observed. **The gate as a boundary-uncertainty device.** The gate $\Gamma$ is an information-structure device: it formalizes that, from the agent’s perspective, the action-observation boundary is not itself a learnable regularity of the interaction history. If the agent could reliably detect a pattern in when it must write versus when the world will write, then $\mathcal{A}$-blocks would split into distinguishable regimes, and evidence obtained when the world writes a would-be $\mathcal{A}$-continuation need not constrain behavior when the agent is later required to write one. Transfer can fail by selection effects. We therefore treat slot assignment as a Harsanyi-style move by Nature in a Bayesian game [Harsanyi1967BayesianGamesI]: at each would-be $\mathcal{A}$-block start, the gate outcome is chosen chronologically from the already-written transcript, including past gate values, and is not revealed to the agent at decision time; hence the agent’s continuation rule is defined invariantly, without conditioning on whether the next $\mathcal{A}$-block will be revealed as evidence or demanded as an intervention. **Meaning of tokens (experiments and reported outcomes).** The tokens $a_t$ and $o_t$ are interface-level abstractions, not literal motor and sensor variables. An “action” $a_t$ may denote a temporally extended procedure, such as issuing a tool call, running a controller, or choosing an experimental protocol. An “observation” $o_t$ may be a summary of the resulting outcomes, such as tool output, success or failure, diagnostics, or feedback. From the agent’s first-person perspective, the atomic unit of evidence is therefore the intervention-outcome pair $(\hat{a}_t,o_t)$: an experiment together with its reported result. **Where are policies, predictors, and preferences?** Our setup is unusual because we use a single-model semantics: each hypothesis is a complete chronological generator of substrate strings and therefore, together with the gate, induces simultaneously an action and an observation channel. That is, we do not factor hypotheses into “policy” and “environment” components. The agent-world distinction is imposed externally by the interface and, crucially, by the epistemic rule used for distinguishing actions from evidence. We also do not take rewards and utilities as primitive. Rewards can be included as one kind of observation token, alongside demonstrations, language, tool outputs, and feedback. They could potentially be ignored by the agent. As we will see later, we let the patterns contained in the single-stream substrate plus the intervention-evidence split induce the behavioral schemas directly: from the agent’s point of view, interaction becomes a “gap-filling” or “pattern completion” problem. **Sampling from a semimeasure (symbol-level oracle assumption).** As a simplifying modeling assumption, we posit access to a symbol-level sampling procedure for any lower-semicomputable generator $\nu$ on $\Sigma^\ast$: given the current written prefix, we can draw the next substrate symbol according to $\nu(\cdot \mid \text{prefix})$. Together with the gate, this induces token-level sampling: whichever side holds the gate samples symbols until the gate switches and the block closes. If $\nu$ stops early, because of the semimeasure’s missing mass, sampling may halt mid-block. **Counterfactual vs. third-party actions.** We have shown how certain substrate positions admit an //interpretation// as an $\mathcal{A}$-token: namely, the $\mathcal{A}$-block that would be produced starting at $k$ under a fixed counterfactual convention in which the world is taken to write through the next $\mathcal{A}$-block. This choice of counterfactual is not meant to be unique; it is adopted for mathematical convenience, because it yields a single canonical continuation object that is sometimes observable, as a third-party action, and sometimes not, as a counterfactual action. ===== Universal Imitation ===== We now treat purposeful interaction as //intrinsic continuation//: the agent extends a rule that is already present in its experience. The key trick is //equating predicting with acting//. In passive prediction, the agent observes a prefix and predicts the next symbol. In interaction, the transcript contains alternating //gaps// (action slots) and //evidence// (world outputs). If the agent chooses its next action by sampling from its current predictive distribution for “what should occupy the next action slot,” then the agent is literally a //pattern completer//: it extends the transcript according to the regularities compressed by its current beliefs. ===== Intrinsic continuation and the identifiability problem ===== A tempting picture of interaction is //intrinsic continuation//: given what has been written so far, the agent computes a “natural next piece” by predicting what should come next in the transcript. In many imitation-style situations this is exactly the computation we want: the agent extends a regularity that is already present in its experience. To see the intuition in a minimal form, consider a supervised-learning style prompt embedded in a single observation token: $$ o_{t-1} = (u_1,v_1), (u_2,v_2), \ldots, (u_n,\cdot). $$ The last pair is incomplete: the label $v_n$ is missing. Now imagine running the agent’s internal continuation simulation on this token. After each prompt $u_i$, the agent runs the same computation to form a prediction for the next label $v_i$. The same simulation also predicts future prompts, but that part is typically much harder; the salient object here is the next label. Nothing about this changes at the last prompt: after seeing $u_n$, the same internal computation could simply predict, or sample, $v_n$, which makes the task sound straightforward. But there is an //identifiability// pitfall: the example assumes the agent is in the //same epistemic situation// at every label. If, before each $v_i$, an “act-observe” signal indicates whether the label will be //shown by the world// or //demanded of the agent//, then the first $n-1$ labels are typically learned under “observe,” while the final gap may be the first time “act” appears. The signal itself could be explicit, a codeword, or implicit, a coding pattern. Regularities under “observe” need not transfer to the “act” regime. Worse, under a first-person treatment past “act” labels are not evidence and thus cannot inform future behavior. In short, the problem is that different hypotheses can agree on all observed outcomes yet prescribe different actions. Acting is therefore unidentifiable even when “observe” prediction is well supported. To avoid the identifiability problem, the agent must commit to its continuation //before// any “act-observe” assignment is revealed or identified. This is what the stochastic gate enforces: the agent does not observe the assignment, so it uses the same intrinsic continuation rule whether the next slot is revealed or demanded. If the slot is assigned to the agent, it simply outputs the completion. To evaluate an action, we compare it to a //counterfactual action// [GibbardHarper1978TwoKindsEU; pearl2009causality]: the token the world would have written in the same position if it had kept writing. In the labeled-pairs example $(u_1,v_1),\ldots,(u_n,\cdot)$, we can segment the stream and thereby identify each label location as a //potential action position//. For $i=1,\ldots,n-1$, the world fills that potential action position, so each $v_i$ is a //third-party action//: a token in an $\mathcal{A}$-position, but written by the world and therefore available as evidence. The same potential action position after $u_n$ defines the final missing label: the //counterfactual action// is the label the world would have written there, while the agent’s emitted label is the //factual action//. This makes the relation between the three types of actions explicit: third-party actions are the previously revealed labels, and their shared slot structure defines the counterfactual target for the final gap against which the factual action can be compared. The same intuition extends beyond labeled examples. The conditioning context is the entire experience, which may mix extensional content (examples, demonstrations, traces) with intensional content (instructions, rules, constraints, verifier descriptions, program sketches). As long as the agent focuses on predicting the next continuation from whatever structure is present, no special handling is required: the gate determines whether that continuation is scored as evidence (world writes) or realized as an intervention (agent writes), while the agent’s intrinsic continuation computation stays the same. ===== From third-party evidence to behavioral convergence ===== We now formalize the setting in which third-party actions provide evidence about the continuation rule, and in which this evidence //transfers// to the agent’s own actions. We then show that an agent driven by the universal mixture $M$ converts this evidence into behavior: acting by intrinsic completion induces a policy whose divergence from the counterfactual targets is bounded in cumulative expectation. We begin by listing definitions and assumptions. We need to specify //which// substrate positions count as “action-slot starts.” This definition ensures: - each chosen position is a valid potential action index, so $\dot{a}^{(k)}$ is well-defined; - slots do not overlap and are separated by at least some world-written material, so that evidence from one slot is part of the agent-visible history at later slots. ==== Definition: Action schedule ==== Fix an interaction system $(\Sigma,\Gamma,\pi,\mu)$. An //action-slot schedule// is an infinite random sequence of substrate positions $$ k_1 < k_2 < \cdots $$ such that each $k_i$ is a potential action index, and the $\mathcal{A}$-token beginning at $k_i$ ends strictly before $k_{i+1}$ begins. Moreover, between the end of the $\mathcal{A}$-token beginning at $k_i$ and the start of the $\mathcal{A}$-token beginning at $k_{i+1}$, the world writes at least one nonempty substrate block $w_i$, so the agent-visible transcript is $$ h_i := \hat{a}_1 o_1 \cdots \hat{a}_{t(i)} w_i, $$ where $t(i)$ is the interaction time of the last completed action. Let $\dot{a}^{(k_i)}$ denote the $\mathcal{A}$-token the world would write at position $k_i$, that is, the counterfactual or third-party action when $\gamma(k_i)=1$ or $\gamma(k_i)=0$ respectively. To make “revealed” slots informative about “demanded” slots, we need the gate assignment at each slot to behave like a randomized masking device: it should decide whether the $\mathcal{A}$-token becomes evidence or an agent intervention using only the agent-visible past, and without peeking at what the world continuation will be. We also restrict the world to primitive measures. ==== Assumption: Standard setup ==== Assume $(\Sigma,\Gamma,\pi,\mu)$ is an interaction system where $\pi := M$ is the //universal semimeasure// and $\mu$ is a //primitive measure//. Let $(k_i)_{i \ge 1}$ be an action-slot schedule. The following conditions hold: * **Action-slot is chosen by coin flip.** At each $k_i$, the gate draws $\gamma(k_i) \sim \mathrm{Bernoulli}(\rho_i)$, $\rho_i \in (0,1)$, where $\rho_i$ is a chronological function of the agent-visible history $h_i$. Conditional on $h_i$, the bit $\gamma(k_i)$ is independent of the world’s $\mathcal{A}$-token $\dot{a}^{(k_i)}$ at $k_i$. * **Gate held fixed through action-slot.** The gate holds the value of $\gamma(k_i)$ fixed throughout the $\mathcal{A}$-token beginning at $k_i$. If $\gamma(k_i)=0$, the world writes the $\mathcal{A}$-token, so it is a third-party action. If $\gamma(k_i)=1$, the agent writes the $\mathcal{A}$-token, so it becomes an intervention $\hat{a}$ from the agent’s view. * **Infinitely many agent-written slots.** With probability $1$, $\gamma(k_i)=1$ occurs for infinitely many $i$. **Induced agent interventions and world targets.** Before we proceed, we need to clarify the indexing of action slots, and in particular their substrate position versus agent-time. According to the standard setup, the schedule specifies substrate positions $k_1 < k_2 < \cdots$. Then $\dot{a}^{(k_i)} \in \mathcal{A}$ denotes the $\mathcal{A}$-token the world would write starting at $k_i$. If $\gamma(k_i)=0$ this token is realized on-path as an embedded third-party action; if $\gamma(k_i)=1$ it is only a counterfactual target. To index only the factual actions, the slots assigned to the agent, let $i_1 < i_2 < \cdots$ be the random indices with $\gamma(k_{i_t}) = 1$. For each $t \ge 1$, define $a_{t+1} \in \mathcal{A}$ as the $\mathcal{A}$-token the agent actually writes at $k_{i_t}$, and define the corresponding counterfactual target by $\dot{a}_{t+1} := \dot{a}^{(k_{i_t})}$. Notice that in this case the previous observation token was completed, and hence $$ h_i = \hat{a}_1 o_1 \cdots \hat{a}_{t(i)} w_i = \underline{\hat{a}o}_{\le t}. $$ **Deviation measures.** To quantify how closely intrinsic completion tracks the target continuation, we use $D_{\mathrm{KL}}$ and $\mathrm{TV}$. Since $M(\cdot \mid \cdot)$ may have missing mass, we complete it by adding a stop outcome $\bot \notin \mathcal{A}$ and writing $\overline{\mathcal{A}} := \mathcal{A} \cup \{\bot\}$. Define $\overline{M}(a \mid \cdot) := M(a \mid \cdot) \quad \text{for } a \in \mathcal{A}$, and $\overline{M}(\bot \mid \cdot) := 1 - \sum_{a \in \mathcal{A}} M(a \mid \cdot)$. For the measure $\mu$ set $\overline{\mu}(a \mid \cdot) := \mu(a \mid \cdot) \quad \text{for } a \in \mathcal{A}$, and $ \overline{\mu}(\bot \mid \cdot) := 0$. For distributions $P,Q$ on a countable set, define $$ D_{\mathrm{KL}}(P \| Q) := \sum_x P(x)\log\frac{P(x)}{Q(x)} $$ and $$ \mathrm{TV}(P,Q) := \frac{1}{2}\sum_x |P(x)-Q(x)|. $$ In this section, $D_{\mathrm{KL}}(\mu \| M)$ and $\mathrm{TV}(\mu,M)$ are shorthand for $D_{\mathrm{KL}}(\overline{\mu}\|\overline{M})$ and $\mathrm{TV}(\overline{\mu},\overline{M})$ on $\overline{\mathcal{A}}$. ===== The transfer lemma ===== The next lemma is the basic bookkeeping step that links third-party and counterfactual actions. Whenever a nonnegative quantity of interest is determined by the agent-visible history immediately before an action slot begins, then the expected total of that quantity over agent-assigned slots can be rewritten exactly as a reweighted expected total over world-assigned slots. The only ingredient is that the slot assignment is a coin flip based on the agent-visible past. ==== Lemma: Transfer ==== Under the standard setup, for any nonnegative sequence $(\Delta_i)_{i \ge 1}$ such that each $\Delta_i$ is determined by the agent-visible history immediately before $k_i$, we have the exact identity $$ \mathbb{E}\left[\sum_{i:\,\gamma(k_i)=1}\Delta_i\right] = \mathbb{E}\left[\sum_{i:\,\gamma(k_i)=0}\frac{\rho_i}{1-\rho_i}\,\Delta_i\right]. $$ In particular, if there exists $r < \infty$ such that $$ \frac{\rho_i}{1-\rho_i} \le r $$ with probability $1$ for all $i$, then $$ \mathbb{E}\left[\sum_{i:\,\gamma(k_i)=1}\Delta_i\right] \le r\,\mathbb{E}\left[\sum_{i:\,\gamma(k_i)=0}\Delta_i\right]. $$ **Proof.** Fix $i$ and condition on the agent-visible history $h_i$ available immediately before $k_i$. Under this conditioning, $\Delta_i$ and $\rho_i$ are fixed and $\gamma(k_i) \sim \mathrm{Bernoulli}(\rho_i)$, hence $$ \mathbb{E}[\mathbf{1}\{\gamma(k_i)=1\}\Delta_i \mid h_i] = \rho_i \Delta_i = \mathbb{E}\left[\mathbf{1}\{\gamma(k_i)=0\}\frac{\rho_i}{1-\rho_i}\Delta_i \mid h_i\right]. $$ Taking expectation gives $$ \mathbb{E}[\mathbf{1}\{\gamma(k_i)=1\}\Delta_i] = \mathbb{E}\left[\mathbf{1}\{\gamma(k_i)=0\}\frac{\rho_i}{1-\rho_i}\Delta_i\right]. $$ Summing over $i=1,\ldots,N$ and letting $N \to \infty$, all terms are nonnegative, yields the identity. If $\frac{\rho_i}{1-\rho_i} \le r$ almost surely for all $i$, then the inequality follows by bounding the right-hand side termwise by $r\,\mathbf{1}\{\gamma(k_i)=0\}\Delta_i$ and taking expectation. ===== Universal bound on third-party actions ===== We now instantiate the abstract $\Delta_i$ with a quantity that measures how much //evidence about the continuation// the agent would obtain if the world writes it. The key design constraint is twofold. First, $\Delta_i$ must be determined from the agent-visible transcript available immediately before the action slot begins, so the Transfer lemma applies. Second, when the continuation is actually revealed as a third-party action, $\Delta_i$ should be chargeable to a single global “evidence budget” implied by universality: along the world-written stream, the universal mixture $M$ cannot fall behind the true world $\mu$ by more than a fixed constant in cumulative log-loss [Solomonoff1964FTII; Levin1974LawsInfoConservation]. Since third-party actions are literal substrings of the world-written stream, their total contribution can be bounded by the same constant. For each action slot $k_i$, define the //action evidence divergence// at the agent’s decision time by $$ \Delta_i := \mathbb{E}\left[ \left. \log\frac{\mu(\dot{a}^{(k_i)} \mid h_i)}{M(\dot{a}^{(k_i)} \mid h_i)} \right| h_i \right]. $$ On slots where $\gamma(k_i)=0$, $\dot{a}^{(k_i)}$ is a third-party action, hence it contributes genuine evidence to the intervention posterior; on slots where $\gamma(k_i)=1$, $\dot{a}^{(k_i)}$ remains counterfactual and is not observed on-path. ==== Lemma: Universal bound on third-party actions ==== Assume $(\Sigma,\Gamma,M,\mu)$ is as in the standard setup, with $\mu$ a measure, and consider any action-slot setup satisfying that assumption. Let $\Delta_i$ be defined as above. Then there exists a constant $C_\mu < \infty$, depending only on $\mu$ and the chosen universal prior weights, such that $$ \mathbb{E}\left[\sum_{i:\,\gamma(k_i)=0}\Delta_i\right] \le C_\mu. $$ **Proof.** Because $\mu$ is primitive, hats can be removed from the chronological conditionals: $$ \mu(a^{(k_i)} \mid h_i) = \mu(a^{(k_i)} \mid \underline{\hat{a}o}_{ 0$, so for every realized first-person history $\underline{\hat{a}o}_{\le T}$ we have $$ M(\underline{\hat{a}o}_{\le T}) \ge w(p_\mu)\,\mu(\underline{\hat{a}o}_{\le T}), $$ hence $$ \sum_{t=1}^T \log\frac{\mu(o_t \mid \underline{ao}_{ 0$, the number $N_\varepsilon$ of times $$ \mathrm{TV}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}), \, M(a_{t+1} \mid \underline{\hat{a}o}_{\le t}) \right) > \varepsilon $$ satisfies $$ \mathbb{E}[N_\varepsilon] \le \frac{1}{2\varepsilon^2} \sum_{t \ge 1}\mathbb{E}\left[ D_{\mathrm{KL}}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}) \;\middle\|\; M(a_{t+1} \mid \underline{\hat{a}o}_{\le t}) \right) \right], $$ and $N_\varepsilon < \infty$ with probability $1$. If moreover for each $t$ the target $\mu(\dot{a}_t \mid \underline{ao}_{1}\mathbf{1}\{a_t \ne \dot{a}_t\} $$ satisfies $$ \mathbb{E}[N] \le \sum_{t \ge 1}\mathbb{E}\left[ D_{\mathrm{KL}}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}) \;\middle\|\; M(a_{t+1} \mid \underline{\hat{a}o}_{\le t}) \right) \right], $$ and hence $N < \infty$ with probability $1$. **Proof.** Write $$ T_t := \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}), \qquad P_t := M(a_{t+1} \mid \underline{\hat{a}o}_{\le t}), $$ and $$ D_t := D_{\mathrm{KL}}(T_t \| P_t). $$ Pinsker gives $$ \mathrm{TV}(T_t,P_t)^2 \le \frac{1}{2}D_t, $$ hence $$ \Pr\left(\mathrm{TV}(T_t,P_t) > \varepsilon\right) \le \frac{\mathbb{E}[D_t]}{2\varepsilon^2}. $$ Summing over $t$ yields the bound on $\mathbb{E}[N_\varepsilon]$ and implies $N_\varepsilon < \infty$ almost surely. For the deterministic clause, let $\dot{a}_{t+1}$ denote the point-mass target and set $$ p_t := P_t(\dot{a}_{t+1}) = M(\dot{a}_{t+1} \mid \underline{\hat{a}o}_{\le t}). $$ Then $$ D_t = -\log p_t $$ and $$ \Pr(a_{t+1} \ne \dot{a}_{t+1} \mid \underline{\hat{a}o}_{\le t}) = 1-p_t \le -\log p_t. $$ Taking expectations and summing over $t$ gives the bound on $\mathbb{E}[N]$, and finiteness of $\sum_t \Pr(a_t \ne \dot{a}_t) = \mathbb{E}[N]$ implies $N < \infty$ almost surely. ===== Universal imitation of computable stochastic functions ===== The previous results are stated for an abstract randomized interaction protocol. We now show this is not a vacuous idealization by constructing a computable interaction system in which the agent learns to implement //any// computable, possibly stochastic, function. Intuitively, this is the supervised example-learning setup from the earlier subsection: after each prompt $u_i$, there is a designated next output position in $\mathcal{A}$. For instance, the world may present labeled examples followed by a new query, $$ o_{t-1} = (u_1,v_1), (u_2,v_2), \ldots, (u_n,\cdot), $$ so that the designated next output is the missing label and the agent’s response is $$ a_t = v_n \in \mathcal{A}. $$ Crucially, this happens //after each prompt//: with a fixed nonzero probability, the protocol routes the next output position to the agent, so every prompt has a positive chance of requiring an on-path agent completion. To make generalization to unseen examples explicit, we choose the prompt schedule by dovetailing over $\mathcal{U}$, that is, a computable schedule in which every $u \in \mathcal{U}$ appears infinitely often, so any prompt not yet seen will still occur later and be tested and reused infinitely many times. ==== Corollary: Universal imitation of computable stochastic functions ==== Let $f(\cdot \mid u)$ be any computable measure on $\mathcal{A}$ indexed by prompts $u \in \mathcal{U}$. Then there exists a computable interaction system $(\Sigma,\Gamma,M,\mu)$ with $\mu$ a primitive computable measure, such that the following holds on every slot assigned to the agent. At each decision time $t \ge 1$, the most recent world token $o_t$ contains a computably decodable prompt $u_t \in \mathcal{U}$, and the counterfactual action $\dot{a}_{t+1} \in \mathcal{A}$ satisfies $$ \mu(\dot{a}_{t+1}=a \mid \underline{ao}_{\le t}) = f(a \mid u_t), \qquad a \in \mathcal{A}. $$ Each prompt $u \in \mathcal{U}$ appears infinitely many times. There exists a constant $C < \infty$ such that $$ \sum_{t \ge 1}\mathbb{E}\left[ D_{\mathrm{KL}}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}) \;\middle\|\; M(a_{t+1} \mid \underline{\hat{a}o}_{\le t}) \right) \right] \le C. $$ Consequently, by the finite mistakes corollary, only finitely many large deviations from the counterfactual can occur, and if $f(\cdot \mid u)$ is deterministic, only finitely many literal mismatches occur. **Proof.** **Construction.** Organize the world-written stream into an infinite sequence of prompt-gap pairs indexed by $i \ge 1$. Before the $i$-th gap, the world writes a nonempty block from which $u_i$ is uniquely decodable at decision time. Choose the $u_i$ according to a dovetailing strategy, so each prompt appears infinitely many times. Let $k_i$ be a potential action index at the start of that gap, and ensure gaps do not overlap and are separated by at least one further nonempty world-written block. At each $k_i$, draw the gate coin $$ \gamma(k_i) \sim \mathrm{Bernoulli}(\rho) $$ with a fixed bias $\rho \in (0,1)$, and hold $\gamma(k_i)$ fixed throughout the ensuing $\mathcal{A}$-block. Define $\mu$ to be a primitive computable measure that, at the start of the $i$-th gap, generates the world continuation target according to $f(\cdot \mid u_i)$, revealed on-path when $\gamma(k_i)=0$, and otherwise remaining counterfactual as $\dot{a}$ for the counterfactual-action definition. By construction, conditional on the agent-visible history immediately before $k_i$, the coin $\gamma(k_i)$ is independent of the ensuing world-generated $\mathcal{A}$-token, and the gate is held fixed through the slot. Then $\gamma(k_i)=1$ occurs infinitely often with probability $1$. **Apply the universal bound.** The setup satisfies the standard assumptions with $\rho_i \equiv \rho$ and $\mu$ primitive, so the universal bound on actions gives $$ \sum_{t \ge 1}\mathbb{E}\left[ D_{\mathrm{KL}}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}) \;\middle\|\; M(\cdot \mid \underline{\hat{a}o}_{\le t}) \right) \right] \le \frac{\rho}{1-\rho}\,C_\mu. $$ Taking $$ C := \frac{\rho}{1-\rho}\,C_\mu $$ yields the claim, and the identity $$ \mu(\dot{a}_{t+1}=a \mid \underline{ao}_{\le t}) = f(a \mid u_t) $$ holds by construction on each agent-assigned slot. ===== Preference pluralism ===== A useful way to read the preceding results is as //schema acquisition with context identification//. The interaction stream need not encode a single uniform behavioral regularity: it can contain many qualitatively different continuation principles, such as task procedures, tool protocols, dialogue norms, and choice rules, each applicable only in certain situations. Formally, we can view this as a computable partition of the prompt space $$ \mathcal{U} = \bigsqcup_{j=1}^J \mathcal{U}^{(j)} $$ together with a family of computable continuation rules $f^{(j)}(\cdot \mid u)$ on $\mathcal{A}$, where the operative rule depends on which cell contains the current prompt. The substantive learning problem is therefore not merely to fit a single rule, but to infer //which// rule is in force by extracting the right situational cues from the transcript, that is, to learn the partitioning of contexts and the corresponding schema within each cell. ==== Corollary: Preference pluralism ==== Let $\mathcal{U}$ be a countably infinite prompt set with a computable partition $$ \mathcal{U} = \bigsqcup_{j=1}^J \mathcal{U}^{(j)}, $$ where each $\mathcal{U}^{(j)}$ is infinite. For each $j \in \{1,\ldots,J\}$ let $f^{(j)}(\cdot \mid u)$ be a computable measure on $\mathcal{A}$ indexed by $u \in \mathcal{U}^{(j)}$. Define the combined rule $$ f(a \mid u) := f^{(j)}(a \mid u) $$ for the unique $j$ with $u \in \mathcal{U}^{(j)}$. Then there exists a computable interaction system $(\Sigma,\Gamma,M,\mu)$ with $\mu$ a primitive computable measure, and a computable prompt schedule $(u_t)_{t \ge 1}$, such that on every slot assigned to the agent the most recent world token computably decodes a prompt $u_t \in \mathcal{U}$ and the counterfactual target satisfies $$ \mu(\dot{a}_{t+1}=a \mid \underline{ao}_{\le t}) = f(a \mid u_t) \qquad \text{for all } a \in \mathcal{A}. $$ Moreover, each $u \in \mathcal{U}$ occurs infinitely often in $(u_t)$, and new prompts appear arbitrarily late, meaning that for every $N$ there exists $t > N$ such that $$ u_t \notin \{u_1,\ldots,u_{t-1}\}. $$ There exists $C < \infty$ such that $$ \sum_{t \ge 1}\mathbb{E}\left[ D_{\mathrm{KL}}\left( \mu(\dot{a}_{t+1} \mid \underline{ao}_{\le t}) \;\middle\|\; M(\cdot \mid \underline{\hat{a}o}_{\le t}) \right) \right] \le C. $$ Consequently, the finite-deviation and deterministic-mismatch conclusions above apply to these agent-assigned slots. **Proof.** By computability of the partition and of each $f^{(j)}$, the combined rule $f(\cdot \mid u)$ is a computable measure on $\mathcal{A}$ indexed by $u \in \mathcal{U}$. Because $\mathcal{U}$ is countably infinite and each cell $\mathcal{U}^{(j)}$ is infinite, we can fix a computable enumeration $$ \phi : \{1,\ldots,J\} \times \mathbb{N} \to \mathcal{U} $$ with $\phi(j,n) \in \mathcal{U}^{(j)}$ and $$ \{\phi(j,n) : n \in \mathbb{N}\} = \mathcal{U}^{(j)} $$ for each $j$. Choose a computable prompt schedule $(u_t)$ by dovetailing over pairs $(j,n)$ through $\phi$ in a way that revisits each pair infinitely often. Then each $u \in \mathcal{U}$ occurs infinitely often, and since $\mathcal{U}$ is infinite, first occurrences are unbounded, so new prompts appear arbitrarily late. Apply the previous corollary to $f$ using this schedule to obtain a computable interaction system $(\Sigma,\Gamma,M,\mu)$ with $\mu$ primitive and the stated target identity on agent-assigned slots, together with the finite cumulative divergence bound. The final sentence follows by invoking the finite mistakes corollary. The point is not merely eventual accuracy on previously seen prompts. Because the prompt schedule dovetails over $\mathcal{U}$, genuinely new prompts appear at arbitrarily late times while every prompt is revisited infinitely often. Together with the finite-deviation guarantee, and for deterministic targets the finite-mismatch guarantee, this implies that only finitely many demanded slots can be substantially wrong along the realized schedule. Hence after a finite transient the agent tracks the target counterfactual even when a prompt has just appeared for the first time. This cannot be explained by finite memorization of prompt-completion pairs; it requires learning reusable //situational structure// on $\mathcal{U}$, a partition into context classes, together with the corresponding per-class continuation rules. In this sense, “learning preferences” is one instance of a more general capability: acquiring multiple heterogeneous schemas and applying each one in the situations where it governs the next $\mathcal{A}$-token. Notice that we can combine a variety of schema rules that instantiate well-known decision principles and other preference structures, such as: * //Bayes-optimal finite-horizon POMDP control:// $u$ encodes an executable finite POMDP, horizon or discount, and reward parameters; $f$ outputs a Bayes-optimal adaptive controller. * //Safety-first / constrained control:// $u$ specifies dynamics plus a hard safety predicate, or budget constraint, and a secondary objective; $f$ outputs a controller that enforces the constraint when feasible and otherwise follows the specified fallback rule. * //Multi-objective tradeoffs:// $u$ provides multiple reward components and weights, or a specified scalarization; $f$ outputs the optimal controller under that tradeoff. * //Choice from comparisons:// $u$ contains a finite set of candidates plus computable pairwise comparisons or rankings; $f$ outputs the candidate, or action program, selected by a computable revealed-preference rule. * //Rule- / constitution-following:// $u$ encodes a finite set of rules, hard constraints, plus a computable tie-breaker; $f$ outputs an action or program satisfying the rules when feasible, and otherwise follows an explicitly encoded fallback. * //Program synthesis / tool protocol:// $u$ encodes a specification together with a computable evaluator or tool interface; $f$ outputs a program, or macro-action, that passes the evaluator, or an explicit next repair step in an iterative protocol. * //Norms / dialogue acts:// $u$ encodes an interaction context together with a computable norm taxonomy; $f$ outputs an appropriate dialogue act, such as apologize, clarify, refuse, or defer, consistent with the norms and the stated context. On prompts of the corresponding type, the agent will behave “as if” following the schema. ===== Discussion ===== **Adaptive compression under interventions.** A useful way to read the construction is as //adaptive compression// of the //realized// interaction history. In the passive case, a Bayesian mixture is an optimal adaptive code in expected log-loss [Rissanen1978MDL; Grunwald2007MDLBook; Dawid1984Prequential]. In interaction, the same coding interpretation is recovered only after the first-person correction: evidence is the sequence of world responses under the agent’s interventions, so mixture weights update by the interventional likelihood and the atomic unit of evidence is the completed pair $(\hat{a}_t,o_t)$ [OrtegaBraun2010MinRelEnt; DawidVovk1999PrequentialProbability]. The crucial additional point is that this compression view already fixes how to //act// at a potential action index. Along an action-slot schedule $(k_i)$, the decision-time history $h_i$ determines a distribution for the next $\mathcal{A}$-token. The gate bit $\gamma(k_i)$ decides whether the next $\mathcal{A}$-token is realized as a third-party action, when $\gamma(k_i)=0$, or must be produced as a factual action, when $\gamma(k_i)=1$, but the agent’s decision-time information is $h_i$ in either case. Hence the same mixture conditional used to predict, and therefore code, the next $\mathcal{A}$-token from $h_i$ must also be used to generate it when the agent writes: the action rule draws the next factual action from that conditional. Operationally, this is exactly sampling from the mixture’s conditional at decision time, equivalently posterior sampling as an implementation detail [LeikeLattimoreOrseauHutter2016Thompson]. **Demonstrations are observations; why actions are not evidence.** Demonstration data enters the agent only as part of the observation tokens $o_t$: in particular, what were “their actions” for a demonstrator appear for the agent as substrings of some $o_t$ and therefore count as evidence only insofar as they are predictable as world output. By contrast, the agent’s own completed outputs $\hat{a}_t$ are interventions and therefore cannot be used as evidence to update mixture weights. This distinction is essential for joint interaction generators that assign probabilities to both action and observation tokens: if the agent were to update on realized $a_{t+1}$ using the factor $\nu_p(a_{t+1} \mid \underline{\hat{a}o}_{\le t})$ as if it were evidence, then hypotheses would be spuriously reinforced merely for assigning higher probability to the action that the agent itself sampled, even when all hypotheses agree on the tested world-response terms $\nu_p(o_t \mid \underline{\hat{a}o}_{