Differences

This shows you the differences between two versions of the page.

--- and_or_kl [2024/06/20 12:50] – pedroortega
+++ and_or_kl [2024/07/26 18:53] (current) – pedroortega
@@ Line 1: / Line 1: @@
 ====== And, Or, and the Two KL Projections ======
+> I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively.
+//Cite as: Ortega, P.A. "And, Or, and the Two KL Projections", Tech Note 1, DAIOS, 2024.//
-> I discuss the difference between minimizing the KL-divergence with respect to
-> the first and second argument, and will conclude that they correspond to
-> AND and OR operations on distributions, respectively.
 Oftentimes I see people wondering about the meaning of the two KL-projections:
@@ Line 22: / Line 22: @@
 {{ ::forward-reverse-kl.png?800 |}}
-Personally, I find this explanation somewhat unclear and unsatisfying. I've grappled with this distinction since my PhD studies, while attempting to comprehend it from various perspectives, including [information geometry](https://en.wikipedia.org/wiki/Information_geometry). However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to arrive to, encapsulates a substantial amount of insight without relying on the complexities of information geometry. I hope you agree!
+Personally, I find this explanation somewhat vague and unsatisfying. I've looked at the difference from various angles, including [[https://en.wikipedia.org/wiki/Information_geometry|Information Geometry]]. However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to get to, encapsulates the core insight without relying on the complexities of information geometry. I hope you agree!
@@ Line 30: / Line 30: @@
 their application on two distributions can be quite challenging. Instead, a
 clearer grasp of the difference can be attained through the examination
-of mixture distributions. Let's delve into this approach.
+of mixture distributions. Let's look into this approach.
 ==== Linear mixture ====
@@ Line 36: / Line 36: @@
 Let's say we have $N$ distributions $q_1, q_2, \ldots, q_N$ over a finite set $\mathcal{X}$.
 Given a set of positive weights $w_1, w_2, \ldots, w_N$ that sum up to one, their
-*linear mixture* is
+//linear mixture// is
 $$
@@ Line 42: / Line 42: @@
 $$
-The *linear mixture* expresses $$N$$ mutually exclusive hypotheses $q_i(x)$ that
+The //linear mixture// expresses $N$ mutually exclusive hypotheses $q_i(x)$ that
-could be true with probabilities $w_i$. That is, either $$q_1$$ **or** $q_2$ **or**
+could be true with probabilities $w_i$. That is, either $q_1$ **or** $q_2$ **or**
 ... **or** $q_N$ is true, with probability $w_1$, $w_2$, ..., $w_N$ respectively,
 expressing a **disjunction** of probability distributions.
@@ Line 49: / Line 49: @@
 ==== Exponential mixture ====
-Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their *exponential mixture* (a.k.a. geometric mixture) is
+Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their //exponential mixture// (a.k.a. geometric mixture) is
 $$
@@ Line 57: / Line 57: @@
 It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized.
-The *exponential mixture* expresses $N$ constraints $q_i(x)$ that must be true
+The //exponential mixture// expresses $N$ constraints $q_i(x)$ that must be true
 simultaneously with precisions $\alpha_i$. That is, $q_1$ **and** $q_2$ **and** ...
 **and** $q_N$ are true, with precisions $\alpha_1$, $\alpha_2$, ..., $\alpha_N$
 respectively, expressing a **conjunction** of probability distributions.
-## Building conjunctions and disjunctions
+===== Building conjunctions and disjunctions =====
 So now that we've seen how to express conjunctions and disjunctions of distributions as
@@ Line 82: / Line 82: @@
 $$
-\begin{equation}
+  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).\qquad(1)
-  \label{eq:build-linear}
-  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).
-\end{equation}
 $$
@@ Line 97: / Line 94: @@
 $$
-That is, using the KL-divergences where $$p$$ is in the first argument. And in fact,
+That is, using the KL-divergences where $p$ is in the first argument. And in fact,
 the minimizer is precisely the exponential mixture:
 $$
-\begin{equation}
+  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.\qquad(2)
-  \label{eq:build-exponential}
-  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.
-\end{equation}
 $$
-Equations \eqref{eq:build-linear} and \eqref{eq:build-exponential} form the core
+Equations (1) and (2) form the core
 of my argument. Basically, we have found a relation between the two KL-projections
 and the two logical operators **and** and **or**. The two KL-divergences then measure
@@ Line 125: / Line 119: @@
 $$
-The resulting estimator $$\sum_h q(h) q(x \mid h)$$ is the Bayesian estimator.
+The resulting estimator $\sum_h q(h) q(x \mid h)$ is the Bayesian estimator.
 **Belief update:** The second step is to understand how to update our beliefs after
 making an observation. It turns out that an observation is a constraint on the prior
 beliefs, i.e. we obtain the posterior through a conjunction between the prior
-and the likelihood function. If $$\dot{x}$$ is our observation,
+and the likelihood function. If $\dot{x}$ is our observation,
 then the conjunction is
@@ Line 138: / Line 132: @@
 $$
-In this operation we use precisions equal to $$\alpha=\beta=1$$ so that the result
+In this operation we use precisions equal to $\alpha=\beta=1$ so that the result
-is proportional to $$q(h)q(\dot{x}|h)$$. We can easily verify that the right hand
+is proportional to $q(h)q(\dot{x}|h)$. We can easily verify that the right hand
 side is indeed the Bayesian posterior after normalization.
 Notice how building the predictor involves taking the disjunction of distributions
-over the observations $$x$$, while computing the posterior amounts to computing
+over the observations $x$, while computing the posterior amounts to computing
-the conjunction of functions over the hypotheses $$h$$. In doing so, we interpret
+the conjunction of functions over the hypotheses $h$. In doing so, we interpret
 the likelihood as two different functions: in the disjunction, it is regarded as a function
-of $$x$$ whereas in the conjunction it is a function of $$h$$.
+of $x$ whereas in the conjunction it is a function of $h$.
 Thus, it turns out that sequential predictions can be regarded as an alternation
 between OR and AND operations, first to express our uncertainty over the hypotheses,
-and second to incorporate new evidence, respectively.
+and second to incorporate new evidence, respectively. What class of algorithms does this remind you of?