Differences

This shows you the differences between two versions of the page.

--- and_or_kl [2024/06/20 12:57] – [And, Or, and the Two KL Projections] pedroortega
+++ and_or_kl [2024/07/26 18:53] (current) – pedroortega
@@ Line 1: / Line 1: @@
 ====== And, Or, and the Two KL Projections ======
 > I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively.
+//Cite as: Ortega, P.A. "And, Or, and the Two KL Projections", Tech Note 1, DAIOS, 2024.//
 Oftentimes I see people wondering about the meaning of the two KL-projections:
@@ Line 20: / Line 22: @@
 {{ ::forward-reverse-kl.png?800 |}}
-Personally, I find this explanation somewhat unclear and unsatisfying. I've attempted to understand the difference from various perspectives, including [information geometry](https://en.wikipedia.org/wiki/Information_geometry). However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to arrive to, encapsulates a substantial amount of insight without relying on the complexities of information geometry. I hope you agree!
+Personally, I find this explanation somewhat vague and unsatisfying. I've looked at the difference from various angles, including [[https://en.wikipedia.org/wiki/Information_geometry|Information Geometry]]. However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to get to, encapsulates the core insight without relying on the complexities of information geometry. I hope you agree!
@@ Line 28: / Line 30: @@
 their application on two distributions can be quite challenging. Instead, a
 clearer grasp of the difference can be attained through the examination
-of mixture distributions. Let's delve into this approach.
+of mixture distributions. Let's look into this approach.
 ==== Linear mixture ====
@@ Line 34: / Line 36: @@
 Let's say we have $N$ distributions $q_1, q_2, \ldots, q_N$ over a finite set $\mathcal{X}$.
 Given a set of positive weights $w_1, w_2, \ldots, w_N$ that sum up to one, their
-*linear mixture* is
+//linear mixture// is
 $$
@@ Line 40: / Line 42: @@
 $$
-The *linear mixture* expresses $N$ mutually exclusive hypotheses $q_i(x)$ that
+The //linear mixture// expresses $N$ mutually exclusive hypotheses $q_i(x)$ that
 could be true with probabilities $w_i$. That is, either $q_1$ **or** $q_2$ **or**
 ... **or** $q_N$ is true, with probability $w_1$, $w_2$, ..., $w_N$ respectively,
@@ Line 47: / Line 49: @@
 ==== Exponential mixture ====
-Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their *exponential mixture* (a.k.a. geometric mixture) is
+Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their //exponential mixture// (a.k.a. geometric mixture) is
 $$
@@ Line 55: / Line 57: @@
 It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized.
-The *exponential mixture* expresses $N$ constraints $q_i(x)$ that must be true
+The //exponential mixture// expresses $N$ constraints $q_i(x)$ that must be true
 simultaneously with precisions $\alpha_i$. That is, $q_1$ **and** $q_2$ **and** ...
 **and** $q_N$ are true, with precisions $\alpha_1$, $\alpha_2$, ..., $\alpha_N$
@@ Line 80: / Line 82: @@
 $$
-\begin{equation}
+  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).\qquad(1)
-  \label{eq:build-linear}
-  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).
-\end{equation}
 $$
 **Conjunctions:** If I have a set of simultaneous constraints $q_1(x)$ through $q_N(x)$
@@ Line 96: / Line 94: @@
 $$
-That is, using the KL-divergences where $$p$$ is in the first argument. And in fact,
+That is, using the KL-divergences where $p$ is in the first argument. And in fact,
 the minimizer is precisely the exponential mixture:
 $$
-\begin{equation}
+  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.\qquad(2)
-  \label{eq:build-exponential}
-  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.
-\end{equation}
 $$
-Equations \eqref{eq:build-linear} and \eqref{eq:build-exponential} form the core
+Equations (1) and (2) form the core
 of my argument. Basically, we have found a relation between the two KL-projections
 and the two logical operators **and** and **or**. The two KL-divergences then measure
@@ Line 149: / Line 144: @@
 Thus, it turns out that sequential predictions can be regarded as an alternation
 between OR and AND operations, first to express our uncertainty over the hypotheses,
-and second to incorporate new evidence, respectively.
+and second to incorporate new evidence, respectively. What class of algorithms does this remind you of?