and_or_kl

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
and_or_kl [2024/06/20 12:57] – [And, Or, and the Two KL Projections] pedroortegaand_or_kl [2024/07/26 18:53] (current) pedroortega
Line 1: Line 1:
 ====== And, Or, and the Two KL Projections ====== ====== And, Or, and the Two KL Projections ======
- 
  
 > I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively. > I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively.
 +
 +//Cite as: Ortega, P.A. "And, Or, and the Two KL Projections", Tech Note 1, DAIOS, 2024.//
 +
  
 Oftentimes I see people wondering about the meaning of the two KL-projections: Oftentimes I see people wondering about the meaning of the two KL-projections:
Line 20: Line 22:
 {{ ::forward-reverse-kl.png?800 |}} {{ ::forward-reverse-kl.png?800 |}}
  
-Personally, I find this explanation somewhat unclear and unsatisfying. I've attempted to understand the difference from various perspectives, including [information geometry](https://en.wikipedia.org/wiki/Information_geometry). However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to arrive to, encapsulates a substantial amount of insight without relying on the complexities of information geometry. I hope you agree!+Personally, I find this explanation somewhat vague and unsatisfying. I've looked at the difference from various angles, including [[https://en.wikipedia.org/wiki/Information_geometry|Information Geometry]]. However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to get to, encapsulates the core insight without relying on the complexities of information geometry. I hope you agree!
  
  
Line 28: Line 30:
 their application on two distributions can be quite challenging. Instead, a  their application on two distributions can be quite challenging. Instead, a 
 clearer grasp of the difference can be attained through the examination  clearer grasp of the difference can be attained through the examination 
-of mixture distributions. Let'delve into this approach.+of mixture distributions. Let'look into this approach.
  
 ==== Linear mixture ==== ==== Linear mixture ====
Line 34: Line 36:
 Let's say we have $N$ distributions $q_1, q_2, \ldots, q_N$ over a finite set $\mathcal{X}$. Let's say we have $N$ distributions $q_1, q_2, \ldots, q_N$ over a finite set $\mathcal{X}$.
 Given a set of positive weights $w_1, w_2, \ldots, w_N$ that sum up to one, their  Given a set of positive weights $w_1, w_2, \ldots, w_N$ that sum up to one, their 
-*linear mixtureis+//linear mixture// is
  
 $$ $$
Line 40: Line 42:
 $$ $$
  
-The *linear mixtureexpresses $N$ mutually exclusive hypotheses $q_i(x)$ that+The //linear mixture// expresses $N$ mutually exclusive hypotheses $q_i(x)$ that
 could be true with probabilities $w_i$. That is, either $q_1$ **or** $q_2$ **or** could be true with probabilities $w_i$. That is, either $q_1$ **or** $q_2$ **or**
 ... **or** $q_N$ is true, with probability $w_1$, $w_2$, ..., $w_N$ respectively, ... **or** $q_N$ is true, with probability $w_1$, $w_2$, ..., $w_N$ respectively,
Line 47: Line 49:
 ==== Exponential mixture ==== ==== Exponential mixture ====
  
-Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their *exponential mixture(a.k.a. geometric mixture) is+Given a set of positive coefficients $\alpha_1, \alpha_2, \ldots, \alpha_N$ (not necessarily summing up to one), their //exponential mixture// (a.k.a. geometric mixture) is
  
 $$ $$
Line 55: Line 57:
 It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized. It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized.
  
-The *exponential mixtureexpresses $N$ constraints $q_i(x)$ that must be true +The //exponential mixture// expresses $N$ constraints $q_i(x)$ that must be true 
 simultaneously with precisions $\alpha_i$. That is, $q_1$ **and** $q_2$ **and** ...  simultaneously with precisions $\alpha_i$. That is, $q_1$ **and** $q_2$ **and** ... 
 **and** $q_N$ are true, with precisions $\alpha_1$, $\alpha_2$, ..., $\alpha_N$  **and** $q_N$ are true, with precisions $\alpha_1$, $\alpha_2$, ..., $\alpha_N$ 
Line 80: Line 82:
  
 $$ $$
-\begin{equation} +  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).\qquad(1)
-  \label{eq:build-linear} +
-  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x). +
-\end{equation}+
 $$ $$
- 
  
 **Conjunctions:** If I have a set of simultaneous constraints $q_1(x)$ through $q_N(x)$ **Conjunctions:** If I have a set of simultaneous constraints $q_1(x)$ through $q_N(x)$
Line 96: Line 94:
 $$ $$
  
-That is, using the KL-divergences where $$p$$ is in the first argument. And in fact,+That is, using the KL-divergences where $p$ is in the first argument. And in fact,
 the minimizer is precisely the exponential mixture: the minimizer is precisely the exponential mixture:
  
 $$ $$
-\begin{equation} +  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.\qquad(2)
-  \label{eq:build-exponential} +
-  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}. +
-\end{equation}+
 $$ $$
  
-Equations \eqref{eq:build-linear} and \eqref{eq:build-exponential} form the core+Equations (1) and (2) form the core
 of my argument. Basically, we have found a relation between the two KL-projections of my argument. Basically, we have found a relation between the two KL-projections
 and the two logical operators **and** and **or**. The two KL-divergences then measure and the two logical operators **and** and **or**. The two KL-divergences then measure
Line 149: Line 144:
 Thus, it turns out that sequential predictions can be regarded as an alternation Thus, it turns out that sequential predictions can be regarded as an alternation
 between OR and AND operations, first to express our uncertainty over the hypotheses, between OR and AND operations, first to express our uncertainty over the hypotheses,
-and second to incorporate new evidence, respectively.+and second to incorporate new evidence, respectively. What class of algorithms does this remind you of?
  • and_or_kl.1718888256.txt.gz
  • Last modified: 2024/06/20 12:57
  • by pedroortega