and_or_kl

This is an old revision of the document!


And, Or, and the Two KL Projections

I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively.

Oftentimes I see people wondering about the meaning of the two KL-projections: minpD(pq)versusminpD(qp),

where of course, in the discrete case,

D(pq):=xp(x)logp(x)q(x).

One popular intuitive explanation is that “the first projection finds the mode and the second the support”, motivated by depictions such as the following (from Manisha and Gujar, 2019)

Personally, I find this explanation somewhat unclear and unsatisfying. I've looked at the difference from various perspectives, including Information Geometry. However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to arrive to, encapsulates a substantial amount of insight without relying on the complexities of information geometry. I hope you agree!

Understanding the distinction between the two projections solely by examining their application on two distributions can be quite challenging. Instead, a clearer grasp of the difference can be attained through the examination of mixture distributions. Let's look into this approach.

Let's say we have N distributions q1,q2,,qN over a finite set X. Given a set of positive weights w1,w2,,wN that sum up to one, their *linear mixture* is

q(x)=iwiqi(x),

The *linear mixture* expresses N mutually exclusive hypotheses qi(x) that could be true with probabilities wi. That is, either q1 or q2 oror qN is true, with probability w1, w2, …, wN respectively, expressing a disjunction of probability distributions.

Given a set of positive coefficients α1,α2,,αN (not necessarily summing up to one), their exponential mixture (a.k.a. geometric mixture) is

q(x)iqi(x)αi.

It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized.

The exponential mixture expresses N constraints qi(x) that must be true simultaneously with precisions αi. That is, q1 and q2 andand qN are true, with precisions α1, α2, …, αN respectively, expressing a conjunction of probability distributions.

So now that we've seen how to express conjunctions and disjunctions of distributions as exponential and linear mixtures respectively, we'll try to find objective functions (or divergence measures, if you prefer) to build them.

Disjunctions: If I have a set of alternative hypotheses q1(x) through qN(x) which are true with probabilities w1 through wN respectively, how divergent is p from this knowledge?

The answer is

iwiD(qip).

That is, using the KL-divergences where p is the second argument. Indeed, the minimizer is precisely the linear mixture:

argminpiwiD(qip)=iwiqi(x).(1)

Conjunctions: If I have a set of simultaneous constraints q1(x) through qN(x) with precisions α1 through αN, how far off is p from this knowledge?

The answer is

iαiD(pqi).

That is, using the KL-divergences where p is in the first argument. And in fact, the minimizer is precisely the exponential mixture:

argminpiαiD(pqi)qi(x)αi.(2)

Equations (1) and (2) form the core of my argument. Basically, we have found a relation between the two KL-projections and the two logical operators and and or. The two KL-divergences then measure the divergence of p from a conjunction and disjunction, respectively.

As a final comment, I'd like to point out the connection to the Bayesian framework. Let's consider hypotheses hH with i.i.d. likelihoods q(x|h) over X and priors q(h).

Prediction: The first step is to build the predictor. Since we believe that one and only one hypothesis is true, then we take the disjunction over the likelihoods (or expert predictors), where the weights are given by the prior probabilities:

argminp(x)hq(h)D(q(x|h)p(x))=hq(h)q(x|h).

The resulting estimator hq(h)q(xh) is the Bayesian estimator.

Belief update: The second step is to understand how to update our beliefs after making an observation. It turns out that an observation is a constraint on the prior beliefs, i.e. we obtain the posterior through a conjunction between the prior and the likelihood function. If ˙x is our observation, then the conjunction is

argminp(h){αD(p(h)q(h))+βD(p(h)q(˙x|h))}q(h)αq(˙x|h)β.

In this operation we use precisions equal to α=β=1 so that the result is proportional to q(h)q(˙x|h). We can easily verify that the right hand side is indeed the Bayesian posterior after normalization.

Notice how building the predictor involves taking the disjunction of distributions over the observations x, while computing the posterior amounts to computing the conjunction of functions over the hypotheses h. In doing so, we interpret the likelihood as two different functions: in the disjunction, it is regarded as a function of x whereas in the conjunction it is a function of h.

Thus, it turns out that sequential predictions can be regarded as an alternation between OR and AND operations, first to express our uncertainty over the hypotheses, and second to incorporate new evidence, respectively.

  • and_or_kl.1718888714.txt.gz
  • Last modified: 2024/06/20 13:05
  • by pedroortega