Ortega, P.A., Grau-Moya, J., Genewein, T., Balduzzi, D. and Braun, D.A.
“A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function.”
Neural Information Processing Systems (NIPS) 2012
[PDF]
The arg-max prior is a Bayesian model of the maximizing argument (argmax in short) of a noisy function. For instance, consider the noisy function depicted in the next figure, and assume you are given a set of observations of the y-value for a number of x-locations. Then, our model allows computing the posterior probability density over the argmax.
The Arg-Max Prior is useful for optimizing a noisy black-box function, which is something that you encounter in many real-world applications. The brute-force approach for solving this problem consists in writing down a probabilistic model over all the possible functions implemented by this black-box (e.g. a Gaussian Process Prior), calculating the posterior distribution given the input-output data, and choosing the next test location (using e.g. a heuristic such as UCB). This is a lot of additional modelling when in reality all we look for is a model over the optimum.
The Arg-Max Prior addresses this problem by providing a direct probabilistic model based on simulated annealing and a temperature calculated from a measure of the effective number of data points we have observed so far (e.g. the Rényi entropy). Given this probabilistic model, we can choose the next test point by just sampling from it (i.e. Thompson sampling). This surprisingly simple approach turns out to work very well in practice.
The model is intuitively straightforward and easy to implement. Let $h(x): \mathcal{X} \rightarrow \mathcal{R}$ be an estimate of the mean $\bar{f}(x)$ constructed from data $\mathcal{D}_t := \{(x_i, y_i)\}_{i=1}^t$. This estimate can easily be converted into a posterior pdf over the location of the maximum by first multiplying it with a precision parameter $\alpha>0$ and then taking the normalized exponential \[ P(x^\ast|\mathcal{D}_t) \propto \exp\{ \alpha \cdot h(x^\ast) \}. \] In this transformation, the precision parameter $\alpha$ controls the certainty we have over our estimate of the maximizing argument: $\alpha \approx 0$ expresses almost no certainty, while $\alpha \rightarrow \infty$ expresses certainty. The rationale for the precision is: the more distinct inputs we test, the higher the precision—testing the same (or similar) inputs only provides local information and therefore should not increase our knowledge about the global maximum. A simple and effective way of implementing this idea is given by \[ P(x^\ast| \mathcal{D}) \propto \exp \biggl\{ \rho \underbrace{ \biggl( \xi + t \cdot \frac{ \sum_i K(x_i, x_i) }{ \sum_{i,j} K(x_i, x_j) } \biggr) }_\text{effective number of observations} \underbrace{ \biggl( \frac{ \sum_i K(x_i, x^\ast) y_i + K_0(x^\ast) y_0(x^\ast) } { \sum_i K(x_i, x^\ast) + K_0(x^\ast) } \biggr) }_\text{mean function estimate} \biggr\} \] where $\rho$, $\xi$, $K$, $K_0$ and $y_0$ are parameters of the estimator: $\rho > 0$ is the precision we gain for each new distinct observation; $\xi > 0$ is the number of prior points; $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a symmetric kernel function; $K_0: \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a prior precision function; and $y_0: \mathcal{X} \rightarrow \mathcal{R}$ is a prior estimate of $\bar{f}$.
The two components of the model are:
The expression for the posterior can be calculated up to a constant factor in $\mathcal{O}(t^2)$ time. The computation of the normalizing constant is in general intractable. Therefore, our proposed posterior can be easily combined with Markov chain Monte Carlo methods (MCMC) to implement stochastic optimizers. For more information, please refer to the paper.