Proofs from the notebook Olivier's open lab book.

/ / / / /

Bayesian Optimalities

I’m sometimes asked in conferences and meetings around Montreal: why Bayes?

This always strikes me. I do sometimes identify with “bayesianism”, because I want to do research related to Bayesian statistics, but I don’t feel that I’m a Bayesian in the sense that I adhere to one particular school of thought. First and foremost, I am a mathematician and a statistician. I want to understand the world through simulation and modelling, with the certainty provided by mathematical lenses, and I want to study and develop the statistical tools that we use to understand the world when certainty is not at reach. That’s my scientific identity.

Yet I’m also driven to answer: what else? What else than probabilities to model uncertainty and randomness? What else than probabilistic conditioning to account for new information? (Ok, I’m exaggerating a bit here. There are tons of other fun “else” to answer that question, but Bayes certainly stands out.) I’m drawn to Bayesian stats because, more often than not, it’s about providing meaningful answers to the questions that we really care about. I want, when possible, posterior probabilities of hypotheses; not 0-1 decisions and wishful thinking.

That’s for the “let’s infer stuff” part. In a prediction or point estimation context, I can also try to explain how Bayesian procedures are well suited to hierarchical modelling and convenient computational algorithms, how they can be used to incorporate prior information, have nice properties and follow from conceptually simple principles which are well suited to mathematical analysis. They may or may not help solving a particular problem and that’s perfectly fine, but they are still a subject of study worthwhile of specialisation. (Yes, I currently often have to defend my choice of studying Bayesian stats. I’m eager to begin my PhD.)

This post is about some of these nice properties of posterior distributions that I like to talk about. I’ll leave the non-nice things for another time.

1. Point estimation and minimal expected risk

This first section is not directly about properties of the posterior distribution, but it rather concerned with some summaries of the posterior which have nice statistical properties in different contexts.

Squared error loss

Suppose $\pi$ is a prior on an euclidean parameter space $\Theta \subset \mathbb{R}^d$ with norm $|\theta|^2 = \theta^T \theta$ defined through the dot product. Given a likelihood $p _ \theta(X)$ for data $X$, the posterior distribution is defined as

and the mean of the posterior distribution is

If we define the risk of an estimator $\hat \theta$ for the estimation of a parameter $\theta _ 0$ as

and if

is the expected risk of $\hat \theta$ with respect to the prior $\pi$ (also called the Bayes risk), then we have that the posterior mean estimate $\hat \theta _ {\pi}$ satisfies

for any estimator $\hat \theta$. That is, the posterior mean estimate minimizes the expected risk.

The proof follows from the fact that

Writing the expected risk as an expected posterior loss, i.e. using the fact that

where $m$ has density $m(x) = \int p _ \theta(x) \pi(\theta)\,d\theta$, and since

we obtain the result.

A few remarks:

  1. The expected risk has stability properties. If $\tilde \pi$ and $\pi$ are two priors that are absolutely continuous with respect to each other, and if $|\log \frac{d\tilde \pi}{d\pi}| _ \infty \leq C$, then

If the risk $R(\hat \theta; \theta _ 0)$ is uniformly bounded by some constant $M$ over $\theta _ 0\in \Theta$, then

This shows how small chances in the prior does not result in a dramatic change in the expected loss of an estimator, as long as the priors have “compatible tails” (i.e. a manageable likelihood ratio).

  1. It is sometimes advocated to choose the prior $\pi$ so that the risk $R(\hat \theta _ \pi; \theta _ 0)$ is constant over $\theta _ 0$: the resulting estimator $\hat \theta _ \pi$ is then agnostic, from a risk point of view, to $\theta _ 0$. This may result in a sample-size dependent prior (which is arguably not in the Bayesian spirit), but the fun thing is that it makes the expected risk maximal and the Bayes estimator $\hat \theta _ \pi$ minimax: $\hat \theta _ \pi \in \arg\min _ {\hat\theta} \sup _ {\theta _ 0}R(\hat \theta;\theta _ 0)$. Indeed, in that case we have for any estimator $\hat \theta$ that $\sup _ {\theta _ 0} R(\hat \theta; \theta _ 0) \geq B _ \pi(\hat \theta) \geq B _ \pi(\theta _ \pi) = \sup _ {\theta _ 0}R(\hat \theta _ \pi;\theta _ 0)$, from which it follows that $\hat \theta _ \pi$ is minimax.

  2. The idea of minimizing expected risk is not quite Bayesian, since it required us to first average over all data possibilities when computing the risk. One of the main advantage of the Bayesian framework is that it allows us to condition over the observed data, rather than pre-emptively considering all possibilities, and we can try to make use of that. Define the posterior expected loss (or posterior risk) or an estimator $\hat \theta$, conditionally on $X$, as

  • It is clear from the previous computations that the posterior mean estimate minimizes the posterior risk, and hence the two approaches are equivalent. It turns out that, whatever the loss function we consider (under some regularity condition ensuring that stuff is finite and minimizers exist), minimizing the posterior risk is equivalent to minimizing the Bayes risk. In other words, we have that for any loss function (again under some regularity conditions ensuring finiteness and existence of stuff), we have
  • This is roughly self-evident if we think about it. An interesting consequence is that any estimator minimizing a Bayes risk is a function of the posterior distribution.

2. Randomized estimation and information risk minimization

Let $\Theta$ be a model, let $X \sim Q$ be some data and let $\ell _ \theta(X)$ be a loss associated with using $\theta$ to fitting the data $X$. For instance, we could have $\Theta = {f:\mathcal{X} \rightarrow \mathbb{R}}$ a set of functions, $X ={(U _ i, Y _ i)} _ {i=1}^n \subset \mathcal{X}\times \mathbb{R}$ a set of features with associated responses, and $\ell _ \theta(X) = \sum _ {i}(Y _ i -\theta(U _ i))^2$ the sum of squared loss.

There may be a parameter $\theta _ 0\in\Theta$ minimizing the risk $R(\theta) = \mathbb{E} _ {X\sim Q}[\ell _ \theta(X)]$, which will then be our learning target. Now we consider randomized estimators taking the form $\theta\sim \hat \pi _ X$, where $\hat\pi _ X$ is a data-dependent distribution, and the performance of this estimation method can then be evaluated by the empirical risk $R _ X (\hat\pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$.

Here we should be raising an eyebrow. There is typically no point in having the estimator $\theta$ being random, i.e. we typically will prefer to take $\hat \pi _ X$ a point mass rather than anything else. But bear with me for a sec. The cool thing is that if we choose

where $D(\hat \pi _ X| \pi) = \int \log \frac{d\hat \pi _ X}{d\pi} \,d\hat \pi _ X$ is the Kullback-Leibler divergence, then this distribution will satisfy

That is, Bayesian-type posteriors arise by minimizing the empirical risk of a randomized estimation scheme penalized by the Kullback-Leibler divergence form prior to posterior (Zhang, 2006).

For the proof, write

which is also equal to $D(d\hat \pi _ X | e^{-\ell _ \theta(X)} d\pi)$ and, by properties of the Kullback-Leibler divergence, obviously minimized at $d\hat \pi _ X \propto e^{\ell _ \theta(X)}d\pi(\theta)$.

Is this practically useful and insightful? Possibly. But at least this approach is suited to a general theory, as shown in Zhang (2006) and as I reproduce below.

Let us introduce a Rényi-type generalization error defined, for $\alpha \in (0,1)$, by

This is a measure of loss associated with the use of a parameter $\theta$ to fit new data $X’ \sim Q$. We also write

for the expected Rényi generalization error when using the randomization scheme $\theta \sim \hat \pi _ X$.

In order to get interesting bounds on this generalization error, we can follow the approach of Zhang (2006).

Change of measure inequality

We’ll need the change of measure inequality, which states that for any function $f$ and distributions $\pi$, $\hat \pi$ on $\Theta,$

Indeed, with some sloppyness and Jensen’s inequality we can compute

Generalization error bound

We can now attempt bounding $d _ \alpha(\hat \pi _ X;Q)$. Consider the difference $\Delta _ X (\theta) = d _ \alpha(\theta;Q) - \ell _ \theta(X)$ between the generalization error and the empirical loss corresponding to the use of a fixed parameter $\theta$. Then by the change of measure inequality,

and hence for any $\pi$,

By Markov’s inequality, this implies that $\forall t > 0$,

Rewriting yields

with probability at least $1-e^{-t}$. To recap: the term $d _ \alpha(\hat \pi _ X;Q)$ is understood as a generalization error, on the right hand side $R _ X(\hat \pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$ is the empirical risk, the Kullback-Leibler divergence $D(\hat \pi _ X|\pi)$ penalizes the complexity of $\hat\pi _ X$ seen as a divergence from a “prior” $\pi$, and $t$ is a tuning parameter.

3. Online learning, regret and Kullback-Leibler divergence

Following Barron (1998), suppose we sequentially observe data points $X _ 1, X _ 2, X _ 3, \dots$ which are say i.i.d. with common distribution $Q$ with density $q$. At each time step $n$, the goal is to predict $X _ {n+1}$ using the data $X^n = (X _ 1, \dots, X _ n)$. Our prediction is not a point estimate of $X _ {n+1}$, but somewhat similarly as in the randomized estimation scenario we output a density estimate $\hat p _ n = p(\cdot \mid X^n)$, the goal being that $p(X _ {n+1}\mid X^n)$ be as large as possible. A bit more precisely, we individually score a density estimate $\hat p _ n$ through the risk $\ell _ q(\hat p _ n) = \mathbb{E} _ {X _ {n+1}\sim q}[\log(q(X _ {n+1})/\hat p _ n(X _ {n+1} ))] = D(q| \hat p _ n)$ which is the Kullback-Leibler divergence between $\hat p _ n$ and $q$. The regret over times $n=1, 2,\dots, N$ is the sum of the risk over the whole process, i.e.

Formally, this process is equivalent to estimating the distribution of $X^N$ all at once: our density estimate $\hat p^N$ of $X^N$ would simply be

and the regret is, by the chain rule, simply $D(q^N | \hat p^N)$, where $q^N$ is the $N$th independent product of $q$.

Given a prior $\pi$ over a space of distributions for $q$, our problem then to minimize the Bayes risk

This is achieved by choosing $\hat p^N(x) = \hat p _ \pi^N(x) = \int q^N(x) \pi(dq)$ the prior predictive density. This is equivalent to using, at each time step $n$, the poterior predictive density $\hat p _ {n, \pi}(x) = \int q(x) \,\pi(dq\mid {X _ i} _ {i=1}^n)$.

To see this minimizing property of the Bayes average, it suffices to write

Note that an consequence of this analysis is also that the posterior predictive distribution $\hat p _ {n, \pi}$ will minimize the expected posterior risk:

Following section 1, this furthermore means that the posterior predictive distribution minimizes the Bayes risk associated with the Kullback-Leibler loss.