Olivier Binette: Bayesian Optimalities

1. Point estimation and minimal expected risk

This first section is not directly about properties of the posterior distribution, but it is rather concerned with some summaries of the posterior which have nice statistical properties in different contexts.

Squared error loss

Suppose $\pi$ is a prior on an euclidean parameter space $\Theta \subset \mathbb{R}^d$ with norm $\|\theta\|^2 = \theta^T \theta$ defined through the dot product. Given a likelihood $p _ \theta(X)$ for data $X$, the posterior distribution is defined as

\[ \pi(A \mid X) \propto \int _ A p _ \theta(X) \pi(d\theta) \]

and the mean of the posterior distribution is

\[ \hat \theta _ {\pi} = \int \theta \,\pi(d\theta \mid X) = \mathbb{E} _ {\theta \sim \pi}[\theta \mid X]. \]

If we define the risk of an estimator $\hat \theta$ for the estimation of a parameter $\theta _ 0$ as

\[ R(\hat \theta; \theta _ 0) = \mathbb{E} _ {X \sim p _ \theta}[\|\theta _ 0 - \hat \theta(X)\|^2], \]

and if

\[ B _ \pi(\hat \theta) = \mathbb{E} _ {\theta _ 0 \sim \pi}[R(\hat \theta; \theta _ 0)] \]

is the expected risk of $\hat \theta$ with respect to the prior $\pi$ (also called the Bayes risk), then we have that the posterior mean estimate $\hat \theta _ {\pi}$ satisfies

\[ B _ \pi(\hat \theta) \geq B _ \pi(\hat \theta _ \pi) \]

for any estimator $\hat \theta$. That is, the posterior mean estimate minimizes the expected risk.

The proof follows from the fact that

\[ \| \theta _ 0 - \hat \theta(X) \|^2 \geq \|\theta _ 0 - \hat \theta _ \pi(X) \|^2 + \langle \theta _ 0 - \hat \theta _ \pi(X), \hat \theta _ \pi(X) - \hat \theta(X)\rangle. \]

Writing the expected risk as an expected posterior loss, i.e. using the fact that

\[ \mathbb{E} _ {\theta _ 0 \sim \pi}\left[\mathbb{E} _ {X \sim p _ {\theta _ 0}}[\,\cdot\,]\right] = \mathbb{E} _ {X \sim m}\left[\mathbb{E} _ {\theta _ 0 \sim \pi(\cdot \mid X)}[\,\cdot\,]\right] \]

where $m$ has density $m(x) = \int p _ \theta(x) \pi(\theta)\,d\theta$, and since

\[ \mathbb{E} _ {\theta _ 0 \sim \pi(\cdot \mid X)}\left[\langle \theta _ 0 - \hat \theta _ \pi(X), \hat \theta _ \pi(X) - \hat \theta(X)\rangle\right] = 0, \]

we obtain the result.

A few remarks:

The expected risk has stability properties. If $\tilde \pi$ and $\pi$ are two priors that are absolutely continuous with respect to each other, and if $\|\log \frac{d\tilde \pi}{d\pi}\| _ \infty \leq C$, then

\[ e^{-C}B _ \pi(\hat\theta) \leq B _ {\tilde \pi}(\hat \theta) \leq e^C B _ {\pi}(\hat \theta). \]

If the risk $R(\hat \theta; \theta _ 0)$ is uniformly bounded by some constant $M$ over $\theta _ 0\in \Theta$, then

\[ B _ {\tilde \pi}(\hat \theta) \leq \sqrt{M B _ {\pi}(\hat \theta)} \left\|d\tilde\pi/d\pi\right\| _ {L^2(\pi)}. \]

This shows how small chances in the prior does not result in a dramatic change in the expected loss of an estimator, as long as the priors have “compatible tails” (i.e. a manageable likelihood ratio).

It is sometimes advocated to choose the prior $\pi$ so that the risk $R(\hat \theta _ \pi; \theta _ 0)$ is constant over $\theta _ 0$: the resulting estimator $\hat \theta _ \pi$ is then agnostic, from a risk point of view, to $\theta _ 0$. This may result in a sample-size dependent prior (which is arguably not in the Bayesian spirit), but the fun thing is that it makes the expected risk maximal and the Bayes estimator $\hat \theta _ \pi$ minimax: $\hat \theta _ \pi \in \arg\min _ {\hat\theta} \sup _ {\theta _ 0}R(\hat \theta;\theta _ 0)$. Indeed, in that case we have for any estimator $\hat \theta$ that $\sup _ {\theta _ 0} R(\hat \theta; \theta _ 0) \geq B _ \pi(\hat \theta) \geq B _ \pi(\theta _ \pi) = \sup _ {\theta _ 0}R(\hat \theta _ \pi;\theta _ 0)$, from which it follows that $\hat \theta _ \pi$ is minimax.
The idea of minimizing expected risk is not quite Bayesian, since it required us to first average over all data possibilities when computing the risk. One of the main advantage of the Bayesian framework is that it allows us to condition over the observed data, rather than pre-emptively considering all possibilities, and we can try to make use of that. Define the posterior expected loss (or posterior risk) or an estimator $\hat \theta$, conditionally on $X$, as

\[ R _ \pi(\hat \theta\mid X) = \mathbb{E} _ {\theta _ 0 \sim \pi(\cdot \mid X)}\left[(\hat \theta(X) - \theta _ 0)^2\right]. \]

It is clear from the previous computations that the posterior mean estimate minimizes the posterior risk, and hence the two approaches are equivalent. It turns out that, whatever the loss function we consider (under some regularity condition ensuring that stuff is finite and minimizers exist), minimizing the posterior risk is equivalent to minimizing the Bayes risk. In other words, we have that for any loss function (again under some regularity conditions ensuring finiteness and existence of stuff), we have

\[ \arg\min _ {\hat \theta}\mathbb{E} _ {X \sim m}\left[\mathbb{E} _ {\theta _ 0 \sim \pi(\cdot \mid X)}[\ell(\hat \theta(X), \theta _ 0)]\right] = \arg\min _ {\hat\theta}\mathbb{E} _ {\theta _ 0 \sim \pi(\cdot \mid X)}[\ell(\hat \theta(X), \theta _ 0)]. \]

This is roughly self-evident if we think about it. An interesting consequence is that any estimator minimizing a Bayes risk is a function of the posterior distribution.

2. Randomized estimation and information risk minimization

Let $\Theta$ be a model, let $X \sim Q$ be some data and let $\ell _ \theta(X)$ be a loss associated with using $\theta$ to fitting the data $X$. For instance, we could have $\Theta = \{f:\mathcal{X} \rightarrow \mathbb{R}\}$ a set of functions, $X =\{(U _ i, Y _ i)\} _ {i=1}^n \subset \mathcal{X}\times \mathbb{R}$ a set of features with associated responses, and $\ell _ \theta(X) = \sum _ {i}(Y _ i -\theta(U _ i))^2$ the sum of squared loss.

There may be a parameter $\theta _ 0\in\Theta$ minimizing the risk $R(\theta) = \mathbb{E} _ {X\sim Q}[\ell _ \theta(X)]$, which will then be our learning target. Now we consider randomized estimators taking the form $\theta\sim \hat \pi _ X$, where $\hat\pi _ X$ is a data-dependent distribution, and the performance of this estimation method can then be evaluated by the empirical risk $R _ X (\hat\pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$.

Here we should be raising an eyebrow. There is typically no point in having the estimator $\theta$ being random, i.e. we typically will prefer to take $\hat \pi _ X$ a point mass rather than anything else. But bear with me for a sec. The cool thing is that if we choose

\[ \hat \pi _ X = \arg\min _ {\hat \pi _ X} \left\{R(\hat \pi _ X) + D(\hat \pi _ X \| \pi)\right\}, \tag{$*$} \]

where $D(\hat \pi _ X\| \pi) = \int \log \frac{d\hat \pi _ X}{d\pi} \,d\hat \pi _ X$ is the Kullback-Leibler divergence, then this distribution will satisfy

\[ d\hat \pi _ X(\theta) \propto e^{-\ell _ \theta(X)}d\pi(\theta). \]

That is, Bayesian-type posteriors arise by minimizing the empirical risk of a randomized estimation scheme penalized by the Kullback-Leibler divergence form prior to posterior (Zhang, 2006).

For the proof, write

\[ R _ X(\hat \pi _ X) + D(\hat \pi _ X \| \pi) = \int \left(\ell _ \theta(X) + \log\frac{d\hat \pi _ X(\theta)}{d\pi(\theta)}\right) d\hat \pi _ X (\theta)=\int\left(\log\frac{d\hat\pi _ X(\theta)}{e^{-\ell _ \theta(X)}d\pi(\theta)}\right)d\hat \pi _ X(\theta) \]

which is also equal to $D(d\hat \pi _ X \| e^{-\ell _ \theta(X)} d\pi)$ and, by properties of the Kullback-Leibler divergence, obviously minimized at $d\hat \pi _ X \propto e^{\ell _ \theta(X)}d\pi(\theta)$.

Is this practically useful and insightful? Possibly. But at least this approach is suited to a general theory, as shown in Zhang (2006) and as I reproduce below.

Let us introduce a Rényi-type generalization error defined, for $\alpha \in (0,1)$, by

\[ d _ \alpha(\theta; Q) = -\alpha^{-1}\log\mathbb{E} _ {X' \sim Q}[e^{-\alpha \ell _ \theta(X')}]. \]

This is a measure of loss associated with the use of a parameter $\theta$ to fit new data $X' \sim Q$. We also write

\[ d _ \alpha(\hat \pi _ X; Q) = -\mathbb{E} _ {\theta \sim \hat \pi _ X}\left[ \alpha^{-1}\log\mathbb{E} _ {X' \sim Q}[e^{-\alpha \ell _ \theta(X')}] \right] \]

for the expected Rényi generalization error when using the randomization scheme $\theta \sim \hat \pi _ X$.

In order to get interesting bounds on this generalization error, we can follow the approach of Zhang (2006).

Change of measure inequality

We’ll need the change of measure inequality, which states that for any function $f$ and distributions $\pi$, $\hat \pi$ on $\Theta,$

\[ \mathbb{E} _ {\theta \sim \hat\pi}[f(\theta)] \leq D(\hat \pi \| \pi) + \log \mathbb{E} _ {\theta \sim \pi}\left[e^{f(\theta)}\right]. \]

Indeed, with some sloppyness and Jensen’s inequality we can compute

\[ \log \int e^{f(\theta)}\pi(d\theta)\geq \int f(\theta)\log(d\pi/d\hat\pi(\theta))d\hat \pi = \mathbb{E} _ {\theta \sim \hat \pi}[f(\theta)] - D(\hat \pi\|\pi). \]

Generalization error bound

We can now attempt bounding $d _ \alpha(\hat \pi _ X;Q)$. Consider the difference $\Delta _ X (\theta) = d _ \alpha(\theta;Q) - \ell _ \theta(X)$ between the generalization error and the empirical loss corresponding to the use of a fixed parameter $\theta$. Then by the change of measure inequality,

\[ \exp\{\mathbb{E} _ {\theta \sim \hat \pi _ X}[\Delta _ X(\theta)] - D(\hat \pi _ X\|\pi)\} \leq \mathbb{E} _ {\theta \sim \pi}\left[e^{\Delta _ X(\theta)}\right] \]

and hence for any $\pi$,

\[ \mathbb{E} _ {X \sim Q}\left[\exp\left\{\mathbb{E} _ {\theta \sim \hat \pi _ X}[\Delta _ X(\theta)] - D(\hat \pi _ X\|\pi)\right\}\right] \leq \mathbb{E} _ {X \sim Q}\left[\mathbb{E} _ {\theta \sim \pi}\left[e^{\Delta _ X(\theta)}\right]\right] = 1 \]

By Markov’s inequality, this implies that $\forall t > 0$,

\[ \mathbb{P}\left(\mathbb{E} _ {\theta \sim \hat \pi _ X}[\Delta _ X(\theta)] - D(\hat \pi _ X\|\pi) \geq t\right) \leq e^{-t}. \]

Rewriting yields

\[ d _ \alpha(\hat \pi _ X;Q) \leq R _ X(\hat \pi _ X) + D(\hat \pi _ X\|\pi) + t \]

with probability at least $1-e^{-t}$. To recap: the term $d _ \alpha(\hat \pi _ X;Q)$ is understood as a generalization error, on the right hand side $R _ X(\hat \pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$ is the empirical risk, the Kullback-Leibler divergence $D(\hat \pi _ X\|\pi)$ penalizes the complexity of $\hat\pi _ X$ seen as a divergence from a “prior” $\pi$, and $t$ is a tuning parameter.

3. Online learning, regret and Kullback-Leibler divergence

Following Barron (1998), suppose we sequentially observe data points $X _ 1, X _ 2, X _ 3, \dots$ which are say i.i.d. with common distribution $Q$ with density $q$. At each time step $n$, the goal is to predict $X _ {n+1}$ using the data $X^n = (X _ 1, \dots, X _ n)$. Our prediction is not a point estimate of $X _ {n+1}$, but somewhat similarly as in the randomized estimation scenario we output a density estimate $\hat p _ n = p(\cdot \mid X^n)$, the goal being that $p(X _ {n+1}\mid X^n)$ be as large as possible. A bit more precisely, we individually score a density estimate $\hat p _ n$ through the risk $\ell _ q(\hat p _ n) = \mathbb{E} _ {X _ {n+1}\sim q}[\log(q(X _ {n+1})/\hat p _ n(X _ {n+1} ))] = D(q\| \hat p _ n)$ which is the Kullback-Leibler divergence between $\hat p _ n$ and $q$. The regret over times $n=1, 2,\dots, N$ is the sum of the risk over the whole process, i.e.

\[ \text{regret} = \sum _ {n=1}^N D(q\| \hat p _ n). \]

Formally, this process is equivalent to estimating the distribution of $X^N$ all at once: our density estimate $\hat p^N$ of $X^N$ would simply be

\[ \hat p^N(X^N) = \prod _ {n=1}^N \hat p _ n(X _ n) \]

and the regret is, by the chain rule, simply $D(q^N \| \hat p^N)$, where $q^N$ is the $N$th independent product of $q$.

Given a prior $\pi$ over a space of distributions for $q$, our problem then to minimize the Bayes risk

\[ B _ \pi(\hat p^N) = \mathbb{E} _ {q\sim \pi} D(q^N\|\hat p^N). \]

This is achieved by choosing $\hat p^N(x) = \hat p _ \pi^N(x) = \int q^N(x) \pi(dq)$ the prior predictive density. This is equivalent to using, at each time step $n$, the poterior predictive density $\hat p _ {n, \pi}(x) = \int q(x) \,\pi(dq\mid \{X _ i\} _ {i=1}^n)$.

To see this minimizing property of the Bayes average, it suffices to write

\[ B _ \pi(\hat p^N) = \mathbb{E} _ {q \sim \pi} \left[D(q^N\| \hat p _ \pi^N)\right] + D(\hat p _ \pi^N \| \hat p^N). \]

Note that an consequence of this analysis is also that the posterior predictive distribution $\hat p _ {n, \pi}$ will minimize the expected posterior risk:

\[ \hat p _ {n, \pi} \in \arg\min _ {\hat p _ {n}} \mathbb{E} _ {q \sim \pi(\cdot\mid X^n)}\left[D(q\|\hat p _ n)\right]. \]

Following section 1, this furthermore means that the posterior predictive distribution minimizes the Bayes risk associated with the Kullback-Leibler loss.