Bayesian Optimalities
I’m sometimes asked in conferences and meetings around Montreal: why Bayes?
This always strikes me. I do sometimes identify with “bayesianism”, because I want to do research related to Bayesian statistics, but I don’t feel that I’m a Bayesian in the sense that I adhere to one particular school of thought. First and foremost, I am a mathematician and a statistician. I want to understand the world through simulation and modelling, with the certainty provided by mathematical lenses, and I want to study and develop the statistical tools that we use to understand the world when certainty is not at reach. That’s my scientific identity.
Yet I’m also driven to answer: what else? What else than probabilities to model uncertainty and randomness? What else than probabilistic conditioning to account for new information? (Ok, I’m exaggerating a bit here. There are tons of other fun “else” to answer that question, but Bayes certainly stands out.) I’m drawn to Bayesian stats because, more often than not, it’s about providing meaningful answers to the questions that we really care about. I want, when possible, posterior probabilities of hypotheses; not 01 decisions and wishful thinking.
That’s for the “let’s infer stuff” part. In a prediction or point estimation context, I can also try to explain how Bayesian procedures are well suited to hierarchical modelling and convenient computational algorithms, how they can be used to incorporate prior information, have nice properties and follow from conceptually simple principles which are well suited to mathematical analysis. They may or may not help solving a particular problem and that’s perfectly fine, but they are still a subject of study worthwhile of specialisation. (Yes, I currently often have to defend my choice of studying Bayesian stats. I’m eager to begin my PhD.)
This post is about some of these nice properties of posterior distributions that I like to talk about. I’ll leave the nonnice things for another time.
1. Point estimation and minimal expected risk
This first section is not directly about properties of the posterior distribution, but it rather concerned with some summaries of the posterior which have nice statistical properties in different contexts.
Squared error loss
Suppose $\pi$ is a prior on an euclidean parameter space $\Theta \subset \mathbb{R}^d$ with norm $\theta^2 = \theta^T \theta$ defined through the dot product. Given a likelihood $p _ \theta(X)$ for data $X$, the posterior distribution is defined as
and the mean of the posterior distribution is
If we define the risk of an estimator $\hat \theta$ for the estimation of a parameter $\theta _ 0$ as
and if
is the expected risk of $\hat \theta$ with respect to the prior $\pi$ (also called the Bayes risk), then we have that the posterior mean estimate $\hat \theta _ {\pi}$ satisfies
for any estimator $\hat \theta$. That is, the posterior mean estimate minimizes the expected risk.
The proof follows from the fact that
Writing the expected risk as an expected posterior loss, i.e. using the fact that
where $m$ has density $m(x) = \int p _ \theta(x) \pi(\theta)\,d\theta$, and since
we obtain the result.
A few remarks:
 The expected risk has stability properties. If $\tilde \pi$ and $\pi$ are two priors that are absolutely continuous with respect to each other, and if $\log \frac{d\tilde \pi}{d\pi} _ \infty \leq C$, then
If the risk $R(\hat \theta; \theta _ 0)$ is uniformly bounded by some constant $M$ over $\theta _ 0\in \Theta$, then
This shows how small chances in the prior does not result in a dramatic change in the expected loss of an estimator, as long as the priors have “compatible tails” (i.e. a manageable likelihood ratio).

It is sometimes advocated to choose the prior $\pi$ so that the risk $R(\hat \theta _ \pi; \theta _ 0)$ is constant over $\theta _ 0$: the resulting estimator $\hat \theta _ \pi$ is then agnostic, from a risk point of view, to $\theta _ 0$. This may result in a samplesize dependent prior (which is arguably not in the Bayesian spirit), but the fun thing is that it makes the expected risk maximal and the Bayes estimator $\hat \theta _ \pi$ minimax: $\hat \theta _ \pi \in \arg\min _ {\hat\theta} \sup _ {\theta _ 0}R(\hat \theta;\theta _ 0)$. Indeed, in that case we have for any estimator $\hat \theta$ that $\sup _ {\theta _ 0} R(\hat \theta; \theta _ 0) \geq B _ \pi(\hat \theta) \geq B _ \pi(\theta _ \pi) = \sup _ {\theta _ 0}R(\hat \theta _ \pi;\theta _ 0)$, from which it follows that $\hat \theta _ \pi$ is minimax.

The idea of minimizing expected risk is not quite Bayesian, since it required us to first average over all data possibilities when computing the risk. One of the main advantage of the Bayesian framework is that it allows us to condition over the observed data, rather than preemptively considering all possibilities, and we can try to make use of that. Define the posterior expected loss (or posterior risk) or an estimator $\hat \theta$, conditionally on $X$, as
 It is clear from the previous computations that the posterior mean estimate minimizes the posterior risk, and hence the two approaches are equivalent. It turns out that, whatever the loss function we consider (under some regularity condition ensuring that stuff is finite and minimizers exist), minimizing the posterior risk is equivalent to minimizing the Bayes risk. In other words, we have that for any loss function (again under some regularity conditions ensuring finiteness and existence of stuff), we have
 This is roughly selfevident if we think about it. An interesting consequence is that any estimator minimizing a Bayes risk is a function of the posterior distribution.
2. Randomized estimation and information risk minimization
Let $\Theta$ be a model, let $X \sim Q$ be some data and let $\ell _ \theta(X)$ be a loss associated with using $\theta$ to fitting the data $X$. For instance, we could have $\Theta = {f:\mathcal{X} \rightarrow \mathbb{R}}$ a set of functions, $X ={(U _ i, Y _ i)} _ {i=1}^n \subset \mathcal{X}\times \mathbb{R}$ a set of features with associated responses, and $\ell _ \theta(X) = \sum _ {i}(Y _ i \theta(U _ i))^2$ the sum of squared loss.
There may be a parameter $\theta _ 0\in\Theta$ minimizing the risk $R(\theta) = \mathbb{E} _ {X\sim Q}[\ell _ \theta(X)]$, which will then be our learning target. Now we consider randomized estimators taking the form $\theta\sim \hat \pi _ X$, where $\hat\pi _ X$ is a datadependent distribution, and the performance of this estimation method can then be evaluated by the empirical risk $R _ X (\hat\pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$.
Here we should be raising an eyebrow. There is typically no point in having the estimator $\theta$ being random, i.e. we typically will prefer to take $\hat \pi _ X$ a point mass rather than anything else. But bear with me for a sec. The cool thing is that if we choose
where $D(\hat \pi _ X \pi) = \int \log \frac{d\hat \pi _ X}{d\pi} \,d\hat \pi _ X$ is the KullbackLeibler divergence, then this distribution will satisfy
That is, Bayesiantype posteriors arise by minimizing the empirical risk of a randomized estimation scheme penalized by the KullbackLeibler divergence form prior to posterior (Zhang, 2006).
For the proof, write
which is also equal to $D(d\hat \pi _ X  e^{\ell _ \theta(X)} d\pi)$ and, by properties of the KullbackLeibler divergence, obviously minimized at $d\hat \pi _ X \propto e^{\ell _ \theta(X)}d\pi(\theta)$.
Is this practically useful and insightful? Possibly. But at least this approach is suited to a general theory, as shown in Zhang (2006) and as I reproduce below.
Let us introduce a Rényitype generalization error defined, for $\alpha \in (0,1)$, by
This is a measure of loss associated with the use of a parameter $\theta$ to fit new data $X’ \sim Q$. We also write
for the expected Rényi generalization error when using the randomization scheme $\theta \sim \hat \pi _ X$.
In order to get interesting bounds on this generalization error, we can follow the approach of Zhang (2006).
Change of measure inequality
We’ll need the change of measure inequality, which states that for any function $f$ and distributions $\pi$, $\hat \pi$ on $\Theta,$
Indeed, with some sloppyness and Jensen’s inequality we can compute
Generalization error bound
We can now attempt bounding $d _ \alpha(\hat \pi _ X;Q)$. Consider the difference $\Delta _ X (\theta) = d _ \alpha(\theta;Q)  \ell _ \theta(X)$ between the generalization error and the empirical loss corresponding to the use of a fixed parameter $\theta$. Then by the change of measure inequality,
and hence for any $\pi$,
By Markov’s inequality, this implies that $\forall t > 0$,
Rewriting yields
with probability at least $1e^{t}$. To recap: the term $d _ \alpha(\hat \pi _ X;Q)$ is understood as a generalization error, on the right hand side $R _ X(\hat \pi _ X) = \mathbb{E} _ {\theta \sim \hat \pi _ X}[\ell _ \theta(X)]$ is the empirical risk, the KullbackLeibler divergence $D(\hat \pi _ X\pi)$ penalizes the complexity of $\hat\pi _ X$ seen as a divergence from a “prior” $\pi$, and $t$ is a tuning parameter.
3. Online learning, regret and KullbackLeibler divergence
Following Barron (1998), suppose we sequentially observe data points $X _ 1, X _ 2, X _ 3, \dots$ which are say i.i.d. with common distribution $Q$ with density $q$. At each time step $n$, the goal is to predict $X _ {n+1}$ using the data $X^n = (X _ 1, \dots, X _ n)$. Our prediction is not a point estimate of $X _ {n+1}$, but somewhat similarly as in the randomized estimation scenario we output a density estimate $\hat p _ n = p(\cdot \mid X^n)$, the goal being that $p(X _ {n+1}\mid X^n)$ be as large as possible. A bit more precisely, we individually score a density estimate $\hat p _ n$ through the risk $\ell _ q(\hat p _ n) = \mathbb{E} _ {X _ {n+1}\sim q}[\log(q(X _ {n+1})/\hat p _ n(X _ {n+1} ))] = D(q \hat p _ n)$ which is the KullbackLeibler divergence between $\hat p _ n$ and $q$. The regret over times $n=1, 2,\dots, N$ is the sum of the risk over the whole process, i.e.
Formally, this process is equivalent to estimating the distribution of $X^N$ all at once: our density estimate $\hat p^N$ of $X^N$ would simply be
and the regret is, by the chain rule, simply $D(q^N  \hat p^N)$, where $q^N$ is the $N$th independent product of $q$.
Given a prior $\pi$ over a space of distributions for $q$, our problem then to minimize the Bayes risk
This is achieved by choosing $\hat p^N(x) = \hat p _ \pi^N(x) = \int q^N(x) \pi(dq)$ the prior predictive density. This is equivalent to using, at each time step $n$, the poterior predictive density $\hat p _ {n, \pi}(x) = \int q(x) \,\pi(dq\mid {X _ i} _ {i=1}^n)$.
To see this minimizing property of the Bayes average, it suffices to write
Note that an consequence of this analysis is also that the posterior predictive distribution $\hat p _ {n, \pi}$ will minimize the expected posterior risk:
Following section 1, this furthermore means that the posterior predictive distribution minimizes the Bayes risk associated with the KullbackLeibler loss.