/ / / / /

# Idea to approximate Gibbs posterior means

Consider the statistical learning framework where we have data $X\sim Q$ for some unknown distribution $Q$, a model $\Theta$ and a loss function $\ell_\theta(X)$ measuring a cost associated with fitting the data $X$ using a particular $\theta\in\Theta$. Our goal is to use the data to learn about parameters which minimize the risk $R(\theta) = \mathbb{E}[\ell_\theta(X)]$. Here are two standard examples.

Density estimation. Suppose we observe independent random variables $X_1, X_2, \dots, X_n$. Here the model $\Theta$ parametrizes a set $\mathcal{M} = {p_\theta : \theta \in \Theta }$ of probability density functions (with respect to some dominating measure on the sample space), and our loss for $X = (X_1, \dots, X_n)$ is defined as $\ell_\theta(X) = - \sum_{i=1}^n \log p_\theta(X_i).$ If, for instance, the variables $X_i$ are independent with common distribution with density function $p_{\theta_0}$ for some $\theta_0 \in \mathbb{\Theta}$, then it follows from the positivity of the Kullback-Leibler divergence that $\theta_0 \in \arg\min _ \theta \mathbb{E}[\ell _ \theta(X)]$. That is, under identifiability conditions, our learning target is the true data-generating distribution.

If the model is misspecified, roughly meaning that there is no $\theta_0\in \Theta$ such that $p_{\theta_0}$ is a density of $X_i$, then our framework sets up the learning problem to be about the parameter $\theta_0$ which is such that $p_{\theta_0}$ mininizes the Kullback-Leibler divergence between $p_{\theta_0}$ and the true marginal distribution of the $X_i$’s.

Regression. Here our observations take the form $(Y_i, X_i)$, the model $\Theta$ parameterizes regression functions $f_\theta$ and we can consider a sum of squared errors loss $\ell_\theta(X) = \sum_{i=1}^n(Y_i - f_\theta(X_i))^2.$

### Gibbs posterior distributions

Gibbs Learning approaches this problem from a pseudo Bayesian point of view. While typically a Bayesian approach would require the specification of a full data-generating model, here we replace the likelihood function by the pseudo-likelihood function $\theta \mapsto e^{-\ell_\theta(X)}$. Given a prior $\pi$ on $\Theta$, the Gibbs posterior distribution is then given by $\pi(\theta \mid X) \propto e^{-\ell_\theta(X)} \pi(\theta)$ and satisfies $\pi(\cdot \mid X) \in \text{argmin}_{\hat \pi} \left\{ \mathbb{E}_{\theta \sim \hat \pi}[\ell_\theta(X)] + D_{\text{KL}}(\hat \pi \mid \pi) \right\}$ whenever these expressions are well defined.

In the context of integrable pseudo-likelihoods, the above can be re-interpreted as a regular posterior distributions built from density functions $f _ \theta(x) \propto e^{-\ell _ \theta(x)}$ and with a prior $\tilde \pi$ satisfying $\frac{d\tilde \pi}{d\pi}(\theta) \propto \int e^{-\ell_\theta(x)}\,dx =: c(\theta).$ However, the reason we cannot apply standard asymptotic theory to the analysis of Gibbs posterior is that the quantity $c(\theta)$ will typically be sample-size dependent. That is, if $X=X^n=(X_1, X_2, \dots, X_n)$ for i.i.d. random variables $X_i$ and if the loss $\ell_\theta$ separates as the sum $\ell_\theta(X) = \sum_{i=1}^nl_{\theta}(X_i),$ then $c(\theta) = \left(\int e^{-l_\theta(x_1)} \, dx_1\right)^n$. This data-dependent prior, tilting $\pi$ by the function $c(\theta)^n$, is what allows Gibbs learning to target general risk-minimizing parameters rather than likelihood Kullback-Leibler minimizers.

Some of my ongoing research, presented as a poster at the O’Bayes conference in Warwick last summer, focused on understand the theoretical behaviour of Gibbs posterior distributions. I studied the posterior convergence and finite sample concentration properties of Gibbs posterior distributions under the large sample regime with additive losses $\ell_\theta^{(n)}(X_1, \dots, X_n) = \sum_{i=1}^n\ell_\theta(X_i)$. I’ve attached the poster (joint work with Yu Luo) below and you can find the additional references here.

Note that this is very preliminary work. We’re still in the process of exploring interesting directions (and I have very limited time this semester with the beginning of my PhD at Duke).

### Interpolating between Empirical risk minimization and model averages

An important (both useful and somewhat annoying) feature of Gibbs posterior distributions is that they depend on the scale at which the losses are specified. That is, if considering the alternative losses $\tilde \ell_\theta(X) = \eta \ell_\theta(X)$ for some $\eta > 0$ yields a completely different Gibbs posterior distribution. As $\eta \rightarrow \infty$, we can expect the posterior distribution to converge to the empirical risk minimizer (if it exists), whereas for small $\eta$, the posterior distribution should be more vague. Choosing this scale is an open problem for which a few interesting solutions have been recently proposed.

Ok, enough background material. Let’s get to the meat of this post.

## Approximating Gibbs posterior mean computation (an early idea)

I think (even though I have no precise idea yet) that there is an interesting relationship between the the choice of the scale of the loss functions previously mentionned, the process of iterating Gibbs posterior computation and the gradient descent algorithm used to compute empirical risk minimizers.

What I mean by iterating Gibbs posterior computation is the following: we start with a prior $\pi = \pi_0$, compute the posterior $\pi_1 = \pi(\cdot \mid X)$, then compute the posterior $\pi_2$ obtained by using the same data and by using $\pi_1$ as a prior, and etc. I’ll refer to this as iterated Gibbs. I suspect (although I haven’t checked anything) that this sequence $\pi_n$ converges to a point mass at the empirical risk minimizer, under some assumptions.

This idea came from a talk by Joan Bruna I attended last Wednesday at Duke. They used something similar to this iterated Gibbs method as a “proximal update” step in a gradient descent algorithm, although I’ll have to look more deeply into their results to understand everything they’ve done (the talk was in a statistical physics framework which I’m less familiar with).

The cool thing would be if we were able to use only a few steps of iterated Gibbs, with scaled down loss functions, as an approximation to the Gibbs posterior distribution we care about. Each of these iterated Gibbs step could further be approximated by a few steps of gradient descent initialized following the relevant “prior” distributions. This could be interpreted as (very roughly) approximate Bayes while being very fast and providing some of the benefits of model averaging and incorporating prior information.

This is something I’ll try to look into. I’ll update this post as I either make progress or discover how this doesn’t work.

# Climate Strike

I went to the “climate strike” in Raleigh this afternoon, an event which is part of this week’s protest of political inaction on climate change (or of denial of climate change). It was pretty small in Raleigh, but still interesting to hear from people with all kind of backgrounds and points of views. I also had the opportunity to write to North Carolina’s governor Roy Cooper about this pressing issue. It’s a rather personal letter, written quickly and without a deep knowledge of the topic. But I thought I’d share it here as it, I believe, carries common sentiments on this issue.

# The credibility of confidence sets

Andrew Gelman and Sander Greenman went “head to head” in a discussion on the interpretation of confidence intervals in The BMJ. Greenman stated the following, which doesn’t seem quite right to me.

The label “95% confidence interval” evokes the idea that we should invest the interval with 95/5 (19:1) betting odds that the observed interval contains the true value (which would make the confidence interval a 95% bayesian posterior interval$^{11}$). This view may be harmless in a perfect randomized experiment with no background information to inform the bet (the original setting for the “confidence” concept); more often, however […]

It’s not true that “this view may is harmless in perfect randomized experiments”, and I’m not sure where this “original setting of the confidence concept” is coming from. In fact, even in the simplest possible cases, the posterior probability of a $95\%$ confidence interval can be pretty much anything.

Imagine a “perfect randomized experiment”, where we use a test of the hypothesis $H_0: \mu = 0$ for which, for some reason, has zero power. If $p < 0.05$, meaning that the associated confidence interval excludes $0$, then we are certain that $H_0$ holds and the posterior probability of the confidence interval is zero.

Let this sink in. For some (albeit trivial) statistical tests, observing $p < 0.05$ brings evidence in favor of the null.

The power of the test carries information, and the posterior probability of a confidence interval (or of an hypothesis), depends on this power among other things, even in perfect randomized experiments.

# Bayesian Optimalities

I’m sometimes asked in conferences and meetings around Montreal: why Bayes?

Given a convex function $\varphi : \mathbb{R} \rightarrow \mathbb{R}$ and $X$ a random variable on $\mathbb{R}$, the Jensen functional of $\varphi$ and $X$ is defined as
The well-known Jensen inequality states that $\mathcal{J}(\varphi, X) \geq 0$. For instance, if $\varphi(x) = x^2$, then $\mathcal{J}(\varphi, X) = \text{Var}(X) \geq 0$. If $\mu$ and $\nu$ are two probability measures, $X \sim \nu$ and $\varphi$ is convex with $\varphi(1) = 0$, then $\mathcal{J}(\varphi, \tfrac{d\mu}{d\nu}(X)) =: D_\varphi(\mu, \nu)$ is a so-called $f$-divergence between probability measures such as the total variation distance, the Kullback-Leibler divergence, the $\chi^2$ divergence, etc.
If $X$ is bounded, then a converse to the Jensen inequality can be easily obtained as follows: let $m$ and $M$ be the infimum and maximum of $X$, and write $X = \alpha m + (1-\alpha)M$ for some random variable $\alpha$ taking values in $[0,1]$. Then $\mathbb{E}[\alpha] = (M - \mu)/(M-m)$ and consequently with $\mu:= \mathbb{E}[X]$,