Proofs from the notebook Olivier's open lab book.

/ / / / /

Idea to approximate Gibbs posterior means

Consider the statistical learning framework where we have data $X\sim Q$ for some unknown distribution $Q$, a model $\Theta$ and a loss function $\ell_\theta(X)$ measuring a cost associated with fitting the data $X$ using a particular $\theta\in\Theta$. Our goal is to use the data to learn about parameters which minimize the risk $R(\theta) = \mathbb{E}[\ell_\theta(X)]$. Here are two standard examples.

Density estimation. Suppose we observe independent random variables $X_1, X_2, \dots, X_n$. Here the model $\Theta$ parametrizes a set $\mathcal{M} = {p_\theta : \theta \in \Theta }$ of probability density functions (with respect to some dominating measure on the sample space), and our loss for $X = (X_1, \dots, X_n)$ is defined as If, for instance, the variables $X_i$ are independent with common distribution with density function $p_{\theta_0}$ for some $\theta_0 \in \mathbb{\Theta}$, then it follows from the positivity of the Kullback-Leibler divergence that $\theta_0 \in \arg\min _ \theta \mathbb{E}[\ell _ \theta(X)]$. That is, under identifiability conditions, our learning target is the true data-generating distribution.

If the model is misspecified, roughly meaning that there is no $\theta_0\in \Theta$ such that $p_{\theta_0}$ is a density of $X_i$, then our framework sets up the learning problem to be about the parameter $\theta_0$ which is such that $p_{\theta_0}$ mininizes the Kullback-Leibler divergence between $p_{\theta_0}$ and the true marginal distribution of the $X_i$’s.

Regression. Here our observations take the form $(Y_i, X_i)$, the model $\Theta$ parameterizes regression functions $f_\theta$ and we can consider a sum of squared errors loss

Gibbs posterior distributions

Gibbs Learning approaches this problem from a pseudo Bayesian point of view. While typically a Bayesian approach would require the specification of a full data-generating model, here we replace the likelihood function by the pseudo-likelihood function $\theta \mapsto e^{-\ell_\theta(X)}$. Given a prior $\pi$ on $\Theta$, the Gibbs posterior distribution is then given by and satisfies whenever these expressions are well defined.

In the context of integrable pseudo-likelihoods, the above can be re-interpreted as a regular posterior distributions built from density functions $f _ \theta(x) \propto e^{-\ell _ \theta(x)}$ and with a prior $\tilde \pi$ satisfying However, the reason we cannot apply standard asymptotic theory to the analysis of Gibbs posterior is that the quantity $c(\theta)$ will typically be sample-size dependent. That is, if $X=X^n=(X_1, X_2, \dots, X_n)$ for i.i.d. random variables $X_i$ and if the loss $\ell_\theta$ separates as the sum then $c(\theta) = \left(\int e^{-l_\theta(x_1)} \, dx_1\right)^n$. This data-dependent prior, tilting $\pi$ by the function $c(\theta)^n$, is what allows Gibbs learning to target general risk-minimizing parameters rather than likelihood Kullback-Leibler minimizers.

Some of my ongoing research, presented as a poster at the O’Bayes conference in Warwick last summer, focused on understand the theoretical behaviour of Gibbs posterior distributions. I studied the posterior convergence and finite sample concentration properties of Gibbs posterior distributions under the large sample regime with additive losses $\ell_\theta^{(n)}(X_1, \dots, X_n) = \sum_{i=1}^n\ell_\theta(X_i)$. I’ve attached the poster (joint work with Yu Luo) below and you can find the additional references here.

Note that this is very preliminary work. We’re still in the process of exploring interesting directions (and I have very limited time this semester with the beginning of my PhD at Duke).

Read more

Climate Strike

I went to the “climate strike” in Raleigh this afternoon, an event which is part of this week’s protest of political inaction on climate change (or of denial of climate change). It was pretty small in Raleigh, but still interesting to hear from people with all kind of backgrounds and points of views. I also had the opportunity to write to North Carolina’s governor Roy Cooper about this pressing issue. It’s a rather personal letter, written quickly and without a deep knowledge of the topic. But I thought I’d share it here as it, I believe, carries common sentiments on this issue.

Read more

The credibility of confidence sets

Andrew Gelman and Sander Greenman went “head to head” in a discussion on the interpretation of confidence intervals in The BMJ. Greenman stated the following, which doesn’t seem quite right to me.

The label “95% confidence interval” evokes the idea that we should invest the interval with 95/5 (19:1) betting odds that the observed interval contains the true value (which would make the confidence interval a 95% bayesian posterior interval$^{11}$). This view may be harmless in a perfect randomized experiment with no background information to inform the bet (the original setting for the “confidence” concept); more often, however […]

It’s not true that “this view may is harmless in perfect randomized experiments”, and I’m not sure where this “original setting of the confidence concept” is coming from. In fact, even in the simplest possible cases, the posterior probability of a $95\%$ confidence interval can be pretty much anything.

Imagine a “perfect randomized experiment”, where we use a test of the hypothesis $H_0: \mu = 0$ for which, for some reason, has zero power. If $p < 0.05$, meaning that the associated confidence interval excludes $0$, then we are certain that $H_0$ holds and the posterior probability of the confidence interval is zero.

Let this sink in. For some (albeit trivial) statistical tests, observing $p < 0.05$ brings evidence in favor of the null.

The power of the test carries information, and the posterior probability of a confidence interval (or of an hypothesis), depends on this power among other things, even in perfect randomized experiments.

Global bounds on the Jensen functional

Given a convex function $\varphi : \mathbb{R} \rightarrow \mathbb{R}$ and $X$ a random variable on $\mathbb{R}$, the Jensen functional of $\varphi$ and $X$ is defined as

The well-known Jensen inequality states that $\mathcal{J}(\varphi, X) \geq 0$. For instance, if $\varphi(x) = x^2$, then $\mathcal{J}(\varphi, X) = \text{Var}(X) \geq 0$. If $\mu$ and $\nu$ are two probability measures, $X \sim \nu$ and $\varphi$ is convex with $\varphi(1) = 0$, then $\mathcal{J}(\varphi, \tfrac{d\mu}{d\nu}(X)) =: D_\varphi(\mu, \nu)$ is a so-called $f$-divergence between probability measures such as the total variation distance, the Kullback-Leibler divergence, the $\chi^2$ divergence, etc.

If $X$ is bounded, then a converse to the Jensen inequality can be easily obtained as follows: let $m$ and $M$ be the infimum and maximum of $X$, and write $X = \alpha m + (1-\alpha)M$ for some random variable $\alpha$ taking values in $[0,1]$. Then $\mathbb{E}[\alpha] = (M - \mu)/(M-m)$ and consequently with $\mu:= \mathbb{E}[X]$,

Read more