Olivier Binette: Global bounds for the Jensen functional

Given a convex function \(\varphi : \mathbb{R} \rightarrow \mathbb{R}\) and \(X\) a random variable on \(\mathbb{R}\), the Jensen functional of \(\varphi\) and \(X\) is defined as

\[ \mathcal{J}(\varphi, X) = \mathbb{E}[\varphi(X)] -\varphi(\mathbb{E}[X]).\tag{1} \]

The well-known Jensen inequality states that \(\mathcal{J}(\varphi, X) \geq 0\). For instance, if \(\varphi(x) = x^2\), then \(\mathcal{J}(\varphi, X) = \text{Var}(X) \geq 0\). If \(\mu\) and \(\nu\) are two probability measures, \(X \sim \nu\) and \(\varphi\) is convex with \(\varphi(1) = 0\), then \(\mathcal{J}(\varphi, \tfrac{d\mu}{d\nu}(X)) =: D_\varphi(\mu, \nu)\) is a so-called \(f\)-divergence between probability measures such as the total variation distance, the Kullback-Leibler divergence, the \(\chi^2\) divergence, etc.

If \(X\) is bounded, then a converse to the Jensen inequality can be easily obtained as follows: let \(m\) and \(M\) be the infimum and maximum of \(X\), and write \(X = \alpha m + (1-\alpha)M\) for some random variable \(\alpha\) taking values in \([0,1]\). Then \(\mathbb{E}[\alpha] = (M - \mu)/(M-m)\) and consequently with \(\mu:= \mathbb{E}[X]\),

\[ \mathcal{J}(\varphi, X) \leq \mathbb{E}[\alpha\phi(m) + (1-\alpha)\phi(M)] - \varphi(\mu)\\ \qquad= \frac{(M-\mu)\varphi(m) + (\mu-m)\varphi(M)}{M-m}- \varphi(\mu).\tag{2} \]

When \(\mu\) is unknown in practice, then maximizing the above over all possibilities is the bound \[ \mathcal{J}(\varphi, X) \leq \max_{p \in [0,1]} \left\{p\varphi(m) + (1-p)\varphi(M) - \varphi(pm + (1-p) M)\right\}\tag{3} \]

which is Theorem C in Simic (2011).

Some examples

Variance bound. Consider for example the case where \(\varphi(x) = x^2\), so that \(\mathcal{J}(\varphi, X) = \text{Var}(X)\). Then for \(X\) taking values in say \([0,1]\), the above bounds read as

\[ \text{Var}(X) \leq \mu(1-\mu) \leq 1/4 \]

which is a well-known elementary result.

\(f\)-divergence bounds. In (Binette, 2019), I show how we can use similar ideas to get best-possible reverse Pinsker inequalities: upper bounds on \(f\)-divergences in terms of the total variation distance and likelihood ratio extremums. In particular, with \(D(\mu\|\nu) = \int \log\left(\frac{d\mu}{d\nu}\right) d\mu\) the Kullback-Leibler divergence between the probability measures \(\mu\) and \(\nu\), we find that if \(a = \inf \frac{d\nu}{d\mu}\) and \(b = \sup \frac{d\nu}{d\mu}\), then \[ D(\mu|\nu) \leq \sup_A|\mu(A) - \nu(A)| \left(\frac{\log(a)}{a-1} +\frac{\log(b)}{1-b}\right). \]

Applying again the Jensen functional bound to \(\sup_A \lvert \mu(A)-\nu(A) \rvert = \frac{1}{2}\int\left \lvert \frac{d\mu}{d\nu} - 1\right \rvert d\nu\), we obtain

\[ \sup_A|\mu(A) - \nu(A)| \leq \frac{(M-1)(1-m)}{M-m} \]

and this implies the range of values theorem

\[ D(\mu|\nu) \leq \frac{(a-1)\log(b) + (1-b)\log(a)}{b-a}. \]

Variations

In cases where \(\mu\) is unknown and optimizing over all possibilities is not quite feasible, we can use the following trick.

Let \(f(x) = x\varphi(m) + (1-x)\varphi(M) - \varphi(x m +(1-x)M)\) be the term involved in the maximization step of \((3)\). Then \(f\) is concave with \(f(0) = f(1) = 0\), and hence for any \(p \in (0,1)\) we have that

\[ \max_{x \in [0,1]} f(x) \leq (\min\{p, 1-p\})^{-1}f\left(pm +(1-p)M\right). \]

In particular, taking \(p = 1/ 2\), we obtain the result of Simic (2008) stating that

\[ \mathcal{J}(\varphi, X) \leq \varphi(m) + \varphi(M) - 2\varphi\left(\frac{m+M}{2}\right). \]

When \(\varphi\) is differentiable (this assumption is not strictly necessary but it facilitate the statements), then we can use the concavity of \(f\) (using the fact that \(f(0) = f(1) = 0\)) to very easily obtain

\[ \mathcal{J}(\varphi, X) \leq \frac{f'(1)f'(0)}{f'(1)-f'(0)} \leq \frac{1}{4}(f'(0)-f'(1)) = \frac{1}{4}(M-m)(\varphi'(M) - \varphi'(m)) \]

which is an inequality attributed to S.S. Dragomir (1999), although I haven’t managed to find the original paper yet.