Updating a 250 years old theorem for the 21st century
If you are reading this, you probably know what Bayes’ theorem is. Here, we are concerned with the use of Bayes’ theorem to perform inference for parameters of a statistical model (a.k.a. Bayesian inference).
Recently, several generalizations of Bayesian inference have been proposed. The aim of this post is to survey some of these generalizations and overview the pros and cons of each with respect to using the original Bayes theorem. Hopefully, I’ll be able to put some order in this rapidly expanding literature and provide some intuition on why we need to move beyond the original Bayes theorem.
TL; DR: Extensions of Bayesian inference work better in some cases which are not contemplated by using the original Bayes’ theorem (primarily, model misspecification), and provide more justification.
Let’s first have a look at Bayes’ original theorem
Here, I am using \(\theta\) to denote a parameter on which we want to perform inference, while \(x\) is the data that we observe.
As the standard textbook description of Bayesian inference says, Bayes’ theorem provides a posterior belief \(\pi(\theta\vert x)\), by updating the prior belief \(\pi(\theta)\) with the information on \(\theta\) carried by \(x\). Crucially, all information on \(\theta\) contained in \(x\) is represented by the likelihood \(p(x\vert \theta)\), such likelihood being the probabilistic model from which we believe \(x\) was generated. This is the gist of Bayesian inference.
Actually, the denominator in the definition of the posterior is independent on \(\theta\). Exploiting this fact, it is common to write:
\[\pi(\theta\vert x) \propto \pi(\theta) p(x\vert \theta),\]where the proportional sign refers to both sides of the equation being considered as function of \(\theta\)
Additionally, Bayes’ theorem is very modular, allowing to sequentially incorporate new information into our belief: say that we previously had data \(x_1\), from which we obtained \(\pi(\theta\vert x_1)\). If we now get an independent observation \(x_2\), we can apply Bayes’ theorem again by using \(\pi(\theta\vert x_1)\) as a prior and get:
\[\pi(\theta\vert x_1, x_2) \propto \pi(\theta\vert x_1) p(x_2\vert \theta).\]So, if we get \(n\) independent observations, say \(\mathbf {x} = (x_1, x_2, \ldots, x_n)\), the posterior belief is:
\[\pi(\theta\vert \mathbf x) = \frac{\pi(\theta) \prod_{i=1}^n p(x_i\vert \theta)}{p(\mathbf x)} \propto \pi(\theta) \prod_{i=1}^n p(x_i\vert \theta).\]The posterior obtained from Bayes’ theorem satisfies some nice properties; first, there are properties that are intrinsic to Bayes’ update rule:
given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.
If you further assumes that the observations \(\mathbf x\) are generated from \(p(\cdot\vert\theta_0)\) for some parameter value \(\theta_0\) (i.e., the model is well specified), Bayes’ posterior satisfies some additional properties:
according to information theory arguments in Zellner
An analogue to the Central Limit Theorem in frequentist statistics applies to Bayes’ posterior, called Bernstein-von Mises theorem. It goes like this: as the number of observations \(n\) goes to infinity, the posterior converges to a normal distribution which is centered in \(\theta_0\) (and whose variance decreases as \(1/n\) and is asymptotically equivalent to the sampling variance of the maximum likelihood estimator of \(\theta\))
To recap what we said above, the main motivation behind Bayes’ theorem is that, if the model \(p(x\vert \theta)\) is well specified, the posterior distribution is a coherent way to learn about the true parameter value (to which it converges in the limit of infinite data); additionally, that is the best possible way to process information even with a finite amount of observations (by Zellner’s argument).
However, these arguments vacillate when the model is not an accurate representation of the distribution \(g(\cdot)\) from which observations were generated (or data-generating process, DGP). By this, we mean that there do not exist any parameter value such that \(p(\cdot\vert\theta)=g(\cdot)\). If that is the case, two things happen:
More in general, we need to ask what is the aim of Bayesian inference in such a misspecified setting. In fact, with Bayesian inference we do not learn about the true parameter value anymore, as such thing does not exist. Rather, the standard Bayes’ posterior learns about the parameter value such that the misspecified model is as close as possible to the DGP in the specific sense of the KL divergence.
That may still be what you want to do in some cases, but I argue below that using the KL divergence may behave poorly in some misspecified cases; instead, some generalized Bayesian inference allow the user to choose the way in which you approximate the DGP with the probabilistic model (for instance replacing the KL with other divergences
Learning \(\theta^\star\) according to the KL divergence may not behave well when the model is misspecified; for instance, consider the following DGP:
\[g(x) = 0.1\ h(x) + 0.9\ p(x|\theta_0);\]this means that 90% of observations are generated from the model for a given parameter value \(\theta_0\), while the other 10% are generated from the distribution \(h\). If \(h\) has heavier tails than \(p\), \(\theta^\star\) can be far away from \(\theta_0\), as the KL divergence gives large importance to the tail behavior of the distributions
With more general misspecifications, things may go wrong in different ways.
Finally, Bayesian inference implicitly assumes that the prior is a good representation of previous knowledge, and that enough computational power is available to sample from the posterior (which, in some cases, it is very expensive to do)
Some of the generalized Bayesian approaches
In the following, I review extensions to Bayes theorem which are more justified and may have better performances in a misspecified setting (or can even be used without a probabilistic model); some will also tackle directly the issues regarding prior and computational power.
Across these works, a recurrent underlying question is: what is the actual aim of inference when we cannot specify the model correctly?
Disclaimer: this overview is non-exhaustive and strongly biased due to papers I’ve read and my personal research activity.
A first idea to tackle (mild) model misspecification is to reduce the importance of the likelihood term in the definition of Bayes’ posterior. This can be done by raising the likelihood function to a power \(\alpha \in [0,1]\), where \(\alpha=1\) recovers Bayes’ posterior and \(\alpha=0\) corresponds to sticking to the prior. The resulting posterior is therefore:
\[\pi_\alpha(\theta\vert x) \propto \pi(\theta) p(x\vert \theta)^\alpha,\]This idea has been discussed in detail in
With respect to standard Bayes’ posterior, this strategy does not change the parameter value on which learning is performed, but it only changes the speed of learning (ie the rate of concentration of the posterior distribution with an increasing number of observations). Also, it still satisfies the likelihood principle and coherence.
A larger leap is taken in
\begin{equation}\label{Eq:bissiri} \pi_{\ell,w}(\theta\vert x) \propto \pi(\theta) \exp{ - w \cdot \ell (\theta,x)}, \end{equation}
where \(w\) is a weight which balances the influence of the loss and prior term. With this update, as argued in
where \(g\) is the data-generating process.
In
An advantage to this approach is that you do not need to specify a probabilistic model (ie likelihood) here; inference can be performed on “parameter” values that are solely defined through the loss function.
Of course, the power likelihood posterior is obtained as a special case by setting \(\ell(\theta, x) = - \log p(x\vert \theta)\), and \(w=\alpha\); further setting \(w=1\) gives back the original posterior.
However, with a generic loss, setting the value of \(w\) is an arbitrary choice. The larger \(w\), the faster the concentration of the posterior with the increase of the number of observations. In
Finally, notice that a Bernstein-von Mises result still hold for this more general posterior (of course under some regularity conditions). Some very general and applicable formulations are given in
So, the loss-based approach in Eq. \eqref{Eq:bissiri} works without a probabilistic model. But what if you have a misspecified model which you believe carries some meaning about the process you are studying?
As mentioned above, using the original Bayes’ posterior may not be a wise choice. In order to perform inference in a sounded way, an idea is to use the loss-based approach and express the loss \(\ell\) as a function of the probabilistic model:
\[\ell(\theta, x) = S(p(\cdot|\theta), x),\]where \(S\) is a function of a probability distribution and an observation, which is usually called scoring rule in the literature
This approach has been investigated in some recent works, amongst which
For some scoring rules \(S\), the above expectation corresponds to a statistical divergence (for instance the energy distance and MMD
Clearly, this still gives a coherent update, but the likelihood principle is not satisfied anymore (as we use the likelihood, but do not just evaluate it at the observation); however, the likelihood principle itself does not seem very reasonable if the model is misspecified in the first place
One more step towards generality and we find the approach presented in
The idea is to start from the variational formulation of Bayes’ posterior which is attributed to Donsker and Varadhan
where \(\Theta\) denotes the values that \(\theta\) can take, \(\mathcal P (\Theta)\) is the set of all probability distributions on \(\Theta\), and \(D_{\text{KL}}(q \| \pi)\) is the KL divergence between the distribution \(q\) and the prior \(\pi\).
This formulation leads to an optimization-centric view of Bayesian inference, as the authors of
This formulation open the doors to additional extensions by changing \(D_{\text{KL}}\) in the optimization objective above to a different divergence, say \(D\). Further, you can also restrict the minimization problem to a constrained space of distributions \(\mathcal \Pi \subseteq \mathcal P(\Theta)\). The resulting update rule obtained in this way is termed the Rule of Three (RoT), as you obtain a posterior distribution by specifying the loss, the space of distributions \(\mathcal \Pi\) and the divergence \(D\):
\[\pi_{\ell, D, \mathcal \Pi}(\cdot|\mathbf x) = \underset{q \in \mathcal \Pi }{\operatorname{argmin}}\left\{\mathbb{E}_{q(\theta)}\left[w \sum_{i=1}^{n} \ell(\theta, x_i) \right]+D(q \| \pi)\right\} \stackrel{\text{def}}{=} P(\ell, D, \Pi).\]However, it is in general impossible to obtain a closed form solution for the above problem. Additionally, you do not have a coherent update anymore.
In the paper
The rest of the work discusses in very much detail (more than 100 pages!) the properties of the resulting posterior. The key idea is that with this approach you can not only perform inference with a generic loss (as it was already with the loss-based approach by
They also provide an axiomatic derivation of the RoT; interestingly, among the required axioms is a generalized likelihood principle, which states that all information on \(\theta\) contained in \(\mathbf x\) is accessible through evaluation of the loss \(\ell\) at the datapoints.
I have gone up the ladder of generality, from standard Bayes’ posterior to the General Variational Inference formulation; at each step, some features of standard Bayes’ posterior are lost and some others are retained, but a larger set of tools become available. Those may work better for instance with misspecified models, or in case you do not even want to specify a full probabilistic model, or finally if a full Bayesian analysis is too costly and you want to instead resort to a variational approach.
The following Venn’s diagram represents the relation between the different techniques presented here:
I have followed here a possible generalization route, but that is by no means the only possibility and my overview does not include all methods which are a superset of standard Bayes’ posterior - I have for instance excluded the approach taken in
Still, I think the approaches reviewed here are thought-provoking and allow to see Bayesian inference under a different light. I also feel they bring the original axiomatic theory closer to a pragmatical toolbox.
Of course, I do not think now that standard Bayesian inference has to be thrown to the garbage; there is no need to list the practical results that have been possible with it. Still, it is good to know there are ways around (or beyond) in case a standard Bayesian analysis doesn’t work. Additionally, some of the techniques developed to work with standard posteriors (for instance MCMC and VI) can be ported to generalized posteriors.
I am sure there will be other extensions proposed in the future, so keep an eye out if you are interested!