Journal Issue

A Bayesian Approach to Model Uncertainty

Charalambos Tsangarides
Published Date:
April 2004
  • ShareShare
Show Summary Details

I. Introduction

The study of socioeconomic phenomena may be plagued by inconsistent empirical estimates and model uncertainty. The case of inconsistent empirical estimates typically arises with omitted country-specific effects that, if not uncorrelated with other regressors, lead to a misspecification of the underlying dynamic structure, or with endogenous variables that may be incorrectly treated as exogenous. A panel data estimator that simultaneously addresses the issues of endogeneity and omitted variable bias is the systems Generalized Method of Moments Estimator (GMM) proposed by Hansen (1982). GMM estimators hold the potential for both consistency and efficiency gains by exploiting additional moment restrictions. The systems GMM involves the estimation of two equations, one in levels and the other in differences. The estimates from the difference equation, constructed by taking first differences of the levels equation eliminates the country-specific effect. For both equations, potentially endogenous explanatory variables are instrumented with their own lagged values, a factor that deals with the issue of endogeneity. Estimating the equations as a system, the procedure constrains similar coefficients to be constant across equations.2

The case of model uncertainty arises because the lack of clear theoretical guidance on the choice of regressors results in a wide set of possible specifications and, often, contradictory conclusions. Remedially, the analyst has three options: (i) arbitrarily select one model as the true model generating the data; (ii) present the results based on all plausible models without selecting between different specifications; and (iii) explicitly account for model uncertainty. While preferable, option (iii) presents enormous challenges at the level of both concept and statistical theory. Option (ii), although unsystematic, is preferable over option (i), but poses substantial logistical challenges. In practice, researchers tend to focus on one “channel” and choose option (i), ignoring model uncertainty altogether and risking overconfident inferences.3 In theory, accounting for model uncertainty requires some version of a “robustness check,” essentially an attempt to account for all possible combinations of predictors. A conceptually attractive solution to the problem of model uncertainty is provided by Bayesian Model Averaging (BMA) although difficulties at the implementation stage sometimes render it impractical.4 In particular, with a large number of regressors, k*, the procedure may be infeasible due to the large number of models to be estimated, 2k*. In addition, the researcher is required to specify the prior distributions of all relevant parameters. In practice, most applications of BMA utilize an arbitrary set of priors, without examining the impact of this choice. Standard Bayesian Model Averaging techniques have been used in the context of investigating growth determinants by Brock and Durlauf (2001), Doppelhofer, Miller, and Sala-i-Martin (2000) in their Bayesian Averaging of Classical Estimates (BACE) approach, and Fernandez, Ley, and Steel (2001).

Taking into consideration the concerns over model uncertainty, this essay develops the theory of a new Limited Information Bayesian Model Averaging estimator (LIBMA). The proposed estimator incorporates a dynamic panel estimator in the context of GMM, and a Bayesian robustness check to explicitly account for model uncertainty in evaluating the results of a universe of models generated by a set of possible regressors. The LIBMA approach provides certain advantages over the existing literature by relaxing the otherwise restrictive underlying assumptions in two ways. First, while standard Bayesian Model Averaging is a full information technique where a complete stochastic specification is assumed, LIBMA is a limited information approach that relies on GMM, a limited information technique based on moment restrictions rather than a complete stochastic specification. Second, while previous literature implicitly assumes exogenous regressors, LIBMA can control for endogeneity through the use of GMM.

The remainder of the paper is organized as follows. Section II introduces some preliminary ideas about the GMM. Section III constructs the GMM estimator in the Bayesian framework and the limited information likelihood. Section IV discusses the concepts of hypothesis testing and model selection in the Bayesian framework, presents the Limited Information Bayesian Information Criterion used in the context of GMM, and completes the derivation of the LIBMA. Section V presents all the calculated quantities and summary statistics on which the robustness analysis is based. The final section concludes.

II. Preliminaries on GMM

The GMM was developed by Hansen (1982) and White (1982) as an extension to the classical method of moments estimator. The basic idea of the GMM is to choose parameters of the model so as to match the moments of the model to those of the data as closely as possible. A weighing matrix determines the relative importance of matching each moment. Most common estimation procedures are contained in the GMM framework, including ordinary least squares, instrumental variables estimators, and in some cases, maximum likelihood estimators.

A key advantage to GMM over other estimation procedures is that there is no need to specify a likelihood function. The method of moments (and by extension, GMM) does not require the complete specification of distributions. Given that economic models do not specify joint distributions of economic variables, the method of moments (as well as other limited information inference methods) becomes very appealing in empirical studies. Of course, nothing comes for free. The cost is a loss of efficiency over methods such as Maximum Likelihood (MLE). The MLE can be viewed as a limiting case of GMM where under MLE the distribution of errors is specified (so in a sense all of the moments are incorporated). The trouble with MLE is often that the errors may not follow a known distribution (such as the normal which is almost the universal standard in MLE).5 Thus, GMM offers a compromise between the efficiency of MLE and robustness to deviations from normality (or other distributional forms).

This section follows the presentation in Kim (2000) and (2002) to introduce the GMM concepts. Let xt be an n × 1 vector of stochastic processes defined on a probability space (Ω,,P). Denote by XT(ϖ)=(x1(ϖ),,xT(ϖ)), for ϖΩ, a T–segment of a particular realization of {xt}. Let θ be a q × 1 vector of parameters from Θq. Let 𝒢 be the Borel σ–algebra of Θ, where (Θ, 𝒢) is a measurable space.6 In this paper Θ is a “grand” parameter space on which all the likelihoods, priors and posteriors under consideration are defined.

Let h(xt, θ) be an r × 1 vector valued function, h:(n×q)r. The function h(xt, θ) characterizes an econometric relation h(xt0) = wt for a θ0 ∈ Θ, where wt is an r-vector stochastic disturbance process satisfying the standard conditions in GMM of Hansen (1982).

Assumption (A1)

{wt, −∞ < t < ∞} is stationary and ergodic.

Assumption (A2)

(a) EP[wtwt′] exists and is finite, and

(b) EP[wt+s|wt, wt−1, … converges in mean square to zero.

Assumptions (A1) and (A2) imply a broad class of models as shown in Hansen (1982). Using iterated expectations, Assumption (A2) implies the r × 1 moment conditions

Assumption (A3)7

(a) h(x,.) is continuously differentiable in Θ for each xn.

(b) h(., θ) and ∂h(., θ)/∂θ are Borel measurable for each θ ∈ Θ.

Let gT(XT, θ) be the sample average of h(xt, θ) where gT(XT,θ)1Tt=1Th(xt,θ)

Definition 1The GMM estimator{θ^G,T(ω):T1}for some ω ∈ Ω is the value of θ that minimizes the objective function

where{WTG}t=1is a sequence of (r × r) positive definite weighting matrices which are functions of the data xT.

Assuming an interior optimum, the GMM estimate θ^G is then the solution to the system of nonlinear equations:

Let Rw(s) = Ep[ws+1w1′]. Using Assumptions (A1) and (A2), it is ensured that S=s=Rw(s) is well defined and finite. The matrix S above is sometimes interpreted as a long-run variance of wt = h(xt,θ0) and can be alternatively written as

Conditions (1) and (4) form conditions on the first and second moments of wt = h(xt, θ0) implied by the probability measure P. The matrix S is the asymptotic variance of TgT(XT,θ0)

In order to see how the weighing matrix in (2) works, consider first the situation where there are as many moment conditions as parameters (referred to as the “just-identified” case). The moments will all be perfectly matched and the objective function in (2) will have a value of zero. In the “over-identified” case where there are more moment conditions than parameters, not all of the moment restrictions will be satisfied, so the weighting matrix WTG determines the relative importance of the various moment conditions.

Hansen (1982) points out that setting WTG=S1, the inverse of an asymptotic covariance matrix, is optimal in the sense that it yields parameter estimates with the smallest asymptotic variance. (Intuitively, more weight is given to the moment conditions with less uncertainty. S is also known as the spectral density matrix evaluated at frequency zero.) There are many approaches for estimating (consistent estimators of) S which can account for various forms of heteroskedasticity and/or serial correlation, including White (1980), the Bartlett kernel used by Newey and West (1987), the truncated kernel of Hansen (1982), and the automatic bandwidth selection from Andrews and Monahan (1992).

Let S^T be a consistent estimator of S based on a sample of size T. An optimal GMM estimator is obtained with WTG=S^T1 in (2)

where S is approximated by S^T

III. GMM in the Bayesian Framework

In contrast to the classical approach, Bayesian estimation requires the specification of likelihood functions or the data generating mechanism. Because of this reason, one may conclude that the Bayesian method cannot be applied to the moment problem. However, recent developments in the Bayesian and classical econometrics have made it possible to consider a likelihood interpretation of some non-likelihood problems.8 Innovative work in this area was done by Zellner (1996 and 1997) who developed a finite sample Bayesian Method of Moments (BMOM) based on the principle of maximum entropy.9 One of the distinguishing features of the BMOM approach is that it yields post-data densities for models’ parameters without use of an assumed likelihood function. Inoue (2001) proposes a semi-parametric Bayesian method of moments approach (which differs from the maximum entropy approach) that enables direct Bayesian inference in the method of moments framework. It turns out that the posterior distribution of strongly identified parameters is asymptotically normal even in the presence of weakly identified parameters. Finally, Kim (2000) and (2002) develops a limited information procedure in the Bayesian framework that does not require the knowledge of the likelihood function. His procedure is the Bayesian counterpart of the classical GMM but has certain advantages over the classical GMM for practical applications, and it is the approach we closely follow in this essay.

A. The Bayes Estimator and GMM

We now begin the construction of the GMM in the Bayesian framework. In the classical framework, GMM is a limited information procedure. The GMM estimate in Definition (1) is based on the moment condition (1), a set of limited information on the data generating process. The goal is to build a Bayesian counterpart of the classical GMM by constructing a Bayesian limited information procedure based on a set of moments.

Following Kim (2000) and (2002), we begin with some of the basic elements. A Bayesian framework is identified by a posterior density defined in the measurable space (Θ,G). Let πT(θ|xT(ω)) be the “true” posterior of θ that may be unknown.10 Assume that the posterior πT(·|XT(·)) is jointly measurable G×. Define

for any GG and ω ∈ Ω, where PTπ(.,ω) is a probability measure on Θ for every ω ∈ Ω.

Let l(θ,δ) be the loss function that penalizes for the choice of δ when θ is the real parameter value. The Bayes’ estimator is an estimator that minimizes the expected posterior loss


We are interested in a loss function that yields an estimator equivalent to the GMM estimator. Since the objective is to study a Bayesian counterpart of the classical GMM, it is natural to adopt a loss function with this property. Consider the following loss function that is quadratic in gT:

where {WTG}t=1 is a sequence of positive definite weighing matrices. The loss function in (10) can be transformed to a loss function quadratic in θ:

where W˜T={g(θ˜)θ}WT{g(θ˜)θ} and θ˜(θ,δ).

The loss functions in (10) and (11) are such that under some conditions (discussed in Lemma 1 below) yield an estimator that is the same as the GMM. As discussed in Kim (2000), the choice of the loss functions (10) and (11) does not cause loss of generality. The main results of this essay do not change so far as the chosen loss function can be transformed into a function that is quadratic in θ.

From the minimization problem (8) using the loss function in (10) the first order condition is

This implies a moment condition

where the right hand side is a constant conditional on xT and

Interpreting the GMM estimator as a Bayes’ estimator, the right hand side of (12) is equal to zero. So we have the moment condition:


Assume the second order conditions hold for the minimization in the GMM estimate in (2) and in the Bayes’ estimate in (8) with the loss function described in (10). Then, under Assumption (A3), the GMM estimator θ^GMM is equal to the Bayes’ estimator θ^B if and only if and only if


and {WT}={WTG}.11

B. Limited Information Likelihood and GMM

In this section we follow the discussion in Section 3 of Kim (2002) to establish a semi-parametric limited information likelihood based on the moment conditions which form a set of limited information on the data generating mechanism. (This limited information likelihood is then used to derive a limited information posterior in Kim (2002).) The approach to get this limited information likelihood function is based on the principle of maximum entropy where the idea is to get a likelihood that is closest to the unknown true likelihood in an information distance.

From the moment condition (1) we have

and from (4) and (6) we have the second moment condition on gT

where S is the long-run variance of wt = h(xt,θ0) described in (4). Under Assumptions (A1) and (A2) it can be shown12 that

Given the true probability measure P with the properties in the moment conditions (13) and (14), we are interested in the probability measure Q that implies the same moment conditions. Let 𝒬 be a family of probability measures that is absolutely continuous with respect to P such that for θ ω Θ

which (as shown in Kim 2002) reduces to

Usually such a Q is not unique. For Q𝒬 we are interested in the one that it is the closest to the true probability measure P in the entropy distance or the Kullback-Leibler information distance (White 1982) or the I–divergence distance (Csiszar 1975). The optimization problem yields such a solution Q*

where dQ/dP is the Radon-Nikodym derivative (or density) of Q with respect to P. So, Q* is the solution of the constrained minimization where the constraint is given with respect to the moments implied in the measure P. As in Csiszar (1975), Q* is defined to be the I–projection of P on 𝒬. Further, denote by qP*(θ)=dQ*(θ)/dP the Radon-Nikodym derivative of Q*(θ) with respect to P. Kim (2002) calls qP*(θ) a limited information density or the I–projection density (following Csiszar (1975)).

The solution of (18) qP*(θ) is uniform in θ ∈ Θ that satisfies the moment in (16) or (17), and therefore, qP*(θ) can be interpreted as the likelihood of θ. Thus, we call qP*(θ) a Limited Information Likelihood (LIL) or the I–projection likelihood.

Under the conditions on Q

where κ is a constant and 𝒦 is a normalizing constant. As shown in Kim (2002), κ = −1/2 is a desirable choice. Finally, Theorem 1 of Kim (2002) establishes that qP,T*(XT,θ) is a finite sample analogue of qP*(θ). Therefore, the sample LIL is

When S is not known it is replaced by a consistent estimator S^T.

IV. Model Uncertainty and BMA

Standard statistical practice ignores model uncertainty. The classical apporach conditions on a single model and thus leads to underestimation of uncertainty when making inferences about quantities of interest. A complete Bayesian solution to the problem of model uncertainty is the BMA approach which involves averaging over all possible combinations of predictors when making inferences about the quantities of interest.13 The Bayesian approach avoids conditioning on a single model. No model is assumed to be the “true” model, instead, all possible models are assigned different probabilities based on the researcher’s prior beliefs using the posterior model probabilities as weights. As noted by Hoeting and others (1994), this is reasonable as it allows for propagation of model uncertainty into the posterior distribution and leads to more sensible uncertainty bands.

The following sections draw from Raftery (1994), Kass and Raftery (1995), and Kim (2000) and (2002). First, we introduce Bayesian hypothesis testing and Bayes factors to test competing models. Then, we derive a limited information model selection criterion, in order to calculate the Bayes factors in the case of a limited information procedure. Finally, we incorporate the derived criterion in the context of BMA to derive the posterior distributions of the parameters of interest.

A. Bayesian Hypothesis Testing

We begin with the general setup for a model selection problem. Let be a family of candidate models for xT. A model MkM is associated with a parameter space Θk of dimension qk for k ∈ I where I = {1,…,I} and characterized by a relation of the form h(xt,θ) = wt (as described in Section 1) with wt a stochastic process satisfying Assumptions (A1) an (A2). For every Mk a set of moment conditions is defined as in (1) and a likelihood qTk(XT,θk) is defined (as in (20)).

Suppose that we want to use data xT to test competing hypotheses presented by two models M1 and M2 with parameter vectors θ1 and θ2. Let p(M1|XT) be the posterior probability that M1 is the correct model,

where (for k = l, 2) qT(XT|Mk) is the marginal probability of the data given Mk, and p(Mk) is the prior probability of model Mk.14

In general, the term qT(XT|Mk) in (21) is obtained by integrating over the parameter space

where qT(XT|θk,Mk)=qTk(XT,θk), the likelihood of θk under model Mk (the marginal likelihood), and Φ(θk|Mk) is the prior density associated with model Mk.

The posterior odds ratio for M2 against M1 (i.e. the ratio of their posterior probabilities p(M2|XT)p(M1|XT)) can be used to measure the extent to which the data support M2 over M1. Using (21) the posterior odds ratio is

where the first term on the right-hand side of (23) is the Bayes factor for M2 against M1, denoted by B21, and the second term is the prior odds ratio. Sometimes the prior odds ratio is set to 1, representing the lack of prior preference for either model, in which case the posterior odds ratio is equal to the Bayes factor. When the posterior odds ratio is greater (less) than 1, the data favor M2 over M1 (M1 over M2).

Evaluating the Bayes factor in (23) for hypothesis testing requires calculating the marginal likelihood qT(XT|Mk). This can be a high-dimensional and intractable integral. Various analytic and numerical approximations have been proposed which are reviewed in Kass and Raftery (1995). The Bayesian Information Criterion (BIC) is a simple and accurate method to estimate Bayes factors when the likelihood function is known. This is discussed first in the next section. Then we extend the discussion to the case where only the limited information likelihood is available to derive the Limited Information Bayesian Information Criterion (LIBIC).

B. The Information Criteria: BIC and LIBIC

Following the approach of Raftery (1994), we focus on approximating the marginal likelihood for a single model, that is, the right hand side of (22). We will avoid indexing for a specific model, so a general form of (22) can be written as qT(XT|M)=qT(XT|θ,M)ϕ(θ|M)dθ.

Let f(θ)=log [qT(XT|θ,M)ϕ(θ|M)], and consider a Taylor series expansion of f(θ) about θ˜, the value of θ that maximizes f(θ) or the posterior mode. The expansion gives

where f(θ)=(f(θ)θ1,,f(θ)θd) is the vector of first partial derivatives of f (θ), and f″(θ) is the Hessian matrix of second partial derivatives of f(θ) whose (i, j) element is 2f(θ)θiθj. Since f (θ) is maximized at θ,˜f(θ˜)=0, so (24) becomes

From the definition of f (θ) and (22) it follows that qT(xT|M)=exp [f(θ)]dθ, and using (25):

Recognizing that the integrand in (26) as proportional to a multivariate normal density gives

where d is the number of parameters in the model and A=f(θ˜). This is the Laplace approximation method.15 Using

In large samples, θ˜=θ^, where θ^ is the Maximum Likelihood Estimator, and Ani where i is the expected Fisher information matrix for one observation. This is a (d × d) matrix whose (i, j) element is E[2log p(y1|θ)θiθj|θ=θ^], with the expectation being taken over values of y1 with θ held fixed. Thus, |A| ≈ nd |i|. With these approximations and an added O(n12) error, (28) becomes

Removing the terms of order O(l) or less, gives16

Equation (30) is the approximation on which the BIC is based and was first derived by Schwarz (1978). As suggested by Raftery (1994), although the O(l) term suggests that the error does not vanish with an infinite amount of data, the error will tend towards zero as a proportion of log qT(XT|M), which ensures that the error will not affect the conclusion reached given enough data. For a particular choice of prior, the error term is of much smaller magnitude. Suppose that the prior is a multivariate normal with mean θ^ and variance matrix i−1. Under that choice of a prior, we have

Substituting (31) in (29), we get an expression for log qT(XT|M) where the error term vanishes as n → ∞

Using the approximation in (32) we can derive the Bayes factor B21=p(XT|M2)p(XT|M1) in (23), such that:

As discussed in Kass and Raftery (1995), the expression in (33) is the Schwarz criterion (S) and as n,Slog B21log B210. (Based on this result, (33) can be viewed as an approximation to the Bayes factor B21.) Twice the Schwarz criterion is the BIC or

Exact calculation of equation (34) requires the knowledge of the likelihood function for each of the models. If M1 is nested within M2 (34) reduces to 2log B21χ212df21log n, where χ212 is the standard likelihood ratio test for testing M1 against M2 and df21 = d2d1 is the number of degrees of freedom.

The full-information likelihood function is not available in the context of GMM. Therefore, in order to calculate the BIC in (34) we need to rely on the LIL developed by Kim (2002) and discussed in Section 3. Using the LIL in (30) we can replace the likelihood functions on the right hand side of (34) to get a Limited Information Bayesian Information Criterion (LIBIC). First, using (20) (with S replaced by a consistent estimator S^T) to substitute for the log-likelihood in (34) we have

and substituting in (34) eliminating the O(n12) term we have the expression for the


which reduces to

The LIBIC expression in (37) can be calculated from the estimated output and will be used to estimate the Bayes factors required for the hypothesis testing.

C. BMA under Limited Information

Suppose we can divide the parameter space into K regions (models), so we have the space of all possible models M={Mj:j=1,,K}. Let Δ be the quantity of interest (such as a parameter, in our case). Then Bayesian inference about Δ is constructed given the data D, based on the law of total probability:

where p(Δ|D), the posterior distribution of the quantity of interest Δ is a mixture of the posterior distributions of that quantity under each of the models with mixing probabilities given by the posterior model probabilities and using the posterior model probabilities as weights. Thus, the full posterior distribution of Δ is a weighted average of the posterior distributions under each model (M1, …,MK), where the weights are the posterior model probabilities p(Mk|D). This procedureiswhatistypically referred to as BMA, and it is in fact the standard Bayesian solution under model uncertainty, since it follows from direct application of Bayes’ theorem.

Denoting the data by xT, (38) becomes p(Δ|XT)=κ=1Kp(Δ|XT,Mk)p(Mk|XT). Using Bayes’ theorem, the posterior model probabilities are obtained using (21) extended for the case of K models, such that:

Although this fully Bayesian approach is a feasible and a very attractive solution to the problem of accounting for model uncertainty, there are certain difficulties in the implementation of the BMA, making it (in some cases) a rather unpopular and less practical proposition. First, when the number of regressors k* is very large, the number of models is 2k* and, as a result, calculations may be rendered infeasible.17 Second, the BMA requires specification of the prior distributions of all relevant parameters so for k* possible regressors, 2k* priors are needed. In most BMA cases, the choice of priors has essentially been arbitrary and the impact of this choice on the estimated parameters has not been examined. In the context of growth Brock and Durlauf (2001), Doppelhofer, Miller, and Sala-i-Martin (2000), and Fernandez, Ley, and Steel (2001) are all applications of BMA techniques to investigate robustness of growth determinants in light of model uncertainty.18

Each of the K models is compared in turn with a baseline model M0 (which could be the null model with no independent variables), yielding Bayes factors B10, B20, …, B0K. The value of BIC for model Mk denoted BICk, is the approximation to 2 log B0k given by (34), where B0k is the Bayes factor for model Mk against M0.19

It is possible to write (21) and (39) in terms of the BIC. To see this, rewrite the Bayes factor B12=p(D|M1)p(D|M2) as B12=p(D|M1)p(D|M2)×p(D|M0)p(D|M0)=p(D|M1)p(D|M0)p(D|M2)p(D|M0)=B10B20=B02B01. Using (34), this implies that 2log B12=(2log B02log B01)=BIC2BIC1. Substituting the Bayes factor B12 in (2l), we get p(M1|XT)=B12qT(XT|M2)p(M1)B12qT(XT|M2)p(M1)+qT(XT|M2)p(M2)=B10B20p(M1)B10B20p(M1)+B20B20p(M2)=B10p(M1)B10p(M1)+B20p(M2). Since 2log B10=BIC0BIC1=BIC1, then B10=exp (12BIC1), the expression then becomes p(M1|XT)=exp (12BIC1)p(M1)j=12exp (12BICj)p(Mj).

Extending from 2 models to K models, p(XT|Mk) exp (12BICk) and (39) becomes

The expression in (40) uses the “full information” BIC derived in (34). In the framework of our GMM analysis, and following the discussion in Section 3.2 we modify (40) to incorporate the “limited information” LIBIC defined in (37)

Equation (41) defines the LIBMA estimator, an extension of the BMA in the case of a limited information likelihood. The LIBMA incorporates a dynamic panel estimator in the context of GMM and a Bayesian robustness check to explicitly account for model uncertainty in evaluating the results of a universe of models generated by a set of possible regressors.

V. Statistics for the Robustness Analysis

This section summarizes the computational aspects and introduces the statistics on which we will base our robustness analysis.

Suppose we have n independent replications of a linear regression model with an intercept α, and k* possible regressors grouped in a k*–dimensional vector β. Denote by Z the corresponding n × k* design matrix. We have K=2k* possible sampling models, depending on whether we include or exclude each of the regressors, so we have the space of all possible models ={Mj:j=1,,2k*}. In order to deal with with model uncertainty in the Bayesian framework, we need to define a prior distribution over the model space , namely, p(Mj), where j=12k*p(Mj)=1.

A model Mj with 0 ≤ kjk* regressors is defined by y = α + βjXj + ε, where y is the vector of observations, Xj denotes the n × kj matrix of the regressors included, and βj is the vector of the relevant coefficients.

From (23) the posterior odds ratio for two models Mj, Ml is Bjl=p(Mj|XT)p(Ml|XT)=qT(XT|Mj)qT(XT|Ml)×p(Mj)p(Ml). The first term onthe righthand side, qT(XT|Mj)qT(XT|Ml) is the Bayes factor and can be approximated using (34). The second term, p(Mj)p(Ml) is the prior odds ratio. In the case where there is no preference for a specific model, p(M1)=p(M2)==p(MK)=1K and the posterior odds ratio is equal to the Bayes factor. We do not assume equal inclusion probability for each model. Instead, following Doppelhofer, Miller, and Sala-i-Martin (2000) we represent a model Mj as a length k* binary vector in which a one indicates that a variable in included in the model and a zero indicates that it is not. In addition, following Doppelhofer, Miller, and Sala-i-Martin (2000), we do not require the choice of (arbitrary) priors for all the parameters – instead, only one hyper-parameter is specified, the expected model size, k¯.

Assuming that each variable has an equal inclusion probability, the prior probability for model Mj is

and the prior odds ratio is

where k* is the total number of regressors, k¯ is the researcher’s prior about the number of regressors with non-zero coefficients, kj is the number of included variables in model Mj, and k¯k* is the prior inclusion probability for each variable. Since k¯ is the only prior that arbitrarily specified in the simulations, robustness checks of the results can be ran by changing the value of this parameter.

If the set of possible regressions is small enough to allow exhaustive calculation, we can substitute (42) into (44) to calculate the posterior model probabilities (where the weights for different models are assigned based on posterior probabilities of each model–essentially normalizing the weight of any model by the sum of the weights of all possible K=2k* models) so that:

Next, we can use (44) to estimate the posterior mean and posterior variance as follows:


Other statistics relevant to the study are the posterior mean and variance conditional on inclusion. First we calculate the posterior inclusion probability, which is the sum of all posterior probabilities of all the regressions including the specificvariable (regressor). The posterior inclusion probability is a ranking measure to see how much the data favors the inclusion of a variable in the regression, and is calculated as

posterior inclusion probability

If p(θk0|xT)>p(θk0)=k¯k* then the variable has a high marginal contribution to the goodness of fit of the regression model. Then, the posterior mean and variance conditional on inclusion are the ratios of the posterior mean and variance divided by the posterior inclusion probability, E(θk|xT)θk0p(Mj|xT), and Var(θk|XT)θk0p(Mj|XT), respectively.

Finally, we compute the sign certainty probability. This measures the probability that the coefficient is on the same side of zero as its mean (conditional on inclusion) and is calculated as

VI. Conclusion

This paper develops the theoretical background of the Limited Information Bayesian Model Averaging (LIBMA) approach and the computational aspects of the robustness analysis. The proposed methodology consists of a coherent Bayesian framework that addresses the problems of model uncertainty and restrictive assumptions of certain estimation procedures. The LIBMA technique has many potential applications including investigations of competing hypotheses, and parameter estimation that is robust to model specification.

As is typical in many areas of economic research, empirical work on investigating growth (and poverty) determinants is (i) prone to inconsistent estimates due to bias from omitted country-specificeffects and failing to account for endogenous regressors; and (ii) particularly susceptible to model uncertainty arising from the combination of a complex web of relationships and the lack of clear theoretical guidance on the choice of regressors. The first practical application of the LIBMA by Ghura, Leite, and Tsangarides (2002) is a contribution to the ongoing growth and poverty debate that provides empirical evidence on the elasticity of the income of the poor with respect to average income and on the set of macroeconomic policies that directly influence poverty rates. Further, motivated by the existing empirical evidence on poverty reduction (and more broadly on human development), which strongly supports the primacy of the role of economic growth, a second research project attempts to explain the observed differences in standards of living across countries by identifying robust patterns of cross-country growth behavior, and examine convergence using the LIBMA approach.


    Andrews, D., and J. C.Monahan, 1992, “An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator,”Econometrica, Vol. 60 (July), pp. 953-66.

    • Search Google Scholar
    • Export Citation

    Brock, W., and S.Durlauf, 2001, “Growth Empirics and Reality,”World Bank Economic Review, Vol. 15 (no. 2), pp. 229-72.

    Csiszar, I., 1975, “I-divergence Geometry of Probability Distributions and Minimization Problems,”Annals of Probability, Vol. 3 (no. 1), pp. 146-58.

    • Search Google Scholar
    • Export Citation

    Doppelhofer, G., R. I.Miller, and X.Sala-i-Martin, 2000, “Determinants of Long-Term Growth: A Bayesian Averaging of Classical Estimates (BACE) Approach,”Working Paper No. 7750 (Cambridge, Massachusetts: National Bureau of Economic Research).

    • Search Google Scholar
    • Export Citation

    Ghura, D., C.Leite, and C.Tsangarides, 2002, “Is Growth Enough? Macroeconomic Policy and Poverty Reduction,”IMF Working Paper No. 02/118 (Washington: International Monetary Fund).

    • Search Google Scholar
    • Export Citation

    Golan, A., G.Judge, and D.Miller, 1996, Maximum Entropy Econometrics: Robust Estimation with Limited Data (Chichester: Wiley & Sons).

    Hansen, L. P., 1982, “Large Sample Properties of Generalized Methods of Moments Estimators,”Econometrica, Vol. 50 (July), pp. 1029-54.

    • Search Google Scholar
    • Export Citation

    Hoeting, J. A., and others, 1999, “Bayesian Model Averaging: A Tutorial,”Statistical Science, Vol. 14 (No. 4), pp. 382-417.

    Inoue, A., 2001, “A Bayesian Method of Moments in Large Samples,”Working Paper (Raleigh: North Carolina State University).

    Kass, R. and A.Raftery, 1995, “Bayes Factors,”Journal of the American Statistical Association, Vol. 90 (no. 430), pp. 773-95.

    Kim, J. Y., 2000, “The Generalized Method of Moments in the Bayesian Framework and a Model and Moment Selection Criterion,”Working Paper (Albany: State University of New York).

    • Search Google Scholar
    • Export Citation

    Kim, J. Y., 2001, “Bayesian Limited Information Analysis in the GMM Framework,”Working Paper (Albany: State University of New York).

    • Search Google Scholar
    • Export Citation

    Kim, J. Y., 2002, “Limited Information Likelihood and Bayesian Analysis,”Journal of Econometrics, Vol. 107 (March), pp. 175-93.

    Leamer, E., 1978, Specification Searches: Ad Hoc Inference with Non-experimental Data, (New York: Wiley & Sons).

    Madigan, D., and A.Raftery, 1994, “Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window,”Journal of American Statistical Association, Vol. 63, pp. 1535-46.

    • Search Google Scholar
    • Export Citation

    Newey, W., and K.West, 1987, “Hypothesis Testing with Efficient Method of Moments Estimation,”International Economic Review, Vol. 28 (no. 3), pp. 777-87.

    • Search Google Scholar
    • Export Citation

    Raftery, A. E., 1994, “Bayesian Model Selection in Social Research,” in Sociological Methodology, ed. by Peter V.Marsden, (Cambridge, Massachusetts: Blackwells), pp. 111-96.

    • Search Google Scholar
    • Export Citation

    Raftery, A. E., 1996, “Approximate Bayes Factors and Accounting for Model Uncertainty in Generalized Linear Models,”Biometrika, Vol. 83, pp. 251-66.

    • Search Google Scholar
    • Export Citation

    TierneyL., and J. B.Kadan, 1986, “Accurate Approximations for Posterior Moments and Marginal Densities,”Journal of American Statistical Association, Vol. 81, pp. 82-86.

    • Search Google Scholar
    • Export Citation

    White, H., 1980, “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,”Econometrica, Vol. 48 (May), pp. 817-38.

    • Search Google Scholar
    • Export Citation

    White, H., 1982, “Maximum Likelihood Estimation of Misspecified Models,”Econometrica, Vol. 50 (January), pp. 1-25.

    Zellner, A., 1996, “Bayesian Method of Moments/Instrumental Variable (BMOM/IV) Analysis of Mean and Regression Model,” in Modeling and Prediction Honoring Seymour Geisser, ed. by J.C.Lee, W.C.Johnson, and A.Zellner(New York: Springer), pp. 61-74.

    • Search Google Scholar
    • Export Citation

    Zellner, A., 1997, “The Bayesian Method of Moments (BMOM): Theory and Application,” in Advances in EconometricsVol. 12, ed. byT.Fomby, R. C.Hill(Greenwich, Connecticut: JAI press), pp. 85-105.

    • Search Google Scholar
    • Export Citation

This paper is a revised version of the first essay of my Ph.D. dissertation. I thank Mike Bradley, Bob Phillips, and Fred Joutz for their guidance, advice and support. Any remaining errors are my responsibility. Financial support from the Economic Club of Washington is gratefully acknowledged.


To the extent that the lagged values of the regressors are valid instruments, this GMM estimator addresses consistently and efficiently both sources of bias.


See Leamer (1978) and Raftery (1988), and (1996) for a discussion.


Madigan and Raftery (1994) show that BMA provides optimal predictive ability. Hoeting and others (1999) summarize recent work using BMA. Brock and Durlauf (2000) provide an accessible explanation of criticisms levied at growth empirics and the contribution of Bayesian analysis in dealing with model uncertainty.


In such a case, one may use quasi-Maximum Likelihood estimation which does not sacrifice consistency. However, consistency may be an issue for nonlinear models estimated with Maximum Likelihood.


The measurable space (Θ,G) is required to define the posterior density in the Bayesian framework, and it is discussed here for completeness.


Assumption (A3) is described here for completeness although it is used later for Lemma 1 as well as the derivation of the asymptotic normality of the posterior.


Inoue (2001), and Kim (2000) and (2001) provide a good literature review.


As Golan, Judge, and Miller (1996) show, in the entropy approach, estimators are chosen to maximize entropy or minimize some distance metric between the true probability measure and artificial probability measures for which the moment condition in question is satisfied. Hence, this approach does not require knowledge of the functional form of the likelihood. Using the Bayesian counterpart of this approach, one can obtain finite-sample post-data moments and distribution of the parameters and conduct post-data inference (e.g. see Zellner 1996 and 1997). Many traditional estimators are special cases of entropy.


It is the “true” posterior in the sense that it is obtained from the true likelihood of θ, or it is a posterior of θ that contains a richer set of information than that in the limited information posterior discussed in this paper.


The proof of this lemma is found in Kim (2001).


The proof is discussed in Kim (2002).


Madigan and Raftery (1994) note that the BMA approach provides the optimal predictive ability. Hoeting and others (1999) summarize recent work using BMA.


Similar to (21), p(M2|XT)=p(XT|M2)p(M2)p(XT|M1)p(M1)+p(XT|M2)p(M2), and p(M1)|XT + p(M2)|XT = 1.


Tierney and Kadane (1986) show that the error in (27) is O(n−1) so that nO(n−1) → constant as n → ∞.


Note that in (29) the terms log p(θ^|M), (d2)log (2π), 12log |i|, and O(n12) are all of order O(1) or less.


For example, the summation and implicit integrations in (39) below may be difficult to compute. Proposed solutions to this problem are novel Markov Chain Monte Carlo techniques such as the MC3 sampler first used by Madigan and York (1995) or averaging over a subset of models that are supported by the data such as the Occam’s window method of Madigan and Raftery (1994).


In the last section we will address how our proposed methodology addresses some of the weaknesses of the BMA, and how it compares to the approaches of Brock and Durlauf (2001) and Dopelhoffer, Miller, and Sala-i-Martin (2000).


Note that B00 = 1 and BIC0 = 0.

Other Resources Citing This Publication