Limited Information Bayesian Model Averaging for Dynamic Panels with Short Time Periods

Bayesian Model Averaging (BMA) provides a coherent mechanism to address the problem of model uncertainty. In this paper we extend the BMA framework to panel data models where the lagged dependent variable as well as endogenous variables appear as regressors. We propose a Limited Information Bayesian Model Averaging (LIBMA) methodology and then test it using simulated data. Simulation results suggest that asymptotically our methodology performs well both in Bayesian model selection and averaging. In particular, LIBMA recovers the data generating process very well, with high posterior inclusion probabilities for all the relevant regressors, and parameter estimates very close to the true values. These findings suggest that our methodology is well suited for inference in dynamic panel data models with short time periods in the presence of endogenous regressors under model uncertainty.

Abstract

Bayesian Model Averaging (BMA) provides a coherent mechanism to address the problem of model uncertainty. In this paper we extend the BMA framework to panel data models where the lagged dependent variable as well as endogenous variables appear as regressors. We propose a Limited Information Bayesian Model Averaging (LIBMA) methodology and then test it using simulated data. Simulation results suggest that asymptotically our methodology performs well both in Bayesian model selection and averaging. In particular, LIBMA recovers the data generating process very well, with high posterior inclusion probabilities for all the relevant regressors, and parameter estimates very close to the true values. These findings suggest that our methodology is well suited for inference in dynamic panel data models with short time periods in the presence of endogenous regressors under model uncertainty.

I. Introduction

Model uncertainty is an issue encountered often in the econometric study of socioeconomic phenomena. Initially pointed out by Leamer (1978) and later elaborated by Durlauf and Quah (1999) model uncertainty arises because the lack of clear theoretical guidance and tradeoffs on the choice of regressors result in a broad number of possible specifications, and often contradictory conclusions. In addition, attempts to deal with model uncertainty by engaging in unsystematic searches of possible model configurations, may result in overconfident and often fragile inferences. As a result, a growing number of researchers are turning to the Bayesian Model Averaging (BMA) framework in order to deal with the problem of model uncertainty.

Conceptually, BMA bases inferences on a weighted average of the full model space instead of on one selected model, and thus incorporates uncertainty in both predictions and parameter estimates.2 Seminal contributions to BMA include those of Moulton (1991), Madigan and Raftery (1994), Kass and Raftery (1995), Raftery (1995), and Raftery, Madigan and Hoeting (1997). The BMA framework has been applied in various areas of social sciences.3 In economics, some of the most notable work includes Brock and Durlauf (2001), Fernández, Ley and Steel (2001a), and Sala-i-Martin, Doppelhofer and Miller (2004). Despite the increasing interest in BMA, most of the work thus far uses static models, focusing mainly on cross section analysis with data averaged over the time dimension, thus ignoring dynamic relationships among variables.4 Moreover, to the best of our knowledge, none of the models allow for the inclusion of endogenous variables.5

In this paper we propose a methodology for dealing with model uncertainty in the context of panel data model with short time periods where the lagged dependent variable and endogenous variables appear as regressors. We use a limited information approach which refines the limited information version of Bayesian Model Averaging (LIBMA) introduced by Tsangarides (2004). The limited information criterion proposed in this paper resembles the BIC model and moment selection criterion (MMSC-BIC) proposed by Andrews and Lu (2001), and parallels the one proposed by Hong and Preston (2008). One key difference of our approach is that we construct the likelihood by data transformation and straightforward Bayesian arguments.6 We also investigate the performance of the proposed framework relative to both Bayesian model selection and averaging by performing Monte Carlo simulations.

The remainder of the paper is organized as follows. In Section 2 we introduce the concept of model uncertainty in the Bayesian context and then review model selection and model averaging. Section 3 develops the theoretical framework of the LIBMA methodology in the context of dynamic panels with endogenous regressors. It includes the model setup, the moment conditions, the limited information criterion, and estimation. Section 4 discusses the proposed simulation experiment and presents the results. The final section concludes.

II. Model Uncertainty in the Bayesian Context

For completeness, this section reviews briefly the basic theory of uncertainty in the Bayesian context. Excellent reviews include Hoeting, Madigan, Raftery and Volinsky (1999), and Chipman, George and McCulloch (2001).

A. Model Selection and Hypothesis Testing

Consider the standard linear regression model

Y=Zθ+u(1)

where Y is the variable of interest, Z is a matrix of explanatory variables, θ is a vector of unknown parameters and u is the error term. Suppose there is a universe of k possible explanatory variables indexed by U = {1, 2, …, j, j + 1,…,k }. Let Z be the matrix of all possible explanatory variables. For a given model Mj that considers only a subset of the possible explanatory variables, MjU, let CMj={cmn,Mj}m,n=1k be a k × k diagonal choice matrix such that its diagonal will have 1′s if the corresponding variable is included in the model and 0′s otherwise. Hence cii,Mj = 1{iMj }. Therefore, for a given model Mj, Z=ZCMj and model (1) can be now written more generally as

Y=ZCMjθ+u(2)

where θ = (θ1θ2θk)′ is the set of parameters to be estimated.

Given the universe of k possible explanatory variables, a set of K = 2k models ℳ = (M1,…, MK) are under consideration. In the spirit of Bayesian inference, one can specify priors p(θ | Mj) for the parameters of each model, and a prior p (Mj) for each model in the model space ℳ.

Model selection seeks to find the model Mj in M = (M1,…, MK) that actually generated the data. Let D = (Y Z) denote the data set available to the researcher. The probability that Mj is the correct model, given the data D, is, by Bayes’ rule

p(Mj|D)=p(D|Mj)p(Mj)l=1Kp(D|Ml)p(Ml)(3)

where

p(D|Mj)=p(D|θj,Mj)p(θj|Mj)dθj(4)

is the marginal probability of the data given model Mj.

Based on the posterior probabilities, the comparison of model Mj against Mi is expressed by the posterior odds ratio p(Mj|D)p(Mi|D)=p(D|Mj)p(D|Mi).p(Mj)p(Mi). Essentially, the data updates the prior odds ratio p(Mj)p(Mi) through the Bayes factor p(D|Mj)p(D|Mi) measure extent to which the data support Mj over Mi. When the posterior odds ratio is greater (less) than 1 the data favor Mj over Mi (Mi over Mj). Often the prior odds ratio is set to 1 representing the lack of preference for either model,7 in which case the posterior odds ratio is equal to the Bayes factor Bji.

B. Bayesian Model Averaging

A natural strategy for model selection is to chose the most probable model Mj, namely the one with the highest posterior probability, p(Mj | D). Alternatively, especially in cases where the posterior mass of the model space M is not concentrated only on one model, Mj, it is possible to consider averaging models using the posterior model probabilities as weights. Raftery, Madigan, and Hoeting (1997) show that BMA almost always improves on variable selection in terms of predictive performance.

Using Bayesian Model Averaging, inference for a quantity of interest Г can be constructed based on the posterior distribution

p(Γ|D)=j=1Kp(Γ|D,Mj)p(Mj|D)(5)

which follows by the law of total probability. Therefore, the full posterior distribution of Г is a weighted average of the posterior distributions under each model (M1, …, MK), where the weights are the posterior model probabilities p(Mj | D). The posterior model probabilities are obtained using (3). Using (5) one can compute the posterior mean and posterior variance for parameters θ1 as follows

E(θl|D)=j=1Kp(Mj|D)E(θl|D,Mj)(6)

and

Var(θl|D)=j=1Kp(Mj|D)Var(θl|D,Mj)+j=1Kp(Mj|D)[E(θl|D,Mj)E(θl|D)]2.(7)

The implementation of BMA presents a number of challenges, including the evaluation of the marginal probability in (4), the large number of possible models, and the specification of the prior model probabilities p(Mj) as well as the parameters’ prior, p(θ | Mi).

C. Choice of Priors

Evaluating Bayes factors required for hypothesis testing and Bayesian model selection or model averaging requires calculating the marginal likelihood

p(D|Mj)=p(D|θ,Mj)p(θ|Mj)dθ.(8)

Here, the dimension of the parameter θ is determined by model Mj. In many cases the likelihood p (D | θ, Mi) is fully specified with some nuisance parameter ζ. Therefore we may write

p(D|Mi)=p(D|θ,ζ,Mi)p(θ,ζ|Mi)dθdζ.(9)

In this case, determining the prior p (θ, ζ | Mi)) becomes an important issue.8

For Gaussian models the nuisance parameter is the variance σu2 of the noise term. A common selection of the prior for the pair (θ,σu2) is its conjugate prior, Normal-Gamma distribution, which has the benefit of rendering a closed-form posterior9. With this prior θ is a Normal random variable with mean θ0 and variance σu2V given σu2, while σu2 is a Gamma random variable with mean γλ and variance γλ2. Based on this prior, when (1)) represents a Gaussian panel data model with fixed effects, the likelihood ratio of two different models, Mi and Mj, becomes

p(D|Mi)p(D|Mj)=(|I+Z˜jZ˜jVj||I+Z˜iZ˜iVi|)1/2×(2λ+SSEj+(θ^jθj)(Vj+(Z˜jZ˜j)1)1(θ^jθj)2λ+SSEi+(θ^iθi)(Vi+(Z˜iZ˜i)1)1(θ^iθi))(γ+N(T1))/2

where Z˜ stands for the demeaned values of Z. Due to the sensitivity of the Bayes factors to the prior parameters {θ0,V, γ,λ}, one often avoids choosing specific values for them, in order not to affect substantially the posterior distribution. As discussed in Kass and Wasserman (1995), and Fernández, Ley and Steel (2001a), one possibility is to use a diffuse prior for σu with density p(σu)σu1. This prior has a nice scale invariance property and is equivalent to setting γ = λ = 0 in the Gamma distribution of σu2. For the prior distribution of θ conditioned on σu2, one popular choice is using Zellner′s g-prior with 0 mean

p(θ|σu2)N(0,g1(Z˜Z˜)1σu2)

which can be motivated by the fact that the correlation of the OLS estimate θ^ is proportional to (Z˜Z˜)1σu2. This also leads to a simple likelihood ratio

p(D|Mi)p(D|Mj)=(1+gi1)k1/2(1+gj1)k2/2(1gj+1SSEj+gjgj+1y˜y˜1gi+1SSEi+gigi+1y˜y˜)N(T1)/2

where y˜ stands for the demeaned values of y. The BACE procedure (proposed by Sala-i-Martin, Doppelhofer, and Miller (2004)) is asymptotically equivalent to setting g = N (T − 1).

Alternatively, one can use what has been labeled in the literature as the BIC approach, where the likelihood ratio is approximated by

p(D|M1)p(D|M2)exp(12(SSE1SSE2)12(k1k2)log(N(T1))).

Here the approximation is Op (N-1/2) when the implicit prior for θ is the unit information Normal prior as discussed in Kass and Wasserman (1995) and Kass and Raftery (1995).

Finally, several options exist for the specification of the model priors p(Mj). For example, Fernández, Ley and Steel (2001b) assume a Uniform distribution over the model space, essentially implying that there is no preference for a specific model so p(M1)=p(M2)==p(MK)=1K. Other options include penalizing models with more regressors. Sala-i-Martin, Doppelhofer and Miller (2004) use a prior model probability structure initially proposed by Mitchell and Beauchamp (1988). Assuming that each variable has an equal inclusion probability, the prior probability for model Mj is

p(Mj)=(k¯k)kj(1k¯k)kkj(10)

and the prior odds ratio is

p(Mj)p(Ml)=(k¯k)kjkl(1k¯k)klkj(11)

where k is the total number of regressors, k¯ is the researcher′s prior about the size of the model, kj is the number of included variables in model Mj, and k¯k is the prior inclusion probability for each variable.

III. Limited Information Bayesian Model Averaging

This section provides a discussion of the LIBMA using a dynamic panel data model with endogenous and exogenous regressors and derives the limited information criterion using the moment conditions implied by the GMM framework.

A. A Dynamic Panel Data Model with Endogenous Regressors

Let us consider the case where a researcher is faced with model uncertainty when trying to estimate a dynamic model for panel data. We assume that the universe of potential explanatory variables, indexed by the set U, consists of the lagged dependent variable, indexed by 1, a set of m exogenous variables, indexed by X, as well as a set of q endogenous variables, indexed by W, such that {{1}, X, W } is a partition of U. Therefore, for a given model M jU, (2) becomes

yit=(yi,t1XitWit)CMj(αθxθw)+uituit=ηi+υit|α|<1;i=1,2,,N; t=1,2,,T.(12)

Here yit, xit and wit are observed variables, ηi is the unobserved individual effect while vit is the idiosyncratic random error. The exact distributions for vit and ηi are not specified here, but assumptions about some of their moments and correlation with the regressors are made explicit below. It is assumed that E (vit) = 0 and that vit ’s are not serially correlated. xit is a 1 × m vector of exogenous variables while wit is a 1 × q vector of endogenous variables. Therefore the total number of possible explanatory variables is k = m + q + 1. The observed variables span N individuals and T periods, where T is small relative to N. The unknown parameters α, θx, and θw are to be estimated. In this model, α is a scalar, θx is a 1 × m vector while θw is a 1 × q vector.

Given the assumptions made so far, for any model Mj, and any set of exogenous variables, xit, we have

E(xitlυis)=0,i,t,s;xitlxit.(13)

Similarly, for any endogenous variable we have

E(witlυis){0, st=0, otherwise,witlwit.(14)

Note that, in principle, the correlations between endogenous variables and the idiosyncratic error may change over different individuals and/or periods.

B. Estimation and Moment Conditions

A common approach for estimating the model (12) is to use the system GMM framework developed by Blundell and Bond (1998). This implies constructing the instruments set and moment conditions for the “level equations” (12) and combining them with the moment conditions using the instruments corresponding to the first differences equations. The first differences (FD) equations corresponding to model (12) are given by

Δyit=(Δyi,t1ΔXitΔwit)CMj(αθxθw)+Δυit|α|<1;i=1,2,,N;t=2,3,,T.(15)

One assumption required for the FD equations is that the initial value of y, yi0, is predetermined, that is, E (yi0vis) = 0 for s = 2, 3,…,T. Since yi,t−2 is not correlated with ∆vit we can use it as an instrument. Hence we have E (yi,t−2vit) ≠ 0 for t = 2, 3,…,T. Moreover, yi,t-3 is also not correlated with ∆vit. Therefore, as long as we have enough observations, that is T ≥3, yi,t-3 can be used as an instrument. Assuming that we have more than two observations in the time dimension, the following moment conditions could be used for estimation

E(yi,tsΔυit)=0,t=2,3,,T;s=2,3,,t;forT2,i=1,2,,N.(16)

Similarly, the first difference of the exogenous variable Δxitl, xitlxit is not correlated with ∆vit and therefore we can use it as an instrument.10 That gives us additional moment conditions

E(ΔxitlΔυit)=0,t=2,3,,T;l=1,,m;i=1,2,,N.(17)

The endogenous variable wi,t2l, wi,t2lwit is not correlated with ∆vit and therefore it can be used as an instrument. We have the following possible moment conditions

E(wi,tslΔυit)=0,t=3,4,,T;s=2,,t1;for T3,1,2,,q;i=1,,N.(18)

Table 1 summarizes the moment conditions that could be used for the FD equation.

Table 1.

Moment Conditions for the FD Equation

article image

The FD equation provides T (T − 1)/2 moment conditions for the lagged dependent variable, m(T −1) moment conditions for the exogenous variables, and q (T − 2)(T − 1)/2 moment conditions for the endogenous variables.

Going back to the equation in levels (12), it is easy to see that first differences for the lagged dependent variable are not correlated with either the individual effects or the idiosyncratic error term and hence we can use the following moment conditions

E(Δyi,t1uit)=0,t=2,3,,T.(19)

Similarly, for the endogenous variables the first difference Δwi,t1l is not correlated with uit. Therefore, assuming that wi,0l is observable, and as long as T ≥ 3 we have the following additional moment conditions

E(Δwi,t1luit)=0,t=3,4,,T; l=1,2,,q.(20)

Finally, based on the assumptions made so far, the exogenous variables xitlxit are not correlated with current realizations of uit and hence one can use another set of moment conditions11

E(xitluit)=0,t=1,2,,T; l=1,2,,m.(21)

Table 2 summarizes the moment conditions for the level equation.

Table 2.

Moment Conditions for the Level Equation

article image

The equation in levels provides (T − 1) moment conditions for the lagged dependent variable, mT moment conditions for the exogenous variables, and q(T− 2) moment conditions for the endogenous variables.

We group the moment conditions into matrices the following way. Let Yi be the (T − 1)× T (T − 1)/2 matrix of lagged dependent variable used as instruments for the FD equation

yi=(yi00000000yi0yi10000000yi0yi1yi200000000000000yi0yi,Ti2)(22)

Similarly, Wi denotes the (T − 1) × q (T − 2)(T − 1)/2 matrix of endogenous variables

Wi=(000000wi1100wi1q000wi11wi210000000000000wi,T3qwi,T2q)(23)

For the level equation we have the T × (T − 1) instruments matrix DYi consisting of first differences of the dependent variable and the T × q (T − 2) instruments matrix DWi consisting of first differences of the endogenous variables

DYi=(0000Δyi10000Δyi20000Δyi30000Δyi,T1)DWi=(000000Δwi21Δwi2q000000Δwi,T1q).(24)

Further let Xi and DXi denote the following T × m and (T −1) × m matrices of exogenous and first differenced exogenous variables, respectively

DXi=(Δxi21Δxi22Δxi23Δxi2mΔxi31Δxi32Δxi33Δxi3mΔxi41Δxi42Δxi43Δxi4mΔxi1Δxi2Δxi3Δxim) Xi=(xi11xi12xi13xi1mxi21xi22xi23xi2mxi31xi32xi33xi3mxi1xi2xi3xim).(25)

For the exogenous variables, we aggregate the moment conditions across all periods from both the first difference equation and the level equation. Thus, we are left with one moment condition for each of the exogenous variables

t=2TE(ΔxitlΔυit)+t=1TE(xitluit)=0,l=1,,m;i=1,2,,N.

Let ui and Dvi denote the T × 1 and (T − 1) × 1 matrices of the error term and the first differenced idiosyncratic random error, respectively, as defined in model (12)).

ui=(ui1ui2uiT)Dυi=(Δυi2Δυi3ΔυiT).(26)

We can define a (2T − 1) × 1 matrix Ui=(uiDυi) that contains both the error term and the first differenced idiosyncratic random error. The moment conditions can now be written in matrix form

E[GiUi]=0(27)

where Gi is a (2T −1)×(m − 1 + (T + 1)((T − 2)q + T)/2) matrix defined as

Gi=(XiDYi0T×T(T1)/2DWi0T×q(T1)(T2)/2DXi0(T1)×(T1)Yi0(T1)×q(T2)Wi).(28)

Based on the moment conditions (27) we propose a limited information criterion that can be used in Bayesian model selection and averaging. In the next section we provide details on how to construct this criterion.

C. The Limited Information Criterion

As we pointed out in section II.C, evaluating the Bayes factors needed for hypothesis testing and Bayesian model selection or model averaging requires calculating the marginal likelihood

p(D|Mj)=p(D|θ,Mj)p(θ|Mj)dθ.(29)

Given that we choose to use the Generalized Method of Moments (GMM) for estimating the parameters of the model, the assumptions we have made so far do not give us a fully specified parametric likelihood p (D | θ, Mi). Therefore, we have to build the model likelihood in a fashion consistent with the Bayesian paradigm using the information provided by the moment conditions.

The construction of non-parametric likelihood functions has received lately a good deal of attention in the literature. Several approaches have been used to derive or estimate non-parametric likelihood functions. For example, Back and Brown (1993) provide a method of estimating a distribution function using only information derived from moment restrictions. Kim (2002) uses information projection onto a family of probability measures and constructs the likelihood by using a transformation of the GMM objective function. Hong and Preston (2008) build a quasi likelihood which is based on objective functions used for extremum estimation (see also Chernozhukov and Hong (2003)). Schennach (2005) builds a likelihood function that is the nonparametric limit result of a formal Bayesian procedure where the prior for the data favors distributions with a large entropy. Further the prior is conditioned on the moment equations. In this fashion it becomes feasible to compute a likelihood function that is closely related to empirical likelihood. Finally, Ragusa (2008) projects a reference distribution onto the space of distributions that are consistent with a set of moment restrictions and obtains the likelihood by integrating out the nuisance parameters.

In this section we propose a method of constructing the model likelihoods and posteriors based only on the information elicited from the moment conditions (27). Suppose we have a strictly stationary and ergodic random process {ξi}i=1, which takes value in the space Ξ, and a parameter space Θ ⊂ Rk. Then there exists a function g: Ξ×Θ→ Rl which satisfies the following conditions

  1. it is continuous on Θ;

  2. E[ gi,θ)] exists and is finite for every θ ∈ Θ; and

  3. E [ gi,θ)] is continuous on θ.

We further assume that the moment conditions, E [gi,θ) ] = 0, hold for a unique unknown θ0∈ Θ. Let g^N(θ)=N1i=1Ng(ξi,θ) denote the sample mean of the moment conditions, and assume that E[ g(ξi, θ0) g′ (ξi, θ0) ] and S(θ0)limnVar[N1/2g^N(θ0)] exist and are finite positive definite matrices. Then the following standard result holds (for a proof see Hall (2005) Lemma 3.2).

Under the above assumptions, N1/2g^N(θ0)dN(0,S(θ0)).

That is, the random vector N1/2g^N(θ0) convergences in distribution to a multivariate Normal distribution.

For model (12), the moment conditions for individual i discussed in the previous section can be written in the following form

g(ξi,θ)=Gi(y˜iZ˜iθ)(30)

where ξi={y˜i,z˜i}, z˜i=(y˜i,1x˜iw˜i), θ = (αθzθw)′ while Gi is the matrix defined in (28). The vectors y˜i and y˜i,1 for the dependent variable and the lagged dependent variable, respectively, are defined as follows

y˜i=(yi1yi2yiTΔyi2Δyi3Δyit)y˜i,1=(yi0yi1yi,T1Δyi1Δyi2Δyi,T1).(31)

The matrix X˜i for the exogenous variables is given by

X˜i=(xi11xi12xi13xi1mxiT1xiT2xiT3xiTmΔxi21Δxi22Δxi23Δxi2mΔxiT1ΔxiT2ΔxiT3ΔxiTm)(32)

while the matrix w˜i for the endogenous variables is defined as follows

w˜i=(wi11wi12wi13wi1qwiT1wiT2wiT3wiTqΔwi21Δwi22Δwi23Δwi2qΔwiT1ΔwiT2ΔwiT3ΔwiTq).(33)

Therefore g^N(θ0)=N1i=1NGiy˜iN1i=1NGiZ˜iθ0. By Lemma 1, one may write the likelihood for θ as

p(N1i=1NGiy˜i|θ,N1i=1NGiZ˜i)exp(12Ng^N(θ)S(θ)1g^N(θ)).(34)

Hence, the model likelihood can be expressed as

Θp(N1i=1NGiy˜i|θ)p(θ)dθΘexp(12Ng^N(θ)S(θ)1g^N(θ))p(θ).

Assuming that the prior p (θ) is second order differentiable around θ^0 and using the Laplace approximation, we obtain that the model likelihood is proportional to

Θp(N1i=1NGiy˜i|θ)p(θ)dθexp(12Ng^N(θ^0)S(θ^0)1g^N(θ^0)+logp(θ^0)+k2log2π12log det2θ2(12Ng^N(θ^0)S(θ^0)1g^N(θ^0)))

where θ^0arg minθNg^N(θ)S(θ)1g^N(θ) is the GMM estimate of θ0 with weighting matrix S(θ)1. Noting the fact that 2(g^NS1g^N)/θ2|θ=θ^0 is a k × k matrix of order Op(1) due to the ergodicity assumption, the model likelihood can be approximated by

Θp(N1i=1NGiy˜i|θ)p(θ)dθexp(12Ng^N(θ˜0)S(θ^0)1g^N(θ^0)k2logN)(35)

where k is the dimension of vector θ. Alternatively, the above approximation has the order of Op (N−1/2) if the unit information prior for θ is used with 2(g^NS1g^N)/θ2|θ=θ^0 as its variance-covariance matrix, that is, the prior distribution for θ, p (θ), is given by N(0,2(g^NS1g^N)/θ2|θ=θ^0).

For a given model Mj for which θ has kj elements different from zero, with the estimate denoted by θ^0,j, the model likelihood (35) becomes

Θp(N1i=1NGiy˜i|θ,Mj)p(θ)exp(12Ng^N(θ˜0,j)S(θ^0,j)1g^N(θ^0,j)kj2logN).(36)

Then the moment conditions (27) associated with model Mj can be written as E[Gi(y˜iZ˜iCMjθ0)]=0 where CMj is a diagonal choice matrix such that its diagonal will have 1′s if the corresponding variable is included in the model and 0′s otherwise. Recognizing that the estimate θ^0 differs from model to model, the sample mean of the moment conditions for model Mj can be written as g^N(θ^0)=N1i=1NGi(y˜iz˜iCMjθ^0,j). It is easy to see that Gi, y˜i, and Z˜i are the same across all models. In other words, the moment conditions and the observable data are the same across the universe of models.12 allowing us to make valid comparisons of posterior probabilities, in accordance to the principle of Bayesian factor analysis. Therefore, by using (36), one can compute the posterior odds ratio of two models M1 and M2 by

p(M1|N1i=1NGiy˜i)p(M2|N1i=1NGiy˜i)=p(M1)p(M2)p(N1i=1NGiy˜i|M1)p(N1i=1NGiy˜i|M2)=p(M1)p(M2)exp(12(Ng^N(θ^0,1)S(θ^0,1)1g^N(θ^0,1)Ng^N(θ^0,2)S(θ^0,2)1g^N(θ^0,2))(k1k22logN)),(37)

which has the same form of BIC as fully specified models. We use iterative GMM estimation with moment conditions E[Gi(y˜iZ˜iCMjθ0,j)]=0 to approximate the Bayesian factors above. A consistent estimate of the weighting matrix is used to replace S(θ^0)1 in (37).

IV. Monte Carlo Simulation and Results

In this section we describe the Monte Carlo simulations intended to assess the performance of LIBMA. We compute posterior model probabilities, inclusion probabilities for each variable in the universe considered, and parameter statistics. These statistics provide a description of how well our procedure helps the inference process both in a Bayesian model selection and a Bayesian model averaging framework.

A. The Data Generating Process

We consider the case where the universe of potential explanatory variables contains 6 exogenous variables, 2 endogenous variables and the lagged dependent variable. Throughout our simulations we maintain the number of periods constant, that is, T = 4 and we vary the number of individuals, N.

For every individual i and period t, the first four exogenous variables are generated as follows

(xit1xit2xit3xit4)=(0.30.40.80.5)+rtwith rtN(0,I4) for t=0,1,,T; i=1,,N,(38)

where I4 is the four dimensional identity matrix. We allow for some correlation between the first two and the last two exogenous variables. That is, (xi5 xi6) are correlated with (xi1 xi2) such that for every individual i and period t, the data generating process is given by

(xit5 xit6)=((xit1 xit2)(0.3 0.4))0.1(1 2)(1 1)+(1.5 1.8)+rtwith rtN(0,I2) for t=0,1,,T; i=1,,N,(39)

where I2 is the two dimensional identity matrix.

Similarly, for the endogenous variables, (wi1 wi2), we have the following data generating process

(wit1 wit2)=0.9(wi,t11 wi,t12)+10υit(1 1)+rt for t=1,2,,T(wi01 wi02)=10υi0(1 1)+r0with υitN(0,συ2) and rtN(0,I2) for t=0,1,,T.(40)

As the data generating process for the endogenous variables indicates, the overall error term vit is assumed to be distributed normally here. We relax the normality assumption later.

For t = 0, the dependent variable is generated by

yi0=1(1α)(Xi0θx+wi0 θw+ηi+υi0)with υi0N(0,συ2) and ηi N(0,ση2)(41)

where θx = (0.07 0 0 -0.09 0 0.1)′, θw = (0 -0.1)′, Wi0=(wi01 wi02), and xi0=(xi01 xi02 xi03 xi04 xi05 xi06).

For t = 1, 2,…,T the data generating process is given by

yit=αyi,t1+θxXit+θwwit+ηi+υitwith υitN(0,συ2) and ηiN(0,ση2).(42)

We now test the robustness of our procedure with respect to underlying distributions of the error term by relaxing the normality assumption and using discrete distributions instead.

Concretely, to set the distribution of the random variable vit, we first generate its support, Sv, by taking Nv points from a uniform sampling over the interval [-1,1]. Then we draw Nv i.i.d. random variables ωk ~ Exponential (1). The probability mass assigned to each point skSv is obtained by setting pk=ωkiωi. Finally, we translate each point in Sv so that vit has zero mean. It is well known that the probability distribution obtained in this fashion is equivalent to a uniform sampling from a simplex in Nv dimensional space. The construction of the simulated model follows exactly the case of the Normal distribution, with the only difference being the use of the discrete distribution described above in every place where the Normal distribution is used for vit.

B. Simulation Results

This section reports Monte Carlo simulations of our LIBMA methodology in order to assess its performance. We generate 100 instances of the data generating process with the exogenous variables xit, endogenous variables wit, and parameter values ((αθxθw)′ as discussed in the previous section, and we present results in the form of medians, means, variances and quartiles. We consider several sample sizes, N = 200, 500, 1000 and 2000, and several values for the coefficient of the lagged dependent variable, α =0.95, 0.50, and 0.30. In the first set of simulations we assume that both the random error term vit and the individual effect ηi are drawn from a Normal distribution, υitN(0,συ2) and ηiN(0,ση2), respectively. We consider the cases where σv= 0.05, 0.10, and 0.50 while ση = 0.10. Since our methodology should not depend on the normality of the random error term, we check for robustness by creating a second set of simulations where the assumption of normality for vit is dropped, as discussed earlier.

Model selection

In the Bayesian framework, the posterior model probability is a key indicator of performance. Table 1 presents means, variances, and three quartiles (Q1, median, and Q3) for the posterior probability of the true model across the 100 instances. As expected, the mean posterior probabilities of the true model increase with the sample size. For sample size values of N = 200, 500, 1000 and 2000, average values of the posterior model probability are about 0.31, 0.46, 0.57, and 0.65, respectively. Median posterior model probabilities are slightly higher than the means, with average values of 0.32, 0.50, 0.62, and 0.69. In addition, as the sample increases, the distribution becomes skewed toward 1. Quartiles and distribution plots show that as the sample increases the distributions of the posterior model probabilities are becoming less and less normal, with long left tails.13

Table 1.

Posterior Probability of the True Model Summary statistics using LIBMA estimation for various n, a, and sν

article image
Notes:1. For the idiosyncratic error term, ηi ~ N(0, ση2) where ση = 0.10.2. The error term is normally distributed νit ~ N(0, σν2).

As shown in (3) the posterior model probability depends on the prior model probability. Under the assumption that all models have equal prior probability, the more variables are under consideration the smaller the prior probability for each model. Obviously that has an effect on the absolute value of the posterior model probability. Therefore, we choose to also compute a relative measure that helps one understand how well the methodology performs. Table 2 presents the ratio of the posterior model probability of the true model to the highest posterior probability of all the other models (excluding the true model). On average this ratio is above unity for all the cases considered, suggesting that the correct model is on average favored over all the other models. As expected, the average ratios increase with the sample size, starting from about 2.26 for N = 200 and reaching 7.09 for N = 2000.

Table 2.

Posterior Probability Ratio of True Model/Best Among the Other Models Summary statistics using LIBMA estimation for various n, α, and σv

article image
Notes:1. For the idiosyncratic error term, ηi ~ N(0, ση2) where ση = 0.10.2. The error term is normally distributed νit ~ N(0, σν2).

In Table 3 we examine how often our methodology recovers the true model by reporting how many times, out of 100 instances, the true model has the highest posterior probability. The results indicate that this is done quite well. For the smallest sample size, N =200, the recovery rate varies from 65 percent to 83 percent. For N = 500 we see an improvement in the selection of the true model with the success rate ranging from 82 percent to 93 percent. For sample sizes bigger than 1000, the recovery rate stays over 90 percent, reaching 97 percent in a couple of cases.

Table 3.

Probability of Retrieving the True Model Summary statistics using LIBMA estimation for various n, α, and σν

article image
Notes:1. For the idiosyncratic error term, ηi ~ N(0, ση2) where ση = 0.10.2. The error term is normally distributed νit ~ N(0, σv2).

Model averaging

While model selection properties are desired, researchers are often more interested in making inferences. Table 4 presents the posterior inclusion probabilities for all the variables considered along with the true model (column 2 of the table).14 Given the assumptions made relative to the model priors, the prior probability of inclusion for each variable is the same and equal to 0.5. From Table 4 we see that the median value of the inclusion probability for all the relevant explanatory variables is greater than 0.942 in all cases considered. As the sample size increases the posterior inclusion probabilities approach 1 for all the relevant variables. In fact for sample sizes greater than 500, the median value of the probability of inclusion for all relevant variables is practically 1. For the variables not contained in the true model the median posterior probability of inclusion decreases with the sample size with the upper bound being less than 0.076 for the case when N = 2000.

Table 4.

Model Recovery: Medians and Variances of Posterior Inclusion Probability for Each Variable True model vs BMA posterior inclusion probability for various n, α, and σv

article image
Notes:1. For the idiosyncratic error term, ηi ~ N(0, ση2) where ση = 0.10.2. The error term is normally distributed νit ~ N(0, σv2).

We turn now to the parameter estimates, and examine how the estimated values compare with the true parameter values. Table 5 presents the median values of the estimated parameters, averaged over 100 replications, compared to the parameters of the true model.15 As in the case of inclusion probabilities, our methodology is performing very well in estimating the parameters, with the performance improving as the sample gets larger. In Figure 3 of Appendix A, we present the box plots for the parameter estimates of Table 5, for the case of α =0.95 and σv =0.1. It becomes clear that as the sample increases the variance of the distribution decreases and the median converges to the true value. Aside from the fact that the estimates are very close to the true parameter values, the variance over the 100 replications is also very small across the board with values less than 10-5 in many cases.

Table 5.

Model Recovery: Medians and Variances of Estimated Parameter Values True model vs BMA coefficients’ estimated values for various n, α, and σv

article image
Notes:1. For the idiosyncratic error term, ηi ~ N(0, ση2) where ση = 0.10.2. The error term is normally distributed νit ~ N(0, σv2).