Journal Issue
Share
Article

Bayesian Vars

Author(s):
Matteo Ciccarelli, and Alessandro Rebucci
Published Date:
May 2003
Share
  • ShareShare
Show Summary Details

I. Introduction

Vector auto-regressive models (VARs) are a useful starting point for econometric modeling and a standard benchmark for the analysis of dynamic economic problems. In the original formulation of Sims (1972 and 1980), VAR users should rely mostly on the data to study the interaction among the economic variables of interest in order to avoid imposing “incredible restrictions.” This approach, however, may create two problems that have been widely discussed in the literature over the last three decades or so. First, leaving the lag structure the VAR model unrestricted to avoid imposing the incredible restrictions of traditional simultaneous equation models may result in overfitting: given their generous parameterization, VARs may suffer from a loss of degrees of freedom, which decrease geometrically with the number of variables and proportionally with the number of lags included, resulting in inefficient estimates. Second, by using reduced form VARs to avoid imposing identification restrictions on the contemporaneous causation among the variables of interest one may not go beyond a simple description of the data.

This paper reviews the efforts spent over the past several years to address the overfitting problem by adopting a Bayesian approach to specification and estimation of VARs. The Bayesian approach to estimation regards the true population structure as uncertain and does not assign too much “weight” on any particular value of the model parameters (e.g., zero-restrictions on certain coefficients). Instead, it takes this uncertainty into account in the form of a prior probability distribution over the model parameters. The degree of uncertainty represented by this prior distribution can then be altered by the information contained in the data, if the two sources of information are different. As long as the prior information is not too ‘vague’, it is altered only by the “signal” and not by the “noise” contained in the data sample, thus reducing the risk of overfitting. As a result, Bayesian VARs (BVARs) are known to produce better forecasts than reduced form VARs estimated in a classical way and, in some cases, even better than univariate autoregressive moving average processes or structural models.2

The choice of a prior distribution summarizing the researcher’s uncertainty over the model parameters is a crucial step in specifying a BVAR. As pointed out by Learner (1978), for most inference problems in economics, prior information matters in the sense that two economists can legitimately make different inferences from the same data set. In order to deal with this dependence of inference on prior information and to characterize the mapping from the set of alternative priors onto the corresponding inference, a method must be chosen to combine sample information with prior information, and a set of alternative priors must be considered. The Bayesian approach to estimation provides the needed method, while the range of alternative priors depends on the particular economic problem at hand.

We review the Bayesian estimation principle in the specific case of VARs in Section II. In Section III, we review the most relevant prior distributions that have been proposed in the literature, from the original and easily computed “Minnesota prior” to the most recent and computationally more demanding assumptions. In Section IV, we discus s how to extend the basic model to deal with (i) time-varying coefficients, (ii) nonnormal data and (iii) nonlinear relations. In Section V, we discuss basic issues in forecasting and structural analysis with BVARs. In addition, in Section VI, by drawing on our own applied work (Ciccarelli and Rebucci, 2002), we present an application of these methods to the estimation of a system of monetary reaction functions for four European central banks under the European Monetary System. This application uses some of the results previously discussed and illustrates how flexible the Bayesian approach is. Section VII concludes.

Our contribution, therefore, is to provide the reader with a road map to the general method and a few specific results that may be applied in a flexible manner for the empirical analysis of dynamic multivariate economic problems. The presentation in the paper reflects to some extent the broad chronological progress of research in this area, which in turn followed to some extent developments in computer technology permitting the estimation of increasingly more complex models at reasonable computing costs.3

Today’s appearance of an increasing number of Bayesian applications, in particular, seems linked to enhanced and cheaper computing power. As it will evident from the rest of the paper, with statistically and economically richer prior assumptions than the original Minnesota prior, integration problems become easily intractable analytically and must be solved numerically. But the difficulties of implementing numerical integration have been substantially reduced by the recent advances in computer technology and in sampling-based methods. Quoting from Hsiao and others. (1999) therefore, there is no excuse any longer for not considering Bayesian methods because of practical difficulties.

II. The Basics

A. BVARs as an Answer to the Overfitting Problem

Consider a typical VAR,

where Yt is a n × 1 vector of endogenous variables and εt is a n × 1 vector of error terms independently, identically and normally distributed with variance-covariance matrix Σ, εtIIN(0, Σ), Bt (l = 1,…, p) and D are n × n and n × d matrices of parameters respectively, and zt is a d × 1 vector of exogenous variables.

Classical estimation of models like (1) may yield imprecisely estimated relations that fit the data well only because of the large number of variables included, a problem known as ‘overfitting’ in the literature. In fact, the number of parameters to be estimated, n (np + d), grows geometrically with the number of variables (n) and proportionally with the number of lags (p). When the number of variables on the right hand side of (1) is relatively high and the sample information is relatively loose, it is likely that the estimates are influenced by noise as opposed to signal, especially if the estimation method is designed to fit the sample data as closely as possible. When this is the case, it is recommendable to estimate models like (1) by imposing some restrictions to reduce the dimension of the parameter space. Therefore, the problem is to find restrictions that are as credible (in the sense of Sims, 1980) as possible.

A Bayesian approach to VAR estimation was originally advocated by Litterman (1980) as a solution to the ‘overfitting’ problem. The solution he proposed is to avoid overfitting without necessarily imposing exact zero restrictions on the coefficients. The researcher cannot be sure that some coefficients are zero and should not ignore their possible range of variation. A Bayesian perspective fits precisely this view. Without putting too much weight on certain values, one may think of this uncertainty over the exact value of the model’s parameters as a probability distribution tor the parameter vector. The degree of uncertainty represented by this distribution can then be altered by the information contained in the data if the two sources of information are different. As long as the prior information is not too vague or non-informative, it should be altered only by the ‘signal’ and not by the ‘noise contained in the sample, thus reducing the overfitting risk.

More specifically, Litterman (1986) specifies his prior by appealing to three statistical regularities of macroeconomic time series data: (i) the trending behavior typical of macroeconomic time series; (ii) the fact that more recent values of a series usually contain more information on the current value of the series than past values and (iii) the fact that past values of a given variable contain more information on its current state than past values of other variables. If we apply these statistical regularities, (1) becomes a multivariate random walk.4 A Bayesian researcher instead can specify these regularities by assigning a probability distribution to the parameters in such a way that: (i) the mean of the coefficients assigned to all lags other than the first one is equal to zero; (ii) the variance of the coefficients depends inversely on the number of lags; and (iii) the coefficients of variable j in equation g are assigned a lower prior variance than those of variable g.

As we shall see below, these requirements can be expressed formally by introducing a vector of parameters (called hyperparameters), say Π ≡ (π1,… πH) For instance, π1 may control the value of the mean of the first own lag coefficient, π2 controls the variance of the lags of variable g in equation g, π3 controls the variance of the lags of variable j in equation g, π4 controls the speed of decreasing of the variance as the number of lags increase, π5 controls the variance of the deterministic/exogenous part, and π6 controls the overall degree of prior uncertainty. Therefore, the original problem of estimating n (np + d) parameters can be converted in a problem of estimating just six “hyper-parameters.” As we will see, this is only one example, although most BVARs were originally specified by using this particular set of hyperparameters.

Additional data characteristics can be easily accommodated by introducing different hyperparameters to control for other relevant information. Cointegrated systems, for instance, may be treated within the same framework by specifying, a priori, long-run relations among the variables of interest. As shown by Alvarez and Ballabriga (1994), however, a Minnesota-type prior with hyperparameters search performs well in the presence of cointegration. Alvarez and Ballabriga (1994) show also that adding long-run restrictions to the prior does not improve the small sample performance of the Bayesian estimation methods commonly applied to VAR models.

B. Bayesian Estimation of VARs in a Nutshell

To state more formally the Bayesian estimation principle, consider (1) rewritten in compact form as:

where Xt = (InWt-1) is n × nk,Wti=(Yt1,,Ytp,zt) is k × 1, and β=vec(B1, B2,…Bp, D) is nk × 1. The unknown parameters of the model are β and Σ.

Bayesian estimation of (2) is simple, in principle, and works as follows. Given the probability density function (pdf) of the data conditional on the model’s parameters (the information contained in the data in the form of a likelihood Junction),5

and a joint prior distribution on the parameters, p(β, Σ), the joint posterior distribution of the parameters conditional on the data is obtained through the Bayes rule,

noting that, by definition of conditional probability, the joint pdf of the data and the parameters, p(β, Σ, Y), can be written as

where ∝ denotes ‘proportional to’. Given p(β, Σ|Y), the marginal posterior distributions conditional on the data p(Σ |Y) and p(β |Y), can then be obtained by integrating out β and Σ from p(β, Σ|Y), respectively. Finally, location and dispersion of p(Σ|Y) and p(β |Y) can be easily analyzed to yield point estimates of the parameters of interest and measures of precision, comparable to those obtained by using a classical approach to estimation.

C. A Monte Carlo Simulation Method for Numerical Integration: The Gibbs Sampler

In many applications, the analytical integration of p(β, Σ|Y) may be difficult or even impossible to implement. This problem, however, can often be solved by using numerical integration based on Monte Carlo simulation methods.

One particular method used in the literature to solve similar estimation problems to those discussed in this paper is the Gibbs sampler.6 The Gibbs sampler is a recursive Monte Carlo method which requires only knowledge of the full conditional posterior distribution of the parameters of interest, p(β |Σ,Y) and p(Σ |β,Y). Suppose Σ and β are scalars and that the conditional posterior distributions p(Σ |β,Y) and p(β, |Σ,Y) are known. Then the Gibbs sampler starts from arbitrary values for β(0) and Σ(0), and samples alternately from the density of each element of the parameter vector, conditional on the values of the other element sampled in the previous iteration and the data. Thus, the Gibbs sampler samples recursively as follows:

The vectors v(m) = (β(m), Σ(m)) form a Markov chain, and. for a sufficiently large number of iterations (say m≥ M), can be regarded as draws from the true joint posterior distribution. Given a large sample of draws from this limiting distribution, any posterior moment or marginal density of interest can then be easily estimated consistently with its corresponding sample average.

III. Alternative Prior Distributions

A second problem in implementing Bayesian estimation of (2) is the choice of a prior distribution for the model’s parameters, p(β, Σ). This choice is a fundamental step in the specification of the model. This section discusses several alternatives.

Many priors have been proposed in the literature, according to the specific economic problem, the sample data, and the way the parameters of the prior p(β, Σ) are determined. These, however, share common features. Full Bayesian estimation of (2) would require specifying a prior distribution also for the parameters of the prior p(β, Σ) and then integrating them out of the posterior distributions, p(β, Σ | y). As noted, however, these integrations may be complicated or even impossible to implement. In addition, the specification of the prior must take some hyperparameters as given, eventually. The discussion of alternative priors, therefore, may be organized according to the way the problem of the dependence of the posterior moments on the unknown hyperparameters is solved.

One solution is to substitute estimates of the hyperparameters directly into the formulas for the mean and variance of the posterior distributions of the parameters of interest, sometime called Empirical Bayesian Estimation. These could be obtained in a previous stage by OLS, maximum likelihood, GMM, and other optimization criteria or informal methods such as the rules of thumb discussed above. Since uncertainty in the estimation of the hyperparameters is not accounted for by the substitution, the empirical Bayes approach will be only an approximation to the full Bayes approach, possibly approaching it in the limit, depending upon the statistical properties of the estimators used or the quality of the rules of thumb used. An empirical Bayesian estimate, may converge to a full Bayesian estimate if the estimator of the hyperparameters converges to its true value as the sample size increases. An alternative solution, sometime called Hierarchical Bayesian Estimation in the literature, is to incorporate into the model the prior distribution of the hyperparameters and then obtain the marginal densities of the parameters of interest by integrating out the hyperparameters from the joint posterior density numerically.

We are now going to discuss the empirical and the hierarchical Bayes approaches in turn.

A. Empirical Bayes Estimation

The Minnesota Prior

While the number of possibilities for a linear regression model as (2) is vast, a commonly used prior in the Bayesian VAR literature is that proposed by Litterman (1986), sometime called a Minnesota prior.

Consider the problem of estimating the (k × 1) vector βg containing the parameters of the gth equation of (2) when the variance of the error term (σg,g2) is known. Litterman (1986) assumes that

where β¯g and Ω¯g denote the prior mean and variance-covariance matrix of, βg, respectively. The residual variance-covariance matrix, Σ, is assumed fixed and diagonal, (σg,g2IT). Stacking the time observation of the gth equation, (2) can be written as

where Yg and εg are T x 1 vectors, and X is the stacked version of Xt in (2).

Given the assumed independence of the error terms, the likelihood (3) is just the product of independent normal densities,

The posterior distribution of the parameters of interest, given by

is proportional to:

In fact, |σg,g2|T/2and|Ω¯g|T/2 constants with respect to the integration for βg in the first proportionality statement above, while YgYg and β¯gΩ¯g1β¯g are constants in the second proportionality relation. Now, completing the square in the last exponential, we obtain

with

and

In other words, p(βg|Y)=N(β˜g,Ω˜g), where once we know Ω¯g1,βg,, and σg,g2, we can take β˜g as a point estimate.

Several remarks are in order here. First, note that, given the assumptions made, there is prior and posterior independence between equations. This explains why the equations can be estimated separately. Second, as noted above, ∑ is assumed fixed and diagonal here, with its diagonal elements obtained from the estimation of a set of univariate autoregressive models of order p, AR(p). Third, β¯g and Ω¯g are unknown and specified in terms of few hyper-parameters. Finally, by assuming an infinite dispersion of the prior distribution around its mean (Ω¯g1=0)—i.e., by assuming a diffuse prior in the terminology used in the literature—the posterior mean of βg becomes β˜g=(XX)1XYg, which is the OLS estimator of β˜g.7

Litterman (1986) then assigns numerical values to the hyperparameters of the model on the presumption that most macroeconomic time series are well represented by random walk processes, as discussed in the previous section. Specifically, he assumes that π is a degenerate random variable on the assigned values with the following structure for the diagonal elements of S Ω¯g:

Here, π6 controls the overall prior tightness (or uncertainty); π2 controls the tightness of own lags, while π3 controls the tightness of own lags relative to the tightness of lags of the other variables in the equation; π4 controls the lag-decay in the prior variance with l = 1,…,p denoting the variable’s lags; π5 controls the degree of uncertainty on the coefficients of the deterministic and/or exogenous variables in equation g, while the factors σg,g and σj,j measure the scale of fluctuations in variables g and j taking the unit of measure of different variables into account. Finally, the mean vector is specified as β¯g=(0,,0,π1,0,..0), where π1 is in the gth position and represents the prior mean of coefficient on the first lag of the endogenous variable in equation g.8

Obviously, in order to provide a closer approximation to a full Bayesian estimation, these hyperparameters may be estimated consistently rather than being assigned arbitrary values. For instance, the pdf of the data (5) can be written in terms of II and then maximized—the approach that we shall take in our application reported in Section VI of the paper.

The Diffuse Prior

The two main features of the Minnesota prior, the posterior independence between equations and the fixed residual variance-covariance matrix, can be relaxed with the priors considered in this subsection and in the next one. In fact, it can be shown that (see Kadyiala and Karlsson, 1997), with the (diffuse) prior distribution

the joint posterior distribution is given by

where

and

with X (of dimensions T × k) and Y (of dimensions T ×n) denoting the matrix version of Xt and Yt respectively, B=[B1,,Bp,D] (of dimensions k × n), and iW (Q, q) denoting an inverted Wishart distribution with scale matrix Q and degrees of freedom q.9 Further, integrating Σ out of the joint posterior distribution, the marginal posterior distribution of B (the matrix form of the parameter vector β) p(B | Y), can be shown to be

which is a generalized t-Student distribution with scales (Y=XB^ols)(YXB^ols) and X’X, mean B^ols, and degrees of freedom T−k (see Kadyiala and Karlsson, 1997, page I04).10

Conjugate Prior Distributions

A matematically convenient way of solving the main drawbacks of the Minnesota prior is to use a conjugate family of prior distributions. Conjugacy is the property by which the posterior distribution follows the same parametric form as the prior distribution.11 In what follows we describe two possibilities, the Normal-Wishart and the Normal-Diffuse prior.

The Normal-Wishart Prior

To relax the assumption of a fixed and diagonal variance-covariance matrix of the residuals, a natural conjugate prior for normal data is the Normal-Wishart,

i.e., the unconditional prior distribution of β will be normal with prior mean and variance E(β)=β¯ and V(β)=(αn1)1Σ¯Ω¯, respectively, where α denotes the degrees of freedom of the inverse-Wishart and satisfies α > n + 1.

Given this prior assumption, the posterior distribution is obtained as (e.g. Kadyiala and Karlsson, 1997, page 104)

where

and

As before, integrating ∑ out of the joint posterior, the marginal posterior distribution of B is a multivariate t-Student distribution, whose integration can be easily performed numerically (Kadiyala and Karlsson, 1997, page 104).

Note that working with this prior, it is necessary to assume β¯,Ω¯, and Σ¯ as known. However, a Minnesota-type of specification for these matrices could be adapted and used here. In practice, β¯ is specified as dependent upon only one hyper-parameter that controls the mean of the first lag of the endogenous variable, Ω¯ is specified as a diagonal matrix according to the scheme given in the previous section, and the diagonal elements of Σ¯ are set as σ¯gg=(αn1)sgg2 with sgg2 estimated from univariate AR(p) models. Finally, the prior degrees of freedom, α, must be chosen so as to ensure the existence of the prior variances of the regression parameters.

The Normal-Diffuse Prior

A slightly different prior, which in addition to allowing for non-diagonal residual variance-covariance matrix also avoids the Normal-Wishart kind of restrictions on the variance-covariance matrix of β, is the Normal-Diffuse prior firstly introduced by Zellner (1971). The key assumption made by Zellner here is prior independence between β and Σ, where

The combination of these priors with the data yields the following marginal conditional distributions

with

and where W(Σ˜1,T) denotes the inverse of iW.

Then the conditional posterior of β can be integrated analytically to obtain the marginal, but the form of the latter is complicated in the sense that large differences between the information contained in the prior and the likelihood can cause the posterior to become bimodal, and thus the posterior mean to have low posterior probability (see, again, Kadyiala and Karlsson, 1997, page 106). For this reason it is preferable to integrate numerically.

B. Hierarchical Bayes Estimation

A prior distribution for the hyper-parameters can be incorporated into the specification of the model in the context of the hierarchical version of Zellner’s SUR model, as recently proposed by Chib and Greenberg (1995). In general, this kind of specification does not provide a closed form solution for estimation, and full Bayesian estimation is approximated by using numerical integration methods. The hierarchical setup is relatively straightforward to work with in a simulation context and provides a great deal of flexibility in modeling as it increasingly more widespread application shows.

Chib’s hierarchical SUR model is specified as follows:

where the nk × 1 parameter vector β is related to a m × 1 parameter vector θ, which in turn is modeled as a function of a r × 1 parameter vector μ. If rmnk, then the VAR parameters are projected onto progressively lower-dimensional spaces. The literature refers to (15), (16) and (17) as the first, second and third stage of the hierarchy, respectively.

The following assumptions are made by Chib: (i) Mo, M1 and D1 are known; (ii) the matrices Σ, Do are unknown; (iii) the errors are mutually independent across the hierarchy, implying that Yt is conditionally independent of θ, Do. As a result, the likelihood of the data, again, is proportional to:

The prior information can then be completed by assuming

with p(Σ1)=W(Σ¯,s) and p(Do1)=W(D¯t), where all the hyperparameters are assumed known. For instance, they may set in accordance to the Minnesota prior as before or more simply be estimated form data.

The posterior density of the parameters of interest ψ = (β, Σ, θ, Do) is given by

Where p(ψ) is constructed from (16) and (17) (stages 2 and 3 of the hierarchy, respectively) and the prior distributions on the parameters. The Gibbs sampler can then run cycling through the conditional posterior distributions (9)(22) below, which are in standard form (see Chib and Greenberg, 1995):

where

The number of stages in the hierarchy could be more or less than the three discussed here. In the case of more stages, we essentially have to add more conditional posterior distributions to the Gibbs sampler. In the case of less stages, as for instance in the case of the Minnesota prior discussed in the previous section, we have to fix somewhat arbitrarily the value of the hyperparameters of the omitted stage.

For example, the Minnesota prior—see equation (4) above—can be written as

where β¯g=Mgθ, θ=(π1 0), and

with the ones of the first column and the zeros of the second referring to the gth row of the VAR. Therefore, the difference between the latter specification and the Chib’s three-stages hierarchy is that θ and Ω¯g are either estimated or assumed to be known based on the Minnesota rules of thumb, while we give a proper prior on the hyper-parameters of the model in the three-stages hierarchy. Obviously, the hierarchical model is closer to a full Bayesian model than an empirical Bayesian one in the sense that it incorporates a more general prior assumption for the hyper-parameters.

IV. Some Extensions

In this section we consider a few useful extensions of the models discussed thus far. In particular, we shall discuss models with time-varying coefficients (TVC), which have been shown to capture both nonlinearities and nonnormalities in the data, an alternative to the normal data model that captures outliers (or fat-tails) in the distribution of the error terms, and briefly touch upon regime switching models. The first two extensions are relevant for modeling financial data. The latter is more of interest in macroeconomic applications. The models discussed in this section clearly illustrate the flexibility of a Bayesian approach to estimation, which allows complicated estimation problems to be handled in a fairly simple way.

A. Time-Varying Coefficients

The Kalman Filter with a Minnesota Prior

A standard tool for the estimation of linear regression models with time-varying parameters is the Kalman Filter. The Kalman Filter is an algorithm for recursively updating a linear projection for a dynamic system represented in state-space form, given a set of initial values on the parameters describing its status.12 An empirical Bayesian procedure for the estimation of a time-varying version of (2), equation-by-equation, with known variance-covariance matrix of the residuals as discussed in Section III. A, has been developed by Doan et al. (1984). Doan et al. suggest to specify the law of motion of βt as a first order autoregressive process represented in state-space form and to give a Minnesota-type prior on βt-1, so that the Kalman Filter can be applied to update it for all t, recursively.

Consider the problem of estimating the entire parameter vector of (2) when this changes over time in all periods t (thus, we want to obtain a sequence of posterior distributions for β1, β2,…, βT)), and assume that βt follows a stationary VAR process of order one, such as:

where

with βt-1,εt and ηt independently distributed. Here, ∑, A, and Φ are assumed to be known, and lt-1 denotes the information available at time t - 1. Thus, we are considering the following specification for (2):

Note that conditional on lt-1, the prior distribution of βt is

where

Substituting for βt in the first line of (25), we obtain a state space model for βt with observation equation given by

Now, it is possible to show (e.g., Ballabriga et al., 1998) that, under these assumptions, the marginal posterior distribution of βt is given by:

where β¯t|t and Ω¯t|t are the one-period-ahead forecast of βt and the variance-covariance matrix of its mean square error, respectively, calculated by the Kalman Filter as:

To make this scheme operational, one needs a consistent estimate of ∑, values for the hyperparameters of the model (A and Φ and initial values for the elements of Ω1|0 and β1|0. As we saw in the time-invariant case discussed in the previous section, these may be obtained in many different ways.13

A Hierarchical Time-Varying STJR model

Parameter time-variation may be specified also in the context of the hierarchical framework proposed by Chib and Greenberg (1995), based on an important result previously obtained by Carter and Kohn (1994), by adding one stage to the hierarchy implicit in the model discussed in the previous section.14

In this case, parameter time-variation is specified as follows:

where εt, ζt, and ηt are assumed to be independent. Compared to the model in section 3.2, it is now θ = θt, μ=θt-1 and M1=I. Then, the law of motion of θt is specified as a random walk without drift, but could be an AR(1). By substituting (28) in (27) and rearranging the resulting expression we have:

with

which has the same structure as (24) except that the error term now contains a moving average component of the first order, MA(l).

Assuming that θo and D1 are known and conditioning on {θt}t=0T, the conditional posterior distribution of the remaining parameters can be easily derived. For instance, conditional on {θt}t=0TβtN(Moθt,Do), and its conditional posterior is obtained by combining this prior with the likelihood (18). Proceeding in this way, the following conditional distributions can be obtained:

where

The joint posterior distribution of {θt} conditional on the remaining parameters and the data can then be obtained in two steps as shown by Chib and Greenberg (1995, pp. 349-350) First, by initializing {θt}|Y,ψθ with the Kalman Filter, whose output {θ^t|t,Rt|t,Ft)}t=0T is saved for each t:

where

And second, by sampling p({θt}|Y,ψθ) in reverse time order from

where

Note that with this algorithm, we simulate {θt} from the joint distribution p(θo,θ1,θT|ψθ), rather than from its full conditional distribution θt | ψθ. The latter simulation strategy produces very slow convergence to the ergodic distribution because it requires adding T + 1 additional blocks to the Gibbs sampler, a problem exacerbated as the dimension of θt increases. In the former case, instead, only one additional block is introduced in the sampler.

B. Nonnormal Data

The Bayesian framework discussed thus far can be easily adapted to take into account the presence of outliers, or to model data coming form distributions with higher probability of extreme observations than under the normal distribution—sometime called fat-tailed distributions in the empirical finance literature. Being able to accommodate outliers or fat-tails is particularly important in financial applications dealing with high frequency data, which are well known to depart form normality.

In a Bayesian framework, the presence of outliers can be accommodated by replacing the normal data model usually assumed by a fatter-tailed family of distributions for the error term and thus the data, such as the t-student, or by using mixture models.15 A nonnormal data specification can then be combined with alternative prior assumptions for the parameters of interest to obtain their posterior distributions along the line discussed thus far.

Unlike the classical analysis of outliers, the Bayesian analysis makes no distinction between methods that search for outliers—possibly removing them from the analysis—and robust estimation procedures that are not vulnerable to their presence. Bayesian researchers characterize outliers as observations coming form fat-tailed distributions, or high-variance periods in the context of mixture models of time series, rather than extreme observations from a normal distribution. As a result, these observations do not distort the point estimates of the population moments. In addition, such a modeling approach is compatible with many of the prior distributions for the parameters of interest seen in the previous sections, including the possibility of time-variation.

To present these ideas more formally, consider the same hierarchical model discussed in Section III.B, generalized to allow for fat-tails in the data as described by a t-student distribution for the vector of error terms:

where tν(μ,σ2) denotes a t-student distribution with location μ, scale σ, and degrees of freedom v, v ∈ (0, ∞), determining the shape of the distribution.16

As for the estimation of the model, the new specification requires only a small modification to what discussed in Section III.B. Specifically, the prior (37)-(38) in footnote must be added to the hierarchical scheme (34)(36) and the conditional distribution of σt must be added to the Gibbs sampler. It is not difficult to show that, assuming v is fixed, the conditional distributions becomes:

where

with

denoting the scale of the scaled inverse chi squared.

In practice, if the t-student aims at fitting a long-tailed distribution on a long series of observations, then it is generally appropriate to include the degrees of freedom as an unknown parameter to be estimated. If instead the t-student specification is chosen as a robust alternative to the normal to control for outliers, then ν can be fixed at an arbitrarily small value, but no smaller than prior understanding dictates (see Gelman et al., 1995, page 350 for more details on this).17

C. Regime Switching, Nonlinear Models, and Beyond

Regime-switching time series models are a popular device to model non-linear dynamic, following the seminal contribution of Hamilton (1989). Shifts between two or more regimes can be easily accommodated in the Bayesian framework discussed thus far.18Canova (1993), however, has shown that nonlinearities, nonnormalities and conditional heteroschedasticity may also be modelled by a the kind of time-varying coefficient models (TVCs) discussed above. In particular, he shows that a TVC model nests specifications generally used to characterize typical departures from standard assumptions such as conditionally normal ARCH and ARCH-M models, or even Hamilton’s regime switching effects. Interestingly, as we have seen in the previous subsections, a TVC specification is rather easily estimated in a Bayesian framework.

The flexibility of the Bayesian approach does not end here and the modeling possibilities are numerous. For instance, it is possible to specify a t-student prior for the parameters rather than the error terms, thereby allowing for outliers in the parameters, while continuing to use a normal data model. Sometimes, it is difficult to justify a priori the normality of the parameter vector, especially if the VAR includes equations referring to different decision units, as in the case of a panel VAR. Moreover, posterior inference is generally sensitive to the assumed prior, even when the model fits the data well. For this reason, it may be useful to check robustness by assuming alternative specifications for the prior distribution of the parameter vector, such as a t-student as we discussed for the error terms.

Assume for instance that in (34)(36) we change the assumption on the population structure as

As in previous subsection, the t-distribution can be written as a mixture of a normal and a scaled inverse-χ2 as

The conditional posterior of τ is then added to the Gibbs sampler as we added the conditional posterior of σt in the previous section.

V. Forecasting and Structural Analysis

In order to compute out of sample forecasts, impulse response functions (IRFs) and forecast error variance decompositions (FVDs). take the companion form of model (1) with no deterministic components but the constant:

where

Solving forward from period T, we can express the h-step ahead out-of-sample forecast at time T as:

Now, defining the n×np matrix J=[In 0 …0] as in Liitkepohl (1990, page 13) and using the fact that J′JUt = Ut and that JUtt, we obtain h-step ahead, out-of-sample forecast of YT-h:

where C0=I,Ci=I+j=1iBjCij(i=1,2,),Bj=0 for j > p, and Φj = JBjJ′.

Forecasting (conditional and unconditional), impulse response and variance decomposition analysis are natural application of (48), where the first two terms add up to the expected value of YT+h (i.e., YT(h) = Ch-1μ + JBhYT = E[YT,β,Σ]), while the last term is the forecast error (YT-h – YT(h)) with conditional variance equal to j=0h1ΦjΣΦj

A. Unconditional Forecasting

There are two different ways to forecast future realizations of the vector of variables of interest. If there are conditions or constraints on the future value of the variables or the shocks driving the model, the forecast produced by means of (48) is conditional; if there are no such conditions, it is an unconditional forecast. Here, we focus only on unconditional forecasts.19

The unconditional forecasting function (YT (h)) is given by the first two terms in (48),

Some algebra manipulations show that the h-step ahead forecasting function may also be written, recursively, as

where YT (j) = Yℓ+j for j ≤ 0 Point forecasts may now be computed in two ways.

First, by substituting in equation (50) estimates of μ and Bl (l=1,…, p) obtained from the mean of the posterior distribution of β:

Alternatively, by defining ϑ=(β,Σ) and evaluating the integral

where p(ϑ|Y) denotes the posterior distribution of ϑ and the forecasting function is averaged by taking as weights the whole posterior density of the parameters and not just their posterior mean as in (51).

If instead one is interested in forecasting the whole density of Yt (h) and not one just “one point”, i.e., one is interested in density forecasts, the following integral may be evaluated either analytically or numerically by means of Gibbs sampler:20

The procedure is relatively simple. For each draw of ϑ(j))., we draw Yt(j)(h) from p(Yt(h)|YT(j)). After, say, M iterations of the Gibbs sampler these draws may be regarded as draws from p(Yt(h)|YT). A point forecast can then be obtained using the ergodic mean of the empirical distribution.

B. Structural Analysis

Computation of unconditional forecasts is related to the calculation of IRFs and FVDs. But while the reduced form is all we need for unconditional forecasting, structural VAR analysis requires the solution of a joint estimation-identification problem. However, if the system is exactly identified, the analysis remains relatively simple.21

For instance, take the model with time-invariant parameters and any of the prior assumptions discussed above. Any such model may be seen as the reduced form of the a structural form in which:

where A is an n×n non-singular matrix of contemporaneous correlations, εt=A−1vt, and vt is assumed normally and independently distributed with E[vt|Yt-s, s>0]=0 and E[vtv′t|Yt-s, s>0]=I, for all t. Equation (48) can then be re-written as

where the n×n matrix ψj = ΦjA−1 is the matrix of the jth-period-ahead impulse responses, while the forecast error variance is then given by j=0h1ΨjΨj=j=0h1ΦjaiaiΦj where ai is the i-th column of A1.

Probability distributions of the responses to an impulse in the k-th structural shock may be computed by making random draws for the k-th column of Ψj for periods j = 0,…h, as explained for unconditional forecasting. The mean response and the percentiles are then used to summarize the posterior distribution of these statistics. Similarly a distribution for the contribution of the i-th innovation to the forecast error variance of the h-step ahead forecast can be obtained by making draws of j=0h1ΦjaiaiΦj as a proportion of the total forecast error variance.

In the case of overidentification, the mapping between Σ−1 and A is not one-to-one. This means that we cannot obtain the posterior distribution of A while computing the posterior distribution of Σ, given the chosen identifying scheme. In case of overidentification, as explained in Sims and Zha (1998), a joint prior distribution on the parameters governing contemporaneous and lagged interdependence between the variables of interest must be specified, and the posterior distribution of the parameters governing contemporaneous correlations becomes nonstandard. In this case, one needs to take a second-order Taylor expansion around the maximum of the likelihood to obtain the posterior distribution of these parameters.22

VI. An Application: Estimating a System of Reaction Functions

In this section, by drawing on the work reported by Ciccarelli and Rebucci (2002), we apply some of the results previously reviewed to the estimation of a system of monetary reaction functions of the type discussed and estimated by Clarida, Gali, and Gertler (1997).

Researchers estimate central bank reaction functions either to investigate the behavior of the monetary authorities or to obtain measures of the expected and unexpected component of monetary policy. In the empirical monetary policy literature, examples of use of estimated reaction functions for both purposes abound. For instance, in a seminal contribution, Clarida and others (1997) estimate the reaction function of the largest advanced economies’ central banks to compare their behavior; Dornbusch and others (1998) estimate a system of reaction functions for European central banks to compare the transmission mechanism of monetary policy across these countries; Sims (1999) and Sargent and Cogley (2002) revisit the U.S. postwar monetary policy history by estimating a reaction function for the US Federal Reserve with methods similar to those reviewed in this paper; finally, Ciccarelli and Rebucci (2002) use the system of reaction functions presented in this section to investigate how the transmission mechanism of European monetary policy has changed over time.

More specifically, we consider four central banks in the European Monetary System (EMS)— Germany, France, Italy, and Spain—which are the four largest economies currently in the European Monetary Union (EMU), accounting for about 80 percent of the euro-area GDP.23 As these countries’ commitment to the EMS, and hence their monetary policy regime, has changed over time, we let also the model parameters change over time. In addition, as short-term interest rates may reflect not only monetary policy actions and surprises but also exogenous shocks to exchange rate risk premia and money demand shocks not fully accommodated by the money supply, we assume that the error terms are t-distributed.

Therefore, we estimate a time-varying coefficient system for nonnormal data by adapting the models discussed in Sections IV.A and VLB. Specification, identification and estimation of the particular econometric model used are presented in the first three subsections. The estimation results are reported and discussed in the following one.

A. Specification

The behavior of the four European central banks considered is modeled by the following time-varying structural VAR:

where Rt=[rt1,rt4] is a (4×1) vector of monetary policy instruments, Wt=[wt1,wt4] is a (4×1) vector of monetary policy final objectives and exogenous variables, At(L) and Bt(L) are time-varying polynomial matrices in the lag operator L, with lag length p1 and p2 respectively, and Dt is a (4×1) vector of constants. Here, Ut=[ut1,ut4] is a (4×1) vector of monetary policy shocks such that:

where Zt contains lagged Rt and contemporaneous and lagged Wt, and E denotes the expectation operator and that.

A short-term interest rate is assumed to be the monetary policy instrument. Consistent with the specification of a standard VAR for the analysis of monetary policy in open economy, in wit we include contemporaneous and lagged inflation (π), output (y), and lagged nominal exchange rate (e) in percent deviation from its target (π*,y*, e*, respectively). In addition, we include a contemporaneous and lagged index of commodity prices (cp), the U.S. federal fund’s rate (rtUS) and the lagged value of a broad monetary aggregate (m). Commodity prices and the US money market rate are included to control for external shocks, while monetary aggregates are widely believed to have been monitored closely by most European central banks throughout the period considered. Thus,

All data used are from the International Financial Statistics database of the IMF. As a proxy for the short-term interest rate, following Bernanke and Mihov (1997) and Clarida, Gali, and Gertler (1997), we use the money market rate. Output is measured by an industrial production index. Inflation is measured by the annual change in the consumer price index. We use the bilateral exchange rate vis-à-vis the deutsche mark (DM) for France, Italy, and Spain, and the DM/U.S. dollar rate for Germany. The commodity price index and the monetary aggregate are entered in first-difference. The monetary aggregate chosen is a seasonally adjusted M3 series.24 The targets variables π* and y* are the fitted values of a linear regression of the actual variables (πi,t and yi,t) on a constant and a linear trend and a constant and a quadratic trend,, respectively, while e* is the central parity vis-a-vis the DM.25

The specification chosen imposes very few a priori restrictions on the system of reaction functions: all parameters in At(L) and Bt(L) are unrestricted and can vary over time, including those governing the contemporaneous causation among short-term interest rates.26 Leaving Bt(L) unrestricted allows the behavior of the central banks considered to change during the sample period, letting the data reveal which objectives they were actually pursuing in each period. For instance, the Spanish peseta joined the EMS only in 1989. the Italian lira has been floating more or less freely from September 1992 to November 1996, and the fluctuation bands of all three currencies vis-a-vis the DM have changed several times during the period considered. Even the Bundesbank’s focus might have shifted away from strictly domestic objectives, after the early years of the German unification, in the run up to the EMU. It is evident that, if the policy changes and possible outliers are not accounted for by the estimated parameters of the system of reaction functions, they will end up in the estimated residuals, thereby potentially undermining their interpretation of well behaved (i.e., white noise) policy innovations as assumed in (55).

Leaving At(L) unrestricted allows for lagged interdependence among short term interest rates of different countries as well as for varying degrees of interest rate smoothing over time. In addition, as we shall see below, our prior assumptions on the matrix of contemporaneous relations among the interest rates considered (i.e.. the matrix At(0)) allow for the possibility that Uiere are exchange rate risk premium shocks and money demand shocks not fully accommodated by money supply possibly resulting in large outliers in Ut.

Nonetheless, we do impose some lag-length restriction by choosing p1 and p2 based on the Schwarz (BIC) criterium and ex post misspecification tests on the estimated residuals. While the BIC criterion suggests to set p1 = 6 and p2 = 1, we find that our residuals pass all misspecifications tests using p1 =2 and p2 = 1 (see below). Therefore, to save computing time, the final results are based on the shorter lag structure.

B. Identification

Identification of (54) may be achieved through exclusion restrictions on the coefficient matrix At(0). The specific scheme used exploits the Bundesbank’s presumed leading role under the EMS and the relative economic size of other countries.27 More specifically, we place the German short term interest rate first in the vector Rt, assuming that it affects other European interest rates contemporaneously without being affected by them. We then assume that French and Italian interest rates affect contemporaneously the Spanish rate without being affected by it. This is plausible given that Spain’s GDP was considerably smaller than that of France and Italy during much of the period considered. (Spain also joined the EMS only in 1989.) Finally, we assume that the impact on France of an increase in interest rates in Italy is the same as the impact on Italy of an increase in French rates.28

Formally, we need six restrictions on At(0) to identify the model. The assumptions above provide the six restrictions that identify the model exactly and translate into the following block recursive structure for At (0):

where A11 (0), A31 (0), and A33 (0) are scalars, A21 (0) and A32 (0) are 2×1, and A22 (0) is 2×2. The leader-follower behavior presumably characterizing the EMS imposes three zero restrictions in the first row of this matrix. The smaller size of Spain relative to that of France and Italy allows to impose two more zero restrictions in the last column of this matrix, while the last restriction is obtained by imposing symmetry on A22 (0).

The structural VAR(54), therefore, can be rewritten as:

where R1t, W1t and U1t are the German monetary policy instrument, objectives and shock, respectively, R2t and U2t(R3t, W3t, and U3t) are the vectors containing the same variables for France and Italy (and Spain).

C. Estimation

Bayesian estimation of (57) exploits its block recursive structure. Following Zha (1999), let kj and Gj be the total number of right-hand-side variables per equation and the total number of equations in block j of (57), respectively, where the same set of variables enter the equations of each block j. If we pre-multiply (57) by the (4x4) matrix

and rearrange terms, the model can divided into three blocks.

Here,  Zjt=diag Z[Z1,tj,Z2,tj,ZGj,tj] denotes a (Gj × kjGj) diagonal matrix whose elements are the (1×kj) vectors (Zg,tj) containing all contemporaneous (in our case R1t in block 2 and 3 and R2t in block 3) and lagged endogenous variables, exogenous and deterministic variables of equation g in block j for g = 1, .... Gj; δjt=[δ1,tj,δ2,tj,δGj,tj] denotes a (kjGj×1) vector whose (kj×1) element (δg,tj) contains the parameters of equation g in block j (for g = 1, …,Gj); and (vjt)=At,jj1(0) Ujt.

We make the following prior assumption on c and At,ij1(0):

with

Thus,

As explained in Section IV.B, this is equivalent to assume that vjttv(0, jj), which in turn is the same as assuming that

This assumption is economically plausible because it allows for the possibility that there are other shocks in addition to money supply shocks, such as exchange risk premium and money demand shocks not fully accommodated by monetary authorities, that may possibly result in large outliers in Ut potentially distorting our estimates.29

The other prior assumptions on the model’s parameters generalize those introduced by Zellner (1971, Chapter 8) to take into account the presence of time-varying coefficients: a time-varying Minnesota prior (e.g. Doan, Litterman, and Sims, 1984) for the slope coefficients (δjt) is combined with a diffuse prior on the variance-covariance matrix of the residuals (∑jt) and the assumption made for the time variation of the error term (σjt) with prior independence.30 Thus:

where

Here, Pj is a (Gjkj× Gjkjmatrix governing the law of motion of δjt0j is the unconditional mean of δjt, Φj governs the time variation of δjt and ηjt is assumed to be independent from vjt.

Now, denoting with RjT the sample data for each block j of (58), the pdf of the data, L(RjT|Zjt,δjt,jj,σjt), conditional on the exogenous variables, the initial observation, Rj0, and the parameters of the model (δjt and ∑jj) is proportional (∝) to:

The posterior distribution of jj1conditional on the entire history of δjt for t = 0,.. T (denoted {δjt}), σjt, and the data, is easily obtained combining (62) with p (∑jj) as the following Wishart distribution with T degrees of freedom and scale matrix S (Zha, 1999, p. 299):

where

The joint posterior distribution of {δjt} conditional on ∑jj, σjt and the data is obtained as discussed in section 4.A. Specifically,

where δ¯jt|t and Ω¯jt|t are one-period-ahead forecast δjt and the variance-covariance matrix of its mean square error, respectively, calculated by the Kalman Filter as in (26)

The conditional posterior distribution of σjt is similar to (43):

where

Given (63), (64) and (65), and initial values for Pj, Φj,Ω^j0, and δ^j0, the marginal posterior distributions of jj1,{δjt} and σjt can then be obtained from the Gibbs sampler by drawing alternately from these conditional distributions. Here, note that, given the marginal posterior distribution of σjt and ∑jj, we can recover the posterior distribution of At,jj (0), and thus also the posterior distribution of the structural residual Ut, since the matrices Ajj (0) are exactly identified and thus linked to ∑jj by a one-to-one mapping.

As explained in Section III.A, we define Pj, Φj,Ω^j0,δ^j0 in terms of few hyperparameters and then maximize the likelihood of the data as a function of this smaller parameter set to obtain numerical values that are then fed into the Gibbs sampler. More specifically, Ω^j0 and δ^j0 are exactly as Ω¯gβ¯g in section 3.1.1, while the matrices Pj and Φj are defined as:

where Pjg =diag8, g) are (kj×kj) matrices with (π8, g) controlling the coefficient of the law of motion of each element of δjg, and Φjg =diag7, g) are (kj×kj) matrices with π7, g controlling the variance around these values actually introduced in the model, for g = 1, …Gj. Finally, for all j we set vj equal 5.31

Given the values of the model hyperparameters, (32) is run. Then the Gibbs sampler starts iterating, switching between (63), (64) and (65) and taking the estimated values of π1,…,π8 as given. The Gibbs sampler iterates 5000 times yielding 4000 draws from the joint and marginal posterior distributions of the parameters of interest after discarding the first 1000 draws. All the numerical integrations and the statistics reported are based on these last 4000 draws.

D. Results

In this section we report the estimation results: the estimated reaction function residuals and parameters derived from (57) for each country considered. The posterior median of the estimated residuals are reported in Figure 1. The posterior median of the estimated parameters together with the first and third quartile for each estimated reaction function are reported in Figures 25, while

Figure 1.Reaction Function Residuals

(Standardized, January 1981-December 1998)

Figure 2.Germany: Reaction Function Parameters

(Elasticities, January 1981-December 1998)

Figure 3.France: Reaction Function Parameters

(Elasticities, January 1981-December 1998)

Figure 4.Italy: Reaction Function Parameters

(Elasticities, January 1981-December 1998)

Figure 5.Spain: Reaction Function Parameters

(Elasticities, January 1981-December 1998)

Figure 6 reports the posterior median of the error variance. The sample period is from January 1980 to December 1998, but the reported results are from February 1981 to December 1998 because 24 observations are used to initialize the estimation procedure. In Table 1 we also report the estimated hyperparameters (with the same notation is as in Section 3.A).

Figure 6.Interest Rate Posterior Volatility

(January 1981-December 1998)
Table 1.Estimated Hyperparameters
GermanyFranceItalySpain
π10.97820.99220.93280.9814
π21.01.01.01.0
π30.30390.207547.09870.0668
π46.816810.6596.36440.9099
π5363.3768.89e-003377848.0136.187
π60.21832.70861.19e-0052.4612
π74.52e-0074.61e-0093.22e-0089.36e-006
π80.98490.94290.99520.9673

The estimated hyperparameters are fed into the Gibbs sampler, affecting the prior assumptions actually used in estimation, and thus also our posterior estimates of the model residuals and parameters. Therefore, a few remarks are in order regarding the estimated hyperparameters in Table 1. First, without loss of generality, to economize on free hyperparameters, we assume that the tightness on the coefficients of the lagged endogenous variables (π2) is equal to 1 (as suggested by Doan, 2000, Chapter 10). Second, time variation, as controlled by π7, does not appear to be a major feature of the data, as will also be discussed below. Third, the prior mean of the coefficient on the first lagged endogenous variable (π1) is estimated to be close to one for all countries, as expected. Higher order lags, decay rather quickly in the case of Germany, France and Italy, but less so in the case of Spain (π4). Fifth, overall parameter uncertainty, as controlled by π6, seems higher in the case of Spain and France than Italy and Germany, meaning that the posterior distributions should be more concentrated in the latter cases. Finally, we note that π8 is very close to one for all countries considered, meaning that the process for the coefficient vector δjt is close to a random walk.

The estimated residuals may reflect potential model misspecifications and the model’s goodness of fit. As we can see from Figure 1, they appear remarkably well behaved for all countries considered: there are essentially no outliers and there is also little or no evidence of serial autocorrelation and/or heteroschedasticity.

This is also borne out by a battery of standard test statistics reported in Table 2. Table 2 reports summary and test statistics on the null hypothesis that the estimated posterior median of the residuals follow a white noise process. The first two lines report the sample mean and the p-vaiue for the null hypothesis that this is zero, respectively. The second two lines report a Kolmogorov-Smirnov statistics for the Durbin’s (1969) cumulated periodogram test.32 The following six lines report the Ljung-Box’s statistics for the null hypothesis of absence of serial correlation of order higher than specified with their respective p-values. F inally, the last two lines report Engle’s test for the null of absence of autoregressive conditional heteroschedasticity of second order together with their p-values in brackets.

Table 2.White Noise Test Statistics
GermanyFranceItalySpain
Sample Mean0.0053

(0.937)
0.0062

(0.927)
-0.0301

(0.659)
-0.0044

(0.948)
Cum. Period.0.0816

(0.0981)
0.0665

(0.0981)
0.0617

(0.0981)
0.0997

(0.0981)
Q(4)6.2035

(0.184)
1.4762

(0.831)
2.3753

(0.667)
5.3435

(0.254)
Q(8)7.1921

(0.516)
8.3869

(0.397)
6.5074

(0.591)
12.186

(0.143)
Q(12)7.8025

(0.800)
10.430

(0.578)
16.4072

(0.173)
23.8513

(0.021)
Arch(2)2.7708

(0.250)
3.5593

(0.169)
7.1603

(0.029)
33.4121

(0.000)

As we can see from this table, the null hypothesis of zero mean cannot be rejected for all countries, and the absence of heteroschedasticity and autocorrelation is also clearly rejected by all statistics for Germany and France. The results for Spain, and to a lesser extent also for Italy, how some presence of ARCH effects and autocorrelation of order higher than 12, though. Nonetheless, compared to other studies one notable difference is the absence of large outliers in these residuals corresponding to the 1992 EMS crisis and the subsequent periods of financial turbulence documented by Favero and Giavazzi (2002), among others. This is because these episodes have been captured by the time-varying interest rate volatility introduced through our prior specification, whose posterior medians are reported in Figure 6.

We can now focus on the estimated model parameters. Figure 2 reports the posterior median of selected parameters of the Bundesbank’s reaction function together with a band containing 50 percent of the posterior distribution. Interestingly, the estimated parameters appear rather stable over time, except for the structural break around the time of German unification, which does not even appear in all coefficients. This suggests that no major behavioral change took place in the Bundesbank reaction function in the run up to the monetary union (EMU).

These estimation results conform well to standard views of the behavior of the German central bank: they show a high degree of interest rate smoothing or persistence, relatively low weight on other European countries’ targets and a relatively large weight on the domestic inflation target, with some weight also attached to U.S. variables. The coefficient of the German own lagged interest rate remains close to 1 throughout the sample period. Among domestic objectives, the inflation gap is by far the most important variable, although its effect is not estimated very precisely. The U.S. interest rate and the DM/U.S. dollar exchange rate also have notable impacts.33 The coefficient on the U.S. federal funds rate, in particular, is comparable in size to that of the domestic inflation target. The coefficients of other European countries’ targets are generally not significantly different from zero, except for the output gaps and the Italian exchange rate gap. But their magnitudes are quite small compared to other objectives and slightly declining over time. The coefficients of foreign inflation gaps are clearly insignificant statistically.

The parameters of the reaction function of the Bank of France also conform well to what one would expect for a follower country in the EMS (Figure 3): the German policy interest rate is the most important variable, and exchange rate targets have a smaller (and slightly declining) but significant weight. Domestic and other countries’ inflation and output targets, instead, have weights which are either very small or not significant statistically. Only monetary growth appears to have had a stronger role, but this declined markedly over time.

The results for Spain, and to a lesser extent Italy, are consistent with the view that the Bank of Spain and the Bank of Italy were less constrained than the central banks of other European countries by the EMS (Figure 4 and 5). The Bank of Spain, in particular, appears to have been the least constrained among the countries considered. Both reaction functions show a much smaller and insignificant weight attached to the German contemporaneous rate and somewhat larger weights on output gaps compared to France. Interestingly, as in the case of France, the coefficient of the German exchange rate appears to have declined significantly after 1992. The reaction function of the Bank of Spain, in particular, also shows more instability, consistent with Spain’s later entry in the EMS. The weight attached to monetary growth in Italy (Spain) is also smaller (larger) than that in the French reaction function. The weight on the domestic output gap affects Spanish short-term interest rates to a much greater extent than in France or even Italy, with an impact comparable to that of the German interest rate. Finally, interest rate persistence is the smallest in Spain.

Overall, these results show how difficult it would have been to choose a restricted and yet uniform econometric specification to describe the different behavior of the central banks considered. The fact that we find relatively “clean” estimated residuals then confirms that most of the behavioral differences across countries and over time are well captured by the adopted econometric specification for the system of reaction functions studied.

VII. Conclusions

In this paper we have reviewed recent developments in the literature on Bayesian VARs and have discussed how to apply some of these results with an application to the estimation of a system of monetary reaction functions. Starting from the general Bayesian principle applied to the concrete case of VAR estimation, we have described several prior distributions and some useful extensions of the basic model, including nonnormal data, nonlinear models, and time-varying coefficients models. In all cases analyzed, we have provided expressions for the posterior distributions, either analytically or numerically computed. Besides its interesting retrospective economic content, the application presented showed how flexible the Bayesian approach might be in dealing with complex dynamic economic problems.

References

Universitat d’Alicant (Alicante, Spain) and IMF, respectively. Forthcoming in the “Rivista di Politica Economica”. The authors are grateful to Gustavo Piga, two anonymous referees, Fabio Canova and Chris Gilbert for useful comments, and to Luigi Guiso for his suggestions and encouragement.

See Canova (1995) for specific references.

Nonetheless, this survey does not provide specific indications on software implementation. The interested reader may consult Geweke (2000) for references to software freely available on the web for the implementation of some of the methods review in this paper. The application in Section VI of the paper has been implemented in RATS, which has a number of preprogrammed functions and procedures for Bayesian estimation of VARs.

This specification has been criticized on the ground that the resulting estimation procedure might be affected by spurious regression problems as in the classical case, due to the presence of unit roots or stochastic trends in the data. Proponents of the Bayesian approach to var estimation, however, argue that this is not the case. See the special issue of the Journal of Applied Econometrics (1991, Vol. 6.) in which the Bayesian method of treating unit roots is thoroughly discussed, for a presentation of the arguments and the counterarguments.

We shall denote probability density functions with a ‘p’ when they are prior or posterior distributions of the parameters, and with a ‘L’ when a likelihood function.

See Gelfand and others (1990) for a detailed discussion of the Gibbs sampler and Canova and Ciccarelli (2000), Ciccarelli and Rebucci (2001) and Hsiao and others (1999), among others, for some applications. The brief summary in the text is borrowed from this latter contribution. For a more general discussion of numerical Bayesian methods see also Geweke (2000) and Gelman and others (1995).

Thus, a diffuse prior may be interpreted as assigning zero probability weight to the prior information relative to the information contained in then data.

Note that by allowing π6 to approach infinity the prior becomes diffuse, while if π3 is set to zero, the prior defines a set of univariate AR(p) models.

The Wishart distribution is the multivariate generalization of the gamma distribution.If W ~ W (Q, q) where is of dimensions k × k, then its density is proportional to |W|(qk1)/2×exp(12tr(Q1W)). On the other hand, if W-1~ W (Q.q), then W has the inverse-Wishart distribution. The inverse-Wishart is the conjugate prior distribution for the multivariate normal covariance matrix. More references for univariate and multivariate distributions and for sampling from them are Zellner (1971, Appendix B) and Gelman et al (1995, Appendix A).

This integration, however, can also be approximated numerically by first drawing Σ from the inverted Wishart (7), and then using these draws to simulate β from (6).

Formally, conjugacy is defined as follows: if () is a class of sampling distributions p(y | θ), and () is a class of prior distributions for θ, then the class () is conjugate for () if (p(y|θ)) for all (p(|θ)) and (p()). Natural conjugate priors arise by taking () to be the set of all densities having the same functional form as the likelihood (Gelman et al. 1995).

See Hamilton (1994, Chapter 13) for more details on the Kalman Filter.

We apply this model to the estimation of a system of reaction functions for four European central banks under the EMS. See also Ballabriga et al. (1998) for a detailed presentation of the methodology.

See Canova and Ciccarelli (2000) and Ciccarelli and Rebucci (2002) for applications of this procedure to the prediction of turning points in the business cycle of G7 countries and to the analysis of the transmission mechanism of European monetary policy, respectively.

Mixture models are statistical model based on the combination of two or more distributions. They are frequently used in situations in which the measurement of a random variable is taken under two or more conditions, or where the population is known or assumed to consist of subpopulations that follow a different, simpler model. For more details see, for instance, Gelman et al. (1995).

The t-student distribution can be interpreted as mixture model, i.e., a mixture of normal distributions with a common mean and time-varying variance (in the case of a time series regression) distributed as scaled inverse chi-squared distribution. For instance,

is equivalent to

Statistically, therefore, outliers may be interpreted as observations drawn from a distribution with higher variance. Hence, the variance of the error term is heteroschedastic and the degree of heteroschedasticity depends on the number of degrees of freedom ν. In fact, as ν approaches infinity, εt converges in distribution to N(0, Σ), as Et) tends to one and Vt) tends to zero.

See Ciccarelli and Rebucci (2002) for an application of a non-normal, TVC model to the measurement of financial contagion. Similar models have been used by Sims (1999) and Cogley and Sargent (2002) to study monetary policy issues.

See Kim (1994) and Chib et al. (2001a) for univariate applications and Chib et al. (2001b) and Cogley and Sargent (2002), among others, for multivariate applications.

On conditional forecasting see, among others, Waggoner and Zha (1998). Kadyiala and Karlsson (1997) evaluate the forecasting performance of most of the specifications discussed in the previous section.

Density forecast is of great help when the researcher wants to compute toning point probabilities, for instance, as in Canova and Ciccarelli (2001).

See Amisano and Giannini (1997) on the general issue of identification in VARs. For a more complete Bayesian treatment of the calculation of impulse response functions in structural VAR models see Koop (1992). For the case of over-identification in a Bayesian framework see Zha (1999) and Sims and Zha (1998).

See Zha (1999, page 300) for details.

There is nothing in our empirical framework that would prevent us to include more than four countries except additional computing costs.

All variables are transformed in natural logarithm so that estimated coefficients can be interpreted as elasticities. For interest rates we take the natural logarithm of the gross rates.

The exchange rate gap for Italy and Spain is set to zero before the Spanish peseta joined the EMS and during the period in which the Italian lira was floating following the 1992 ERM crisis. We compute the inflation and output target also by taking deviations from the German inflation rate and by using the HP filter for the output series finding similar results available on request.

Note however that the relative tightness of the prior distribution given on the elements of At(L)and Bt(L) distinguishes between own and other countries’ monetary policy instruments (the endogenous variables), between instruments and objectives and between own and other countries’ objectives (the exogenous variables).

See Giavazzi and Giovannini (1988) and Kenen (1995) on the Bundesbank’s presumed leading role in the EMS from the mid-1980s onward.

See Amisano and Giannini (1997, pages 166–67) for an example of identification by means of symmetry as assumed here.

By imposing an autoregressive structure on σjt (or its log) instead of assuming a simple t-student distribution for the error term we could specify a stochastic volatility model such as that used by Cogley and Sargent (2002) or Uhlig (1992). However, the prior assumption chosen is sufficient to model temporary, non persistent shifts in the conditional variance of vjt.

See the Normal-diffuse case above.

Note that as the first and third blocks of the model contain only one equation, (63) becomes an inverted gamma for j=1.

This test cumulates the periodogram (i.e., the squared Fourier transform) over frequencies 0 to π and scales it so as to have an end-value equal one. If the series examined is white noise, its cumulated periodogram should differ only marginally from the theoretical periodogram of a white noise, which is a straight line. Concretely, the table reports the maximum gap between the actual and theoretical cumulated periodograms together with the approximate rejection limit at 5 percent significance level in brackets.

Note that an increase in the exchange rate gap denotes a depreciating movement.

Other Resources Citing This Publication