Overfitting in Judgment-based Economic Forecasts: The Case of IMF Growth Projections
  • 1 0000000404811396https://isni.org/isni/0000000404811396International Monetary Fund

Contributor Notes

Author’s E-Mail Address: khellwig@imf.org

I regress real GDP growth rates on the IMF’s growth forecasts and find that IMF forecasts behave similarly to those generated by overfitted models, placing too much weight on observable predictors and underestimating the forces of mean reversion. I identify several such variables that explain forecasts well but are not predictors of actual growth. I show that, at long horizons, IMF forecasts are little better than a forecasting rule that uses no information other than the historical global sample average growth rate (i.e., a constant). Given the large noise component in forecasts, particularly at longer horizons, the paper calls into question the usefulness of judgment-based medium and long-run forecasts for policy analysis, including for debt sustainability assessments, and points to statistical methods to improve forecast accuracy by taking into account the risk of overfitting.

Abstract

I regress real GDP growth rates on the IMF’s growth forecasts and find that IMF forecasts behave similarly to those generated by overfitted models, placing too much weight on observable predictors and underestimating the forces of mean reversion. I identify several such variables that explain forecasts well but are not predictors of actual growth. I show that, at long horizons, IMF forecasts are little better than a forecasting rule that uses no information other than the historical global sample average growth rate (i.e., a constant). Given the large noise component in forecasts, particularly at longer horizons, the paper calls into question the usefulness of judgment-based medium and long-run forecasts for policy analysis, including for debt sustainability assessments, and points to statistical methods to improve forecast accuracy by taking into account the risk of overfitting.

“In any case, that is what economists do. We are storytellers, operating much of the time in worlds of make believe.”

Robert E. Lucas, Jr.1

1 Introduction

Forecasting economic growth is as difficult as it is important to policy makers and investors around the world. Modern budget frameworks are anchored on medium term planning and therefore lean heavily on growth assumptions over a four to five year horizon. For debt sustainability assessments in low-income countries, the IMF and World Bank make forecasts as far as 20 years out. In many cases, those forecasts are largely based on human judgment. That is, rather than writing down and estimating a formal statistical model to generate their projection, forecasters make an educated guess based on their subjective understanding of how current developments affect an economy’s trajectory. The track record of these forecasts in terms of accuracy, however, has been mixed.2

The often stated rationale for relying on judgment is that it allows the forecaster to take into account many aspects of an economy that are difficult to capture in a formal model. The world is complex, and informal models of the human mind are better equipped to aggregate complex information than an abstract empirical model. For example, in a survey by Genberg and Martinez (2014), country authorities value the ability of IMF forecasts to take into account country specific circumstances. The preoccupation reflected in this view is, in the language of predictive modeling, one about underfitting: A model that ignores country specific features and other complexities cannot really describe the real world, and its usefulness will therefore be limited. By contrast, this paper emphasizes the risk of overfitting, a competing and equally important concern in the literature on learning, both in statistics and in cognitive science.3 Overfitting occurs when forecasters attempt to construct stories about the future based on past experience, not taking into account that this experience is limited. Whenever this occurs, forecasters behave like statistical models that are estimated on small non-representative samples. They pay too much attention to details that explain the data well in a limited sample but turn out to be less informative in a larger sample. As a result, overfitted models (and human forecasters) respond to noise rather than relevant information.

Overfitting is detrimental to policymaking in several ways. The most immediate consequence is poor predictive performance. In addition, if forecasters are overly confident in their ability to predict future developments, they underestimate the uncertainty around their own forecasts. Hence, the forecasting process itself is a source of risk that would imply a need for larger buffers.

For the IMF’s growth forecasts, published semi-annually in the World Economic Outlook (WEO), I find little evidence of overfitting for short-term forecast horizons. However, when looking at projections over longer horizons, I find strong symptoms of overfitting. Following Copas (1983), I regress growth outcomes on forecasts and obtain a coefficient significantly smaller than unity, meaning that, on average, the accuracy of forecasts can be enhanced by shrinking them towards the sample mean.4 In other words, forecasts are informative about future growth, but forecasters place too much weight on the information they deem relevant and thereby overreact. Therefore, the accuracy of forecasts would have been higher if forecasters had put more weight on mean reversion and placed less confidence in their own judgment.5 Indeed, I find that at longer forecast horizons IMF forecasts are barely more informative about future growth than the historical global average growth rate – particularly for low-income countries. Hence, overfitting could explain why, despite frequent forecast revisions, IMF forecasts improve only marginally as the projection horizon shortens (see Figure 1), a stylized fact that is consistent with findings by Tetlock (2017) on the performance of subject experts in predicting political events.

Figure 1:
Figure 1:

Average information content of IMF real GDP growth forecast revisions

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

An alternative way to detect overfitting is to directly identify variables that can explain variation in forecasts without being good predictors of actual growth outcomes. This exercise also highlights that evidence of overfitting is complementing, not contradicting, evidence of underfitting found in earlier work on growth forecasts. While papers like Ball et al. (2014), Blanchard and Leigh (2013), Chaterjee and Nowak (2016), Frankel and Schreger (2016), Jalles et al. (2015), Lahiri et al. (2006), and Loungani and Rodriguez (2008) all ask “are there variables that forecasters should pay more attention to?”, this paper departs from the literature by asking “are there variables that forecasters should pay less attention to?”.6 Of course, the answer to both questions can be yes, as highlighted in Bordalo et al. (2018). Using the LASSO estimator (Tibshirani, 1996), I find that forecasters overreact to investment and past growth outcomes while not responding enough to the real exchange rate. I also find considerable heterogenity across income groups in the type of information to which forecasters over- or underreact.

The finding that judgment-based forecasts suffer from overfitting is perhaps not surprising, since the risk of overfitting is particularly large when the estimation sample is small or the prediction model is complex. And the informal models underlying our judgment are highly complex – too complex to be formalized – whereas the data that inform these informal models are relatively limited, like any macroeconomic data set. For example, if the forecaster was familiar with 40 years of economic history of 190 countries, that would get us 7600 country-year observations. Not a big number, especially after taking into account that these observations are not independent across countries (for global variables there would be only 40 observations). For five-year growth rates, the sample would only have 1520 non-overlapping observations. For comparison, de Miguel et al. (2007) find that optimal portfolio diversification models over 25 assets require at least 3000 months of data in order to beat a naive (i.e., equal weighting) allocation rule. The finding that forecasts overreact to some information is also consistent with recent evidence by Bordalo et al (2018) on macroeconomic forecasts made by private sector experts.

Of course, the risk of overfitting applies to formal models as well. However, a big advantage of formal models is that they allow us to use statistical learning methods to quantify and reduce the risk of overfitting. Accordingly, I find that there is less overfitting in WEO projections for advanced economies, which are informed by model-based forecasts. Moreover, I show that, for developing countries, a simple LASSO model with few variables can generate forecasts with smaller out-of-sample mean squared error than WEO forecasts.7 Formal models have several other well-known advantages: They are internally consistent and force the forecaster to be transparent about the assumptions made, which enhances transparency and accountability.8

Even so, the conclusion of the paper is not to completely ignore human judgement. Even after controlling for a large number of candidate predictors, WEO forecasts contain valuable information that would otherwise not be captured by a linear prediction model – though less so for developing countries. But the gains in accuracy from reducing the weight of judgement in the IMF’s forecasts appear to be significant.

The rest of the paper is organized as follows: First, I revisit the bias-variance trade-off. Then I apply Copas’ approach to detect overfitting in IMF forecasts. Section 4 benchmarks the accuracy of forecasts against a naive prediction rule and a LASSO estimator. Section 5 identifies dimensions in which IMF forecasts over-and underfit. Section 6 combines Copas’ approach with the LASSO estimator. Section 7 concludes with policy implications.

2 Overfitting and the bias-variance trade-off

To fix ideas, this section revisits the bias variance trade-off in predictive modeling. I assume that the data generating process for outcomes yit in country i and year t is

yit=g(xit)+ϵit,

where xit is a vector of predictors and it is an unobservable independent random disturbance. Prediction model f is estimated on a random sample and, for any xit, yields the forecast f(xit). The expected squared forecast error is

MSE=E[(yitf(xit))2],

which can be expanded to

MSE=E[(yitg(xit)+g(xit)f¯(xit)+f¯(xit)f(xit))2],

where f¯(xit)E(f(xit)). By rearranging and taking into account that both en and the sampling error are independent, we obtain

MSE=Var(ϵit)irreduciblenoise+E[(g(xit)f¯(xit))2]biasterm+E[(f¯(xit)f(xit))2]varianceterm.(1)

Equation (1) separates the three sources of forecast error. First, the exogenous noise term is independent of f and therefore puts an upper bound on the accuracy of forecasts. Second, if the model is misspecified, the parameter estimates of the model are likely to be biased. This bias, for example due to omitted variables or endogeneity in a linear model, is a key concern in most of the applied econometric literature. It should be noted that bias refers to bias in model parameter estimates, not bias in forecasts.9 In small samples, the accuracy of forecasts is affected by a third factor, the sampling error of the model fℐ. Note that the bias term becomes smaller as the set of predictors in f grows. For example, the more variables we add to a linear model, the less severe is the problem of omitted variable bias. On the other hand, as the set of predictors in f grows, the potential error due to sampling variance grows. Hence, there is a trade-off between the bias term and the variance term: simple models have parameter bias and don’t provide enough explanatory power to fit the data well. But complex models fit the data too well.

Due to this trade-off, the prediction error of many forecasting models can be improved by inducing some bias, for example by dropping predictors from the model. Hence, prediction is fundamentally different from the typical econometric problem of inference, where identifying the model parameters means that bias has to be avoided.10

A linear example

Before analyzing the behavior of real-world forecasts it is worth studying overfitting in an illustrative example using simulated data. The data generating process is a linear model with 49 variables and a random normal error term with variance σe2.11 Using this artificial data set, I estimate OLS estimators for 50 separate models: The first model uses no variables, the second model uses one variable, the third model uses two variables, and so forth. The fiftieth model uses all 49 variables. Each model is then used to make out-of-sample predictions on a separate test sample. Figure 2 illustrates the average performance of each model in large and small estimation samples.

Figure 2:
Figure 2:

Overfitting: out-of-sample performance in an artificial linear example

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

In the large sample, the best model is the one that corresponds to the data generating process. This model’s forecast errors are limited to the irreducible noise term. Dropping predictors leads to omitted variable bias that is the most severe when no variables are used. In a small sample, however, parameter estimates are sensitive to noise, so that out-of-sample forecast errors have an additional component, the variance term. Figure 2b shows how this variance term grows with the number of predictors, whereas the bias term declines as the number of predictors grows. As a result, even if we know that the true model has 49 predictors, the optimal OLS model is one with fewer regressors. Note that the performance of the true model is worse even than the Null model – the model that uses none of the predictors.

It is also worth pointing out that omitting some variables will lead to forecast errors that are correlated with observables. Hence, a correlation between growth forecast errors and fiscal consolidation like the one documented in Blanchard and Leigh (2013) is not necessarily at odds with good forecasting practice in small samples. Blanchard and Leigh state that “[u]nder rational expectations, fiscal consolidation forecasts should be unrelated to subsequent growth forecast errors.” If rational expectations is understood as the long-run limit in which forecasters’ beliefs have converged (e.g., as in Blume and Easley, 1982), then Blanchard and Leigh’s statement on forecasts is in line with the behavior of forecasts based on large samples in Figure 2a. The world in which IMF forecasters operate, however, is closer to the small sample world of Figure 2b, in which it is optimal to suppress the information content of some predictors because forecasters don’t have sufficient information about the data generating process.

Overfitting and shrinkage

Overfitting in the above models is present whenever a reduction in model fit leads to improved forecasts. A simple way to reduce the in-sample fit of a model fℐ(xit) is to take a linear combination between that model and the least informative model, i. e., a constant к:

f~(xit,β)=βf(xit)+(1β)κ.

The weight β indicates the degree to which the influence of xit is shrunk relative to the original model (e.g., the OLS benchmark). For observed outcomes yit, the out-of-sample forecast error then becomes

ηit=yitf~(xit,β)=yit(βf(xit)+(1β)κ)

If the weight β that minimizes the MSE of model f~(xit,β) is smaller than unity, then shrinkage improves the quality of forecasts, suggesting that the forecaster’s initial model was overfitted.

To detect overfitting in predictions, Copas (1983) therefore proposes to regress actual values on forecasts and a constant:

yit=α+βy^it+ηit,(2)

where y^itf(xit) is the forecast for yit and ηit is an orthogonal error term.12 For the case where f is a linear least-squares model, Copas (1983) shows that E(β^)<1.E(β^) is increasing in the size of the estimation sample and decreasing in the number of predictors in x, and it approaches 1 as the sample size becomes large. The estimated slope coefficient β^ indicates by how much each forecast y^it needs to be shrunk towards the sample mean to maximize the forecast accuracy (after correcting for α^). Note that β^<1 implies that forecasts and forecast errors are negatively correlated:

yity^it=α+(β1)y^it+ηit.

Whenever forecasts and forecast errors are correlated, forecast errors are predictable. A positive correlation arises if forecasters underreact to relevant information. A negative correlation arises if forecasters overreact to information or react to irrelevant information. Irrespective of the source of negative correlation, its presence means that forecasters are overly confident in the information content of their forecasts.

Figure 3 plots the shrinkage coefficients β^ for the linear example introduced in Figure 2 above. If the estimation sample is large, no shrinkage is required. However, for small samples, forecasts need to be shrunk by half to reduce the influence of noise in the estimation sample. The amount of shrinkage required is increasing in the number of predictors.

Figure 3:
Figure 3:

Shrinkage coefficients in an artificial linear example

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

3 Optimal shrinkage for WEO forecasts

To detect overfitting in WEO forecasts, I estimate equation (2) for each forecast horizon of IMF real GDP growth projections. Data for actual growth is based on the April 2018 vintage of the WEO.13 The sample includes WEO forecasts made since 1990.

Figure 4 depicts the estimated shrinkage coefficients β^. The estimates in panel (a) indicate that forecasts are highly vulnerable to overfitting and that the optimal degree of shrinkage increases with the forecast horizon. This result is largely driven by emerging and developing economies. For low income countries (panel (d)), five-year ahead forecasts need to be shrunk towards the sample mean by almost 50 percent. For advanced economies (panel (b)), by contrast, there is no evidence of systematic overfitting.

Figure 4:
Figure 4:

Optimal shrinkage coefficients for real GDP growth projections, by forecast horizon.

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Note: Dashed lines indicate 95 percent confidence intervals based on robust standard errors. Observations with forecast errors greater than 10 percentage points are excluded.

Several factors could explain the differences in results between advanced economies and emerging and developing economies. First, the economic experience of advanced economies is well documented and less volatile, so that forecasters have a larger amount of training data, with fewer outliers, at hand. Second, forecasts for advanced economies are informed by a large set of competing forecasts produced by other organizations, both public and private, so that they are likely to benefit from crowd wisdom.14 And third, for advanced economies, the IMF’s Research Department produces model-based forecasts that can inform desk economists’ judgment.

The increase in shrinkage required at longer horizons is also in line with Copas (1983). For long forecast horizons, the sample of non-overlapping observations is smaller than for one-year ahead forecasts, so that models of the same degree of complexity are more likely to overfit.

When repeating the exercise for cumulative growth rates rather than annual growth rates (Figure 2), a similar picture emerges: The required shrinkage increases with the forecast horizon and decreases with the level of development. The confidence intervals around the coefficients are smaller. Moreover, for advanced economies we find that at a 5-year horizon, cumulative forecasts benefit from shrinkage, suggesting that some overfitting occurs.

Figure 5:
Figure 5:

Optimal shrinkage coefficients for cumulative real GDP growth projections, by forecast horizon.

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Note: Dashed lines indicate 95 percent confidence intervals based on robust standard errors. Observations with absolute forecast errors in the top percentile are excluded.

4 Benchmarking forecast accuracy

This section compares the accuracy of WEO growth forecasts against several alternative forecasts in terms of mean squared error (MSE). It is worth pointing out that IMF forecasts are not made to minimize the MSE. Economists are in fact instructed to predict the most likely outcome under a certain set of policies – the mode rather than the mean of a conditional distribution of outcomes.15 Evaluating forecasts along this criterion, however, requires strong distributional assumptions. Moreover, since forecasts are contingent on a set of policies, it is difficult to evaluate accuracy if these policies are not implemented.16

Like every choice of loss function, the choice of MSE as loss function for this paper is a subjective one. Loss functions should reflect the user’s preferences, and preferences vary with the policy application and context. Hence, while MSE is a popular measure of accuracy in the literature, it may not always be the most relevant one, depending on the application. For short-term budget forecasts, it is often prudent to, unlike the most likely outcome, err on the side of caution, in order to avoid costly adjustments ex post, particularly when debt is high.17

In the following, I first analyze the gains in accuracy from transforming WEO forecasts using Copas’ shrinkage model from the previous section. Since, in practice, “shrinking” judgment based forecasts is difficult to operationalize, I then compare the WEO forecasts with an extreme case of shrinkage, the simple crosscountry historical average. Finally, I compare the WEO forecast with those obtained from a linear modeling technique that embraces the principle of shrinkage. All predictions are made out of sample. That is, to make predictions for growth rate in year t + j, we rely on a model that is estimated using only data up to year t – 1.

4.1 WEO vs. optimal shrinkage

In Figure 6, I ask whether whether the accuracy of forecasts made in April of year t could have been improved by applying the linear transformation from equation 2, estimated using data up to year t – 1. The transformation would have led to a significant reduction in forecast dispersion and, on average, to an improvement in accuracy. The improvement in accuracy is mainly driven by low income countries. For advanced economies and emerging market economies, the accuracy would have suffered marginally, suggesting that the optimal shrinkage coefficient has changed over time. Even so, the charts illustrate that even for advanced and emerging market economies the forecast dispersion can be reduced without substantially affecting forecast accuracy. This creates a trade-off: by reducing the dispersion of forecasts, we correct for overfitting to noise, but we also reduce the contribution of relevant information and hence introduce underfitting along some dimensions.

Figure 6:
Figure 6:

Comparison of WEO forecast performance with and without Copas (1983) transformation (out of sample)

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

4.2 WEO vs. extreme shrinkage (the Null model)

In Section 2, an extreme benchmark for measuring forecast accuracy was to look at forecasts that rely entirely on the sample mean and ignore any information gained from potential predictors. I call this method of generating forecasts purely based on historical averages the Null model. At each point in time t, the Null model’s forecast for any horizon and for any country corresponds to the unweighted average growth rate across all countries between 1970 and year t-1. It is updated every year to reflect new information about average growth in the historical sample. Note that, at any point in time, there are no cross-country differences in Null model forecasts (i.e., Luxemburg, Liberia, and Laos are all forecast to grow at the same rate). The Null model is situated at the extreme end of the bias-variance trade-off in that it exhibits maximum parameter bias but has the lowest risk of responding to noise. It is more naive than some of the “naive” models that have been used to benchmark forecast accuracy in the literature, such as the auto-regressive models.

Figure 8a shows the MSE for WEO forecasts and forecasts generated by the Null model. While WEO forecasts are substantially better in the short run, these differences in performance become insignificant as the horizon widens. And for five-year ahead forecasts of growth rates, the Null model is more accurate than the WEO forecast, though not significantly. This is remarkable: Null model forecasts have no cross-country dispersion, whereas the dispersion in WEO forecast is considerable, as seen in Figure 7. And yet, the Null model is not significantly worse at fitting the data than the WEO forecasts. Figure 8b-c repeat this exercise separately for different income categories. Even though the Null model is still based on the global sample, the pattern is fairly similar across groups. The difference in performance between Null model and WEO is more pronounced for advanced economies, consistent with the finding above that forecasts for these countries suffer less from overfitting. However, at 5-year, the difference in performance is no longer significant, as can be seen from the more formal Diebold-Mariano (2002) test shown in Table 1. For emerging market and developing economies, we cannot reject the hypothesis that the performance of WEO forecasts and Null model forecasts is similar.

Figure 7:
Figure 7:

Standard deviation of WEO annual real GDP growth forecasts, by income group

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Figure 8:
Figure 8:

Mean squared error for Spring WEO projections and Null model forecasts.

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Note: light dashed lines indicate 95 percent confidence intervals. WEO forecasts and actual growth outcomes are trimmed at +/- 20 percent. Null model forecasts made in year t are the global average growth rate from 1970 to year t-1, as reported in the Spring WEO of year t.
Table 1:

Diebold-Mariano tests for squared errors: WEO vs. Null model, by forecast horizon and income group

article image
Note: diff(MSE) reports the coefficient from regressing the difference in squared forecast error between WEO forecasts and Null model forecasts on a constant. Standard errors are clustered by year. Negative coefficients indicate that WEO is more accurate than the Null model. WEO forecasts and actual growth outcomes are trimmed at +/- 20 percent. Null model forecasts made in year t are the global average growth rate from 1970 to year t-1, as reported in the Spring WEO of year t.

In Appendix A I show how the MSE has evolved over time for each forecast horizons, which illustrates how WEO forecasters do better in same-year forecasts but not at longer horizons. For same-year forecasts, the WEO forecasters are able to respond to significant events such as the financial crisis that was incorporated in the same-year forecast of the April 2009 WEO. The backward-looking Null model, by contrast, missed this event. The one-year ahead forecasts in 2008 were not able to anticipate the crisis, leading to large errors for the WEO and the Null model.

4.3 WEO vs. model forecasts

Since judgment-based forecasts outperform the Null model, the question arises whether, as in Figure 2, there is a model in between – one that incorporates some information, though less than the IMF forecasts – that performs better than both the Null model and the WEO. I therefore let the WEO’s cumulative six-year forecasts compete with those generated using a Least Absolute Shrinkage and Selection Operator (LASSO), a popular technique in predictive modeling (see Appendix B for details and Jung et al. (2018) for other recent work on machine learning methods and WEO forecasts). The LASSO estimator, developed by Tibshirani (1996), builds on the intuition highlighted above that inducing parameter bias by shrinking coefficients towards zero can lead to a better out-of-sample fit. Since the LASSO typically shrinks some coefficients to exactly zero, it provides a data driven variable selection method, which I use in Section 6 below.

Our set of candidate predictors for growth in year t + j consists of the lagged (i.e., t – 1) values following macroeconomic variables from the World Economic Outlook: GDP per capita in constant PPP-adjusted US dollar (in logs), annual real GDP growth, the backward-looking 5-year average real GDP growth rate, real GDP growth in the U.S., the current account balance in percent of GDP, the PPP real exchange rate (in logs), and population growth. I also include the current value (i.e., time t) value of the change in the overall fiscal balance. To the extent possible, we use the April WEO vintage of year t for all predictors, to align the model’s information set with that of the WEO forecaster. Again, all model predictions are made out of sample.

Figure 9a compares the dispersion of forecasts across countries and years. The unconditional standard deviation of annual GDP growth rates is 4.3 percentage points, and WEO forecasts attempt to predict more than three quarters of this variation at short horizons and almost half of the variation for five-year horizons. By contrast, the LASSO model attempts to predict only a bit more than a quarter of the variation in growth at long horizons. Figure 9b shows the MSE for each forecast horizon, comparing the WEO predictions with out-of-sample LASSO model predictions. Despite the considerably lower forecast dispersion of model-based forecasts, models outperform the WEO at all but the very short horizon. As shown in Table 2, these differences become significant at the five-year horizon.

Figure 9:
Figure 9:

Comparison of WEO and model-based forecast performance

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Note: all forecasts and actual growth outcomes are trimmed at +/- 20 percent. Results by income group are based on models pooling over all income groups. LASSO model forecasts of growth in year t + h made at time t are based on data up to yeat th – 1.
Table 2:

Diebold-Mariano tests for squared errors: WEO vs. LASSO model, by forecast horizon and income group

article image
Note: diff(MSE) reports the coefficient from regressing the difference in squared out of sample forecast error between WEO forecasts and LASSO model forecasts on a constant. Standard errors are clustered by year. Negative coefficients indicate that WEO is more accurate than the LASSO model. WEO forecasts and actual growth outcomes are trimmed at +/- 20 percent. Results by income group are based on models pooling over all income groups.

There are again considerable differences across income groups. The improvements in accuracy are driven by low income countries, where at horizons longer than two years the model outperforms the WEO.

5 Predictors and noise

In the previous section we looked at forecasts and outcomes to detect whether forecasters respond to noise. In this section, I take a closer look at the nature of the information that is incorporated by IMF economists and whether this information is relevant for predicting real GDP growth. To do so, I estimate a series of LASSO models, first with forecasts and then with growth outcomes on the left-hand side.18 The primary question is which variables are selected as predictors by the LASSO algorithm. Selection probabilities are estimated from selection frequencies in 10,000 regressions using bootstrapped samples.

Selection frequencies for same-year forecasts and five-year ahead forecasts are plotted in Figure 10. Points close to a 45-degree line would suggest that WEO forecasters’ choice of predictors is consistent with their actual relevance. The point in the top right corner indicates that, regarding the most robust predictors, the behavior of WEO forecasts is consistent with that of actual growth outcomes. However, especially at the longer horizon, several variables are above a 45-degree line, indicating that they receive too much attention. In the following discussion, however, I focus only on those variables for which either the selection frequency is high in some regression or where there is a statistically significant inconsistency between the regressions of WEO forecasts and those of actual outcomes. The selection frequencies for all horizons are reported in Appendix C.

Figure 10
Figure 10
Notes: Each point represents a predictor. The axes measure selection frequencies for each predictor in prediction models for GDP growth (horizontal axis) and WEO growth forecasts (vertical axis). Selection frequencies are based on 10,000 LASSO regressions based on bootstrapped samples.

For different forecast horizons, Table 3 reports estimated regression coefficients. The table indicates the frequency at which each variable is selected as well as the frequency at which a variable’s weight in predicting actual growth is larger or smaller than its weight in predicting the WEO forecast. Three variables are robust predictors of growth across all forecast horizons: lagged one-year GDP growth, the country’s historical five-year average growth, and the real exchange rate. And, according to the LASSO model, WEO forecasts respond to all three of these predictors. However, the coefficients suggest that lagged one-year growth matters much less for actual growth than WEO forecasters think, whereas the real exchange rate matters more than forecasters think, especially at long horizons.19 The overreaction to recent growth momentum, together with underreaction to longer-trem trends, again provides evidence that forces of mean reversion are systematically underestimated relative to those of growth momentum. One predictor to which forecasts are too sensitive is population growth. While this variable is a key determinant of growth in the neoclassical growth model, it does not appear to be a robust predictor of growth in a linear model. WEO forecasts also place more weight on foreign direct investment, but the differences are not significant.

Table 3:

LASSO coefficients (selected variables) for predicting cumulative real GDP growth rates and WEO growth forecasts: all countries

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast. See Table 7 for full results.

A comparison of results across income groups in Tables 46 highlights that a linear model does not capture the considerable heterogeneity across groups. For example, forecasts fail to take into account that the persistence in country-specific growth trends is much less pronounced in advanced economies than in emerging market economies, as documented by Aguiar and Gopinath (2007).

Table 4:

LASSO estimates for predicting cumulative real GDP growth rates and WEO growth forecasts: advanced economies

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast.
Table 5:

LASSO estimates for predicting cumulative real GDP growth rates and WEO growth forecasts: emerging market economies

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast.
Table 6:

LASSO estimates for predicting cumulative real GDP growth rates and WEO growth forecasts: low income countries

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast.

In advanced economies (Table 4), population growth, the current pace of fiscal consolidation, and the real exchange rate are robust predictors of growth that IMF forecasts should have paid more attention to. The results also suggest that, in the short run, government expenditure has a crowding out effect that is underestimated by forecasters. At longer horizons, the predictive power of the current account balance is also underestimated, though not by a significant margin. On the other hand, IMF forecasts have overreacted to past growth, investment levels and changes, and export growth. Government debt-to-GDP ratios are determinants are consistently chosen as predictors of IMF forecasts but less frequently as predictors of actual growth. However, we cannot say with confidence that IMF forecasters for advanced economies have overreacted to debt levels.

For emerging markets, the main inconsistency between predictors of growth and predictors of WEO forecasts is related to the real exchange rate, to which forecaster do not respond sufficiently. While the previous year’s GDP growth rate receives more weight in predicting WEO forecasts, the difference in weights is statistically significant only for same-year forecasts. Moreover, forecasters again do not take into account the crowding out effects of public spending.

In low income countries, forecasters are underestimating the importance of the real exchange rate and of country specific growth trends while overfitting to population growth, short-term growth outcomes, FDI, and investment.

6 Are IMF forecasts robust predictors of growth?

I now add the WEO forecast to the set of predictors for real GDP growth used in the previous section and repeat the LASSO regressions to see whether the estimator selects the WEO forecast over alternative predictors. This exercise can be seen as a LASSO version of the Copas (1983) regression from Section 3.

Figure 11 reports the average regression coefficients and selection probabilities. Results for cumulative forecasts are reported in Table 15. The results confirm that, for advanced and emerging economies, IMF forecasts are informative and are, with very high probability, selected as predictors of growth, though with significant shrinkage. For low-income countries, however, the selection probability decreases considerably with the forecast horizon. Moreover, the average weight that the estimator for low-income countries puts on the forecast is declining to almost zero for horizons beyond one year.

Figure 11:
Figure 11:

Selection probabilities and regression coefficients for WEO forecasts as predictors of growth

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

Note: Results represent average LASSO regression coefficients and selection probabilities from regressions of real GDP growth on WEO forecasts and a set of predictors. Averages from 10,000 regressions using bootstrapped samples are reported.

7 Conclusion

In macroeconomics, small samples are a fact of life. Another fact of life is that the real world is complex. Resulting from these is a trade-off between bias and variance that every macroeconomic forecaster needs to navigate: In large samples, the world’s complexity puts a premium on human expertise or the use of rich empirical models so that – like the model with 49 predictors in the linear example of Section 2 – forecasts can take into account country and time specific information. By contrast, small sample size limits the usefulness of human expertise or complex models because the stories identified by economists in historical data do not generalize well to the future. Ignoring this trade-off and the tension between understanding economic outcomes in a limited set of circumstances and prediction of future outcomes will lead to overconfidence and, potentially, to poor policy design.

This paper has explored the trade-off for IMF growth projections, which are largely based on human judgment, and has found strong symptoms of overfitting, particularly at longer horizons. Forecasters incorporate too much information that is irrelevant at the margin. The fact that, particularly for low-income countries, judgment based long-term forecasts are not significantly more accurate than forecasts that ignore any information beyond the historical average has important implications for policy analysis. The level of noise incorporated in forecasts under the current practice suggests that assessments of fiscal space and debt sustainability could be significantly improved. And while the current paper only assesses forecast horizons up to five years, extrapolating from Figures 5d, 8d, and 14d suggests that the 20-year forecasts underlying the debt sustainability analysis in low-income countries should not be based on judgment alone.

While the paper provides an explanation for the poor performance of the IMF’s medium-term forecasts and shows how the bias-variance trade-off needs to be taken into account, it leaves us with a dilemma. While statistical models typically deliver better forecast accuracy than economic models, communicating forecasts to decision makers often requires a narrative. As pointed out by Pagan (2003), forecasts are more appealing to decision makers if they are underpinned by an economic rationale, which creates a tension if there is a trade-off between the theoretical and empirical coherence of forecasting models. This tension is magnified in small samples, as the potential gains in forecast accuracy through parameters bias come at the cost of reduced interpretability. Moreover, forecasts are an ingredient in counterfactual policy scenarios for which unbiased parameters are essential, for example when determining the growth payoffs from public investment. More research is needed on how to navigate the bias-variance trade-off in policy applications.

References

  • [1]

    Mark Aguiar and Gita Gopinath. Emerging market business cycles: The cycle is the trend. Journal of political Economy, 115(1):69102, 2007.

    • Search Google Scholar
    • Export Citation
  • [2]

    Laurence M. Ball, Joao Tovar, and Prakash Loungani. Do Forecasters Believe in Okuns Law? An Assessment of Unemployment and Output Forecasts. IMF Working Papers 14/24, International Monetary Fund, February 2014.

    • Search Google Scholar
    • Export Citation
  • [3]

    Paul Beaudry and Tim Willems. On the macroeconomic consequences of over-optimism. Technical report, National Bureau of Economic Research, 2018.

    • Search Google Scholar
    • Export Citation
  • [4]

    Olivier J Blanchard and Daniel Leigh. Growth forecast errors and fiscal multipliers. American Economic Review, 103(3):11720, 2013.

  • [5]

    Lawrence E Blume and David Easley. Learning to be rational. Journal of Economic Theory, 26(2):340351, 1982.

  • [6]

    Pedro Bordalo, Nicola Gennaioli, Yueran Ma, and Andrei Shleifer. Overreaction in macroeconomic expectations. 2018.

  • [7]

    Valerie Cerra and Sweta Chaman Saxena. Growth dynamics: the myth of economic recovery. American Economic Review, 98(1):43957, 2008.

  • [8]

    Pratiti Chatterjee and Sylwia Nowak. Forecast errors and uncertainty shocks. 2016.

  • [9]

    John B Copas. Regression, prediction and shrinkage. Journal of the Royal Statistical Society. Series B (Methodological), pages 311354, 1983.

    • Search Google Scholar
    • Export Citation
  • [10]

    Victor De Miguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? The Review of Financial studies, 22(5):19151953, 2007.

    • Search Google Scholar
    • Export Citation
  • [11]

    Francis X Diebold and Robert S Mariano. Comparing predictive accuracy. Journal of Business & economic statistics, 20(1):134144, 2002.

  • [12]

    Theo S Eicher, David J Kuenzel, Chris Papageorgiou, and Charis Christofides. Forecasts in times of crises. IMF Working Papers 18/48, International Monetary Fund, 2018.

    • Search Google Scholar
    • Export Citation
  • [13]

    Jeffrey A. Frankel and Jesse Schreger. Bias in Official Fiscal Forecasts: Can Private Forecasts Help? NBER Working Papers 22349, National Bureau of Economic Research, Inc, June 2016.

    • Search Google Scholar
    • Export Citation
  • [14]

    Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2009.

    • Search Google Scholar
    • Export Citation
  • [15]

    Hans Genberg, Andrew Martinez, and Michael Salemi. The imf/weo forecast process. IEO Background Paper No. BP/14/03 (Washington: Independent Evaluation Office of the IMF), 2014.

    • Search Google Scholar
    • Export Citation
  • [16]

    Gerd Gigerenzer and Henry Brighton. Homo heuristicus: Why biased minds make better inferences. Topics in cognitive science, 1(1):107143, 2009.

    • Search Google Scholar
    • Export Citation
  • [17]

    Giang Ho and Paolo Mauro. Growth—now and forever? IMF Economic Review,Jung 64(3):526547, Aug 2016.

  • [18]

    Joao Tovar Jalles, Iskander Karibzhanov, and Prakash Loungani. Cross-country evidence on the quality of private sector fiscal forecasts. Journal of Macroeconomics, 45(C):186201, 2015.

    • Search Google Scholar
    • Export Citation
  • [19]

    Jin-Kyu Jung, Manasa Patnam, and Anna Ter-Martirosyan. An algorithmic crystal ball: Forecasts based on machine learning. 2018.

  • [20]

    Kajal Lahiri, Gultekin Isiklar, and Prakash Loungani. How quickly do forecasters incorporate news? Evidence from cross-country surveys. Journal of Applied Econometrics, 21(6):703725, 2006.

    • Search Google Scholar
    • Export Citation
  • [21]

    Prakash Loungani and Jair Rodriguez. Economic Forecasts. World Economics, 9(2):112, April 2008.

  • [22]

    Robert E. Lucas. What economists do. Journal of Applied Economics, 14:14, May 2011.

  • [23]

    Jacob A Mincer and Victor Zarnowitz. The evaluation of economic forecasts. In Economic forecasts and expectations: Analysis of forecasting behavior and performance, pages 346. NBER, 1969.

    • Search Google Scholar
    • Export Citation
  • [24]

    Sendhil Mullainathan and Jann Spiess. Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2):87106, 2017.

    • Search Google Scholar
    • Export Citation
  • [25]

    Adrian Pagan. Report on modelling and forecasting at the bank of england/bank’s response to the pagan report. Bank of England. Quarterly Bulletin, 43(1):60, 2003.

    • Search Google Scholar
    • Export Citation
  • [26]

    Lant Pritchett and Lawrence H Summers. Asiaphoria meets regression to the mean. Technical report, National Bureau of Economic Research, 2014.

    • Search Google Scholar
    • Export Citation
  • [27]

    Dani Rodrik. The real exchange rate and economic growth. Brookings papers on economic activity, 2008(2):365412, 2008.

  • [28]

    Philip E Tetlock. Expert political judgment: How good is it? How can we know? Princeton University Press, 2017.

  • [29]

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267288, 1996.

    • Search Google Scholar
    • Export Citation
  • [30]

    Andrew Tiffin. Seeing in the dark: A machine-learning approach to nowcasting in lebanon. 2016.

  • [31]

    Allan Timmermann. An evaluation of the world economic outlook forecasts. IMF Staff Papers, 54(1):133, 2007.

Appendix

A Performance of Null model and WEO forecasts over time

Figure 12:
Figure 12:

MSE for Spring WEO projections and Null model forecasts for annual real GDP growth

Citation: IMF Working Papers 2018, 260; 10.5089/9781484386187.001.A001

B The LASSO estimator

Formally, we solve

β^=arg minβ(yXβ)(yXβ)s.t.Σi=1k||βi||<c,

where c is a positive constant parameter.20 If c is large, then β^ corresponds to the OLS estimator. For small values of c, the estimator has to manage a binding budget constraint, so that the in-sample model fit is worse than the OLS fit. To assign more explanatory power to some predictor, the estimator has to reduce the explanatory power of some other predictor. The optimal solution sets some predictors’ coefficients to zero, and for all non-zero coefficients the marginal improvement in fit is equal to the Lagrange multiplier associated with the budget constraint.

I choose the parameter c (or, equivalently, the Lagrange multiplier λ) to find the optimal cross-validated fit. Figure 13 plots the cross-validated fit for different values of the Lagrange multiplier A. The numbers above the chart indicate the number of variables with non-zero coefficients. For low values of λ, there is a marginal improvement in MSE for any increase in λ, as the model is overfitting the data. For high values of λ, the model is underfitting and a further tightening of the constraint leads to a worse out-of-sample fit. Hence, Figure 13 is another illustration of the bias-variance trade-off. The left vertical line indicates the parameter for which the average cross-validated MSE is the lowest. In practice, however, it is common to use a more conservative constraint such that the average cross-validated MSE is equal to the minimum average MSE plus one standard deviation (i.e., at the second vertical line). Given that the WEO forecast has an MSE of 330, it is easy to see that the LASSO is substantially superior in performance.

C Detailed results for Section 5

C.1 LASSO Coefficients

Table 7:

LASSO estimates for predicting cumulative real GDP growth rates and WEO growth forecasts: all countries

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast.
Table 8:

LASSO estimates for predicting cumulative real GDP growth rates and WEO growth forecasts: advanced economies

article image
Notes: results are average LASSO regression coefficients from 10,000 bootstrapped samples. Shaded rows show results where the dependent variable is the actual growth outturn. White rows show results where the dependent variable is the WEO forecast; bold print indicates that the selection probability ≥90 percent; stars in the shaded rows indicate that the value of the estimated coefficient for predicting actual growth is significantly larger (more positive/ less negative) than the estimated coefficient for predicting the WEO forecast, stars in the white rows indicate that the value of the estimated coefficient for predicting actual growth is significantly smaller (less positive/ more negative) than the estimated coefficient for predicting the WEO forecast.