Deus ex Machina? A Framework for Macro Forecasting with Machine Learning1
  • 1 0000000404811396https://isni.org/isni/0000000404811396International Monetary Fund
  • | 2 0000000404811396https://isni.org/isni/0000000404811396International Monetary Fund

Contributor Notes

We develop a framework to nowcast (and forecast) economic variables with machine learning techniques. We explain how machine learning methods can address common shortcomings of traditional OLS-based models and use several machine learning models to predict real output growth with lower forecast errors than traditional models. By combining multiple machine learning models into ensembles, we lower forecast errors even further. We also identify measures of variable importance to help improve the transparency of machine learning-based forecasts. Applying the framework to Turkey reduces forecast errors by at least 30 percent relative to traditional models. The framework also better predicts economic volatility, suggesting that machine learning techniques could be an important part of the macro forecasting toolkit of many countries.

Abstract

We develop a framework to nowcast (and forecast) economic variables with machine learning techniques. We explain how machine learning methods can address common shortcomings of traditional OLS-based models and use several machine learning models to predict real output growth with lower forecast errors than traditional models. By combining multiple machine learning models into ensembles, we lower forecast errors even further. We also identify measures of variable importance to help improve the transparency of machine learning-based forecasts. Applying the framework to Turkey reduces forecast errors by at least 30 percent relative to traditional models. The framework also better predicts economic volatility, suggesting that machine learning techniques could be an important part of the macro forecasting toolkit of many countries.

Deus ex machina (noun)

deus ex machina \ ˈdā-əs-ˌeks-ˈmä-ki-nə

An unexpected power or event that saves a situation that seems without hope, especially in a play or novel.”

-Oxford Dictionary

I. Introducing Machine Learning

Traditional forecasting methods often provide poor macro forecasts. Techniques based on ordinary least squares (OLS) struggle to overcome several issues, including collinearity, dimensionality, predictor relevance, and nonlinearity. Some state-of-the-art forecasting models, including dynamic factor models, can help address collinearity and dimensionality problems, but do not address predictor relevance and nonlinearity problems. As a result, even state-of-the art forecasting models often result in large forecast errors. Furthermore, dynamic factor models perform particularly poorly when the variable to be predicted is volatile, such as output growth in many emerging market and developing economies.

Machine learning (ML) methods present an alternative to traditional forecasting techniques. ML models can outperform traditional forecasting methods because they emphasize out-of-sample (rather than in-sample) performance and better handle nonlinear interactions among a large number of predictors. ML methods are specifically designed to learn complex relationships from past data while resisting the tendency of traditional methods to over-extrapolate historical relationships into the future. Indeed, a literature is beginning to emerge which suggests that ML methods often outperform traditional linear regression-based methods in terms of accuracy and robustness.2

We develop a framework to use ML methods for macro forecasting. We use the framework to nowcast (and forecast) economic growth in Turkey and are able to reduce forecast errors by at least 30 percent relative to traditional models. Importantly for Turkey and other countries with volatile economies, the framework also better predicts large swings in the growth rate, suggesting that machine learning techniques could be an important part the macro forecasting toolkit of many countries. We also attempt to improve transparency and interpretability of ML forecasts by uncovering the contribution of each predictor to individual forecasts.

II. The Basics of Forecasting—a Bias-Variance Tradeoff

All forecasting methods aim to minimize expected forecast errors. Forecasting consists of selecting a function that maps indicator data to a forecast while minimizing a particular loss function. Suppose a researcher wants to forecast a variable yt (e.g., real GDP growth) using K predictor variables summarized in the K x 1 vector Xt, with the h-step ahead forecast of yt denoted as yt+h:

yt+h=f(Xt)+ϵt+h

where εt+h is an error term. The goal is to forecast yt+h by choosing the function f(·) that minimizes the average loss:

minf(Xt)L(yt+hf(Xt))

where L(·) is a loss function that assigns relative weights to different forecast errors. If the loss function is quadratic, for example, the expected loss to minimize by picking the function f^(Xt)=yt+h^ can be decomposed as (James et al., 2013):

E((f(Xt)yt+h^)2)exp.squaredforecasterror=[E(yt+h^)f(Xt)]2squaredbias+Var[yt+h^]variance+σ2irreducibleerror(1)

where the first term on the right-hand side is the squared bias of the forecast, the second term is the variance of the forecast and the third term is the idiosyncratic contribution of the error term to total loss.

The optimal forecast uses the past to predict the future without over-extrapolating. Minimizing the loss function (1) amounts to picking the function f^(Xt) that minimizes the expected sum of the squared bias and the variance of the forecast. Unfortunately, it is typically impossible to reduce both terms simultaneously (Annex I). This bias-variance tradeoff is a central concept in both the forecasting and the machine learning literatures (James et al., 2013). In general, more complex forecasting models exhibit lower bias, because they better capture nuances in the mapping from Xt to yt+h3 However, as complex models provide sharper predictions, they are also more likely to capture perturbations (or ‘noise’) in the historical data that are uninformative for future predictions. This tendency, known as ‘overfitting’, increases the variance of forecasts, potentially resulting in higher forecast errors.

A. Shortcomings of OLS-Based Forecasting Methods

Forecasting methods based on OLS struggle to optimize the bias-variance tradeoff. Suppose the predictors are mean zero and the error term is i.i.d. N(0, σ2) and independent of Xt (Stock & Watson, 2006). With OLS, the expected loss under the quadratic loss function becomes:

E((f(Xt)yt+hOLS^)2)=[E(yt+h^)f(Xt)]2+(XtXt1T+1)σ2(2)

and several issues arise, including:

  • Collinearity. The variance of the OLS forecast is increasing in the degree of correlation between predictors. To see this, note that the expected value of the inner product Xt’Xt (for a given observation) equals the covariance of Xt. The more correlated the predictors are, the higher this covariance.

  • Dimensionality. The variance of the OLS forecast is increasing in the number of predictors, K. To see this, suppose the predictors Xt are orthogonal such that 1TΣt=1TXtXt=IK (a K x K identity matrix). In this case it can be shown that (Stock & Watson, 2006):

    yt+hOLS^N(E(yt+hOLS^),cKσ2T)

    where c is a constant. For a given number of historical observations, T, the variance of the forecast is proportional to the number of predictors.

  • Predictor relevance. Related to dimensionality, irrelevant predictors unambiguously increase the forecast error because they do not reduce bias, but increase the forecast variance by increasing Xt’Xt.

  • Nonlinearity. If the data-generating process (DGP) is non-linear, the OLS forecast is biased. To see this, note that the first term on the right-hand side of (2) is minimized at zero if f(Xt)=E(yt+h^), which is the case if the underlying model is linear, i.e., f(Xt) = β’Xt.

State-of-the-art forecasting techniques such as dynamic factor models can address some of these issues. Specifically, factor models (Annex II) aim to address collinearity and dimensionality by summarizing the variation in the predictor data using a small set of orthogonal factors.4 In particular, if the selected indicators capture the underlying forces that affect the forecasted variable, and there is a high degree of co-movement among indicators, this variation can be explained by a small set of latent variables (Sargent & Sims, 1977).

Factor models do not, however, address predictor relevance or nonlinearity. In attempting to summarize the information content of a large number of predictors into a small number of factors, there may be settings where the predictors follow a factor structure, but the factors do not predict the forecast variable (Tu & Lee, 2018). While factor models can help reduce dimensionality, they do not provide a means to identify the most relevant predictors. Furthermore, factor models rely on the assumption that the DGP follows a linear factor structure, which may not necessarily be the case.

B. The Advantages of Machine Learning Methods

Unlike traditional forecasting techniques, ML methods are specifically designed to optimize the bias-variance tradeoff. In particular, ML models can address the above issues with which traditional forecasts have struggled because they select predictors to optimize out-of-sample (rather than in-sample) performance and are better able to handle nonlinear interactions among a large number of predictors (Annex III). In this study we focus on three specific ML methods: Random Forest; Gradient Boosted Trees; and Support Vector Machines.

Random Forest (RF) is an algorithm that uses forecast combinations of multiple decision trees to construct an aggregate forecast. The key elements of RF include:

  • Decision trees. A decision tree is an algorithm that repeatedly separates categorical data into two groups, with each split chosen by the algorithm to yield the largest reduction in the forecast error of the variable of interest. Regression trees are a type of decision tree used for predicting a continuous variable and are particularly well suited for nonlinear relationships. A regression tree minimizes the forecast error by repeatedly splitting the continuous data into two groups, with a prediction for each group that is based on the mean of that group’s data (Hastie et al., 2009).5 Decision trees can be as complex (i.e., long) as needed to fit to the in-sample data well. However, they often ‘overfit’ the in-sample data at the expense of out-of-sample performance Also, decision trees use local, rather than global, optimization which can create path dependence and model instability. Modifications to the basic decision tree, such as random sampling, are often made to prevent overfitting and improve model performance.

  • Random sampling. RFs modify the decision tree approach in two ways to maximize the information content of the data by using subsamples of observations and predictors. First, they use bootstrap aggregation (’bagging’) by building each individual tree on only a random sample of the observations in the training data. Second, at each split in the tree, the RF algorithm uses only a random subsample of the predictors. Bagging therefore generates a large number of uncorrelated trees. Individually, the trees tend to have low bias but poor out-of-sample accuracy due to high variance (i.e., they overfit on the training data). However, for a large enough number of uncorrelated trees, these errors tend to average out to zero. RF is one of the most popular ML algorithms available because it is computationally easy to use and requires almost no tuning of model parameters. This makes it an ideal algorithm for forecasting on time-series data with relatively few observations.

Gradient Boosted Trees (GBT) is an algorithm that constructs sequential decision trees to learn from previous trees’ errors. Just like the RF, GBT combines individually-weak trees into a robust forecast. The algorithm starts out by training an initial decision tree on the historical data. It then uses the prediction errors from the first tree to train a second tree. In turn, the errors from the second tree are used to train the third tree, etc. After the final iteration, the algorithm uses the sum of the individual predictions for the final forecast.6 Whereas RF combines relatively deep trees with low bias and high variance, GBT combines relatively shallow trees with high bias and low variance. As each subsequent tree targets the bias from the previous tree, the bias errors of subsequent trees tend to sum towards zero, resulting in an overall prediction with both low bias and low variance.

Support Vector Machine (SVM) is an algorithm that constructs hyperplanes to partition predictor combinations and make a point forecast for each of the sections. Unlike tree-based algorithms, SVM is similar to kernel regression with a penalty imposed on the use of coefficients (i.e., penalized kernel regression). Formally, SVM regressions find the function f(Xt) = Xt’β + b and observation-specific slack constants ζi and that minimize , subject to yi – f(Xi) ≤ ε + ζi and . The complexity parameters ε and C govern the acceptable margin and the penalty imposed on observations that lie outside this margin. The cost parameter, C, mainly determines the degree of model complexity. If C = 0, the algorithm disregards individual deviations and constructs the simplest hyperplane for which every observation is still within the acceptable margin e. For sufficiently large C, the algorithm will construct the most complex hyperplane that predicts the outcome for the training data with zero error, i.e. the algorithm will fit the training data perfectly. Through cross-validation, SVM finds the optimal value of C that balances this bias-variance tradeoff and maximizes out-of-sample accuracy on the historical data.

III. A Framework for Macro Forecasting with Machine Learning

A. Limiting Preselection

We apply the framework to Turkey, a country for which traditional forecasting techniques have been unsatisfactory. We collect a database of country-specific and global indicators, with 234 separate series in total (Tables A5.1 and A5.2). The data consist of an array of mixed-frequency (monthly and quarterly) leading and coincident indicators from Haver Analytics. We then apply some basic transformations to each raw indicator. In addition to deflating nominal indicators where appropriate and including 12 lags, we include two transformations of each indicator series. For stationary variables (e.g., capacity utilization, consumer confidence), we use the level and quarter-on-quarter difference. For non-stationary variables (e.g., production, money) we take first- and second-order log differences. Moreover, we construct several indicators such as the sovereign term spread, sovereign yield spread, the US sovereign term spread, and the US high yield spread.7

We use hard thresholding to help address the dimensionality problem of a large set of predictors. More data is not always better and can increase forecast errors even when using dimensionality reduction techniques (Boivin & Ng, 2006). Hard thresholding (Bai & Ng, 2007) consists of regressing the forecast variable on its lags and each individual indicator and selecting all indicators with an absolute t-statistic above a certain threshold. In this case, the threshold is obtained by comparing out-of-sample performance of forecasts across a range of thresholds and choosing the threshold that delivers the lowest forecast errors.

B. Identifying Complementary Algorithms

The chosen ML models (RF, GBT, and SVM) are relatively simple and accessible. All three models require little parameter tuning and are thus less likely to overfit than other types of ML models.8 In addition, all three models are computationally relatively inexpensive.

The three models are also complementary. We combine the individual ML models into several ensembles. Ensembles can lower forecast errors relative to any of the individual models by producing a single, weighted forecast of the individual models. Ensembles tend to outperform individual forecasts, especially when the models are relatively independent yet similar in forecast accuracy (Timmermann, 2006). In this case, we combine the forecasts of the three models using equal weights (Ensemble 1), inverse root mean squared error (RMSE) weights (Ensemble 2) and inverse-RMSE rank weights (Ensemble 3).

Our ensembles combine the best aspects of the individual models. More complex ML methods such as SVM tend to overfit when training data is relatively limited (e.g., short time series), resulting in predictions that are sensitive to small perturbations in the leading indicators. As such, SVM acts as a counterweight against GBT and RF, which tend to outperform during stable periods of growth, whereas the SVM is more likely to pick up the effect of extreme shocks.

C. Evaluating Performance and Interpreting Results

To evaluate model performance, we use rolling out-of-sample forecasts. This method provides an intuitive test of how the models would have performed in the past. Specifically, for each individual nowcast, we split the historical data available at the time of the nowcast into a training set and a test set and use cross-validation techniques to tune the parameters of the model (Annex III). Once calibrated, we then run the model using all historical data available at that time to obtain each individual nowcast, and ultimately assess the performance of the model.

We also assess the importance of each predictor by constructing variable importance measures for each of the ML models. To improve transparency and interpretability of our ML forecasts, we identify the contribution of each predictor to individual forecasts. Shapley Values provide an intuitive summary of each variable’s contribution to the forecast’s deviation from its historical mean (Annex IV).

IV. Results—More Accurate Forecasts

Individual ML methods can improve forecast performance. Figure 1 plots the RMSE of the benchmark factor model nowcast, against the RSME of the three machine learning models (RF, GBT and SVM) for the 2012– 2019 period.9 The benchmark has a RMSE of 1.66, which corresponds to a mean absolute deviation of about 1.2 percentage points per nowcast. Using RF, GBT, or SVM reduces the RMSE by 24, 22, and 18 percent, respectively. We find similar improvements for the forecast models (Figure A5.1), where the RF and GBT outperform the benchmark by 18 and 22 percent.

ML methods not only increase average accuracy, but also better predict economic volatility. Figure 2 plots the rolling out-of-sample nowcasts against actual quarterly real GDP growth. While the forecasts of the benchmark factor are relatively stable, the three ML methods all better predict the large growth swings seen in 2014, 2016 and 2018–19. Figure 3 plots the RMSE for the different nowcast models for ‘volatile’ quarters only, where we define a volatile quarter as one with a more than 3 percentage points higher or lower growth rate than the previous quarter. In this setting, SVM outperforms any of the models, improving upon the factor model by 39 percent. Moreover, the ML methods tend to move closer to the actual quarterly growth rate as we get closer to the end of the quarter. These patterns are similar in case of the forecast (Figure A5.2).

Figure 2.
Figure 2.

Individual Model Nowcasts vs. Actual Real GDP Growth (percent, quarter on quarter seasonally adjusted)

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Figure 3.
Figure 3.

Nowcast RMSE, Volatile Quarters

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

The accuracy of ML methods increases with the availability of training data. Figure 4 plots the smoothed root-squared (or absolute) error of the benchmark model and the ML methods. Relative to the benchmark, nowcast errors decrease substantially over time as more data become available to train and test the models. From 2012 to 2019, RF and GBT gain roughly 60 percent in accuracy relative to the benchmark, while the SVM lowers errors by almost 80 percent. Again, we observe similar patterns for the forecast (Figure A5.1).

Figure 4.
Figure 4.

Nowcast Smoothed RSE

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Different ML methods have different strengths, making them ideal as combinations in ensembles. RF seems to have good predictive performance overall, but does not fully capture the large swings in growth (Figure 2). Predictions from GBT are a bit more volatile, but also better capture the large swings in growth. SVM appears best at capturing the large swings, but at the expense of even more volatility.

Ensembles exploit the different strengths of the individual ML models to further improve predictions. Figure 5 plots the RMSE of the three ensemble nowcasts and the benchmark factor model for the 2012–2019 period. The ensembles differ little in terms overall performance. All outperform the benchmark by about 33 percent, which is an improvement of at least 9 percentage points compared to the individual models. The outperformance of the ensembles is also more stable over time.

Figure 5.
Figure 5.

Ensemble Model Nowcasts vs. Actual Real GDP Growth (percent, quarter on quarter seasonally adjusted)

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Measures of variable importance can improve the economic interpretability of the predictions. Using nowcasts for Turkey as an illustrative example:

  • First, we construct variable importance measures for Ensemble 1 (equal weights between RF, GBT and SVM), which are model-specific estimates of the relative importance of predictors in generating the forecasts.10 These are scaled from 0 to 100 in ascending order of relative importance. Figure 6 plots the 25 most importance predictors for the Turkey nowcast model in July 2019. In addition to the previous quarter’s GDP growth, the nowcast mainly relies on changes in the stock market, imports, business confidence, unemployment and the manufacturing PMI.

  • Second, we use Shapley Values to decompose recent Turkey nowcasts into contributions of different predictor categories. Figure 6 also plots the Shapley Values by categories for three Turkey nowcasts. Relative to the historic mean, lower production indicators and higher inflation contributed to lower forecasts in all months. Over time, the nowcast mainly deteriorated due to worsening financial conditions and consumption indicators.

Figure 6.
Figure 6.

Variable Importance and Shapley Values

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

V. Conclusions

Machine learning techniques can improve forecasting performance relative to traditional models. Techniques based on OLS struggle to overcome several issues, including collinearity, dimensionality, predictor relevance, and nonlinearity. As a result, even state-of-the art forecasting models often result in large forecast errors, especially when the variable to be predicted is volatile, such as output growth in many emerging market and developing economies. ML models can outperform traditional forecasting methods because they emphasize out-of-sample (rather than in-sample) performance and better handle nonlinear interactions among a large number of predictors. ML methods are specifically designed to learn complex relationships from past data while resisting the tendency of traditional methods to over-extrapolate historical relationships into the future.

VI. References

  • Bai, J., & Ng, S., 2008. “Forecasting Economic Time Series Using Target Predictors,” Journal of Econometrics, 146 (2), 304317.

  • Bai, J., & Ng, S., 2008. “Large Dimensional Factor Analysis,” Foundations and Trends in Econometrics, 3 (2), 89163.

  • Barhoumi, K., Darné, O., & Ferrara, L., 2014. “Dynamic Factor Models: A Review of the Literature,” Journal of Business Cycle Research, 2013(2), 73.

    • Search Google Scholar
    • Export Citation
  • Boivin, J., & Ng, S., 2006. “Are More Data Always Better for Factor Analysis?Journal of Econometrics, 132 (1), 169194.

  • Carrasco, M., & Rossi, B., 2016. “In-sample Inference and Forecasting in Misspecified Factor Models,” Journal of Business & Economic Statistics, 34 (3), 313338.

    • Search Google Scholar
    • Export Citation
  • Hastie, T., Tibshirani, R., & Friedman, J., 2009. “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” Springer Science & Business Media.

    • Search Google Scholar
    • Export Citation
  • James, G., Witten, D., Hastie, T., & Tibshirani, R., 2013. “An Introduction to Statistical Learning,” (Vol. 112, p. 18). New York: Springer.

    • Search Google Scholar
    • Export Citation
  • Jung, J. K., Patnam, M., & Ter-Martirosyan, A., 2018. “An Algorithmic Crystal Ball: Forecasts Based on Machine Learning,” IMF Working Paper 18/230.

    • Search Google Scholar
    • Export Citation
  • Kim, H. H., & Swanson, N. R., 2018. “Mining Big Data Using Parsimonious Factor, Machine Learning, Variable Selection and Shrinkage Methods,” International Journal of Forecasting, 34 (2), 339354.

    • Search Google Scholar
    • Export Citation
  • Medeiros, M. C., Vasconcelos, G. F., Veiga, Á., & Zilberman, E., 2019. “Forecasting Inflation in a Data-rich Environment: the Benefits of Machine Learning Methods,” Journal of Business & Economic Statistics, 145.

    • Search Google Scholar
    • Export Citation
  • Richardson, A., & Mulder, T., 2018. “Nowcasting New Zealand GDP Using Machine Learning Algorithms,” Mimeo.

  • Sargent, T. J., & Sims, C. A., 1977. “Business Cycle Modeling Without Pretending to Have Too Much A Priori Economic Theory,” New Methods in Business Cycle Research, 1, 145168.

    • Search Google Scholar
    • Export Citation
  • Shapley, L. S., 1953. “A Value For N-person Games,” Contributions to the Theory of Games, 2 (28), 307317.

  • `ter Hall, A., 2018. “Machine Learning Approaches to Macroeconomic Forecasting,” Economic Review-Federal Reserve Bank of Kansas City, 103(4), 63.

    • Search Google Scholar
    • Export Citation
  • Smeekes, S., & Wijler, E., 2018. “Macroeconomic Forecasting Using Penalized Regression Methods,” International Journal of Forecasting, 34 (3), 408430.

    • Search Google Scholar
    • Export Citation
  • Stock, J. H., & Watson, M. W., 2006. “Forecasting With Many Predictors,” Handbook of Economic Forecasting, 1, 515554.

  • Stock, J. H., & Watson, M., 2011. “Dynamic Factor Models,” Oxford Handbooks Online.

  • Stock, J. H., & Watson, M., 2012. “Generalized Shrinkage Methods for Forecasting Using Many Predictors,” Journal of Business & Economic Statistics, 30 (4), 481493.

    • Search Google Scholar
    • Export Citation
  • Stock, J. H., & Watson, M., 2017. “Twenty Years of Time Series Econometrics in Ten Pictures,” Journal of Economic Perspectives, 31 (2), 5986.

    • Search Google Scholar
    • Export Citation
  • Tiffin, A., 2016. “Seeing in the Dark: A Machine-Learning Approach to Nowcasting in Lebanon,” IMF Working Paper 16/56.

  • Timmermann, A., 2006. “Forecast Combinations,” Handbook of Economic Forecasting, 1, 135196.

  • Tu, Y., & Lee, T. H., 2019. “Forecasting Using Supervised Factor Models,” Journal of Management Science and Engineering.

Annex I. The Bias-Variance Tradeoff

We demonstrate the bias-variance tradeoff with two simple examples. Suppose a researcher has T periods of historical data on yt+h and a set of predictors, Xt. The least complex model would simply forecast the historical mean of yt+h. Doing so leads to substantial bias as it is unlikely that yt+h is constant over time. However, the variance of this simple forecast is minimized. At the other extreme, a forecaster could pick one historical observation that it believes to be most representative (’closest’) to the current environment in terms of Xt, and use this observation’s historical outcome as the forecast. Such a complex forecast will have low bias but high variance.

The K-Nearest Neighbors algorithm is one way to minimize the bias-variance tradeoff. The two extreme types of forecasts described above are examples of the K-Nearest Neighbors (KNN) algorithm. This ML method uses observations in the historical data closest to Xt to form the forecast yt+h^, which is formally defined as (Hastie et al., 2009):

yt+h^=1KΣnNK(Xt}yn

where NK(Xt) is the neighborhood of the forecast defined by the K closest points n in the historical sample. This neighborhood is usually constructed using the Euclidian distance. KNN has a convenient closed form expression for expected loss:

E((f(Xt)yt+h^)2)=[f(Xt)1KΣnNK(Xt)yn]2+σ2(1K+1)

which nicely summarizes the bias-variance tradeoff. The squared bias (first term on RHS) is monotonically increasing in K as observations ‘farther’ from Xt tend to be less informative for the forecast. The variance (second term on RHS) is monotonically decreasing in K. As a result, the K that minimizes forecast errors tends to be somewhere in between the two extreme cases. Figure 1.1 expresses this bias-variance tradeoff visually.

Figure 1.1.
Figure 1.1.

Model Complexity and the Bias-Variance Tradeoff

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Source: Smalter Hall (2018)

Annex II. Static Dynamic Factor Models

Traditionally, the factor model literature assumes predictors take the form (Stock & Watson, 2006; Smeekes & Wijler, 2016):

xkt=λi(L)ft+eit

where xkt is the predictor k time series observed at time t with zero mean and unit variance. ft is a Q x 1 vector containing latent factors and eit is a idiosyncratic disturbance term. λt(L) is a lag polynomial of order KQ, often referred to as the “dynamic factor loadings.” Both the factors and disturbances are assumed to be uncorrelated at all leads and lags. We also assume the forecast variable admits a factor structure:

yt+h=λY(L)ft+eyt

the single forecasting equation for Yt+h from (X) takes the form:

yt+h=β(L)ft+γ(L)Yt+ϵt+h

where β(L) is a lag polynomial, and εt+h is a conditional mean zero disturbance term. (Y) can be estimated using MLE, although this is computationally demanding and only consistent under somewhat restrictive assumptions. As a result, it is standard in the macro forecasting literature to rewrite the dynamic factor model summarized in (X) and (Y) in its static form, which can be estimated using principal components analysis (PCA).

If the lag polynomials β(L) and λi(L) have finite order KP, we can rewrite (X) and (Y) as (Stock & Watson, 2006):

Xt=ΛFt+utyt+h=βFFt+γ(L)yt+vt+h

Where Λ and Ft represent unobserved factor loadings and factors. ut is an error term that is i.i.d. N(0,σv2) and independent of Ft. We can now recast estimating the general model as (Smeekes & Wijler, 2016):

yt+h=f(ΛFt+ut)+ϵt+h

For a given estimated Λ^andF^t, a static factor model assumes f(·) is linear and thus runs OLS such that:

yt+hDFM^=(F^tΛ^)β^

In this case, expected mean loss can be decomposed as:

E((f(Xt)yt+hDFM^)2)=[E(yt+h^)f(Xt)]2+(Ft^Λ^)(Ft^Λ^)V(ut+ϵt+h)T+V(ut+ϵt+h)

Annex III. Machine Learning and Cross Validation

Any ML algorithm can be cast as a series of general steps. ML methods are designed to find the optimal degree of complexity of a model that maximizes out-of-sample forecast accuracy. Suppose a researcher can pick f(·) from a class of models (e.g., linear, nearest neighbors). Given the model class, we can represent this as the researcher selecting parameters p and a:

minβ,αL(yt+hf(Xt,β))s.t.βΘ(α)

where β determine the specific function within the model class, and α are ‘tuning parameters’ or ‘regularizers’ that determine the potential model complexity by constraining β to be in Θ(α). The table below summarizes a and β of popular ML algorithms. Any ML algorithm consists of the following steps:

  • (a) For every degree of model complexity α, find the model configuration p that maximizes forecast accuracy on the training data.

  • (b) Forecast on the test data using this model configuration β.

  • (c) Across all possible α, pick the degree of model complexity α that maximizes forecast accuracy on the test data.

This process of finding the optimal model parameters is called cross validation (CV). With CV, the entire data set is split into multiple subgroups (’folds’), which are all used as separate test sets. In this paper, we use 10 folds to tune the model complexity parameters.

article image
Figure A3.1.
Figure A3.1.

Decision Tree Example

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Notes: Figure plots a hypothetical decision tree nowcasting real GDP growth at time t using lags of real GDP growth, stock market growth, and the US term premium. Each leaf contains two training observations, and the trained decision tree predicts the average observed GDP growth of these two observations.
Figure A3.2.
Figure A3.2.

Random Forest Example

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Notes: Figure plots a hypothetical decision Random Forest nowcasting real GDP growth at time t using lags of real GDP growth, stock market growth, and the US term premium. Each tree uses different observations and considers different variables at each split. In this example, each leaf contains only one training observation. The trained RF predicts the average of the GDP growth rates of the leaves that the new observation belongs to.

Annex IV. Interpreting Forecasts: Shapley Values

Shapley Values can help with the interpretation of the results of ML forecasts. Shapley Values are a concept from coalitional game theory that measures the contribution of each player in a game when the game’s payoff depends on interactions (’coalitions’) between the players (Shapley, 1953). They are constructed as the mean of each player’s marginal contributions for every possible combination of other player’s actions. In the context of ML methods, Shapley Values measure each variable’s contribution to an individual prediction’s deviation from the historical mean. For an OLS-based model, these contributions are the same as the predictor’s coefficient multiplied by its specific value. Shapley Values are thus particularly useful for decomposing predictions from methods with interactions among predictors (e.g., a Random Forest).

The forecast decomposition using Shapley Values can be demonstrated with an example. Suppose we have trained an ML model to predict real GDP growth. The model predicts 5 percent for a certain period in which: (i) nominal credit growth is above 10 percent; and (ii) the country’s major trading partner is expanding. We want to decompose this prediction into contributions of the two predictors (credit growth and trading partner growth). The matrix below summarizes the model’s predictions contingent on the values of the two predictors. In this case, the two variables act as complements. The average marginal contribution of credit growth being above 10 percent is 4.5 percent, and the average marginal contribution of the trading partner expanding is 1.5 percent. If we assume the historical mean of the model forecast is 1 percent, the Shapley Values for credit growth and trading partner expansion would be 3 percent and 1 percent, respecitvely.1

article image

Annex V. Additional Figures and Tables

Figure A5.2.
Figure A5.2.

Rolling Out-of-Sample Forecasts vs. Actual Real GDP Growth

(percent, quarter on quarter seasonally adjusted)

Citation: IMF Working Papers 2020, 045; 10.5089/9781513531724.001.A001

Table A5.1.

Stationary Variables—Level and First Difference

article image
Table A5.2.

Non-Stationary Variables—First and Second Log Difference

(Y: in real terms)

article image
article image
1

The authors would like to thank Donal McGettigan, Alex Culiuc, Vincenzo Guzzo, Romain Lafarguette, and participants at the IMF EUR seminar for helpful comments and suggestions, and Morgan Maneely for outstanding research assistance. All remaining errors are our own. R codes are available at marijnbolhuis.org.

2

For example, Smeekes & Wijler (2016) and Carrasco & Rossi (2016) find that penalized ML methods tend to outperform traditional factor models in terms of forecast accuracy. The former also show that ML methods are more robust to model misspecification. Tu & Lee (2018) show that traditional factor models tend to be inferior to supervised factor models that perform variable selection. Kim & Swanson (2014) assess the predictive accuracy of both traditional, ML and ‘hybrid’ forecasting methods and find that the latter two dominate in most settings. Tiffin (2016), Jung et al. (2018), and Richardson et al. (2018) use different ML methods to forecast GDP growth for several countries. Smalter Hall (2018) employs ML methods to forecast unemployment in the United States and Medeiros et al. (2018) forecast inflation in Brazil.

3

There is no universal definition of complexity in the ML literature, as the degree of complexity often depends on the nature of the underlying learning model. Common sources of complexity are the number of included variables (e.g., penalized linear models), the number of parameters a model ‘learns’ (e.g., random forest), the number of relationships specified (e.g., neural networks), and the number of observations used per individual prediction (e.g., nearest neighbors).

4

For a detailed review, see Stock & Watson (2006, 2011, 2012, 2017) and Bai & Ng (2008).

5

Formally, regression trees pick regions Rm and region predictions cm (for M different regions) and: min(Rm,cm}m=1MΣ[ytΣm=1McmI(XtRm)]2

6

Let F1(Xt) denote the in-sample prediction from the first decision tree. The second tree thus constructs the tree that solves minF3(Xt) Σ[yt-F1(Xt)-F2(Xt)]2, the third tree solves minF2(Xt) Σ[yt-F1(Xt)-F2(Xt)-F3(Xt)]2 etc. With three trees, the final forecast equals F1(Xt)+F2(Xt) + F3(Xt).

7

A common drawback of assembling a large dataset is that many series may have missing observations for a significant period of time. ML techniques offer a way to impute missing values in order to take advantage of all available indicators and observations. Specifically, the algorithm we use initially imputes the missing values with each indicator’s median, then runs a Random Forest. It then replaces the missing value of an indicator with the weighted average of the non-missing observations, where the weights are the proximities (i.e., the fraction of final nodes shared by two observations) of the random forest.

8

Avoiding overfitting to a complex model is our main reason for not deploying neural networks, which tend to require large datasets for good performance. Indeed, in Jung et al. (2018) elastic net tends to outperform recurrent neural networks when forecasting GDP growth. We avoid linear penalized regression methods such as ridge, LASSO and elastic net because they are sensitive to large, unexpected changes in predictor values that are not in the training dataset. As a result, forecasts using these methods tend to be unstable for datasets like ours.

9

We compare the performance of the ML models and ensembles against a more traditional forecasting model. As a benchmark, we use a static dynamic factor model (DFM). We employ three factors, as is standard in the DFM literature (Barhoumi et al., 2013).

10

We use R package caret.

1

3.0=(51)4.54.5+1.5;1.0=(51)1.54.5+1.5

Deus ex Machina? A Framework for Macro Forecasting with Machine Learning
Author: Marijn A. Bolhuis and Brett Rayner