Lasso Regressions and Forecasting Models in Applied Stress Testing

Contributor Notes

Author’s E-Mail Address: jchanlau@imf.org

Model selection and forecasting in stress tests can be facilitated using machine learning techniques. These techniques have proved robust in other fields for dealing with the curse of dimensionality, a situation often encountered in applied stress testing. Lasso regressions, in particular, are well suited for building forecasting models when the number of potential covariates is large, and the number of observations is small or roughly equal to the number of covariates. This paper presents a conceptual overview of lasso regressions, explains how they fit in applied stress tests, describes its advantages over other model selection methods, and illustrates their application by constructing forecasting models of sectoral probabilities of default in an advanced emerging market economy.

Abstract

Model selection and forecasting in stress tests can be facilitated using machine learning techniques. These techniques have proved robust in other fields for dealing with the curse of dimensionality, a situation often encountered in applied stress testing. Lasso regressions, in particular, are well suited for building forecasting models when the number of potential covariates is large, and the number of observations is small or roughly equal to the number of covariates. This paper presents a conceptual overview of lasso regressions, explains how they fit in applied stress tests, describes its advantages over other model selection methods, and illustrates their application by constructing forecasting models of sectoral probabilities of default in an advanced emerging market economy.

I. Introduction1

Owing to the large costs associated with bank failures and financial crises, as experienced in the global recession of 2008–9, it has become important to assess the vulnerability of financial institutions to severe economic and financial shocks. Among quantitative tools, stress tests are widely used to examine risks to financial institutions and financial stability more broadly (Bookstaber et al, 2013). Stress tests are also important policy tools when supported by credible official backstops, raising confidence on the ability of policy makers to manage well the risks envisaged in stress scenarios (Orphanides, 2015).2

Recent work on stress testing has been mainly oriented towards scenario design. One strand of the literature has focused on the generation of severe, plausible, and coherent stress scenarios consistent with historical crisis episodes and potential regime changes.3 On the policy front, there have been efforts for integrating stress tests into financial sector surveillance and oversight with a view towards enhancing regulatory and supervisory guidance (IMF, 2012; Bookstaber et al, 2013).

Improving forecasting models, however, has been somewhat neglected in recent applied stress test work. This is not a minor omission, since the numerical outcome of the forecasting models influence policy recommendations as well as business strategies. As stress scenarios become more comprehensive, encompassing an increasing number of primary variables, model selection and forecasting become challenging even within the family of linear models.

This paper argues that model selection and forecasting in stress tests can be facilitated using techniques borrowed from the field of machine learning. These techniques have not been widely used in econometric and financial applications despite their robustness and good performance in other fields where large datasets are typical. Notable exceptions in applying them in stress tests and default prediction are Kapinos and Mitnik (2015), and Perdeiy (2009). The former suggest using the least absolute shrinkage selection operator (Lasso) to link bank performance indicators to macroeconomic variables. The latter uses Lasso regressions to predict bankruptcy using non-traditional financial indicators as covariates. This paper extends their work by delving in more depth on the conceptual issues justifying the use of machine learning techniques, and discussing them more extensively.4

The organization of the rest of the paper is as follows. Section II provides an overview of the multi-step process in a standard stress test. Within this context, section III how the large number of primary variables included in a stress scenario leads to a curse of dimensionality problem and complicates the model selection process. Section IV argues machine learning techniques are preferable than other dimension reduction techniques in addressing the curse of dimensionality. Section V discusses subset selection and shrinkage methods, including Lasso. After the presentation of the conceptual underpinnings of Lasso estimation, Section VI describes Lasso applications in the areas of finance, economics, and financial networks. Section VII illustrates the use of Lasso estimation in forecasting probabilities of default in an advanced emerging market economy. Section VIII concludes.

II. Stress Tests: A Multi-Step Process

A standard stress test comprises several steps. The first step is stress scenarios design. This involves choosing the length (or horizon) of the scenarios, selecting the primary variables to stress, and specifying their paths under each scenario. The number of primary variables can be quite large as is the case in supervisory stress scenarios specified by supervisory authorities.

For instance, the U.S. Federal Reserve included sixteen domestic economic and financial variables, and twelve international variables in the scenarios specified in the 2015 Comprehensive Capital Analysis and Review (CCAR) of U.S. banks. The Bank of England specified more than sixty primary variables for its 2015 annual stress test exercise (Bank of England, 2015). Finally, the European Banking Association (EBA), in its stress tests of banks in the European Union (EU), included government bond yields, equity prices, house price shocks, and the impact of funding shocks on real GDP growth for twenty-seven EU countries. It also included changes in real GDP growth for twenty countries and regions outside the EU (European Systemic Risk Board, 2014).

In the second step, performance indicators for the firms included in the stress test are forecasted using econometric and statistical models, a.k.a. satellite models. The primary variables are included in the models as covariates or explanatory variables. As an illustration, in the case of a bank forecasting models identify the dynamics of non-performing loans (NPLs) and provisions related to its loan portfolio under the stress scenarios. The choice of metrics to evaluate the performance of a firm depends on the goal of the stress test. In solvency stress tests, for example, the relevant metrics are those associated with the default risk of the firm such as capital adequacy ratios in the case of banks, and statutory capital and technical provisions ratios in the case of insurers.5

The final step comprises an evaluation of the weaknesses and strengths of individual firms, and an overall assessment of system-wide vulnerabilities based on the performance indicators. The evaluation then serves to guide business strategy if conducted by the firm itself as part of its risk management process, or policy measures if conducted by supervisory authorities. For example, the U.S. Federal Reserve uses the results of the stress tests conducted under the CCAR exercise to grant or deny approval of the capital distribution plans of the banks. Banks that fail the stress tests are required to resubmit revised plans ensuring their capital buffers would be sufficient to withstand the shocks envisaged in the stress scenario.

III. Model Selection Challenges in Stress Tests

The large number of primary variables specified in stress scenarios poses a challenge for the design, selection, and estimation of forecasting models. In the case of the CCAR, restricting models to use only contemporaneous values of the variables and ruling out interactions lead to 228 possible linear models for explaining a single dependent variable, say, the default rate of a specific asset class. Model selection intractability worsens as the number of potential models grows linearly with the number of variables to forecast.

Moreover, in the context of macro stress tests, the dimensionality of the data may exceed the length of the sample size raising concerns about model overfitting. In these tests, data is typically available at an annual frequency and may be available only for few years. In this case, least squares cannot yield unique coefficient estimates and some method is necessary to reduce the number of covariates included in the model.

Given the large set of potential covariates, multicollinearity is likely to pose problems. This situation justifies selecting a reduced subset of variables for forecasting purposes. Arguably, expert judgment could help to identify the covariates relevant for the forecasting exercise facilitating the model selection process. However, in the context of a relatively complex firm, e.g. a commercial bank, expert judgment may not compensate for the lack of specialized and detailed knowledge on the firm’s operations and exposures, including on its main counterparties. Expert judgment, especially if exercised by outsiders to the firm, may create blind spots and foster an unjustified sense of security.

Dimension reduction techniques offer a formal approach for dealing with the curse of dimensionality. Commonly used techniques are factor analysis and principal component analysis, which construct factors and components as linear combinations of the covariates. Typically, only a reduced number of factors and components are sufficient to explain the variability of the data. For instance, three principal components are enough to explain the term structure of government yield curves. While useful, there is an important caveat with dimension reduction techniques: it is somewhat difficult to associate an economic meaning to a factor or a principal component. Hence, it may be difficult to specify the path of a factor under a given scenario or understand what a negative shock to the factor is.

Expert judgment and dimension reduction techniques lead to lower number of potential covariates ahead of the model selection stage. This may not be desirable from an economic perspective. Financial and nonfinancial corporations are becoming increasingly global, and domestic and international factors affect their operations, profitability, and solvency. Reducing the number of covariates prior to selecting the forecasting model may not reveal what the drivers of a firm’s performance are. Arguably, it may be preferable to address the curse of dimensionality at the model selection level rather than at the covariate level. Machine learning methods excel in this task.6

IV. Machine Learning: The Interpretability-Flexibility Tradeoff

Machine learning encompasses a number of techniques for identifying patterns and relationships in the data, making it suitable for forecasting and simplifying the model selection process.7 In the realm of machine learning, forecasting models falls under the category of supevised learning, in which we are interested in finding a rule, e.g. an econometric model, that maps the set of covariates or inputs, into an output, e.g. the ratio of non-performing loans in an asset category. Linear regression models are just one subset of supervised learning models, and are not discussed here since they are already familiar to economists and analysts involved in stress tests and are thoroughly covered in introductory and advanced econometric textbooks such as Greene (2011) and Woodbridge (2012).

Different machine learning methods are available for forecasting, with increased flexibility offset by increased difficulty to interpret the results. More flexible methods are better at capturing patterns in the data but simpler, easier to interpret methods are better suited for understanding and communicating results. Among the methods, in decreasing order of interpretability, we have subset selection, lasso regressions, least squares, generalized additive models, trees, support vector machines, and methods combining different base learning methods such as bagging and boosting.

The bias-variance tradeoff formalizes the tension between interpretability and flexibility. Given a forecast method or learning algorithm, its expected mean squared error (MSE) can be decomposed into its squared bias, i.e. errors due to erroneous assumptions underlying the method; its variance, i.e. errors due to the sensitivity of the method to noise in the calibrating data set; and the variance of the residual term. The more flexible the method the lower its bias since it can approximate better the true relationship existent in the data. But increased flexibility increases the variance of the method since it attempts to fit not only true data points but also the unavoidable noise present in the data set.

Since stress tests are an input for formulating business strategy or guiding policy, the tests’ results and conclusions need to be communicated to different constituencies, including senior decision makers, each with a different grasp and understanding of the technical details underlying the assumptions and analytical methods. This situation places a premium on interpretability rather than flexibility since it is somewhat difficult to build persuasive arguments based on black-box analysis. Whenever possible, hence, linear models are preferred. By using non-linear transformation of the covariates, these models could also capture some non-linearity in the data.

V. Linear Models: Subset Selection and Shrinkage Methods8

Among linear models, linear regression is the most widely used method among economists and financial analysts. In a linear regression, the mean of the independent variable conditional on the covariates is an affine function of the covariates. Note, however, that linear models could exhibit low bias and high variance. In linear regressions, the fit can be improved by including a large number of covariates. Including all potential covariates minimizes the bias of the model at the expense of higher variance. As a result, the predictive power of the model and its interpretability are negatively affected.

In response, a number of methods can help reduce the number of covariates in linear regression models. The methods fall into two different categories: subset selection methods and shrinkage methods. The next sections describe the different model categories, and put forward arguments for singling out Lasso regression, a shrinkage method, as the preferred choice for stress test purposes.

A. Subset selection methods

Subset selection methods aim at finding the optimal number of covariates in a linear regression, and include two categories, best subset selection and stepwise selection. The latter are further classified into forward stepwise and backward stepwise selection methods.9 In the case of p possible covariates, including perhaps interacting variables and lagged values of the variables, best subset selection searches for the best combination of covariates among the set 2p of all possible linear combinations. An optimality criterion applied to the fit of the model vis-à-vis observed data determines what model is best. Typical criteria include the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the adjusted R2. Since the algorithm must search over all of the 2p potential models, best subset selection performs well only if the number of variables, p, is small. In practice, efficient algorithms work well with as many as 30 to 40 variables.10

Forward stepwise selection starts with a least squares model with no covariates, adding one variable at a time based on its contribution to improving the model fit, typically as measured by the residual sum of squares (RSS) or highest R2. Starting with a model including only the constant term, the first covariate selected for inclusion is the one generating the higher RSS among all models in the one covariate family. After the selection of the first covariate, the method generates the family of models with two covariates that includes the first covariate. Within this family, the method selects the model with the higher RSS, which yields the second covariate. The process continues until obtaining a model with p covariates. Of the p models generated, each with k covariates, k = 1, …, p, the final model selected is the one maximizing an optimality criterion, e.g. AIC, BIC, or adjusted R2.11

Backward stepwise selection starts with a least squares model including all covariates, with each subsequent model with one less variable until only one variable is left. The variable selected for deletion is the one that contributes the least to the explanatory power of the model. At the end of the backward elimination process, there is be a set of p models, each corresponding to the best performing model in the family of models with one covariate, two covariates, and so on.

As in the case of forward stepwise selection, the final model is the one that performs the best among the set of best performing models according to the chosen optimality criterion. Contrary to forward stepwise selection, it is not possible to apply backward stepwise selection when the number of observations is less than the number of covariates. If the matrix of covariates is full rank, or with rank at least equal to the number of observations, the RSS of the linear regression will be zero.

Computational costs in stepwise selection are lower than in best subset selection since the number of models evaluated is only 1 + p (p+1)/2 instead of 2p. While computational savings are substantial, there is no guarantee that stepwise selection may yield the same solution as the best subset selection method. The set of n variables that yields the smallest RSS in a linear regression with n covariates does not necessarily contain the same subset of n-1 variables yielding the smallest RSS in a linear regression with n-1 covariates. Thus, forward and backward stepwise selection may not need to converge to the same model nor to the model chosen by the best subset selection method. All of these methods, however, could generate reasonable forecasting models that do not need to incorporate all p covariates.

B. Shrinkage methods and Lasso Regression

Shrinkage methods aim to reduce (or shrink) the values of the coefficients to zero compared with ordinary least squares. The advantage of shrinkage methods is that the estimated models exhibit less variance than least squares estimates. In addition, some shrinkage methods also reduce the number of covariates included in the regression model by yielding coefficient estimates exactly zero, facilitating the model selection process.

Two widely used shrinkage methods are ridge regression (Hoerl, 1962) and the Lasso (Least Absolute Shrinkage Selection Operator) regression (Tibshirani, 1996). Their similarities and differences, as well as those relative to least squares estimation, are apparent by examining the optimization problems solved by each method:

(1)Least  Squares:minβ0,βjΣi=1n(yiβ0Σi=1pβjxij)2,
(2)Ridge Regression:minβ0,βjΣi=1n(yiβ0Σi=1pβjxij)2subject to β2t,
(3)Lasso Regression:minβ0,βjΣi=1n(yiβ0Σi=1pβjxij)2subject to β1t,

where y is the vector of observations of the independent variable, x denotes the covariates, β are the corresponding coefficients, 1 and 2 are the L1 and L2 norms respectively, and t is a user-specified parameter. The Lagrangian formulation of the ridge and Lasso regression are respectively:

(4)Ridge Regression:minβ0,βjΣi=1n(yiβ0Σi=1pβjxij)2+λβ2,
(5)Lasso Regression:minβ0,βjΣi=1n(yiβ0Σi=1pβjxij)2+λβ1.

The optimization problems presented in equations (2) to (5) are standard quadratic programs with convex constraints, and there are a variety of numerical methods to solve them.

The least squares estimation corresponds to an unconstrained minimization problem, the ridge regression adds a smooth, convex 2 constraint and the Lasso regression a convex but non-smooth 1 constraint. Least squares favors including as many covariates as possible since it helps reducing the sum of squares. In ridge regression and Lasso regression, non-null coefficients carry a penalty, which helps reducing the value of the coefficients (ridge, Lasso) or reducing the number of the covariates included in the model (Lasso). Notice also that the use of the 2 penalty in the ridge regression implies that coefficient estimates are not scale-invariant.

Figure 1 illustrates the geometric intuition underlying the differences between least squares and the two shrinkage regressions, with the ellipses representing the contours of the residual sum of squares. The least squares coefficients, which correspond to the solution of the unconstrained optimization problem, are large relative to those produced by the shrinkage methods. The smooth, convex nature of the 2 constraint implies that the ridge regression coefficients, albeit small, are not equal to zero. Hence, similarly to least squares, ridge regression, includes all the available covariates and does not yield parsimonious models.

Figure 1.
Figure 1.

Geometry of least squares, ridge regression and Lasso regression

Citation: IMF Working Papers 2017, 108; 10.5089/9781475599022.001.A001

In contrast, the presence of corners in the 1 constraint increases the chances the ellipses would intersect at the corner, yielding exactly zero coefficients and reducing the number of covariates included in the forecasting model.12 The Lasso regression, hence, appears well suited for addressing the model selection challenge posed by the forecasting requirements of stress tests. Note that the Lasso regression performs both the variable selection and parameter estimation simultaneously. Finally, compared with subset selection methods, Lasso exhibits lower variability and cheaper computational costs, especially for high dimensional problems (Hastie, Tibshirani, and Friedman, 2009).

C. Lasso Regression, Cross-Validation and Consistency

In Figure 1, the value of the user-specified parameter t is small enough such that the unconstrained optimization solution is not contained within the regions bounded by the L1, and L2 constraints. Had this parameter been sufficiently large, the ridge regression and Lasso regression solutions would have coincided with the least squares solution. In a Lasso regression, the value of the parameter controls both the size and the number of coefficients, with higher values leading to a greater number of covariates to be included in the linear model. This increases the flexibility of the model and reduces its variance but at the cost of a higher model bias.

Cross-validation, a resampling technique, helps to find a parameter value that ensures a proper balance between bias and variance (or flexibility and interpretability). Cross-validation selects the best parameter value as the one that minimizes the estimated test error rate of the estimator, in this case the Lasso regression. In the absence of a test set, it is not possible to calculate the test error – the average error from using a method to predict the response of a new observation. In cross-validation a subset of the data observations, the training set, is used to estimate (or train) the model, and the remainder observations are held to serve as test set or validation set. The selected test sets serve to provide an estimate of the test error rate. Typically, the measure of the test error is the mean square error (MSE).

The K-fold cross-validation method divides the data set randomly into K different subsets, typically five or ten in practical applications. Keeping one of the subsets as the validation set, the model is trained (estimated) over the remaining K-1 sets for a range of values of the parameter t (or λ, if the Lagrangian formulation is used). We repeat this process using each of the K subsets as a validation set, yielding K estimates of the MSE for each parameter value, and its K-fold estimate is simply the average value of the K estimates.

The best parameter value is the one yielding the lowest K-fold estimate, which we denote as λ-min in the Lagrangian formulation. It is also typical to report results corresponding to the minimum parameter value such that its K-fold estimate does not exceed the minimum K-fold estimate by more than one standard error using a lower number of covariates. This parameter estimate is the one-standard error rule parameter, λ-1se. The Leave-One-Out cross-validation method is a special case of the K-fold cross-validation method. In this case, a sample containing N observations is partitioned exactly into N subsets, which is equivalent to N-fold cross-validation.

Since the Lasso biases the coefficients towards zero, the estimates are not consistent. There are ways to address this issue using two-step estimation procedures. In the first step, the Lasso regression selects the covariates in the model. In the second step, only the selected covariates are included in a linear model estimated either by ordinary least squares or by applying the Lasso again. The latter method is known as relaxed Lasso (Meinshausen, 2007), which would be used in the stress test application later. The next section describes some recent applications in finance, economics, and financial networks.

VI. Lasso Applications in Finance, Economics, and Financial Networks

Lasso methods, by regularizing least squares estimates to satisfy a 1 constraint, are very useful for constructing sparse models in high multidimensional data environments. Therefore, there is increased usage of these methods in financial and economic forecasting.13

In finance, one active area of research concerns the estimation of stable variance covariance matrices for asset returns, which are necessary inputs for portfolio optimization and asset allocation models. In portfolios including a large number of assets or securities, the returns exhibit strong collinearity, making the sample estimate of the variance covariance matrix and its inverse, the information matrix, highly sensitive to noise and data outliers. Portfolio weights, therefore, may exhibit large changes even if returns are slightly perturbed (Kan and Zhou, 2007), a problem identified long ago (Jorion, 1992; Broadie, 1993).

Lasso constraints can serve to estimate sparse covariance matrices, eliminating the contribution of some of the highly collinear variables, or imposed directly on the portfolio weights. For instance, Brodie, Daubechies, De Mol, Giannone, and Loris (2008), Fan, Zhang, and Yu (2009), and De Miguel et al (201) require that the sum of the portfolio weights satisfy a 1 constraint. Moreover, De Miguel et al find that the 1 - constrained problem is equivalent to the constrained minimum variance portfolio solution of Jagannathan and Ma (2003). More generally, Scherer (2007) shows that optimal portfolio in a Markowitz framework is the solution to a linear regression amenable to solution using Lasso regression. Finally, Bruder, Richard, and Roncalli (2013) explain how the asset selection process benefits from regularization.

Economic applications have centered mainly on the modeling of multivariate time series. Vector autoregressions (VARs) are not well suited for high dimensional problems. To lessen the dimensionality problem, economists have used dynamic factor models (Geweke, 1977; Stock and Watson, 2002) and factor augmented VAR (Bernanke et al, 2005). Lasso methods, however, seem to perform as well as these other models while generating sparser models while bypassing the difficulties in factor interpretation. Results by De Mol, Giannone, and Reichlin (2008) suggest that Lasso regression tend to perform as well as principal components regression when the variables are highly collinear, a typical situation in empirical macroeconomics. Li and Chen (2014) provide evidence that Lasso models tend to outperform factor models.

Straightforward applications of Lasso regressions neglect time dependence in endogenous time series models, an omission that affect the mean square bounds of the estimator. By using Lasso and group-Lasso, Song and Bickel (2010) are able to accommodate time dependence in large VARs. Gefang (2014) exploits time dependence to develop a Bayesian doubly adaptive elastic-net Lasso which allows for grouping effects determined by the data. Kock (2012), Callot and Kock (2014), and Kock and Callot (2015) show that adaptive Lasso estimates parameters consistently, selects the correct sparsity pattern, and it is asymptotically efficient in vector autoregressions.

More recently, Lasso methods serve to construct and estimate financial networks.14 In a number of financial networks, the interconnectedness between two financial institutions, or two nodes in the network, is measured either by the correlation between two risk measures, e.g. equity return correlation, or the risk contribution of one institution to the risk of another. In these networks, the systemic risk of a financial institution is set proportional to a measure of centrality, i.e. the number of connections to other institutions in the system or the number of possible paths from one institution to another that include the institution, among others.

Again, the curse of dimensionality poses problems. When the analysis includes a large number of institutions, estimating correlations requires as an input a stable variance-covariance matrix. Furthermore, correlation networks may show too much interconnectedness, as the correlation matrix is quite dense. As in the case of portfolio optimization, Lasso methods are able to generate sparse correlation matrices. For example, Chan-Lau, Chuang, Duan, and Sun (2015) use economic priors and Lasso restrictions to produce relatively sparse but fully connected financial networks, where interconnectedness in the network arises from forward-looking default correlations. Upon constructing the network, assessing the systemic risk of individual firms is a straightforward exercise.

In other instances, the risk contribution of one institution is an output from an econometric model that incorporates a large number of covariates, which presents the curse of dimensionality problem. For example, in the global banking network described in Demirer, Diebold, Liu, and Yilmaz (2015), the value of the edge connecting one bank to another is proportional to its contribution to the variance decomposition of the volatility of equity returns of the latter. With hundreds of banks in the network, a Lasso reduced-dimensional VAR helps reducing the high dimensionality of the problem, making feasible obtaining the variance decomposition.

Before moving to the next section, which applies lasso regressions to forecast probabilities of default, it is important to note one important caveat raised by Hansen (2013). The prediction advantages of Lasso-based models seem to vanish when the number of covariates is small relative to the sample size. As the earlier discussion has emphasized, however, this is not the situation typically encountered in a stress test.

VII. A Stress Test Application: Forecasting Probabilities of Default

This section puts into practice the discussion above, using Lasso regressions to construct forecasting models for the median one-year probabilities of default (PD) in ten different industrial sectors in an advanced emerging market economy.

The sectors included are basic materials, communications, consumer cyclicals, consumer non-cyclicals, diversified industries, energy, financials, industrials, technology and utilities. The monthly median PD series data came from the CRI database maintained by the Risk Management Institute, National University of Singapore (Duan and Van Laere, 2012), and accessed on April 30, 2014. The data covers the period December 1990 – February 2014.15

There are thirteen domestic primary variables including the exchange rate vis-à-vis the U.S. dollar, the nominal effective exchange rate, the domestic policy rate, the consumer price index, the real GDP growth rate, the unemployment rate, the total amount of credit in the economy, the money market rate, the 3-month Treasury bill rate, the bank deposit rate, the bank lending rate, and the 10-year Treasury bond rate. The domestic variables are complemented by five international variables: the U.S. real GDP, the China real GDP, the U.S. policy rate, the U.S. consumer price index, and a commodity price index. The data, collected at a quarterly frequency, covers the period 1990 Q1 – 2013 Q4.

We fitted the equation below for each sectoral median PD using both Lasso and relaxed Lasso and 10-fold cross-validation:

(6)log(PDi,t1PDi,t)=αi+Σk=1pΣl=04Xk,tl+εi,t,i=sector 1 to 10,

where αi is the constant term, Xk, t-ℓ is the value of the covariate k lagged by periods, and i,t is the error term. Because the Lasso constraint applies to the norm of the coefficients, it is necessary to standardize the covariates otherwise the coefficients of covariates with smaller measurement units may be unfairly penalized. The standardization procedure consists on calculating the covariate’s Z-score by centering it covariate on its mean and normalized by its standard deviation.

In this application the number of potential covariates, p, is equal to ninety excluding the intercept, and the number of observations, n, is at most one hundred, a situation well suited for Lasso estimation. With p~n, ordinary least squares estimates would exhibit high sensitivity to outliers (Hastie, Tibshirani, and Friedman, 2008). Pre-selecting the variables and choosing the number of lags would require a detailed knowledge of the drivers of default in different sectors. Since the median PD is the variable of interest, expert judgment may require understanding the differences in PD dynamics of individual firms within a sector.

The Lasso estimation includes all ninety covariates while the relaxed Lasso only includes the covariates with non-zero coefficients of the Lasso λ-min specification. This two-step procedure ensures the relaxed Lasso yields at most the same number of non-zero coefficients as the Lasso and likely less for the λ-1se specification. The estimation used the R implementation of the coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010).

Figure 2, and Tables 1 and 2 illustrate the differences between the Lasso and relaxed Lasso estimations. Figure 2 shows the MSEs, or a measure of the models’ out-of-sample forecasting performance, for different values of the λ parameter, expressed in natural logarithms in the lower horizontal axis, while the upper horizontal axis reports the corresponding number of non-zero coefficients identified by the Lasso and relaxed Lasso methods. The leftmost vertical dashed line in the figures correspond to the λ-min parameter, i.e. the parameter that yields the minimum MSE, while the rightmost vertical dashed line to the λ-1se parameter, i.e. the parameter that yields a MSE exactly 1 standard deviation above the minimum MSE using a lower number of covariates.

Figure 2.
Figure 2.
Figure 2.

Lasso and relaxed Lasso, mean squared errors (MSEs)

Lasso regression, left side panels; relaxed lasso regressions, right side panels. In each panel, mean squared errors, red line, bounded by +/- 1 standard deviation lines, vertical axis; upper horizontal axis indicates number of non-zero coefficients associated with a given value of log(λ). Leftmost discontinuous vertical line corresponds to the MSE of λ-min; rightmost discontinuous vertical line to the MSE of the λ-1se.

Citation: IMF Working Papers 2017, 108; 10.5089/9781475599022.001.A001

Source: Author’s calculations.
Table 1.

Lasso and relaxed lasso, coefficient estimates, λ-min specification

article image
article image
article image
article image
article image
Author’s calculations.
Table 2.

Lasso and relaxed lasso, coefficient estimates, λ-1se specification

article image
article image
article image
article image
article image
Author’s calculations.