Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Early-warning systems for crisis risk
(USC Thesis Other)
Early-warning systems for crisis risk
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Early-Warning Systems for Crisis Risk by Weining Xin A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Economics) August 2020 Copyright 2020 Weining Xin Epigraph t.t 1~1;111.. * :I: ,t ' rtrJ §ti{-¾ ~-ii!, 0 «jt-f-%-tT» ii Acknowledgements I am extremely indebted to Romain Ranciere for his guidance and support. This dissertation would not have been possible without his patient guidance and unconditional support at every step of the way. I am very grateful to my dissertation committee members, Joshua Aizenman, Caroline Betts, and Xin Tong, for the time and dedication they invested in my dissertation. I am truly thankful to Suman Basu and Roberto Perrelli for being coauthors, mentors, and friends. I could not have made it so far without my friends, especially Jisu Cao, Zhibo Li, and Zhouqian Zhong. This dissertation benefited greatly from discussions with colleagues and seminar participants at USC. Finally, to my family for their unwavering support of my pursuit of the doctorate degree. iii Table of Contents Epigraph Acknowledgements List Of Tables List Of Figures Abstract Chapter 1: International Reserves, Risk Tolerance, and Crisis Risk 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 A Two-Stage Framework . . . . . . . . . . . . . . . . . . 1.2.1 Second-Stage Problem for Reserves Accumulation 1.2.2 First-Stage Problem for Risk Estimation 1.3 Implementation Methodology . . . . . . . . . 1.3.1 Neyman-Pearson Paradigm . .... . 1.3.2 A Good Fit for Crisis Risk Estimation 1.4 Model Estimation . . . . . . . . 1.4.1 Crisis Definition . . . . 1.4.2 Explanatory Indicators . 1.4.3 Model Choice . . . . . . 1.4.4 Crisis Probability Calibration 1.5 Empirical Results ........ . 1.5.1 Backtesting Performance . . . 1.5.2 Benchmark Calibration ... 1.5.3 Time-Varying Risk Tolerance of Policymakers 1.5.4 Counterfactual analysis 1.6 Conclusion ........... . Chapter 2: External Crisis Prediction Using Machine Learning: Evidence ii iii vi viii X 1 1 10 10 16 22 22 24 29 29 32 34 37 38 38 42 44 48 51 from Three Decades of Crises Around the World 53 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 53 2.2 Applying Machine Leaning To Crisis Prediction . . . . 57 2.2.1 Signal Extraction and Regression-based Models 57 iv 2.2.2 Machine Learning Techniques 59 2.2.3 Potential Limitations in Applying Machine Learning to Macro Data . 63 2.2.4 Machine Learning Applications in the Early-Warning Literature 64 2.3 Crisis Definitions 65 2.3.1 Sudden Stops 65 2.3.2 Exchange Market Pressure Events 68 2.4 Explanatory Indicators of External Crises 71 2.5 Model Design 75 2.5.1 Hyperparameter Tuning 75 2.5.2 Testing and Tuning Sets 77 2.5.3 Crisis Probabilities 79 2.6 Empirical Results . 80 2.6.1 The Horse Race . 80 2.6.2 Signal Extraction versus RUSBoost for SSGis in EMs . 82 2.6.3 Signal Extraction versus RUSBoost for EMPEs in EMs . 86 2.6.4 EMPEs in Advanced Economies and Low-Income Countries 89 2.6.5 Event Probabilities 91 2.7 Conclusion . 93 Chapter 3: Performance Uncertainty and Ranking Significance of Early-Warning Models 95 3.1 Introduction . . . . 95 3.2 Small Macro Data 100 3.3 Data and Models . 102 3.3.1 Crisis Events 102 3.3.2 Explanatory Indicators . 104 3.3.3 Model Choice . . . . . . 105 3.4 Model Performance Uncertainty 108 3.4.1 Jackknife Resampling . 108 3.4.2 Model Estimation and Testing . 112 3.4.3 Confidence Intervals . . . . . . 115 3.4.3.1 Confidence intervals of individual model performance . 115 3.4.3.2 Confidence intervals of conditional performance difference 116 3.5 Empirical Results . . . . . . . 118 3.5.1 Fixed Cutoff Testing . 118 3.5.2 Rolling Cutoff Testing 125 3.6 Conclusion . . . . . . . . . . . 128 Reference List 130 Appendix A Appendix to Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Appendix B Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 V List Of Tables 1.1 Different Specifications of Loss Functions . . . . . . . . . . . . . 1.2 Combinations of Binary Predicted Flags and True Realizations . 1.3 Years and Countries of Sudden Stops with Growth Impacts 1.4 List of Explanatory Indicators 17 19 31 33 1.5 Comparison Between the Literature Paradigm and the New Paradigm . 36 1.6 Rolling Backtesting Performance of Models . 39 1.7 Benchmark Calibration Parameters . . . . . 44 1.8 Counterfactual Results for Asian Financial Crisis and Global Financial Crisis 50 2.1 Sudden Stops with Growth Impacts (SSGis) in Emerging Markets 2.2 List of Explanatory Indicators 3.1 Episodes of Sudden Stops with Growth Impacts (SSGis) 3.2 List of Explanatory Indicators . . . . . . . . . . . . . . . 67 72 103 105 3.3 Confidence Interval Results on Unconditional Performance in Fixed Cutoff Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 3.4 Confidence Interval Results on Conditional Performance Difference in Fixed Cutoff Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.5 Confidence Interval Results of Rolling Cutoff Testing 126 A.1 Welfare Cost Matrix . . . . . . . . . . . . . . . . . . 137 vi B.1 Average Sum of Errors in Rolling Cutoff Testing . B.2 Country Sample .................. . 142 142 vii List Of Figures 1.1 Frequency of Sudden Stops with Growth Impacts . . . . . . . . . . . . . . . 32 1.2 Average Scores and Probabilities in Emerging Market Countries During 2008- 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 1.3 Revealed Risk Tolerance of Policymakers 46 1.4 Observed and Optimal Level of Reserves as a Share of GDP, during 1993-2017 47 2.1 Frequency of Sudden Stops and SSGis in Emerging Markets, 1990-2017 68 2.2 Frequency of Exchange Market Pressure Events (EMPEs) by Country Group, 1990-2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.3 Threshold Calculation in Signal Extraction Approach 78 2.4 Mapping from Composite Score to Crisis Probability 80 2.5 Out-of-Sample Performance . . . . . . . . . . 81 2.6 Top-15 Important Variables for SSGis in EMs 83 2.7 Examples of Non-Monotonic Effect on SSGI Probability Implied by RUSBoost 85 2.8 Top-15 Important Variables for EMPEs in EMs . . . . . . . . . . . . . . . . 87 2.9 Examples of Non-Monotonic Effect on EMPE Probability Implied by RUS- Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 2.10 Examples of Interaction Effect on EMPE Probability Implied by RUSBoost 90 2.11 Top-15 Important Variables for EMPEs in AEs and LICs 92 viii 2.12 erage Probabilities for SSGis and EMPEs in Different Country Groups 92 3.1 3.2 Frequency of Sudden Stops with Growth Impacts (SSGis) Confidence Intervals of Performance in Fixed Cutoff Testing 104 122 3.3 Confidence Intervals of Performance Difference in Fixed Cutoff Testing 124 3.4 Confidence Intervals of Conditional Performance Difference in Rolling Cutoff Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 ix Abstract This dissertation focuses on crisis forecasting and risk assessment, studying how to model a policy-consistent risk assessment framework, and how to incorporate advanced techniques into crisis forecasting. The first chapter investigates the mutual interaction between crisis risk estimation and crisis prevention policy in the context of sudden stops. Proposing a two-stage framework embedding an early-warning problem into a policy-making problem, this chapter conducts a welfare analysis of crisis risk estimation based on subsequent pol icy responses. Building upon this two-stage framework, this chapter shows that there is welfare cost asymmetry between two types of errors: the welfare-maximizing weight on the percentage of missed crises must be greater than that on the percentage of false alarms. Introducing a constrained optimization under the Neyman-Pearson paradigm to solve this error-asymmetry problem reduces overall welfare loss by more than 20 percent. Bringing this welfare-based crisis risk estimation model to emerging market countries' data, this chapter uncovers time-varying risk tolerance of policymakers and explores its policy implications on identifying leading indicators and determining levels of reserves through counterfactual analysis. The second chapter evaluates the performance of signal extraction approach and machine learning techniques for the prediction of external crises, by generating crisis lists for two X types of external crises sudden stops with growth impacts, and exchange market pressure events for 159 countries over 27 years. Bearing in mind the potentially sharp divergence between in-sample and out-of-sample performance, this chapter designs a rigorous testing procedure attuned to the temporal dependence of macro data and the manner in which the models would be used in practice. The horse race results show that sudden stops with growth impacts are well predicted by signal extraction approach, while exchange market pressure events in the same set of emerging markets, which are more heterogeneous, are well predicted by machine learning techniques. This chapter also sheds light on variable importance rankings and on some of the important non-monotonicities and interactions that machine learning uncovers from the historical data. The third chapter aims to assess model performance uncertainty and test model ranking significance when conducting a horse race and selecting the best model to use, bearing in mind the small data nature of macroeconomic data in early-warning framework. To assess model performance uncertainty, this chapter explores three sources of data variation in early-warning framework and proposes three types of jackknifing methods to construct confidence intervals of model performance respectively. Additionally, this chapter proposes to construct confidence intervals of conditional performance difference and performs hypothesis testing on the conditional performance difference to test model ranking significance. The approaches are illustrated in an example of predicting sudden stops in capital flows for emerging market countries. Results show that the degree of model performance uncertainty depends on the structure of model and the source of data variation. Also, our approach to construct confidence intervals of conditional performance difference presents evidence of xi model ranking significance which is otherwise not revealed in simply comparing confidence intervals of individual model performance. xii Chapter 1 International Reserves, Risk Tolerance, and Crisis Risk 1.1 Introduction Crisis risk is a critical element of economic outlook for countries, regions, and the entire world, as crises have been so disruptive that most policymakers, if not all, want to monitor the buildup of crisis risk and the need for policy actions. Take sudden stops as an example: the large economic and social costs for emerging market countries have been learned the hard way in historical crisis waves from the 1990s international financial crises to the recent global financial crisis. As an effort to systematically assess risks and predict crises, the set of early-warning models have been developed since the seminal work of Kaminsky et al. (1998) and have gained increasing policy interests in recent years after the global financial crisis. 1 However, a probability or a signal of crisis is not the end of policy efforts: the ultimate objective of policymakers is to take policy actions in order to prevent crises from materialization or mitigate welfare costs of crises. Crisis risk estimate plays an important role in informing crisis prevention policy, in the sense that it does not only guide the activation but also determine the magnitude of such policies. Hence, as an input into crisis prevention policy-making, crisis risk estimate has direct policy implications and indirect welfare effects. 1 Relevant papers include Catiio and Milesi-Ferretti (2014), Sevim et al. (2014) , Xu et al. (2018) and Basu et al. (2019) for external crises; Borio and Drehmann (2009) , Alessi and Oetken (2011) , Lo Duca and Peltonen (2013), Behn et al. (2013), Betz et al. (2014) , Holopainen and Sadin (2017), Basu et al. (2017), Alessi and Oetken (2018), Lang et al. (2018) for banking crises; Manasse and Roubini (2009) , Savona and Vezzoli (2015) and Basu et al. (2017) for sovereign crises. 1 However, despite many efforts in the literature on early-warning models, there has been little work trying to estimate crisis risk accounting for its subsequent policy implications and welfare effects. This paper aims to bridge the gap between the literature on crisis risk estimation and the literature on crisis prevention policy. Essentially, they are two connected problems but have been studied separately in the literature. Firstly, in the literature on crisis risk estimation, early-warning models are estimated by considering the trade-offs between two types of errors: the percentage of false alarms and the percentage of missed crises. 2 Following the pioneer work of Kaminsky et al. (1998) , the loss function is defined as the sum of these two errors with same weights. However, the model-free setting in most of the work does not guarantee that the equal weights are specified correctly, because subsequent policy actions that are enacted to prevent or mitigate crises are not considered. Specifically, subsequent policy actions may incur extra welfare costs in the case of issuing a false alarm or mitigate existing welfare costs in the case of missing a crisis. Hence, in order to study the trade-offs between these two errors and correctly define the loss function, subsequent policy decisions and welfare effects should be taken into consideration. Secondly, in the literature on crisis prevention policy, most of the theory work assumes full knowledge about the crisis risk and studies the optimal policy response based on the true crisis risk. 3 However, in a more realistic setting, crisis risk should be estimated first and policy decisions should be made based on the estimated crisis risk instead of the true crisis risk. It implies that crisis risk estimation should be modeled as part of the policy-making process. Also, risk preference of policymakers should be considered in understanding historical policy responses to crisis risks. 2 The percentage of false alarms is equal to the percentage of non-crisis observations that are incorrectly flagged as crises, while the percentage of missed crises is equal to the percentage of crisis observations that are not flagged as crises. 3 For example, Aizenman and Lee (2007) and Jeanne and Ranciere (2011) in the literature on sudden stops and the optimal level of reserves. 2 To bridge this gap, this paper investigates the mutual interaction between solving an early-warning problem for crisis risk estimate and solving a policy-making problem for opti mal prevention policy. I ask and wish to answer three questions. First, what are the direct policy implications and indirect welfare effects of crisis risk estimation? Second, what are the welfare-maximizing weights on the percentage of false alarms and missed crises? And third, bringing the model to the data, what can be inferred about the risk preference of policymakers? To answer these questions, I focus on sudden stop risk estimation and international reserves accumulation as the prevention policy in emerging market countries. 4 Following Jeanne and Ranciere (2011), I model international reserves as an instrument chosen by the government to insure against sudden stops. As a social planner, the government accumu lates international reserves in order to maximize domestic welfare, which is the expected consumers' utility across non-crisis state and sudden stop state. In this insurance model set ting, the optimal level of international reserves to accumulate depends on the sudden stop risk, which is however not known to the government at the time of policy-making. 5 Hence, the government needs to estimate the sudden stop risk first and then determine the level of international reserves based on the sudden stop risk estimate, which implies a sequential nature of the policy-making process for the government. Motivated by the sequential nature of sudden stop risk estimation and reserves accumu lation decision, I propose a two-stage framework embedding an early-warning problem into the policy-making problem. In the first stage, an early-warning problem is solved for sudden stop risk estimate, and in the second stage, the policy-making problem mentioned above is solved for the optimal level of international reserves to accumulate. The early-warning prob lem and the policy-making problem are connected by the sudden stop risk estimate, which is the output of the first-stage problem and the input into the second-stage problem. With 4 There is extensive theoretical and empirical work that argues the self-insurance role of international reserves against capital flow volatility in emerging market counties (Aizenman and Marion, 2003; Stiglitz, 2007; Jeanne and Ranciere, 2011; Hur and Kondo, 2016; Bianchi et al., 2018). 5 As well as other parameters which can be calibrated with reference to historical data 3 the help of the second-stage insurance model, the determinacy of the trade-offs between the percentage of false alarms and missed crises can be derived from subsequent reserves accumulation decisions and resulted welfare effects. In a welfare analysis of the sudden stop risk estimation based on subsequent reserves accumulation decision, I show in proposition 1 that imperfect sudden stop risk estimation will lead to a suboptimal level of reserves accumulated and cause welfare loss. Specifically, if sudden stop risk is underestimated, then a lower (than optimal) level of reserves will be accumulated, which will cause welfare loss in sudden stops. However, if sudden stop risk is overestimated, then a higher (than optimal) level of reserves will be accumulated, which will cause welfare loss in normal periods. Therefore, in order to warrant the consistency of sudden stop risk estimation with the welfare-maximizing criterion, the weights on the two types of errors should be defined as their welfare costs respectively. By modeling the early-warning problem and the policy-making problem in sequence, I am able to use backward induction to pin down the welfare-based weights on the percentage of false alarms and missed crises. I show structurally in proposition 2 that there is welfare cost asymmetry between the two types of errors: the welfare-maximizing weight on the percentage of missed crises must be greater than that on the percentage of false alarms. 6 It is shown that this asymmetry comes from the curvature of consumers' utility function. In particular, as long as consumers are risk averse, this welfare cost asymmetry holds. My finding is consistent with the asym metric welfare cost of consumption uncertainty implied by Barro (2009) in the context of macroeconomic disasters. More importantly, my result shows that the equal weights used in the literature are not consistent with the welfare-maximizing criterion. To solve this error-asymmetry problem implied by the welfare-maximizing criterion, I introduce a new implementation method using constrained optimization under the Neyman Pearson classification paradigm. 7 Developed to address the issue of error asymmetry, the 6 Following the common empirical practice of transferring crisis probabilities into binary signals in the literature on early-warning models. 7 See Cannon et al. (2002), Scott and Nowad (2005) , Rigollet and Tong (2011). 4 Neyman-Pearson paradigm forms a constrained optimization to minimize one type of error while maintaining the prioritized type of error below a certain level. In this paper, given the welfare cost asymmetry, I implement the constrained optimization to estimate sudden stop risk by minimizing the percentage of false alarms while maintaining the percentage of missed crises below a certain level, which hereafter I refer as the new paradigm. I show that this new paradigm is a statistically and practically plausible way to solve the welfare cost asymmetry problem, given its quantifiability, interpretability, and robustness. I then bring this welfare-based crisis risk estimation model to emerging market countries' data and show the performance on predicting sudden stops under the new paradigm. Back testing results from a signal-extraction model show that the new paradigm 8 delivers better performance with respect to the welfare-maximizing criterion. In particular, compared with the equally-weighted average approach used in the literature, the new paradigm reduces out-of-sample sum of errors by more than 20 percent, which is equivalent to reducing overall welfare loss by more than 20 percent. 9 More importantly, as crises are rare at individual country level, a tighter control on the percentage of missed crises can significantly reduce welfare loss for individual countries by signaling the few or even single crisis correctly. In addition, as an extension of the benchmark analysis, I explore to what extent embed ding this welfare-based crisis risk estimation model into the insurance model can explain the buildup of international reserves in emerging market countries. The insurance model itself find it difficult to account for the trend and magnitude, especially in emerging market Asia, without assuming very high level of risk aversion of households ( Jeanne and Ranciere, 2011). To reconcile the divergence between the model-implied and actual level of reserves, I model the upper bound on the percentage of missed crises in the crisis risk estimation as a measure s·with preferred choice of upper bound below which the percentage of missed crises is maintained. 9 Sum of errors is the unweighted sum of the percentage of false alarms and missed crises. 5 of policymakers' risk tolerance, 10 and allow it to be time-varying. 11 The lower the upper bound on the percentage of missed crises, the lower the risk tolerance of policymakers, and the less tolerant of risk the policymakers. Then, I calibrate the upper bound by minimizing the difference between the model-implied level of reserves and actual level of reserves. The calibration is achieved through the two-stage channel: different upper bounds on the per centage of missed crises yield different levels of sudden stop risk estimates, which will lead to different levels of reserves accumulated. Calibration results show that (i) plausible values of upper bound collaborate with bench mark calibration of the insurance model to match the trend and magnitude of international reserves holdings in emerging market countries, and (ii) risk tolerance of policymakers is time-varying: it peaks before two historical waves of sudden stops, the Asian financial crisis and the global financial crisis, and then decreases sharply after the crises. The pre-crisis high risk tolerance is in line with the failure to prevent historical sudden stops. When policy makers were very tolerant of sudden stop risk, they would not accumulate sufficient reserves and thereby fail to prevent materialization of sudden stops. The post-crisis decreasing risk tolerance provides a novel explanation of the rapid reserves accumulation in emerging mar ket countries following the Asian financial crisis, from the perspective of risk perception and uncertainty estimation. After the danger of sudden stops was learned the hard way from the Asian financial crisis, policymakers became less risk tolerant and thereby imposed tighter control on the percentage of missed crises when estimating future sudden stop risk. As a result, higher crisis risk estimates were generated and more reserves were accumulated. These results imply that the buildup of international reserves in emerging market countries 10 Instead of following studies in the literature (Kimball et al., 2008) to assume a utility function with constant relative risk aversion and model risk tolerance as the inverse of relative risk tolerance, I choose to use the upper bound on the percentage of missed crises as a cardinal proxy for risk tolerance. The reason is without further assumptions on the distribution of sudden stop risk, the relative magnitude of welfare loss associated with the percentage of missed crises and false alarms cannot be calculated. Hence, it is impossible to associate the upper bound choice with underlying risk tolerance through a parametric utility function. 11 The time-varying nature of policymakers' risk preference can be explained by elections, political cycles, or simply the fact that individual decision-making process is more volatile. 6 are optimal in the sense of providing self-insurance, given the decreasing risk tolerance of the policymakers. Additionally, I conduct counterfactual analysis to explore policy implications of a lower risk tolerance of policymakers before the Asian financial crisis and the global financial crisis. The results show that if policymakers had a preferred lower degree of risk tolerance, they would aim at accumulating twice as many reserves as they did before the two waves of crises. More importantly, the results establish that when policymakers have high risk tolerance, they may overlook leading indicators in which high crisis risk is rooted, such as excessive external borrowing for the Asian financial crisis and change in global financing conditions for the global financial crisis. Hence, they may not be able to identify the most vulnerable sectors, and as a consequence, fail to enact appropriate policies to prevent the crises. To my knowledge, this paper is the first to emphasize the policy implications and wel fare effects of crisis risk estimation and structurally show the large degree of welfare cost asymmetry between missing a crisis and issuing a false alarm. This paper is also the first to develop a policy-consistent risk assessment framework and bring in the Neyman-Pearson paradigm in order to warrant the consistency of crisis risk estimation with crisis prevention policy decision. Additionally, this paper sheds light on the importance of modeling risk tol erance of policymakers (i.e., public sector) on top of risk aversion of households (i.e., private sector) in understanding their policy responses to crisis risks. This paper contributes to three strands of the literature. First, it complements the literature on the welfare analysis of international reserves holdings. The literature generally focuses on the welfare-based trade-offs involved in the choice of optimal level of reserves assuming policymakers have full knowledge of the sudden stop risk. 12 This paper studies the welfare-based trade-offs involved in over-estimating and under-estimating the sudden stop risk through the channel that the optimal level of reserves is determined by estimated sudden stop risk. This paper highlights the fact that an ex-ante optimal level of reserves determined 12 Relevant papers include Aizenman and Marion (2003), Miller and Zhang (2006), Aizenman and Lee (2007), Durdu et al. (2009), Alfaro and Kanczuk (2009), Jeanne and Ranciere (2011). 7 by estimated sudden stop risk could be ex-post suboptimal under the true sudden stop risk, and as a consequence, incur welfare loss for domestic households. Hence, this paper calls for attention on structurally implementing crisis risk estimation by taking into consideration subsequent policy responses and welfare effects. Second, this paper extends the literature on early-warning models by bridging the gap between the econometric specification and policy objective of early-warning models. As the policy objective of building early-warning models is to guide the implementation of policy tools through identifying the buildup of crisis risk, the econometric specification of early-warning problems should be derived backward by considering what crisis prevention policy decisions might be optimally made given any estimated crisis risk. However, in the literature, the econometric specification of early-warning problems has been disconnected from subsequent policy decisions. None of the existing objective function of early-warning models is derived structurally with policy-making problems involved, from the noise to signal ratio initiated in Kaminsky et al. (1998), to the sum of errors introduced in Berg et al. (2005), and to the usefulness developed in Alessi and Oetken (2011). Proposing a two-stage framework, this paper uses backward induction to formulate the early-warning problem based on subsequent policy decisions. Specifically, this paper derives the loss function as the welfare loss incurred by suboptimal level of reserves. Although it is a consensus that the cost incurred by missing a crisis is higher than that incurred by issuing a false alarm, it is not a trivial task to quantify the degree of cost asymmetry in order to determine the weights on corresponding classification errors, i.e., the percentage of missed crises and false alarms. In the literature on early-warning models following the pioneer work by Kaminsky et al. (1998), most of studies treat the percentage of missed crises and false alarms equally when specifying the loss function and evaluating the model performance Berg et al., 2005; Basu et al., 2017; Lang et al., 2018). This paper enriches the literature by showing the degree of welfare cost asymmetry is so large that it results in greater weight on the percentage of missed crises. Additionally, this paper brings 8 in a constrained optimization under the Neyman-Pearson classification paradigm and an implementable algorithm developed by Tong et al. (2018) as a tractable and interpretable way to address the cost asymmetry. Third, this paper supplements the extensive literature on reserves adequacy by provid ing an alternative explanation of the buildup of international reserves in emerging market countries. Although it is well acknowledged that international reserves are accumulated in emerging market countries as self-insurance against sudden stop risk, the trend and magni tude of reserves accumulation, especially in emerging Asian countries, remains unaccountable unless households are assumed to have very high risk aversion. By modeling the risk tolerance of policymakers on top of the risk aversion of households, 13 this paper presents time-varying risk tolerance of policymakers, and explains the buildup of international reserves from the perspective of risk perception and uncertainty estimation. The time-varying nature of policy makers' risk tolerance also emphasizes the importance of estimating crisis risk with moderate and persistent level of risk tolerance. Counterfactual results show that policymakers with high risk tolerance would overlook leading indicators that could have signaled high crisis risk and thereby accumulate too few reserves to prevent real consequences of sudden stops. The rest of the paper is structured as follows: Section 1.2 presents a two-stage framework for sudden stop risk estimation and prevention, and formulates the early-warning problem backward based on subsequent reserves accumulation decisions. Section 1.3 introduces the implementation method, a constrained optimization under the Neyman-Pearson classification paradigm, and justifies the use of it to handle the cost asymmetry in the early-warning problem. Section 1.4 discusses the empirical model estimation approach, including crisis definition, explanatory indicators, as well as model choice. Section 1.5 presents the prediction performance and explores the quantitative implications of the two-stage framework. Finally, Section 1.6 concludes. 13 T he rationale behind is that the crisis risk estimation and crisis prevention policy are both carried out by policymakers. 9 1.2 A Two-Stage Framework Formulating crisis prediction and prevention as a sequential two-stage process which involves the crisis risk estimation in the first stage and the prevention policy decision in the second stage, I first present the crisis prevention policy decision problem in the second stage as a welfare-maximizing problem building upon an insurance framework developed by Jeanne and Ranciere (2011), which is solved by policymakers for an optimal level of reserves (Sec tion 1.2.1). Showing that optimality of reserves accumulation decision only holds under perfect crisis risk estimation and thereby welfare cost will be incurred by imperfect crisis risk estimation, I formulate the crisis risk estimation problem in the first stage as a welfare cost-minimizing problem in which the objective function to minimize is the welfare loss re sulted from imperfectly estimated crisis risk (Section 1.2.2). I then show that welfare cost of missing a crisis is so much larger than that of issuing a false alarm that the welfare-based weight on the percentage of missed crises is greater than that on the percentage of false alarms. 1.2.1 Second-Stage Problem for Reserves Accumulation I follow the insurance framework developed in Jeanne and Ranciere (2011) to formulate the crisis prevention policy decision problem in the second stage in which policymakers choose the level of reserves to maximize domestic welfare. The model choice is motivated by several reasons. First, it is well acknowledged in the literature that international reserves are accumulated to insure against capital flow volatility in emerging market countries, which is also supported by extensive empirical evidence. Second and more importantly, the insurance model yields a closed-form formula for the optimal level of reserves as a function of a couple of parameters including the probability of a sudden stop. Hence, it allows me to write down the closed-form formula of domestic welfare as a function of the probability of a sudden stop and thereby to conduct welfare analysis of sudden stop risk estimation. 10 In the model, there are two states: the normal state ( denoted by n) where output is growing at a constant growth rate and domestic consumers can borrow internationally (sub ject to a borrowing constraint), and a sudden stop ( denoted by s) where output falls by a certain fraction and domestic consumers are prevented from borrowing internationally. Faced with sudden stop risk, the government smooths domestic consumption and maximizes domestic welfare by entering a "reserves insurance contract" with foreign investors. That is, during non-crisis periods, the government pays a premium Z which is transferred from domestic consumers until a sudden stop occurs. In exchange, the government receives a pay ment R and transfer back to domestic consumers if a sudden stop occurs. Let 1ft denote the sudden stop risk, i.e., the probability of a sudden stop, conditional on a set of explanatory variables. 14 The welfare-maximizing problem solved by the government is max 1ftU( Ct)+ (1 - 7rt)u( Cf) {Zt , R t } s.t. Cf = ~n + Lt - (1 + r)Lt- 1 - Zt 1ft Zt = ------Rt frt + Pt(l - frt) ~~l = (1 + gt)~n (1.1) where Cf and Cf are state-contingent levels of consumption in a normal state and in a sudden stop in period t respectively. ~n is the trend output and (Zt , Rt) characterized the "reserves insurance contract " between the government and foreign investors in period t. The first two conditions characterize consumers' budget constraints in different states. In normal states, domestic consumption is equal to trend output plus external debt minus the debt repayment ( with constant interest rate r) and contract payment Zt transferred by the government. In 14 It is assumed that the probability of a sudden stop 7rt is derived from the true crisis model, where the subscript of 1r allows for different model specifications across times, and indicates that the sudden stop risk in period t depends on all information up period t. 11 sudden stops, output falls by a fraction rt below the trend and consumers are prevented from borrowing internationally, but they get an additional transfer from the government of the value Rt. The third condition characterizes consumers' borrowing constraint where the size of a sudden stop At is the maximum level of external debt which is otherwise available to domestic consumers in a normal state in period t. The fourth equation characterizes foreign insurers' binding participation constraint where Pt is the price of a non-crisis dollar in terms of crisis dollar for foreign insurers in period t, and 1ft is the aggregate probability of a sudden stop perceived by foreign insurers in period t. Here, it is assumed that foreign insurers do not differentiate among countries and thereby see the same unconditional aggregate sudden stop risk for all countries. Therefore, 1ft = IE[nt] when foreign insurers make decisions on whether to enter into the insurance contract. Last, 9t is the rate of output growth along the trend in period t. The utility function is assumed to have a constant relative risk aversion a- ~ 0: u( C) = ~~ - ; for a- =/ 1 and u( C) = log( C) for a- = l. We first write the optimal insurance decision chosen by the government as functions of the probability of a sudden stop, i.e. (Zt(nt) , Rt(nt)), 15 which are given by (1.2) where optimal level of reserves-to-GDP ratio p;(nt) = Rt(nt)/Yt is characterized as * ( ) 1 -~ A -~ ( 1 - rt - ~At) Pt 1ft = 1 1 - (1 - T/t(7rt) )c5t (1.3) where 1 ( ) ( 1rt(l - 1ft) ) --;; d , Pt(l - 1ft) T/t 1ft = Pt an Ut = ------, ift(l - 1ft) 1ft + Pt(l - 1ft) (1.4) and c5t represents the opportunity cost of holding the reserves in period t. 15 The insurance model yields a closed-form formula of the insurance contract payments (Zt , R t ) as a function of a couple of parameters including r , gt, At, r t , ii't, and Pt, in addition to the probability of a sudden stop Kt- However, in order to analyze welfare-based trade-offs involved in the crisis risk estimation, all other parameters but the probability of a sudden stop are not specified as arguments of the formula here. 12 Plugging in the optimal insurance contract (Zt(rrt), Rt(7rt)), we have the state-contingent levels of domestic welfare as functions of the probability of a sudden stop, which are char acterized by (1.5) and (1.6) where (1. 7) and _ r - 9t _ 1 + r et= Pt(l - 7rt)(l - --At)+ 7rt(l - rt - --At)- 1 + 9t 1 + 9t (1.8) It follows from equations 1.5 and 1.6 that the state-contingent levels of domestic welfare are functions of the probability of a sudden stop in period t, which is not known to the government when it solves the welfare-maximizing problem for optimal levels of reserves at the beginning of period t before the state of period t is realized. Therefore, it implies that the optimality of the level of reserves characterized by formula 1.3 is only held under perfect risk estimation, that is, the estimated probability of a sudden stop is exactly the same as the true probability. Any imperfect estimation of the probability of a sudden stop, i.e., deviation of the estimated probability from the true probability, will result in sub-optimal insurance contract decisions and hence will incur welfare costs for the domestic economy, which is shown formally in what follows. Let irt denote the estimated probability of a sudden stop in period t obtained before the government solves its problem 1.1 and determines the level ofreserves to accumulate. Due to the fact that the true probability 7rt is unknown at the beginning of period t, the government will choose the level of reserves based on the estimated probability according to the formula 13 1.3. However, the expected domestic welfare derived from the resulted level of reserves will be calculated over the true probability. Therefore, the expected domestic welfare in period t denoted by Ut(7rt, ftt) is a function of the true probability of a sudden stop 7rt and the estimated probability of a sudden stop ftt , where the first argument is the true probability of a sudden stop and the second is the estimated probability of a sudden stop in period t, (1.9) Defining the expected real domestic welfare Uf8a 1 ( 1r t, ft t) as the expected domestic welfare derived from domestic consumption Cf and Cf deflated by the trend output ~n u;ea1(1rt, ftt) = 1rtU ( Cf(Xt(~~ Rt('itt))) + (l -1rt)U ( Cf(Xt(~~ Rt('itt))) = l ~ (J [7rt ((TJ(ftt)f(ftt)) 1 - a) + (1- 7rt) (f(ftt)l-a)], (1.10) the time subscript t is ignored hereafter for simplicity as the expected real domestic welfare does not depend on the output and therefore is comparable across time periods. 16 As dis cussed previously, because the insurance contract payment ( Z (ft) , R( ft)) is chosen so as to maximize the expected domestic welfare given the estimated probability of a sudden stop ft, the expected domestic welfare derived from this contract payment decision does not achieve its maximum under the true probability unless perfect crisis risk estimation is achieved, which is stated in the following lemma. Lemma 1. The insurance contract payment ( Z (ft), R( ft)) as characterized in equation 1. 2 based on any estimated probability of a sudden stop ft is not optimal under the true probability 1r, unless ft = 1r. 16 But it should be noted that it does not mean all parameters are constant over time. Most of the parameters could be time-variant, including 7rt, itt , 9t, At , "ft, 7rt, and Pt· 14 Hence, it implies that welfare cost will be incurred if the government chooses the insurance contract payment ( Z (fr), R( fr)) as depicted in equation 1. 2 based on any estimated probability fr unless fr = 1r, and the welfare cost is [rreal(1r , 1r) _ [rreal(1r , fr) = l ~ a [1r ((ry(1r)f(1r))l-a) + (1- 1r) (!(1r)l-a)] - 1 ~ a [1r ((ry(fr)f(fr))l-a) + (1- 1r) (f(fr)l-a)] 2:: 0. (1.11) The equality holds if and only if fr = 1r, i.e. perfect risk estimation is achieved. Proof. Under the true probability of a sudden stop 1r, the optimal level of reserves-to-GDP p* ( 1r) is the one maximizing the expected domestic welfare taken over the true probability 1r. Therefore, the level of reserves-to-GDP p*(fr) based on the estimated probability fr is no longer maximizing the expected domestic welfare taken over the true probability if fr =/- 1r, which implies that the expected domestic welfare derived from p* (fr) is less than that derived from the optimal level p* ( 1r) if fr =/- 1r. � Although such welfare cost characterized in equation 1.11 is caused directly by the sub optimal level of reserves, it is hereafter referred to as being incurred by the imperfectly estimated probability of a sudden stop, because the sub-optimality of the reserves accumu lation decision is essentially resulted from the imperfect risk estimation. Given that welfare cost will be incurred whenever the estimated probability of a sudden stop deviates from the true probability, I argue in the following proposition that the welfare cost of an imperfect estimation of the sudden stop risk is not symmetric, as long as consumers are risk averse. Proposition 1. The welfare cost of any imperfectly estimated probability of a sudden stop is not symmetric. That is, given the size of the deviation of the estimated probability from the true probability ~ 1r 11 fr - 1r 11 > 0, the welfare cost of an upward deviation (i.e. fr = 1r + ~ 1r) and a downward deviation (i.e. fr = 1r - ~1r) are not the same for any true probability 1r, as long as consumers are risk averse, i.e. a > 0. 15 Proof. See Appendix. � The asymmetry in welfare costs implies that the crisis risk estimation in the first stage needs to be carefully formulated and rigorously solved because of its policy implications and welfare effects. Specifically, to warrant the consistency of crisis risk estimation and reserve accumulation decision, the objective function of the first-stage problem should be defined in terms of welfare, and the asymmetric welfare costs incurred by imperfectly estimated probabilities should be taken into consideration. 1.2.2 First-Stage Problem for Risk Estimation In formulating the crisis risk estimation problem in the first stage, the objective function plays a critical role and should be carefully constructed, as it is not necessarily the same in every situation. In a regression problem, the loss function to be minimized is often chosen to be the £ 2 loss llv - Pll2 = (v - p) 2 (1.12) and the objective function of a regression model is the population-level expected value of the £ 2 loss, which is estimated within sample by mean square error. In a classical binary classification problem, the loss function to be minimized is often chosen to be the £ 1 loss (1.13) and the objective function of a classification model is the population-level expected values of the £ 1 loss, which is estimated within sample by accuracy. Though these loss functions take different forms, essentially they are defined as functions of the estimated values and the true values. Table 1.1 summarizes different specifications of the loss functions in regression and classification problems. 16 Table 1.1: Different Specifications of Loss Functions Loss function Population-level objective function Sample-level objective function Regression problem IIP - Pll2 = (p - P) 2 IE [(p- p)2] fr ((Pl -p1) 2 + ... + (PN -PN) 2 ) Classification problem 11:V - Yll1 = IY - YI IE [IY - YI] fr (1111 - Y1I + ... + IYN - YNI) As indicated in lemma 1 and proposition 1, the crisis risk estimation problem in the first stage has policy implications and welfare effects to be generated in the second stage: the level of reserves and domestic welfare will be determined by the probability of a sudden stop estimated in the first stage. More importantly, welfare cost will be incurred if the estimated probability deviates from the true probability and thereby the level of reserves determined by the estimated probability is no longer optimal. Also, there is asymmetry between the welfare costs incurred by upward- and downward-deviated probability estimates. Therefore, it implies that the crisis risk estimation problem in the first stage should be formulated backward from solving the reserves accumulation decision problem in the second stage and taking into account the level of reserves and thereby domestic welfare resulted from any risk estimate. As the criterion in the second stage when solving for the optimal the level of reserves is to maximize the expected domestic welfare, the objective of the crisis risk estimation problem in the first stage should be specified to minimize welfare costs incurred by imperfect estimates of the probability of a sudden stop, so that the consistency of crisis risk estimation with prevention policy decision is warranted. In the same spirit of regression and classification problems, I define a Welfare Loss denoted by Lw = II -fr - 1rllw as the welfare cost of an estimated probability of a sudden stop -fr under the true probability 1r that is characterized in equation 1.11 (1.14) 17 and the objective function is defined as the population-level expected values of the Lw loss (1.15) Following the common practice in the literature on early-warning models to formulate a crisis risk estimation problem as a binary classification problem, I rewrite the objective function as a function of true crisis realizations and predicted crisis flags which are both binary indicators. Let y and f) denote the true crisis realization and predicted binary crisis flag respectively, both taking 1 for a sudden stop and O for a normal state. Then the objective function 1.15 which is the expected welfare cost of estimated probability ft under true probability 1r is re-written as an weighted sum of expected conditional welfare cost which is conditional on different combinations of the predicted crisis flag and true crisis realization IE [ureal(1r,1r) _ o-real(1r,ft)] = IE [ureal(1r,1r) _ o-real(1r,ft)I y = 0, y = 0]lP'(y = 0,y = 0) +IE [ureal(1r,1r) -O-real(1r,ft)I y = l, y = 0]lP'(y = l ,y = 0) +IE [ureal(1r, 1r)-O-rea1(1r, ft)I y = l, y = l ]lP'(y = l , y = 1) +IE [ureal(1r,1r)-O-real(1r,ft)I y = 0, y = l ]lP'(y = 0,y = 1). (1.16) where IE [· If) = i, y = j ] is the expectation taken over the joint distribution of the estimated probability given predicted crisis flag being i and the true probability given true crisis real ization is j , where i,j E {O, 1}. It then follows that the objective function can be expressed as an weighted sum of four categories of welfare cost where in equation 1.16 the first part represents expected welfare cost associated with true negatives (i.e. both true crisis realiza tion and predicted crisis flag are O and indicate a normal period) ; the second part represents expected welfare cost associated with false negatives (i. e. true crisis realization is O while the predicted crisis flag is 1, indicating a crisis is not flagged in the prediction); the third part represents expected welfare cost associated with true positives (i. e. both true crisis re alization and predicted crisis flag are 1 and indicate a crisis) ; and the fourth part represents 18 expected welfare cost associated with false positives (i.e. true crisis realization is 1 while the predicted crisis flag is 0, indicating a normal period is incorrectly flagged as a crisis in the prediction). Table 1.2 summarizes the four combinations of predicted crisis flag and true crisis realization. It is noted that there could still be welfare cost associated with true negatives and true positives, because the continuous probabilities may not be perfectly esti mated which would result in sub-optimal decisions on the level of reserves and therefore incur welfare cost, even though the predicted crisis flags coincide with the true crisis realizations. Table 1.2: Combinations of Binary Predicted Flags and True Realizations True realizations non-cns1s crisis Predicted non-cns1s True negative (f) = 0 & y = 0) False alarm (iJ = 0 & y = l) flags crisis Missed crisis (f) = l & y = 0) True positive (f) = l & y = l) Notes. I adapt the notion that O indicates a normal period and 1 indicates a sudden stop. In order to re-write the expected welfare cost as a function of predicted crisis flag and true crisis realization, several assumptions on distributions of true and estimated probability, as well as their respective mapping rules from continuous crisis probabilities to binary crisis indicators are made. As for distributions, let Fn ( 1r) and f 1r ( 1r) denote the cumulative and probability density function of the true probability --rr. I then assume ft is an unbiased estimation of the true probability 1r Jr= ft+ En, (1.17) where E bounded by [--rr - 1, --rr] represents the unobserved error with IE[ En l--rr] = 0, and FE (El--rr) and JE( E l--rr) as its cumulative and probability density function respectively, conditional on 19 true probability 1r. As for mapping rules from continuous crisis probabilities to binary crisis indicators, first , by definition of the true probability 1r, for a given value 1r0 , I have IP'(y = lln =no)= no , (1.18) representing that given true probability 1r equal to 1r0 , the probability of true crisis realiza tion being 1 (i.e. a sudden stop) is 1r0 . Second, a standard cutoff decision rule in binary classification problems is assumed to map estimated probabilities to predicted crisis flags. That is, a cutoff value c is chosen such that i) = l if fr > c and i) = 0 if fr :S c. Therefore, the mapping rule from estimated probabilities to predicted crisis flags is that, for any given true probability 1r equal to 1r 0 indicating that the mapping from estimated probability to predicted crisis flag depends on the conditional distribution of the estimation error given the true probability. Given assumptions on distribution of true probability 1r and estimation error E1r, and mappings from continuous probabilities to binary crisis indicators, the objective function characterized by equation 1.16 is re-written as a weighted average of the percentage of false alarms and missed crises where the weight of the percentage of false alarms denoted by Wy=lly=O and the weight of the percentage of missed crises denoted by wg=Oly=l are Wy=lly=O = IE [(1r - l)H(n)] Wy=Oly=l = IE [ 1rH(1r)]' (1.21) 20 and H(1r) is a function of the true probability 1r and characterized as (1.22) I then argue in the following proposition that the welfare-based weight on the percentage of false alarms and the percentage of missed crises in the objective function are not symmetric and the latter is greater than the former. Proposition 2. The welfare-base weight on the percentage of false alarms and missed crises in the objective function characterized in equation 1. 20 are not symmetric: the weight on the percentage of missed crises is greater than that on the percentage of fals e alarms as long as O' > 0. That is, (1.23) The equality holds if and only if consumers are risk neutral, i.e. O' = 0. Proof. See Appendix. � Proposition 2 shows that the welfare cost incurred by missing a crisis is so much larger than that incurred by issuing a false alarm that the overall welfare-based weight on the percentage of missed crises is greater than that on the percentage of false alarms. Hence, it emphasizes the need for prioritizing towards the percentage of missed crises when formulating and solving the crisis risk estimation problem if the criterion is to maximize domestic welfare or to minimize welfare loss equivalently. Although it is a consensus that the cost incurred by missing a crisis is larger than that incurred by issuing a false alarm, it is not a trivial task to quantify the aggregate cost of the proportion of missed crises and false alarms, as there are much fewer crises than non-crises. Therefore, in the literature on early-warning models following Kaminsky et al. (1998) , most studies adopt an objective function in which the welfare-based weights characterized by equation 1.21 are ignored, and as a consequence same weights are assigned to the percentage of missed crises and false alarms (hereafter referred 21 as the literature paradigm). Proposition 2 shows that such assignment of same weights is not consistent with the welfare-maximizing criterion, subject to which policy decisions on crisis prevention are made. Also, it is worth noting that the derivation of the welfare cost asymmetry is based on the population-level distribution, implying that such prioritization should be achieved not only within model estimation or in-sample, but more importantly in forecasting practice or out-of-sample. 1.3 Implementation Methodology In this section, I first introduce the Neyman-Pearson classification paradigm (Subsection 1.3.1 ), in contrast to the literature paradigm. I then discuss reasons for which the Neyman Pearson classification paradigm is a good fit for modeling the crisis risk estimation problem in the first stage derived in previous section (Subsection 1.3.2). 1.3.1 Neyman-Pearson Paradigm In binary classification, a majority of models ( or classifiers) are constructed under the clas sical classification paradigm, which aims to minimize the population-level expected classifi cation error, IF(y = lly = 0)IF(y = 0) + IF(y = 0ly = l)IF(y = 1), (1.24) a weighted sum of the population-level type I error, i.e. IF(y = lly = 0) and type II error, .e. IF(y = 0ly = 1) where weights are the unconditional probability of the true classes 0 and 1, respectively. However, in real-world applications, different priorities for type I and type II errors may be needed because of the different costs incurred by type I and type II errors. Such examples are common in real world including disease diagnose, fraud detection and etc. Take cancer diagnosis as an example: a type II error ( that is, misdiagnosing a cancer patient as healthy) will incur significantly larger cost in the case of tragic loss of life which may not even be quantifiable than extra the medical costs incurred by a type I error ( that is, misdiagnosing 22 a healthy patient with cancer). The Neyman-Pearson classification paradigm was developed to serve this situation. 17 A model constructed under the Neyman-Pearson classification paradigm aims to minimize one type of error while maintaining the other prioritized type of error below a certain level a (on the population level). For example, if the goal is to prioritize type II error, then the objective under the Neyman-Pearson classification paradigm is to minimize type I error with respect to an upper bound a on type II error, that is to solve min IP(y = lly = 0) (1.25) s.t. IP(y = 0ly = 1) :Sa, where the upper bound on type II error a reflects the degree of prioritization towards type II error. It is worth noting that the Neyman-Pearson classification paradigm is different from the cost-sensitive classification paradigm which also accounts for asymmetric misclassification costs by assigning different weights to misclassified observations and thereby different weights to the type I and type II error, other than the unconditional probabilities of the two classes 0 and 1. 18 A model estimated under the cost-sensitive classification paradigm solves min c0IP(y = lly = 0)IP(y = 0) + c1IP(y = 0ly = l)IP(y = 1). (1.26) where c0 and c 1 can be specified based on priors. Following the literature on early-warning models pioneered by Kaminsky et al. (1998) , most of the studies adopt the cost-sensitive clas sification paradigm and assigns the same weights to the type I and type II errors, equivalent to assigning the inverse of the unconditional probablity of non-crisis and crisis to non-crisis and crisis observations respectively. In fact, the Neyman-Pearson classification paradigm can better serve the purpose in many real-world applications for several reasons. First, it is often not trivial to quantify the costs to be assigned to type I and type II errors under cost-sensitive classification paradigm, while an upper bound on the prioritized type of error 17 See Cannon et al. (2002) , Scott and Nowad (2005) , Rigollet and Tong (2011). 18 See Elkan, 2001; Zadrozny and Elkan, 2001. 23 is easier to specify given priorities. Second, the Neyman-Pearson classification paradigm achieves a prioritized control of the asymmetric misclassification errors by maintaining the prioritized type of error under a certain level on the population level which is critical in forecasting practice as it has been shown that there could be potential sharp divergence between the in-sample and out-of-sample performance. However, the cost-sensitive classifi cation paradigm provides no such probabilistic control on the errors on the population level. Third, the prioritization achieved by imposing an upper bound on the prioritized type of error provides an interpretable mechanism for any resulted revision on the probability esti mates: the upper bound can be modeled as the degree of risk tolerance of policymakers when they estimate crisis risk. Therefore, the Neyman-Pearson classification paradigm provides a robust, quantifiable and interpretable way to account for asymmetric misclassification costs, which makes it a good fit for modeling the first-stage crisis risk estimation problem discussed in Section 1.2. 1.3.2 A Good Fit for Crisis Risk Estimation Proposition 2 argues that the welfare costs associated with the percentage of false alarms ( which is referred to as type I error) and the percentage of missed crises ( which is to referred as type II error) are not symmetric, and the welfare cost associated with the latter is larger than that associated with the former. Therefore the formulation of the first-stage crisis risk estimation problem is in need of a prioritization towards the percentage of missed crises, which the Neyman-Pearson classification paradigm is developed to address. I show in the following proposition that there is a statistical equivalence between estimating a model to minimize the welfare loss characterized in formula 1.20 with Wy=Ily=O > Wy=Oly=I and to solve problem characterized by 1.25 with a less than 0.5. Proposition 3. Solving the objective function under the Neyman-Pearson classification paradigm with a < 0. 5 is equivalent to minimizing an objective function characterized as WFAIP(y = lly = 0) + WMcIP(y = 0ly = 1) with some WMc > WFA· 24 Proof. See Appendix. � Therefore, the Neyman-Pearson classification paradigm is a statistically plausible mod eling of the crisis risk estimation problem with welfare cost-induced asymmetric misclassi fication error. In addition, I discuss in detail in this subsection why the Neyman-Pearson classification paradigm is a good fit for the formulation of the crisis risk estimation prob lem from the perspective of policymakers and researchers, given the welfare cost asymmetry characterized in section 1.2. First of all, according to Proposition 2, the commonly-used objective function to minimize in the literature paradigm IP(y = lly = 0) + IP(y = Oly = 1) (1.27) is misspecified in that it assigns the same weight to the percentage of type I error ( that is the percentage of false alarms) and the percentage of type II error ( that is the percentage of missed crises) under the cost-sensitive classification paradigm and therefore does not ac count for the welfare cost asymmetry. Hence, it implies that when formulating the objective function to minimize, larger weight should be assigned to the type II error that is the per centage of missed crises if under the cost-sensitive classification paradigm. However, without further assumptions on the specification of the true model characterizing the relationship be tween binary crisis incidence and explanatory indicators, as well as the distribution of the unobserved error term, it is impossible to quantify the degree of welfare cost asymmetry. The complexity of crisis prediction problem that comes from a couple of aspects including non-linearities in the effect of explanatory indicators, interactions among explanatory indi cators as well as infrequent but large global regime shifts, makes it an impossible task to explicitly model the relationship between binary crisis incidence and explanatory indicators, and therefore, to obtain a perfect estimate of the welfare costs incurred by false alarms and missed crises. It could be one potential reason why studies in the literature on early-warning 25 models simply ignore the additional welfare-based adjustment as shown in equation 1.21 and formulate the objective function as an equal-weighted sum of the percentage of false alarms and missed crises, which however is not consistent with the welfare-maximizing criterion as discussed in section 1.2. As a result, the cost-sensitive classification paradigm does not fit well formulating the crisis risk estimation problem due to its significantly high leverage on cost specification. In contrast, the Neyman-Peasson classification paradigm provides a less model-dependent and more interpretable way to account for asymmetric welfare costs and prioritize the more costly type of error ( that is the percentage of missed crises in the case of crisis risk estimation characterizes in section 1.2) by introducing an upper bound below which the prioritized type of error is maintained ( on the population level). It is easier for policymakers to specify an upper bound on the percentage of missed crises given their priorities than to specify relative weight to be assigned to the percentage of missed crises. Moreover, because policymakers make policy decisions on the level of reserve insurance with reference to the estimated probability of a sudden stop, the upper bound on the percentage of missed crises can also be specified and interpreted as a policy goal to be achieved by subsequent policy actions, i.e. the level of reserve insurance made based on the estimated probability. Another characteristic of the Neyman-Pearson classification paradigm which makes it a good fit for modeling the crisis risk estimation problem is that it aims to maintain the prioritized type of error under a certain level on the population level. It is important to be born in mind that Proposition 2 holds on the population level, as all expectations are taken over the population-level distributions, which implies that any prioritization goal should be achieved not only in-sample, that is when a model is estimated on the observed data, but also out-of-sample, that is when a model is actually used in practice to assess vulnerabilities in the future. The importance of out-of-sample has been emphasized in Berg et al. (2005) from several aspects: it is easier to manipulate in-sample so as to obtain good in-sample performance; crises could be suddenly and fundamentally different from previous waves of 26 crisis and etc. Therefore, if there is any prioritization towards one type of error ( that is the percentage of missed crisis in the case of crisis risk estimation characterizes in section 1. 2), then the objective should be to estimate a model that is able to maintain such prioritization out-of-sample and the evaluation criteria should also include the ability of maintaining such prioritization out-of-sample. The Neyman-Pearson classification paradigm together with the algorithm developed by Tong et al. (2018) are well positioned to serve this situation as it aims to minimize one type of error given that an upper bound on the prioritized type of error is maintained on the population level. Hence in the case of formulating crisis risk estimation problem, a model estimated under the Neyman-Pearson classification paradigm with priori tization towards the percentage of missed crises is able to maintain such prioritization when it is used for vulnerability assessment and crisis prediction in practice. Therefore, the goal of welfare cost minimization in the first-stage and welfare maximization in the second stage) are indeed achieved in "real time". With prioritization towards the percentage of missed crises in mind, one common and intuitive practice is to tune the costs c 0 and c 1 in equation 1.26 under the cost-sensitive classification paradigm such that the empirical percentage of missed crises is bounded from above by a certain level a. However, it has been shown in Tong et al. (2018) through extensive simulation studies that this practice cannot control the percentage of the prioritized type of error under the specified level on the population level ( or out-of-sample) with high probability. Hence, it is argued that the Neyman-Pearson classification paradigm fits well for the formulation of crisis risk estimation problem given its emphasize on the population-level prioritization. The third advantage of the Neyman-Pearson classification paradigm which is the most empirically relevant, is that the upper bound on the percentage of missed crises does not only indicate the degree of cost asymmetry, but also provides an interpretable mechanism to explain subsequent policy decisions which is the buildup of reserves in emerging market countries as a self-insurance against sudden stops. The buildup of reserves in emerging mar ket countries since late-1990s and the substantial level maintained since the global financial 27 crisis have gained a lot of interests from academia and motivated a lot of research. Despite of the ability of the benchmark model in Jeanne and Ranciere (2011) to explain the level of reserves in many emerging market countries, the buildup of reserves in emerging market Asia and therefore in the entire sample of emerging market cannot be explained unless a large anticipated output cost of sudden stops and a high level of risk aversion are assumed. In fact, they find that the buildup of reserves in emerging market Asia does not move corre spondingly with the evolution of the estimated probability of a sudden stop, that is the large buildup of reserves in emerging market Asian since 1998 is inconsistent with the decline in the estimated probability of a sudden stop in Asian after a peak reached at the time of the Asian crisis. 19 Given previous discussion on the asymmetric welfare costs associated with the percentage of false alarms and missed crises, it could be the case that because the crisis risk estimation problem in the first stage is not formulated and solved in a welfare-based approach, the estimated probability of a sudden stop cannot account for the large welfare maximizing buildup of reserves in these countries. With the introduce of one free parameter a specifying the upper bound on the percentage of missed crises, the Neyman-Pearson clas sification paradigm could not only contribute to the welfare-based formulation of the crisis risk estimation problem in the first stage but also allow for flexibility in the degree of risk tolerance of policymakers when estimating crisis risk. With low risk tolerance and therefore sufficient prioritization imposed towards the percentage of missed crises, an early-warning model estimated under the Neyman-Pearson classification paradigm could generate upward revised probability of a sudden stop so as to explain the buildup of reserves in emerging market countries with the benchmark calibration of the insurance model. Hence, it implies an interpretable mechanism for the buildup of reserves in emerging market countries from the perspective of policymakers having different degrees of risk tolerance when estimating sudden stop risk. Faced with substantially larger welfare costs incurred by missed crises, policymakers have lower risk tolerance and therefore impose stronger prioritization towards 19 A standard probit model is used to estimate the probability of a sudden stop in their paper. 28 not missing many crises. As a result, the model estimated tends to yield an upward revision of the estimated probabilities of a sudden stop based on which higher levels of reserves are accumulated. 1.4 Model Estimation In this section, I discuss the empirical approach to estimate an early-warning model for sudden stop risk estimation. (Subsection 1.4.1) first presents the definition of a sudden stop and (Subsection 1.4.2) shows the set of explanatory indicators. I then discuss the choice of model with objective function (i) under the literature paradigm and (ii) under the Neyman Pearson (NP) classification paradigm (Subsection 1.4.3). I conclude this section by discuss the approach of crisis probability calibration (Subsection 1.4.4). 1.4.1 Crisis Definition I follow the definition of a sudden stop in Basu et al. (2019) which aims to capture a sudden switch in investor appetite from domestic to foreign assets reflected in a sudden decline in private capital flows in emerging market countries. Sudden stops in emerging market countries are defined as occurring when net private capital inflows as a percentage of GDP is at least 2 percentage points lower than that in the previous year and two years before, as well as when the country gets approved to tap large IMF financial support 20 to capture counterfactual situations in which sudden declines in private capital flows were prevented by large IMF financial support. As observed in the historical data, severe real economic consequences are often resulted from binding financial constraints caused by sudden stops, such as large decreases in output and consumption which are also characterized as output loss in the second-stage welfare-maximizing model depicted in 1.1. Therefore, the definition of growth impact in Basu et al. (2019) is also adapted where large growth declines is defined 20 Hereafter defined as IMF arrangements with agreed amount at least five times as large as the respective country's quota at the IMF. 29 as occurring when the changes in real GDP growth relative to the previous five-year average lie in the lower 10 th -percentile of entire sample, as well as when the country gets approved to tap large IMF financial support to capture counterfactual situations in which large growth declines were prevented by large IMF financial support. Combining the definition of sudden stop and growth impact, episodes of sudden stops with growth impacts (SSGI) are the binary events to predict in the crisis risk estimation problem in the first stage. 30 Table 1.3: Years and Countries of Sudden Stops with Growth Impacts Years 1990 1991 1994 1995 1997 1998 1999 2000 2001 2002 2004 2007 2008 2009 2010 2012 2014 2015 2016 Countries Angola, Poland Bulgaria Mexico, Turkey Argentina, Morocco Indonesia, Malaysia, Thailand Chile, Colombia, Croatia, Peru, Philippines, Turkey Ecuador, Venezuela Angola, Argentina, Panama Turkey Brazil, Dominican Republic, Uruguay, Venezuela Ukraine Azerbaijan Argentina, Hungary, Kazakhstan, Latvia, Lithuania, Malaysia, Pakistan, Romania, Russia, South Africa, Turkey, Ukraine Belarus, Bosnia and Herzegovina, Bulgaria, Costa Rica, Croatia, Georgia, Macedonia, Mexico, Panama, Peru, Serbia Angola, Jordan, Lebanon Azerbaijan, Jordan Russia, Ukraine Belarus, Ecuador Venezuela Our sample covers 53 emerging market countries from 1980 to 2017. There are in total 82 sudden stops with growth impacts which account for 4.2 percent of the sample. Table 1.3 lists sudden stops with growth impacts (SSGI) and Figure 1.1 shows the historical frequency of sudden stops with growh impacts (SSGI) in the sample since 1990. It can be seen that the 31 15 <ll 10 c Q) > Q) 0 ai .0 E :::, z 5 0 Figure 1.1: Frequency of Sudden Stops with Growth Impacts Sudden Stops with Growth Impacts I II I 11.I I I II 1990 2000 2010 Year definition captures prominent historical waves of sudden stops including the Mexico crisis in 1994, the Asian crises in 1997, South American crises in the early-2000s, and the global financial crises. 1.4.2 Explanatory Indicators I choose the set of explanatory indicators of sudden stops based on the same principle as in Basu et al. (2019), that is whether the variable is associated with justifiable economic channels and/ or interpretable economic mechanisms according to different generations of theoretical models on sudden stops in the literature. Table 1.4 lists the selection of ex planatory indicators which are categorized as chosen from different generations of sudden stop models, as well as those capturing current account shocks, contagion effects and etc. One thing to note is that all measures of reserves coverage are not included in the set of explanatory indicators for two reasons. Firstly, it is assumed that sudden stop risk is ex ogenous of reserves accumulation so that the insurance model and the two-stage framework is theoretically plausible. Secondly, it is shown in the Jeanne and Ranciere (2011) that no 32 reserve adequacy ratios have significant impact on the probability of a sudden stop which is in accordance with empirical results in the literature. Table 1.4: List of Explanatory Indicators First generation model (Krugman, 1979) F iscal balance/GDP 5-year change in M2/GDP Reserves/M2 and Reserves/GDP Dummies for exchange rate regime Dummy for parallel market Third generation model: Medium-term Building Bubbles 5-year growth in private credit/GDP 5-year growth in REER appreciation 5-year growth in external debt/GDP 5-year growth in inter-bank external lia bilitiies/GDP Third generation model: Bursting Bubbles Second generation model (Obstfeld, 1992) REER acceleration Change in unemployment rate Public debt/GDP growth Real GDP growth Public external debt/GDP growth Private credit/GDP growth Third generation model: Capital Account Openness External debt/GDP growth Inflow and outflow restrictions Private external debt/GDP growth Overall capital account restrictions Bank external debt/GDP growth Inter-bank external liabilit ies/GDP growth Third generation model: Debt stocks External debt/GDP and External debt/Exports Reserve/Short-term Debt Public debt/GDP Public external debt/GDP Private external debt/GDP Bank external debt/GDP Non-bank private external debt/GDP Inter-bank external liabilities/GDP Domestic private credit/GDP Third generation model: Flows and Mismatch Current account balance/GDP Total debt service/Exports Amortization/Exports FX share of public debt Third generation model: Buffers EMBI sovereign spread Primary gap/GDP Inflation Change in EMBI spread Third generation model: Global Shocks Federal Fund Rates (level and growth) VIX US NEER change US yield spread TED spread Law of One Price 5-year cumulative inflation Current Account Shocks Real growth in exports % change in ToT % change in non-fuel commodity ToT Absolute oil balance/GDP % change in oil price Contagion Change in export partners growth 33 1.4.3 Model Choice Inter-bank external liabilities to AEs in financial crises/GDP Similarity index The set of early-warning models for sudden stops have been developed along along the his torical crisis waves in which the danger of sudden stops is learned, from signal-extraction model pioneered by Kaminsky et al. (1998) to recent more advanced machine learning mod els (Chamon et al., 2007; Basu et al., 2019). In the effort to justify that the prioritization towards the percentage of missed crises through modeling crisis risk estimation problem for sudden stops under the Neyman-Pearson classification paradigm is both theoretically plau sible and empirically relevant , I choose signal-extraction model as the benchmark model and compare its performance under the literature paradigm and the Neyman-Pearson (NP) paradigm, for three reasons. Firstly, signal-extraction model has been proven to be one of the best early-warning models on predicting sudden stops or currency crises when compared with traditional regression models (Berg and Patillo, 1999; Berg et al. , 2005) and advanced machine learning models (Basu et al., 2019) , especially in terms of out-of-sample perfor mance. Secondly, the univariate and non-parametric setting for identifying variable-specific thresholds of signal-extraction model accommodates well both classification paradigms in consideration, which will be discussed in detail. Thirdly, signal-extraction model has been one of the most commonly used early-warning models in both public and private sector since late-1990s, because of its superior out-of-sample performance and practical algorithm. Hence it provides the opportunity to evaluate, ceteris paribus, the empirical relevance of the Neyman-Pearson classification paradigm, that is, whether the prioritization towards the percentage of missed crises modeled in the Neyman-Pearson classification paradigm could explain the buildup of reserves in emerging market countries which standard estimation paradigm including the literature paradigm fails to explain. Otherwise, we may benefit from 34 extra performance gains from introducing more advanced techniques which were not used to estimate the probability of a sudden stop in practice. In the signal-extraction model, a threshold is chosen for each variable so as to minimize the specified loss function, and observations whose indicator variable values fall on one side of the threshold are given an 1 and flagged as risky, otherwise are given a O and flagged as safe. Then flags of all indicators of an observation are then aggregated to generate a composite score ( which sometimes is called vulnerability index in the literature and in practice) with weights given by their signal-to-noise ratio: l-z z (1.28) where z is defined as the value of the loss function achieved. Therefore, it implies that minimizing the loss function for each indicator is equivalent to maximizing each indicator's signal-to-noise ratio so that indicators with larger signaling power are given larger weights in the composite score. The commonly-used loss function under the literature paradigm is the unweighted sum of the percentages of false alarms and missed crises. This loss function can be seen as formulated as 1.26 under the cost-sensitive classification paradigm characterized with weights assigned to the percentage of type I error (i.e. the percentage of false alarms) and type II error (i.e. the percentage of missed crises) being the inverse of the (unconditional) probabilities of the true classes: 1 1 co = - IP( _y_=_O_) ' c1 = - IP-(y_=_l_)" Therefore, the threshold of each indicator is chosen to solve max IP(y = lly = 0) + IP(y = Oly = 1) , (1.29) (1.30) 35 which is equivalent to maximizing the difference between the cumulative distribution func tions of the crisis and non-crisis samples. Therefore, according to this loss function, z in the weight of each indicator is defined as value of 1.30 achieved by the optimal threshold which solves 1.30. Although this loss function captures the notion that missing a crisis is substantially costly than issuing a false-alarm as crises are relatively rare in the historical data, Proposition 2 argues that this cost of missing a crisis (relative to that of issuing a false-alarm) is not sufficiently high from the perspective of maximizing welfare. In fact, the costs characterized in equation 1.30 ignores the welfare-based adjustment characterized in formula 1.20 and 1.21 and thereby simply takes the weights as the inverse of the (uncondi tional) probabilities of the true classes. Under the Neyman-Pearson classification paradigm, the loss function capturing the pri oritization towards the percentage of missed crises is formulated as in 1.25. Therefore, the threshold of each indicator is chosen to minimize the percentage of false alarms given an upper bound a on the percentage of missed crises as in 1.25. Because the threshold is chosen so as to only minimize the percentage of false alarms once the percentage of missed crises is controlled under a certain level, the weight of each indicator is defined as the percentage of false alarms achieves by the optimal threshold which solves 1.25. Table 1.5 summarizes the difference between the literature paradigm and the new paradigm. Table 1.5: Comparison Between the Literature Paradigm and the New Paradigm Equally-weighted sum Objective function (f) IP'(y = lly = 0) + IP'(y = 0ly = 1) Threshold ( z) augrnin IP'(y = lly = 0) + IP'(y = 0ly = 1) Weight (w) 1 1 ~ IP'(y=lly=O)+IP'(y=Oly=l) z Constrained optimization IP'(y = lly = 0) s.t. IP'(y = 0ly = 1) < a augrnin IP'(y = lly = 0) s.t. IP'(y = 0ly = 1) < a 1 1 ~ IP'(y=l ly=O) z 36 To implement the NP classification paradigm, we adapt the umbrella classification algo rithm developed by Tong et al. (2018) which offers the possibility to implement a broad class of scoring-type classification methods under the Neyman-Pearson classification paradigm. Because the score is simply the value of each indicator in the signal-extraction model, the signal-extraction model is easily implementable using the umbrella algorithm. The essential idea of the umbrella algorithm is, given an upper bound on the percentage of missed crises a and a tolerance parameter 6 on the violation rate ( that is, the population probability that the percentage of missed crises exceed the upper bound), to choose the smallest threshold on the classification scores such that the violation rate is maintained under the tolerance level 6, that is IP(IP(y = Oly = 1) > a) ::;; 6. (1.31) The threshold is to be chosen from an order statistics of the classification scores of a left out crisis sample, i.e. a sample of crisis observations which is not used to estimate the model and generate the classification scores. Detailed algorithm of the umbrella algorithm is described in Appendix. Because the signal-extraction model finds the optimal threshold according to the specified loss function in a univariate way, different loss functions will lead to different thresholds ( and weights) for each indicator and therefore different flags for each observation. As a result, the composite scores and thereby calibrated probability estimates generated under the literature paradigm and the Neyman-Pearson classification paradigm will be different. It implies that different classification paradigms will give rise to different policy implications (i.e. the levels of reserve insurance) and therefore welfare implications. 1.4.4 Crisis Probability Calibration Although the composite scores generated by signal-extraction model give some kind of confi dence on the prediction, they are not estimates of the crisis probabilities. Hence, in order to calibrate the crisis probabilities, I use the isotonic regression approach to produce calibrated 37 probabilities with reference to observed binary crisis realizations. Isotonic regression is a technique to fit a free-form and monotonic line to a sequence of observations, which has been applied widely to calibrating probabilities from model-based outputs. 1.5 Empirical Results In this section, I present the empirical results of the early-warning model estimated under the Neyman-Pearson classification paradigm, benchmarking against that estimated under the literature paradigm. Firstly, I discuss the out-of-sample performance of model estimated under different paradigms (Subsection 1.5.1). I then explore the quantitative implications of the model estimated under the Neyman-Pearson classification paradigm: I investigate to what the extent the crisis risk estimation modeling under the Neyman-Pearson classification paradigm can synergize with the benchmark calibration of the insurance model to account for the buildup of reserves in emerging market countries. Subsection 1.5.2 shows the time varying benchmark calibration with reference to the full sample of 53 emerging market countries during 1980-2017. Then Subsection 1.5.3 presents time-varying risk tolerance of policymakers which is modeled by the upper bound on the percentage of missed crises via two stage calibration and explains the reserves accumulation puzzle. Subsection 1.5.4 conducts counterfactual analysis to investigate scenarios in which policymakers have different risk tolerance. 1.5.1 Backtesting Performance In realization of the importance of out-of-sample performance in evaluating an early-warning model, I follow the same backtesting procedure as in Basu et al. (2019). For one particular year in the testing ( which is called a cutoff year) , a model is estimated using all data available up to that year and then is applied to an out-of-sample test set consisting of the two years right after the cutoff year. For example, if the cutoff year is 2007, then a model is estimated 38 using all data up to 2007 ( that includes explanatory indicators available up to 2006 with crisis realizations available up to 2007 correspondingly with one-year forward-looking window) and applied to a two-year test set consisting of data in 2008 and 2009. Different cutoff years from 2007 to 2015 with two-year gaps are chosen (which is called "rolling" backtesting) so as to avoid evaluating the performance based solely on one-year test set which may consists of only one or even none crisis. Table 1.6: Rolling Backtesting Performance of Models Cutoff year Sum of errors (%) False alarms (%) Missed crises (%) 2007 50 20 30 2009 125 25 100 2011 117 17 100 2013 100 0 100 2015 5 5 0 Average 79 13 66 (a) Model estimated under the literature paradigm Cutoff year Sum of errors (%) False alarms ( % ) Missed crises (%) 2007 80 67 13 2009 39 39 0 2011 76 76 0 2013 48 23 25 2015 67 67 0 Average 62 54 8 (b) Model estimated under the new paradigm with a = 0.4 No tes. Fa lse alarms indicate the percentage of false alarms, and missed crises indicate the percentage of missed crises. Sum of errors is the unweighted sum of percentage of false a larms and missed crises. The tolerance level of the model estimation under t he Neyman-Pearson classification paradigm wit h a = 0.4 is set to be ,5 = 0.05, which implies t he p ercentage of missed crises is maintained under 0.5 wit h t he proba bility of 95% on t he population. 39 Table 1.6 reports ther "rolling" backtesting performance of model estimated under the literature paradigm as in 1.30 (Panel A) and model estimated under the Neyman-Pearson classification paradigm as in 1.25 with upper bound on the percentage of missed crises set a to be 0.4 (Panel B), which is a preferred value that will be discussed later. Firstly, under the literature classification paradigm, all crises in the test sets for cutoffs 2009, 2011 and 2013 are missed, resulting in the percentage of missed crises equal to 100%. Bearing in mind the substantially large welfare cost associated with the percentage of missed crises as argued in Proposition 2, an early-warning model estimated under the literature classification paradigm could potentially generate large welfare cost given that most of the crises are not correctly flagged. However, under the Neyman-Pearson classification paradigm, the percentages of missed crises are all maintained under the specified level which is 40% out-of-sample in the test set, 21 which indicates its capability of avoiding large welfare cost incurred by missing crises. Although it is inevitable that the percentage of false alarms is increased due to an upward revision of the thresholds on individual explanatory indicators, it is noted that sum of errors, that is the unweighted sum of percentage of false alarms and missed crises, are also lower for cutoff years 2009, 2011 and 2013 under the Neyman-Pearson classification paradigm. The reason is that an extra correctly flagged crisis in situations where crises are rare will significantly reduce the percentage of misses crises, while an extra falsely issued signal of non-crisis will not increase the percentage of false alarms because there are many non-crises. More importantly, the average performance which is measure by average sum of errors across different cutoff years shows that the Neyman-Pearson classification paradigm delivers better overall performance, considering both false alarms and missed crises over a ten-year period of testing. The average sum of errors under the Neyman-Pearson classification paradigm is 62%, which is more than 20 percent lower than that under the literature classification paradigm. It shows that given the rarity of crises, the prioritization towards not missing 21 The reason that the percentage of missed crises is reduced sharply to zero under the Neyman-Pearson classification paradigm with a = 0.4 is that there are few crisis observations in recent years, for example, one crisis during 2016-2017, four crises during 2014-2015, two crises during 2012-2013, and three crises during 2010-2011. Hence, the value of the percentage of missed crises does not change smoothly. 40 Figure 1.2: Average Scores and Probabilities in Emerging Market Countries During 2008- 2017 (/) ~ 0 0.6 ~ 0.4 Q) Ol ~ Q) > <( 0.2 0.0 2008 2011 2014 2017 Year - Under the literature paradigm 0.25 Cll 0.20 Q) :e :0 2 0.15 0 ... Cl.. ~ 0.10 ~ Q) > <( 0.05 0.00 2008 2011 Under the new paradigm 2014 2017 Year Notes. Scores are model-based composite values generated out-of-sample in backtesting. For exa mple, scores in 2008-2009 are generated by estimating a signal-extraction model using data up to 2007 and then applying the model to out-of-sample test set consisting of data in 2008-2009. Calibrated probabilities are generated by applying the isotonic regression discussed in subsection 1.4.4 to out-of-sample scores with reference to true crisis realizations. crises in the Neyman-Pearson classification paradigm yields also better overall performance. What's more, because sum of errors does not account for the welfare cost asymmetry between the percentage of false alarms and missed crises by assigning the same weights, the overall performance results further implies that the Neyman-Pearson classification paradigm would perform even better if performance was evaluated in terms of the actual expected welfare loss in which larger weight is assigned to the percentage of missed crises. Hence, it follows that the Neyman-Pearson classification paradigm can reduce overall welfare loss by much more than 20 percent. Figure 1.2 shows the evolution of average out-of-sample model-based composite scores ( on the left) and calibrated crisis probabilities ( on the right) across all countries during 2008- 2017, under the literature and the Neyman-Pearson classification paradigm. It can be seen that the average model-based composite scores and calibrated crisis probabilities generated under the Neyman-Pearson classification paradigm with a chosen to be 0.4 are higher than 41 those generated under the literature classification paradigm in all years during 2008-2017. The reason behind is intuitive: threshold of each indicator will be revised downward under the Neyman-Pearson classification paradigm so as to flag more crises and therefore not to miss many crises. As a result , the model-based composite score that is a weighted sum of the flags of indicators will be revised upward under the Neyman-Pearson classification paradigm, and so are the calibrated crisis probabilities. Hence, it gives rise to the possibility that the buildup of reserves in emerging market countries can be explained by policymakers' lower tolerance of crisis risk which can be achieved by tighter control on the percentage of missed crises. Tighter control on the percentage of missed crises, i.e. lower upper bound on the percentage of missed crises, will generate higher model-based composite scores and hence higher calibrated crisis probabilities which will lead to higher level of reserves accumulated. Motivated by this, I explore the extent to which policymakers' risk tolerance which is modeled by upper bound on the percentage of missed crises in the Neyman-Pearson classification paradigm can account for the buildup of reserves in emerging market countries, even with benchmark calibration on other parameters in the insurance model in subsequent subsections. 1.5.2 Benchmark Calibration To explore the synergy quantitative implications of the insurance model presented in sec tion 1.2 and the crisis risk estimation model in section 1.4, I first construct a benchmark calibration in reference to the time-varying statistics of sudden stops with growth impacts in the sample of 53 emerging market countries during 1990-2017, following the similar pro cedure as in Jeanne and Ranciere (2011), but in a modified time-variant version where for each year, all time-variant parameters are calibrated with reference to only data up to that year. Because my framework embeds crisis risk estimation into the insurance model, it is critical that the information set for crisis risk estimation and prevention policy decision are the same. Hence, benchmark calibration for making reserves accumulation decision should be constructed with reference to only data available to policymakers when they estimate 42 sudden stop risk. Otherwise, the policy decision making in the second stage is inconsistent with the crisis risk estimation in the first stage if future information is used to construct the benchmark calibration instead of being used to estimate crisis risk. Table 1. 7 reports the benchmark calibration. The parameter if, ,,\, and I in each year are calibrated to match the average behaviors associated with the prediction targets, that are sudden stops with growth impacts, over all years before. The (unconditional) probability of a sudden stop ranges from 3.1 percent to 4.1 percent per country-year observation during 1990-2017. The parameter,,\ which captures the fraction of output that could have been available to the domestic consumer in the state of non-crisis is calibrated as the average historical level of decline in capital flows over the set of sudden stops with growth impacts, which ranges from 5.6 percent to 8.5 percent during 1990-2017. In terms of output loss of a sudden stop, it is calibrated as the average historical level of drop in real GDP growth rate in the year of a sudden stop from the previous year. It is found that the drop in real GDP growth rate ranges from 4.8 percent to 5.9 percent on average in the year of a sudden stop and from 6.6 percent to 8.4 percent on average over those episodes which have seen a drop in real GDP growth rate. Therefore, 1 is calculated to range from 5.7 percent to 7.1 percent, the average between the estimates over all sudden stops with growth impacts and those in which real GDP growth rate falls. 43 Table 1. 7: Benchmark Calibration Parameters Parameters Probability of sudden stops Size of sudden stop Output loss Potential output growth Term premium Risk-free rate Risk aversion Benchmark ranges ir E [0.031, 0.046] ,\ E [0.056, 0.081] 1 E [0.057, 0.072] g E [0.027, 0.040] J E [0.006, 0.017] r = 0.05 er= 2 Notes . The ranges of benchmark calibration summarize the calibrated values of parameters in each year during 1990-2017 with reference to data up to that year. For example, for year 2000, all parameters (except risk-free rate and risk aversion) are calibrated with reference to the crisis history up to 2000 in the sample of emerging market countries. The potential growth rate g ranges from 2. 7 percent to 4.0 percent, calibrated as the average real GDP growth rate in the sample excluding sudden stop years. The US term premium, measured as the difference between the yield on 10-year US Treasury bonds and the Federal Funds rate, is calibrated to range from 0.6 percent to 1.7 percent during 1990-2017. And the risk-free short-term dollar interest rate is set at 5 percent, which is time-invariant over the sample. Finally, the risk aversion parameter is set to be the standard time-invariant value in the literature. 1.5.3 Time-Varying Risk Tolerance of Policymakers Based on previous discussion on the Neyman-Pearson classification paradigm, it follows that the upper bound on the percentage of missed crises can be modeled as the risk tolerance of policymakers for several reasons. Firstly, the lower the upper bound, the tighter control on the percentage of missed crises, and the less tolerant of risk the policymakers. Hence there is a monotone relationship between the upper bound on the percentage of missed crises and the risk tolerance of policymakers. Secondly, it can be specified by policymakers as a policy 44 goal to be achieved by subsequent policy responses, especially in emerging countries where policymakers are faced with potentially large welfare cost of missing crises. Thirdly, the bounding condition can be achieved on population level which makes the upper bound a good proxy for risk tolerance of policymakers when they want to estimate crisis risk in the future. Hence, I model the upper bound on the percentage of missed crises as a measure for risk tolerance of policymakers. In order to estimate the value of the upper bound, I leverage on the two-stage framework, in which the sudden stop risk estimation and reserves accumulation decision are connected by the probability of a sudden stop. It implies that the upper bound on the percentage of missed crises can be calibrated with reference to the observed level of reserves through the estimated probability of a sudden stop, which is the output of the crisis risk estimation and the input of the reserves accumulation decision. The calibration procedure is specified as follows. For a given value of the upper bound a , the probability of a sudden stop is estimated under the Neyman-Pearson classification paradigm by replicating a "real-time" forecasting. That is, in each year t, a signal-extraction model is estimated using data available up to the year before, i.e. , year t - 1, and applied to new data in year t to generate out-of-sample composite scores and calibrated sudden stop probabilities for the year after, i.e. , year t+l. Hence it means that a two-year forward looking window is adopted because it allows for sufficient time to accumulate reserves. Then the calibrated probabilities are plugged in the formula 1.3 to solve for optimal level of reserves to be accumulated in year t, together with benchmark calibrated values of other parameters with reference to data up to year t - l. The calibrated value of upper bound is then chosen as the one minimizing mean square errors of the optimal and observed level of reserves in the same year over a three-year moving window. 45 70% 60% 50% 40% 30% 20% 10% 0% 1993 Figure 1.3: Revealed Risk Tolerance of Policymakers 1997 2001 Sudden Stops 2005 Year 2009 - Revealed Risk Tolerance 2013 2017 Notes. The revealed risk tolerance is the calibrated value of the upper bound on the percentage of missed crises in the Neyman Pearson classification paradigm. The calibration is constructed within a two-year forward looking "real-time" forecasting framework in which a, i.e., the upper bound on the percentage of missed crises, in year t is calibrated with reference to the observed level of reserves in year t + 1. It is because the a in year t is used to estimate sudden stop risk in year t + 2 with a two-year forward looking window and the end-of-year level of reserves observed in year t is to insure against sudden stop risk in year t + 2 as well. Figure 1.3 shows the revealed risk tolerance which measured by the calibrated value of the upper bound on the percentage of missed crises over 25 years during 1993-2017, as well as two major waves of sudden stops: the Asian financial crisis in late 1990s and the global financial crisis in late 2000s. Firstly, it shows that the revealed risk tolerance of policymakers is time-varying. It increases during the 1990s, achieving the record high during the Asian financial crisis and then decreases sharply afterwards. After dropping to the lowest level which is as low as 0.25, the risk tolerance recovers and increases to a second record high during the global financial crisis, after which it falls again. The time-varying risk tolerance indicates that it is not the insurance mechanism characterized in the theoretical model but rather the lack of modeling preferences in crisis risk estimation that cannot account for the buildup of reserves in emerging market countries. 46 Figure 1.4: Observed and Optimal Level of Reserves as a Share of GDP, during 1993-2017 30% :,le ~ 20% 0 (9 U) Q) ~ lll 10% Q) a:: 0% 1993 1997 2001 2005 Year 2009 - Observed Reserves/GDP(%) - Optimal Reserves/GDP (%) 2013 2017 Notes. Optimal levels of reserves are calculated according to the formula 1.3 using parameter values in the benchmark calibration and calibrated probability estimates generated out-of-sample by signal-extraction models estimated under the Neyman-Pearson classification paradigm with calibrated upper bound on the percentage of missed crises. Figure 1.4 presents the optimal level of reserves calculated according to the formula 1.3 using parameter values in the benchmark calibration and probability estimated under the Neyman-Pearson classification paradigm with calibrated upper bound on the percentage of missed crises. It can be seen that the time-varying risk tolerance can help to reconcile the insurance model with the trend and magnitude of reserves accumulation in emerging market countries. The decreasing pattern in the risk tolerance of policymakers can be interpreted as an outcome of learning: after they learn the large welfare cost of sudden stops from the Asian financial crisis the hard way, they become much less tolerant of sudden stop risk and thereby impose tighter control on the percentage of missed crises when estimating future risk of sudden stop. Therefore, they generate upward revised estimate of sudden stop risk, and as a consequence, accumulate higher level of reserves as self-insurance accordingly. Secondly, risk tolerance of policymakers increases and reaches exceptionally high level before the two major waves of sudden stops in emerging market countries. Calibration results show that the upper bound on the percentage of missed crises for estimating sudden stop risk in 1997 and 2008 are both above 50%, indicating a prioritization towards not issuing false alarms instead of not missing crises according to proposition 3 which however is not consistent 47 with the proposition 2. Hence, the pre-crisis high level of risk tolerance is surprisingly in line with the historical failure to prevent these two major waves of sudden stops, although it is not calibrated to match historical outcomes of crisis prevention: because policymakers are so tolerant of sudden stop risk before the crises that they even prioritize towards not issuing false alarms, they (or early-warning models built by them) tend to downward revise the estimate of sudden stop risk for the coming crisis years. As a consequence, they do not accumulate sufficient reserves to prevent the real consequences of the coming crises. In sum, time-varying risk tolerance is found in the two-stage framework embedding sud den stop risk estimation into reserves accumulation decision. It does not only offer a behav ioral explanation of the buildup of reserves accumulation in emerging market countries by featuring the post-crisis downward trend of risk tolerance, but also provide an alternative interpretation of historical failure to prevent crises by documenting the pre-crisis high level of risk tolerance. It is worth noting that the two-stage framework plays an important role here. Crisis risk estimation is a critical element in policy decision-making and macro-modeling because policymakers do not have perfect knowledge of the risk and their risk tolerance can be different from households' risk aversion. Therefore, policy decisions such as the level of reserve holdings are made based on estimates of crisis risk which are generated according to policymakers' risk tolerance instead of true values of crisis risk. Hence, it emphasizes the im portance of modeling crisis risk estimation and disentangling risk preferences of policymakers from risk aversion of households. 1.5.4 Counterfactual analysis Given the pre-crisis high risk tolerance of policymakers documented previously, I ask whether there would be different policy implications if policymakers had lower level of risk tolerance before the two major waves of sudden stops in emerging market countries. Specifically, how much level of reserves would have been accumulated and which leading indicators could have 48 signaled high risk of sudden stop before the Asian financial crisis and the global financial crisis if policymakers were less tolerance of sudden stop risk? To answer these question, I conduct counterfactual analysis in which the upper bound on the percentage of missed crises in estimating sudden stop risk for 1997 and 2008, i.e., the starting years of two major waves of sudden stops, is set to be the average level across all values lower than 50%, which is equal to 40%. The choice of counterfactual value of risk tolerance is motivated by two reasons. Firstly, proposition 2 and proposition 3 highlight that the upper bound on the percentage of missed crises should be less than 50% in order to be consistent with the welfare cost asymmetry. Hence it implies that as long as the upper bound is chosen to be less than 50%, policymakers are making rational decisions from the perspective of the insurance model. Secondly, the time-varying risk tolerance presented previously shows that pre-crisis level of risk tolerance is above 50% in both the Asian financial crisis and the global financial crisis, providing empirical evidence support for an upper bound below 50% as an counterfactual choice. Table 1.8 presents the counterfactual results under counterfactual choice ofrisk tolerance m comparison with those under revealed risk tolerance presented previously. Panel (a) reports, in the case of estimating sudden stop risk for 1997, the five explanatory indicators for whom predictive rank 22 changes the most among the top ten leading indicators, as well as the level ofreserves to GDP. It is found that growth of export partners, medium-term growth of broad money and external debt are more important in identifying the buildup of sudden stop risk in 1997 under the counterfactual level of risk tolerance, compared with the focus on current account balance and TED spread under the revealed level of risk tolerance. It then follows that the counterfactual choice of risk tolerance helps identify leading indicators which are argued to be causes of the crisis in post-crisis studies, including the significant slowdown of export growth due to the recession of Japan, housing bubbles accompanied by 22 As described previously in model choice section for signal-extraction model, weight of an explanatory indicator is calculated as the inverse of the attained value of signal-to-noise ratio, and is used to rank among explanatory indicators: the larger the weight, the higher the rank. 49 rapid growth in the money supply and excessive exposure to foreign exchange risk caused by large external borrowing. Table 1.8: Counterfactual Results for Asian Financial Crisis and Global Financial Crisis Variable Under revealed a Change Under counterfactual a Export Partner Growth 5th 11' 2nd 5yr Broad Money Growth 7th 11' 4th 5yr External Debt Growth 9th 11' 5th Current Account Balance 1st ~ 8th TED Spread 4th ~ 10th Reserves-to-GDP 11 .5% 11' 19.5% (a) Asian Financial Crisis: Risk Estimation for 1997 Variable Under revealed a Change Under counterfactual a US Term Premium 7th 11' ist Current Account Balance 5th 11' 2nd Fed Rate Change 10th 11' 3rd Private Credit Growth ist ~ 5th 5yr Private Credit Growth 2nd ~ 9th Reserves-to-GDP 21.3% 11' 38.5% (b) Global Financial Crisis: Risk Estimation for 2008 Notes. Only the top five explanatory indicators which exhibits the largest change in rank yielded by different levels of risk tolerance are reported. In risk estimat ion for 1997, the revealed risk tolerance is measured as an upper bound on t he percentage of missed crises equal to 52%. In risk estimation for 2008, the revealed risk tolerance is measured as an upper bound on the percentage of missed crises equal to 52% as well. In both cases, the counterfactual risk tolerance is set as an upper bound on the percentage of missed crises equal to 40% Counterfactual results for global financial crisis are presented in panel (b) of Table 1.8. It is shown that global conditions including US term premium and change in fed funds rate would be more predictive under the counterfactual level of risk tolerance, compared with the focus on short- and medium-term growth in private credit under the revealed level of risk 50 tolerance. The importance of push factors, i.e., common shocks, indicated by a counterfactual lower level of risk tolerance is consistent with the fact that US being the origin of the global financial crisis and has been documented by many studies after the crisis. 23 More importantly, if policymakers had lower risk tolerance at the time of assessing sudden stop risk in 1997 and 2008, they would have accumulated almost twice as much as reserves which could have helped prevent the crisis or at least mitigate the real cost of the crisis. In sum, counterfactual results establish that different policy implications can be implied by different level of risk tolerance. Policymakers overlooked leading indicators in which high sudden stop risk was rooted as they had high risk tolerance, and as a consequence, did not accumulated sufficient reserves to prevent the real consequences of the crises. However, if policymakers were less tolerance of sudden stop risk, they would have identified factors which signaled the buildup of risk and thereby could have accumulated more reserves to prevent or mitigate the real consequences of the crises. 1. 6 Con cl us ion Crisis risk is not only a critical element of economic outlook for countries, regions, and the entire world, but also an essential input into policy decisions of central banks and interna tional institutions. This paper proposes a two-stage framework to investigate the mutual interaction between crisis risk estimation and crisis prevention policy decision in the context of sudden stops. In the framework, an early-warning problem to estimate sudden stop risk is solved in the first stage and a policy-making problem to determine optimal level of reserves is solved in the second stage. Building upon this two-stage framework , this paper shows struc turally there is welfare cost asymmetry between two types of errors: the welfare-maximizing weight on the percentage of missed crises must be greater than that on the percentage of false alarms. This paper introduces a constrained optimization under the Neyman-Pearson classification paradigm as an implementation method to solve this error-asymmetry problem 23 See Bacchetta and van Wincoop (2010), Gourio et al. (2010), and Fratzscher (2012) 51 and reduces overall welfare loss by more than 20 percent. Bringing this two-stage model to emerging market countries' data, this paper uncovers time-varying risk tolerance of pol icymakers that features pre-crisis high level and post-crisis downward trend. The pre-crisis high risk tolerance is in line with the failure to prevent historical waves of sudden stops, and the post-crisis decreasing risk tolerance explains the buildup of international reserves. Counterfactual results show that policymakers may overlook leading indicators in which high crisis risk is rooted if they have high risk tolerance, and thereby do not accumulate sufficient reserves to prevent real consequences of sudden stops. Policy implications of this paper are two-fold. First, this paper highlights the impor tance of developing a policy-consistent risk assessment framework. Shedding light on the suboptimality in policy-making and welfare loss resulted from imperfect crisis risk estima tion, this paper calls for policymakers' awareness of policy implications and welfare effects of their practice of crisis risk estimation. The policy-consistent framework proposed in this paper can be applied to other types of risk and policy tools around the world, among which the most concerned financial crisis risk and macro-prudential policies should be on the re search agenda. Second, the time-varying risk tolerance of policymakers presented in this paper emphasizes the need for a commitment mechanism to ensure time consistency of pol icy decisions. Higher risk tolerance in normal periods could result in insufficient level of reserves to mitigate welfare costs of crises, while lower risk tolerance after crises could lead to over-accumulation of reserves and incur higher opportunity cost. Hence, a commitment mechanism can help prevent potential welfare costs incurred either by being too optimistic after long normal periods or by being too prudent after crises. 52 Chapter 2 External Crisis Prediction Using Machine Learning: Evidence from Three Decades of Crises Around the World 1 2 .1 Introduction Predicting external crises is an important task, yet a somewhat inglorious one. From the Mexican peso crisis in 1994, to the Asian financial crises of 1997-98, to the deep Argentinian distress of the 2000s and the short sharp panic of Brazil in 2002, all the way to the more recent upheavals of the global financial crisis in 2008-09 and the European sovereign debt crises of 2010-12, external crises have generated substantial macroeconomic distortions. The theoretical and empirical literatures have both identified several important risk factors for such crises, but early warning models for external crises remain haunted by Berg et al.'s (2005) finding that in-sample performance and out-of-sample performance may diverge sharply, so careful model evaluation and selection procedures are necessary for any new models. As crisis prediction is essentially a classification problem (into "crisis" and "non-crisis" categories), new advances in machine learning appear to hold some promise for this field. Machine learning is designed for classification problems, and some of the models that have 1 The work in this chapter is joint with Suman S. Basu (International Monetary Fund) and Roberto A. Perrelli (International Monetary Fund). 53 been developed over the past couple of decades can be interpreted as extensions to the signal extraction approach historically used in the crisis prediction literature (following Kaminsky et al., 1998, among others). In this paper, we examine the abilities of these advanced machine learning techniques to predict external crises in a manner which is attuned to the macro nature of the crisis prediction problem and which echoes how the models would be used in the IMF's Vulnerability Exercise. We design the tests rigorously by establishing safeguards against manipulation of the test sets, so that there is a high bar to clear before we can conclude that machine learning is superior to traditional signal extraction approach. Our target of prediction is external crises: episodes of a sudden decline in the private sector's willingness to hold domestic assets relative to foreign assets. In the empirical litera ture on external crisis prediction, studies have mostly focused on sudden stops (e.g., Chamon et al., 2007) and exchange market pressure events (e.g. , Kaminsky et al. , 1998; Berg and Pattillo, 1999) in emerging markets. We provide new definitions for both crisis types, sud den stops with growth impacts (SSGis) and exchange market pressure events (EMPEs), and then we expand the exchange market pressure definition to include advanced economies and low-income countries. In total, we cover 159 countries and 27 years using our models. External crises are difficult to predict because changes in preferences for assets are con tinuously priced in financial markets, so any sudden movements must be surprises not just to subgroups of investors, but to a large portion of the private sector. And it is likely that they surprise us as early warning modelers as well. We therefore need to conduct the tests with the appropriate humility, taking into account that we should only predict future crises with the information that is historically available. When selecting models from the field of machine learning, it is important to work with models that deliver good prediction performance without loss of interpretability. In this paper we explore two groups of these models: regularized linear regressions (i.e., Elastic Net, LASSO, and Ridge) and tree-based ensemble learning methods (i.e., Random Forests and RUSBoost). Few papers have used these machine learning techniques to predict crises, and 54 even a smaller number are related to external crises. Among them, regularized logit models were used to assess risks in banking sector in Lang et al. (2018), and Random Forests were used to predict banking crises in Alessi and Detken (2018). The crisis prediction problem can be formulated as (2.1) where Yit is the crisis event for country i in year t, taking the value of 0 when there is no crisis, or 1 if there is a crisis, Xi,t-l is a vector of country-specific explanatory indicators for country i in year t - l, f specifies the relation between Yit and Xi,t-l, Eit is an idiosyncratic shock, and T/t-l is a vector or global factors capturing the global regime which affects the relation f between Yit and Xit-1 · Signal extraction uses each of the Xit-l variables in a univariate manner and estimates a threshold of riskiness for each. The great promise of machine learning is that it allows much more flexibility in the f relation, including nonlinearities, non-monotonicities, and interactions among explanatory variables. However, whether machine learning improves prediction performance depends on whether the promise of machine learning exceeds the limits of applying machine learning to macro data. The limitation of machine learning is that while it has been developed for large datasets, macro data may be small in some important senses which machine learning cannot overcome. Firstly, there are few crisis events Yit, and they are heterogeneous, which means that historical data may have only a few lessons for the future. Secondly, the Xit-l series exhibit temporal dependence and cross-sectional correlation, which means that adding new variables (which machine learning is designed to facilitate) may add only a small amount of independent information. Thirdly, infrequent but large shifts in T/t-l limit the applicability of past lessons for future crisis prediction: the rarity of the shifts means that 27 years of historical data cannot capture all the possible global regimes, and the large size of the shifts means that past information may become suddenly rather outdated. 55 Another factor to be borne in mind relative to other papers is that the heterogeneity of countries which must be included in a global prediction exercise may place an upper bound for prediction performance. Performance can be improved by focusing only on a smaller set of countries which are more similar to each other, but one of the objectives for our prediction exercise is precisely to understand how to extrapolate lessons from crises in some kinds of countries to other kinds of countries in a similar income category. In this paper, we design a rigorous testing procedure taking into account the autocorre lation of Xit-l, the presence of common factors in Xit-i , and the possibility of future shifts in 7/t-l· We cannot apply random k-fold cross-validation techniques as typically used in the machine learning literature, but instead, we opt to use cutoff testing, using only historical data to predict the future. Given that the model is likely to be re-estimated periodically, we choose to conduct a test which incorporates this phenomenon, with recursive re-estimation of the model over time and evaluation taken over the re-estimations. Such a testing proce dure should be rigorous enough to prevent manipulation of the test sets. It should give some sense of which models are able to best capture the shape of the f relation, to be robust to the uncertainty in 7/t-l , and to learn quickly from new crises which enter the historical training set over time. Our results indicate that there is no clear superiority of machine learning relative to traditional approach for all kinds of external crises and for all country groupings. Specifi cally, machine learning does not outperform signal extraction approach for predicting sudden stops with growth impacts in emerging markets, but tree-based ensemble machine learning technique, especially the one addressing imbalanced data, does appear to outperform sig nal extraction approach for predicting exchange market pressure events in the same set of emerging markets. Moreover, signal extraction approach also performs the best for pre dicting exchange market pressure events in advanced economies and low-income countries, although tree-based ensemble machine learning techniques are able to deliver comparable performance for the latter. 56 By exploring the variable importance implied by signal extraction approach and ma chine learning techniques, we document differences in sectoral contributions to crisis risks implied by different models. Moreover, we find interesting effects of explanatory variables on the crisis probabilities, including non-linearities, non-monotonicities and interactions among explanatory variables, implied by tree-based ensemble machine learning techniques for pre dicting EMPEs in EMs. We may conclude that for crises that are more heterogeneous and less well studied by the literature to date, for example, EMPEs as compared to SSGis for emerging markets, machine learning has the greatest potential in helping us conduct the search for important predictors and allowing for complex economic mechanism relevant to crisis prediction, and thereby to improve prediction performance. The remainder of this paper is organized as follows. Section 2.2 describes model can didates in our horse race, including signal extraction approach and machine learning tech niques, and reviews the early warning literature on external crises. Sections 2.3 and 2.4 presents our definitions for two types of external crises, sudden stops with growth impacts and exchange market pressure events, and our set of explanatory variables for predicting external crises. Section 2.5 discusses our model design, including model estimation and test ing. Section 2.6 reports the horse race results and delves into model properties that may contribute to their prediction performance. Finally, section 2. 7 concludes. 2.2 Applying Machine Leaning To Crisis Prediction 2.2.1 Signal Extraction and Regression-based Models The empirical literature on external crisis prediction has evolved around two widely used methods: signal extraction approach and regression-based models. The signal extraction approach pioneered by Kaminsky et al. (1998) is a univariate method identifying variable specific thresholds, such that observations with variable values on one side of the threshold are flagged as risky and given 1 (if 1 indicating a crisis follows) while those with variable 57 values on the other side are flagged as safe and given O (if O indicating no crisis follows). The threshold is calculated to minimize a loss function which is defined as a combination of false alarms ( the percentage of the risky flags that are not followed by crises) and missed crises ( the percentage of the safe flags that are followed by crises) . Once variable-specific thresholds are calculated and observations are flagged by variables, an aggregate vulnerability index is constructed for each observation as a weighted average of the variable-specific flags. Thanks to its simplicity and interpretability, the signal extraction approach is one of the most commonly used techniques among the early warning systems for various types of crises. 2 Also, it has been shown to have reasonable out-of-sample properties which are critical in crisis prediction (see Berg and Pattillo, 1999 and Berg et al. 2005). However, a major drawback of this approach is that it does not accommodate complex relationship between the crisis outcome and explanatory variables, such as nonlinearities, nonmonotonicities or interactions among variables, which can be of crucial importance for crisis prediction. Regression-based models examine parametric relationships between a binary crisis out come variable ( taking the value of 1 if there is a crisis following, and O otherwise) and a set of explanatory variables. Early applications of regression-based models to external cri sis prediction include Frankel and Rose (1996), Berg and Pattilo (1999) and Rocha et al. (2002). These authors used multivariate probit regression models to predict currency crises and examine learning indicators. In contrast of signal extraction approach, regression-based models allow more complex specifications such as interactions among variables and produce estimates of crisis probabilities instead of vulnerability indices. However, by including a large set of variables and their interactions, regression-based models are prone to overfitting (Hawkins, 2004). That is, the model is so complex that it fits too well to the specific dataset and learns too much noise in the specific dataset, which may undermine its performance 2 For example, Borio and Drehmann (2009) and Laina et al. (2015) on banking crises, Kaminsky et al. (1998) and Berg et al. (2005) on external crises, Basu et al. (2017) on growth crises, Knedlik and Schweinitz (2012) and Dawood et al. (2017) on sovereign crises. 58 on new datasets. 3 Overfitting is one of the important problems we need to tackle in crisis prediction because crisis mechanisms may change so drastically that a significant amount of information in historical data may suddenly become noise which would be irrelevant for the future. Also, the panel data structure of regression-based models may result in serious depletion of the dataset when data availability varies significantly across variables. Many variables have been studied for their importance on predicting external crises, and the set of variables is expanding as new crises occur and new transmission mechanisms are discovered. Variables commonly used span multiple sectors of the economy, including external sector variables such as current account balance as percentage of GDP and external debt as percentage of GDP; financial sector variables such as capital adequacy ratio and return on bank assets; fiscal sector variables such as public debt as percentage of GDP and EMBI sovereign spread; and real sector variables such as price-to-earnings ratio and interest coverage ratio. In addition, global factors have appeared to be of great importance for external crisis prediction, for example, TED spread and U.S. federal fund rates. 2.2.2 Machine Learning Techniques Recent developments in the field of machine learning seem to offer promising solutions to the problems identified above. Among them, it is important to work with models that deliver good prediction performance without loss of interpretability. In the present paper we explore two set of these models: regularized linear regression models and tree-based ensemble methods. Both have been widely used in classification and prediction problems in many fields, such as image classification, disease diagnosis, etc., and have been introduced to solve economics problems. 4 In the category of regularized linear regression models, one of the most versatile is the Elastic Net model introduced by Zou and Hastie (2005) encompassing both the LASSO 3 For further discussion, see Berg and Pattillo (1999) and Berg et al. (2005). Chamon et al. (2007) and Basu et al. (2017) conducted out-of-sample testing to evaluate their models using, respectively, one-year ahead in the early 2000s and over the global financial crisis. 4 See, for example, Caner (2009) and Wager and Athey (2017). 59 (Tibshirani, 1996) and Ridge regression (Hoerl and Kennard, 1988) by linearly combining the L 1 and L 2 regularizations. This model maximizes a log-likelihood function subject to a bound on the sum of absolute values and square values of the coefficients equation, by solving the following problem, 1~ T ( 2 ) minf3o, f3 NL, l(yi, f3o + f3 X i) + A (1 - a)ll/311 /2 + all/311 , i=l (2.2) where Yi is the dependent variable and Xi is the set of explanatory variables for observation i, l(yi, f3o + f3T Xi) is the negative log likelihood of observation i, A is the shrinkage parameter and a controls the relative weight of L 1 and L 2 regularizations. By implementing regularizations, regularized linear regression models prevent overfitting when there is large set of explanatory variables which may also exhibit significant degree of correlation. Due to its simple linear specification and capability of performing variable selection and regularization, regularized linear regression models have been widely applied to forecasting problems. In the category of tree-based ensemble methods, this paper considers two different en semble learning methods with binary classification tree (BCT) as the base learner: Random Forests (Breiman, 2001) and random under-sampling boosting tree-ensemble model (RUS Boost) (Seiffert et al. , 2010). The binary classification tree method ( Breiman et al. , 1984) uses a decision tree to flag an observation by going from the original complex sample to smaller and purer subsamples. Each decision tree consists of a root node, branches de parting from parent nodes and entering child nodes, and multiple terminal nodes which are also called leaves. In the structure of classification tree, leaves represent the flagged classes ( determined by the class with the most votes within one leaf) and branches represent the conjunctions of indicators that lead to the classes. Observations in the root node are sent to left or right child node according to some splitting rules that identify indicators and cor responding thresholds. Once the whole sample is split into two subsamples, such process is repeated on each child node recursively until each leaf consists of observations in one class, or some stopping criteria are met. The indicator and threshold used to split the sample 60 at each node are chosen based on some measures of impurity, such as the Gini impurity index. Because of the recursive algorithm, the binary classification tree structure partitions the classification ( or prediction) space into multiple smaller spaces, which allows for complex relationship between the classification ( or prediction) outcome and explanatory indicators, such as non-linearities, non-monotonicities, and interactions among indicators. Binary classification trees are prone to overfitting when a tree grows fully to fit all obser vations in the training sample, which results in a deep tree with small leaves containing only few observations with strict rules. Such a deep tree will fail to make accurate predictions for new observations because it includes too much noise from the training sample which is irrelevant to new predictions. To reduce overfitting of one single binary classification tree, ensemble models consisting of many binary classification trees was proposed. Among them, the simplest is Random Forests introduced by Breiman ( 2001), which applies the general techniques of bootstrap aggregating (bagging). The Random Forest consists of multiple binary classification trees, each of which is grown on a random sample selected with replacement from the training sample, which decrease the variance of the model without increasing substantially the bias. Additionally, it also performs random feature sampling that only a random subset of explanatory variables selected from the entire set are considered at each split, effectively preventing strong corre lations among trees. In the end, class predictions for new observations are made by taking the majority vote of classes determined by individual trees, and scores of new observations are calculated by taking the average of scores generated by individual trees. Bootstrap ag gregating and feature sampling together help Random Forests prevent overfitting and thus achieve better prediction performance. In addition to Random Forests, we also consider RUSBoost (Seiffert et al., 2010), another ensemble learning algorithm with binary classification tree as the building block. Implement ing a hybrid sampling/boosting algorithm which is designed to deal with class imbalance problem, RUSBoost is well-suited to our crisis prediction problem in which crises are very 61 rate. RUSBoost combines two techniques that have been used to alleviate class imbalance problem, random under-sampling and AdaBoost (Freund and Schapire, 1996) algorithms. Random sampling is the simplest method for resampling data: A data set is resampled by either duplicating observations of the minority class (i.e., over-sampling) or depleting ob servations of the majority class (i.e., under-sampling) until a desired class ratio is achieved. In our case where crisis mechanisms constantly change, over-sampling may lead to over fitting when crisis observations are duplicated and thus overrepresented. Hence, random under-sampling is a more appropriate strategy in our crisis prediction problem. In addition, boosting algorithms have been developed to improve classification performance regardless of whether the data is balanced or not. AdaBoost (Freund and Schapire, 1996) , one of the most commonly used boosting algorithms, adds single learners (i.e., binary classification trees in the case of RUSBoost) sequentially to build an ensemble of learners. In each iteration, ob servation weights are adjusted such that more weights are given to observations which are misclassified in previous iterations. Once all iterations are finished, class predictions for new observations are made by taking a weighted vote of classes determined by individual trees. Although AdaBoost was not originally designed to deal with class imbalance problem, it performs particularly well on imbalanced classification because in most cases, observations of the minority class are those most likely to be misclassified. Therefore, giving more weights to observations of the minority class is equivalent to perform resampling, which help alleviate the class imbalance problem. Given its use of random under-sampling and AdaBoost, both of which help address class imbalance problem, RUSBoost has been proven to perform very well on imbalanced data and therefore is chosen to be included in our horse race. Besides delivering better prediction performance, tree-based methods are capable to han dle missing data with a build-in algorithm called surrogate splits. Whenever there is a missing value of the best feature at a split, the algorithm determines whether to send the observation to the left or right node by looking for other features which resembles the split of the best feature in terms of number of correctly classified observations. Then the observation with a 62 missing value of the best feature will be classified based on the most resembling surrogate, or the second most resembling surrogate if the first one is also missing, and so on. By taking the advantage of surrogate splits in tree-based methods, we avoid depletion of data or naive imputation. 2.2.3 Potential Limitations in Applying Machine Learning to Macro Data However, despite the promise of machine learning techniques, there are several potential limitations to their use in macroeconomic contexts. While machine learning offers greater flexibility in the mapping from explanatory variables to crisis incidence, it cannot overcome the "small" nature of macro data: crises are rare and heterogeneous, making them harder to predict; explanatory variables are often correlated cross-sectionally and temporally, so that the independent information content of each additional variable is limited; and there may be infrequent but large changes in the global regime, which affects the data generating process from explanatory variables to crisis incidence. Changes in the data generating process resulted from changes in global regimes and crisis mechanisms make it very important to have a testing process which reflects how any crisis prediction model would be used in practice. Crucially, one should set clear safeguards to avoid the temptation to manipulate the test set when applying machine learning techniques. Firstly, the testing set should not be randomly selected as is often the case in machine learning applications, but it should be comprised years succeeding the training set; that way, we capture the practical problem that we do not know future global regimes when predicting future crises. Secondly, if we expect the crisis prediction model to be periodically re-estimated in practice, accounting for potential changes in the data generating process, then our testing set should also be constructed in the same manner, so that we gauge the average performance over a range of recursively-estimated models. Thirdly, the variable 63 selection and tuning procedure should be determined without regard to the test set, so to ensure that model estimation and testing are not manipulated. 2.2.4 Machine Learning Applications in the Early-Warning Literature Although machine learning has been introduced to early-warning literature since early this century, few studies are related to external crises and use methods in our consideration. 5 Despite its simple specification and capability to perform variable selection and prevent overfitting, regularized linear regression models have not been commonly applied to crisis prediction, surprisingly. 6 Built from binary classification trees whose first use in crisis predic tion was almost two decades ago, Random Forests have only been applied to crisis prediction in recent years. 7 The horse race study of different methods is even scarce, which however is of significant importance in light of increasingly richer set of early-warning models. To our best knowl edge, Holopainen and Sarlin (2017) is the only study conducting a horse race of traditional statistical methods and machine learning techniques on predicting banking crises. Their results showed that traditional statistical methods were outperformed by machine learning techniques. However, their results depended heavily on their data and model choice. Firstly, they focused only on banking crises in European Union countries, so their set of countries and thus crises could be fairly homogeneous. Secondly, the signal extraction model being compared in their paper was very degenerate, in which they chose only one explanatory variable, instead of aggregating all variables into a composite vulnerability index. Thirdly, 5 Early applications of machine learning to crisis prediction include Ghosh and Ghosh (2003) , Frankel and Wei (2004), Kaminsky (2006) , Chamon et al (2007) and Manasse and Roubini (2009) in which binary classification trees were used. 6 Lang et al. (2018) applied regularized logit models onto a large dataset of European Union banks and assessed risks in banking sector at the aggregate level; Holopainen and Sarlin (2017) also included LASSO in their horse race exercise. 7 Alessi and Oetken (2018) used Random Forests to predict banking crisis; Xu et al. (2018) used Random Forests to predict currency crises for a much smaller subset of countries than ours with a modified training set. Chamon et al. (2007) also applied Random Forests to their prediction for capital account crises, but only as a robustness check. 64 although they conducted recursive testing from the viewpoint of real-time analysis, the pa rameters they used were still chosen with cross-validation on the full sample, instead of being determined by using only historical training data and with respect to recursive testing. This paper offers a rigorous horse race between signal extraction approach and machine learning techniques for external crises, 8 with several features designed to make the testing procedure as close as possible to the practical use of the models: large cross-country samples; test sets designed in line with the periodic re-estimation of the model in practice; and tuning approaches aligned with the testing procedure. All of these features provide strong safeguards against manipulation of model estimation and testing. 2.3 Crisis Definitions External crises occur when there is a sudden switch in investors' preferences from domestic to foreign assets. How such a switch in preferences translates into domestic macroeconomic outcomes depends on the structure of the economy. In this paper, we focus on two different definitions of external crises. 2.3.1 Sudden Stops Sudden stops in capital flows capture the most brutal external crisis events for emerging markets-a group of economies whose capital accounts are open enough for private capital inflows to accumulate, but not sufficiently liberalized for sudden outflows to be easily insured against. In this paper, we define sudden stops in capital flows as occurring when net private capital inflows as percentage of GDP is at least 2 percentage points lower than that in the 8 "\Ve only include signal extraction approach out of traditional statistical methods because among different traditional statistical methods (including probit models), only signal extraction approach delivers reasonable out-of-sample properties as documented in Berg and Pattillo (1999) and Berg et al. (2005). 65 previous year and two years before, as well as when the country gets approved to tap large IMF financial support. 9 Sudden stops are often followed by severe real economic consequences, such as large decreases in output, consumption and private credit, as well as real exchange depreciation. In line with the theoretical literature on external crises (for example, Mendoza, 2002), we are especially interested in predicting sudden stops with sizeable growth impacts, i.e. , large growth declines resulting from binding financial constraints throughout the economy caused by sudden stops in private capital flows. Adapting the definition in Basu et al. (2017) , we define large growth declines as occurring when the changes in GDP growth relative to the previous five-year average growth rate lie in the 10 th -percentile tail. Hereafter we call these events as sudden stops with growth impacts (SSGI). Our sample covers 53 emerging markets and spans the period between 1990 and 2017. There are 183 sudden stops accounting for 12.3% of the sample, and 61 SSGI accounting for 4.1 % of the sample. Table 2.1 lists SSGis in our sample, and Figure 2.1 shows the historical frequency of sudden stops and SSGI. 9 Hereafter large IMF financial support is defined as IMF arrangements with agreed amount at least five times as large as the respective country's quota at the IMF. This criterion attempts to capture counterfactual situations in which sudden stops in capital flows were prevented by large IMF financial support. 66 Table 2.1: Sudden Stops with Growth Impacts (SSGis) in Emerging Markets Years 1990 1991 1994 1995 1997 1998 1999 2000 2001 2002 2004 2007 2008 2009 2010 2012 2014 2015 2016 Countries Angola, Poland Bulgaria Mexico, Turkey Argentina, Morocco Indonesia, Malaysia, Thailand Chile, Colombia, Croatia, Peru, Philippines, Turkey Ecuador, Venezuela Angola, Argentina, Panama Turkey Brazil, Dominican Republic, Uruguay, Venezuela Ukraine Azerbaijan Argentina, Hungary, Kazakhstan, Latvia, Lithuania, Malaysia, Pakistan, Romania, Russia, South Africa, Turkey, Ukraine Belarus, Bosnia and Herzegovina, Bulgaria, Costa Rica, Croatia, Georgia, Macedonia, Mexico, Panama, Peru, Serbia Angola, Jordan, Lebanon Azerbaijan, Jordan Russia, Ukraine Belarus, Ecuador Venezuela Our definition of SSGI captures prominent historical examples of sudden stops with severe economic consequences, including the Mexican peso crisis in 1994, Malaysia-Indonesia Thailand crises in 1997, Argentina crisis in 2000, Brazil crisis in 2002, Hungary-Ukraine crises in 2008, Russia crisis 2014 and Venezuela crisis in 2016. Furthermore, we see from 67 Figure 2.1: Frequency of Sudden Stops and SSGis in Emerging Markets, 1990-2017 15 (/) c Q) ~ 10 0 <ii .0 E :J z 5 0 Sudden Stops SSGI 1990 2000 2010 Year Figure 1 that our definition of SSGI give rise to several clusters of crises in the mid-1990s, late 1990s, early 2000s, and late 2000s. Such clustering of crises will have implications for the importance of global versus country-specific variables in the empirical crisis prediction exercise later in this paper. From an economic perspective, common global factors, such as monetary policy actions in advanced economies, can substantially drive capital inflows to and outflows from emerging markets (Fratzscher, 2012). Therefore, they may play an important role in explaining the occurrence of these waves of crises. 2.3.2 Exchange Market Pressure Events Exchange market pressure events (EMPEs) capture episodes of sudden exchange rate de preciation or reserves depletion which may occur owing to a sudden switch in investors' preferences from domestic to foreign assets, even if the realized capital outflows is not large. Such events are especially relevant for economies which are financially closed (so that ex change rate misalignment may occur, but the potential outflows are limited by the small capital inflows in prior periods), which have active crisis management (so that reserves are 68 used to absorb sudden stops in capital flows) , and which have few binding financial con straints (so that exchange rate misalignment may be quickly corrected without generating severe negative consequences for domestic output and credit). In the spirit of early papers in the empirical literature on currency crises, we construct an exchange market pressure index (EMPI) combining degrees of exchange rate depreciation and international reserves loss. 10 The index is defined as a weighted average of the annual percentage depreciation in the nominal effective exchange rate and the annual decline in reserves as percentage of the previous year's GDP. The weights are chosen so that the variance of the two components are the same. 11 We then define exchange market pressure events (EMPEs) as occurring when the EMPI lies in the 85th-percentile tail of the entire sample including all countries from different income groups, as well as when the country gets approved for large IMF support (as previously described) to capture counterfactual situations in which sharp exchange rate depreciations or large declines in reserves were prevented by large IMF support. Although EMPEs are defined on the entire sample of all countries with a uniform per centile cutoff, 12 we will conduct model estimations and analysis on different country groups separately because these EMPEs may be heterogeneous across country groups and explained by different economic mechanisms. These aspects may ultimately lead to different predic tion performances, model rankings, and implications on the importance of the explanatory variables for each group of countries. 10 For example, Eichengreen et al. (1995) and Kaminsky and Reinhart (1999). nwith the weight on the annual depreciation in the nominal effective exchange rate (NEER) normalized to be 1, the weight on the annual decline in reserves as percentage of the previous year's GDP is the ratio of standard deviation of the annual depreciation in the NEER to the standard deviation of the annual decline in reserves as percentage of the previous year's GDP. The standard deviations are calculated as those of the sample from 1990 to 2015 with observations seeing exceptionally large depreciation of the NEER ( defined as the absolute values of the annual percentage change in the NEER higher than 80%) removed. By doing this, we calculate the standard deviations of the sample prior to the global financial crisis and exclude hyperinflation episodes to prevent distortions in the standard deviations and thus the EMPI. 12 Such design delivers different frequencies of EMPEs in different income groups, with EMPEs more frequent in EMs and LICs than in AEs, capturing the fact that EMs and LICs are more likely to see turbulence in the exchange market. 69 Figure 2.2: Frequency of Exchange Market Pressure Events (EMPEs) by Country Group, 1990-2017 (/) c Q) > Q) 0 <ii .0 E :J 10 Z 5 0 1990 2000 Year 2010 AEs EMs LICs For EMPEs, our sample covers 159 countries in total and spans the period from 1990 to 2017. Among the 159 countries, there are 33 advanced economies (AEs), 53 emerging markets (EMs) and 73 low-income countries (LICs). Our definition of EMPEs identifies 32 events in advanced economies (3.5% of the AEs sample), 110 in emerging markets (7.4% of the EMs sample), and 140 in low-income countries (6.8% of the LICs sample). Figure 2.2 shows the frequency of EMPE by country group. First, we note that there are no obvious clusters or global waves of EMPEs, especially in EMs and LICs, while there are two clusters of EMPEs in AEs, one in the early 1990s and another capturing the global financial crisis followed by the European debt crisis. Second, EMPEs in EMs and LICs are also heterogeneous, spanning events in financially-closes economics (e.g., India, 1991) and events in which large capital outflows absorbed by depletion of reserves (e.g., China, 2016). Hence, it implies that EMPEs in EMs and LICs may be harder to predict. Third, for EMs, EMPEs are more frequent than SSGis, making the prediction of EMPEs more salient from year to year. 70 2.4 Explanatory Indicators of External Crises Table 2.2 lists the set of explanatory indicators for predicting external crises (both SSGis and EMPEs) in our consideration. Our selection is based on whether the variable is as sociated with clear economic channels and interpretable mechanisms according to different generations of the academic literature on external crises. In first-generation models devel oped by Krugman (1979) and Flood and Garber (1984), a government which runs a fiscal deficit in a fixed exchange rate regime carries out policies such as depleting reserves and increasing money supply to finance the deficit, which eventually causes the collapse of the pegged regime. Thus, in this paper we consider indicators related to fiscal balance, change in money supply, reserves coverage and dummy variables for exchange rate regime. In second-generation models pioneered by Obstfeld (1996), the collapse of fixed exchange rate regime depends on the government's willingness to maintain it given the government 's objectives on employment and/or output. With that in mind, we consider indicators related to unemployment rate changes and real GDP growth. Third-generation models (Dornbusch et al., 1995; Mendoza, 2002) explore the role of external and financial sector in causing currency crises and sudden stops. To capture the important role of overheating external sector and large exposure to international capital markets, we include liability stock variables measuring external leverage in different sectors, along with capital account openness measures and global factors indicating shocks to global risk and liquidity. Following Basu et al. (2017), we include medium-term growth variables so as to capture sustained credit booms, building bubbles in asset prices and foreign investor sentiment, as well as a growth in the construction and financial sectors' contributions to GDP. To capture bubbles which may already have been to burst, we include sudden slowdowns in the growth rate of asset prices and the real effective exchange rate (i.e., negative acceleration) and one-year changes in all debt stock variables. In addition, to capture large volumes of capital inflows and currency mismatches which increase the vulnerability to currency devaluations, we include the current account balance, and variables measuring the degree 71 of foreign exchange exposure. Moreover, we include variables summarizing the degree of private-sector cushioning, such as corporate liquidity buffers, and etc. In addition to the variables chosen according to the three generations of models, we also include 5-year cumulative inflation to capture deteriorations in currency's real value, variables capturing current account shocks, contagion variables and political shocks. For crisis prediction purposes, we use one-year lagged values for all explanatory variables and drop all observations with explanatory variables from in-crisis years. 13 We note here that including too many variables as well as including only selected few variables opens the door to conceptual overfitting. In our effort to conduct a valid horse race, we do not revise the list of variables after estimating models on the training set and testing models on the test set, as that would amount to selecting variables according to performance on the test set. Table 2.2: List of Explanatory Indicators Variable First generation (Krugman , 1979; Flood and Garber, 1984) Fiscal balance/GDP 5-year change in M2/GDP Reserves/M2 and Reserves/GDP Dummies for hard peg and float Dummy for parallel market Second generation (Obstfeld, 1996) Change in unemployment rate Real GDP growth Third generation (Dornbusch et al., 1995; Mendoza, 2002) Liability stocks External debt/GDP External debt/exports Private external debt/GDP Source WEO WEO WEO IRR IRR WEO WEO EWN , WEO EWN, WEO WEO Bank external debt/GDP WEO 13 . With the exception of scheduled amortization, for which we use year t value to predict crises in year t, because year t's scheduled amortization is calculated using only information up to year t - l. 72 Non-bank private external debt/GDP Total and externa l public debt/GDP Cross-border interbank liabilit ies/GDP External equity liabilities/GDP Private credit/GDP Household liabilit ies/GDP Foreign liabilities/Domestic credit Reserves/short-term debt Private-sector buffers EMBI spread (level and change) Corporate sector returns Corporate default probability Interest coverage ratio Price-to-earnings ratio Bank returns on assets Non-performing loans/total loa ns Banks' capital-to-asset ratio Loan-to-deposit ratio Primary gap/GDP Inflation Flows and mismatch Current account balance/GDP Debt service/exports Amortization/exports Share of non-investment grade debt FX share of public debt FX share of external debt FX share of household and non-financial corporate credit Net open FX position/GDP Net open FX debt position/GDP Inflow and outflow restrictions Medium-term building bubbles 5-year growth in private sector credit/GDP 5-year growth in housing prices 5-year growth in stock prices 5-year growth in REER 5-year growth in cross-border interbank liabilit ies/GDP 5-year growth in external debt/GDP WEO WEO BIS BLS, BGJS WBWDI OECD Fitch, CSD WEO Bloomberg CVU, CSD CVU, CSD CVU, CSD cvu Fitch, CSD Fitch, CSD Fitch, CSD CSD WEO WEO WEO WEO WEO CSD WEO BLS, BGJS CSD BLS, BGJS BLS, BGJS Cl, FKRSU WBWDI HPD , GPG Bloomberg INS BIS EWN, WEO 73 5-year growth in external equity liabilities/GDP 5-year growth in contribution of construction and finance to GDP Near-term Bursting bubbles Change in reserves/GDP REER acceleration* Real house prices acceleration Real stock prices acceleration Change in all liability stocks Global shocks Federal funds rate (level and change) + VIX US NEER change US 10 year/3-month yield spread TED spread Law of one price 5-year cumulative inflation Current account shocks Real growth in exports Change in terms of trade (ToT) Reserves/imports Absolute oil balance/GDP Contagion Deviation of export partner growth from 5-year trend Interbank liabilities/GDP to AEs in financial crisis Frequency of banking crises in AEs Similarity to last year's crises Political shocks Political violence Successful coup ELS, BGJS OECD FFA INS HPD GPG Bloomberg Haver, WX Haver Haver Haver Haver WEO WEO WEO WEO WEO WEO BIS, LV BIS, LV INSCR INSCR Notes. WEO = IMF World Economic Outlook, BIS = Bank for International Settlement International Banking Locational Statistics, CSD = IMF Common Surveillance Database, CVU = IMF Corporate Vulnerability Utility, INS = IMF Information Notice System, FFA = IMF Financial Flows Analytics, WB WDI = World Bank World Development Indicators, HPD = OECD Housing Prices Database, GPG = Global Property Guide, EWN = External Wealth of Nations Dataset, Lane and Milesi-Ferretti (2007), INSCR = Integrated Network for Societal Conflict Research, IRR: Ilzetzki et al. (2019), LV = Laeven and Valencia (2012), ELS = Benetrix et al. (2015), BGJS = Benetrix et al. (2020), CI = Chinn and Ito (2006), FKRSU = Fernandez et al. (2016), WX = Wu and Xia (2016). * Acceleration is defined as the percentage change of growth rate in the variable. + We replace the values by Wu-Xia shadow rates whenever they are available (Wu and Xia, 2016). 74 2.5 Model Design We discuss in details the design of our model training and testing, including hyperparateme ter choice (Subsection 2.5.1) , testing and tuning design (Subsection 2.5.2), and crisis proba bility mapping (Subsection 2.5.3). 2.5.1 Hyperparameter Tuning The horse race consists of three stages in which different datasets are used: (a) tuning, (b) training, and ( c) testing. The datasets used are "validation set", "training set", and "test set", respectively. Simply put, tuning is a process conducted before training a model to find optimal hyperparameters of the model; training is the process in which a model is estimated after all hyperparameters are chosen optimally in the tuning stage. In the effort to conduct a horse race, testing is an evaluation process in which all completely estimated models are evaluated based on their out-of-sample performances on the test set(s). The most common tuning method in machine learning is k-fold cross-validation intro duced by Stone (1974). It randomly partitions the training set into k complementary subsets of approximately equal size, which are called validation sets ( or validation folds in the case of k-fold cross-validation). For each validation set, a model is trained on its training counter part which consists of the remaining k-1 validation sets and then evaluated on the holdout validation set. Therefore, k models are estimated and evaluated with each of k validation sets used only once as the validation data. The out-of-sample performances of the k mod els on the k complementary validation sets are then averaged for evaluation of the set of hyperparameters. In addition to being used for hyperparameters optimization, the aver age out-of-sample performance over validation sets can be seen as a proxy for a model's 75 out-of-sample performance on the test set. Hence, we refer to the average out-of-sample per formance over validation sets whenever we want to make decisions based on how the model will perform out of sample on the test set before conducting the final test. 14 Different methods have different sets of hyperparameters to be tuned. In Elastic Net, the parameter a controlling the relative contributions of L 1 and L 2 penalties in the over all penalty is the only hyperparameter to be tuned. 15 In Random Forests, there are three categories of hyperparameters to be tuned: hyperparameters governing tree size, maximum number of surrogate splits the tree finds at each split, and number of features to consider at each split. In standard Random Forests developed by Breiman (2001) , bootstrap aggre gating and random feature sampling help prevent overfitting, so trees are allowed to grow fully. However, this approach may not be the best in the context of crisis forecasting when crisis mechanisms constantly change over time. We find that Random Forests consisting of fully-grown trees are prone to severe overfitting when applied to our external prediction problem, while controlling tree size help reduce overfitting significantly. 16 Therefore, we include parameters which control tree size into the set of hyperparameters to be tuned for all tree-based ensemble methods we are using. Among all parameters controlling tree size including minimum observations per leaf, minimum number of observations per parent node and maximum number of splits, we choose to tune minimum observations per leaf to control tree size. Instead of tuning the number of trees, we fix it to be 1000 so that there is a suf ficiently large number of trees to reduce variance and thus stabilize prediction performance and variable importance. For RUSBoost , the hyperparameters to be tuned are minimum observations per leaf, maximum number of surrogate splits the tree finds at each split, and the learn rate for shrinkage. 14 In this paper, test sets are quarantined and never used before the final testing is conducted. This step is critical for a fair evaluation of competing models. 15 For a given a , the shrinkage coefficient >. is chosen to minimize the deviance in the training set. 16 As mentioned previously, we compare the average out-of-sample performance across validation sets as a proxy for out-of-sample performance, instead of assessing the out-of-sample performance on test sets. 76 We follow the signal evaluation framework in the empirical literature on early warning systems to evaluate model performance by maximizing the signal relative to the noise. 17 Hence, our evaluation metric to be minimized is defined as the unweighted sum of false alarms ( the percentage of risk flags that is not followed by a crisis) and missed crises ( the percentage of safe flags that is followed by a crisis), which we refer to as sum of errors (SOE) hereafter. Due to the rarity of crisis in our sample (for both SSGI and EMPE), this evaluation metric attaches significantly higher costs to missed crises than to false alarms. It is also used as the loss function in both tuning and training stages to find optimal hyperparameters and estimate models. In addition, SOE is used to calculate optimal thresholds for indicators in signal extraction approach, which implies that the optimal threshold is the one where the vertical gap between the cumulative distribution functions (CDFs) of crisis and non-crisis observations is maximized (as illustrated in Figure 2.3). The weight of each variable in signal extraction approach is calculated as ( 1 - z) / z where z is the sum of errors calculated from applying the respective optimal threshold and flagging criteria to the series of variable values. In the testing stage, thresholds over composite scores (the aggregate vulnerability index in signal extraction, estimated crisis probability in regularized linear regression models, and the output scores in tree-based ensemble models) are needed to generate binary flags for observations in the test set( s) . We calculate the optimal threshold as the one which minimizes sum of errors (SOE) in the training set, and then assign binary crisis labels according to that threshold (i.e., one if a composite score is above the threshold, zero if a composite score is below or equal to the threshold). 2.5.2 Testing and Tuning Sets We adopt a recursive testing procedure which we call rolling cutoff testing. It consists of multiple training and testing sets for the following range of cutoff years: 2007, 2009, 2011, 17 See Kaminsky et al. (1998), Chamon et al. (2007) , and Basu et al. (2017) for examples. 77 Figure 2.3: Threshold Calculation in Signal Extraction Approach 1 0 Threshold Threshold maximizes the distance between the two COFs Variable value (higher = more risky) Notes. From Basu et al. (2017). The cumulative distribution functions (GDFs) of crisis and non-crisis observations are plotted against the value of the variable in the previous year. Assuming that a higher value of the variable is associated with a higher risk of crisis , the crisis GDF lies to the right of the non-crisis GDF. For a given threshold , all country-year observations with variable values above the threshold are flagged as risky, while all observations with variable values below the threshold are flagged as safe. This means that the height of the crisis GDF at the threshold represents the percentage of missed crises , while the height of the non-crisis GDF at the threshold represents the complement of the percentage of false alarms. Therefore, the optimal threshold is the one where the vertical gap between the GDFs is maximized. 2013, and 2015. For each cutoff year t, a model is estimated using all data up to the year t-1, and then the model is tested using a two-year test set which consists of years t and t+ 1. For example, when cutoff year is 2007, a model is estimated using all data up to 2006 and it is tested on a two-year set consisting of data from 2007 and 2008 to predict crises in 2008 and 2009 respectively. At the end, models are evaluated based on the average performances over all two-year test sets. 18 By limiting the training sets to historical data for each cutoff year, cutoff tests replicate how the model may be used in real-time analysis, and crucially, they ensure that the testing is conducted fully out-of-sample. By choosing multiple cutoff years and aggregating out-of-sample performance, such recursive design prevents model evaluation from being determined by a single nonrepresentative test set and allows for assessing how a model updates over time, including prediction performance and variable importance. As discussed before, we need to tune hyperparameters empirically before the training process begins, for which k-fold cross-validation is the most common technique used in 18 The rolling cutoff testing approach can be designed in different ways. \i\Te use test size of two years because it offers stability and interpretability to the model. For example, shorter (one-year) test sets may contain no crises, and longer (multi-year) test sets may deviate from practical use of the models. \i\Te choose the first cutoff year as 2007 to test how a model performs on out-of-sample prediction of global financial crisis while ensuring that we have sufficient training data. 78 machine learning. However, it is not appropriate for our macro panel data which exhibit temporal dependencies, because it randomly partitions the training set into non-clustered complementary subsets in which observations do not necessarily come from the same years. Also, such random partition may lead to common factors present in both the validation set and its training counterpart which may not provide a good proxy for a model's true out-of sample performance. Therefore, we need to design a tuning strategy as close as possible to the testing, taking the panel data structure into consideration as well. By modifying k-fold cross validation, we design random year-block tuning to mimic the cutoff testing exercises. In random year-block tuning, data in training sets are partitioned into ten year-blocks with roughly equal size as validation sets in each of which all observations are from the same set of consecutive years. For example, one possible partition along years for cutoff year 2007 could be 1990-1991, 1992, 1993-1994, 1995-1996, 1997-1998, 1999, 2000-2001, 2002-2003, 2004-2005, 2006. By partitioning along the year dimension only, it is ensured that there is no observation from the same year present in both the validation sets and their training counterparts. In contrast to using only historical data for each prediction in the testing, for each validation year-block we use all remainder of years as the training counterpart, to avoid depletion of data within the training set. Then the set of hyperparameters are chosen such that average performance over all validation year-blocks is maximized. To perform hyperparameters optimization, we make use of Bayesian optimization which is an efficient way to optimize hyperparameters for machine learning. 19 2.5.3 Crisis Probabilities We convert the composite scores produced by machine learning models into crisis probabil ities using a non-parametric approach proposed in Basu et al. (2017). The first step is to sort country-year observations according to their composite scores assigned by the model. Then we plot the history of crisis realizations (i.e., one if followed by a crisis, zero otherwise) 19 The number of optimization iterations in Bayesian optimization is set to be 500 to guarantee the con vergence of our loss function. 79 Figure 2.4: Mapping from Composite Score to Crisis Probability 1 ------- --- ••·· . .... --- --- Crisis -- Average over crises -- Average+ monotonicity restriction 0 L..L-.i....a- .L.11.----IL...-....... 1&1.1,1,ll,l,,I_...._---+ 0 Composite value 1 Notes. From Basu et al. (2017). against the composite scores which are in ascending order. Next , we use a modified version of the Hodrick-Prescott filter that replaces the time variable by the score value to generate a continuous line representing the average of the binary crisis realizations around each value of the composite score. Finally, the continuous line is adjusted according to a monotonicity restriction so that the mapping is monotonically increasing. Flat portions may be observed in the mapping, indicating that it is not a one-to-one mapping, i.e., the crisis probability may not change as the composite score changes (Figure 2.4). 2.6 Empirical Results In this sector, we present the horse race results (Subsection 2.6.1) , delve into model properties implied by the winning models for SSGis and EMPEs (Subsections 2.6.2, 2.6.3 and 2.6.4), and show the evolution of event probabilities (Subsection 2.6.5). 2.6.1 The Horse Race In this section we summarize the empirical results from the application of signal extrac tion approach and machine learning techniques to predict external crises. Figure 2.5 shows 80 Figure 2.5: Out-of-Sample Performance Signal Extraction LASSO RIDGE Elastic Net Random Forest RUSBoost 100% 90% ~ 80% g Q) 0 E :::, (/) 70% 60% 50% SSGI EMPE-AEs EMPE-EMs EMPE-LICs Event type average out-of-sample performance of different models on predicting SSGI and EMPEs (in different country groups), in terms of average of sum of errors over five test sets with cutoff year being 2007, 2009, 2011, 2013 and 2015, respectively. Our findings on model performance are as follows: • For SSGis, signal extraction approach performs the best among all models, achieving an average sum of errors as low as 70 percent. Regularized linear regression models in general perform poorly, while tree-based ensemble methods deliver mixed results - Random Forests perform significantly better than RUSBoost and achieves the best performance among machine learning techniques. • For EMPEs in advanced economies, signal extraction approach performs the best as well, achieving an average sum of errors slightly below 65 percent. Similar to the results for SSGis, regularized linear regression models perform poorly. Tree-based ensemble methods in general perform relatively well, although not as well as signal extraction approach. 81 • For EMPEs in emerging markets, RUSBoost delivers the best performance, although the difference among all models do not appear to be large. In contrast to predicting SSG Is well, signal extraction approach performs the worst on predicting EMPEs for the same set of emerging markets. • For EMPEs in low-income countries, signal extraction approach and RUSBoost deliver almost the same performance, both achieving the lowest average sum of errors among all models. The performance of regularized linear regression models and Random Forests do not seem to be distinguishable. 2.6.2 Signal Extraction versus RUSBoost for SSGis in EMs In this subsection and the next , we focus on the set of emerging markets and compare the model properties of signal extraction approach and RUSBoost for predicting SSGis and EM PEs. The reason of this comparison is that on the same set of emerging markets, signal extraction approach performs the best while RUSBoost performs the worst for the predic tion of SSGis. By contrast, RUSBoost performs the best while signal extraction approach performs the worst for the prediction of EMPEs. Therefore, we wish to explore the reasons behind. The differences in model performance are very likely to come from the different treat ment of variables by different models. To dig deeper into this question, we first examine the variable importance rankings from different models (signal extraction approach versus RUSBoost) for different types of events (SSGis and EMPEs), and explore non-linearities, non-monotonicities, and interactions among variables implied by different approaches. Figure 2.6 shows the top-15 important variables in terms of contributions to predicting SSGis in emerging markes: The left charts the top-15 important variables implied by signal extraction approach ( the winning model that delivers the best out-of-sample performance) and the right charts the top-15 variables implied by RUSBoost. First, implied by the winning signal extraction approach (i.e. , delivering the lowest average out-of-sample sum of errors), 82 Figure 2.6: Top-15 Important Variables for SSGis in EMs External Financial 5yr Growth in Stock Prices TED Spread Interbank Liab./GDP to AEs in Financial Crisis House Prices Acceleration 5yr Growth in House Prices Frequency of Banking Crises in AEs Amortization to Exports CA Balance to GDP 5yr Growth in Cross-Border Interbank Liabilities to GDP Change in Private Credit to GDP 5yr Growth in Broad Money to GDP 5yr REER Appreciation Debt Service to Exports Cross-Border Interbank Liabilities to GDP Foreign Liabilities to Domestic Credit 0% 2% 4% Financial-Real Fiscal Real Global Price-to-Earnings Ratio Interbank Liab./GDP to AEs in Financial Crisis CA Balance to GDP TED Spread Amortization to Exports 5yr Change in Broad Money to GDP Debt Service to Exports Frequency of banking crises in Aes Fed Funds Rate Change VIX Dollar Appreciation Deviation of Export Partner GDP Growth Foreign Liabilities to Domestic Credit Non-bank Private External Debt to GDP 5yr REER Appreciation Va ri able Importance 0% 2% 4% 6% Va riable Importance Notes. The horizontal axes plot the variable importance metric from authors' calculations (for signal extraction approach) and from algorithm outputs (for tree-based ensemble methods). The metric in signal extraction approach is the weight of the variable. The metric in tree-based ensemble methods is the estimate of predictor importance output by the algorithm, which is calculated as the sum of changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. If tree is grown without surrogate splits, this sum is taken over best splits found at each branch node. If tree is grown with surrogate splits, this sum is taken over a ll splits at each branch node including surrogate splits. Estimates of predictor importance are normalized to summing up to 1. the most important predictors are debt liabilities and the asset price/credit bubbles that they finance. Important predictors include global factors (e.g., TED spread, the incidence of financial crises in AEs, and inter-bank liabilities to banks in these AEs), medium-term building bubbles (e.g., stock prices, house prices, and the real effective exchange rate), and external debt measures (e.g., scheduled amortization and cross-border inter-bank debt). Sec ond, although there are many overlapping important variables implied by signal extraction approach and RUSBoost , RUSBoost does not put as much weight to the financial-real sec tor as signal extraction approach. Instead, global factors such as global risk appetite, U.S. monetary policy, Dollar strength, and change in export partners' growth are found to be predictive by RUSBoost. 83 Some difference in variable importance is to be expected, because machine learning meth ods in general, and tree-based ensemble methods in particular, allow for the detection of more complex relationships between explanatory variables and crisis incidences, which signal ex traction approach does not. However, despite these capabilities, RUSBoost does not appear to improve predictive performance. There are several possible reasons for this observation. Firstly, it is possible that tree based algorithms with many splits and leaves may pick up noise in the training set which are irrelevant to the future, and thus may hurt predictive power. Figure 2. 7 shows examples of explanatory variables that exhibit non-monotonic effects on SSGI probability implied by RUSBoost, including dollar appreciation, deviation of export partner growth from 5-year trend, Federal Funds Rate, and percentage change in Terms of Trade. These non-monotonic nature of the derived mapping does not make much sense, as according to the sudden stops literature, we expect a non-decreasing, non-increasing, non-decreasing, and non-increasing effect of dollar appreciation, deviation of export partner growth from 5-year trend, Federal Funds Rate, and percentage change in Terms of Trade on crisis probability on average, respectively. These non-monotonicities are likely to be caused by a cluster of crises which may not be very representative. In general, such "incorrect non-monotonicities", which can hurt out-of-sample properties, are especially likely to be derived from samples with few crisis observations which are our macro panel data indeed features, and models that tend to learn "aggressively" from individual crisis observations such as tree-based ensemble methods. By contrast, the signal extraction approach generates a single split and is less affected by non-monotonicities far from the average. Secondly, it is possible that signal extraction approach performs relatively well for SSGis in particular because of the long history of the empirical crisis prediction literature with such models and events. In other words, in cases that interactions between two variables turned out to be very important, it is possible that the literature has already identified alternative new single variables which can be used instead, and which capture the key drivers behind 84 Figure 2.7: Examples of Non-Monotonic Effect on SSGI Probability Implied by RUSBoost 0.032 £ :g 0.030 .n e a.. (/) :~ 0.028 0 Q) Ol ~ Q) ~ 0.026 0.052 £ :g 0.050 .n e a.. (/) ·;;; 8 0.048 Q) Ol ~ ~ <( 0.046 -5% -3% 0% 5% 10% 15% Dollar Appreciation 0% 3% 6% 9% Federal Funds (Shadow) Rate g 0.054 :.c co .n e 0.053 a.. (/) ·;;; 8 0.052 Q) Ol ~ Q) ~ 0.051 0.050 -8% -5% -3% 0% 2% 5% £ :.c co 0.054 -g 0.052 ct (/) ·;;; 8 Q) Ol ~ 0.050 ~ <( Deviation of Export Partner Growth from 5-year trend -30% -20% -10% 0% 10% 20% Percentage Change in Terms of Trade 30% 85 the interactions, for example, variables capturing currency compositions of debt liabilities. In this case, the ability to handle novel interaction terms, which machine learning provides, may not be very valuable. 2.6.3 Signal Extraction versus RUSBoost for EMPEs in EMs Let us next turn to EMPEs in emerging markets, for which RUSBoost outperforms signal extraction approach. As described above, EMPEs in EMs and LICs appear to exhibit a high degree of heterogeneity, and it is possible that signal extraction approach may not be able to accommodate and capture the diverse features of these events, such as allowing for non-linearities, non-monotonicities, and interactions. Figure 8 shows the top-15 important variables in terms of contributions to predicting EMPEs in emerging markets: The left charts the top-15 important variables implied by RUSBoost ( the winning model that delivers the best out-of-sample performance) and the right charts the top-15 variables implied by signal extraction approach. First, the best predictors implied by RUSBoost come from several different crisis generations. External variables such as reserve adequacy metrics are complemented by measures of equity outflows which generate depreciations, even if they do not tighten debt constraints. In addition, fiscal vulnerabilities (e.g., the EMBI sovereign spread, change in public debt) , and competitiveness indicators (e.g., cumulative inflation) are highly important, while no global factors or con tagion variables appear to be important for predicting EMPEs in EMs. The diversity in the top-15 variables is consistent with the heterogeneity among EMPEs, implying that there are no dominant mechanisms or common factors that contribute to explaining the occurrence of EMPEs in EMs. Because RUSBoost builds up lots of trees by resampling and reweighting, it may attenuate the heterogeneity problem in EMPEs so that it delivers better prediction performance. Second, despite many overlapping important variables implied by signal extraction ap proach and RUSBoost, signal extraction approach sees greater predictive power from the 86 Figure 2.8: Top-15 Important Variables for EMPEs in EMs EMBI Sovereign Spread 5yr Inflation Inflation Change in Reserves to GDP Change in Public Debt to GDP Reserves to GDP Change in Ext. Equity Liabilities. to GDP Interest Coverage Ratio Change in Foreign Liab. to Domestic Cred. Export Growth External Financial External Debt to GDP - Reserves to Imports - 5yr Growth in Private Credit to GDP 5yr Growth in Broad Money to GDP GDPGrowth - 0% 2% 4% 6% Variable Importance Financial-Real Fiscal Real Global 8% 5yr Inflation Change in Public Debt to GDP Inflation House Prices Acceleration Change in EMBI Sovereign Spread Change in Reserves to GDP 5yr Growth in Cross-Border Interbank Liab. to GDP Stock Prices Accelaration 5yr Growth in House Prices EMBI Sovereign Spread 5yr Growth in External Debt to GDP Corporate Default Probability Reserves to GDP Fiscal Balance to GDP Change in Private External Debt to GDP 0% 1% 2% 3% Variable Importance Notes. The horizontal axes plot the variable importa nce metric from authors' calculations (for signal extraction approach) and from algorithm outputs (for tree-based ensemble methods). The metric in signa l extraction approach is the weight of the variable. The metric in tree-based ensemble methods is the estimate of predictor importance output by the algorithm, which is calculated as the sum of changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. If tree is grown without surrogate splits, this sum is taken over best splits found at each branch node. If tree is grown with surrogate splits, this sum is taken over a ll splits at each branch node including surrogate splits. Estimates of predictor importance are normalized to summing up to 1. financial-real sector, including house and stock prices acceleration and medium-term build ing bubbles in house prices, somewhat consistent with that for SSGis. Figure 2.9 shows that RUSBoost is able to identity important non-monotonicities in the effect on EMPE probability. Interesting examples are 5-year change in broad money to GDP and private credit to GDP. For both 5-year growth measures, average EMPE probability first decreases as they are negative and increasing to zero, and then increases sharply as they cross zero and become positive. While the second scenario is consistent with the first and third generation models in the literature that a sustained increasing in broad money supply may lead and private credit may give rise to higher vulnerability to currency market turbulence, the first scenario captures situations in which domestic financial markets become 87 Figure 2.9: Examples of Non-Monotonic Effect on EMPE Probability Implied by RUSBoost ~ 0.052 :0 co .c, e Cl. 0.050 (f) ·w cS ~ 0.048 ~ ~ <( -10% -5% 0% 5% 10°/4 5-Year Growth in Broad Money to GDP j:.- B o.o5s co .c, e Cl. ·~ 0.052 cS Q) Ol ~ 0.048 ~ <( 0% 0% 0% 0% 5-Year Growth in Private Credit to GDP less developed or less liquid, which may lead to self-fulfilling panics and increase probability of EMPE. In addition, the negative growth rates may indicate situations that broad money and private credit (as percentage of GDP) start increasing after a period of downward adjustment, which typically occur at the beginning of economic recoveries during which an economy exhibits high growth in GDP, employment, corporate profits, and etc. Therefore, EMPE probability is reduced during this period with strong economic performances. These non monotonicities cannot be captured by signal extraction approach given its single-threshold design, while can be well-captured by tree-based algorithms that segment variable effects by looking for different thresholds at different splits. Therefore, in the presence of this kind of non-monotonicity mechanism, tree-based methods may be able to deliver better prediction performance. Moreover, RUSBoost captures interesting non-linearity and interaction effects between explanatory variables, which may also contribute to its better performance on predicting EMPEs in EMs. Figure 2.10 shows examples of non-linearity and interaction effects on EMPE probability implied by RUSBoost. Upper-left figure shows the effect of change in reserves to GDP on EMPE probability, conditional on reserves to GDP in the bottom and top 25 th percentile. While reserves depletion increases EMPE probability in both scenarios, it has larger effect when the existing level of reserves to GDP is lower. Upper-right figure shows the effect of change in EMBI sovereign spread on EMPE probability, conditional on 88 EMBI in the bottom and top 25th percentile. Similarly, an increase in EMBI sovereign spread make emerging markets more vulnerable to currency market turbulence when existing level of EMBI sovereign spread is higher. These results demonstrate non-linearities in the effect of reserves to GDP and EMBI sovereign spread on EMPE probability, that changes in the variable values have disproportionate effects on the event probabilities. Lower-left figure shows the effect of change in public debt to GDP on EMPE probability, conditional on inflation in the bottom and top 25th percentile. Increasing public debt burden has a larger effect on EMPE probability when the inflation is higher, which may suggest a link with the channel of fiscal dominance (Blanchard, 2004; Ahmed et al. , 2019): emerging markets that have existing higher inflation are constrained in their abilities to reduce their public debt burden through a central-bank-engineered real interest rate decrease, and thus see a larger increase in the probability of EMPEs for an increase in public debt burden. Lower-right figure shows the effect of change in foreign liabilities to domestic credit on EMPE probability, conditional on 5-year growth in private credit to GDP in the bottom and top 25th percentile. Although an increase in the ratio of foreign liabilities to domestic credit increases EMPE probability in both scenarios, it has a larger effect when the 5-year growth in the ratio of private credit to GDP is in the top 25th percentile. In practical terms, the mismatch captured by the foreign liability share matters more in emerging markets where there is a large medium-term vulnerability as well, i.e., a large ongoing private sector credit boom. 2.6.4 EMPEs in Advanced Economies and Low-Income Countries While EMPEs in AEs and LICs are predicted well by signal extraction approach, it is noted that the difference in the out-of-sample performance (in terms of sum of errors) between signal extraction approach and RUSBoost on predicting EMPEs in LICs appear to be very small. 2 ° Figure 2.11 shows the top-15 important variables in terms of contributions to 20 In fact, we find that EMPEs in LICs are sometimes predicted better by signal extraction approach and sometimes by RUSBoost, depending on whether FX share data is included. 89 Figure 2.10: Examples of Interaction Effect on EMPE Probability Implied by RUSBoost £ :.c co .n 0.20 e 0.15 a.. (/) ·;;; 8 Ql 0.10 Ol ~ ~ <( 0.05 0.25 £ 0.20 :.c co .n e a.. -~ 0.15 (/) 8 Q) Ol ~ 0.10 ~ <( 0.05 Reserves/GDP in bottom 25th percentile Reserves/GDP in top 25th percentile -5% 0% 5% 10% 15% -20% Percentage Point Change in Reserves/GDP Inflation in bottom 25th percentile Inflation in top 25th percentile -10% 0% 10% Change in Public Debt to GDP g 0.20 :.c co .n e a.. (/) 0.16 ·;;; 8 Q) Ol ~ ~ 0.12 <( EMBI in bottom 25th percentile EMBI in top 25th percentile 0.0B~---------------------~ £ :.c -1 000% 0.175 15 0.150 e a.. (/) ·;;; 8 0.125 Q) Ol ~ ~ 0.100 <( 0.075 -500% 0% 500% Change in EMBI Sovereign Spread 5-year growth in private crediUGDP in bottom 25th percentile 5-year growth in private crediUGDP in top 25th percentile -100% -50% 0% Change in Foreign Liabilities to Domestic Credit 90 predicting EMPEs in AEs and LICs: The left charts the top-15 important variables in AEs and the right charts the top-15 important variables in LICs, both implied by signal extraction approach (the winning model that delivers the best out-of-sample performance in terms of sum of errors). For predicting EMPEs in AEs, the most important predictors are indicators of external debt (e.g. , private external debt, amortization, and the FX and external shares of public debt). It can be seen that variables that capture external sector vulnerabilities contribute the largest to the overall predictive power for EMPEs in AEs. For predicting EMPEs in LICs, stock market overvaluation (e.g., the price to earnings ratio) is found to be the most predictive by signal extraction approach, and other important predictors include indicators of net open FX share measures (e.g., net open overall and debt FX position to GDP), first-generation currency crises (e.g., cumulative inflation, fiscal vul nerabilities, exchange rate regime), and banking system health (e.g., share of noninvestment grade debt, the capital to assets ratio). 2.6.5 Event Probabilities We now examine the evolutions of average probabilities of SSGis and EMPEs implied by the winning models, as shown in Figure 2.12 (left for SSGis in EMs, and right for EMPEs in AEs, EMs, and LICs). First, we see SSGI probabilities peaked in the late 1990s and the late 2000s, well capturing the economic upheavals of the Asian financial crisis and the global financial crisis. After the global financial crisis, the EMs' vulnerabilities to SSGis remain low. Second, for EMPEs, low income countries saw the highest vulnerabilities while advanced economies had the lowest vulnerabilities before the global financial crisis, except a spike in AEs in the early 1990s. After a global spike seen by all country groups around the global financial crisis, EMs' and LICs' vulnerabilities sustained while AEs had a lull. 91 Figure 2.11: Top-15 Important Variables for EMPEs in AEs and LICs External Financial Change in Private External Debt to GDP Change in Non-bank Priv. Ext. Debt to GDP Private External Debt to GDP FX Share of Government Debi Change in Public External Debt to GDP Amortization to Exports Non-bank Private External Debt to GDP CA Balance to GDP Public External Debt to GDP 5yr Growth in Cross-Border Interbank Liab. to GDP - 5yr Growth in Private Credit to GDP Loan-to-Deposit Ratio - Net Open FX Position to GDP - 5yr Growth in Constuction Contrib. to GDP - FX Share of External Debt - 0% 2% 4% 6% Variable Importance Financial-Real Fiscal Real Global Price-to-Earnings Ratio Share of Non-investment Grade Debt - Net Open FX Position to GDP - Fiscal Balance to GDP - Net Open Debt FX Position to GDP - Change in Public Debt to GDP - Hard Peg Indicator - 5yr Inflation - Banks' Capital-to-Asset Ratio - Change in External Debt to GDP - Primary Gap to GDP - Change in Non-bank Priv. Ext. Debt to GDP - Reserves to GDP - FX Share of External Debt - Inflation - 0% 5% 10% Variable Importance Notes. The horizontal axes plot the variable importance metric from authors' calculations (for signal extraction approach) and from algorithm outputs (for tree-based ensemble methods). The metric in signal extraction approach is the weight of the variable. The metric in tree-based ensemble methods is the estimate of predictor importance output by t he algorithm, which is calculated as the sum of changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes . If tree is grown without surrogate splits, this sum is taken over best splits found at each branch node. If tree is grown with surrogate splits, this sum is taken over all splits at each branch node including surrogate splits. Estimates of predictor importance are normalized to summing up to 1. Figure 2.12: erage Probabilities for SSGis and EMPEs in Different Country Groups SSGls 0.25 0.20 ~ :a _2l 0.15 e 0.. Q) g' 0.10 Q) > <I'. 0.05 0.00 1990 2000 2010 Year 0.25 0.20 ~ :a _2l 0.15 e 0.. Q) g' 0.10 Q) J 0.05 0.00 - EMPEs-AEs - EMPEs-EMs - EMPEs-LICs 1990 2000 2010 Year 92 15% 2. 7 Con cl us ion Relative to traditional approaches such as signal extraction approach, machine learning allows for greater flexibility in the functional relation between explanatory variables and crisis outcomes, which can improve the performance of crisis prediction models and deliver a better sense of how variables are related to crises, in nonlinear fashions and in interaction with other variables. Such improvements are potentially valuable in addressing many criticisms by practitioners of the existing signal extraction approach. However, machine learning is not a panacea. Crucially, the historical macroeconomic data that can be brought to bear in the prediction of external crises is small in some important senses, and there are no free lunches, so the same limitations which apply to existing signal extraction approach may also limit the performance of machine learning techniques. Bearing in mind the promises and limits of the application of machine learning to external crisis prediction, in this paper we have designed a rigorous testing procedure taking into account the structure of macro data and the periodic re-estimation of the model over time in practice. We have attempted to make sure that the testing procedure is rigorous enough to prevent manipulation of the testing, which also giving some sense of which models are able to best capture the shape of the relation between explanatory variables and crisis outcomes, to be robust to the uncertainty in the global regimes, and to learn quickly from new crises which enter the historical training set over time. Our results indicate that there is no clear superiority of machine learning relative to traditional approach for all kinds of external crises and for all country groupings. Specifi cally, machine learning does not outperform signal extraction approach for predicting sudden stops with growth impacts in emerging markets, but tree-based ensemble machine learning technique, especially the one addressing imbalanced data, does appear to outperform sig nal extraction approach for predicting exchange market pressure events in the same set of emerging markets. Moreover, signal extraction approach also performs the best for pre dicting exchange market pressure events in advanced economies and low-income countries, 93 although tree-based ensemble machine learning techniques are able to deliver comparable performance for the latter. We are still exploring the reasons behind this result. We may conclude that perhaps signal extraction approach performs relatively well for SSGis precisely because such events have been extensively studied by both the theoretical and empirical literatures, and wherever there have been significant non-monotonicities and interactions which signal extraction approach has had trouble with, substitute variables have been found. On the other hand, for EMPEs which are more heterogeneous, encompass a wide range of mechanisms, and have been less well studied, it is likely that the identification of new non-monotonicities and interactions is highly valuable, and machine learning is able to reap large benefits by helping the early warning modeler with this task. 94 Chapter 3 Performance Uncertainty and Ranking Significance of Early-Warning Models 1 3.1 Introduction Anticipating and preparing for crises lie at the heart of the mandate of central banks, which yet are intrinsically difficult tasks. Early-warning models are developed to tackle this chal lenge. The history of early-warning models can go far back to two decades ago, when Kaminsky et al. (1998) introduced signal extraction model and Frankel and Rose (1996) ap plied logit regression model to predicting currency crises. As machine learning has achieved success in many areas over the past decade, they were introduced to the literature of early warning models and enrich the set of models by allowing more flexible relationships between crisis events and early-warning indicators. 2 Despite of the success of machine learning in many prediction areas, macroeconomic data and early-warning exercise have their unique features which may not guarantee better performance of machine learning than traditional statistical methods. Hence, in order to select one out of many to use for predicting crises 1 The work in this chapter is joint with Suman S. Basu (International Monetary Fund) and Roberto A. Perrelli (International Monetary Fund). 2 Relevant papers include Chamon et al. (2007) , Sevim et al. (2014), Xu et al. (2018) and Basu et al. (2019) for external crises; Holopainen and Sarlin (2017), Alessi and Detken (2018), Lang et al. (2018) for banking crises; Manasse and Roubini (2009) , Savona and Vezzoli (2015) and Badia et al. (2020) for sovereign crises. 95 and informing policy making, it is important for researchers and policymakers to conduct a horse race and rank models. At the heart of ranking models lie model performance uncertainty arising from sampling and model ranking significance accounting for sampling errors. Model performance uncer tainty arising from sampling refers to the extent to what a model performed differently if it was estimated on a different dataset, and model ranking significance accounting for sampling errors refers to whether a model performs better than another significantly, accounting for model performance uncertainty arising from sampling. Estimating model performance un certainty and testing model ranking significance are especially important for early-warning model selection, as early-warning models are used not only for predicting crisis outcomes, but also for understanding risk factors and informing policy decisions. In case of no signif icant difference in performance between one traditional statistical model and one machine learning model, the former shall be recommended to policymakers for using in practice given its higher degree of interpretability and stability. 3 Hence, this paper touches on this problem by proposing approaches to estimate performance uncertainty and test rank significance and illustrating the approaches in an early-warning framework for sudden stops. Macroeconomic panel data in the early-warning framework is small in three important aspects. First, there are not many countries in the world and even fewer crisis events in the past, which means that historical data may have only a few lessons for the future. Additionally, a high degree of heterogeneity is present among this small set of countries and their crises, which means that each country and its crises may play an important role in model estimation and testing. Hence, it follows that the set of countries and their crisis events matter for model performance and rankings. Second, infrequent but large global regime shifts limit the applicability of past lessons for future crisis prediction. The rarity of the shifts means that a few decades of available historical data cannot capture all the possible global regimes, and the large size of the shifts means that past information may 3 Although the randomness coded in many machine learning algorithms help to prevent overfitting, it reduces the degree of interpretability and stability of such techniques. 96 become suddenly rather outdated. Hence, it follows that the set of global regimes which is contained in the set of years matter for model performance and rankings. Third, in addition to global regime shifts that affect countries simultaneously, countries could have been hit by country-specific idiosyncratic shocks and therefore part of their histories might be different from what should have been. Hence, it follows that the set of country-specific histories matter for model performance and rankings. The small data nature of macroeconomics panel data strengthens the need for assessing model performance uncertainty arising from sampling, and testing model ranking significance accounting for sampling. There is a one-to-one mapping from the above three aspects in which the macroeconomics panel data is small in the early-warning framework and three sources of data variation worth examining. First, data variation in the set of countries should be examined because countries and their crisis events are few and heterogeneous. Second, data variation in the set of years should be examined because global regime shifts are rare but significant. Third, data variation in the set of countries' histories should be examined because countries could have been hit by country-specific idiosyncratic shocks and therefore have developed in a way different from what they should have been. To assess model performance uncertainty arising from the above three sources of data variation, we choose to make use of the jackknife resampling to obtain new samples and construct confidence intervals to represent the uncertainty. The jackknife method resamples the original dataset by dropping data, which means that new samples are subsamples of the original one. By specifying how data is selected to drop, the jackknife resampling allows us to impose priors on the dimension along which the dataset is resampled, so we are able to isolate the source of data variation and estimate the model performance uncertainty arising from specific sources. In line with the three sources of data variation discussed above, the jackknife resampling is performed along three dimensions: (1) dropping countries, that is dropping all years' data for some randomly chosen countries; (2) dropping years, that is dropping all countries' data in some randomly chosen years; and (3) dropping country-year 97 blocks, that is dropping some randomly chosen blocks of a given number of years for some randomly chosen countries. Also, we apply the jackknife resampling to the entire dataset splitting it into training set and test set for two reasons. For one, sampling errors could rise in any part of the data, regardless of how one splits the dataset into training set and test set. Specifically, when global regime shifts happened, or countries were hit by country-specific idiosyncratic shocks did not depend on how researchers or policymakers design their training and testing scheme. For another, one should never manipulate the splitting of a given dataset into training and test set by restricting all potential data variation in only training or test set. 4 As for the percentage of data to drop, we choose to drop a fair fraction of the original data (10 percent and 5 percent) , instead of dropping one single observation (i.e., a country-year pair in our data) as in standard jackknife resampling, for a few reasons behind. First, due to cross-sectional and time-series dependence of macroeconomics panel data, dropping a single observation cannot generate enough data variation, which means that we need to drop a larger fraction of data to generate enough data variation for uncertainty assessment. Second, empirically, our first and second jackknifing methods place a lower bound on the percentage of data to drop, because they are designed to drop all years' data for some countries and all countries' data in some years. 5 We also consider a fourth jackknifing method, which is to treat the data as i.i.d., and drop randomly single country-year pairs to examine how estimates of model performance uncertainty are different when accounting for the panel data structure of our data or not. 4 However, we acknowledge that this way of jackknife resampling on the entire dataset limits our ability to decompose model performance uncertainty into that arising from data variation in training set and test set, which may need more sophisticated design of resampling methods to investigate. 5 In the data in our illustrative example, there are in total 10 countries and 28 years. Hence, it implies that the smallest proportion of data to drop by dropping countries is 1/10, i.e., 10 percent, and the smallest proportion of data to drop by dropping years is 1/28, i.e., around 3.6 percent. In order to have a fair comparison among the model performance uncertainty arising from all three sources of data variation, we need to drop the same proportion of data in all three jackknifing methods. It then follows that 10 percent is the lower bound of the fraction of data to drop. 98 In addition to estimating model performance uncertainty arising from sampling, we also make use of the jackknife resampling to test model ranking significance. We argue that simply examining whether confidence intervals of individual model performance overlap does not provide much evidence on model ranking significance, because confidence intervals are generated by pooling performances calculated from different resamples. However, when testing the null hypothesis whether one model performs significantly better than another, the comparison should be conducted on the same subsample generated by the jackknife resampling. Hence, we propose to construct confidence intervals of the conditional model performance difference, that are confidence intervals of difference in model performance calculated from the same jackknife resampled dataset. And then a null hypothesis that the conditional model performance difference is equal to zero is tested based on the confidence interval results. Our approaches are illustrated in an early-warning framework of predicting sudden stops in capital flows for emerging market countries. Sudden stops have been disruptive crisis events for emerging market countries over the past three decades. The capital accounts of these countries are open enough for private capital inflows to accumulate, but not sufficiently liberalized for sudden outflows to be easily insured against. The danger of such brutal crisis events for emerging market countries have been learned the hard way so that it is important to monitor sudden stop risks, issue early warnings, and inform policy decisions. We consider two models, signal extraction model, a simple statistical method that has been extensively used and tested in the early-warning literature (Berg et al., 2005) , and random forests, a machine learning method that has been proven to be successful in many prediction areas and applied to the early-warning literature in the past decade. 6 There are four main findings: (1) confidence intervals for signal extraction model are wider than those for random forests, for all types of jackknifing methods; (2) confidence intervals generated by dropping years are the widest among all for signal extraction model, 6 Most recent papers using random forest s for crisis prediction are Basu et al. (2019) for external crises, Lang et al. (2018) for banking crises, and Badia et al. (2020) for sovereign crises. 99 while confidence intervals generated by dropping country-year blocks are the widest among all for random forests; (3) there is not much difference among model performance uncer tainty arising from different sources of data variation; ( 4) signal extraction model performs significantly better than random forests (at 0.01 significance level) in fixed cutoff testing, while signal extraction model and random forests do not perform significantly differently in rolling cutoff testing. 7 The rest of the paper is structured as follows: Section 3.2 discusses in detail the impor tance of assessing model performance uncertainty arising from sampling and testing model ranking performance accounting for sampling errors in an early-warning framework. Section 3.3 describes our data, including crisis definition and explanatory indicators, and choices of early-warning models. Section 3.4 explains our model estimation and testing, and ap proaches to assess model performance uncertainty and test model ranking significance, in cluding resampling methodology and confidence interval construction. Section 3.5 presents and discusses our empirical findings on model performance uncertainty and model ranking significance. Finally, Section 3.6 concludes. 3.2 Small Macro Data Suppose the true model of crisis forecasting is (3.1) where Yit is the crisis event for country i in year t, taking the value of 0 when there is no crisis, or 1 if there is a crisis, Xi,t- l is a vector of country-specific explanatory indicators 7 In the fixed cutoff testing where year 2007 is the cutoff year, a model is estimated on the training set, consisting of data from 1990 to 2007, and then tested on the test set, consisting of data from 2008 to 2017. In the rolling cutoff testing where years 2007, 2009, 2011, 2013, and 2015 are cutoff years, a model is recursively estimated on the training set consisting of data before the cutoff year, and then tested on the test set consisting of data in the next two years. In the end, model performance on the test sets from different cutoff years are averaged. 100 for country i in year t - l, J denotes a non-linear relation between Xi,t-l and Yit, Eit is an idiosyncratic shock, and T/t-l is a vector or global factors capturing the global regime which affects the relation J between Xit-1 and Yit· Macroeconomics data in the early-warning framework is small in three importance as pects. First, there are not many countries in the world and even fewer crisis events Yit in the past, which means that historical data may have only a few lessons for the future. Addi tionally, a high degree of heterogeneity is present among this small set of countries and their crises, which means that each country and its crises may play an important role in model es timation and testing. Second, infrequent but large global regime shifts captured by variation in T/(t - l) limit the applicability of past lessons for future crisis prediction. The rarity of the shifts means that a few decades of available historical data cannot capture all the possible global regimes, and the large size of the shifts means that past information may become suddenly rather outdated. Third, in addition to global regime shifts that affect countries simultaneously, countries could have been hit by country-specific idiosyncratic shocks and therefore part of their histories reflected in Xcit - 1) might be different from what should have been. The small data nature of macroeconomic panel data strengthens the need for assessing model performance uncertainty arising from sampling, i.e., the extent to what a model performed differently if it was estimated on a different dataset. In line with the three aspects in which macroeconomic data in the early-warning framework is small, there are three sources of data variation. First, countries and their crisis events are few and heterogeneous, which means that whether some of the countries ( and thus their crisis events) are in the sample could affect model performance. Hence, data variation in the set of countries is worth examining. Second, global regime shifts are rare but significant, which means that whether some of the years ( and thus global regimes) are in the sample could affect model performance. Hence, data variation in the set of years are worth examining. Third, some of the countries could have been hit by country-specific idiosyncratic shocks and therefore could 101 have developed differently in part of their histories, which means that whether part of the histories of some countries are in the sample could affect model performance. Hence, data variation in the set of countries' histories is worth examining. Hence, we conclude that there are three sources of data variation that may affect model performance in the early-warning framework: 1. Variation in the set of countries, 1.e., what if some of the countries were not in the sample? 2. Variation in the set of years, and thus the set of global regimes, i.e. , what if some of the global regimes were not seen? 3. Variation in the set of countries' histories, i.e. , what if some of the countries followed different trajectories in part of their histories? 3.3 Data and Models 3.3.1 Crisis Events We focus on sudden stops of capital flows in emerging market countries, which have been seen as the most brutal crises for such economies. Our sample covers ten countries that have been well accepted to be emerging market countries over the past three decades, including Argentina, Brazil, Chile, Indonesia, Malaysia, Mexico, Philippines, Russia, Thailand, and Turkey. The crisis events are chosen based on the sudden stop definition in Basu et al. (2019). In the definition, a sudden stop is defined as occurring when net private capital inflows as a percentage of GDP is at least 2 percentage points lower than that in the previous year and two years before, as well as when the country gets approved to tap large IMF financial support to capture counterfactual situations in which sudden declines in private capital inflows were prevented by large IMF financial support. 8 Also, such brutal events often cause severe real 8 Large IMF financial support hereafter is defined as IMF arrangements with agreed amount at least five times as large as the respective country's quota at the IMF. 102 economic consequences, such as large growth declines which are defined as occurring when the changes in real GDP growth relative to the previous five-year average lie in the lower 10 th percentile of entire sample, as well as when the country gets approved to tap large IMF financial support to capture counterfactual situations in which large growth declines were prevented by large IMF financial support. 9 Therefore, combining the two definitions, episodes of sudden stops with growth impacts (SSGis) are the main crisis events we focus on in this paper. Table 3.1: Episodes of Sudden Stops with Growth Impacts (SSGis) Countries Years Argentina 1995, 2000, 2008 Brazil 2002 Chile 1998 Indonesia 1997 Malaysia 1997, 2008 Mexico 1994, 2009 Philippines 1998 Russia 2008, 2014 Thailand 1997 Turkey 1994, 1998, 2001, 2008 Our sample spans from 1990 to 2017, during which there are eighteen sudden stops with growth impacts, accounting for 6.4 percent of the sample. Table 3.1 lists the eighteen episodes of sudden stops with growth impacts (SSGI): twelve before the global financial crisis; five during the global financial crisis; and one after the global financial crisis. Figure 3.1 shows the crisis frequency distribution across years. It can be seen that our ten-country sample 9 The entire sample in Basu et al. (2019) covers 53 countries and spans the period between 1990 and 2017. Hence, the ten-country sample used in this paper is a subset of the sample in Basu et al. (2019). Instead of using the lower 10 th percentile of our ten-country sample, we take the value corresponding to the lower 10 th percentile of the sample in Basu et al. (2019) to define large growth declines in our sample for robustness. 103 Figure 3.1: Frequency of Sudden Stops with Growth Impacts (SSGis) 4 3 t5 I.fl IJ'.! b aj 2 .c E ;;:I z l 0 I I I I I 1990 1995 2000 2005 2010 2015 over the period 1990-2017 captures prominent historical waves of sudden stops including the Mexican peso crisis in the mid-1990s, the Asian financial crises in the late-1990s, South American crises in the early-2000s, and the global financial crisis in the late-2000s. 3.3.2 Explanatory Indicators Our set of explanatory indicators consists of twenty-five variables that have been selected using a general-to-specific approach. Specially, we start with the set of more than seventy explanatory variables used in Basu et al. (2019). 10 Then we select a subset of twenty-five variables based on their horse race results. In particular, these twenty-five variables span multiple sectors including external, fiscal, financial, and real sectors, and can be categorized into four groups capturing different economic mechanisms: medium-term bubble building, short-term bubble bursting, buffers and mismatch, and global factors. Table 3.2 lists the selection of explanatory indicators. 10 It is worth noting that their selection is supported by different generations of theoretical models on sudden stops in the literature. Explanatory variables are categorized into several groups based on different economic channels and/or mechanisms. 104 Table 3.2: List of Explanatory Indicators Medium-term bubble building: 5-year inflation 5-year money growth 5-year stock price growth 5-year housing price growth 5-year inter-bank liabilities growth 5-year REER growth 5-year private credit growth Buffers and mismatch: Current account balance Amortization-to-exports ratio EMBI spread Foreign liabilities-to-domestic credit ratio External debt Capital adequacy ratio Interest coverage ratio 3.3.3 Model Choice Shor-term bubble bursting: Change in public debt Change in reserves Change in stock price growth Change in housing price growth Change in external equity liabilities Change in REER appreciation Change in private credit Global factors: TED spread Percentage of AEs in banking crises Inter-bank liabilities to AEs in banking crises Export growth The set of early-warning models for sudden stops have been developed along the historical crisis waves in which the danger of sudden stops is learned, from simple statistical methods such as signal-extraction model (Kaminsky et al., 1998) and regression-based models (Frankel and Rose, 1996; Berg and Patillo, 1999) to recent more advanced machine learning models (Chamon et al., 2007; Basu et al., 2019). In this paper, we choose to focus on signal extraction model and random forests (Breiman, 2001) , assessing their performance uncertainty and testing their model ranking significance. Our choices are motivated by two reasons. First, signal-extraction model has been shown to perform the best on predicting sudden stops when compared with traditional regression models and machine learning techniques including regularized regression models and tree-based models (Berg and Patillo, 1999; Berg et al., 2005; and Basu et al., 2019), especially in terms of out-of-sample performance.11. Secondly, as a well-known and widely-used machine learning method, random forests has been applied 11 Also, the univariate and non-parametric setting of signal-extraction model for identifying variable specific thresholds makes it more practical on macroeconomic data for which data availability varies from variable to variable. 105 to early-warning exercises and shown to perform well on many types of crises including banking crises, currency crises and sovereign debt crises (Alessi and Oetken, 2018; Basu et al., 2019; and Badia et al., 2020). The signal-extraction model identifies one threshold for each explanatory indicator that minimizes the specified loss function. Observations (i.e. , country-year pairs in our data) whose indicator variable values fall on one side of the threshold are given a 1 and flagged as risky, otherwise are given a O and flagged as safe. Then flags of all indicators of an observation are aggregated to generate a composite score (which sometimes is called vulnerability index in the literature and in practice) with weights given by their signal-to-noise ratio: l-z z (3.2) where z is defined as the value of the loss function achieved. Therefore, such algorithm implies that minimizing the loss function for each indicator is equivalent to maximizing the signal-to-noise ratio for each indicator and indicators with larger signaling power are given larger weights in the composite score. We follow the literature (Berg et al., 2005) to use the loss function that is the unweighted sum of the percentages of false alarms and missed crises. The percentage of "false alarms" is defined to be the percentage of non-crisis observations (i.e. , country-year pairs in our data) that the model incorrectly flags as crises, while the percentage of "missed crises" is defined to be the percentage of crisis observations (i.e., country-year pairs in our data) that the model incorrectly flags as non-crises. The threshold chosen to minimize this loss function is therefore the one for which the vertical gap between the conditional cumulative distribution function of crisis observations and the conditional cumulative distribution function of non crisis observations is maximized. 12 12 The cumulative distribution function of crisis or non-crisis observations are plotted against the value of the explanatory indicator. 106 Random forests, introduced by Breiman (2001), is an ensemble method consists of a number of classification trees as building blocks. Classification tree (Breiman et al., 1984) uses a decision tree to flag an observation by going from the original complex sample to smaller and purer subsamples. Each decision tree consists of a root node, branches departing from parent nodes and entering child nodes, and multiple terminal nodes which are also called leaves. In the structure of classification tree, leaves represent the flagged classes and branches represent the conjunctions of indicators that lead to the classes. 13 Observations in the root node are sent to left or right child node according to some splitting rules that identify indicators and corresponding thresholds. Such process is repeated sequentially on each child node recursively until each leaf consists of only one class or some stopping criteria are met. The indicator and threshold used to split the sample at each node are chosen based on some measures of impurity, such as the Gini impurity index. Because of the recursive algorithm, such tree structure partitions the prediction space into multiple smaller spaces, which allows for complex relationship between the target and explanatory indicators. Hence, the classification tree has proven useful in many prediction areas and has been introduced to the early-warning literature (Chamon et al., 2007; Manasse and Roubini, 2009). The method of classification tree suffers from overfitting when a single decision tree is grown very deep and therefore includes too much noise from the sample it is estimated from. To reduce overfitting of one single classification tree, random forests grows a number of classification trees based on bootstrapped samples, i.e. , random samples selected with replacement from the original sample. Additionally, instead of considering all explanatory indicators, only a random subset of indicators are considered as candidates for each split. Such algorithm, sometimes called feature bagging, effectively prevents strong correlation among trees in the forest. The final predicted class for a new observation is achieved by taking the majority vote of predictions of all trees in the forest. Bootstrap aggregating and 13 The class for a leaf is determined by the class with the most votes in the leaf. 107 feature bagging reduce the prediction variance on average, without increasing the prediction bias, and therefore help random forests achieve better classification performance. 3.4 Model Performance Uncertainty The estimation of model performance uncertainty consists of three steps: ( 1) generating new samples from the original sample, (2) estimating and testing models to collect model performance, and (3) constructing confidence intervals to represent model performance un certainty. In this section, we first describe our methods to generate new samples (Subsection 3.4.1). We then summarize how we estimate and test models (Subsection 3.4.2). The last Subsection (Subsection 3.4.3) discusses how we construct confidence intervals to represent model performance uncertainty. Our estimation of model performance uncertainty proceeds in the following way: 1. Perform jackknife resampling on the original entire sample S to obtain a jackknifing sample Sj , j = l , 2, ... , N. 2. Split the jackknifing sample Sj into training set and test set based on cutoff rules, either a fixed cutoff or rolling cutoffs. 3. Estimate different models (signal extraction model and random forests) on the same training set and tested on the same test set generated from the jackknifing sample Sj. Model performance on the test set are then calculated and collected. 4. Repeat 1.-2. for N = 200 times, and construct confidence intervals using the model performance collected. 3.4.1 Jackknife Resampling As discussed in Section 2, the "small data" nature of macroeconomic panel data in the early warning framework strengthens the need for assessing model performance uncertainty arising 108 from sampling, and therefore motivates three sources of data variation from which model performance uncertainty arises. We opt to make use of jackknife resampling (jackknifing) to assess model performance uncertainty, which allows us to examine and compare the model performance uncertainty arising from different source of data variation. The jackknifing was introduced by Efron and Stein (1981) and Efron (1982) before other common resampling methods such as the bootstrap resampling (bootstrapping). The stan dard jackknifing simply omits one single observation of the original sample to generate a subsample. Specifically, given a sample consisting of data x 1 , x 2 , ... , XN, the ith jackknif ing subsample consists of data x 1 , ... , xi-l, xi+ 1 , ... , XN for i = 1, 2, ... , N . However, the standard jackknifing assumes data to be i.i.d. , and therefore does not work well on our macroeconomic panel data that exhibit both cross-sectional and time-series dependence. Specifically, the cross-sectional dependence is derived from the presence of global factors that affect countries simultaneously, and the time-series dependence is derived from the au tocorrelation of explanatory indicators. Because of such dependence in our macroeconomic panel data, dropping one observation in the form of a country-year pair cannot generate sufficient data variation in the collection of jackknifing subsamples. Hence, in line with the three sources of data variation discussed in Section 2, we propose three different jackknifing methods to assess the model performance uncertainty from three different sources of data variation. First, we drop countries, that is to drop all years' data of some randomly chosen countries, to assess the extent to what a model performed differently if some of the countries ( and thus their crisis events) were not in the same group as others. Second, we drop years, that is to drop all countries' data in some randomly chosen years, to assess the extent to what a model performed differently if some of the global regimes were never seen or some of the global regime shifts never happened. Third, we drop country-year blocks, that is to drop data in the form of randomly chosen blocks of three years for some randomly chosen countries, to assess the extent to what a model performed differently if 109 some of the countries were hit by idiosyncratic shocks and therefore part of their histories followed a different trajectory from what should have been. We apply different jackknifing methods on the original sample before it is split into training and test set, instead of only resampling only the training set while keeping the test set untouched (Holopainen and Sarlin, 2017). There are two reasons: For one, sampling errors could rise in any part of the data, regardless of how one splits the dataset into training set and test set. Specifically, when global regime shifts happened, or countries were hit by country specific idiosyncratic shocks did not depend on how researchers or policymakers design their training and testing scheme. For another, one should never manipulate the splitting of a given dataset into training and test set by restricting all potential data variation in only training or test set. However, we acknowledge that this way of jackknife resampling on the entire sample limits our ability to decompose model performance uncertainty into that arising from data variation in training set and test set, which may need more sophisticated design of resampling methods to investigate. 14 As for the percentage of data to drop, we choose to drop 10 percent of the dataset in our benchmark exercise, instead of dropping one single observation (i.e. , a country-year pair in our data) as in standard jackknifing. There are a few reasons behind. First, as discussed before, due to the cross-sectional and time-series dependence of our macroeconomics panel data, dropping a single observation cannot generate enough data variation, so we need to drop a larger fraction of data to generate enough data variation from which we are able to draw inference of model performance uncertainty. Second, our first and second jackknifing methods place a lower bound on the percentage of data to drop, because they are designed to drop all years' data for some countries and all countries' data in some years. In our sample of ten emerging market countries spanning almost three decades from 1990 to 2017, dropping all years' data for a single country contribute to dropping 10 percent of the data below 14 And our results also show that confidence intervals are fairly wide, potentially due to the large degree of model performance uncertainty from data variation in the test set in which there are fewer observations and even fewer crisis events to calculate the percentage of false alarms and missed crises. 110 which makes it impossible to drop the entire history of one country. As a result, we choose to drop 10 percent of the data for all three jackknifing methods as our benchmark to ensure a fair comparison between the degree of model performance uncertainty arising from three sources of data variation. 15 Additionally, we explore how the degree of model performance uncertainty is affected by the percentage of data dropped. Thus, in a different exercise, we drop 5 percent of the data for all three jackknifing methods. As mentioned before, dropping 5 percent of the data makes it impossible to drop the entire history of one country, so we choose to drop half of the history of one country in this case, randomly chosen to be either the first half (i.e., from 1990 to 2007) or the second half (i.e., from 2008 to 2017). 16 To summarize, we conduct three methods of jackknifing in the following way: To summarize, we conduct three types of jackknifing in the following way: 1. Drop countries: (a) Choose randomly a country. (b) Drop all years' data for that country. ( c) Repeat ( a )-(b) until the percentage of data dropped is larger than the specified value (10 percent or 5 percent). 2. Drop years: (a) Choose randomly a year. (b) Drop all countries' data in the year. ( c) Repeat ( a )-(b) until the percentage of data dropped is larger than the specified value (10 percent or 5 percent). 3. Drop country-year blocks: (a) Choose randomly a country. (b) Choose randomly a block of three years for the country. ( c) Drop all data in the block. (d) Repeat (a)-(c) until the percentage of data dropped is larger than the specified value (10 percent or 5 percent). 15 It means dropping one country (all years), three years (all countries) , or nine blocks of three years from our data. 16 It means dropping half country ( first thirteen years or last fourteen years), one year ( all countries) , or five blocks of three years from our data. 111 Also, we conduct a fourth type of jackknifing which mimics the standard i.i.d. jackknifing. That is, we repeatedly choose and drop single country-year pair until the percentage of data dropped is larger than the specified value (10 percent or 5 percent). By comparing model performance uncertainty estimated from our proposed three methods of jackknifing with that estimated from the i.i.d. jackknifing, we aim to (1) examine the difference in model performance uncertainty derived from different sources of data variation; and (2) contrast model performance uncertainty estimated from correct ways of resampling with that estimated from incorrect way of resampling, i.e. , the i.i.d. jackknifing. 17 3.4.2 Model Estimation and Testing The model estimation and testing consist of three stages in which different data sets are used: (a) tuning, (b) training, and ( c) testing. The data sets used are"validation set" and its "training counterpar", "training set", and "test set", respectively. Simply put, "tuning" is a process conducted before training a model to find optimal hyperparameters of the model. "Training" is the process in which a model is estimated after all hyperparameters are chosen optimally in the tuning stage. In the effort to rank model performance, "testing" is an evaluation process in which all completely estimated models are evaluated based on their out-of-sample performances on the test set ( s). Loss Function We follow the signal evaluation framework in the empirical literature on early warning systems to evaluate model performance by maximizing the signal relative to the noise. Thus, our evaluation metric is defined as the unweighted sum of "false alarms" (i.e. , the percentage of non-crisis observations that the model incorrectly flags as risky) and "missed crises" (the percentage of crisis observations that the model incorrectly flags as safe). Given that crises are rare in all our definitions and datasets, this evaluation metric attaches significantly higher costs to missed crises than to false alarms. It is also used as the loss function in both tuning 17 By correct ways of resampling, we mean the ways of resampling that account for the cross-sectional and time-series dependence in our macroeconomic panel data. 112 and training stages to find optimal hyperparameters and estimate models in the training set, as well as calculating optimal thresholds for indicators in signal extraction model. In the testing stage, to generate binary flags on test set ( s), we first find the optimal threshold over composite scores in the training set that each model produced by maximizing the signal relative to the noise, and then assign binary labels (i.e., one if a crisis is predicted to follow, zero otherwise) according to that threshold. Hyperparameter Tuning The most common tuning method in machine learning is k-fold cross-validation intro duced by Stone (1974) . However, it is not appropriate for our macroeconomic panel data which exhibit cross-sectional and time-series dependencies, because it randomly partitions the training set into non-clustered complementary subsets in which observations do not nec essarily come from the same years. Hence, common global factors may be present in both the "validation set" and its training counterpart. Therefore, we make use of the random year-block tuning (Basu et al. ( 2019)). In random year-block tuning, data in training sets are partitioned into ten year-blocks with roughly equal size as validation sets in each of which all observations are from the same years. In contrast to using only historical data for each prediction in the testing, due to the relatively small sample size, for each validation year-block we use all remainder of years as the training counterpart, to avoid depletion of data within the training set. Then the set of hyperparameters are chosen such that average performance over all validation year-blocks is maximized. To perform hyperparameters op timization, we use Bayesian optimization which works well for optimizing hyperparameters of machine learning algorithms. Different methods have different sets of hyperparameters to be tuned. In random forests, there are three categories of hyperparameters to be tuned: hyperparameters governing tree size, maximum number of surrogate splits the tree finds at each split, and number of fea tures to consider at each split. In standard random forests developed by Breiman (2001), 113 bootstrapping and feature bagging help prevent overfitting, so trees can grow fully, reduc ing in-sample bias to help reduce out-of-sample bias. However, this approach may not be correct in the context of crisis forecasting when crisis mechanisms constantly change over time. As found in Basu et al. (2019), fully-grown random forests are prone to severe over fitting when applied to crisis forecasting problems, while controlling tree size help reduce overfitting significantly. Therefore, we include parameters which control tree size into the set of hyperparameters to be tuned for all tree-ensemble methods we are using. Among all parameters controlling tree size including minimum observations per leaf, minimum number of observations per parent node and maximum number of splits, we choose to tune minimum observations per leaf to control tree size. Instead of tuning the number of trees, we fix it to be 1000 so that there is a sufficiently large number of trees to reduce variance and thus stabilize prediction performance and variable importance. Training and Testing We follow the testing procedure in Basu et al. (2019) to make use of two kinds of cutoff tests: a fixed cutoff testing and a rolling cutoff testing. The algorithm is quite simple: For a given cutoff year, a model is estimated using all information up to that year, and then applied to out-of-sample test sets consisting of all or some of the years after the cutoff year. By limiting the training sets to historical data for each prediction, cutoff tests replicate how the model may be used in real-time analysis, and crucially, they ensure that the testing is conducted fully out-of-sample. Given the timing of the global financial crisis, we set 2007 as the cutoff year in the fixed cutoff testing. A model is estimated using all data up to 2006 to predict crises up to 2007, and is applied to an out-of-sample test set consisting of all data after 2007 to calculate its out-of-sample performance. The fixed-year cutoff testing is simple and stable because there are many crises in the testing set. However, it does not provide an assessment of how model would update and perform after the global financial crisis. 114 The rolling cutoff testing consists of multiple training and testing sets generated by multiple cutoff years: 2007, 2009, 2011 , 2013, and 2015. For each cutoff year, a model is estimated using all data up to the year before the cutoff year, and then tested using a two year test set immediately after the cutoff year. For example, when cutoff year is 2007, a model is estimated using all data up to 2006 and it is tested on a two-year set consisting of data from 2007 and 2008 to predict crises in 2008 and 2009. At the end, models are evaluated based on the average performances over all the two-year test sets. 3.4.3 Confidence Intervals We choose to construct confidence intervals of our preferred evaluation metric, sum of errors, to assess model performance uncertainty and test model ranking significance. 3.4.3.1 Confidence intervals of individual model performance We use th~ta to denote the estimator of model performance obtained from the original sample, which in our design is the out-of-sample sum of errors, either calculated in the fixed cutoff exercise, or the average value calculated in the rolling cutoff exercise. Then we use 0; to denote the estimator of model performance obtained from the jackknifing samples j = 1, 2, ... , J where J = 200. We make use of the estimator obtained from the original sample and the distribution of the estimator obtained from the jackknifing sample to construct the confidence interval developed by Davison and Hinkley (1997). To construct the confidence interval for individual models, we proceed as follows: 1. Order the estimators obtained from the jackknifing sample 0* such that 0; ::; · · · ::; 0), with subscript denoting the ]th element in the ordered list. 2. For a significance level a, select the lJ · a/2jth and P · (1 - a)/2lth elements from the above ordered list of estimators, i.e. , 01J-a/2J and 0r J-(l-a/2)1" 115 3. Construct the two-tailed confidence interval of 0 with the significance level a as [ 20 - 0r J-(l-a/2) l ' 20 - 01 J-a/2 J l · By constructing the confidence intervals for individual models, we can assess to what extent models performed differently if they were estimated on different datasets. However, when we test ranking significance, simply examining whether individual confidence intervals of individual models overlap each other does not provide much evidence on model ranking significance. Because each jackknifing sample represents a possible history of the set of emerging market countries, it is not fair to compare the performance of one model estimated and tested on one history and that of another model estimated and tested on a different history. For example, signal extraction model may perform worse in a world without the crisis of Russia in year 2014 than random forests in a world with the crisis of Russia in year 2014, but this does not mean that signal extraction model cannot perform better than random forests if they are estimated on the same history. Results in Holopainen and Sarlin (2017) also show that there is almost no significant difference among models when comparing individual confidence intervals across models, that is, individual confidence intervals constructed by pooling performance obtained from jackknifing samples are so large that they overlap each other. Hence, we emphasize that model performance should be compared on the same sample, that is to estimate models on the same training set and compare model performance obtained from the same test set, respectively, no matter whether data variation is introduced or not. 3.4.3.2 Confidence intervals of conditional performance difference Therefore, the right way to test model ranking significance when accounting for data variation is to construct the confidence interval of the conditional difference in model performance estimators, that is the different in model performance obtained from the same sample. We use 0i to denote the estimator of model i's performance obtained from the original sample, with i = 1 for signal extraction model and i = 2 for random forests. Then we use 0;,j to 116 denote the estimator of model i's performance obtained from the jackknifing j = 1, 2, ... , J where J = 200. We proceed as follows: 1. Calculate the difference in performance estimators obtained from the original sample, 2. Calculate the differences in performance estimators obtained from the jackknifing sam ples, 6.0; = 0tj - 0tj for j = 1, 2, ... , J. 3. Order the estimator differences obtained from the jackknifing 6.0* such that 6.0~ < · · · ::; 6.0j, with subscript denoting the }th element in the ordered list. 4. For a significance level a, select the lJ · a/2jth and P · (1 - a)/2lth elements from the above ordered list of estimator differences, i.e., 6.0IJ-a/ 2 J and 6.0f J-(l-a/ 2 ) 1. 5. Construct the two-tailed confidence interval of 6.0 with the significance level a as [ 26.0 - 6.0f J.(l-a/2) l' 26.0 - 6.0t J.a/2 J l · Using these confidence intervals of the conditional performance difference, we test model ranking significance using a two-sided hypothesis test of the null H 0 : 6.0 = 0 and examine whether zero is inside the two-tailed confidence interval with the significance level a. Given that the estimator difference is calculated as the difference between performance estimator of signal extraction model and random forests , if the confidence interval lies on the left side of zero ( not including zero), then we reject the null hypothesis and conclude that signal extraction model performs significantly better than random forests at the a significance level. If the confidence interval lies on the right side of zero ( not including zero), then we reject the null hypothesis and conclude that random forests perform significantly better than signal extraction model at the a significance level. If zero is inside the confidence interval, then we fail to reject the null hypothesis and conclude that signal extraction model and random forests do not perform significantly different from each other. 117 3.5 Empirical Results In this section, we present our results of model performance uncertainty and model ranking significance in terms of confidence intervals. (Subsection 3.5.1) first presents the confidence interval results in fixed cutoff testing exercise and (Subsection 3.5.2) shows the confidence interval results in rolling cutoff exercise. 3.5.1 Fixed Cutoff Testing In fixed cutoff testing exercise, we choose to drop 10 or 5 percent of the data and perform four jackknifing methods for each. Table 3.3 and Figure 3.2 show and plot results for confidence intervals at the significant level of 0.01. To begin with, we discuss the results of dropping 10 percent of the data. First, we note that most of confidence intervals are wide, especially for signal extraction model. When dropping 10 percent of the data, confidence intervals of signal extraction model have a width greater than 0.35, implying that sum of errors at the 95th percentile is larger than that at the 5th percentile by at least one third of the possible range from 0 to 1. As for random forests, confidence intervals are narrower, but still wider than 0.2 in three jackknifing methods ( dropping years, dropping country-year blocks, and dropping country-year pairs ( as i.i.d. observations)). Second, for signal extraction model, among all four jackknifing methods, dropping years produce the widest confidence interval which ranges from a sum of error 0.332 to 0.888, covering more than half of the possible range. For random forests, dropping country-year blocks yield the widest confidence interval which has a width almost 0.3. In contrast, dropping countries generate the narrowest confidence interval for random forests, which is about half of all other three. Third, confidence intervals of signal extraction model and random forests overlap in all four jackknifing methods, although the sum of errors of signal extraction model calculated from the original dataset (0. 758) is much lower than that of random forests (0.970). However, as discussed in previous section, overlapping confidence 118 intervals of individual models does not necessarily mean there are no significant difference in performance among models. The correct way to proceed is to examine the confidence interval of conditional performance difference, that is the difference in sum of errors of different models calculated from the same jackknifing sample. We will discuss the results on this later. Fourth, there seems no much difference between confidence intervals generated by different jackknifing methods including dropping country-year pairs (as i.i.d. observations), except for the wider one generated by dropping years in signal extraction model and the narrower one generated by dropping countries in random forests. 119 Table 3.3: Confidence Interval Results on Unconditional Performance in Fixed Cutoff Testing J ackknifing method Median Lower bound of 90 t h CI Upper bound of 90 th CI Drop countries 0.695 0.515 0.874 Drop years 0.610 0.332 0.888 Drop country-year blocks 0.701 0.515 0.888 Drop i.i.d. 0.703 0.515 0.890 (a) Signal extraction model, dropping 10 percent of the data J ackknifing method Median Lower bound of 90 t h CI Upper bound of 90 th CI Drop countries 0.894 0.811 0.978 Drop years 0.905 0.762 1.048 Drop country-year blocks 0.904 0.738 1.069 Drop i.i.d. 0.939 0.809 1.069 (b) Random forest s, dropping 10 percent of the data J ackknifing method Median Lower bound of 90 th CI upper bound of 90 th CI Drop countries 0.690 0.515 0.866 Drop years 0.628 0.430 0.826 Drop country-year blocks 0.706 0.532 0.880 Drop i.i.d. 0.692 0.529 0.855 ( c) Signal extraction model, dropping 5 percent of the data J ackknifing method Median Lower bound of 90 th CI upper bound of 90 th CI Drop countries 0.975 0.867 1.083 Drop years 0.916 0.793 1.039 Drop country-year blocks 0.931 0.801 1.061 Drop i.i.d. 0.961 0.853 1.070 ( d) Random forest s, dropping 5 percent of the data 120 When dropping 5 percent of the data, the results do not change much, except that the overlaps between confidence intervals seem to shrink. In the case of dropping countries and dropping country-year pairs (as i.i.d. observations), there are almost no overlaps between confidence intervals of signal extraction model and random forests. It is also worth noting that for signal extraction model, reducing the percentage of data dropped does not shrink confidence intervals much, except for dropping years. Confidence interval generated by dropping years is shrunk by 0.1 when switching from dropping 10 percent to 5 percent of the data, while in all other three jackknifing methods, the reductions in width are small. Combining with the finding that dropping years produce the widest confidence interval for signal extraction model, it may imply that the presence of global regimes and their shifts plays an important role in determining model performance uncertainty for signal extraction model. Because it extracts information by distinguishing between the set of crises and non-crises, whether certain years capturing global regimes that contain certain generations of crises are present in the data may be a determinant factor for its performance variation. In contrast, for random forests , confidence interval generated by dropping country-year blocks is shrunk the most when switching from dropping 10 percent to 5 percent of the data. Combining with the finding to hat dropping country-year blocks yield the widest confidence interval for random forests, it may indicate that the presence of certain individual crisis events plays an important role in determining model performance uncertainty for random forests. Because the recursive partitioning and bootstrap aggregating algorithms make random forests better at accounting for heterogeneity and learning from individual observations, whether certain slices of countries' histories that contain certain crisis observations are present in the data may affect its model performance the most. In order to test model ranking significance for signal extraction model and random forests in fixed cutoff testing accounting for model performance uncertainty, we construct confidence intervals of conditional performance different between them, i.e. , the difference between sum of error of signal extraction model and random forests calculated from the same sample. 121 Figure 3.2: Confidence Intervals of Performance in Fixed Cutoff Testing 08 I 06 0.4 0.4 0.2 0.2 - SE - SE 0.0 _ Bag 0.0 _ Bag Drop Countries Drop Years Drop Country-Year Blocks Drop i.i.d Drop Countries Drop Years Drop Country-Year Blocks Drop i.i.d. (a) Drop 10 percent of the data (b) Drop 5 percent of the data Table 3.4 and Figure 3.3 show confidence interval results on conditional performance differ ence generated by four jackknifing methods, dropping 10 percent and 5 percent of the data, respectively. First, confidence intervals of conditional performance difference are also wide. All confidence intervals are wider than 0.4 when dropping 10 percent of the data. Although they are shrunk when switching to dropping 5 percent of the data, the reductions in width are not large. Second and more importantly, these confidence intervals present evidence of significant difference in model performance between signal extraction model and random forests. When dropping 10 percent of the data, all confidence intervals of conditional perfor mance difference lie on the left side of zero, except that the upper bounds of those generated by dropping countries and country-year blocks are slightly larger than zero. 122 Table 3.4: Confidence Interval Results on Conditional Performance Difference in Fixed Cutoff Testing Jackknifing method Median Lower bound of 90 th CI Upper bound of 90 th CI Drop countries -0.206 -0.450 0.037 Drop years -0.281 -0.486 -0.075 Drop country-year blocks -0.213 -0.437 0.010 Drop i.i.d. -0.268 -0.476 -0.060 (a) Conditional difference in sum of errors, dropping 10 percent of the data Jackknifing method Median Lower bound of 90 th CI upper bound of 90 th CI Drop countries -0.270 -0.437 -0.104 Drop years -0.224 -0.360 -0.087 Drop country-year blocks -0.204 -0.379 -0.029 Drop i.i.d. -0.228 -0.411 -0.046 (b) Conditional difference in sum of errors, dropping 5 percent of the data Notes. These results indicate that in fixed cutoff testing, signal extraction model performs sig nificantly better than random forests at the significance level of 0.01 accounting for three sources of data variation in: (1) the set of countries, (2) the set of years, and (3) the set of countries' histories, in contrast to previous insignificant difference based on comparison between individual confidence intervals. Such different results in testing model ranking sig nificance illustrate the fact we mentioned before: one model may perform better in one version of emerging market history than another model in another version of emerging mar ket history, which makes their confidence intervals overlap. Nevertheless, simply comparing their individual confidence intervals is not the correct way to test their performance difference and may lead to biased insignificant difference. We should always look at the performance difference between models on the same history, and then draw inferences of the difference. 123 Figure 3.3: Confidence Intervals of Performance Difference in Fixed Cutoff Testing 0.6--------------~ 0.4 0.2 0.0 :: I I I I -0.6~~---~---~---~~ Drop Countries Drop Years Drop Country-Year Blocks Drop i.i.d Drop Countnes Drop Years Drop Country-Year Blocks Drop u.d (a) Drop 10 percent of the data (b) Drop 5 percent of the data When dropping 5 percent of the data, signal extraction model still performs robustly better than random forests when accounting for the three sources of data variation, as all confidence intervals of conditional performance difference lie on the left size of zero. It is also noted that all conditional confidence intervals are narrower when dropping 5 percent of the data. Additionally, different results in testing model ranking significance between comparing confidence intervals of individual model performance and conditional confidence intervals are observed: Comparing confidence intervals of individual model performance do not reject the null hypothesis when accounting for data variation in the set of years and countries' histories, while examining confidence intervals of conditional performance difference provide evidence of their significant difference. It is interesting that there is not much difference among confidence intervals generated by different types of jackknifing methods, which essentially account for different types of data variation. There is one potential reason behind such wide and similar confidence in tervals. Our approach to jackknifing on the entire dataset encompasses two dimensions of data variation, that in the training set and that in the test set. It could be the case that one dimension of data variation leads to different degrees of model performance uncertainty 124 across the methods, but there is another dimension of data variation that generate larger de gree of performance variation which is similar across the methods and dominate the figures, i.e., performance variations derived from the training set variation are different but those derived by the test set variation are huge and similar across different sources. Since there are only six crisis events in the test set of the "fixed" cutoff testing exercise, performance variations derived from the test set variation are very likely to be huge because whether some of the crisis events are included in the test set may alter the percentage of missed crises in a non-smooth way. 3.5.2 Rolling Cutoff Testing We now assess model performance uncertainty and test model ranking significance in rolling cutoff testing exercise. Table 3.5 and Figure 3.4 show confidence interval results generated by four types of jackknifing methods and dropping 10 percent of the data. 125 Table 3.5: Confidence Interval Results of Rolling Cutoff Testing Jackknifing method Drop countries Drop years Drop country-year blocks Drop i.i.d. Jackknifing method Drop countries Drop years Drop country-year blocks Drop i.i.d. Jackknifing method Drop countries Drop years Drop country-year blocks Drop i.i.d. Median Lower bound of 90 th CI Upper bound of 90 th CI 0.744 0.506 0.981 0.750 0.474 1.027 0.742 0.520 0.965 (a) Signal extraction model Median 0.934 0.913 0.894 Lower bound of 90 th CI Upper bound of 90 th CI 0.793 1.074 0. 791 1.035 0.745 1.043 (b) Random forests Median -0.174 -0.197 -0.145 Lower bound of 90 th CI Upper bound of 90 th CI -0.432 0.084 -0.456 0.061 -0.352 0.062 ( c) Conditional performance difference, sum of errors Similar results as in the fixed cutoff testing exercise are seen in the rolling cutoff test ing exercise. First, confidence intervals of signal extraction model are very wide for all types of jackknifing methods, while those of random forests are much narrower. Second, confidence interval generated by dropping years is the widest for signal extraction model, while confidence interval generated by dropping country-year blocks is the widest for random forests . Consistent with previous results, these results provide evidence of the importance of global regimes for signal extraction model and the importance of individual countries' 126 Figure 3.4: Confidence Intervals of Conditional Performance Difference in Rolling Cutoff Testing 1.0 0.8 I 0.4 0.2 0.6 00 0.4 --0.2 0.2 --0.4 - SE 0.0 _ RF Drop Countries Drop Yea rs Drop Country-Year Blocks Drop Countries Drop Years Drop Country-Year Blocks (a) Drop 10 percent of the data (b) Drop 5 percent of the data histories for random forests, potentially due to their model algorithm and structure. Third and most importantly, both comparing confidence intervals of individual model performance and examining confidence intervals of conditional performance difference do not reject the null hypothesis and therefore imply insignificant performance difference between signal ex traction model and random forests. Confidence intervals of signal extraction model and random forests overlap for all types of jackknifing methods, and zero is inside in all confi dence intervals of conditional performance difference. It implies that when evaluating model performance in such a recursive way, signal extraction model no longer performs significantly better than random forest accounting for any types of data variation. However, although signal extraction model and random forests do not exhibit significant difference in their performance, we will prefer signal extraction model in practical use given its stability and interpreta bili ty. 127 3.6 Conclusion Recent technological advances introduce more complex models (machine learning and deep learning) to the literature on early-warning models and expand the set of early-warning mod els available for policymakers. Hence, it becomes increasingly important for policymakers to conduct a horse race and rank models in order to select the best one to use in practice. Given the small data nature of macroeconomic data in the early-warning framework, one model that performs best based on past histories may perform worse when new data is avail able and models are re-estimated. Hence, it is critical to assess the uncertainty in model performance and test the significance in performance difference accounting for data varia tion. We emphasize the importance of three sources of data variation in macroeconomic data used for early-warning models and propose three types of jackknifing methods to account for these data variations respectively. Then we construct confidence intervals based on model performance calculated from different jackknifing samples to assess model performance un certainty and perform hypothesis testing to examine whether there is significant difference in performance between models. Our results show that model performance uncertainty, i.e., the extent to what a model performed different if it was estimated on a different dataset depends on the model structure and the source of data variation. For signal extraction model which looks at the difference between the set of crises and non-crises and does not learn aggressively from individual crisis events, its performance varies the most if there is change in the history of global regimes. For random forest which is designed to tackle heterogeneity and learn aggressively from individual crisis events, its performance varies the most if there is change in individual countries' histories. Also, we show that simply comparing confidence intervals of individual model performance does not provide much evidence on the significance of model performance difference accounting for data variation, and sometimes may lead to incorrect inferences. The correct way to proceed is to construct confidence intervals of the conditional performance 128 difference, that is to focus on difference in model performance on the same sample and assess variation in this difference arising from sampling. As we observed, most of the confidence intervals are wide. Out conjecture is that resam pling on the entire dataset does not distinguish between two dimensions of data variation, training set variation and test set variation, and test set variation leads to so large degree of model performance uncertainty that it dominates that derived from training set variation. Hence, for the future, it is worth decomposing overall model performance uncertainty into that derived from training set variation and test set variation and examining them separately. 129 References Ahmed, R., Aizenman, J. , & Jinjarak, Y. (2019). Inflation and exchange rate targeting chal lenges under fiscal dominance (Tech. Rep.). National Bureau of Economic Research. Aizenman, J. , & Lee, J. (2007). International reserves: precautionary versus mercantilist views, theory and evidence. Open Economies Review, 18(2), 191-214. Aizenman, J., & Marion, N. (2003). The high demand for international reserves in the far east: What is going on? Journal of the Japanese and international Economies, 17(3), 370-400. Alessi, L., & Oetken, C. (2018). Identifying excessive credit growth and leverage. Journal of Financial Stability, 35, 215-225. Alfaro, L., & Kanczuk, F. (2009). Optimal reserve management and sovereign debt. Journal of International Economics, 77(1) , 23- 36. Athey, S., & Wager, S. (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896. Bacchetta, P., & Van Wincoop, E. (2010). Infrequent portfolio decisions: A solution to the forward discount puzzle. American Economic Review, 100(3) , 870- 904. Barro, R. J. (2009). Rare disasters, asset prices, and welfare costs. American Economic Review, 99(1) , 243-64. Basu, S. S. , Chamon, M. , & Crowe, C. W. (2017). A model to assess the probabilities of growth, fiscal, and financial crises. IMF Working Paper. Basu, S. S., Perrelli, R. A., & Xin, W. (2019). External crisis prediction using machine learning: Evidence from three decades of crises around the world. Computing in Economics and Finance, Ottawa, Canada. Behn, M. , Detken, C. , Peltonen, T. A. , & Schudel, W. (2013). Setting countercyclical capital buffers based on early warning models: would it work? Benetrix, A. S., Gautam, D., Juvenal, L., & Schmitz, M. (2020). Cross-border currency exposures: new evidence based on an enhanced and updated dataset. Benetrix, A. S., Lane, P. R., & Shambaugh, J. C. (2015). International currency exposures, valuation effects and the global financial crisis. Journal of International Economics, 96, 898-8109. 130 Berg, A. , Borensztein, E., & Pattillo, C. (2005). Assessing early warning systems: how have they worked in practice? IMF staff papers, 52(3), 462- 502. Berg, A. , & Pattillo, C. (1999). Are currency crises predictable? a test. IMF Staff papers, 46(2), 107- 138. Betz, F., Oprica, S., Peltonen, T. A., & Sarlin, P. (2014). Predicting distress in european banks. Journal of Banking 8 Finance, 45, 225- 241. Bianchi, J., Hatchondo, J. C., & Martinez, L. (2018). International reserves and rollover risk. American Economic Review, 108(9), 2629- 70. Blanchard, 0. (2004). Fiscal dominance and inflation targeting: lessons from brazil (Tech. Rep.). National bureau of economic research. Borio, C. E. , & Drehmann, M. (2009). Assessing the risk of banking crises-revisited. BIS Quarterly Review, March. Breiman, L. (2001). Random forests. Machine learning, 45(1) , 5-32. Breiman, L., Friedman, J., Stone, C. J. , & Olshen, R. A. (1984). Classification and regression trees. CRC press. Caner, M. (2009). Lasso-type gmm estimator. Econometric Theory, 270- 290. Cannon, A., Howse, J., Hush, D. , & Scovel, C. (2002). Learning with the neyman-pearson and min-max criteria. Los Alamos National Laboratory, Tech. Rep. LA-UR, 02- 2951. Catao, L. A. , & Milesi-Ferretti, G. M. (2014). External liabilities and crises. Journal of International Economics, 94(1) , 18- 32. Chamon, M. , Manasse, P. , & Prati, A. (2007). Can we predict the next capital account crisis? IMF Staff Papers, 54(2) , 270-305. Chinn, M. D. , & Ito, H. (2006). What matters for financial development? capital controls, institutions, and interactions. Journal of development economics, 81(1), 163-192. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application (No. 1). Cambridge university press. Dawood, M., Horsewood, N. , & Strobel, F. (2017). Predicting sovereign debt crises: an early warning system approach. Journal of Financial Stability, 28, 16-28. De la Rocha, M., Perrelli, R., & Mulder, C. B. (2002). The role of corporate, legal and macroeconomic balance sheet indicators in crisis detection and prevention (No. 2-59). IIMF Working Paper. Detken, C., Weeken, 0., Alessi, L., Bonfim, D. , Boucinha, M. M., Castro, C., ... others (2014). Operationalising the countercyclical capital buffer: indicator selection, threshold identification and calibration options (Tech. Rep.). ESRB Occasional Paper Series. 131 Dornbusch, R., Goldfajn, I. , Valdes, R. 0., Edwards, S., & Bruno, M. (1995). Currency crises and collapses. Brookings papers on economic activity, 1995(2), 219- 293. Duca, M. L., & Peltonen, T. A. (2013). Assessing systemic risks and predicting systemic events. Journal of Banking cJ Finance, 37(7), 2183- 2195. Durdu, C. B., Mendoza, E. G., & Terrones, M. E. (2009). Precautionary demand for foreign assets in sudden stop economies: An assessment of the new mercantilism. Journal of development Economics, 89(2) , 194- 209. Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. SIAM. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. The Annals of Statistics, 586- 596. Eichengreen, B. , Rose, A. K. , & Wyplosz, C. (1995). Exchange market mayhem: the antecedents and aftermath of speculative attacks. Economic policy, 10(21) , 249-312. Elkan, C. (2001). The foundations of cost-sensitive learning. In International joint conference on artificial intelligence (Vol. 17, pp. 973-978). Fernandez, A., Klein, M. W., Rebucci, A. , Schindler, M., & Uribe, M. (2016). Capital control measures: A new dataset. IMF Economic Review, 64(3), 548-574. Flood, R. P., & Garber, P. M. (1984). Collapsing exchange-rate regimes: Some linear examples. Journal of international Economics, 17(1-2), 1-13. Frankel, J. A. , Rose, A. K. , et al. (1996). Currency crashes in emerging markets: An empirical treatment. Frankel, J. A., & Wei, S.-J. (2004). Managing macroeconomic crises (Tech. Rep.). National Bureau of Economic Research. Fratzscher, M. (2012). Capital flows, push versus pull factors and the global financial crisis. Journal of International Economics, 88(2) , 341- 356. Freund, Y. , Schapire, R. E. , et al. (1996). Experiments with a new boosting algorithm. In icml (Vol. 96, pp. 148- 156). Ghosh, S. R. , & Ghosh, A. R. (2003). Structural vulnerabilities and currency crises. IMF Staff Papers, 50(3), 481- 506. Gourio, F. , Siemer, M. , & Verdelhan, A. (2013). International risk cycles. Journal of International Economics, 89( 2), 4 71-484. Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical information and computer sciences, 44(1) , 1-12. Hoerl, A. , & Kennard, R. (1988). Ridge regression, in 'encyclopedia of statistical sciences', vol. 8. Wiley, New York. 132 Holopainen, M., & Sarlin, P. (2017). Toward robust early-warning models: A horse race, ensembles and model uncertainty. Quantitative Finance, 17(12), 1933- 1963. Hur, S., & Kondo, I. 0. (2016). A theory of rollover risk, sudden stops, and foreign reserves. Journal of International Economics, 103, 44- 63. Ilzetzki, E., Reinhart , C. M. , & Rogoff, K. S. (2019). Exchange arrangements entering the twenty-first century: Which anchor will hold? The Quarterly Journal of Economics, 134(2), 599- 646. Jeanne, 0. , & Ranciere, R. (2011). The optimal level of international reserves for emerging market countries: A new formula and some applications. The Economic Journal, 121(555) , 905-930. Kaminsky, G., Lizondo, S. , & Reinhart, C. M. (1998). Leading indicators of currency crises. Staff Papers, 45(1) , 1- 48. Kaminsky, G. L. (2006). Currency crises: Are they all the same? Journal of International Money and Finance, 25(3), 503- 527. Kaminsky, G. L., & Reinhart , C. M. (1999). The twin crises: the causes of banking and balance-of-payments problems. American economic review, 89(3) , 473- 500. Kimball, M. S. , Sahm, C. R. , & Shapiro, M. D. (2008). Imputing risk tolerance from survey responses. Journal of the American statistical Association, 103(483), 1028- 1038. Knedlik, T., & Von Schweinitz, G. (2012). Macroeconomic imbalances as indicators for debt crises in europe. JCMS: Journal of Common Market Studies, 50(5), 726-745. Krugman, P. (1979). A model of balance-of-payments crises. Journal of money, credit and banking, 11(3), 311-325. Laeven, L., & Valencia, F. (2012). Systemic banking crises database: An update. Laina, P., Nyholm, J., & Sarlin, P. (2015). Leading indicators of systemic banking crises: Finland in a panel of eu countries. Review of Financial Economics, 24, 18- 35. Lane, P. R. , & Milesi-Ferretti, G. M. (2007). The external wealth of nations mark ii: Revised and extended estimates of foreign assets and liabilities, 1970- 2004. Journal of international Economics, 73(2) , 223-250. Lang, J. H. , Peltonen, T. A. , & Sarlin, P. (2018). A framework for early-warning modeling with an application to banks. Manasse, P., & Roubini, N. (2009). "rules of thumb" for sovereign debt crises. Journal of International Economics, 78(2), 192- 205. Mendoza, E. G. (2002). Credit, prices, and crashes: Business cycles with a sudden stop. In Preventing currency crises in emerging markets (pp. 335- 392). University of Chicago 133 Press. Miller, M. H. , & Zhang, L. (2006). Fear and market failure: Global imbalances and'self insurance'. Moreno Badia, M., Ohnsorge, F. , Gupta, P., & Xiang, Y. (2020). Debt is not free. Obstfeld, M. (1996). Models of currency crises with self-fulfilling features. European economic review, 40(3-5), 1037-1047. Rigollet , P., & Tong, X. (2011). Neyman-pearson classification, convexity and stochastic constraints. The Journal of Machine Learning Research, 12, 2831-2855. Savona, R. , & Vezzoli, M. (2015). Fitting and forecasting sovereign defaults using multiple risk signals. Oxford Bulletin of Economics and Statistics, 77(1 ), 66-92. Scott, C., & Nowak, R. (2005). A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11), 3806- 3819. Seiffert, C., Khoshgoftaar, T. M. , Van Hulse, J., & Napolitano, A. (2009). Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1), 185-197. Sevim, C., Oztekin, A., Bali, 0. , Gumus, S., & Guresen, E. (2014). Developing an early warning system to predict currency crises. European Journal of Operational Research, 237(3), 1095-1104. Stiglitz, J. E. (2007). Making globalization work. WW Norton & Company. Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika, 61(3), 509- 515. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) , 58(1), 267-288. Tong, X. , Feng, Y. , & Li, J. J. (2018). Neyman-pearson classification algorithms and np receiver operating characteristics. Science advances, 4(2) , eaao1659. Wu, J. C. , & Xia, F. D. (2016). Measuring the macroeconomic impact of monetary policy at the zero lower bound. Journal of Money, Credit and Banking, 48(2-3) , 253-291. Xu, L., Kinkyo, T. , & Hamori, S. (2018). Predicting currency crises: A novel approach combining random forests and wavelet transform. Journal of Risk and Financial Man agement, 11( 4), 86. Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and prob abilities are both unknown. In Proceedings of the seventh acm sigkdd international conference on knowledge discovery and data mining (pp. 204-213). Zou, H. , & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2), 301-320. 134 Appendix A Appendix to Chapter 1 Proof of Proposition 1 It is equivalent to prove :3! 1r s.t. V fl1r > 0 ureal ( 7r, 7r + fl 7r) _ ureal ( 7r, 7r _ fl 7r) =1 ~ (J' [1r ((rJ(7r + fl1r)f(1r + fl1r))l- a) + (1- 1r) (!(1r + fl1r)l- a)] (A.1) -1 ~ (J' [1r ((rJ(7r - fl1r)f(1r - fl1r))l- a) + (1 - 1r) (!(1r - fl1r)l- a)] = 0. Defining a function f(fl1rl1r) as r(fl1rl1r) =1 ~ (J' [1r ((rJ(7r + fl1r)f(1r + fl1r))l- a) + (1- 1r) (f(1r + fl1r)l- a)] _ 1 ~ (J' [1r ((TJ(7r _ fl1r)f(1r _ fl1r))l-a) + (1- 1r) (!(1r _ fl1r)l -a)], (A.2) It then follows obviously that given 1r, f(fl1rl1r) -/- 0, Vfl1r from the assumption that con sumers are risk aversion, i.e., (J' > 0. Proof of Proposition 2 Given assumptions on the distributions of true and estimated probability, as well as the mapping rules from continuous probability to binary indicators, I derive the conditional distribution of true probability and estimated probability given true crisis realization and 135 predicted crisis flag so as to calculate welfare cost associated with different combinations of true crisis realization and predicted crisis flag. As for the true probability conditional on the true crisis realization, (A.3) characterize the distribution of the true probability conditional on the true crisis realization being 1 and O respectively using Bayes' rule and mapping rule as in equation 1.18, where f1r is the probability density function of the true probability 1r and IE1r is the expectation taken over the distribution of the true probability 1r. As for the estimated probability conditional on the predicted crisis flag and the true, (A.4) characterize the distribution of the estimated probability conditional on the true probability and predicted crisis flag being 1 and O respectively using Bayes' rule and mapping rule as in equation 1.19. 136 Given assumption on estimation error which models the difference between estimated and true probability in equation 1.17, the objective function is re-written as 1 lE [ urea! (Jr, Jr) - urea! ( Jr, 1T)] = lE,,. [ urea! (Jr' Jr) - lE, [urea! (Jr, Jr - c) IY = 0 l IY = 0] (1 - lP'(y = 1 IY = 0) )lP'(y = 0) + JE,,. [urea! ( 7r ' 7r) - JE, [Orea! ( Jr, 7r - c) IY = 1] IY = 0] lP'(y = 1 IY = 0) lP'(y = 0) + lE,,. [ ureal(1r, 1r) - lE, [Oreal(1r, Jr - c)IY = 1 ] IY = 1] (1 - lP'(y = 0ly = l)) lP'(y = 1) + JE,,. [urea! (Jr, Jr) - JE, [Orea! ( Jr, Jr - c) IY = 0] IY = 1 ] lP'(y = 0ly = 1 )lP'(y = 1) (A.5) Hence, it can be seen from equation A.5 that the four constant expectation terms are the welfare costs associated respectively with those four different combinations of predicted crisis flag and true crisis realization characterized in Table 1.2. Table A.1 summarizes these welfare costs in a welfare cost matrix. Table A.1: Welfare Cost Matrix True realizations non-crisis crisis Predicted non-cns1s IE,,. [ urea! ( 71", 71") - IE, [ urea! ( 71" , 71" - E) 1:0 = 0 l IY = 0] IE,,. [ urea! ( 71", 71") - IE, [ urea! ( 71" , 71" - E) 1 :0 = 0 l IY = 1] flags crisis IE,,. [ ureal(1r, 1r) - IE, [ureal(1r, 1[ - E)l:O = 1 l IY = 0] IE,,. [ ureal(1r, 1r) - IE, [ ureal(1r, 1[ - E)l:O = 1 l IY = 1] Notes . I adopt the notion that O indicates a normal period and 1 indicates a sudden stop , and define the null hypothesis as it is norma l period. It can be seen from equation A.5 and Table A.1 that, unlike classical binary classification problem, the proportion of true negatives and true positives also incur welfare costs and therefore contribute to the overall objective. I then re-arrange equation A.5 so that prediction 1 1 first re-write the probability of the intersection of true crisis realization being 1/0 and predicted crisis flag being 1/0 as the product the conditional probability of predicted crisis flag being 1/0 given true crisis realization being 1/0 and the unconditional probability of true crisis realization being 1/0, and then substitute the conditional probability of predicted crisis flag coinciding with true crisis realization with one minus the conditional probability of predicted crisis flag being different from true binary outcome, i.e. lP'(y = 0ly = 0) = 1 - lP'(y = l ly = 0) and lP'(y = l ly = 1) = 1 - lP'(y = 0ly = 1) 137 outcomes only affect the risk through type I error (i.e. false alarm) and type II error (i.e. missed crisis) IE [ urea! ( n , n) _ urea! ( n, n)] = IE1r [ IEE [ urea! (Jr' Jr - E) IY = 0 ] - IEE [ urea!( Jr ' Jr - E) 1:0 = 1 ] IY = 0] IP(y = 1 IY = 0)W(y = 0) + IE1r [ IEE [urea!( Jr' Jr - E) 1:9 = 1 ] - IEE [ urea!( Jr' Jr - E) 1:0 = 0] IY = 1] IP(y = 0ly = 1 )W(y = 1) + ( IE1r [ ureal (Jr, Jr)] _ IE1r [ IEE [ urea!( Jr, n - E) 1:9 = 0 ] IY = 0] IP(y = 0) - IE1r [ IEE [ urea! ( n, n - E) 1 :0 = 1 ] IY = 1] IP(y = 1)) . (A.6) It can be seen from equation A.6 that the objective function which is the population-level expected Lw loss, can be decomposed into three parts: the first part is attributed to welfare cost incurred by false alarms; the second part is attributed to welfare cost incurred by missed crises; and the third part is attributed to the welfare cost caused by the discrepancy between estimated and true probability even though predicted and true binary outcomes coincide with each other and hence is independent of binary classification outcomes. Therefore, I ignore the third attribute in formulating the first-stage crisis risk estimation problem, as the objective function is not affected by binary classification outcomes through the third attribute. As a result , the objective function now becomes IE [ urea! ( 1r, 1r) _ urea! ( 1r, it)] = IE1r [ IEE [ urea! ( 7f' 7f - E) l:O = 0] - IEE [ urea! ( 7f ' 7f - E) 1:0 = 1 ] IY = 0] IP(y = 1 IY = 0 )IP(y = 0) + IE1r [IEE [ urea 1 (1r, 1r - E) l:O = 1 ] - IEE [ureal(1r, 1r - E) l:O = o] IY = 1 ]IP(y = Oly = l)IP(y = 1), (A.7) where IE1r [IEE[urea 1 (1r,1r - E) l :O = o ] - IEE[urea 1 (1r,1r - E) l:O = 1 ]IY = o] is the welfare cost incurred by false alarms, and IE1r [ IEE [ urea!( 7f, 7f - E) 1:0 = 1] - IEE [ urea!( 7f, 7f - E) l:O = 0] IY = 1] is the welfare cost incurred by missed crises. I then define H ( 1r) as (A.8) 138 which is characterized as It can be further written as = - f(1r - E)l-a ( rJ(7r - E)l-a - 1) E A 1 [ 1· 00 dF ( E) 1 - CT c-1r Prob(y = 1) 1 c -1r ( )1- a ( ( )1- a ) dFE (E) ] - f 1r - E rJ 1r - E - 1 p b( A _ ) 1r -oo ro y - 0 (A.10) 1 {1 00 f( )1- a dFE(E) 1c-1r f( )1- a dFE(E) } +-- 7r-E ----- 7r-E ---- . 1 - CT c-1r Prob(y = 1) -oo Prob(y = 0) Then the welfare cost incurred by false alarms and missed crises can be re-written as a function of H ( 1r) as IE7r [ IEE [ [real ( 1r, 1r - E) IY = 0 ] - IEE [ [;real( 1r, 1r - E) IY = 1] IY = 0] = IF(y ~ O) IE7r [ ( 1r - 1 )H ( 1r)] , IE1r [ IEE [ [;real ( 7r, 7r - E) I Y = l] - IEE [ [;real ( 7r, 7r - E) I Y = Q ] I Y = l] = ]p> (y ~ l) IE1r [ 7r H ( 7r)] . (A.11) From equation A.11 it can be seen that the welfare cost incurred by false alarms and missed crises consists of two parts. The first part is the inverse of (unconditional) probability of a sudden stop and a normal period, respectively. The second part is some expectation taken over the population-level distribution of true probability of a sudden stop 1r. Therefore, it implies that the welfare cost of incorrect binary prediction should be adjusted additionally on top of the inverse of (unconditional) probability of binary crisis realizations and differently between false alarms and missed crises, as characterized by IE1r [(1r - l)H(1r)] and IE1r [ 1rH(1r)] for false alarms and missed crises respectively. 139 Plugging the welfare costs characterized in equation A.11 back into equation A.7, the objective function can be re-written as a weighted average of the percentage of false alarms and missed crises, IE [urea 1 (1r,1r)- ureal(1r, ir)] = Wy=lly=O -IP(y = lly = 0) + wy=lly=O · IP(y = 0ly = 1), (A.12) where the weight of the percentage of false alarms denoted by Wy=lly=O and the weight of the percentage of missed crises denoted by wg=Oly=l are Hence, we have Wy=lly=O = IE7r [(1r - l)H(1r)]' Wy=Oly=l = IE7r [1rH(1r)]. (A.13) (A.14) which by Jensen's inequality, is larger than or equal to zero, as the concavity is inherited from the utility function. Therefore, the welfare-based weight on the percentage of missed crises is greater than that on the percentage of false alarms, as long as consumers are risk averse. Proof of Proposition 3 Let ( X , Y) be a random pair where X E Rd is a d-dimensional vector of features and Y E {0, 1} indicates X 's class label. A classifier ¢ : X -+ {0, 1} is a data-dependent mapping from X to {O, 1} that assigns X to one of the classes. The classification error of ¢ is R(¢) = E (I{¢(X)-/- Y}) = P{¢(X)-/- Y}, where I(-) denotes the indicator function. 140 By the law of total probability, R( ¢) can be decomposed into a weighted average of type I error Ro(¢)= P{</J(X)-/- YIY = O} and type II error R 1 (</J) = P{</J(X)-/- YIY = 1} as (A.15) where 1r0 = P(Y = 0) and 1r 1 = P(Y = 1). The classical paradigm is to find a classifier to minimize R( ·). Denote that Jo and Ji are two conditional probability density functions of Class O and Class 1, respectively. It is well known that the Bayes classifier (i.e., oracle classifier) of the classical paradigm is </J*(x) = I(ry(x) > 1/2) =I(;~~:~ > :~) (A.16) Similarly, when the objective is to minimize the expectation of welfare cost, i.e. , (A.17) our Bayes classifier should be h*(x) = 1 (f1(x) > Wo) . fo(x) W1 (A.18) 141 Appendix B Appendix to Chapter 2 Table B.l: Average Sum of Errors in Rolling Cutoff Testing Sum of false alarms and missed crises Signal Extraction Elastic Net LASSO Ridge Random Forests RUSBoost SSGis 0.695 0.93 0.98 0.93 0.828 0.989 EMPEs-AEs 0.638 0.839 0.902 0.936 0.745 0.727 EMPEs-EMs 0.823 0.823 0.781 0.793 0.813 0.743 EMPEs-LICs 0.854 0.891 0.917 0.993 0.918 0.86 Table B.2: Country Sample Advanced Economies Emerging Markets Low-Income Countries Australia Albania Afghanistan Austria Algeria Bangladesh Belgium Angola Barbados Canada Argentina Benin Cyprus Armenia Bhutan Czech Republic Azerbaijan Bolivia Denmark Bahamas, The Burkina Faso Estonia Belarus Burundi Finland Bosnia and Herzegovina Cabo Verde France Brazil Cambodia 142 Germany Bulgaria Cameroon Greece Chile Central African Republic Hong Kong SAR China Chad Iceland Colombia Comoros Ireland Costa Rica Congo, Democratic Republic of the Israel Croatia Congo, Republic of Italy Dominican Republic Cote d'Ivoire Japan Ecuador Djibouti Korea Egypt Dominica Luxembourg El Salvador Eritrea Malta Georgia Ethiopia Netherlands Guatemala Gambia, The New Zealand Hungary Ghana Norway India Grenada Portugal Indonesia Guinea Singapore Jamaica Guinea-Bissau Slovak Republic Jordan Guyana Slovenia Kazakhstan Haiti Spain Latvia Honduras Sweden Lebanon Kenya Switzerland Lithuania Kiribati United Kingdom Macedonia, FYR Kyrgyz Republic United States Malaysia Lao P.D.R. Mauritius Lesotho Mexico Liberia Morocco Madagascar Pakistan Malawi Panama Maldives Peru Mali Philippines Marshall Islands Poland Mauritania Romania Micronesia Russia Moldova 143 Serbia South Africa Sri Lanka Thailand Tunisia Turkey Ukraine Uruguay Venezuela Vietnam Mongolia Mozambique Myanmar Nepal Nicaragua Niger Nigeria Papua New Guinea Rwanda Samoa Sao Tome and Principe Senegal Sierra Leone Solomon Islands South Sudan St. Lucia St. Vincent and the Grenadines Sudan Tajikistan Tanzania Timor-Leste Togo Tonga Tuvalu Uganda Uzbekistan Vanuatu Yemen Zambia Zimbabwe 144
Abstract (if available)
Abstract
This dissertation focuses on crisis forecasting and risk assessment, studying how to model a policy-consistent risk assessment framework, and how to incorporate advanced techniques into crisis forecasting. The first chapter investigates the mutual interaction between crisis risk estimation and crisis prevention policy in the context of sudden stops. Proposing a two-stage framework embedding an early-warning problem into a policy-making problem, this chapter conducts a welfare analysis of crisis risk estimation based on subsequent policy responses. Building upon this two-stage framework, this chapter shows that there is welfare cost asymmetry between two types of errors: the welfare-maximizing weight on the percentage of missed crises must be greater than that on the percentage of false alarms. Introducing a constrained optimization under the Neyman-Pearson paradigm to solve this error-asymmetry problem reduces overall welfare loss by more than 20 percent. Bringing this welfare-based crisis risk estimation model to emerging market countries’ data, this chapter uncovers time-varying risk tolerance of policymakers and explores its policy implications on identifying leading indicators and determining levels of reserves through counterfactual analysis. ❧ The second chapter evaluates the performance of signal extraction approach and machine learning techniques for the prediction of external crises, by generating crisis lists for two types of external crises―sudden stops with growth impacts, and exchange market pressure events―for 159 countries over 27 years. Bearing in mind the potentially sharp divergence between in-sample and out-of-sample performance, this chapter designs a rigorous testing procedure attuned to the temporal dependence of macro data and the manner in which the models would be used in practice. The horse race results show that sudden stops with growth impacts are well predicted by signal extraction approach, while exchange market pressure events in the same set of emerging markets, which are more heterogeneous, are well predicted by machine learning techniques. This chapter also sheds light on variable importance rankings and on some of the important non-monotonicities and interactions that machine learning uncovers from the historical data. ❧ The third chapter aims to assess model performance uncertainty and test model ranking significance when conducting a horse race and selecting the best model to use, bearing in mind the small data nature of macroeconomic data in early-warning framework. To assess model performance uncertainty, this chapter explores three sources of data variation in early-warning framework and proposes three types of jackknifing methods to construct confidence intervals of model performance respectively. Additionally, this chapter proposes to construct confidence intervals of conditional performance difference and performs hypothesis testing on the conditional performance difference to test model ranking significance. The approaches are illustrated in an example of predicting sudden stops in capital flows for emerging market countries. Results show that the degree of model performance uncertainty depends on the structure of model and the source of data variation. Also, our approach to construct confidence intervals of conditional performance difference presents evidence of model ranking significance which is otherwise not revealed in simply comparing confidence intervals of individual model performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Three essays in international macroeconomics and finance
PDF
Essay on monetary policy, macroprudential policy, and financial integration
PDF
Essays on monetary policy and international spillovers
PDF
Photoplethysmogram-based biomarker for assessing risk of vaso-occlusive crisis in sickle cell disease: machine learning approaches
PDF
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Adapting statistical learning for high risk scenarios
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Essays on firm investment, innovation and productivity
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Essays on sovereign debt
PDF
Essays in panel data analysis
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
Essays on price determinants in the Los Angeles housing market
PDF
Efficient policies and mechanisms for online platforms
PDF
Essays on narrative economics, Climate macrofinance and migration
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Essays on mitigation, adaptation, and resilience to urban fiscal crises
Asset Metadata
Creator
Xin, Weining
(author)
Core Title
Early-warning systems for crisis risk
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Publication Date
07/27/2020
Defense Date
05/04/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
crisis risk,early warning,exchange market pressure,machine learning,OAI-PMH Harvest,sudden stop,uncertainty
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rancière, Romain (
committee chair
), Aizenman, Joshua (
committee member
), Betts, Caroline (
committee member
), Tong, Xin (
committee member
)
Creator Email
weiningx@usc.edu,xinweining@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-347391
Unique identifier
UC11664161
Identifier
etd-XinWeining-8794.pdf (filename),usctheses-c89-347391 (legacy record id)
Legacy Identifier
etd-XinWeining-8794.pdf
Dmrecord
347391
Document Type
Dissertation
Rights
Xin, Weining
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
crisis risk
early warning
exchange market pressure
machine learning
sudden stop
uncertainty