Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Essays on high-dimensional econometric models
(USC Thesis Other)
Essays on high-dimensional econometric models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Essays on High-Dimensional Econometric Models by Lidan Tan A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Economics) August 2021 Copyright 2021 Lidan Tan To my parents: Jun Tan (谭军), Zhongmei Liu (刘钟梅), and my grandparents: Liang Liu (刘良), Rongchun Wang (王荣春). ii Acknowledgements Throughout my days preparing and writing this dissertation, I received many supports and assis- tance. Firstly, I want to express my gratitude to my committee members: Dr. Hyungsik Roger Moon, Dr. Cheng Hsiao, and Dr. Wenguang Sun, for their insightful comments and advice. I want to special thanks to my advisor Dr. Hyungsik Roger Moon, who has been constantly supporting, encouraging, and helping me throughout the whole journey. My deep gratitude extends to other faculties at USC, including Shinichi Sakata, Juan Carrillo, Joel David, Michael Leung, Hashem Pesaran, Ridder Geert, Guofu Tan. I appreciate their help with the class material and introducing me to rigorous economic research. I want to thank my colleagues and friends at USC. I would never make it here without their help, support, and companion. Among many others, I would like to thank Yu Ai, Changlong Wang, Yujia Yang, Xuyang (Gordon) Wang, Zhenhuan Yang, Jiusi (Josie) Xiao, Hong Liu, Guofei Liu, Wei Peng, Shenhe Zhang, Hanying Ding, Zihao Zheng, Kanika Aggarwal, Andreas Aristidou, Jason Choi, Weiran Deng, Bada Han, Qin Jiang, Youngmin Ju, Jeehyun Ko, Eunjee Kwon, Yinan Liu, Yiwei Qian, Yinqi Zhang, Yimeng Xie, Jisu Cao, Yu Cao, Shichen Wang, Weining Xin, Zhan Gao, Jeong (Chris) Yoo, Grigory Franguridi. My special thanks goes to Yuqi Song, for her accompany and unconditional support through these years. My fourth thanks go to the USC economics department and excellent staff for financial sup- ports, research grants, and administrative help. Thank you, Alexander Karnazes, Young Miller, Irma Alfaro, and Morgan Ponder. Finally, I want to thank my parents and family. My dad has always been the one I loop up to. Thanks for my mom’s thousands wechat messages and phone calls, I know you always back me iii up. I love you. I was raised by my grandparents: Liu Liang and Rongchun Wang, I couldn’t be here without your sacrifice, love and care. Lastly, I want to thank myself for not giving up at the hardest time and keep working and grinding at the hundreds of nights at the TA office. There is a poem from Lu You at Southern Song Dynasty: ”After endless mountains and rivers that leave doubt whether there is a path out, suddenly one encounters the shade of a willow, bright flowers and a lovely village.” Fight On! iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract x Chapter 1: Estimation of High-Dimensional Seemingly Unrelated Regression Models 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 SUR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Precision Matrix and Undirected Graph . . . . . . . . . . . . . . . . . . . 9 1.3 Asymptotic Properties of FGLasso Estimator . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Properties of b W gl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Asymptotic Properties of b b FGLasso . . . . . . . . . . . . . . . . . . . . . . 13 1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.1 Comparison to Related Estimator . . . . . . . . . . . . . . . . . . . . . . 15 1.4.2 Sparsity Under Strong Cross-Section Dependence . . . . . . . . . . . . . . 17 1.5 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.1 MC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5.2 MC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 2: Estimation of High-Dimensional V AR and Interactive Fixed Effects 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Alternative model presentation . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.3 Relation with FA V AR and FM . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.3.1 FA V AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.3.2 FM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.4 Unique Matrix Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 36 v 2.3 Joint Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1 Nuclear Norm Regularization . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.2 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.3 General Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3.4 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.3.5 Debias Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4 Multi-stage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.5 Monte Carlo Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.2 Tuning Parameter Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.3 Monte Carlo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Application: Forecasting US Macro Indexes . . . . . . . . . . . . . . . . . . . . . 52 2.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.6.2 Forecasting Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.3 Forecasting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 3: Global Bank Network Analysis 59 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Banks’ V olatilities Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Connectedness Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 High-Dimensional Covariance Matrix Estimation . . . . . . . . . . . . . . . . . . 62 3.5 Connectedness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Evidences of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.2 Banks’ Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.3 Integrated Country Connectedness . . . . . . . . . . . . . . . . . . . . . . 66 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Appendices 76 G Proof for Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 G.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 G.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 G.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 H Proof for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 H.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 H.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 H.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 H.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 H.5 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 H.6 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 H.7 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 H.8 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 H.8.1 Complementary Lemma . . . . . . . . . . . . . . . . . . . . . . 103 H.8.2 Sufficient condition for RSC condition . . . . . . . . . . . . . . 110 H.8.3 Sufficient condition for assumption 7 . . . . . . . . . . . . . . . 110 vi H.8.4 Simulation of the operator norm bound in Proposition 5 . . . . . 114 I Global Bank Details for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vii List of Tables 1.1 MC Results for ˆ b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2 MC Results for ˆ b 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 Monte Carlo Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.2 Forecasting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1 Operator Norm Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.2 Global Bank Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 viii List of Figures 1.1 Sparse network graph and precision matrix . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Four-Nearest Neighbor Lattices (N=9) . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 Forecasting Results from 2005-2010 . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1 Evidence of Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Total Directional Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Country Integrated Network in Multiple Time . . . . . . . . . . . . . . . . . . . . 68 3.4 Country Integrated Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5 Total Country Directional Connectedness . . . . . . . . . . . . . . . . . . . . . . 70 3.6 Country Net Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 A.1 Operator Norm Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 ix Abstract This dissertation contributes to the estimation of high-dimensional econometric models, as well as its applications. In the first chapter (co-authored with Khai X. Chiong and Hyungsik Roger Moon), we inves- tigate seemingly unrelated regression (SUR) models that allow the number of equations (N) to be large and comparable to the number of the observations in each equation (T ). It is well known that conventional SUR estimators, for example, the feasible generalized least squares (FGLS) es- timator from Zellner (1962) does not perform well in a high dimensional setting. We propose a new feasible GLS estimator called the feasible graphical lasso (FGLasso) estimator. For a feasi- ble implementation of the GLS estimator, we use the graphical lasso estimation of the precision matrix (the inverse of the covariance matrix of the equation system errors) assuming that the un- derlying unknown precision matrix is sparse. We show that under certain conditions, FGLasso converges uniformly to GLS even when T < N, and it shares the same asymptotic distribution with the efficient GLS estimator when T > N logN. We confirm these results through finite sample Monte-Carlo simulations. The second chapter studies vector autoregressive model (V AR) with interactive fixed effects in high-dimensional setting, which allows both the number of cross-sectional units N and time peri- ods T go to infinity. Assuming that V AR transition matrix is low rank, this paper first proposes nuclear norm regularization based method that estimates transition matrix and interactive fixed effects simultaneously. Under certain conditions, the paper shows that on average, the deviation of each element in the estimated matrix shrinks to 0 as N;T!¥. Since nuclear norm penalty induces biases, a debiased procedure is then introduced in order to improve estimators’ finite sam- ple performances. Independently, leveraging on principle component analysis (PCA), the paper x proposes multi-stage estimation method that estimates parameters in multiple stages. I show that the method improves the convergence rates of the V AR transition matrix and reduces biases. In Monte Carlo simulation, I examine estimators’ performances in finite sample and the results agree with the theory. Empirically, the paper revisits the US macro data from McCracken and Ng (2016) and shows the model has great advantage in forecasting macro indexes (IP, CPI and federal funds rate) compared with reduced rank V AR model (RRV AR) and pure factor model (FM), especially in long horizon. The third chapter applies the model from previous chapter into analyzing connectedness among 29 countries using 96 banks’ volatility data from 2003-2021. I construct the dynamic network of global banks as well as the integrated country network using rolling window estimation. I find out that the system-wide shocks such as economic crises and pandemic dramatically brings up the connectedness level. The heat quickly calms down after shocks and the connectedness remains at very low level. The integrated country network is more persistent. The connectedness across countries climbs as system-wide shocks arrive, and stays for a much longer period of time before trends down. xi Chapter 1 Estimation of High-Dimensional Seemingly Unrelated Regression Models 1.1 Introduction An SUR model comprises of multiple individual regression equations that are correlated with each other. In our setup, we assume that there are N regression equations that are observed over periods t = 1;2;:::;T . These regression equations are related in the sense that the regression errors of the equation system are correlated. When N is fixed, Zellner (1962) proposed the feasible generalized least squares (FGLS) esti- mator that is based on an estimator of the inverse of the covariance matrix, i.e. the precision matrix (W :=S 1 ), of the SUR equation system. Often this estimator is computed in two steps. In the first step, one estimates each equation by the ordinary least squares (OLS) and computes the residuals. In the second step, one computes the FGLS based on the inverse of the sample covariance matrix of the residuals. In recent years, high dimensional panel data set with large N relative to T becomes available to researchers, which motivates the the use of SUR in a high-dimensional panel data setting. For example in financial economics, P´ astor and Stambaugh (2002) extends CAPM model and uses SUR to study fund’s excessive return. In their model, N is up to 2609 with T = 240. Baltagi and Bresson (2011) uses high dimensional SUR (N = 80;T = 14) to study hedonic housing price for 1 the Paris real estate market so that it controls for both micro markets and market segmentation between several kinds of flats. Elberg (2016) studies deviation of sticky prices from the law of one price using an SUR model. In their model, the number of cross-sectional units is N = 55 and the number of time periods is T = 362. Traditional FGLS estimator requires the knowledge of the sample covariane matrix b S. When N is fixed and T goes to infinity, FGLS performs well since b S converges to the populationS at a rate of p T . However, in the high dimensional setting, b S behaves poorly and can lead to invalid conclusions. For example, when N and T increase and N=T! c2(0;¥], the largest eigenvalue of ˆ S is not a consistent estimator of the largest eigenvalue of the population covariance matrixS, and the eigenvectors of the sample covariance matrix can be nearly orthogonal to the truth (see Johnstone and Lu (2009)). Moreover, when N > T , ˆ S is rank deficient and thus its inverse is not well defined. In this chapter, we revisit the problem of estimating the classical SUR model in a high dimen- sional data setting. In order to feasibly estimate a high-dimensional SUR, we need to estimate a high-dimensional precision matrix. In general, there are two ways to do this: the first approach is to estimate the covariance matrix and then taking the inverse. The second approach is to directly estimate the precision matrix via proper regularizations. A recent paper by Fan et al. (2019) (FHPJ hereafter) takes the first approach, while in our paper, we take the second approach – imposing sparsity restriction onW. In Section 4, we further discuss the differences between the two papers. Sparsity of the precision matrix is a reasonable and commonly-used restriction. For instance, Cai et al. (2019) analyzes the minimum variance portfolio allocation of stocks in S&P 100 using high frequency data. Assuming the precision matrix of the stocks are sparse, they derive the optimal allocation using the estimated precision matrix. They find that their approach leads to lower volatility compared to many other existing methods. Sparsity of the precision matrix also has an intuitive interpretation. To see this: while the covariance matrix encodes marginal correlations between variables, the precision matrix encodes conditional correlations between pairs of variables given the remaining variables. Therefore the 2 non-zeros of the precision matrix correspond to links in a Gaussian Graphical Model, which is a graph summarizing the conditional dependence relationships among a large set of random vari- ables. As such, sparsity of the precision matrix corresponds to sparsity – in a graph-theoretic sense – of the underlying Graphical Model. Note that even as we impose sparsity on the precision matrix, the errors are allowed to be correlated across equations – when two random variables are condi- tionally uncorrelated, they can still be marginally correlated. While the precision matrix is sparse, the covariance matrix is typically dense. As an example of a sparse graphical model, Chiong and Moon (2017) estimate the network dependence structure of firms’ investment decision. The recovered graphical model exhibits a sparse core-periphery structure. 1 Estimating high dimensional precision matrix is an active topic in machine learning and statis- tics, see for example, Friedman et al. (2008), Lam and Fan (2009), Cai et al. (2011). In this chapter, we built our analysis upon Ravikumar et al. (2011) (RWRY hereafter) and propose a two- stage (nearly) efficient estimator called feasible graphical lasso estimator (FGLasso). That is, in the first stage, we compute the graphical lasso estimator b W gl using OLS residuals, we then replace W in the GLS estimator with b W gl . The contributions are as follow: firstly, since the graphical lasso estimator from RWRY is built on observed random variables, the theory is not directly applica- ble here. We show the graphical lasso estimator b W gl , which is derived from the OLS residuals, converges to the trueW at the rateO p ( p logN=T) in terms of the element-wise maximum norm, while preserving the sparsity pattern. As N;T!¥, we also provide conditions under which our estimator converges uniformly to GLS and performs asymptotically equivalent to the (infeasible) GLS estimator. Specifically, we show that if the maximum nonzero entries per row inW is bounded or grows at a rate much slower than N, then FGLasso converges to GLS estimator uniformly if T grows faster than p N logN. Further, if T > N logN, FGLasso has the same asymptotic distribution as GLS, which is (nearly) efficient. In the Monte-Carlo study, we compare the performance of the FGLasso estimator with OLS, GLS, FGLS as well as PQMLE proposed by FHPJ. We find that in the high dimensional setting, our 1 The core firms are linked to each other and to the periphery firms, while the periphery firms are not linked to each other directly but through the core firms. 3 estimator is comparable to GLS, better than OLS, FGLS in all experiments. When the underlying W is sparse, our estimator also outperforms PQMLE. However, ifW is dense whileS is sparse, our estimator is less satisfactory but still maintains good inference property. This chapter is organized as follows. Section 2 discusses the SUR model in details, summarizes the OLS, the GLS, the FGLS, and the FGLasso estimators, as well as the relationship betweenW and the Markov random graph theory. All theoretical results are established in section 3. In section 4, we compare our results with FHPJ. In section 5, we report the Monte Carlo simulation results. In section 6, we conclude. All technical proofs are provided in the corresponding appendix. Notation: We briefly summarize the notation to be used throughout the chapter. For real val- ued matrix A2R mn , denote s max (A) and s min (A) as the maximum and the minimum singular value, respectively. Let A 0 denote the transpose of A and denote the Kronecker product. De- fine operator norm, Frobenius norm, element-wise maximum norm, maximum row sum norm as jjAjj op = s max (A),jjAjj F = q å i; j A 2 i j ,jjAjj ¥ = max i; j jA i; j j andjjjAjjj ¥ = max i=1;2;:::;m å n j=1 jA i j j. Further, let tr(A)=å minfm;ng i A ii as A 0 s trace and trhAi :=å i s i (A), where s i (A) is i 0 th largest singular value of A. For a real sequencefa n g ¥ n=1 and a positive sequencefb n g ¥ n=1 , we denote a n =O(b n ) if there exists a finite constant C> 0 such thatja n j Cb n as n!¥, and a n =O p (b n ) ifP(ja n j Cb n )! 1 as n!¥. We use notation) and p ! to denote the convergence in distribution and the convergence in probability, respectively. 1 : stands for the indicator function. 1.2 Setup 1.2.1 SUR model Suppose we estimate a system of linear equations: Y it =b 0 i X it +U it (1.1) 4 for i= 1;;N and t= 1;;T . Here X it =(X it;1 ;X it;2 ;;X it;K i ) 0 is a K i - column vector of regres- sors corresponding to unit i, and U it is the unobserved error term. The heterogeneous regression coefficientsb i 2R K i 1 are the parameters of interest. Stacking the observations over N units, let Y t =(Y 1t ;:::;Y Nt ) 0 2R N , U t =(U 1t ;:::;U Nt ) 0 2R N , X t = diag(X 1t ;:::;X Nt )2R å N i=1 K i N , andb =(b 0 1 ;:::;b 0 N ) 0 2R å N i=1 K i . The system of the equations in (1.1) can be expressed as Y t = X 0 t b+U t : (1.2) Alternatively, stacking the observations in (1.1) over t, we can also express the system of the equations in (1.1) as Y i = X i b i +U i ; (1.3) where Y i =(Y i1 ;Y i2 ;;Y iT ) 0 2R T , X i =(X i1 ;X i2 ;;X iT ) 0 2R TK and U i =(U i1 ;U i2 ;;U iT ) 0 2 R T . In a matrix form, we can write the model as Y = Xb+U; (1.4) where Y = (Y 0 1 ;Y 0 2 ;;Y 0 N ) 0 2R NT , U = (U 0 1 ;U 0 2 ;;U 0 N ) 0 2R NT , X = diag(X 1 ;X 2 ;:::;X N )2 R NTå N i=1 K i . In this chapter, we assume the classical linear system equation assumptions: Assumption 1 (Model). (i) X is a full rank matrix, and at least there exists a pair(i; j) such that X i 6= X j ; (ii) There exists constant K > 0, such that K i K for all i= 1;2;:::;N; (iii) LetF t :=fX 1 ;:::;X t ;Y 1 ;:::;Y t1 g be the past information to t. Assume thatE(U t jF t )= 0 andE(U t U 0 t )=:S> 0; 5 (iv) Conditional on X, U it = p S ii is i.i.d sub-Gaussian over t with parameters, i.e. E exp l(U it = p S ii ) jX e l 2 s 2 =2 ; for alll2R: (1.5) The first three conditions in Assumption 1 are quite classical in the SUR literature. The first condition excludes the case where all the regressors are identical, in which OLS estimator becomes efficient and there is no gain in using the information in the system of equations. The second condition requires the number of independent variables for each individual to be bounded, we leave the case where the number of regressors also grows with sample size for future research. The third condition assumes the regressors are weakly exogenous, which allows regressors to be pre-determined. Also we assume homoskedasticity of residuals. This condition may be relaxed to conditional heteroskedasticity but at the cost of technical complexity of the asymptotic results. We assume it for simplicity in deriving the asymptotic results. The last condition regulates the tail of the distribution of U it = p S ii . There are several conditions equivalent to (1.5), for example, it can be replaced by the moment conditions: E jU it = p S ii j p jX 1=p Cs 2 p p; for some constant C> 0 and all p 1, see Proposition 2.5.2 in Vershynin (2018) for more details. Assumption 2 (Regularity Condition). There exist universal constants c 0 and c 1 > 0 such that the followings are satisfied: (i) The singular value of precision matrixW=S 1 satisfies c 0 s min (W) s max (W) c 1 ; (ii) Define å N i=1 K i å N i=1 K i matrices b Q and Q such that the subblock b Q i j := 1 T X 0 i X j 2 R K i K j , Q i j :=E(X 0 i X j ). Assume sup 1iN 1 T X i X 0 i Q ii ¥ = o p (1) as N;T!¥, and c 0 min 1iN s min (Q ii ) max 1iN s max (Q ii ) c 1 . The first assumption in Assumption 2 regulates the true precision matrixW. In this chapter, we assume its smallest eigenvalue does not shrink to zero and the largest elements does not explode as 6 N;T!¥. It also implicitly implies that we only allow weak cross-sectional dependence among U t . We will leave strong cross-sectional dependence case for future research. The second assumption is the regularity condition for each regressor X i . It ensures that all individual OLS estimators, b b i;OLS ;i= 1;:::;N are well defined. In particular, since we allow the number of cross sectional units N to grow, we require slightly stronger conditions: the uniform convergence of X i as well as the existence of uniform lower/upper bound for Q ii . Notice that we do not require convergence of b Q i j ! p Q i j before deriving asymptotic property of GLS in Proposition 2. 1.2.2 Estimators In this section, we first briefly summarize the OLS, the GLS, and the FGLS estimators ofb in the SUR model. Then we introduce the FGLasso estimator. The OLS estimator is defined as b b OLS = T å i=1 X t X 0 t ! 1 T å t=1 X t Y t : (1.6) It is equivalent to the OLS estimators of individual equations, b b OLS =( b b 0 1;OLS ; b b 0 2;OLS ;; b b 0 N;OLS ) 0 , where b b i;OLS = X 0 i X i 1 X 0 i Y i for i= 1;2;:::;N. Zellner (1962) proposed the SUR estimator to improve the OLS estimator by exploiting the correlation in the equation system. Suppose thatS is known, define the precision matrix asW := S 1 . The GLS estimator is defined as b b GLS = T å t=1 X t WX 0 t ! 1 T å t=1 X t WY t : (1.7) 7 In most applications, however,S andW are not known. A FGLS estimator (see details in Greene (2003)) is defined by replacing the unknownS with the consistent estimator. A widely used esti- mator ofS is b S= 1 T å T t=1 b U t b U 0 t and b U t is the OLS residuals, that is, b U t = Y t X 0 t b b OLS . Then b b FGLS = T å t=1 X t b S 1 X 0 t 1 T å t=1 X t b S 1 Y t : (1.8) The FLGS estimator in (1.8) suffers from two major problems when N is large. Suppose that T > N, when N and T increase in the same speed, the largest eigenvalue of b S is not a consistent estimate of the largest eigenvalue of the population covariance matrixS, and the eigenvectors of the sample covariance matrix can be nearly orthogonal to the truth (Johnstone and Lu (2009)). Further, b S 1 is only well defined when T N. When T is less than N, b S is rank deficient and therefore not invertible. Our estimator is motivated by these two issues. Suppose that W is sparse. In this case, we propose FGLasso estimator by replacing b W= b S 1 in (1.8) with a graphical lasso estimator, b W gl , where b W gl = argmin W>0;W 0 =W tr(W b S) logdetW+ljjWjj 1;off ; (1.9) jjWjj 1;off =å N i6= j jW i j j and l > 0 is a penalization parameter which is often chosen by a cross- validation method (e.g., see Friedman et al. (2008)). More specifically, b b FGLasso = T å t=1 X t b W gl X 0 t ! 1 T å t=1 X t b W gl Y t ! : (1.10) 8 1.2.3 Precision Matrix and Undirected Graph Given observationsfU t g T t=1 , the precision matrixW captures the conditional correlations among these variables and is closely related to undirected graphs under a Gaussian model. More specif- ically, if U t N(m;W 1 ), then the distribution of U t can be represented in the factorization form with parametersg =Wm2R N andW: P(U t )= exp ( N å i=1 g i U ti 1 2 N å i; j=1 W i j U it U jt + 1 2 logdet(W=2p) ) : (1.11) The representation of (1.11) is convenient since it allows us to discuss the factorization properties directly interms of the sparsity pattern of the precision matrixW. Consider a undirected graph G= (V;E) where V contains nodes that corresponds to the N variables in U t , then edge i; j2 E if and only ifW i j 6= 0. Moreover, the graph G describes the conditional independence relationships among U t :W jk = 0 if and only if U jt is independent with U kt conditional on all other variables U t;nf j;kg := fU it ;i6= j;kg. Figure 1.1 illustrates this correspondence between the graph structure (panel (a)), and the sparsity pattern of the precision matrixW (panel (b)). Notice that although node 1 and 4 are conditionally independent andW 14 = 0, they are still marginally correlated since they are indirectly linked through node 2;3;5, which means S 14 6= 0. Thus, the sparsity in W does not necessarily lead to sparsity inS=W 1 . It is also worthwhile to mention that the connection betweenW and graphical models is originally built on Gaussian distribution, later Liu et al. (2009) and Loh and Wainwright (2013) relax the Gaussian limitation and allow for more general conditions. See more details at Chapter 9 in Hastie et al. (2015). Graph sparsity has been widely documented and used in the networks literature. For example, Banerjee et al. (2013) studies the network diffusion of a new microfinance program in Indian villages. The network structure they found is sparse and centralized – the villagers are connected mainly through a small set of ‘leaders’ (teachers, shopkeepers, etc.), instead of directly linking with each other. Nodes having relatively small number of links in the network corresponds to the precision matrix having a small number of nonzero entries per row – if we were to model the 9 Figure 1.1: Sparse network graph and precision matrix Note: (a) Simple undirected graph G, the graph has N = 5 vertices and s= 6 edges. (b) Precision matrix W that correpondes to G, when the variables in G is from Gaussian distribution. w jk 6= 0(1 j;k 5; j6= k) corresponds to the edge set E, and the zeros correspond to the non-edge set. The diagonal entries are usually nonzero and they are not related to the edge set. behavior of the villagers using SUR. Relatedly, Chiong and Moon (2017) estimates the network dependence structure of firms’ investment decision. The recovered graphical model exhibits a sparse core-periphery structure. As such, motivated by the networks literature, we impose sparsity as a restriction onW. We also relax the Gaussian distribution to sub-Gaussian in Assumption 2. 1.3 Asymptotic Properties of FGLasso Estimator The sample properties of the FGLasso estimator is mainly dependent on the sample properties of b W gl in (1.9). Intuitively, if b W gl is close to trueW in some metric, then b b FGLasso would also be close to b b GLS . In section 3.1, we discuss the properties of b W gl and show the uniformly convergence and asymptotic properties of b b FGLasso in section 3.2. 1.3.1 Properties of b W gl Follow the notation from RWRY , let E(W) be the edge set, i.e E(W) :=f(i; j)jW i j 6= 0;i6= jg and S(W)=fE(W)[f(1;1);:::;(N;N)gg. Denote S c be the complement of S set. For any two sets 10 P;P 0 , we useG PP 0 to denote thejPjjP 0 j matrix with rows and columns ofG indexed by P and P 0 respectively. Denotek S :=jjjSjjj ¥ ,G =W W andk G := (G SS ) 1 ¥ . Denote D N as maximum number of nonzeros per row inW, and S N = card(S(W)) as the number of nonzero entries (including diagonals) inW. Notice that in graph theory literature, D N represents the maximum degree of nodes and S N represents the total number of linkages (including self links) if we treat them as directed. In this chapter, we do not restrict(D N ;S N ) to be bounded and allow them to increase with N. It is worthwhile to mention that imposing sparsity on W is slightly different with restricting D N , sinceW can be sparse while D N =O(N). As we show later, D N plays an important role in the convergence rate and the minimum requirement of sample size T . Assumption 3 (Sparsity ofW). (i) There exists somea2(0;1] such that max e2S c G eS (G SS ) 1 1 1a; (1.12) (ii) (k G ;k S ;a;s) remain constant as a function of(N;T). These assumptions are adopted from RWRY . The first one is also named as irrepresentability condition, which limits the influence that the non-edge terms can have on the edge-based terms. As claimed in the original paper, this condition shares an exact parallel with the incoherence con- dition for the Lasso literature, and enables b W gl converges the trueW uniformly even when N > T . Although it is very difficult to test and verify, RWRY raised several examples that satisfy this con- dition and claimed the condition is general enough. The detailed discussion can be seen at page 13 and 14 in RWRY . The condition that k S remains constant in the second assumption indicates that the singular value of true precision matrixW is lower bounded 2 . The assumption that the rest of the parameters (k G ;a;s) remain constant is only for simplicity. 2 Note s min (W)= 1 s max (S) 1 jjjSjjj ¥ . 11 Lemma 1. Assume Assumption 3 holds. If T cD 2 N logN for constant c > 0, then with proper choice ofl, the optimal solution b W gl in (1.9) satisfies: (i) There exits a large enough constant C> 0 such that P(j b S i j S i j j>d) C exp(CTd 2 ) ford2(0;minf172=K 2 ;16g(1+ 4s 2 )max i (S ii )). (ii) b W gl W ¥ =O p r logN T ! ; (1.13) (iii) The edge set E( b W gl ) is a subset of the true edge set E(W), and includes all edges(i; j) with jW j > c 0 q logN T , where c 0 > 0 is a small constant number that depends on s;a;k G and max i (S ii ). The first result in Lemma 1 is similar, but different from Lemma 1 in RWRY . In their paper, the residualsfU t g T t=1 are directly observed independent. However, in our paper, the regression residuals are not directly observable and dependent, thus their results cannot be applied directly. The first one shows that the tail property ofj b S i j S i j j recovered by OLS residuals. The rest two results in Lemma 1 are analogous to Corollary 1 in RWRY . The second result in (ii) guarantees that the error between each element of b W gl and W shrinks uniformly at a rate of p logN=T , meaning that as long as T increases faster than logN, the distance will go to zero as N;T!¥. The maximum degree D N plays an important role as it determines the lower bound of sample size T . If D N is bounded or increases much slower than N (for example, D N =O(logN)), then it is possible that the properties in Lemma 1 hold for N > T . The result in (iii) is so-called sparsistency, meaning that b W gl from (1.9) remains similar sparsity structure ofW. The non-edge set E c (W) E c ( b W gl ), that is, ifW i j = 0, then b W gl;i j = 0 wp1. Therefore, for the matrixD W :=W b W gl , the maximum nonzero entries per row inD W is at most D N . 12 1.3.2 Asymptotic Properties of b b FGLasso In this section, we discuss the consistency and asymptotic properties of b b FGLasso defined in (1.10). As the main result, we show that b b FGLasso and b b GLS are asymptotically equivalent. Proposition 1 (Uniform Convergence Rate). If Assumptions 1, 2 and 3 hold, for T cD 2 N logN with some constant c> 0, b b Fglasso satisfies: b b FGLasso b b GLS ¥ =O p minf p S N ;D N g p N logN T : (1.14) The proposition states that the largest deviation of the FGLasso estimator from GLS estimator depends on the sparsity parameter (S N ;D N ) as well as N and T . For example, if the maximum degree of nodes inW do not grow as quickly as N, then FGLasso estimator uniformly converge to the GLS estimator when T < N. For example, in the optimal case when D N =O(1), then as long as T > clogN, we have b b FGLasso b b GLS ¥ =O p p N logN T : However, if D N is proportional to N, butW is sparse enough that S N =O P (N), then the conver- gence rate isO p N p logN T . And ifW is dense, then the convergence rate becomesO p N p N logN T , that is, T has to grow much faster than N. Next we will discuss the asymptotic property between b b Fglasso and true b. Before we present the main theorem, we first show asymptotic property of high dimensional b b GLS . Unlike the traditional setting which the dimension of GLS estimator is fixed, in this chapter, the dimension rises with N. In order to achieve efficiency similar as N fixed, we need much stronger assumptions on the convergence of regressors. Recall b Q and Q defined in Assumption 2 that b Q i j = 1 T X 0 i X j and Q i j =E(X 0 i X j ). Define Q S := fM2R (å i K i )(å i K i ) jM i j = Q i j 1 W i j 6= 0 g, which replaces the sub-block Q i j with 0 0 s ifW i j = 0. And b Q S is defined in the similar way. Then we propose the following assumption: 13 Assumption 4. Assume regressors X satisfy one of the following conditions: (i) b Q S Q S ¥ = o p ( 1 p N ); (ii) E(U t jX)= 0 andE(U t U 0 t jX)=S. Compared with Assumption 2 (ii), the first one not only requires the uniform convergence of 1 T X 0 i X j , but requires the uniform convergence of 1 T X 0 i X j in a faster rate than o p (1). IfW is sparse, the assumption is weaker than b Q Q ¥ = o p ( 1 p N ) in the way that we only require the convergence on the sub-matrix whereW i j 6= 0. Compare with Assumption 1 (iii), the second condition requires strict exogenous of the residual. As we claimed before, we need this condition to cancel the effect from increasing N and assure GLS estimator has the similar asymptotic variance as in N fixed case. See proof in Appendix for more details. Proposition 2 (Asymptotic Property of GLS Estimator). Assume Assumptions 1, 2 and 4 hold. For any b2R NK and b 0 b= 1, further assumeE(V 4 NT )<¥ where V NT = b 0 Q 1 X t WU t . Then b b GLS satisfies: b 0 p T( b b GLS b)) N 0;b 0 A 1 b ; (1.15) where A :=E(X t WX 0 t ). In particular, let b= e i 2R KN , the column vector that only i 0 th element is 1 and 0 otherwise. Then for each element of b b GLS;i (i= 1;2;;KN), we have: p T( b b GLS;i b i )) N 0;(A 1 ) ii : (1.16) Combining the results from Propositions 1 and 2, we deduce that if T grows fast enough com- pared with(D N ;N), then b b Fglasso is asymptotically equivalent with b b GLS , therefore the distribution of b b Fglasso tends to be normal distribution asymptotically. Summarizing this, we provide the fol- lowing theorem as the main theoretical result of the chapter. 14 Theorem 1 (Asymptotic Property of b b Fglasso ). Assume Assumptions 1, 2, 3 and 4 hold. Moreover, if(T;N) grows at the rate such that minf p S N ;D N g q N logN T ! 0, then for any vector b2R KN1 such that b 0 b= 1, the feasible graphical lasso estimator b b Fglasso satisfies: b 0 p T( b b FGLasso b)) N 0;b 0 A 1 b ; (1.17) where A :=E(X t WX 0 t ). Similar to Proposition 2, let b= e i 2R KN , Theorem 1 implies that each element of b b FGLasso;i (i= 1;2;;KN) satisfies: p T( b b Fglasso;i b i )) N 0;(A 1 ) ii : (1.18) Theorem 1 states that under certain conditions, the FGLasso estimator achieves efficiency as it has the same asymptotic variance as GLS estimator. In fact, the theorem holds as long as b 0 b<¥, thus provide theoretical evidence of other tests, for example, b b 1 + b b 2 . Notice that if D N is bounded, the theorem holds as long as T > N logN. which enables the efficiency under cases which N is below but comparable with T . However, as W becomes denser, S N and D N increases, T needs to be much larger than N in order to remain asymptotic efficiency. When N is larger than T , the asymptotic distributional theory and efficiency is not guaranteed. 1.4 Discussion 1.4.1 Comparison to Related Estimator There is a recent work by FHPJ that also studied high dimensional SUR model. Similar to what we do here, they proposed another estimator named Penalized Quasi-Maximum Likelihood Estimator (PQMLE). Both papers show consistency and asymptotic normality. However, there are several differences. 15 Firstly, the two papers stem from different assumptions on the model. Assuming the covariance matrixS of U t is sparse, they first estimate b S by imposing a sparsity restriction, then they replace W with b S 1 . Instead, we assumeW is sparse, and directly replaceW with the regularized graphical lasso estimator b W. Sparsity on the covariance matrix implies that the residual U it is marginal correlated with a small portion of other units, and uncorrelated with the rest. While the precision matrix represents conditional independence relationships – that is, W i j = 0 if and only if U it is independent of U jt conditional on other residuals. The two estimators do not nest each other, and in fact, from a practitioner’s perspective, they are applicable to different contexts. If the cross- sectional dependence is well controlled such that the covariance matrix of the remaining term is sparse, PQMLE estimator might fit the data better. However, for cases when the covariance matrix is dense but the precision matrix is sparse, our estimator might perform better than PQMLE. Empirically, in contexts involving networks of individuals and agents, our estimator might also perform better than PQMLE – especially if it is believed that the network is sparse (see, eg Barab´ asi et al. (2016)). As confirmed by our Monte Carlo simulation in the next section: when N;T is large, both PQMLE and FGlasso outperform OLS. Moerover, FGLasso estimator has smaller error than PQMLE whenW is sparse whileS is not, and their estimator is better in the opposite case. Secondly, although both paper provide asymptotic normality and efficiency, the requirement for N and T are different. In FHPJ, T has to grow at least faster than N 3 logN. However, under certain conditions, FGLasso estimator maintains the same asymptotic distribution as GLS as long as T > N logN. As such, we maintain efficiency under a wider set of circumstances, such as when N is slightly smaller than T . Unfortunately, when N is larger than T , the asymptotic distributional theory is not studied for either estimator. Apart from that, our estimator has some practical advantages. The first stage objective function of estimating the penalized W is a concave function, hence a unique solution is guaranteed. In comparison, FHPJ’s objective function of is a mixture of global and convex functions, which could be computationally expensive in practice. In Monte-Carlo simulations, we show that on average, our estimator is approximately 60% to 95% faster than PQMLE. 16 1.4.2 Sparsity Under Strong Cross-Section Dependence Our estimator relies on the precision matrix being sparse. When there are strong cross-sectional dependence among U t , for example, when there are unobserved common factors, the sparsity ofS would not hold, and as such,W may or may not be sparse. This is a non-trivial case that requires a deeper analysis, for example, one could perform a principal component analysis first, and then treating them as regressors, before applying our analysis. However, as we see in Figure 1.1, a sparseW corresponds to the sparse graphical model (a graph which represents the conditional independence relationships). Therefore, under certain conditions, although there are strong cross-sectional dependences among units, the conditional independence would still hold. We illustrate this through the following simple example. Consider among N entries in U t , the first D terms are common factors that will affect the rest of terms. U 1:D;t = g t ; U D+1:N;t =gg t +e t : Further, assumee t and g t are independent, and the covariance matrix ofe t S e are sparse enough such thatS 1 e is also sparse, for example, diagonal matrix. It is easy to show that the covariance and precision matrix of U t are: S U = 2 6 4 S g S g g 0 gS g gS g g 0 +S e 3 7 5 ; W U = 2 6 4 S 1 g +g 0 S 1 g g g 0 S 1 e S 1 e g S 1 e 3 7 5 : In this example, if D is bounded by a constant number, which is commonly assumed in factor model, then even when there is strong cross-sectional dependence among U t , the precision matrix still remains sparse. 17 Figure 1.2: Four-Nearest Neighbor Lattices (N=9) Notes: Panel (a) shows that when there are 9 nodes, each 3 nodes form a row, and 3 rows in total. Each node is connected with its closest neighbor, the degree ranges from 2 to 4. Panel (b) shows the corresponding Omega matrix. For example, node 1 is linked with node 4 and 2, thus node 1’s degree is 2 and in the first row of W, only W 11 ;W 12 ;W 14 6= 0. And node 5 is linked with node 2, 4, 6 and 8, therefore,W 25 ;W 52 ;W 45 ;W 54 ;W 56 ;W 65 ;W 58 ;W 85 and W 55 6= 0. To generateW, if node i is not linked with node j (i6= j), setW i j = 0, otherwise, letW i j = 0:25 andW ii = 1. 1.5 Monte Carlo Simulations In this section, we discuss finite sample properties of the FGLasso estimator using Monte Carlo simulations. For different pairs of(N;T), we generate data and calculate FGLasso as well as OLS, GLS, FGLS and PQMLE estimators. 1.5.1 MC Design For simplicity, we let number of regressors for each unit K i = 1 for all i. The data is generated from the following process: Y t = X 0 t b+U t t= 1;2;::::;T; where X it N(0;1) for all 1 t T and 1 i N. Fix b 1 = 2, and b j U[1;3] for 2 j N. Further, let U t N(0;W 1 ) whereW2R NN is generated from different designs listed below: (i) Band: LetW i;i = 1,W i;i+1 =W i+1;i = 0:6,W i;i+2 =W i+2;i = 0:3, andW i; j = 0 forji jj 3; 18 (ii) Four-Nearest Neighbor Lattices (Grid): Let n := p N, then the nodes are arranged as n rows with n nodes in each row. The nodes are linked with their closest neighbors. Given edge set E(W), letW ii = 1,W i j = 0:25 if(i; j)2 E(W) and 0 otherwise (see the example in Figure 1.2.). The same design is used in RWRY . (iii) Block-diagonal: Generate block-diagonal covariance matrixS with block size= 5. For each block, we let off-diagonal elements be0:3 with signs assigned randomly. The diagonal elements are the same, and they are generated to make sure the S is positive definite and condition number is N; (iv) Dense: Let the covariance matrix S =W 1 be the band matrix where S ii = 1, S i;i+1 = S i+1;i = 0:2 andS i j = 0 for allji jj 2. The first two designs generate a sparse precision matrixW with a certain pattern. Specifically, the number of nonzero entries per row is always 3 in band structure, and is at most 5 in four-nearest neighbor lattice. Block-diagonal design generates both sparseS andW. The number of nonzero entries per row is 5. In ’Dense’ case, the covariance matrixS is sparse while the precision matrix W is dense We estimate b W gl by (1.9) using algorithm proposed by Friedman et al. (2008). The penalty parameterl is chosen by the 5-fold cross validation. 3 Suggested by the referee, we also compare with our estimator with PQMLE in FHPJ. We follow the same rule of choosing penalty parameter from original paper and use the algorithm proposed in Bien and Tibshirani (2011). 3 More precisely, in each replication, we divide the T samples into 5 folds and use four of them as the training data set and one as the validation set. With each choice of l, we estimate the b b Fglasso estimators using the training data, then plug them into the validation set and calculate the mean squared error. We choosel n that minimizes the averaged MSE. 19 1.5.2 MC Results We fix T = 200 and let N varies from 50 to 400 4 . For each pair offN;Tg, we compare our estimator with OLS, GLS, FGLS, as well as PQMLE. All numbers reported are average of 100 replications. Table 1.1 gives ˆ bb ¥ , the element-wise maximum (l ¥ ) norm loss, and the root mean square error (RMSE) 5 , as well as the number of times FGLasso outperforms FGLS and PQMLE, respectively. The l ¥ loss measures the largest deviation of the b b from true b among all ele- ments, while RMSE gives the deviation on average. For each simulation, we compare the norm of FGLasso with FGLS and PQMLE, record the number of times FGLasso yield less deviation and report them in # FGLS and # PQMLE, respectively. We also compares the computation time of FGLasso and PQMLE. Specifically, given the penalty parameter, ’Time’ in Table 1.1 shows the average CPU time of deriving FGLasso when CPU time of deriving PQMLE is normalized to 1 6 . We observe that when T is fixed, as N increases, all estimators’ l ¥ loss and RMSE weakly increases. Among them, the infeasible GLS estimator performs better than other estimators. It also confirms the fact that when N rises, FGLS becomes less accurate and behaves closely to OLS. For the first three designs whereW is exactly sparse, the FGLasso estimator outperforms the FGLS estimator even when N is relatively small. Moreover, it maintains the good performance as N increases and larger than T . For example, in the ’Band’ structure where N= 400 and T = 200, the largest deviation among 400 elements in b b FGLasso is 0:278 and on average the deviation is 0:085. They are very close to the infeasible GLS estimator, which are 0:225 and 0:071, respectively. In the fourth design when the precision matrix is dense, our estimator still better than FGLS estimator. PQMLE estimator, proposed in FHPJ, is the regularized estimator which assumeS is sparse. From Table 1.1, we observe that whenW is sparse (’Band’, ’Grid’, ’Blc-diag’), our estimator is bet- ter even whenS shares the similar sparse structure in ’Blc-diag’. However, whenW is dense while 4 For four-nearest neighbor lattices design, N can only be square number, so we choose N = f49;100;196;289;400g. 5 Here, the l ¥ norm is the element-wise maximum normkk ¥ , and the RMSE is defined askk F = p N. 6 In practical, the time for deriving estimator also depends on how the penalty parameter is choosen. For example, in five-fold cross validation, it requires 26 times of calculation before deriving the final estimator. 20 S is the band matrix, FGLasso performs less satisfied than PQMLE, but the difference shrinks as N increases and it maintains the advantage over FGLS and FGLS estimators. On the other hand, since the objective function deriving PQMLE is not globally concave, it requires extra steps to ap- proximate global solution, which could be time consuming. We show that under same computation power, our estimator is approximately 90% faster than PQMLE. Table 1.2 gives the bias and standard deviation of the first element b b 1 , as well as the interquartile range (IQR) and size at 5% nominal level. We observe that as N increases, the bias remains small while standard deviation weakly increases. Among all estimators, GLS estimator has the smallest standard deviation, the standard deviation and IQR of FGLasso lies between GLS and FGLS (or OLS when N > T ). The size of FGLasso estimator in Table 1.2 provide evidences of inferential properties. Even though the theory only provides the asymptotic distribution when N < T , the size remains around 5% level even when N is twice as large as T . Further, the table suggests the differences between PQMLE and FGLasso in single element in b are very limited and can be negligible. 21 T=200 l ¥ RMSE N 50 100 200 300 400 50 100 200 300 400 Band OLS 0.300 0.349 0.372 0.387 0.394 0.120 0.123 0.123 0.123 0.123 GLS 0.178 0.201 0.212 0.224 0.225 0.071 0.062 0.071 0.072 0.071 FGLS 0.201 0.258 0.371 n/a n/a 0.081 0.094 0.123 n/a n/a PQMLE 0.201 0.235 0.371 0.352 0.360 0.082 0.083 0.123 0.123 0.101 FGLasso 0.196 0.225 0.250 0.266 0.278 0.078 0.081 0.083 0.084 0.085 Time 0.125 0.246 0.011 0.398 0.202 0.125 0.246 0.011 0.398 0.202 #FGLS 72 87 100 n/a n/a 78 99 100 n/a n/a #PQMLE 61 62 100 100 97 79 64 100 100 100 Grid OLS 0.224 0.262 0.300 0.319 0.341 0.089 0.094 0.098 0.100 0.102 GLS 0.175 0.196 0.209 0.219 0.226 0.070 0.071 0.071 0.071 0.072 FGLS 0.193 0.244 0.297 n/a n/a 0.077 0.085 0.096 n/a n/a PQMLE 0.218 0.226 0.235 0.252 0.273 0.085 0.077 0.078 0.080 0.084 FGLasso 0.181 0.203 0.219 0.228 0.234 0.073 0.074 0.074 0.074 0.074 Time 0.211 0.071 0.027 0.305 0.027 0.211 0.071 0.027 0.305 0.027 #FGLS 61 85 97 n/a n/a 82 100 100 n/a n/a #PQMLE 88 77 78 98 91 96 96 94 98 100 Blc-diag OLS 0.183 0.201 0.217 0.233 0.240 0.072 0.074 0.074 0.075 0.076 GLS 0.137 0.159 0.181 0.191 0.203 0.052 0.055 0.057 0.058 0.060 FGLS 0.153 0.193 0.217 n/a n/a 0.058 0.066 0.074 n/a n/a PQMLE 0.171 0.194 0.212 0.223 0.225 0.065 0.068 0.071 0.071 0.070 FGLasso 0.153 0.177 0.197 0.213 0.223 0.059 0.062 0.065 0.067 0.070 Time 0.060 0.050 0.050 0.061 0.126 0.060 0.050 0.050 0.061 0.126 #FGLS 55 71 83 n/a n/a 63 85 100 n/a n/a #PQMLE 72 75 78 68 49 84 88 93 83 44 Dense OLS 0.182 0.196 0.210 0.224 0.227 0.071 0.071 0.071 0.071 0.071 GLS 0.117 0.128 0.136 0.141 0.143 0.045 0.045 0.045 0.045 0.045 FGLS 0.133 0.161 0.209 n/a n/a 0.051 0.058 0.071 n/a n/a PQMLE 0.124 0.136 0.169 0.178 0.170 0.048 0.048 0.053 0.052 0.051 FGLasso 0.134 0.151 0.165 0.174 0.177 0.053 0.054 0.055 0.056 0.055 Time 0.155 0.437 0.05 0.05 0.220 0.155 0.437 0.050 0.050 0.220 #FGLS 50 70 99 n/a n/a 44 100 100 n/a n/a #PQMLE 24 13 48 48 41 5 0 19 1 2 For each experimental design, the table reportsk b b OLS bk,k b b GLS bk,k b b FGLS bk,k b b FGLasso bk as well as k b b PQMLE bk by l ¥ , RMSE=k:k F =N. Time measures on average, for a given tunning parameter, the CPU time for deriving FGLasso estimator when the CPU time for PQMLE is normalized to 1. # FGLS and #PQMLE report that out of 100 replications, the number of times FGLasso outperforms FGLS and PQMLE in terms of l ¥ and RMSE, respectively. When N > T , FGLS estimator is not well defined and therefore we leave it with ’n/a’. All the results are based on 100 simulation replications. Table 1.1: MC Results for ˆ b 22 T=200 N=50 N=200 N=400 Design Bias Std. IQR ˆ a Bias Std. IQR ˆ a Bias Std. IQR ˆ a Band OLS 0.004 0.089 0.115 0.06 0.013 0.094 0.133 0.09 -0.004 0.095 0.146 0.07 GLS 0.002 0.067 0.081 0.07 0.006 0.070 0.100 0.08 -0.008 0.069 0.092 0.03 FGLS 0.007 0.075 0.104 0.11 0.012 0.093 0.122 0.90 n/a n/a n/a n/a PQMLE 0.002 0.070 0.098 0.06 0.007 0.076 0.110 0.12 0.004 0.070 0.103 0.03 FGLasso 0.008 0.072 0.104 0.06 0.009 0.080 0.112 0.06 -0.007 0.077 0.102 0.03 Grid OLS 0.003 0.078 0.081 0.05 -0.007 0.083 0.117 0.05 -0.011 0.084 0.100 0.09 GLS 0.003 0.066 0.082 0.02 -0.009 0.075 0.092 0.07 -0.009 0.077 0.087 0.07 FGLS 0.003 0.070 0.096 0.10 0.007 0.082 0.121 0.80 n/a n/a n/a n/a PQMLE 0.003 0.078 0.081 0.05 -0.006 0.078 0.118 0.07 -0.010 0.080 0.103 0.08 FGLasso 0.003 0.069 0.084 0.05 -0.009 0.079 0.113 0.02 -0.010 0.083 0.097 0.05 Dense OLS 0.000 0.064 0.090 0.02 -0.016 0.062 0.076 0.09 -0.004 0.069 0.094 0.04 GLS -0.005 0.057 0.064 0.04 -0.009 0.057 0.074 0.07 -0.004 0.057 0.084 0.03 FGLS -0.008 0.061 0.086 0.09 -0.016 0.062 0.076 0.90 n/a n/a n/a n/a PQMLE -0.005 0.060 0.067 0.01 -0.011 0.056 0.062 0.09 -0.006 0.058 0.081 0.05 FGLasso -0.004 0.058 0.065 0.05 -0.013 0.058 0.072 0.04 -0.006 0.060 0.076 0.06 Blc-diag OLS 0.001 0.072 0.093 0.03 0.007 0.074 0.089 0.06 -0.005 0.074 0.097 0.04 GLS -0.002 0.058 0.060 0.04 0.006 0.052 0.061 0.05 -0.010 0.060 0.056 0.05 FGLS 0.001 0.062 0.080 0.12 0.008 0.075 0.089 0.98 n/a n/a n/a n/a PQMLE -0.003 0.067 0.084 0.05 0.005 0.070 0.081 0.05 -0.007 0.067 0.085 0.06 FGLasso 0.002 0.060 0.077 0.02 0.007 0.060 0.070 0.09 -0.011 0.067 0.083 0.05 The table shows several statistics of b b 1 , the first element of b b under different N and fixed T = 200. Bias and Std. stand for the average and standard deviation of b b 1 b 0 over 100 replications. IQR shows the interquartile range, which is the difference between 75% quartile and 25% of b b 1 distribution. ˆ a shows the type 1 error, which is calculated byå 1 i=1 1 b b 1 b 1 )=s( b b 1 )> 1:96 =100. Table 1.2: MC Results for ˆ b 1 23 1.6 Conclusion This chapter proposes an estimator for high-dimensional Seemingly Unrelated Regression models: the Feasible Graphical Lasso estimator (FGLasso). We show that as N;T tend to infinity at a certain rate, the largest deviation of FGLasso estimator from the infeasible GLS estimator vanishes, which holds true even when N > T under certain conditions. We further show if T grows slightly faster than N, FGLasso achieves efficiency and shares the same asymptotically distribution with GLS estimator. There are various interesting questions and possible extensions to this chapter. First, although we show FGLasso converges uniformly to GLS when N > T , we have only derived the asymptotic equivalence with GLS under N < T . Even though the simulation results remain satisfactory when N > T , the theoretical evidence remains unclear. A natural extension is to allow N to be larger than T and derive the asymptotic distribution of our estimator. Second, motivated by the network literature and graph theory, the key assumption here is the sparsity of true precision matrix W, and that the maximum nonzero entries per row grows much slower than N. It would be of great practical use if these assumptions can be tested. To our knowledge, we are not aware of such statistical tests. We would like to leave these questions and extensions to future research. 24 Chapter 2 Estimation of High-Dimensional V AR and Interactive Fixed Effects 2.1 Introduction The model studied in this chapter can be viewed as a combination of two parts: vector autoregres- sive (V AR) and factors model (FM). Each part has been very popular and heavily used, especially in the applied finance and macroeconomic literature. The unrestricted V AR model is very successful in finite-dimension time series analysis since it has the advantage of capturing rich dynamics and interconnectedness and can be efficiently es- tiamated by least squares regression. However, low-dimension V AR often suffers from omitted variable bias, which leads to misleading inference and causes doubts on its applications (see, e.g., Sims (1992), Sims (1993), Leeper et al. (1996)). In order to incorporate more data into the sys- tem, certain dimension reductions restriction need be imposed because the parameter size grows quadratically with the number of cross-sectional units. For example, Doan et al. (1984) imposes prior probability distributions on the parameters, and such class of models is known as Bayesian V ARs. Chudik and Pesaran (2011) use economic theory and imposes (approximate) block diag- onal structure on the transition matrix by arranging the units properly. Using high-dimensional statistical tools, Negahban and Wainwright (2011) and Wong et al. (2016) among others study 25 high-dimensional regularized V AR and derived the oracle property by imposing sparsity and low- rank condition on transition matrix, respectively. However, the series of literature usually assumes residuals are exogenous and have weak cross-sectional dependence, which excludes cases where the error terms are strongly correlated and may even be correlated with the lagged series. These restrictions may lead to large bias and inconsistency of the estimators when, for example, there are latent factors in the disturbances. Meanwhile, factor model (FM) is another useful tool, especially when dealing with high- dimension data. It assumes that each variable under consideration can be expressed as a linear combination of a small number of latent factors plus an idiosyncratic component. The model is first proposed by Geweke (1977) and Sargent et al. (1977) who assume the residuals are orthog- onal. It has been generalized to allow for weak cross-section dependence by Forni et al. (2000) and its inference property is studied by Bai and Ng (2002) and Bai (2003). However, the literature usually assumes the residuals are not serially correlated or only weak correlation is allowed. The presence of strongly correlated idiosyncratic components in the model can lead to distorted estima- tion, resulting in overestimation of the number of factors (Greenaway-McGrevy et al., 2012) and is detrimental for forecasting purposes (Anderson and Vahid, 2007). Inclusion of the lagged terms of the time series into FM is a common remedy, but the existing literature either imposes a strong structure of the coefficient matrix Stock and Watson (2005), or restrict into the low-dimensional data setting (Anderson and Vahid, 2007). In this chapter, I assume that comovement of high-dimensional time series can be explained by a combination of its lagged value and the movement of latent factors. The model has advantages in that it is more general than V AR and FM individually, and could potentially relieve certain limitations that these two models have as I pointed out earlier. There are a few literature that have similar settings. Factor-augmented V AR (FA V AR) in Bernanke et al. (2005) considers the high dimensional series which is driven by a set of observed factors and latent factors, with a V AR equation that captures dynamic correlations among the factors. In section 2.3.1, I compare the differences of FA V AR with our model in details. Global-V AR in Pesaran (2006) considers 26 a more general model can not only include time fixed effect but also other weakly exogenous regressors; however, the sample size (T) has to be much larger than the cross-sectional size (N), and asymptotically, N is fixed. Two working papers Lin and Michailidis (2019) and Miao et al. (2020) adopt the similar setting, and they impose sparse restrictions on transition matrix. As I discussed later in section 2.1, this chapter stems from different economics perspective and imposes low rank restrictions. The goal of this chapter is to recover the V AR transition matrix as well as the the common factor spaces. Since the number of parameters exceeds the sample size, I impose low rank con- dition on transition matrix. The idea of low rank goes back to reduced rank regression (RRR) proposed by Anderson et al. (1951), and has been widely applied in econometric modeling, such as demand system (Lewbel (1991) etc.), cointergration analysis (citeengle1987co, Phillips and Ou- liaris (1990), Johansen (1991), etc.), and reduced V AR in finite dimension (Velu et al. (1986) etc.). Specifically, Camba-Mendez et al. (2003) showed that reduced rank V AR performs better than un- restricted V AR in forecasting UK macro indexes. In recent years, researchers extend the reduced rank V AR model into high dimensional setting, and show the estimated transition matrix converges to the true one under certain conditions, see for example, Negahban and Wainwright (2011). Im- posing low rank condition is not only for the purpose of dimension reduction, but also has its own economic interpretations. That is, the model implicitly assumes the movement of dependent vari- ables are driven by two groups of latent factors, the first groups are constructed from its past value, others are not. Since both transition matrix and common factors are assumed to be low rank, it is natural to consider imposing certain rank constraints into objective function. However, since rank constraints are non-covex, solving the problem causes huge computational burden. Therefore, we replace the rank constraints by its convex hull: nuclear norm. Nuclear norm penalization has the great advantage in practice and has became very popular in econometrics literature recently, for example, Bai and Ng (2019) used it in factor model and improve the estimation of number of factors; Athey 27 et al. (2018) apply it in estimating treatment effects with unbalanced panel data; Moon and Weidner (2018) and Chernozhukov et al. (2019) apply nuclear norm penalties in panel regression model. In this chapter, I study the model that high dimensional time series z t is not only depends on its own lagged value z t1 , but also depends on latent factors f t , the formal presentation can be found in section 2. Assuming that V AR transition matrix is low rank, the chapter first proposes joint estimation method that estimates transition matrix and common factors simultaneously by nuclear norm penalization. Under certain conditions, the deviation of each elements in transition matrix and common factors shrinks to 0 as N;T!¥, and their rank can be consistently recovered. Since nuclear norm regularization inevitably induces biases, a debias procedure is proposed in order to improve the finite sample performances. Independently, since the model can be written as pure factor model with residuals from V AR process, leveraging on PCA, the chapter introduces multi- stage estimation method that estimate parameters in multiple stages. The method helps to improve the convergence rates of estimated transition matrix and reduce biases. In Monte-Carlo simulation, I show the finite sample performances of estimators and results agree with the theory. Practically, the chapter revisits the US macro data set from McCracken and Ng (2016) and shows the model has great advantage in forecasting macro indexes (IP, CPI and federal funds rate) compared with reduced rank V AR model and pure factor model (FM), especially at long horizon. This chapter is organized as follows. In section 2, I introduce the set-up, compared with related model and discuss the unique matrix separation. Section 3 introduces joint estimation procedure and shows the properties of estimators. Section 4 introduces multi-stage estimation method and the properties of its estimators. Section 5 presents finite sample Monte Carlo simulation results. Section 6 shows the forecasting application on US macro data. And section 7 concludes and discusses potential extensions. All proofs are presented in the corresponding appendix. Notation For convenience, I summarize the notation that will be used in the following sections. 28 For matrix A2R mn , let s 1 (A) s 2 (A) > ::: s minfm;ng (A) be its ordered singular value, A 0 be the transpose of A. Denote the operator norm asjjAjj op = s 1 (A), nuclear normjjAjj = å minfm;ng i=1 s i (A) as sum of A’s singular values, and Frobenius norm asjjAjj F = q å m i=1 å n j=1 A 2 i j . For square matrix C2R nn , denote tr(C) as the trace of matrix C and trhAi=jjAjj =å minfm;ng i=1 s i (A). LethA;Bi= tr(B T A)=å i; j A i j B i j be trace inner product of A and B2R mn . Denote I K as K by K identify matrix, l i as a column vector of 0 0 s except 1 for its i 0 th element. For random variable x, letE(x) be its expectation. For a real sequencefa n g ¥ n=1 and a positive sequencefb n g ¥ n=1 , we denote a n =O(b n ) if there exists a finite constant C> 0 such thatja n j Cb n as n!¥, and a n =O p (b n ) ifP(ja n j Cb n )! 1 as n!¥. We use notation) and p ! to denote the convergence in distribution and the convergence in probability, respectively. 1 : stands for the indicator function. 2.2 Set Up In this section, I first formally introduce the model and necessary conditions that ensure covariance stationarity, then I compare the model with related factor model (FM) and factor-augmented V AR (FA V AR) model. At last, I discuss the unique matrix separation. 2.2.1 Model Given observed high-dimensional time series datafz t g T t=0 where z t is N-dimensional vector in- cluding all observed cross-sectional units at time t, assume z t follows V AR with order 1 1 : z t =Qz t1 +Lf t + u t ; t= 1;2;:::;T; (2.1) where f t 2R R are latent factors,L2R NR are factor loadings, andQ2R NN is transition matrix. 1 For simplicity, I assume the order of V AR to be 1. Finite lag case can be easily extended. 29 Stack z t over t, (2.1) becomes: Z=QZ 1 +LF 0 +U; (2.2) where Z=(z 1 ;z 2 ;:::;z T ), Z 1 =(z 0 ;z 1 ;:::;z T1 ),L=(l 1 ;l 2 ;:::;l N ) 0 , F =(f 1 ;f 2 ;:::;f T ) 0 and U =(u 1 ;u 2 ;:::;u T ). Since both L and F are unknown, as I will explain in section 2.3, certain identification conditions are required in order to separate them uniquely. However, in applications such as forecasting and dynamic network analysis, the unique separation is not required. Therefore, it would be useful to introduce the productLF 0 as a new parameter:G :=LF 0 , and (2.2) becomes: Z=QZ 1 +G+U: (2.3) In this chapter, as N;T!¥, I expect to recover Q and G. Since the number of unknown parameters increase with N;T , especially parameters inQ increases quadratically with N, certain restrictions have to be imposed onQ andG. Assumption 5 (Model). Assume there exist universal constants 0< c< ¯ c<¥, and 0<r < 1 such that: (i) Q has unknown low rank r ¯ c, the largest modulus of the eigenvalues ofQ is bounded byr, s 1 (Q)> s 2 (Q)>:::> s r (Q)> c> 0 and Q j op ¯ cr j for j= 0;1;2;:::; (ii) Let u t = p S u e u t where e u t IID(0;I N ) andS u 2R NN > 0. AssumeEje u t j 8 ¯ c, and c s min (S u ) s max (S u ) ¯ c,jjjS u jjj ¥ ¯ c; (iii) f t is R by 1 column vector with unknown R ¯ c, and assumeff t g has moving average (MA) representation: f t = ¥ å j=0 b e j e f tj ; t= 1;2;:::;T; where b e 0 = I R . Assume e f t IID(0;I R ), Eje f t j 8 ¯ c, seriesfe f t g andfu t g are mutually independent in all leads and lags. Moreover, assumeå ¥ j=0 b e j < ¯ c; 30 (iv) L T L=N F =O(1),kl i k F ¯ c for any 1 i N. Assumption 5 imposes conditions on transition matrix, residual, and common factor terms that ensuresfz it g is covariance stationary. Specifically, the first one imposes the low rank condition on transition matrix Q. It is worthwhile to mention that imposing low rank condition implicitly excludes certain structures of Q that might be interesting to researchers. For example, Q being diagonal or band are invalid here because in both casesQ is full rank. Such specific prior structures can be restrictive, especially when data size becomes large. In high-dimensional setting, Lin and Michailidis (2019) and Miao et al. (2020) consider the similar model but they impose sparsity on Q. Sparsity and low rank condition stem from different economic meaning and they don’t nest each other. For example, a matrix of 0 0 s except the first element being 1 is both sparse and low rank, however, matrix such as diagonal matrix is sparse but not low rank. Assuming sparsity on Q imposes assumptions on causality that z it can be explained by only a small groups of lagged value in z t1 , while assuming low rank onQ indicates that z it can be explained by a few factors constructed from z t1 . In practice, there is no unanimous rule which method is better, whether imposing low rank condition or not depends on real data applications. Besides low rank condition, (i) restricts the modulus of the eigenvalue so that (I N QL) is invertable. It also imposes condition on largest singular value of Q. Compared to conditions in Negahban and Wainwright (2011) which restrictjjQjj op to be strictly less than 1, condition here is more general and allowjjQjj op > 1. The second condition assumes seriesfu t g has no serial correlation and only weak cross- sectional dependence is allowed. The third one regulates dynamic process of latent factors f t , and assumes factors are covariance stationary with absolute summable MA coefficients. The last con- dition assumes factors in our model are ’pervasive’. In current stage weak factors are not allowed, I will leave it into future exercises. Lemma 2. If assumption 5 holds,fz it g is covariance stationary for all 1 i N. LetS z :=E(z t ), there exists a small enough constant c> 0 such that s min (S z ) c: 31 Notice lemma 2 focuses on the stationarity of z it instead of z t because of the existence of common factorsG=LF 0 . In fact, letS F :=E(f t f 0 t ), z t is not stationary becauseS z =LS F L 0 +S u increases with N. 2.2.2 Alternative model presentation Under assumption 5 that(I N QL) is invertable, V AR model in (2.1) can be written as MA form below: z t =(I N QL) 1 Lf t +(I N QL) 1 u t = x t + e t ; (2.4) where x t =(I N QL) 1 Lf t =Qx t1 +Lf t ; (2.5) e t =(I N QL) 1 u t =Qe t1 + u t : (2.6) Let X =(x 1 ;x 2 ;:::;x T ), X 1 =(x 0 ;x 1 ;:::;x T1 ), then from (2.5), we have: X =QX 1 +LF 0 : Since X is a combination of two low rank matrix: QX 1 and LF 0 , X is a low rank matrix. Denote the rank of X as R X . 2 Write singular value decomposition (SVD) of X as X = U X D X V 0 X , where the column of U X (V X )2R NR X are the left (right) singular vectors of X, D X 2R R X R X is diagonal matrix with singular values of X on its diagonal. Denote a = U X p N, G= V X D X = p N, then we have: 34 X =aG 0 : 2 Notice here we don’t need to know the exact value of R Z , as long as it is bounded. 3 Equivalently,a2R NR X is also p N times the first R X eigenvenvector of XX 0 , and G= X 0 a=N. 4 The separation if X is not unique. The goal here is to remove the low rank term from Z and use the residuals to run V AR analysis. The separation is helpful because it enables us to borrow some results from principle component analysis and prove the theorem. 32 Let G=(g 1 ;g 2 ;:::;g T ) 0 , we have g t = a 0 x t N = a 0 (1QL) 1 L N f t : Then model (2.1) can be written as: z t =ag t + e t =Qag t1 +Lf t + e t ; (2.7) e t =Qe t1 + u t : (2.8) (2.7) and (2.8) indicate that z t can be written as the form of pure factor model, with residuals follow V AR(1) with transition matrixQ. Lemma 3. If assumption 5 holds, there exists constant ¯ c> 0 such that: (i) a 0 a=N= I R X ; (ii)fg t g is covariance stationary with mean 0, has MA representation: g t =b g (L)e f t ; wherefe f t g is defined in assumption 5(iii),jjb g (1)jj ¯ c; (iii) For 1 i; j N, 1 s;t T ,å N j=1 E(e it e jt ) ¯ c,å T t=1 E(e is e jt ) ¯ c. It can be easily seen that lemma 3 and assumption 5 together imply assumptions A-D in Bai and Ng (2002) and Ahn and Horenstein (2013). Leveraging on their results, a and G can be consistently estimated up to a constant rotation matrix by PCA, rank of G can be consistently estimated by ’growth ratio’ (GR). Section 4 discusses in detail how existing factor literature help to estimate the parameters of interest here. 33 2.2.3 Relation with FA V AR and FM 2.2.3.1 FA V AR Bernanke et al. (2005) proposed the factor-augmented V AR that aims to summarize the information contained in a large set of time series by a small number of factors and includes those in a standard V AR model. Specifically, letff t g2R R be the latent factor and w t 2R p be the observed sets of variables, they jointly form a V AR model given by 0 B @ f t w t 1 C A = A(L) 0 B @ f t1 w t1 1 C A + 0 B @ e f t e w t 1 C A ; (2.9) where A(L)=å d j=0 A j L j . In addition, there is a large panel of observed time series z t 2R N , whose current values are influenced by both w t and f t ; i.e., the calibration equation: z t =L f f t +L w w t + u t : (2.10) The model in this chapter shares similarities with FA V AR since they both try to model the high dimensional series with V AR and FM, but they differ in several aspects. FA V AR assumes researchers directly observe another set of variables w t and its dimension is finite. However, this chapter considers researchers only have access to a high dimensional time series z t . Therefore, w t in FA V AR is either not accessible or becomes the lagged value z t1 in this model. Including high dimensional series z t1 into V AR equation brings more estimation challenges. Further, even though we show in (2.7) that z t can be written as a factor model of g t1 and f t , g t1 is constructed from z t1 and not directly observed. In practice, FA V AR model is mostly used for structure V AR analysis, explaining how orthog- onal shocks affect high dimensional series through augmented factors. For example, in Bernanke et al. (2005), w t are observed policy indicators such as federal funds rates and z t consists of large dimensional US macroeconomic time series. They investigated how the monetary shocks affect 34 w t and latent factor f t through V AR equation, and propagate to z t through calibration equation. However, since the observed series in this model are lagged value z t1 , it would be less reasonable to study how shocks affect the lagged value z t1 through high-dimensional V AR system. The po- tential applications of two models are different. In this chapter, I apply the model into forecasting and dynamic network analysis as shown in section 6 and 7. 2.2.3.2 FM Equation (2.4), (2.5) and (2.6) from previous section indicate that our model can be written as approximate dynamic factor model (DFM): z t =l(L)f t + e t ; e t =Qe t1 + u t where l(L)=(I N QL) 1 L=å ¥ p=0 Q p LL p . However, in order to estimate infinity lags DFM under high dimensional setting, traditional estimation methods, for example, maximum likelihood estimation method, has limitations on both theoretical and practical. That is, it implicitly restricts residuals to be independent and are Gaussian distributed. When the data size is large, estimation would cause huge computational burden and time consuming. Moreover, notice that the model in general cannot be written as static factor form since the lag order in l(L) is not finite. Thus principle components (PC) method would not be suitable here without further restrictions. Fortunately, as shown in (2.7), under low rank assumption onQ, the model can be written as the form of static factor model. As N;T!¥, it is well known that the factor space spanned by g t can be consistently recovered, see for example, Bai and Ng (2002), Bai (2003). However, when residuals are serially correlated, results from PC can lead to distorted estimation and inferences. Meanwhile, parameters of interest and potential applications of this chapter are different from factor model literature. Here, I intend to recover transition matrix Q as well as common factor spaceG=LF 0 . However, in factor model literature, the goal is to recover factor spaceaG 0 , which 35 is not exactly the same asLF 0 . In empirical study, as I show later, model here is more general and is able to provide better forecasting results then pure factor model. 2.2.4 Unique Matrix Separation When estimating our model in practice, bothQ andG=LF 0 are unknown. Given observed time seriesfz t g T t=0 , there might exist other pairs of ( e L; e F; e Q) that are observationally equivalent with (L;F;Q). Specifically, for any invertible matrix Q 1 2R RR and any Q 2 2R RN , we have: z t =Qz t1 +Lf t + u t = e Qz t1 + e L e f t + u t ; where e L=LQ 1 , e f t = Q 1 1 f t Q 1 1 Q 2 z t1 , e Q=Q+LQ 2 . That is, in order to uniquely identifyQ,L and F, we need extra at least R 2 +RN conditions. As we pointed out earlier, since the common factor spaceG=LF 0 is our parameter of interest, unique separation ofL and F is not required 5 . For simplicity, let Q 1 = I R , then we have: z t =Qz t1 +Lf t = e Qz t1 +L e f t + u t ; where e f t = f t Q 2 z t1 , e Q=Q+LQ 2 . However, if Q 2 6= 0, seriesf e f t g andfu t g are not strictly independent, which violates the assumption 5(iii). That is, given seriesfz t g T t=0 , the separation of Q andG is unique. 2.3 Joint Estimation In this section, we introduce joint estimation method and discuss the convergence rates of esti- mators. We first briefly introduce nuclear norm regularization and computation algorithm, then we discuss the general properties of joint estimator and its convergence rates under certain certain 5 Various identification conditions can be imposed in order to uniquely separateLF 0 , see Stock and Watson (2005) survey paper for more details. 36 conditions. At last, we propose a debias procedure that could compensate the bias from nuclear norm penalty. 2.3.1 Nuclear Norm Regularization Since both our parameters of interestQ andG are low rank, it is natural to consider the least square estimation with rank restrictions: ( e Q; e G)= argmin 1 2NT jjZQZ 1 Gjj 2 2 s:t: rank(Q) r rank(G) R: (2.11) AlthoughjjZQZ 1 Gjj 2 2 is the convex function ofQ andG, rank constraints are not. Thus, global minimum of (2.11) is not guaranteed. Plus, rank minimization is so called ’NP hard prob- lem’ because the cardinality function that defines rank is non-convex and non-differentiable. In high-dimensional setting, calculating optimal solution from (2.11) would cause big computational burden. One of the solutions is to find a convex hull of rank constrains. In this chapter, I use nuclear norm penalization. Nuclear norm, denoted asjj:jj , is defined as the summation of its singular values. Notice that for matrix A2R mn ;C2R mn 0 , we can write jjAjj = max jjCjj op 1 tr(C 0 A): Therefore, the nuclear norm is dual to the spectrum normjj:jj op and thus convex on A. Moreover, various optimization algorithms are available dealing with nuclear norm penalty, see for example, Ji and Ye (2009), Ma et al. (2011). However, similar as lasso, nuclear norm inevitably induces shrinking biases. Taking pure factor model A= L+U as an example, where L is the common factor term and U is the idiosyncratic 37 error terms. Let SVD of A be A = UDV 0 and define the singular value thresholding operator S f (A)= UD f V 0 , where D f is defined by replacing the diagonal entries of D by max(D ii f;0). Theorem 2.1 of Cai et al. (2010) shows that b L= UD f V 0 = argmin L 1 2 jjA Ljj 2 2 +fjjLjj : That is, nuclear norm penalty has two effects. For singular values of Z that are less than f Q , it shrinks them to 0, which is similar as what principle components method do. However, for those are greater thenf Q , nuclear norm penalty shrinks them by the amount off Q , which causes biases. In finite sample, this bias maybe problematic, however, as the sample size becomes large, as we show in Monte-Carlo simulation later, this bias becomes negligible. In joint estimation, we consider estimatingQ andG jointly by solving the following optimiza- tion problem: ( e Q; e G)= argmin (Q;G) Q(Q;G)= argmin (Q;G) n 1 2NT jjZQZ 1 Gjj 2 2 + f Q p N jjQjj + f G p NT jjGjj o ; (2.12) wheref Q ;f G > 0 are pre-specified regularization parameters. In Section 5.2, we introduce how to choose these parameters. 2.3.2 Iterative Algorithm We propose the following algorithm, which solves Q and G iteratively as the global solution to (2:12). Algorithm 2 (Joint estimation algorithm). S 0 : Fixt2(0;1= Z 1 Z 0 1 op ),f Q andf G . Initialize e Q 0 and set k= 0; 38 S 1 : Solve e G k+1 and e Q k+1 repeatedly: e G k+1 = S p NTf G (Z e Q k+1 Z 1 ); e Q k+1 = S p NTtf Q e Q k t( e Q k Z 1 + e G k Z)Z 0 1 : S 2 : Repeat step 1 until convergence; S 3 : The estimated rank of e G is: ˜ R=å min(N;T) i=1 1 s i ( e G)> (f Q +f G )s 1 ( e G e G 0 ) 1=2 ; S 4 : Apply method from Ahn and Horenstein (2013), select the rank of e Q ˜ r by ˜ r= argmax k GR(k)= log(V(k 1)=V(k)) log(V(k)=V(k+ 1)) ; where V(k)=å N j=k+1 s j (Z e G)(Z e G) 0 . Proposition 3. Let( ¯ Q; ¯ G) be the global minimum for Q(Q;G). Then for anyt2(0;2= Z 1 Z 0 1 op ) and any initialQ 0 , we have: Q( e Q k+1 ; e G k+1 ) Q( e Q k+1 ; e G k ) Q( e Q k ; e G k ): And for all k 1, Q( e Q k+1 ; e G k+1 ) Q( ¯ Q; ¯ G) 1 kt Q 1 ¯ Q 2 F : Proposition 3 shows that for any initial value, algorithm 2 will converge to the global minimum at the rate ofO(1=k). 2.3.3 General Upper Bound Let D Q = e QQ, D G = e GG. Before we present the upper bound of D Q and D G , we need to introduce the restricted strong convexity condition (RSC). 39 Let SVD ofQ beQ= U Q DV T Q , where U r Q , V r Q denote the first r column of U and V , respectively. Let M U r Q = I N U r Q (U r 0 Q U r Q ) 1 U r 0 Q be the orthogonal projection matrix of the column span of U r Q , and define M V r Q similarly. Let D Q 2 = M U r Q D Q M T V r Q ; D Q 1 =D Q D Q 2 : (2.13) That is, we defineD Q 2 as the projection ofD Q to the orthogonal column and row space ofQ . Since Q has exact rank r, we have: Q+D Q 2 =jjQjj + D Q 2 . Similarly, we define D G 2 and D G 1 as D G 2 = M U R G D G M T V R G ,D G 1 =D G D G 2 . Assumption 6 (RSC). For positive integers r N, R minfN;Tg, and l NT > 0, define the set C(r;R;l NT ) as: C(r;R;l NT )= ( D Q 2R NN ;D G 2R NT D Q 2 p N + l NT D G 2 p NT 3 D Q 1 p N + l NT D G 1 p NT !) ; If(D Q ;D G )2C(r;R;l NT ), assume there exist constantk > 0 such that with high probability: D Q Z 1 +D G 2 2 Tk D Q 2 2 + D G 2 2 kNTf G D G 2 ; (2.14) wheref G > 0 is the penalty parameter in (2.12). Intuitively, the coneC contains matrices(D Q ;D G ) that are close to(Q;G), in the sense that the part which cannot be explained by(Q;G),(D Q 2 ;D G 2 ), is relatively small compared to remaining part. The assumption imposes that for all matrices inside coneC, the quadratic term, D Q Z 1 +D G 2 2 is bounded by a relaxed convex function. Similar condition is also imposed in for example, Negahban and Wainwright (2011), Moon and Weidner (2018), Chernozhukov et al. (2019), and it plays the same role as restricted eigenvalue condition in LASSO literature, for example, Candes et al. (2007). In appendix H.8.2, we show under certain distributional assumptions, the condition (2.14) holds with high probability. With above assumptions hold, we present general upper bound for D Q 2 F + D G 2 F . 40 Theorem 3. Suppose Assumption 5 and 6 hold, if regularization parameters (f Q ;f G ) in (2.12) satisfies f Q 2 p NT Z 1 U T op ; f G 2 p NT jjUjj op + 2 NT jjZ 1 jj op ; (2.15) then there exists large enough constantk > 0 such that the optimal solution( b Q; b G) satisfy: 1 p N D Q F + 1 p NT D G F k p rf Q +k p Rf G : (2.16) In Theorem 3, the upper bound of consists of two terms. The first corresponds to the estimation error of estimatingQ with a rank r matrix e Q. The second term is the estimation error ofG with a rank R matrix e G. Specifically, since the dimension ofD Q is N by N, andD G is N by T , the average deviation of each element inD Q andD G can be written as: 1 N D Q F k p r f Q p N +k p R f G p N ; 1 p NT D G F k p rf Q +k p Rf G : When there are no interactive fixed effects in (2.1)(G= 0), the bound in (2.16) becomes 1 p N D Q F k p rf Q ; same as Theorem 1 Negahban and Wainwright (2011). 2.3.4 Convergence Rates Theorem 1 provides the general upper bound ofD Q andD G . In this section, I discuss the conver- gence property by imposing high level assumptions on the matrix norm. Assumption 7. Let assumption 5 hold, as N;T!¥, assume that (i)jjUjj op =O p ( p N+ p T); 41 (ii)jjZ 1 U 0 jj op =O p (N p T). This high level assumption regulates the largest singular value of U and Z 1 U 0 . The first con- dition has been widely used in statistic and econometric literatures, for example, Negahban and Wainwright (2011), Moon and Weidner (2018). Various examples of DGP’s that satisfy this condi- tion can be found in Moon and Weidner (2017). The sufficient condition for second assumption is both z t and u t are from Gaussian distribution. Notice that compared to similar conditions in other literatures (for example, lemma 5 in (Negahban and Wainwright, 2011)) that upper bound is at rate of p NT , our result has a larger bound because of the factor loadingsjjLjj=O p ( p N). In Appendix H.8.3, we provide proofs of second assumption given Gaussian distribution of z t and u t . Under specifications above, we can derive the convergence rate ofD Q andD G immediately from Theorem 3. Proposition 4. Assume assumptions 5 - 7 and conditions in Theorem 3 hold, further, assume N=T! 0, then we have: (i) 1 N D Q F =O p ( 1 N + 1 p T ); (ii) 1 p NT D G F =O p ( q N T + 1 p N ); (iii) ˜ R! p R; (iv) If R X r, ˜ r! p r; (iv) Let ˜ L defined as p N times the first ˜ R eigenvector of e G e G 0 , ˜ F = ˜ L T e G=N. There exists N by N rotation matrix H that 1 N ˜ LLH 2 F =O p ( N T + 1 N )= 1 T ˜ F FH 1 2 F : (2.17) 42 The first two results follows directly from Theorem 3 and assumption 7. The third and fourth condition shows the rank consistency of ˜ R and ˜ r. Notice that R X is the rank of X defined in section 2.2. Since X =QX 1 +LF 0 ; we know the rank R X matrix is the summation of two low rank matrix with rank r and R. If both column and row spaces of two terms are orthogonal, then we have R X = r+ R which indicates R X > r. However, sinceQX 1 andLF 0 are not exactly orthogonal, theoretical justification of the relationship between R X and r is missing and we will leave it into future study. The last condition establish the convergence rate of estimated factor loading b L to the true loading, estimated factors to the true factors, up to a fixed rotation matrix. See more details at proof in Appendix B 9.7. Compared with high dimensional V AR model and pure factor model, the convergence rates of our estimator is slower. That is, when there are no common factors, the convergence rates ofD Q in pure V AR model is faster than(i), for example, in corollary 4 in Negahban and Wainwright (2011): 1 N D Q F =O p ( 1 p NT ): WhenQ= 0, the model becomes pure factor model. The convergence rates of estimated factors and loadings are faster, for example, in therorem 1 and 2 in Bai (2003): 1 N ˜ LLH 2 F =O p 1 min(N;T) = 1 T ˜ F FH 1 2 F : 2.3.5 Debias Procedure In joint estimation, both e Q and e G are estimated from nuclear norm regularization. As we explained in section 3.1, these estimators are biased due to the penalty term f Q and f G . In finite sample, these bias could be a problem. Follow similar idea as Ma et al. (2011), we introduce the debias procedure as follows. Algorithm 4 (Debias procedure). 43 S 0 : Let the singular vector decomposition of e Q= S 1 V 1 D 0 1 and e G= S 2 V 2 D 0 2 ; S 1 : Fix the singular vector, regress z it on(S 1 D 0 1 ) it and(S 2 D 0 2 ) it for i= 1;2;:::;N, t = 1;2;:::;T , and denote the regressor as ˆ v 1 and ˆ v 2 ; S 2 : Let ¯ Q= S 1 diag( ˆ v 1 ) D 0 1 , ¯ G= S 2 diag( ˆ v 2 ) D 0 2 . The debias method is intuitive. Since nuclear penalty shrinks the singular value of estimators and causes biases. In this debias procedure, we fix the singular vector and re-estimate the singular value by least square estimation. In Monte-Carlo simulation, we show the procedure shrinks the bias significantly. 2.4 Multi-stage Estimation In this section, we propose an independent estimation method, which estimate Q and G in three stages. Recall our model in section 2.2 that if Q is low rank, z t can be written as a pure factor model: z t =ag t + e t ; e t =Qe t1 + u t ; where g t 2R R X is covariance stationary latent factors and a 0 a=N = I R X according to lemma 3. Leverage on principle component analysis (PCA) in pure factor model, we propose multi-stage estimation below. Algorithm 5 (Multi-stage estimation procedure). S 1 : Recover e t by principle component analysis: ˆ e t = z t ˆ a ˆ g t ; 44 where ˆ a, ˆ g t are PC estimator. Specifically, we first choose number of factors ˆ R X according to eigenvalue ratio of ZZ 0 (Ahn and Horenstein, 2013). Then we recoverb a as p N times first ˆ R X eigenvalues of ZZ 0 , and b G= Z 0 b a=N; S 2 : Fix tuning parameterl Q > 0, estimateQ by nuclear norm regularization: b Q= argmin Q 1 2NT å t jjˆ e t Qˆ e t1 jj 2 2 +l Q jjQjj N ; S 3 : Estimate b G by running principle components on(z t b Qz t1 ), and denote the rank of b G as ˆ R; S 4 : Apply method from Ahn and Horenstein (2013), select the rank of e Q ˜ r by ˜ r= argmax k GR(k)= log(V(k 1)=V(k)) log(V(k)=V(k+ 1)) ; where V(k)=å N j=k+1 s j (Z b G)(Z b G) 0 . We use similar idea as ’Cochrane–Orcutt’ estimation. That is, we first run PCA on Z and estimate transition matrixQ from recovered residuals ˆ e t . Then we estimate common factor spaces by running PCA again on Z b QZ 1 . Notice that the rank ofQ are not directly estimated by b Q. As we show later, the convergence rates in b QQ is slow and consistency of its rank is not guaranteed. Therefore, similar as we did in joint estimation, we estimate ˆ r by running PCA on Z b G. Convergence Rates In order to discuss the convergence rates of( b Q; b G) , we need to impose similar RSC condition as (2.14), as well as tail conditions on u t and f t . Assumption 8. For positive integers r N, define the setC(r) as: C(r)= n D Q 2R NN D Q 2 3 D Q 1 o ; 45 IfD Q 2C(r), assume there exist constantk > 0 such that with high probability: D Q b E 1 2 2 Tk Q D Q 2 2 : (2.18) From our estimation procedure, the transition matrix is estimated by residual e t from first stage. Intuitively, the deviation of b Q fromQ depends on the estimation error of first stage. We will show this effect by following lemma. Lemma 4. If assumption 5, 8 hold, further, ifl Q satisfies l Q > 2 NT b E 1 ( b EQ b E 1 ) 0 op ; there exist constantk > 0 such that, 1 N b QQ F k p rl Q : Similar as the results of theorem 3, the deviation depends on the scale of b E 1 ( b EQ b E 1 ) 0 op . If there are no estimation errors in first stage, b E = E and b E 1 = E 1 . Under certain conditions, it can be shown that b E 1 ( b EQ b E 1 ) 0 op =jjE 1 U 0 jj op =O p ( p N logNT). The following proposi- tion shows that the recovered ˆ e t from first stage is good enough such that it won’t distort the results significantly. Proposition 5. LetD Q = b QQ,D G = b GG. If assumption 5, 7 and 8 hold, as N;T!¥, we have 1 N jjD Q jj F =O p 1 N + 1 T : The proposition discusses the rates of convergence of b Q. Compared to joint estimation esti- mator e Q, the convergence rate here is weakly better. It is worth mentioning that this rates is the upper bound ofjjD Q jj F =N and may not be sharp. Notice that without estimation error from first 46 stage, convergence rates in proposition 5 would beO p ( p logN= p NT), which is faster compared toO p (1=N+ 1=T) when T > N. In finite sample simulation next section, I exam the performances of( b Q; b G) and the results are satisfying and even better than joint estimation estimators (before debiased). That is because in joint estimation, e G are estimated from nuclear norm regularization and therefore biased. However, in three stage estimation, the bias mainly comes from 2nd stage estimation of b Q, and due to the natural bound onQ (imposed in assumption 5(i)), the bias is very small and negligible. 2.5 Monte Carlo Experiment 2.5.1 Experiment Design This section provides Monte Carlo evidence on the small sample properties of Q, G in different pairs of(N;T). Specifically, I let N=f50;100;200g and control the T=N ratio to bef1;2;4g. For each pair of(N;T) I generate data from model (2.1) z t =Qz t1 +Lf t + u t ; t= 1;2;:::;T: In all cases, I fix the number of unobserved factors R= 2, and generatel i j Uni f orm(1;1), i = 1;2;:::;N; j = 1;2, and u t IIDN(0;I T ); t = 1;2;:::;T . The series offz t g is generated with z 0 = 0 and 1000 burn-in data points; Q and f t are generated from the following three data- generating processes (DGPs): (i) Exact Rank: Let the largest three eigenvalues ofQ bef0:9;0:8;0:7g, and 0 0 s for the rest 6 ; generate factors following f t =(e t + 0:8e t1 )= p 1+ 0:8 2 ,e t N(0;I 2 ). 6 To do this, I first generate random N by N matrixQ 1 from N(0;1), then we letQ= U f1:3g diag(0:9;0:8;0:7) V 0 f1:3g , where U;V are the left and right eigenvector ofQ 1 from singular value decomposition, U f1:3g denotes the first three columns of U, and V f1:3g likewise, diag(0:9;0:8;0:7) denotes the diagonal matrix whose diagonal elements are 0:9;0:8; and 0:7. 47 (ii) Approximate Rank: Let the largest three eigenvalues ofQ bef0:9;0:8;0:7g, and 1=N for the rest; generate factors following f t = 0:8f t1 + p 1 0:8 2 e t ,e t N(0;I 2 ); (iii) No factor: Let the largest three eigenvalues ofQ bef0:9;0:8;0:7g, and 0 0 s for the rest; let f t = 08t. In the first DGP, I constructQ as exact rank 3 matrix, and the module of the largest eigenvalue is bounded below 1, therefore guaranteeing the stationary stated in assumption 5. In the second DGP, I relax the exact rank cases to be approximate rank, that is,Q has full rank N, but the module of the first three eigenvalues are significantly larger than the rest. The last DGP is treated as the robustness test: that is, I assume there are no factors in the true DGP, but I don’t use this information during the estimation. Note that factors are serial correlated as they are generated from the AR(1) and MA(1) process in the first two DGPs, respectively. 2.5.2 Tuning Parameter Choice Both estimation methods requires pre-specified tuning parameters, which relate to the unobserved term U. The most commonly used method: cross validation would not fit time series setting as sample splitting and shuffling would break the causal relationship and affect estimation results. Here, similar as Chernozhukov et al. (2019), I propose the following method to choose tuning parameters. Algorithm 6 (Tuning parameter choice). S 1 : Run principle analysis on Z and denote residual as b E; S 2 : Run AR(p) regression of ˆ e it on its lagged value 7 , denote the residual as ˆ u it ; S 3 : Simulatee u i;t N(0;s 2 i ), wheres 2 i = 1 T å t ˆ u 2 it , denote[ e U] it = ˜ u it ; S 4 : In joint estimation, let f Q be the 95 quantile of 2 p NT Z 1 ˜ U 0 op , and p NTf G be the 95 quantile of s 1 ( ˜ U). In multi-stage estimation, letl Q be the 95 quantile of 2 NT b E 1 e U op . 7 p can be selected by BIC. 48 The intuition is simple. According to equation (2.8), U is the residual of V AR process of E, which is the residual from first stage pure factor model. However, since Q is unknown, I approximate the V AR process in (2.8) with the AR(p) model. It is worthwhile to mention that the procedure also works for the case when there are no latent factors. That is, if there are no factors selected in the first step (DGP 3), b E = Z and ˆ u it is derived from the AR regression of Z in the 2nd step. Monte Carlo results in the next section show that the procedure above is fast and produces satisfying results. 2.5.3 Monte Carlo Results Table 2.1 presents the Monte Carlo simulation results based on 100 replications. In the top panel, it shows root mean square error (RMSE) of estimatedQ, including estimators from joint estima- tion (’Joint’), debias procedure (’Debias’), multi-stage estimation (’M-s’) as well as the rank of joint estimator ˜ r and multi-stage estimator ˜ R. Bottom panel reports the RMSE of estimatedG at corresponding columns. In general, RMSE of estimated Q declines as N;T rises. Among three estimators, due to the biases from nuclear norm penalty, the average deviation of joint estimators from true Q is significantly larger than debiased and multi-stage estimator, but the differences among the latter two are negligible. Meanwhile, the gaps between biased joint estimator and debiased one (or multi-stage) shrinks as sample size increases, indicate that the biases could be problematic in small sample, but become less relevant in large sample. The estimated ranks are around true value 3, even under the approximate rank case (DGP 2) and no factor case (DGP 3), which justifies the procedure of choosing penalty parameters stated in the previous section. Estimated G has similar pattern as Q: RMSE declines as N;T rises. For example, when N is fixed to be 100, the RMSE of debiased estimator decreases from 0:2576 at T = 100 to 0:2126 at T = 400. Similar, fix T = 200, the RMSE decreases from 0:3493 at N = 50 to 0:2337 at N = 100. Note that in all pairs of(N;T), debias procedure improves the results significantly and performs better than multi-stage estimator. The estimated number of latent factors in both methods 49 are consistent as sample size becomes large, and successfully detect whether the model has latent factor structure or not (DGP3). 50 DGP 1 DGP 2 DGP 3 N T Joint ˜ r Debias M-s ˆ r Joint ˜ r Debias M-s ˆ r Joint ˜ r Debias M-s ˆ r 50 50 0.0274 2.76 0.0281 0.0258 1.23 0.0277 2.83 0.0275 0.0265 1.12 0.0266 1.70 0.0245 0.0245 1.48 50 100 0.0267 3.04 0.0251 0.0250 1.67 0.0270 3.10 0.0250 0.0260 1.33 0.0255 2.07 0.0214 0.0226 2.04 50 200 0.0263 3.24 0.0229 0.0228 2.22 0.0260 3.17 0.0229 0.0253 1.85 0.0241 2.66 0.0164 0.0194 2.63 100 100 0.0138 2.71 0.0131 0.0129 1.62 0.0138 2.77 0.0131 0.0138 1.23 0.0133 1.73 0.0122 0.0122 1.75 jjDQjj F N 100 200 0.0136 2.85 0.0125 0.0122 2.04 0.0137 3.00 0.0122 0.0136 1.56 0.0126 2.50 0.0105 0.0108 2.32 100 400 0.0133 3.07 0.0116 0.0111 2.81 0.0138 3.10 0.0116 0.0127 2.21 0.0119 2.85 0.0079 0.0093 2.84 200 200 0.0070 2.70 0.00065 0.0065 1.88 0.0072 2.76 0.0065 0.0070 1.46 0.0067 1.96 0.0062 0.0060 2.11 200 400 0.0069 2.79 0.00063 0.0061 2.55 0.0073 2.93 0.0061 0.0068 1.93 0.0063 2.61 0.0052 0.0053 2.78 200 800 0.0068 2.95 0.00060 0.0054 3.00 0.0074 3.00 0.0060 0.0063 2.57 0.0058 2.98 0.0039 0.0043 3.01 N T Joint ˜ R Debias M-s ˆ R Joint ˜ R Debias M-s ˆ R Joint ˜ R Debias M-s ˆ R 50 50 0.8209 0.98 0.5620 0.4425 2.00 0.8060 1.08 0.5576 0.5901 1.22 0.00 0.00 0.00 0.00 0.00 50 100 0.8153 1.44 0.4412 0.4272 2.02 0.8035 1.50 0.4439 0.5478 1.94 0.00 0.00 0.00 0.00 0.00 50 200 0.8003 1.72 0.3493 0.4211 2.00 0.8300 1.66 0.3770 0.5539 1.98 0.00 0.00 0.00 0.00 0.00 100 100 0.8039 1.96 0.2756 0.3589 2.02 0.8176 1.88 0.3064 0.5088 2.00 0.00 0.00 0.00 0.00 0.00 jjDGjj F p NT 100 200 0.7858 1.98 0.2337 0.3555 2.00 0.8076 1.99 0.2537 0.4804 2.00 0.00 0.00 0.00 0.00 0.00 100 400 0.6856 2.00 0.2126 0.3676 2.00 0.8121 2.00 0.2294 0.4820 2.00 0.00 0.00 0.00 0.00 0.00 200 200 0.7210 1.80 0.1879 0.3300 2.00 0.8103 2.00 0.2068 0.4406 2.00 0.00 0.00 0.00 0.00 0.00 200 400 0.6262 2.00 0.1748 0.3358 2.00 0.8069 2.00 0.1925 0.4563 2.16 0.00 0.00 0.00 0.00 0.00 200 800 0.5053 2.00 0.1712 0.3288 2.03 0.7880 2.00 0.1907 0.4520 2.28 0.00 0.00 0.00 0.00 0.00 Note: The table reports the root mean square error (RMSE) of estimatedQ andG under different pairs of(N;T) and DGPs. ’Joint’ column shows the estimators from joint estimation, ˜ r and ˜ R are estimated rank of joint estimators e Q and e G, respectively. ’Debias’ column reports the RMSE of debiased estimators. ’M-s’ stands for multi-stage estimators, ˆ r and ˆ R are estimated rank of multi-stage estimators b Q and b G, respectively. Table 2.1: Monte Carlo Results 51 2.6 Application: Forecasting US Macro Indexes In this section, I apply the model into forecasting several key US macroeconomic series. To predict h step series, I use the debiased estimator ¯ Q and ¯ G from joint estimation method. Then the latent factors ¯ F are constructed as the first ˜ R eigenvectors of ¯ G 0 ¯ G, and factor loadings are constructed as ¯ L = ¯ G ˆ F. I derive ¯ f T+h iteratively using the AR model, and z T+h is realized iteratively by z T+h = ¯ Qz T+h1 + ¯ L ¯ f T+h . Intuitively, since the model is a combination of reduced rank V AR (RRV AR) and factor model (FM), the forecasting results are expected to be better than results from two models separately. Compared to V AR, our model is more general and allowing strong cross-sectional dependence among unobserved terms. Compared to FM, the model takes serially correlated residuals into consideration and provides more stable forecasting results, especially at long horizon. When re- searchers use FM to forecast, the forecasting equation may end up with too many regressors con- sidering the lag order AR part and factors, while the model here would have fewer regressors since it explicitly model the process of z t . I compare the prediction results using my model with RRV AR and FM separately and the results justify the intuition. 2.6.1 Data The data is from McCracken and Ng (2016) (MN2016 hereafter) which consists 128 US monthly US macro indexes observations going from 1959.01-2014.12. Throughout this section, I focus on predicting three key macroeconomic variables: industrial production (IP), CPI inflation (CPI) and federal funds rate (FYFF). The series in this dataset contains various aspects of US macro-economy. For example, it includes industrial production (IP) and real personal income in group ’Output and income’; civil- ian unemployment rate, help-wanted index in group ’Labor market’; new private housing permits, housing starts: total new privately owned in group ’Housing’; PMI, NAPM, real personal consump- tion expenditures and total business inventories in group ’Consumption, orders, and inventories’; 52 M1, M2 money stock, total reserves of depository institutions in group ’Money and credit’; effec- tive federal funds rate, 3 month treasury bill, exchange rate in group ’Interest and exchange rates’; PPI and CPI in group ’Prices’; and S& P stock price in group ’Stock markets’. See more details at MN2016. 2.6.2 Forecasting Exercise I adopt the rolling scheme for forecasting, and set the window size to be 15 years. The rolling forecasting helps deal with possible sample instability (Pesaran and Timmermann, 2005), and the fixed window size allows me to compare the predicting accuracy with other models using the test proposed by Giacomini and White (2006). Since some of the series are missing until 1978, I choose the first sample from 1978.01 to 1992.12. Within the rolling sample, I first transform the series into stationary as suggested in MN2016 and then standardize the variables using only the data of the rolling sample, so that no information unavailable at the time of the forecast is used. Then I use the model to forecast up to 12 period ahead (i.e, 1993.01-1993.12) as well as other competing models. The final prediction values are recovered by multiplying the number by standard deviation plus mean. After the forecasting results are produced and stored, I move forward one month, becoming 1978.02-1993.01, and new forecasts are produced for the period 1993.02-1994.01. Follow the evidence from Carriero et al. (2011), the lag length in the model is fixed to be 1. Within each rolling period, I forecast h= 1;2;:::;12 period ahead iteratively, which I develop a vector time series model for the factor ˆ F t first and then use it to forecast z t+h . To assess the predictive accuracy, I compute squared error ((ˆ z t+h z t+h ) 2 ) using the model and compare it with two different benchmarks: RRV AR and FM. For simplicity, the lag length of RRV AR is also fixed to be 1. In FM, I choose the number of factors by eigenvalue ratio from Ahn and Horenstein (2013), the lag length of lagged value and factors are determined by BIC. To predict h= 1;2;:::;H period ahead, in the RRV AR model, I derive z t+1 first, then use z t+1 to predict z t+2 , and repeat the process until the last prediction z t+H is produced. In the FM model, 53 I first estimate the factorsf ¯ f t g T t=1 by principle components analysis 8 , and then the forecasts are obtained by projecting z i;t+h onto lagged values of ¯ f t and z it directly. Specifically, I show two cases that all up to 8 factors are used and only the first factor is used in prediction equation. I use the test proposed by Giacomini and White (2006) to assess the statistical significance of the differences in the forecasts by the various models. The test is able to handle forecasts based on both nested and non-nested models, and regardless of the estimation procedures. Suggested by the original paper, I select the prediction function to be h=[1;DL] 0 and mainly focus on conditional test results. The unconditional test is the same as Diebold and Mariano (2002) and yield similar results, which are not reported here. 2.6.3 Forecasting Results This section reports the results of forecasting exercise. In Table 2.1, I compare the forecasting results of my model with two benchmarks in terms of MSE ratio. Ratio less than 1 signals that model in this chapter outperforms the benchmarks. Symbols ; ; denote, respectively, rejection at 10%;5%; and 1% level of the null of equal predictive accuracy according to Giacomini and White (2006). The MSE value produced by our model is also reported. Figure 2.1 compares the prediction results from our model with two benchmark models as well as real data from 2005 to 2010 in horizon h= 1;6;12. 9 10 In table 2.2, MSE ratios against RRV AR are reported at the left panel. The ratios in general are less than 1, indicating model (2.1) outperforms RRV AR in almost every horizons. Notice that my model is significantly better than RRV AR in predicting CPI at every horizon and the advantages increase as horizon rises. For example ,the MSE ratio changes from 0:6155 at h= 1 to 0:1878 at h= 12. However, the series IP and FYFF are more self-explained in the way that although my model on average has less MSE, the advantage is negligible in many horizons. 8 Maximum number of factors are restricted to be 8. 9 Notice that the value produced by three models are based on stationary data. In order to derive the numbers showed in figure 1, the prediction results are transformed back to the original series. 10 In the graph regarding to FM prediction, I use the results from the ’first factor’ case. The results from ’all factors’ case is similar. 54 MSE ratios against FM are reported in the middle two panels in Table 2.2. In column ’FM (All factors)’, the table reports the MSE ratio of our model against FM model using all factors in prediction equation. And column ’FM (First factor)’ reports the ratio against FM using only first factor. It can be seen that between two panels, results from using only first factor are weakly better except CPI prediction. Comparing the FM with the model in this chapter, almost all MSE ratios are less than 1, which shows that my model outperforms FM. Similar as RRV AR, the advantage over FM becomes more significant in longer horizon. Figure 2.1 compares the original data with predicted values from three models at different hori- zons. It is straightforward to see that as horizon rises (from left to right), the prediction results from all models become less accurate. Among three models, the deviation from RRV AR is significantly larger than the rest of two models, but the differences between the FM and the model here are relatively small. However, in CPI prediction at horizon h= 12, it can be easily seen that compared with FM (green line), my model (blue line) is less volatile and closer to the true value (star), which coincides with the MSE ratio reported in table 2.2. 55 Model RRV AR FM (All factors) FM (First factor) MSE(10 4 ) Horizon IP CPI FYFF IP CPI FYFF IP CPI FYFF IP CPI FYFF 1 0:7673 0:6155 0:7371 0:7446 0:8105 0:7053 0:8969 0:6575 1:0620 0:3714 0:1041 243 2 0:8093 0:6286 0:7895 0:8247 0:9917 0:5061 0:8946 0:7575 0:8652 0:3773 0:0999 270 3 0:9327 0:6486 0:8342 0:8082 0:9877 0:7526 0:9483 0:6440 0:8792 0:4018 0:0878 292 4 0:9042 0:5723 0:8295 0:8011 0:9707 0:9070 0:9521 0:9667 0:9345 0:4301 0:0839 339 5 0:8767 0:5911 1:0888 0:6453 0:7445 0:8410 0:8558 0:7499 1:0416 0:4408 0:0860 366 6 0:8266 0:4786 1:0774 0:7745 0:8066 0:7044 0:8271 0:6775 0:8267 0:4491 0:0848 361 7 0:8311 0:4412 1:0410 0:8010 0:8094 0:5915 0:8664 0:6975 0:8589 0:4859 0:0862 357 8 0:8902 0:3559 0:9835 0:8056 0:9385 0:6550 0:8165 0:8198 0:9712 0:4566 0:0850 355 9 0:8891 0:3060 0:8749 0:8561 0:7778 0:5758 0:7665 0:6947 0:7602 0:4561 0:0832 351 10 0:8800 0:2334 0:9479 0:7864 0:7202 0:6025 0:8563 0:6788 0:7791 0:4502 0:0884 361 11 0:8867 0:2158 0:9569 0:6259 0:7833 0:6463 0:7474 0:6868 0:7573 0:4502 0:0863 358 12 0:8756 0:1878 0:9323 0:6854 0:9732 0:7212 0:6885 0:7696 0:6939 0:4432 0:0837 361 Note: The table reports the MSE ratio of the model in this chapter over benchmark models. Ratio that less than 1 indicates the model outperforms benchmarks. Symbols ; : denote respectively, rejection at 10%;5%; and 1% level of the null of equal predictive accuracy according to Giacomini and White (2006). Table 2.2: Forecasting Results 56 2.7 Conclusion This chapter develops the method of estimating high-dimensional V AR with unobserved factors that allow for strong cross-section dependence and serial dependence among the time series. In- corporating such dependence can be important in high-dimensional disaggregated data where con- nectedness between variables may arise through different channels. Imposing low rank restrictions on transition matrix, I introduce two estimation method: joint estimation and multi-stage estimation. I show that under certain conditions, joint estimation can consistently estimate transition matrix, common factor space, as well as their rank. However, nu- clear norm regularization would induce biases on estimators, which could be problematic in finite sample. Therefore, I propose the debias procedure that would help to reduce the bias. Leverag- ing on principle component analysis, I discuss estimators from an independent estimation method, multi-stage estimation. I show that the estimated transition matrix would potentially have faster convergence rates. In Monte Carlo simulation, I examine the RMSEs of the estimators and they agree with the theory. Practically, I show the model has good forecasting performance compared with individual RRV AR and FM. The model also shows the advantage of discovering dynamic connectedness patterns compared with pure V AR model. The methods and results open up multiple avenues for further research. Firstly, estimators in this chapter only have oracle property in terms of Frobenius norm, inference of single element is not clear. Secondly, so far the model does not allow for structural change in transition matrix or the factor ladings. The time varying parameters setting would be helpful to capture empirically evolution in institutional and regulatory frameworks. 57 Figure 2.1: Forecasting Results from 2005-2010 58 Chapter 3 Global Bank Network Analysis 3.1 Introduction In this chapter I construct the dynamic network of the global major bank from 2003 to 2021. The connectedness measurement is central to financial risk measurement and management. For example, it reveals how bank network structure changes in responses to the global economic crisis, as well as each bank’s risk exposure. Besides, it is also important for understanding underling macroeconomic risks, especially monetary risks, in which financial banks play crucial roles. Many existing related works restrict the sample size and focus on a small number of banks, for example, Diebold and Yılmaz (2014). This is mostly due to the challenge of high dimensionality: the number of parameters that are required to construct the network rises quadratically with the number of banks in the sample. Recent paper Demirer et al. (2018) investigated the global banks’ connectedness by V AR model and they imposed sparsity condition V AR transition matrix. How- ever, due to the limitation of the model, the consistency of high-dimensional covariance matrix of disturbances is not guaranteed, neither is the constructed network. Details are explained in section 4.4. Leveraging the theoretical results from chapter 3, I apply the V AR with common factors model into analyzing the connectedness of global banks from 2003 to 2021. I estimate the parameters in the model using nuclear norm regularization, and then construct the dynamic network following the connectedness measurement methods from Diebold and Yilmaz (2009), Diebold and Yılmaz 59 (2014) and Demirer et al. (2018). The main findings are: in most of the time, banks are loosely linked and the connectedness level is very low; latent factors tend to appear during global shocks (for example, economic crisis, covid pandemic), and dramatically brings up the connectedness level; most of the shocks stay within the country, only a small portion go beyond the boundary and affect other countries, even within the period of global shocks. The chapter is organized as follows. Section 4.2 briefly introduces the data set and section 4.3 introduces the methodology. Section 4.4 discusses the advantages of using the model discussed in chapter 3 over the method in Demirer et al. (2018). Section 4.5 presents the estimation results and section 4.6 concludes. 3.2 Banks’ Volatilities Data The data set is based on Demirer et al. (2018) and the original dataset is from 2003 to 2014. Then I extend them from Feburary 7, 2014 to April 28, 2021 using Yahoo Finance 1 . The data consists of over 90 major banks’ daily stock return volatility from September 12, 2003 to April 28, 2021. Specifically, it includes all those ’globally systemically important banks’, except for four banks that were not publicly traded as of September 2003. Eighty-two of them are from 23 developed economies, and the remaining 14 are from 6 emerging economies. Following Garman and Klass (1980), the volatilities are calculated using daily stock prices data: s 2 = 0:511(H it L it ) 2 0:019[(C it O it )(H it + L it 2O it ) 2(H it O it )(L it O it )] 0:383(C it O it ) 2 ; where H it ;L it ;O it and C it are, respectively, the logs of daily high, low, opening, and closing prices for bank stock i on day t. 1 Some of the banks in the original datasets are no longer existed at 2021, I replace them with new major banks, further details can be seen at appendix. Notice that changing series will not change the results dramatically since the focus are mainly on network statistics such as total directional connectedness, country-level connectedness, etc. 60 3.3 Connectedness Measures In this experiment, I use rolling sample to investigate the dynamics of banks’ volatility connected- ness. The sample size is fixed at 150 days. In each sample, I fit 96 demeaned banks’ volatility data into our model and compute variance decompositions and corresponding connectedness measures at horizon H = 10 2 . Specifically, consider N-variable V AR(p) with interactive fixed effects model: z t = p å i=1 F i z ti + e t ; e t =Lf t + u t (3.1) Rewrite it as MA form, we have: z t = ¥ å i=0 Q i e ti ; (3.2) whereQ i =F 1 Q i1 +F 2 Q i2 +:::+F p Q ip ,Q= I N andQ= 0 for i< 0. Follow Pesaran and Shin (1998), unit j 0 s contribution to unit i 0 s H-step-ahead generalized forecast error variance A i j (H) (H = 1;2;:::) is: A i j (H)= å H1 h=0 (e 0 i Q h Se j ) 2 s j jå H1 h=0 (e 0 i Q h SQ 0 h e i ) ; (3.3) whereS is the covariance matrix of the disturbance e t ,s j j is the standard deviation, and e i is vector with 1 in the i 0 th position and 0 0 s elsewhere. Follow the series paper Diebold and Yilmaz (2009), Diebold and Yılmaz (2014) and Demirer et al. (2018), I define the pairwise directional connectedness from unit j to unit i as row normalized A i j : C i j (H)= A i j (H) å H j=1 A i j (H) : (3.4) 2 Same H = 10 is also used in Demirer et al. (2018). Theoretically the network varies with H, in practice, I examined different but similar H’s and the results are similar. I will leave this issue to future research 61 Based on the N by N matrix C, define the total direnctional connectedness of unit i from all other units (From) as: C i = 1 N N å j6=i C i j (H): (3.5) Similarly, total directional connectedness from unit i to all other units (To) is C i! = 1 N N å j6=i C ji (H): (3.6) Moreover, net connectedness of i is defined as: Net i (H)= C i! C i : (3.7) Total directional connectedness is defined as sum of all off-diagonal elements in C: TC(H)= 1 N å i6= j C i j : (3.8) 3.4 High-Dimensional Covariance Matrix Estimation Theoretically, (3.3) shows that deriving matrix C requires the knowledge of Q and S. Without prior information on the population covariance matrix, the sample covariance matrix estimator seems to be a natural candidate. However, it suffers several issues under the context here. The first is misspecification. If there are factors existing in the true data-generating process, ˆ e t with traditional V AR could be biased and distort the sample covariance matrix. Second, the sample size of the data is relatively large: 96 cross-sectional banks’ volatility data and 150 trade days. The sample covariance matrix would perform badly under this high-dimensional scaling; in fact, the eigenvector could even be orthogonal to the true value (Johnstone, 2001). It has been very popular to use V AR(p) without factors to approximate and derive the network structure. However, under certain contexts, factors may very likely exist at certain periods, if not the whole sample. Taking bank volatility as the example, the factors may not necessarily exist 62 in most of the time periods, and the series could be well approximated by simple V AR(p) model. However, in some special moments, especially when crisis happens, factors may exist and drive the series move forward. Intuitively, ignoring such structure would lose some important features; the potential existing factors could make the connectedness denser. The model in (3.1) allows the strong cross-sectional correlation existing among disturbances by latent factors. Assuming u t independent with factors at all leads and lags, the covariance matrix of residual can be written as: S e =LS F L 0 +S u : AssumingS u is diagonal, we replaceL,S F ,S u by its empirical estimator ˆ L, b S F =å t f t f 0 t =T and b S u = diag(å t u t u 0 t =T) in practice. 3.5 Connectedness Analysis 3.5.1 Evidences of Factors Out of 3643 rolling samples, I fit data with my model and find evidence of factors 3 in 1080 samples, which is approximately one third of the sample size. Figure 3.1 marks the samples that contain latent factors. It shows that factors don’t show up at random; in fact, the samples containing factors are gathered between 2007 and 2011, as well as period from 2020 to 2021, during which several major economic crisis took place. For example, in late 2008 when the American economic crises swept the world, I find the existence of latent factors that affect most of the banks’ stock volatilities. A similar situation also happened in the multiphases of the European debt crises from late 2011 to 2012, and the pandemic starting from 2020. 3 Most of the samples have 1 factor; only a few samples have 2 factors. 63 Figure 3.1: Evidence of Factors 3.5.2 Banks’ Connectedness It would be difficult and messy to present the whole network connectedness graph of 96 banks. Instead, figure 3.2 presents the total directional connectedness defined in (3.8), which measures the total strength of connectedness of each bank’s volatilities. It is well believed that volatilities tend to lurch and move together only in crises, and results here confirm this finding again. Figure 3.2 shows that the system-wide connectedness is highly fluctuated. That is, during the time when there were no system-wide shocks that affected the worldwide economy, the connectedness remains as low as 5%, but it could reach as high as 85% when crises happened. For example, the system-wide connectedness reached its peak during the 2008 American economic crisis 4 . Later as US markets calmed, the connectedness level started to trend down gradually. Then due to the European debt crises, we can see there are two big jumps during 2009-2010 (rescue package for Greece) and 2011-2012 (sovereign debt crises in Spain and 4 The peak time might be a little ahead of the time because we choose the starting point of the rolling sample as the date to plot. 64 Figure 3.2: Total Directional Connectedness Italy), respectively. Starting the beginning of 2020, the covid-19 outbursted and swept the world quickly. During this time, the connectedness level rises dramatically and even higher than the peak during American economic crisis. I also compare the results with the connectedness measured by sample covariance matrix (red line in figure 3.2). We show that under high-dimensional setting, naively using sample covariance matrix would overestimate the connectedness level in general. It fails to reflect the fact that the volatility network remains sparse during ’peace’ time and become highly connected during crises. 65 3.5.3 Integrated Country Connectedness In this section, I integrate the connectedness among banks into country level by summing up all pairwise connectedness according to banks’ location. It first shows the network graph of 29 coun- tries in certain sample periods. Then it turns to discuss total country-wise directional connected- ness and compare with banks’ connectedness in previous section. Lastly, it shows dynamic pattern of net connectedness for specific countries. The network graphs in figure 3.3 and 3.4 are drawn by Gephi. Specifically, I make the node size a linear function of its proportion within country connectedness: that is, if effects from other countries or to other countries are relatively small, the node size is larger. The darkness of color indicates the level of connectedness, the stronger the connectedness is, the darker the linkages are. The arrow from country A to B indicates the shock is from A to B. I use ’ForceAtlas2’ algorithm to find the steady state and show it in the graph. The algorithm assumes that each nodes repels each other while linkages attract the pair of nodes. Therefore, the steady state is the position that repelling and attracting forces exactly balance. In other way, if the nodes are displayed in the periphery and far from the center, it means such countries are not linked to many other countries, and the strength of the connectedness is weak. In figure 3.3, I show 29 countries’ network graph in three time periods with the same scale: February 2007, October 2007, and October 2008. It can be seen that countries in early 2007 are much farther away, and the distances shrink as time moves forward, which indicates that the strength of linkages among countries is getting stronger. It confirms that the the linkages among countries are very are when there are no system-wide shocks, however, crises bring the network denser and countries are closely connected. There are also some interesting findings if we focus on specific countries. For example, the United States is always centered with visible arrows pointing to other countries, which shows its dominant position in world economy in the way that it affects every other countries constantly. China is positioned near the center in early 2007, but gets farther away at times close to crises. This might be an evidence that the 2008 American crisis didn’t affect China that much compared to other countries. 66 Figure 3.4 compares the network during American crisis (October 2008) and the pandemic (Feb 2020). From the 3.4a, we see the United States is centered in the network, and there are dark green arrows pointing to most of the other countries from the United States, while other links among non-US countries are weak and hard to see. We can also observe that countries that close to the United States (the UK, Italy, Spain, France, and Canada) were affected more severely compared with periphery countries (China, Korea, Australia, and Malaysia). During the covid-19 shocks in Feb 2020, compare the scale of 3.4b with 3.4a, we see that the connectedness among countries are in general stronger than the 2008 American economic crisis. Instead of the star structure with US being in the center in 3.4a, in 3.4b, the linkages and arrows are messier don’t show a clear pattern. During this time, countries in the center area such as Japan, US, France, Australia produce the net effect to the rest of the countries, while they are strongly affected by each other at the same time . The periphery countries such as China, Russia, Belgium are less affected compared to the countries in the center of the graph. Next, similar to the previous section, figure 3.5 shows the country level total directional con- nectedness. In general, the within country connectedness dominates and much higher than the across country connectedness. Even though it trends down after 2007, the proportion of within country connectedness level remains high at approximately 80%. The figure also shows that start- ing from 2007, international connectedness strengthens over time and reaches a high point in late 2008 in response to the American crisis. The trend goes down for a very short time before it climbs and has two waves during 2010 and 2011 because of the European debt crises. Starting in 2012, the connectedness goes down slowly until 2020. Compared to bank-level connectedness in figure 3.2, I find that integrated country-level connectedness does not respond to crises as dramatically as bank-level connectedness. That is, when crisis comes, cross-country connectedness won’t surge from a low to a high level in a very short time. And after a crisis, the international linkages won’t vanish immediately, which indicates the effects from the crisis to the worldwide economy would last for a longer time than expected. 67 Figure 3.3: Country Integrated Network in Multiple Time Last, figure 3.6 shows the net connectedness of seven countries (US, UK, Japan, China, Korea, and Germany) from 2003 to 2021. Net connectedness is the difference of ’To’ effect and ’From’ effect (3.7) defined in (3.7). Roughly speaking, it measures how much the country affects the rest of the world. If the net connetedness is positive, then the effect produced by the country is larger than the effect received, and vice versa. The figure shows that the US dominates other countries in net connectedness of bank volatilities network. It peaked in late 2008, which was the time of the Lehman collapse and the starting point of the US financial crises. Later as the US economy recovers, the impact from the United States trends down slowly but still way above other countries. Other developed countries’, for example, Japan and Germany, net connectedness level remains low and around zero, which means the shocks from banks inside country and shocks from the rest of the world are relatively at the same level. Meanwhile, during 2020 pandemic, figure shows US, Japan produce much higher positive net effects to the rest of the world, compared with other countries. 68 (a) October, 2008 (b) Feb, 2020 Figure 3.4: Country Integrated Connectedness 69 Figure 3.5: Total Country Directional Connectedness Figure 3.6: Country Net Connectedness 70 3.6 Conclusion Leveraging the model discussed in precious chapter, this chapter applies the model into analyzing global banks’ connectedness level from 2003 to 2021. I find that the banks’ network is fragile to system-wide shocks, such as economic crisis, and pandemic. When the shocks arrive, the network responds fast and the connectedness level surges to a very high level, and the heat quickly calms down after the shocks. The integrated country level network is more persistent. When shocks come, the connectedness across country climbs and stays for a longer period time before goes down. Meanwhile, compared with the 2008 US economic crisis and 2020 pandemic, the latter shocks seem to be more severe than the previous one, and the origins of the shocks are from many countries, instead of solely US in 2008 economic crisis. The study opens up several venues for future study. For example, the constructed network is sensitive to the forward period H and the lags of V AR part. In this chapter, I fix H = 10 and V AR lag to be 1 for simplicity, it would be of interest to study other cases and discover the similari- ties and differences. Secondly, apart from the latent factor structure, I restrict the residual to be cross-sectionally independent, which might underestimate the connectedness level. More general conditions can be made here to relax the restriction, such as country cluster error, etc. Lastly, the theoretical justification of the consistency of the stats, such as total directional connectedness is missed here, I shall leave these issues in future studies. 71 Bibliography AHN, S. C. and HORENSTEIN, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81 (3), 1203–1227. ANDERSON, H. M. and VAHID, F. (2007). Forecasting the volatility of australian stock returns: Do common factors help? Journal of Business & Economic Statistics, 25 (1), 76–90. ANDERSON, T. W. et al. (1951). Estimating linear restrictions on regression coefficients for mul- tivariate normal distributions. The Annals of Mathematical Statistics, 22 (3), 327–351. ATHEY, S., BAYATI, M., DOUDCHENKO, N., IMBENS, G. and KHOSRAVI, K. (2018). Matrix completion methods for causal panel data models. Tech. rep., National Bureau of Economic Research. BAI, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71 (1), 135–171. — and NG, S. (2002). Determining the number of factors in approximate factor models. Econo- metrica, 70 (1), 191–221. — and — (2019). Rank regularized estimation of approximate factor models. Journal of econo- metrics, 212 (1), 78–96. BALTAGI, B. H. and BRESSON, G. (2011). Maximum likelihood estimation and lagrange mul- tiplier tests for panel seemingly unrelated regressions with spatial lag and spatial errors: An application to hedonic housing prices in paris. Journal of Urban Economics, 69 (1), 24–42. BANERJEE, A., CHANDRASEKHAR, A. G., DUFLO, E. and JACKSON, M. O. (2013). The diffu- sion of microfinance. Science, 341 (6144), 1236498. BARAB ´ ASI, A.-L. et al. (2016). Network science. Cambridge university press. BERNANKE, B. S., BOIVIN, J. and ELIASZ, P. (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (favar) approach. The Quarterly journal of economics, 120 (1), 387–422. BERNSTEIN, D. S. (2005). Matrix mathematics: Theory, facts, and formulas with application to linear systems theory, vol. 41. Princeton university press Princeton. BIEN, J. and TIBSHIRANI, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika, 98 (4), 807–820. 72 CAI, J.-F., CAND ` ES, E. J. and SHEN, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20 (4), 1956–1982. CAI, T., LIU, W. and LUO, X. (2011). A constrained l 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106 (494), 594–607. CAI, T. T., HU, J., LI, Y. and ZHENG, X. (2019). High-dimensional minimum variance portfolio estimation based on high-frequency data. Journal of Econometrics. CAMBA-MENDEZ, G., KAPETANIOS, G., SMITH, R. J. and WEALE, M. R. (2003). Tests of rank in reduced rank regression models. Journal of Business & Economic Statistics, 21 (1), 145–155. CANDES, E., TAO, T. et al. (2007). The dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics, 35 (6), 2313–2351. CARRIERO, A., KAPETANIOS, G. and MARCELLINO, M. (2011). Forecasting large datasets with bayesian reduced rank multivariate models. Journal of Applied Econometrics, 26 (5), 735–761. CHERNOZHUKOV, V., HANSEN, C. B., LIAO, Y. and ZHU, Y. (2019). Inference for heterogeneous effects using low-rank estimations. Tech. rep., cemmap working paper. CHIONG, K. X. and MOON, H. R. (2017). Estimation of graphical lasso using the l1, 2 norm. The Econometrics Journal. CHUDIK, A., KAPETANIOS, G. and PESARAN, M. H. (2018). A one covariate at a time, multiple testing approach to variable selection in high-dimensional linear regression models. Economet- rica, 86 (4), 1479–1512. — and PESARAN, M. H. (2011). Infinite-dimensional vars and factor models. Journal of Econo- metrics, 163 (1), 4–22. DEMIRER, M., DIEBOLD, F. X., LIU, L. and YILMAZ, K. (2018). Estimating global bank net- work connectedness. Journal of Applied Econometrics, 33 (1), 1–15. DIEBOLD, F. X. and MARIANO, R. S. (2002). Comparing predictive accuracy. Journal of Busi- ness & economic statistics, 20 (1), 134–144. — and YILMAZ, K. (2009). Measuring financial asset return and volatility spillovers, with appli- cation to global equity markets. The Economic Journal, 119 (534), 158–171. — and YILMAZ, K. (2014). On the network topology of variance decompositions: Measuring the connectedness of financial firms. Journal of Econometrics, 182 (1), 119–134. DOAN, T., LITTERMAN, R. and SIMS, C. (1984). Forecasting and conditional projection using realistic prior distributions. Econometric reviews, 3 (1), 1–100. ELBERG, A. (2016). Sticky prices and deviations from the Law of One Price: Evidence from Mexican micro-price data. Journal of International Economics, 98 (C), 191–203. 73 FAN, Q., HAN, X., PAN, G. and JIANG, B. (2019). Large system of seemingly unrelated regres- sions: A penalized quasi- maximum likelihood estimation perspective. Econometric Theory, 27, 1–33. FORNI, M., HALLIN, M., LIPPI, M. and REICHLIN, L. (2000). The generalized dynamic-factor model: Identification and estimation. Review of Economics and statistics, 82 (4), 540–554. FRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9 (3), 432–441. GARMAN, M. B. and KLASS, M. J. (1980). On the estimation of security price volatilities from historical data. Journal of business, pp. 67–78. GEWEKE, J. (1977). The dynamic factor analysis of economic time series. Latent variables in socio-economic models. GIACOMINI, R. and WHITE, H. (2006). Tests of conditional predictive ability. Econometrica, 74 (6), 1545–1578. GREENAWAY-MCGREVY, R., HAN, C. and SUL, D. (2012). Estimating the number of common factors in serially dependent approximate factor models. Economics Letters, 116 (3), 531–534. GREENE, W. H. (2003). Econometric analysis. Pearson Education India. HASTIE, T., TIBSHIRANI, R. and WAINWRIGHT, M. (2015). Statistical learning with sparsity: the lasso and generalizations. CRC press. HORN, R. A. and JOHNSON, C. R. (1990). Matrix analysis. Cambridge university press. JI, S. and YE, J. (2009). An accelerated gradient method for trace norm minimization. In Pro- ceedings of the 26th annual international conference on machine learning, ACM, pp. 457–464. JOHANSEN, S. (1991). Estimation and hypothesis testing of cointegration vectors in gaussian vec- tor autoregressive models. Econometrica: journal of the Econometric Society, pp. 1551–1580. JOHNSTONE, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pp. 295–327. — and LU, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104 (486), 682–693. LAM, C. and FAN, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics, 37 (6B), 4254. LEEPER, E. M., SIMS, C. A., ZHA, T., HALL, R. E. and BERNANKE, B. S. (1996). What does monetary policy do? Brookings papers on economic activity, 1996 (2), 1–78. LEWBEL, A. (1991). The rank of demand systems: theory and nonparametric estimation. Econo- metrica: Journal of the Econometric Society, pp. 711–730. 74 LIN, J. and MICHAILIDIS, G. (2019). Approximate factor models with strongly correlated id- iosyncratic errors. arXiv preprint arXiv:1912.04123. LIU, H., LAFFERTY, J. and WASSERMAN, L. (2009). The nonparanormal: Semiparametric esti- mation of high dimensional undirected graphs. Journal of Machine Learning Research, 10 (Oct), 2295–2328. LOH, P. and WAINWRIGHT, M. J. (2013). Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. Annals of Statistics, 41 (6), 3022–3049. MA, S., GOLDFARB, D. and CHEN, L. (2011). Fixed point and bregman iterative methods for matrix rank minimization. Mathematical Programming, 128 (1-2), 321–353. MCCRACKEN, M. W. and NG, S. (2016). Fred-md: A monthly database for macroeconomic research. Journal of Business & Economic Statistics, 34 (4), 574–589. MIAO, K., PHILLIPS, P. C. and SU, L. (2020). High-Dimensional VARs with Common Factors. Tech. rep., Cowles Foundation Discussion Papers. MOON, H. R. and WEIDNER, M. (2017). Dynamic linear panel regression models with interactive fixed effects. Econometric Theory, 33 (1), 158–195. — and — (2018). Nuclear norm regularized estimation of panel regression models. arXiv preprint arXiv:1810.10987. NEGAHBAN, S. and WAINWRIGHT, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pp. 1069–1097. P ´ ASTOR, L. and STAMBAUGH, R. F. (2002). Mutual fund performance and seemingly unrelated assets. Journal of Financial Economics, 63 (3), 315–349. PESARAN, H. H. and SHIN, Y. (1998). Generalized impulse response analysis in linear multivari- ate models. Economics letters, 58 (1), 17–29. PESARAN, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifac- tor error structure. Econometrica, 74 (4), 967–1012. — and TIMMERMANN, A. (2005). Small sample properties of forecasts from autoregressive mod- els under structural breaks. Journal of Econometrics, 129 (1-2), 183–217. PHILLIPS, P. C. and OULIARIS, S. (1990). Asymptotic properties of residual based tests for coin- tegration. Econometrica: Journal of the Econometric Society, pp. 165–193. RAVIKUMAR, P., WAINWRIGHT, M. J., RASKUTTI, G., YU, B. et al. (2011). High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Jour- nal of Statistics, 5, 935–980. RECHT, B., FAZEL, M. and PARRILO, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52 (3), 471–501. 75 SARGENT, T. J., SIMS, C. A. et al. (1977). Business cycle modeling without pretending to have too much a priori economic theory. New methods in business cycle research, 1, 145–168. SIMS, C. A. (1992). Interpreting the macroeconomic time series facts: The effects of monetary policy. European economic review, 36 (5), 975–1000. — (1993). A nine-variable probabilistic macroeconomic forecasting model. In Business cycles, indicators, and forecasting, University of Chicago press, pp. 179–212. STOCK, J. H. and WATSON, M. W. (2005). Implications of dynamic factor models for VAR anal- ysis. Tech. rep., National Bureau of Economic Research. VELU, R. P., REINSEL, G. C. and WICHERN, D. W. (1986). Reduced rank models for multiple time series. Biometrika, 73 (1), 105–118. VERSHYNIN, R. (2017). High-Dimensional Probability. Cambridge University Press (to appear). — (2018). High-dimensional probability: An introduction with applications in data science, vol. 47. Cambridge University Press. WONG, K. C., TEWARI, A. and LI, Z. (2016). Regularized estimation in high dimensional time series under mixing conditions. stat, 1050, 12. ZELLNER, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American statistical Association, 57 (298), 348–368. 76 Appendices G Proof for Chapter 1 We first summarize some useful norm inequalities from Chapter 9 in Bernstein (2005) and Chapter 5 in Horn and Johnson (1990). For matrix A, Denote s r (A) the r th largest singular value of A, and let s max (A);s min (A) be the largest and smallest singular value of matrix A, respectively. Denote and denote Kronecker product and Hadamard product, respectively. Lemma A.1. For any matrix A;B2R mn , F2R nl and column vector b2R n , we have : (i)jjAjj ¥ s max (A) p jjjAjjj ¥ jjjA 0 jjj ¥ ; (ii)jjjAjjj ¥ p ns max (A) p njjAjj F ; (iii)jjA+ Bjj ¥ jjAjj ¥ +jjBjj ¥ ; (iv)kAB 0 k op kAk op kBk op ; (v)jjAbjj ¥ jjjAjjj ¥ jjbjj ¥ ; (vi)js i (A) s i (B)jjjA Bjj op for each i= 1;2;;min(m;n); (vii) s min (AF) s min (A)s min (F); (viii) If m= n and A;B> 0, then s min (A B)= s min (A)s min (B), s max (A B)= s max (A)s max (B); (ix) If m= n, A and B are both non-negative and symmetric, s max (A B)jjAjj ¥ s max (B). 77 G.1 Proof of Lemma 1 (i) Since b S i j is the empirical covariance estimator from OLS residuals, we have: b S i j = 1 T T å t=1 b U it b U jt = 1 T T å t=1 U it X 0 it ( ˆ b i b i ) U jt X 0 jt ( ˆ b j b j ) = 1 T T å t=1 U it U jt 1 T T å t=1 U it X 0 jt ( ˆ b j b j ) 1 T T å t=1 U jt X 0 it ( ˆ b i b i )+ 1 T T å t=1 X 0 it ( ˆ b i b i )( ˆ b j b j ) 0 X jt : Denote e S := 1 T å T t=1 U it U jt ,S i j =E(U it U jt ), then: j b S i j S i j jj b S i j e S i j j+j e S i j S i j jjI 1 j+jI 2 j+jI 3 j+j e S i j S i j j; where I 1 := 1 T å T t=1 U it X 0 jt ( ˆ b j b j ), I 2 := 1 T å T t=1 U jt X 0 it ( ˆ b i b i ) and I 3 := 1 T å T t=1 X 0 it ( ˆ b i b i )( ˆ b j b j ) 0 X jt . For simplicity, let K i = 1 for all i, we will extend to the case of K i > 1 later. For anyd > 0 and conditional onfX t g T t=1 5 , P(jI 1 j>d)=P 1 T T å t=1 U it X jt 1 T T å t=1 X 2 jt 1 1 T T å t=1 X jt U jt >d ! P 0 @ 1 T T å t=1 U it X jt ! 1 T T å t=1 X 2 jt ! 1=2 > p d 1 A +P 0 @ 1 T T å t=1 U jt X jt ! 1 T T å t=1 X 2 jt ! 1=2 > p d 1 A 4exp Td 2s 2 max i (S ii ) ; (A.1) 5 For simplicity, we omit the conditional sign inP(jX) through out the proof of Lemma 1. 78 where the first inequality if from the fact thatP(jxyj> a)P(jxj> p a)+P(jyj> p a); the second inequality is derived from the Assumption 1(iv) that U it = p S ii i:i:d subG(0;s 2 ) and Hoeffding’s inequality, see for example, Theorem 2.6.3 in Vershynin (2018). Similarly, we obtain: P(jI 2 j>d) 4exp Td 2s 2 max i (S ii ) : (A.2) For I 3 , apply Cauthy inequality, we obtain: jI 3 j= 1 T T å t=1 X it X jt 1 T T å t=1 X 2 it 1 1 T T å t=1 X 2 jt 1 1 T T å t=1 X it U it 1 T T å t=1 X jt U jt 1 T T å t=1 X 2 it 1=2 1 T T å t=1 X it U it 1 T T å t=1 X 2 jt 1=2 1 T T å t=1 X jt U jt : (A.3) Therefore, from (A.1), P(jI 3 j>d) 4exp Td 2s 2 max i (S ii ) : From Lemma 1 in RWRY , we know P(j e S i j S i j j>d) 4exp Td 2 128(1+ 4s 2 ) 2 max i (S ii ) 2 ; (A.4) for alld2(0;8max i (S ii )(1+ 4s 2 )). Combine (A.1) - (A.4), we obtain ford2 0;16(1+ 4s 2 )max i (S ii ) , P(j b S i j S i j j>d)P(j b S i j e S i j j> d 2 )+P(j e S i j S i j j> d 2 ) P(jI 1 j> d 6 )+P(jI 2 j> d 6 )+P(jI 3 j> d 6 )+P(j b S i j S i j j> d 2 ) 12exp( Td 12s 2 max i (S ii ) )+ 4exp( Td 2 512(1+ 4s 2 ) 2 max i (S ii ) 2 ) 16exp( Td 2 512(1+ 4s 2 ) 2 max i (S ii ) 2 ): 79 Now let’s consider the general case when 1 K i K (i= 1;2;:::;N), where K is a bounded constant number. We will investigate the boundP(jI 1 j>d), the bound ofP(jI 2 j>d) andP(jI 3 j> d) can be derived in similar way. Let Q k (k= 1;2;:::;K j ) denotes the k 0 th column of 1 T å T t=1 X jt X 0 jt 1=2 , then ford > 0, P(jI 1 j>d)=P 1 T T å t=1 U it X 0 jt 1 T T å t=1 X jt X 0 jt 1 1 T T å t=1 X jt U jt >d ! =P K j å k=1 1 T T å t=1 U it X 0 jt Q k 1 T T å t=1 Q 0 k X jt U jt >d ! K j å k=1 P 1 T T å t=1 U it X 0 jt Q k 1 T T å t=1 Q 0 k X jt U jt > d K j ! K j å k=1 P 1 T T å t=1 U it X 0 jt Q k > s d K j ! +P 1 T T å t=1 U jt X 0 jt Q k > s d K j !! : (A.5) Apply Hoeffding’s inequality, we obtain: P 1 T T å t=1 U it X 0 jt Q k > s d K j ! 2exp Td 2s 2 K j max i (S ii ) 1 T å t (X 0 jt Q k ) 2 ! 2exp Td 2K 2 s 2 max i (S ii ) ; (A.6) the second inequality is derived from the fact that 1 T T å t=1 (X 0 jt Q k ) 2 = tr 1 T T å t=1 X jt X 0 jt Q k Q 0 k ! = tr 1 T T å t=1 X jt X 0 jt 1 T T å t=1 X jt X 0 jt 1 å k 0 6=k Q k 0Q 0 k 0 !! tr 1 T T å t=1 X jt X 0 jt 1 T T å t=1 X jt X 0 jt 1 ! K: 80 Continue (A.5), we have: P(jI 1 j>d) 4K exp Td 2K 2 s 2 max i (S ii ) : (A.7) Apply the similar method, we have ford2(0;minf172=K 2 ;16g(1+ 4s 2 )max i (S ii )), P b S i j S i j >d (12K+ 4)exp( Td 2 512(1+ 4s 2 ) 2 max i (S ii ) 2 ): The proof for(ii) and(iii) in Lemma 1 are the same as RWRY . G.2 Proof of Proposition 1 For convenience, we define the following notations: A NT := 1 T T å t=1 X t WX 0 t = 1 T X 0 (W I T )X; B NT := 1 p T T å t=1 X t WU t ; b A NT := 1 T T å t=1 X t b W gl X 0 t = 1 T X 0 ( b W gl I T )X; b B NT := 1 p T T å t=1 X t b W gl U t ; where X t and U t are defined in model (1.2), b W gl is the graphical LASSO estimator in (1.9), X is defined in model (1.4) and I T is the T T identity matrix. Before the proof of Proposition 1, we first show results of several lemmas, then we prove the proposition. Lemma A.2. Suppose Assumption 2 holds, then there exist constants c;c 0 > 0 such that c 0 s min (A NT ) s max (A NT ) c w.p.1 as N;T!¥. Proof. We show the proof of s min (A NT ) c 0 , the proof for s max (A NT ) c is similar. Since A NT = 1 T X 0 (W I T )X, where X = diag(X 1 ;;X N ), from Lemma A.8(vii) and (viii), we have: s min (A NT ) 1 T s 2 min (X)s min (W)= min i=1;2;;N s min X 0 i X i T s min (W): (A.8) 81 According to Assumption 2 that 1 T X 0 i X i p ! Q ii as T!¥ uniformly in i and s min (Q ii ) c 0 for certain constant c 0 > 0, we have: min i=1;2;;N s min 1 T (X 0 i X i ) min i=1;:::;N s min (Q ii ) max i=1;:::;N s min X 0 i X i T s min (Q ii ) c 0 + o p (1): Combine Assumption 2 that s min (W) c 0 > 0, choose c 0 c 2 0 , we have: w.p.1, s min (A NT ) c 0 : Recall that D N is the maximum number of nonzero entries per row in true precision matrixW, and S N is number of nonzero entries inW, including the diagonal elements. DefineD W := b W gl W. Lemma A.3. If Assumption 3 holds,jjD W jj op =O p minf p S N ;D N g q logN T , andjjD W jj F =O p q S N logN T . Proof. The result follows directly from Lemma 1. Specifically, the exact recovery result of (ii) shows that b W gl at least remains the sparse structure ofW. That is, for the pair(i; j) such thatW= 0, then b W gl = 0. Therefore, we can conclude that, if there are at most D N nonzero entries per row in W, then the maximum nonzero entries per row inD W = b W gl W is at most D N . Then from Lemma A.8(i), jjD W jj op =O p D N r logN T ! : From Lemma A.8(ii), jjD W jj op =O p r S N logN T ! : 82 And, from the definition of Frobenius norm, we have: jjD W jj F =O p r S N logN T ! : Lemma A.4. Suppose Assumption 1, 2 and 3 hold, then we have, (a) b A NT A NT op =O p minf p S N ;D N g q logN T ; (b) there exist c 0 > 0 such that s max ( b A 1 NT ) 1 c 0 +O p minf p S N ;D N g p logN=T . Proof. (a) b A NT A NT op = 1 T X 0 (D W I T )X op 1 T s max (X) 2 s max (D W ) max 1iN 1 T X 0 i X i op s max (D W ) =O p minf p S N ;D N g r logN T ! : In the last inequality, we apply results from Lemma A.3. (b) According to Lemma A.8(vi) and Part(a), we have js min ( b A NT ) s min (A NT )j b A NT A NT op O p minf p S N ;D N g r logN T ! : 83 Combining with result in Lemma A.2, we have that there exist c 0 > 0 such that: s max ( b A 1 NT )= 1 s min ( b A NT ) 1 s min (A NT )js min ( b A NT ) s min (A NT )j 1 c 0 +O p minf p S N ;D N g q logN T : Lemma A.5. If Assumptions 1 and 2 hold, thenjjB NT jj F =O p ( p N). Proof. Define new variable Z NT := 1 p N jjB NT jj F . Consider the expectation of Z NT , we obtain: E(Z NT )= 1 p NT E " å t X t WU t 0 å t X t WU t 1=2 # 1 p NT E å t X t WU t 0 å t X t WU t 1=2 = 1 p NT T å t=1 tr E(X 0 t X t )W ! 1=2 =O(1); In the first inequality, we use Jensen’s inequality; the last equality is from Assumption 2(ii) that max 1iN s max (Q ii ) c 1 . Further, we check the variance of Z NT , Var(Z NT )E((jjZ NT jj 2 F ) = 1 NT E å t X t WU t 0 å t X t WU t = 1 NT T å t=1 tr E(X 0 t X t )W =O(1): 84 Combine two equations above, apply Markov’s inequality, we have: jjB NT jj F =O p ( p N): Lemma A.6. Suppose Assumptions 1, 2 and 3 hold. If T > cD 2 N logN for c> 0, then b B NT B NT F = O P (minf p S N ;D N g q N logN T ). Proof. Similar as the proof of Lemma A.6, define the new variable W NT := 1 p N b B NT B NT F . Consider the expectation of W NT , we obtain, for some large enough constant c> 0, E(W NT )= 1 p NT E " å t X t D W U t 0 å t X t D W U t 1=2 # 1 p NT E å t X t D W U t 0 å t X t D W U t 1=2 = 1 p NT T å t=1 tr SD W E(X 0 t X t )D W ! 1=2 1 p NT cT N D W E(X 0 t X t )D W op 1=2 =O p (minf p s;D N g r logN T ): In the second inequality, we use the matrix inequality that tr SD W E(X 0 t X t )D W trhSi D W E(X 0 t X t )D W op ; where trhSi :=å N i=1 s i (S). The third inequality is derived from Assumption 2 thatjjE(X 0 t X t )jj op < ¥. 85 Next we check the variance of W NT , Var(W NT )E(W 2 NT ) = 1 NT E å t X t D W U t 0 å t X t D W U t =O p D 2 N logN T =O p (1); the last equality is from the minimum requirement of T > cD 2 N logN. Then we apply Markov inequality to have the final result: b B NT B NT F O P (minf p S N ;D N g r N logN T ): Proof of Proposition 1: There exists at least one KN vector b and b 0 b= 1 such that: p T b b Fglasso b b GLS ¥ = b A 1 NT b B NT A 1 NT B NT ¥ b 0 b A 1 NT b B NT A 1 NT B NT = b 0 ( b A 1 NT A 1 NT )B NT + b 0 b A 1 NT ( b B NT B NT ) = b 0 b A 1 NT (A NT b A NT )A 1 NT B NT + b 0 b A 1 NT ( b B NT B NT ) b 0 b A 1 NT (A NT b A NT )A 2 NT (A NT b A NT ) b A 1 NT b 1=2 jjB NT jj F + b 0 b A 2 NT b 1=2 b B NT B NT F O p (minf p s;D N g q N logN T ) c+O p (minf p S N ;D N g q logN T ) + minf p S N ;D N gO p ( q N logN T ) c+O p (minf p S N ;D N g q logN T ) =O p (minf p S N ;D N g r N logN T ): (A.9) 86 In the second inequality, we use Cauthy inequality; in the third inequality, we use the results from Lemma A.2, A.4, A.5 and A.5; And the last inequality if from the minimum requirement for T > cD 2 N logN. G.3 Proof of Proposition 2 Recall A NT := 1 T å t X t WX 0 t and A :=E(X t WX 0 t ). Before presenting the proof for Proposition 2, we first show the result of the following lemma. Lemma A.7. If Assumption 4(i) holds,jjA NT Ajj op = o p ( 1 p N ). Proof. To avoid complexity in notation, here we assume the number of regressors for each unit is the same, i.e. K i = K for all i. It could be easily extended to the heterogeneous case with more complicated notations. Since X t is block diagonal, we can write A NT as: 87 A NT = 1 T 2 6 6 6 6 6 6 6 4 s 11 X 0 1 X 1 s 12 X 0 1 X 2 ::: s 1N X 0 1 X N s 21 X 0 2 X 1 s 22 X 0 2 X 2 ::: s 2N X 0 2 X N . . . . . . . . . . . . s N1 X 0 N X 1 s N2 X 0 N X 2 ::: s NN X 0 N X N 3 7 7 7 7 7 7 7 5 =(W 1 K ) 1 T 2 6 6 6 6 6 6 6 4 X 0 1 X 1 X 0 1 X 2 ::: X 0 1 X N X 0 2 X 1 X 0 2 X 2 ::: X 0 2 X N . . . . . . . . . . . . X 0 N X 1 X 0 N X 2 ::: X 0 N X N 3 7 7 7 7 7 7 7 5 =(W 1 K ) b Q =(W 1 K ) b Q S ; where 1 K stands for the K by K matrix with 1 0 s, is the Hadamard product. In the last equality, sinceW is sparse with many 0 0 s, the product remains the same value if we replace b Q with b Q S . Similarly, A= tr((W 1 K )Q S ). Follow the Assumption 4(i) and Lemma A.8(viii), (viiii), we obtain: jjA NT Ajj op = (W 1 K ) b Q S Q S op s max (W)s max (1 K ) b Q S Q S ¥ = o p ( 1 p N ): Next we prove the Proposition 2. 88 Proof. Under Assumption 4(i): b 0 p T( b b GLS b)= b 0 A 1 NT 1 p T T å t=1 X t WU t = 1 p T T å t=1 b 0 A 1 X t WU t + b 0 (A 1 NT A 1 ) 1 p T T å t=1 X t WU t : (A.10) For the first term in (A.10), denote scalar Z t := b 0 A 1 X t WU t , then from Assumption 1, we know fZ t g is martingale difference process. Further, sinceE(Z 2 t )= b 0 A 1 b andE(Z 4 t )<¥, apply CLT in martingale difference process, we have: 1 p T T å t=1 b 0 A 1 X t WU t ) N(0;b 0 A 1 b): For the second term in (A.10), apply Cauthy inequality, b 0 (A 1 NT A 1 ) 1 p T T å t=1 X t WU t b 0 (A 1 NT A 1 ) F 1 p T T å t=1 X t WU t F = b 0 (A 1 NT A 1 )((A 1 NT A 1 )b 1=2 1 p T å t X t WU t 0 å t X t WU t 1=2 = b 0 (A 1 NT A 1 )((A 1 NT A 1 )b 1=2 jjB NT jj F : (A.11) Also, we have b 0 (A 1 NT A 1 )((A 1 NT A 1 )b 1=2 jtr(bb 0 )j 1=2 A 1 NT A 1 op A 1 NT op jjA NT Ajj op A 1 op = o p ( 1 p N ); in the last inequality, we apply the results from Lemma A.7. 89 Continue (A.11), we obtain: b 0 (A 1 NT A 1 ) 1 p T T å t=1 X t WU t = o p (1): Thus, b 0 p T( b b GLS b)= 1 p T T å t=1 b 0 A 1 X t WU t + o p (1) ) N(0;b 0 A 1 b); where A=E(X t WX 0 t ). Under Assumption 4(ii): For any bounded vector b2R NK1 such that b 0 b= 1, denote scalar e x t as: e x t := b 0 1 T T å t=1 X t WX 0 t 1 X t WU t ; whereE(e x t )= 0 and V(e x t )= b 0 1 T å T t=1 X t WX 0 t 1 b 1 s min (A NT ) <¥. Then, b 0 p T( b b GLS b)= b 0 1 T T å t=1 X t WX 0 t 1 ( 1 p T T å t=1 X t WU t )= 1 p T T å t=1 e x t (A.12) Apply the CLT, we have: b 0 p T( b b GLS b)) N 0;b 0 E(X 0 t WX t ) 1 b : (A.13) H Proof for Chapter 2 Before we show the proof for the theorems and propositions, we list several useful facts about matrix norm. The details can be seen in Chapter 9 at Bernstein (2005). Lemma A.8. For matrix A;B2R mn , C2R nn , we know: (i)jjA+ Bjj op jjAjj op +jjBjj op ; 90 (ii)jjA+ Bjj jjAjj +jjBjj ; If AB 0 = 0 and A 0 B= 0, thenjjA+ Bjj =jjAjj +jjBjj ; (iii)jjAjj F <jjAjj p Rank(A)jjAjj F ; (iv) s min (C)jjAjj F jjCAjj F s max (C)jjAjj F ; (v) s min (C)jjAjj jjCAjj s max (C)jjAjj : H.1 Proof of Lemma 2 (i) From assumption 5(i), we know(I N QL) is invertable. Write z t =(I N QL) 1 Lf t +(I N QL) 1 u t = x t + e t ; where x t :=(I N QL) 1 Lf t , and e t =(I N QL) 1 u t . From assumption 5(iii), we knowfx t g andfe t g is independent. Therefore, to prove station- arity of z it = x it + e it , it sufficies to study x it and e it separately. We first showfe it g is covariance stationary. E(e it )= l 0 i ¥ å q=0 Q q E(u tq )= 0: (A.1) And, for 1 t 1 ;t 2 T , var(e t )= ¥ å q=0 Q q S u (Q q ) 0 ; (A.2) cov(e t 1 e 0 t 2 )=Q jt 1 t 2 j var(e t ); (A.3) var(e it )= l 0 i var(e t )l i ; (A.4) cov(e it 1 e it 2 )= l 0 i Q jt 1 t 2 j var(e t )l i : (A.5) 91 By assumption 5(i) that Q j op ¯ cr j for constant ¯ c> 0 andr < 1, we knowjjvar(e t )jj op is bounded, thus var(e it ) and cov(e it 1 e it 2 ) are bounded. Therefore, e it is covariance stationary. Next, we showfx it g is covariance stationary. Note that z f i;t = l 0 i ¥ å i=0 Q i Lf ti = ¥ å i=0 g i f ti =g(L)f t =g(L)b e (L)e f t ; (A.6) whereg i := l 0 i Q i L,g(L)=å ¥ i=0 g i L i ,b e (L)=å ¥ i b e i L i . For simplicity, we assume R = 1, R > 1 can be easily extended with more complicated notations. jg i j=jtr(l 0 i Q i L)j=jtr(Ll 0 i Q i )j maxjl i j Q i ¯ cr i : Thusjg(1)j ¯ c. Since e t IID(0;I R ),jb e (1)j< ¯ c by assumption 5 (iii), x it is covariance stationary. (ii) DenoteS z 0 ,S x 0 andS e 0 as the covariance matrix of z t , x t and e t , respectively. Then we have: S z 0 =S x 0 +S e 0 : SinceS x 0 0, we have s min (S z 0 ) s min (S e 0 ). H.2 Proof of Lemma 3 Proof. (i) The first conditiona 0 a=N= I N is derived from the definition. (ii) Denotex(L)= a 0 N (IQL) 1 L, then g t =x(L)f t . Sincejjajj op O( p N),jjLjj op O( p N), we know there exists constant ¯ c> 0 such that jjx(1)jj jja 0 jj op jjLjj op N ¥ å p=0 jjQ p jj F ¯ c: Combine with assumption 5(iii), we havefg t g is stationary process. 92 (iii) The third condition can be derived straightly from (A.2) and (A.3). H.3 Proof of Proposition 3 The proof of Proposition 3 is similar as the proof in chernozukov paper. As stated in section 3.2, Q( e Q k+1 ; e G k+1 ) Q( e Q k+1 ; e G k ) is derived directly from the definition of e G k+1 . The proof of the second inequality depends on the following lemma. Lemma A.9. Let M(Q; e Q k ; e G k ) := 1 2NT Q e Q k 2 2 t NT tr ( e Q k e Q k+1 ) T A k , then e Q k+1 in algo- rithm 2 satisfies: e Q k+1 = argmin Q M(Q; e Q k ; e G k )+tf Q e Q ; (A.7) where A k :=( e Q k Z 1 + e G k Z)Z T 1 . Proof. Recall in (??) that: e Q k+1 = S NTtf Q e Q k t( e Q k Z 1 + e G k Z)Z T 1 = S NTtf Q e Q k tA k ; then, we can write e Q k+1 as: e Q k+1 = argmin Q 1 2NT e Q k tA k Q 2 2 +tf Q jjQjj = argmin Q 1 2NT e Q k Q 2 2 t NT tr ( e Q k Q) T A k +tf Q jjQjj = M(Q; e Q k ; e G k )+tf Q jjQjj : Apply the results above, we show the proof of Proposition 3 below. 93 Proof. Q( e Q k+1 ; e G k )= 1 2NT Z e Q k+1 Z 1 e G k 2 2 +f Q e Q k+1 +f G e G k = 1 2NT Z e Q k Z 1 e G k 2 2 +f Q e Q k+1 +f G e G k + 1 2NT ( e Q k+1 e Q k )Z 1 2 2 1 NT tr ( e Q k e Q k+1 ) T A k 1 2NT Z e Q k Z 1 e G k 2 2 +f Q e Q k+1 +f G e G k + 1 2NTt e Q k+1 e Q k 2 2 1 NT tr ( e Q k e Q k+1 ) T A k = 1 2NT Z e Q k Z 1 e G k 2 2 +f G e G k + 1 t M( e Q k+1 ; e Q k ; e G k )+tf Q e Q k+1 1 2NT Z e Q k Z 1 e G k 2 2 +f G e G k + 1 t M( e Q k ; e Q k ; e G k )+tf Q e Q k = 1 2NT Z e Q k Z 1 e G k 2 2 +f Q e Q k +f G e G k = Q( e Q k ; e G k ): (A.8) The first inequality is derived from the choice of t 1= Z 1 Z T 1 op . Apply the results of Lemma A.9, we obtain the second inequality. H.4 Proof of Theorem 3 We first show the results of following lemmas. Recall we let D Q = b QQ, D G = b GG, where (Q;G) denote the true value from (2.2), and ( b Q; b G) is the estimated parameter from (2.12). Write the SVD ofQ beQ = U Q DV T Q , and let U r Q , V r Q denote the first r column of U and V , respectively. Define D Q 1 , D Q 2 , D G 1 and D G 2 in (2.13) as: D Q 2 = M U r Q D Q M T V r Q ,D Q 1 =D Q D Q 2 , and similarly,D G 2 = M U R G D G M T V R G ,D G 1 =D G D G 2 . Further, to simplify the notation, let P(Q) = P U r Q QP T V r Q , M(Q) = M U r Q QM T V r Q , and similarly, P(G)= P U R G GP T V R G , M(G)= M U R G GM T V R G . Lemma A.10. For integer 1 r N and 1 R minfN;Tg, we have 94 1.k b Qk jjQjj D Q 2 D Q 1 2å N i=r+1 s i (Q) 2.k b Gk jjGjj D G 2 D G 1 2å minfN;Tg i=R+1 s i (G) Proof. It can be easily seen that Q= P(Q)+ M(Q). Moreover, by Lemma A.8 (ii), we know P(Q)+D Q 2 =jjP(Q)jj + D Q 2 . Thus, we obtain: k b Qk jjQjj = P(Q)+D Q 2 + M(Q)+D Q 1 jjQjj P(Q)+D Q 2 M(Q)+D Q 1 jjQjj jjP(Q)jj + D Q 2 jjM(Q)jj D Q 1 (jjP(Q)jj +jjM(Q)jj ) D Q 2 D Q 1 2jjM(Q)jj = D Q 2 D Q 1 2 N å i=r+1 s i (Q): Similarly, we have: k b Gk N jjGjj D G 2 D G 1 minfN;Tg å i=R+1 s i (G): Next we proceed to the proof for Theorem 3. 95 Proof. Since( e Q; e G) is the optimal solution of Q(Q;G) in (2.12), we have: 0 Q( e Q; e G) Q(Q ;G ) 1 2NT Z e QZ 1 e G 2 2 jjZQ Z 1 G jj 2 2 + f Q p N k e Q kQ k N + f G p NT k e Gk N jjG jj 1 2NT D Q Z 1 +D G 2 2 1 NT hU;D Q Z 1 +D G i+ f Q p N k e Q kQ k N + f G p NT k e Gk N jjG jj 1 2NT D Q Z 1 +D G 2 2 + f Q p N D Q 2 D Q 1 D Q 2 ! + f G p NT D G 2 D G 1 D G 2 ! 1 2NT D Q Z 1 +D G 2 2 3 2 f Q p N D Q 1 + f G p NT D G 1 + 1 2 f Q p N D Q 2 + f G p NT D G 2 (A.9) The third inequality is derived from the choice of penalty parameter in Theorem 3, and we also apply the results from Lemma A.10. In the last inequality, we apply the norm inequality from Lemma A.8 (i) that D Q D Q 1 + D Q 2 . From (A.9), we obtain: f Q p N D Q 2 + f G p NT D G 2 3 f Q p N D Q 1 + f G p NT D G 1 : Therefore,(D Q ;D G )2C(r;R; f G f Q ), whereC is defined in assumption 6. Continue (A.9), we obtain: 1 2NT D Q Z 1 +D G 2 2 3 2 f Q p N D Q 1 + f G p NT D G 1 3 2 f Q p N p 2r D Q 1 2 + f G p NT p 2R D G 1 2 ; (A.10) where the last inequality is derived from the fact that D Q 1 p 2r D Q 1 2 , D G 1 p 2R D G 1 2 , see proof in Recht et al. (2010). 96 Combine RSC condition in (2.14), rearrange (A.10), we obtain: k D Q p N 2 2 + D G p NT 2 2 ! f Q p r D Q p N 2 +f G p R D G p NT 2 ; (A.11) where k = 1 3 p 2 minfk Q ;k G g> 0 is the small universal constant. The result in Theorem 3 follows directly from solving equation above. H.5 Proof of Proposition 4 The first two conditions can be derived directly from Theorem 3 and Lemma ??. Here we show the proof for the last two conditions. Proof. Step 1: Bound on eigenvalues Before the proof for rank consistency and factor loading consistency, notice that from the sec- ond condition, we know e GG F =O p (N+ p T): (A.12) Let c 2 i (1 i R) denote the first R nonzero eigenvalues ofLS f 0 L T =N; s 2 i (1 i N) denote the eigenvalues ofGG T =N; and ˆ s 2 i (1 i N) denote the eigenvalues of e G e G 0 =N. From assumption, c 2 i (1 i R) are positive constants and s 2 i = 0 for R< i N. Further, since p T FF 0 =TS f 0 =O p (1), from Weyl’s theorem, we obtain: max i js 2 i c 2 i j=O p ( 1 p T ): (A.13) 97 Still by Weyl’s theorem, for any 1 i N, j ˆ s 2 i s 2 i j 1 NT e G e G 0 GG 0 op 1 NT e GG 2 op + 2 NT jjGjj op e GG op 1 NT O p (N 2 + T)+ c 1 +O p ( 1 p T O p (N+ p T) =O p ( r N T + 1 p N ); (A.14) where the third inequality is derived from (A.12) and (A.13). Then for all i R, with probability approaching one, we have js 2 i1 ˆ s 2 i jjs 2 i1 s 2 i jjs 2 i ˆ s 2 i j c 2 i1 c 2 i 2 ; (A.15) j ˆ s 2 i s 2 i+1 jjs 2 i s 2 i+1 jjs 2 i ˆ s 2 i j c 2 i c 2 i+1 2 ; (A.16) with c 2 i+1 = 0 for i R since rank ofG is R. Step 2: Rank consistency With the preparations above, we proceed to prove the rank consistency ˆ R! p R first. Notice that s i ( e G)= p NT ˆ s i . Moreover, ˆ s 2 i c 2 i o p (1) c 2 i 2 ; i R; ˆ s i 2 =O p ( r N T + 1 p N ); R< i N: This implies that ˆ s i ( e G e G T ) NT c 2 R 2 ; i R; ˆ s i ( e G e G T )= NTO p ( r N T + 1 p N ); R< i N: 98 Choose threshold(f Q +f G )s 1 ( e G e G T )= NTO p q N T + 1 p N + 1 p T 1=2 ! , with probability ap- proaching to 1, we have: max i>R s i ( e G e G 0 )(f Q +f G )s 1 ( e G e G 0 ) min iR s i ( e G e G 0 ): (A.17) This proves the consistency of ˆ R. Step 3: Convergence of Factor Loading We first show that there exist R R rotation matrix H, which independent of N or T , such that 1 p N LH are the first R eigenvectors ofLS f 0 L T . Let L=S 1=2 L S f 0 S 1=2 L , and let R be the eigenvectors of L. Then D= R T LR is the diagonal matrix with eigenvalues of L on its diagonal. Notice that D is also the eigenvalue of 1 N LS f 0 L T . Let H =S 1=2 L R, then 1 N LS f 0 L 0 ( 1 p N LH)= 1 p N LS f 0 S 1=2 L S 1=2 L H = 1 p N LS 1=2 L S 1=2 L S f 0 S 1=2 L R = 1 p N LS 1=2 L LR = 1 p N LS 1=2 L RD =( 1 p N LH)D: (A.18) The fourth equality is derived from the definition of eigenvalues and eigenvector. Moreover, since 1 N (LH)(H 0 L)= I N , we prove that there exist rotation matrix H=S 1=2 L R such that 1 p N LH are the eigenvectors ofLS f 0 L 0 . 99 Since we have shown ˆ R converge to R with probability approaching to 1, here for simplicity, we assume R is known. Then by Davis-Kahan sin-theta inequality, we obtain: 1 p N ˆ L 1 p N LH 2 F 1 N LS f 0 L 0 1 T e G e G 0 op min iR fjs 2 i1 ˆ s 2 i j;j ˆ s 2 i s 2 i+1 jg O p (1) 1 N LS f 0 L 0 1 T e G e G 0 op O p (1) 1 N L(FF 0 =TS f 0 )L 0 op + 1 NT GG 0 e G e G 0 op =O p ( N T + 1 N ): (A.19) H.6 Proof of Lemma 4 This proof is very similar as proof of theorem 3. For completeness, we briefly show the proof below. Proof. DenoteDE = b E E,DE 1 = b E 1 E 1 ,D Q = b QQ. Since E =QE 1 +U; we have: b E =Q b E 1 +U+DEQDE 1 : Then, 0 1 2NT b E b Q b E 1 2 2 b EQ b E 1 2 2 +l Q b Q jjQjj 1 2NT D Q b E 1 2 2 2trhD Q b E 1 ; b EQ b E 1 i +l Q D Q 2 D Q 1 1 2NT D Q b E 1 2 2 l Q jjD Q jj +l Q D Q 2 D Q 1 1 2NT D Q b E 1 2 2 + 1 2 l Q D Q 2 3 D Q 1 ; (A.20) 100 where the third inequality is derived from the choice ofl Q 2 NT b E 1 ( b EQ b E 1 ) T op . From (A.20), we knowD Q 2C(r) defined in 8. Therefore, apply RSC condition (2.18), con- tinue (A.20), we obtain 1 N jjD Q jj 2 k p rl Q ; (A.21) wherek > 0 is constant. H.7 Proof of Proposition 5 Lemma A.11. If assumption 5, as N;T!¥, we have (i) b G G 2 F =O p T minfN;Tg ; (ii) E( b G G) F =O p T p N minfN;Tg ; (iii)jjDEjj F =O p p N+ p T ; (iv)jjEDE 0 jj F =O p (N+ T). Proof. (i)(ii) The results for i and ii are directly from Bai and Ng (2002) Theorem 1 and Bai (2003) Lemma B.1. (iii) Denote V NT as the R X by R X diagonal matrix with singular values of XX 0 =NT on the diago- nal, andjjV NT jj=O p (1). It is useful to introduce the decomposition ofDE below: DE = b E E =(Xb a b G 0 )(XaG 0 )=aG 0 b a b G 0 =a(G 0 b G 0 )+(a ˆ a) b G T =a(G 0 b G 0 )+a( b G 0 G 0 ) b GV 1 NT b G 0 T : 101 Therefore, jjDEjj F jjajj F G 0 b G 0 F +jjajj F G 0 b G 0 F b GV 1 NT b G 0 T F =O p ( p N)O p p T minf p N; p Tg ! =O p ( p N+ p T); where the first equality is derived from the fact that b GV 1 NT b G T =T = e G e G T =T =O(1). (iv) EDE 0 F E( b G G)a 0 F + E b GV 1 NT b G 0 T ( b G G)a 0 F O(1) E( b G G) F jjajj F =O p (N+ T); In the last equality, we apply results from(ii). Lemma A.12. If assumption 5, 7 hold, as N;T!¥, we have: b E 1 ( b EQ b E 1 ) T op =O p (N+ T): Proof. b E 1 ( b EQ b E 1 ) T op = (E 1 +DE 1 )(U+DEQDE 1 ) T op E 1 U T op + E 1 DE T op + E 1 DE T 1 Q T op + DE 1 U T op + DE 1 DE T op + DE 1 DE T 1 Q T op O p (N+ T): 102 In the last inequality, we apply results from lemma A.11. Finally, using results above, we are able to prove the proposition. Proof. (i) From lemma 4 and lemma A.12, we know there exists constant ¯ c> 0 such that: 1 N b QQ F ¯ c NT b E 1 ( b EQ b E 1 ) T op ¯ c NT b E 1 ( b EQ b E 1 ) T F =O p 1 N + 1 T H.8 Technical Lemmas H.8.1 Complementary Lemma This part lists several lemmas that used in proofs in Appendix A. Lemma A.13. Let x t be a martingale difference process with respect toF x t1 =s(x t1 ;x t2 ;:::); and suppose that there exists some finite positive constants c 1 ;c 2 , and s> 0 such that sup t P(jx t j>a) c 1 exp(c 2 a s ); f oralla > 0: Let alsos 2 xt =E(x 2 t jF x t1 ),s 2 x = T 1 å T t=1 s 2 xt , andd T = (T l ). If 0<l s+1 s+2 , then we have, P(j T å t=1 x t j>d T ) exp ¯ cd 2 T =(Ts 2 x ) : Ifl > s+1 s+2 , then we have, P(j T å t=1 x t j>d T ) exp ¯ cd s=(s+1) T : 103 This lemma is from lemma A3 of Chudik et al. (2018) online theory supplement. Therefore, the proof is omitted here. Next we introduce the definition of sub-Gaussian and sub-exponential distribution. For constantd > 0, a random variable X is sub-Gaussian if the tails of X satisfy: P(jXjd) 2exp(d 2 =k); 8d 0; wherek is some constant. A random variable X is sub-exponential if the tails of X satisfy: P(jXjd) 2exp(d=k); 8d 0: The definitions of sub-Gaussian and sub-exponential have been well developed, see for exam- ple, Vershynin (2017) Chapter 2. Lemma A.14. (i) A random variable X is sub-Gaussian if and only if X 2 is sub exponential. (ii) Let X 1 ;X 2 ;:::;X T be independent, mean zero, sub-exponential random variables. Then for everyd 0, we have: P 1 T T å t=1 X t d ! 2exp k min d 2 ;d T : These results are directly from Vershynin (2017) Lemma 2.7.6 and Corollary 2.8.3. Therefore, the proof are omitted. Lemma A.15. Suppose assumption 5 holds, ande u t ;e f t from assumption 5 are Gaussian, as N;T! ¥, we have: (i) The following terms areO p (T+ N):jjUU 0 jj op ,jjUE 0 jj op ,jjEE 0 jj op ; 104 (ii) The following terms areO p ( p NT):jjU 1 U 0 jj op ,jjU 1 E 0 jj op ,jjE 1 E 0 jj op ; There exists large enough constant ¯ c> 0, such that (iv) P å T t=1 jjf t jj 2 F > ¯ cT+d T exp( ¯ cd 2 T =T); (iv) P å T t=1 jjg t jj 2 F > ¯ cT+d T exp( ¯ cd 2 T =T). Proof. (i) In assumption 5(ii), we impose the condition that U= p S u E u , whereE u = e u 1 ;e u 2 ;:::;e u T 0 , e u t IID(0;I N ). To prove the bound ofjjUU 0 jj op , we first prove the bound of E u E u 0 op . Let v2R N from 1=4-net of the sphere S N1 :N . From Corollary 4.2.13 in Vershynin (2017), we have: jNj< 9 N ; E u E u 0 op 2max v2N jv 0 E u E u 0 vj= 2max v2N v 0 E u 2 2 : Since v T E u 2 2 = T å t=1 hv;e u t i 2 =: T å t=1 w t 2 ; wherew t =hv;e u t i=å N i=1 v i e u it . Sincee u t IIDN(0;I N ), we knowfw t 2 g are independent over t, andE(w 2 t )=å N i v 2 i = 1. Apply lemma A.14(ii), P 1 T T å t=1 w 2 t 1 d ! 2exp k min d 2 ;d T : (A.22) Thus, P 1 T T å t=1 w 2 t Td+ T ! 2exp k min d 2 ;d T : Therefore, P E u E u 0 op 2Td+ 2T P max v2N v 0 E u 2 2 Td+ T 9 2N 2exp k min d 2 ;d T ; 105 in second inequality, we apply the union bound. letd = p N p T + N T , continue, P E u E u 0 op 2N+ 2 p NT+ 2T 2exp(kN+ 4N): Sincek > 0 is large enough, we have: E u E u 0 op =O p (N+ T+ p NT): Since U = p S u E u , andjjS u jj op ¯ c, therefore, UU 0 op jjS u jj op E u E u 0 op =O p (N+ T+ p NT): Moreover, since E =(1QL) 1 U, and Q j op ¯ cr j for r < 1, and j = 0;1;2;:::, we know EE 0 op ¥ å j=0 Q j op ! 2 UU 0 op ¯ c r 1r 2 UU 0 op =O p (N+ T+ p NT): (ii) Similar as proof of (i), we focus onE u 1 E u 0 first. Let v;w2R N from 1=4 net of sphere S N1 :N . Denote x t = v 0 e u t , y t = w 0 e u t , we have: E u 1 E u 0 op 2 max v;w2N T å t=1 x t1 y t : Apply lemma A.13, letd T = (T l ), with 0<l 2 3 , we have: P T å t=1 x t1 y t >d T ! exp( ¯ cd 2 T =T); where ¯ c> 0 is large enough constant. 106 Thus, there existd T = p NT , we have: P E u 1 E u 0 op 2 p NT P max v;w2N T å t=1 x t1 y t p NT ! exp( ¯ cN+ 4N): Therefore, follow the similar argument in(i), we have the desired results. (iii) Since f t =b(L)e f t ,jjb(1)jj ¯ c, follow directly from assumption ??(ii), we have: sup t P jf 2 i;t E(f 2 i;t )j>d ¯ cexp( ¯ cd); 1 i R: Apply lemma A.13, there exist large enough constant ¯ c> 0 such that 6 . P T å t=1 jjf t jj 2 F > T ¯ c+d T ! exp ¯ cd 2 T =T : (iv Recall g t =x(L)f t , wherex(L)= a 0 N (I N QL) 1 L. Sincejjx(1)jjjjajjjjLjj (I N QL) 1 ¯ c, the results follow directly from(iii). Lemma A.16. Assume assumption 5 holds, ande u t ;e f t are subGaussian distributed. DenoteS g = E(g t g 0 t ),S e =E(e t e 0 t ) andS z =E(z t z 0 t ). As N;T!¥, we have (i) 1 T å T t=1 g t g 0 t S g op =O p ( p logN p T ); (ii) 1 T å T t=1 e t e 0 t S e op =O p ( p N p T ); (iii) 1 T å T t=1 z t z 0 t S z op =O p ( N p logN p T ); 6 Notice that in lemma A.13, the tail when 0<l s+1 s+2 dominates the tail whenl > s+1 s+2 107 Proof. (i) Similar as proof of lemma A.15, let v2R R X from 1=4 net of sphere S R X 1 :N . Then we havejNj< 9 R X , and 1 T T å t=1 g t g 0 t S g op 2 T max v2N T å t=1 (v 0 g t ) 2 v 0 S g v Apply lemma A.13, we have: P 1 T T å t=1 (v 0 g t ) 2 v 0 S g v d T ! exp( ¯ cd 2 T T); where ¯ c> 0 is large enough constant. Take the union bound of v2N , choosingd T = q logN T , we conclude following: P 0 @ 1 T T å t=1 g t g 0 t S g op 2d T 1 A P 1 T max v2N T å t=1 (v 0 g t ) 2 v 0 S g v d T ! 9 2R X exp( ¯ cd 2 T T) exp( ¯ clogN+ 8R X ): (ii) SinceE=(I N QL) 1 p S u E u . Thus, there exist 1 T EE 0 S e op ¥ å j=0 Q j op ! 2 jjS u jj op 1 T E u E u 0 I N op ¯ c 1 T E u E u 0 I N op : From(A.22), letw t := v T e u t , we know P 1 T E u E u 0 I N op 2d ! P max v2N 1 T T å t=1 w 2 t 1 >d ! 9 2N 2exp k min(d 2 ;d)T 2exp k min(d 2 ;d)T+ 4N : 108 Choosed = ¯ c p N= p T with large enough ¯ c> 0, then we have: 1 T EE 0 S e op ¯ c 1 T E u E u 0 I N op =O p ( p N p T ): (iii) Recall z t has the decomposition that z t =ag t + e t ; where e t and g t are independent in all lead and lags. Therefore, 1 T å t z t z 0 t S z =a 1 T å t g t g 0 t S g a 0 + 1 T å t e t e 0 t S e + 1 T a å t g t e 0 t : From lemma A.13, it is easy to see that: P 1 T å t g t e 0 t > p logN p T exp( ¯ clogN): Therefore, 1 T å t z t z 0 t S z op jjajj 2 op 1 T å t g t g 0 t S g op +jjajj op 1 T å t e t e 0 t S e op +jjajj op 1 T å t g t e 0 t op =O p ( N p logN p T ): 109 H.8.2 Sufficient condition for RSC condition In this subsection, we show that under Gaussian assumption ofe u t ;e f t in assumption 5, the relaxed RSC condition (2.14) holds with probability goes to 1 as N;T!¥. D Q Z 1 +D G 2 F = D Q Z 1 2 F + D G 2 F + 2hD Q Z 1 ;D G i s min (Z 1 ) 2 D Q 2 F + D G 2 F 2jhD Q Z 1 ;D G ij s min (Z 1 ) 2 D Q 2 F + D G 2 F 2s max (Z 1 ) D Q op D G s min (Z 1 ) 2 D Q 2 F + D G 2 F 2 ¯ cs max (Z 1 ) D G s min (Z 1 ) 2 D Q 2 F + D G 2 F ¯ cNTf G D G : In the second last inequality, to ensure stationary,jjQjj op is upper bounded by some constant ¯ c> 0, therefore, D Q op 2 ¯ c. The last inequality is derived from the choice of f G 2 NT jjZ 1 jj op . In lemma 2, we show that s min (S Z ) c> 0. From lemma A.16(iii), we show that if T grows slightly faster than N 2 , s min (Z 1 ) 2 = T s min (S Z )+ o p (1). Therefore, there exists large enough constantk > 0 such that: D Q Z 1 +D G 2 F Tk D Q 2 F + D G 2 F kNTf G D G F holds with high probability. H.8.3 Sufficient condition for assumption 7 For the proof, we first introduce lemma A.17 from Negahban and Wainwright (2011). Lemma A.17. Given a Gaussian random vector Y2R T N(0;Q), for alld > 2= p T , we have P 1 T jjYjj 2 2 tr(Q) > 4djjQjj op 2exp T(d 2 p T ) 2 2 ! + 2exp(T=2): (A.23) 110 Lemma A.18. Assume conditions in lemma ?? hold, we havejjZ 1 Ujj op O p (N p T). Proof. In this proof, we first approximate Z 1 U T byhy T Z 1 ;x T Ui for vectors y;x on the unit sphere in the 1=4-net. Then we decompose the product by three separate parts, and establish a tight bound on them, respectively. Finally we combine them and take a union bound over all x and y in the net. Step 1: Approximation. Let x;y2 R N from 1=4-net of the sphere S N1 N . From Corollary 4.2.13 from Vershynin (2017), the cardinalities ofN satisfies jNj< 9 N : (A.24) Denote ˜ Z= Z N 1=4 , ˜ U = U N 1=4 . LetS ˜ z 0 andS ˜ u 0 denote the covariance matrix of ˜ z t , ˜ u t , respec- tively. Then the operator norm of Z 1 U T = ˜ Z 1 ˜ U T can be bounded using these nets as follows: ˜ Z 1 ˜ U T op 2 max x;y2N hy T ˜ Z 1 ;x T ˜ Ui: (A.25) Step 2: Concentration. Fix x;y2N . Then we decompose the quadratic form into three parts: 2hy T ˜ Z 1 ;x T ˜ Ui= T 1 T 2 T 3 ; (A.26) where T 1 = T å t=1 (y T ˜ z t1 + x T ˜ u t ) 2 y T S ˜ z 0 y x T S ˜ u x; (A.27) T 2 = T å t=1 (y T ˜ z t1 ) 2 y T S ˜ z 0 y; (A.28) T 3 = T å t=1 (x T ˜ u t ) 2 x T S ˜ u 0 x: (A.29) Now we shall bound each T 1 ;T 2 ;T 3 in turn: in doing so, we repeatedly use Lemma A.17. 111 Bound on T 3 : Let a t := x T ˜ u t , denote its covariance matrix as R a . Since u t N(0;S u ) and the series is serially uncorrelated, we have R a is diagonal with same diagonal entries x T S ˜ u 0 x. Thus r(R a )=T =jjR a jj op = x T S ˜ u 0 x= p Nx T S u x p NjjS u jj op . Apply Lemma A.17, we obtain P 1 T jT 3 j> ¯ cd p NS u op 2exp T(d 2 p T ) 2 2 ! + 2exp(T=2): (A.30) Bound on T 2 : Let b t := y T ˜ z t1 , and denote its covariance matrix as R b . Before deriving the bound on T 2 , we need to find the bound on R b op and tr(R b ). Since ˜ z t = N 1=4 ¥ å j=0 Q j LL j ! f t = N 1=4 ¥ å j=0 Q j LL j ! ¥ å i=0 b e i e ti ! : (A.31) Denotea j = N 1=4 Q j LL j ,a(L)=å ¥ j=0 a j ,b(L)=å ¥ i=0 b e i L i , then we can write ˜ z t as: ˜ z t =a(L)b(L)e t : (A.32) From assumption 5, we knowE(e t e 0 t )= I R , andå ¥ i=0 b e i op ¯ c. Moreover, ¥ å j=0 jja i jj op N 1=4jjLjj op 1r = N 1=4 q jjLL T =Njj op 1r ; (A.33) where the first equality is from assumption 5 thatjjQjj p op ¯ cr p . LetS ˜ z p =E(˜ z t ˜ z 0 tp ), we have: S ˜ z 0 op T1 å p=0 S ˜ z 0 op ¥ å j=0 jja i jj op ! 2 ¥ å i=0 jjb e i jj op ! 2 ¯ c LL T p N op : (A.34) 112 Therefore, R b op T1 å p=0 S ˜ z 0 op ¯ c LL T p N op ; (A.35) tr(R b ) T = y T S ˜ z 0 y S ˜ z 0 op ¯ c LL T p N op : (A.36) Apply Lemma A.17, we obtain P 1 T jT 2 j> ¯ cd LL T p N op ! 2exp T(d 2 p T ) 2 2 ! + 2exp(T=2): (A.37) Bound on T 1 : Define c t := y T ˜ z t1 + x T ˜ u t . The covariance matrix of c t R c has the element: R c i j = 8 > > < > > : y T S ˜ z 0 y+ x T S ˜ u 0 x; i= j y T S ˜ z ji jj y+ y T Q ji jj1 S ˜ u 0 x; i6= j (A.38) Therefore, jjR c jj op max i T å j=1 jR c i j j ¯ c LL T = p N op + p NS u op : (A.39) Apply Lemma A.17, we obtain P 1 T jT 1 j> ¯ cd LL T = p N op + p NS u op 2exp T(d 2 p T ) 2 2 ! + 2exp(T=2): (A.40) Combine (A.40), (A.37), (A.30), and the conditionjjS u jj op ¯ c LL T =N 2 ¯ c, we have: P 2 T hy T ˜ Z 1 ;x T Ui ¯ c p Nd 6exp T(d 2 p T ) 2 2 ! + 6exp(T=2): (A.41) Step 3: Union Bound. In the final step, we take the union bound on all x;y2N . P 2 T max x;y2N hy T ˜ Z 1 ;x T ˜ Ui ¯ c p Nd 9 2N 6exp T(d 2 p T ) 2 2 ! + 6exp(T=2) ! : (A.42) 113 Choosed = ¯ c q N T for some large enough ¯ c> 0 , we have P Z 1 U T op ¯ cN p T ¯ cexp(N) (A.43) as long as T > ¯ cN. H.8.4 Simulation of the operator norm bound in Proposition 5 This section performs a simple simulation to provide finite sample evidences that the operator norm of b E 1 ( b EQ b E 1 ) 0 is dominated byjjE 1 U 0 jj op , and the bound in proposition 5 might be improved further. Write b U = b EQ b E 1 , then b E 1 b U 0 op = (E 1 +DE 1 )(U+DE+QDE 1 ) 0 op E 1 U 0 op +jja NT jj op ; where a NT = b E 1 (DEQDE 1 ) 0 +DE 1 U 0 . For different pairs of(N;T), I generate data from design ’Exact Rank’ design in section 5.1, and b E is calculated follow algorithm 5. Then I calculate the operator norm of b E 1 b U 0 = p (N logN)T , E 1 U 0 = p (N logN)T and a NT = p (N logN)T based on 100 replications. Results are reported in table 3.1 and figure A.1. Simulation results show that as N;T increases, the operator norm b E 1 b U 0 op is dominated by jjE 1 U 0 jj op , and it is upper bounded by p N logNT . From table, it can be seen that the normalized operator norm of b E 1 b U 0 is very close, and strictly below E 1 U 0 . Moreover,jja NT jj op is much less than b E 1 b U 0 op andjjE 1 U 0 jj op . As T increases (top panel) or N increases (bottom panel), the normalized operator norm weakly decreases, indicates that all three terms are upper bounded by p N logNT . Similar in figure A.1, the line of normalized operator norm of E 1 U 0 (red) is very close and strictly above the line of b E 1 b U 0 (blue). The line ofjja NT jj (yellow) is much lower than the rest of two lines. Meanwhile, through the normalization, all six lines in figure A.1 are bounded as T (N) increases. 114 N T b E 1 b U 0 E 1 U 0 a NT 100 50 0.3573 0.3642 0.2433 100 100 0.2723 0.2744 0.1535 100 150 0.2361 0.2386 0.1273 100 200 0.2159 0.2212 0.1177 100 250 0.2047 0.2052 0.1093 100 300 0.1959 0.1954 0.1097 100 350 0.1850 0.1881 0.1043 100 400 0.1804 0.1823 0.1043 50 200 0.5431 0.5480 0.3524 100 200 0.4995 0.5034 0.2711 150 200 0.4822 0.4874 0.2518 200 200 0.4777 0.4819 0.2459 250 200 0.4798 0.4827 0.2564 300 200 0.4814 0.4850 0.2657 350 200 0.4829 0.4860 0.2734 400 200 0.4847 0.4878 0.2895 Notes: The table reports the nor- malized operator norm bound of b E 1 b U 0 op = p (N logN)T , jjE 1 U 0 jj op = p (N logN)T , jja NT jj op = p (N logN)T . Table 3.1: Operator Norm Bound 115 Figure A.1: Operator Norm Bound 116 I Global Bank Details for Chapter 3 In Table 3.2 below, I show banks by country that are included in the analysis in Chapter 4. The 2003-2014 data are directly from Demirer et al. (2018), and I extended them to 2021. Some of the banks are not traded publicly at 2021, and they are marked as ’N’ and not included in the sample 2014-2021. I also added a few major banks that were not included in the sample 2003-2014, and they are marked as ’Y’ in 2014-2021 and ’N’ in 2003-2014. See details below. Bank Name Country 2003-2014 2014-2021 JPMorgan Chase & Co US Y Y Bank of America US Y Y Citigroup US Y Y Wells Fargo US Y Y Goldman Sachs Group US Y Y Morgan Stanley US Y Y Bank of New York Mellon US Y Y U.S. Bancorp US Y Y PNC Financial Services Group US Y Y Capital One Financial US Y Y State Street Corporation US Y Y BB&T Corp US Y Y SunTrust Banks US Y N American Express US Y Y Fifth Third Bancorp US Y Y Regions Financial US Y Y Mitsubishi UFJ Financial Group Japan Y Y Mizuho Financial Group Japan Y Y 117 Table 3.2 continued from previous page Sumitomo Mitsui Financial Group Japan Y Y Resona Holdings Japan Y Y Nomura Holdings Japan Y Y Sumitomo Mitsui Trust Holdings Japan Y Y Fukuoka Financial Group Japan Y Y Bank Of Yokohama Japan Y N Chiba Bank Japan Y Y Hokuhoku Financial Group Japan Y Y Shizuoka Bank Japan Y Y Yamaguchi Financial Group Japan Y Y Toronto-Dominion Bank Canada Y Y Royal Bank of Canada Canada Y Y Bank of Nova Scotia Canada Y Y Bank of Montreal Canada Y Y Canadian Bank of Commerce Canada Y Y National Bank of Canada Canada Y Y Unicredit Italy Y Y Intesa Sanpaolo Italy Y Y Banca Monte dei Paschi di Siena Italy Y Y Banco Popolare Italy Y Y Unipol Gruppo Finanziario Italy Y Y Mediobanca Banca di Credito Finanziario Italy Y Y National Australia Bank Australia Y Y Commonwealth Bank of Australia Australia Y Y Australia and New Zealand Banking Group Australia Y Y Westpac Banking Australia Y Y 118 Table 3.2 continued from previous page Macquarie Group Australia Y Y China Merchants Bank China Y Y Shanghai Pudong Development Bank China Y Y China Minsheng Banking Corp China Y Y Ping An Bank China Y Y Bank of China China N Y Industrial and Commercial Bank of China China N Y Agricaltural Bank of China China N Y Hua Xia Bank China Y Y HSBC Holdings UK Y Y Barclays UK Y Y Royal Bank of Scotland Group UK Y Y Lloyds Banking Group UK Y Y Standard Chartered UK Y Y Banco Santander Spain Y Y Banco Bilbao Vizcaya Argentaria Spain Y Y Banco de Sabadell Spain Y Y Banco Popular Espanol Spain Y N Nordea Bank Sweden Y Y Svenska Handelsbanken Sweden Y Y Skandinaviska Enskilda Banken Sweden Y Y Swedbank Sweden Y Y BNP Paribas France Y Y Credit Agricole France Y Y Natixis France N Y 119 Table 3.2 continued from previous page Societe Generale France Y Y Woori Finance Holdings Korea Y Y Shinhan Financial Group Korea Y Y Industrial Bank of Korea Korea Y Y UBS Switzerland Y Y Credit Suisse Group Switzerland Y Y KBC Groupe Belgium Y Y Dexia Belgium Y N Itau Unibanco Holding Brazil Y Y Banco Bradesco Brazil Y Y Deutsche Bank Germany Y Y Commerzbank Germany Y Y Bank of Ireland Ireland Y Y Allied Irish Banks Ireland Y Y State Bank of India India Y Y Bank of Baroda India Y Y Malayan Banking Berhad Malaysia Y Y CIMB Group Holdings Malaysia Y Y Banco Comercial Portugues Portugal Y Y Banco Espirito Santo Portugal Y N DBS Group Holdings Singapore Y Y United Overseas Bank Singapore Y Y Erste Group Bank Austria Y Y Danske Bank Denmark Y Y Pohjola Bank Finland Y N 120 Table 3.2 continued from previous page National Bank of Greece Greece Y N ING Groep Netherlands Y Y DNB ASA Norway Y Y Sberbank Rossii Russia Y Y Turkiye Is Bankasi Turkey Y Y Standard Bank Group South Africa Y Y Table 3.2: Global Bank Details 121
Abstract (if available)
Abstract
This dissertation contributes to the estimation of high-dimensional econometric models, as well as its applications.
In the first chapter (co-authored with Khai X. Chiong and Hyungsik Roger Moon), we investigate seemingly unrelated regression (SUR) models that allow the number of equations (N) to be large and comparable to the number of the observations in each equation (T). It is well known that conventional SUR estimators, for example, the feasible generalized least squares (FGLS) estimator from Zellner (1962) does not perform well in a high dimensional setting. We propose a new feasible GLS estimator called the feasible graphical lasso (FGLasso) estimator. For a feasible implementation of the GLS estimator, we use the graphical lasso estimation of the precision matrix (the inverse of the covariance matrix of the equation system errors) assuming that the underlying unknown precision matrix is sparse. We show that under certain conditions, FGLasso converges uniformly to GLS even when T < N, and it shares the same asymptotic distribution with the efficient GLS estimator when T > NlogN. We confirm these results through finite sample Monte-Carlo simulations.
The second chapter studies the vector autoregressive model (VAR) with interactive fixed effects in a high-dimensional setting, which allows both the number of cross-sectional units N and time periods T go to infinity. Assuming that the VAR transition matrix is low rank, this paper first proposes a nuclear norm regularization-based method that estimates transition matrix and interactive fixed effects simultaneously. Under certain conditions, the paper shows that on average, the deviation of each element in the estimated matrix shrinks to 0 as N, T → ∞. Since the nuclear norm penalty induces biases, a debiased procedure is then introduced in order to improve estimators' finite sample performances. Independently, leveraging principle component analysis (PCA), the paper proposes a multi-stage estimation method that estimates parameters in multiple stages. I show that the method improves the convergence rates of the VAR transition matrix and reduces biases. In Monte Carlo simulation, I examine estimators' performances in finite sample and the results agree with the theory. Empirically, the paper revisits the US macro data from McCracken and Ng (2016) and shows the model has great advantage in forecasting macro indexes (IP, CPI, and federal funds rate) compared with the reduced rank VAR model (RRVAR) and pure factor model (FM), especially in the long horizon.
The third chapter applies the model from the previous chapter into analyzing connectedness among 29 countries using 96 banks' volatility data from 2003-2021. I construct the dynamic network of global banks as well as the integrated country network using rolling window estimation. I find out that the system-wide shocks such as economic crises and pandemic dramatically brings up the connectedness level. The heat quickly calms down after shocks and the connectedness remains at low level. The integrated country network is more persistent. The connectedness across countries climbs as system-wide shocks arrive, and stays for a much longer period of time before trends down.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on the econometric analysis of cross-sectional dependence
PDF
Essays on econometrics
PDF
Essays on econometrics analysis of panel data models
PDF
Essays on nonparametric and finite-sample econometrics
PDF
Panel data forecasting and application to epidemic disease
PDF
Essays on beliefs, networks and spatial modeling
PDF
A structural econometric analysis of network and social interaction models
PDF
Approximating stationary long memory processes by an AR model with application to foreign exchange rate
PDF
Essays on factor in high-dimensional regression settings
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Robust estimation of high dimensional parameters
PDF
Essays on causal inference
PDF
Three essays on supply chain networks and R&D investments
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Model selection principles and false discovery rate control
PDF
Essays on estimation and inference for heterogeneous panel data models with large n and short T
PDF
High dimensional estimation and inference with side information
PDF
Two essays in econometrics: large N T properties of IV, GMM, MLE and least square model selection/averaging
Asset Metadata
Creator
Tan, Lidan
(author)
Core Title
Essays on high-dimensional econometric models
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Economics
Degree Conferral Date
2021-08
Publication Date
06/28/2021
Defense Date
05/04/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
econometric models,graphical lasso,high-dimensional,nuclear-norm regularization,OAI-PMH Harvest,Var,vector autoregressive
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Moon, Hyungsik Roger (
committee chair
), Cheng, Hsiao (
committee member
), Sun, Wenguang (
committee member
)
Creator Email
lidantan@usc.edu,tldntq@hotmail.com
Permanent Link (DOI)
http://doi.org/10.25549/usctheses-c89-469797
Unique identifier
UC13012649
Identifier
etd-TanLidan-9668.pdf (filename), usctheses-c89-469797 (legacy record id)
Legacy Identifier
etd-TanLidan-9668
Dmrecord
469797
Document Type
Dissertation
Rights
Tan, Lidan
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
econometric models
graphical lasso
high-dimensional
nuclear-norm regularization
vector autoregressive