Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sparseness in functional data analysis
(USC Thesis Other)
Sparseness in functional data analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPARSENESS IN FUNCTIONAL DATA ANALYSIS by Xinghao Qiao A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) August 2015 Copyright 2015 Xinghao Qiao To my parents. ii Acknowledgments First and foremost, I would like to express my sincere gratitude to my advisor Professor Gareth James for his continuous support of my PhD study and research. His encouragement, enthusiasm, motivation, patience and broad knowledge helped me in all time of my research and writing of my thesis. Without his guidance and persistent support, this dissertation would not have been possible. Besides my advisor, I would like to thank the rest of my thesis committee members, Professor Larry Goldstein, Professor Peter Radchenko and Professor Xin Tong for their reading of my thesis, pointing out errors, providing insightful comments and raising hard questions. I am also grateful to Professor Gareth James, Professor Peter Radchenko, Professor Jinchi Lv and Pro- fessor Wenguang Sun for several collaborative research projects we have been working on that lead to the different chapters of my thesis. I am looking forward to collaborating with them on more interesting topics in the future. I thank all the faculty members in Marshall Statistics group for their attendance of my presentations for research projects. I also would like to thank the PhD students in our group, Pallavi Basu, Courtney Paulson, Luella Fu and Weinan Wang for the stimulating discussions, the days we were working hard together and all the fun we have had in the past several years at USC. Last but not least, I have many reasons to thank my parents, Shuang Tian and Fenghua Qiao for giving birth to me at the first place and providing strongest backing support throughout my life. I am very fortunate to be their son. iii Contents Acknowledgments iii List of Tables viii List of Figures ix Abstract x 1 Introduction 1 1.1 Sparseness and FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Sparseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Three Sparseness Settings for Functional Data . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Sparseness in Functions for Multi-Index Models . . . . . . . . . . . . . . . . . . . 3 1.2.2 Sparseness in Edges for Functional Graphical Models . . . . . . . . . . . . . . . . . 5 1.2.3 Sparseness in Predictors for Ultra-High Dimensional Functional Data . . . . . . . . 6 1.3 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Index Models for Sparsely Sampled Functional Data 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 SIMFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Scalar Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 The SIMFE Fitting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Functional Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Bias Corrected SIMFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.5 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iv 2.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Scalar Gaussian Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Functional Gaussian Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Auction Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 Functional Graphical Models 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Functional Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.3 Functional Graphical Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Fglasso Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Block Partitioning to Accelerate the Algorithm . . . . . . . . . . . . . . . . . . . . 46 3.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 EEG Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4 Feature Screening for Ultra-High Dimensional Functional Regression Models 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Independence Screening Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 DC-SIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Distance Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.2 DC-SIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 MDC-SIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Martingale Difference Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.2 MDC-SIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 SIRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Functional Data with Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.8 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 v References 80 A Algorithms and Proofs in Chapter 2 87 A.1 EM Algorithm for Estimating e X ij (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.2 Proof of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.2.3 Theoretical Assumptions for the Results in Section 2.3 . . . . . . . . . . . . . . . . 89 A.2.4 Proof of Theorems 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 A.2.5 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.2.6 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B Algorithms and Proofs in Chapter 3 96 B.1 Derivations for Fglasso Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B.1.1 Step 2(b) of Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B.1.2 Steps 2(a) and 2(c) of Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 B.2 Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 B.2.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 B.2.2 Convergence Rate for Submatrices in e . . . . . . . . . . . . . . . . . . . . . . . . 99 B.2.3 Equivalence betweenE and e E " . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.4 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.5 General Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B.2.6 General Model Selection Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.2.7 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.3 Additional Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.3.1 Estimation of FPC Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.3.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.3.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.3.4 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.3.5 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.3.6 Lemma 5 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.3.7 Lemma 6 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.3.8 Lemma 7 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.3.9 Lemma 8 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 vi B.3.10 Lemma 9 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.3.11 Lemma 10 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.3.12 Lemma 11 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 C Proofs in Chapter 4 119 C.1 Proof of Theorems 7 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.2 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.3 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 C.4 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 C.5 Proof of Proposition 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 vii List of Tables 2.1 Scalar Gaussian Response ford 1 = d 2 = d 3 = 1: The mean squared prediction error, and correlation coefficients. Standard errors are shown in parentheses. . . . . . . . . . . . . . . 32 2.2 Scalar Gaussian Response for d 1 = 2, d 2 = 1: The mean squared prediction error, and correlation coefficients. Standard errors are shown in parentheses. . . . . . . . . . . . . . . 33 2.3 Functional Gaussian Response: The mean squared prediction error, and correlation coeffi- cients. Standard errors are shown in parentheses. . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Cross-validated mean squared prediction errors (10 3 ) for the three versions of SIMFE and four competing methods. The lowest MSPE for each value ofT is bolded. . . . . . . . . 36 3.1 The mean area (10 1 ) under the ROC curves with false positive rates up to 20%. Standard errors are shown in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 The mean area under the ROC curves. Standard errors are shown in parentheses. . . . . . . . 55 4.1 The proporations ofP M andP a withd = 40: . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Results of channel ranking using 7 feature screening methods. . . . . . . . . . . . . . . . . 78 viii List of Figures 2.1 The left panel displays the boxplot of daily aggregate bids from 156 seven-day eBay online auctions, the right panel provides the boxplot of cumulative bids placed on every four hours during the last day of the auction. The small circles outside the whiskers are identified outliers and the solid lines in the box are the medians. . . . . . . . . . . . . . . . . . . . . . 11 2.2 Scalar Gaussian simulation withT ij = 8 andn = 200: Comparison of true(t) functions (black solid) with median estimates over 100 simulation runs; SIMFE (blue solid), Base SIMFE (red solid), gOPG (green dotted), gSIR (cyan dashed) and gPCA (grey dash dotted). 30 2.3 Functional Gaussian Response forT ij =n i = 5;n = 100: Comparison of true j (t) func- tions (black solid) with SIMFE (blue solid), Base SIMFE (red solid), gOPG(green dotted), gSIR(cyan dashed) and gPCA estimates(grey dash dotted). . . . . . . . . . . . . . . . . . . 34 2.4 Estimated jk (t) curves for eBay online Auction Data. The red dashed, black solid, green dotted, blue dash-dot and cyan long dashed lines correspond to SIMFE, Base SIMFE, gOPG, gSIR and gPCA, respectively, up to different current times,T = 138; 144; 150; 156; 162 and 168 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 The illustrative example. Left: The data,n = 100 observations ofg ij (t) forj = 1;:::; 9 nodes. Right: the true underlying network structure. . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Model 1 forp=50, 100, 150 and 200: Comparison of median estimated ROC curves over 100 simulation runs with false positive rates up to 20%. fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted) and PCA (cyan long dashed). . . 53 3.3 Model 2 forp=50, 100, 150 and 200: Comparison of median estimated ROC curves over 100 simulation runs with false positive rates up to 20%. fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted) and PCA (cyan long dashed). . . 54 3.4 Model 1 (top row) and Model 2 (bottom row) forp = 50 and 100: Comparison of median estimated ROC curves over 100 simulation runs for fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted), and PCA (cyan long dashed). . . . . . . . 56 3.5 Left graph plots the estimated network for the alcoholic group and right graph plots the estimated network for the control group. Black lines denote edges identified by both fglasso and bootstrapped fglasso, blue lines denote edges identified by fglasso but not selected by the bootstrapped fglasso and red lines denote edges identified by the bootstrapped fglasso but missed by the fglasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Black lines denote edges identified only by the alcoholic group and red lines denote edges identified only by the control group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 ix Abstract Functional Data Analysis (FDA), a branch of statistics that analyzes data providing information about func- tions measured over some domain, has recently attracted more attention. In my thesis, I present a brief overview of FDA and consider three different settings in FDA where sparseness can play an important role. The first setting considers sparseness in the functions. Classical FDA requires a large number of regu- larly spaced measurements per subject. I consider the situation involving sparse, irregularly observed and noisy data. I propose a new functional errors-in-variable approach, Sparse Index Model Functional For- mulation (SIMFE), which uses a functional index model formulation to handle sparsely sampled functional predictors. SIMFE enjoys several advantages over traditional methods. First, it implements a non-linear regression and uses a supervised approach to estimate the lower dimensional space that the predictors should be projected. Second, SIMFE can be applied to both scalar and functional responses with multiple predic- tors. Finally, SIMFES uses a mixed effect model to deal with sparsely sampled functional data. The other two settings focus on sparseness arising from high dimensional functional data, where the number of functional variables,p, exceeds the number of observations,n. One example considers extend- ing Gaussian graphical model, which depicts the conditional dependence structure among p multivariate Gaussian variables, to the functional domain. Fitting high dimensional graphical model pose several chal- lenges, thus I need assume sparseness in the edges, where the edges only connect a subset of nodes i.e. only some pairs of random variables are conditional dependent. I propose a functional graphical model (FGM) describing the conditional dependence structure amongp random functions. Then I develop a fglasso crite- rion, which estimates FGM by imposing block sparsity in the precision matrix, via a group lasso penalty. I show the successful graph recovery with a high probability. The other example concerns screening features for ultra-high dimensional functional data. To make the problem feasible I assume extreme sparseness in the predictor space, i.e. most of the functional predictors are x not contributing to the response. I propose several model-free independence screening procedures to rank the importance of functional predictors. I compare different active sets of predictors that these approaches aim to select and establish the corresponding sure screening properties when the number of predictors diverges at an exponential rate with the number of observations. In each setting, I illustrate the sample performance of my proposed method through a series of simula- tions and a real world data example. xi Chapter 1 Introduction 1.1 Sparseness and FDA Due to recent advances in high technology, scientific researchers are able to collect and store large and complex data. Functional Data Analysis (FDA), a branch of statistics, has attracted increasing attention in recent years. My thesis concentrates on the intersection between the field of functional data, and the concept of sparsity, in its various forms. In Section 1.1.1, I provide an overview of FDA and in particular introduce three different types of sparseness arising from FDA in Section 1.1.2. 1.1.1 Functional Data Analysis FDA refers to the statistical analysis of data consisting of samples of random curves or surfaces, where each measured function is treated as the unit of observation. It is often assumed in FDA that each function is sampled from some smooth realization of an underlying stochastic process. Then FDA provides a statistical approach to analyze the observed functional data generated from such a process. Functional data are intrinsically infinite dimensional and measurements within each curve exhibit strong correlation, differing from multivariate data analysis, an area that deals with finite dimensional random vectors. Smoothness and infinite dimension are structural assumptions in FDA and hence fitting models in FDA usually requires some form of dimension reduction. Additionally various smoothing techniques from nonparametric statistics, e.g. basis representation or penalized spline and kernel smoothing, are an essential part of the tool kit for analyzing functional data. Both unsupervised dimension reduction techniques, such as functional principal component analysis (FPCA), and supervised dimension reduction methods, such as index models, are commonly used since they allow a finite dimensional analysis of intrinsically infinite dimensional objects. 1 Several key FDA methodologies include FPCA (James et al. 2000, Yao et al. 2005a, Hall and Hosseini- Nasab 2006, Li et al. 2010, Shang 2014), warping and curve registration (Gervini and Gasser 2004) and different forms of functional regression models (Cardot et al. 2003, James 2002, James and Silverman 2005, James et al. 2009, Yao et al. 2005b, Mueller and Stadtmueller 2005, Mueller and Yao 2008, Jiang and Wang 2011, Chen et al. 2011, Radchenko et al. 2015, Fan et al. 2014, 2015, Yao et al. 2015). Theoretical results about FDA rely on either perturbation theory of linear operators in Hilbert space (Dauxois et al. 1982, Bosq 2000) or a reproducing kernel Hilbert space approach (Eubank and Hsing 2008). Applications of FDA lie in many areas, such as e-commerce and marketing (Jank and Shmueli 2006, Sood et al. 2009), genetics (Opgen-Rhein and Strimmer 2006, Reimherr and Nocolae 2014), growth curves (Gasser et al. 1984), yield curves (Hays et al. 2012), mortality and fertility rates (Hyndman and Ullah 2007), to list only a few. A broad overview of methodological, theoretical and applied aspects of FDA can be found in the textbooks of Ferraty and Vieu (2006), Ramsay and Silverman (2005), Ramsay et al. (2009), Horvath and Kokoszka (2012) and some additional reviews in Rice (2004), Zhao et al. (2004) and Ferraty and Romain (2011). 1.1.2 Sparseness A common assumption in FDA is that one observes a relatively small number of functions, densely sampled over time or some other domain. In this setting it is not clear how the concept of sparsity, so prevalent in the high dimensional statistics literature, would be relevant. However, in practice various forms of sparsity patterns turn out to be important in FDA applications. In my thesis, I focus on three different instances of sparseness in FDA. The first setting involves sparseness in the functions, where in practice curves are commonly measured on an irregular, sparse set of time points, contaminated by measurement error. This form of sparsity impacts settings such as principal components, clustering, classification and regression. The literature concentrates on two main approaches for dealing with sparsely sampled functional data. One method adopts local linear smoothing techniques to estimate the mean and covariance functions of the curves without specifying any basis function (Yao et al. 2005a,b). The second method uses a mixed effects model to build strength across functions by incorporating information from all the curves (James et al. 2000, James and Sugar 2003, James and Hastie 2001, Radchenko et al. 2015). In Chapter 2, I propose a multi-index regression model for sparsely sampled functional data through a mixed effect model approach. This work corresponds to an accepted paper in the Journal of the American Statistical Association (Radchenko et al. 2015). 2 The other two settings consider different sparsity patterns in high dimensional functional data, where the number of functional variables,p, exceeds the number of observations,n. One example involves func- tional extensions to Gaussian graphical models, which describe the conditional dependence structure among p multivariate Gaussian random variables. Fitting the classical graphical model in high dimensional settings poses several challenges and I need to assume sparseness in the edges, where the edges only connect a subset of nodes, i.e. only some pairs of random variables are conditionally dependent. In Chapter 3, I pro- pose a graphical model for functional data depicting the conditional dependence structure amongp random functions. In Chapter 4, I explore another interesting type of sparseness in the context of functional regression models for ultra-high dimensional data. To be specific, ultra-high dimensionality refers to the assumption that logp =O(n ) for some> 0 as defined in Fan and Lv (2008a). To make the problem feasible, I must assume sparseness in the predictor space, i.e. most of the functional predictors are unrelated to the response, which may be either scalar or functional. I present several model-free (i.e. no model assumption is made about the relationship between the response and predictors) independence screening approaches to rank the importance of all functional predictors. Such rankings can be used to select a small subset of predictors that contain the true signal variables with high probability. I show that this “sure screening” property holds for these methods in the functional domain. 1.2 Three Sparseness Settings for Functional Data In this section I briefly present the three problems that make up the core of my thesis. 1.2.1 Sparseness in Functions for Multi-Index Models Given a scalar responseY i and a functional predictorX i (t), the standard functional linear regression model is given by: Y i = 0 + Z (t)X i (t)dt +" i ; 1;:::;n (1.1) 3 which implies a scalar response, a single densely observed functional predictor and a linear relationship. Since functional predictors are infinite dimensional, fitting (1.1) also requires using some dimension reduc- tion technique. Most approaches use an unsupervised method, such as FPCA, to approximate the predictors and then regressY on a lower dimensional representation ofX(t). In the standard setting involving univariate response variablesY with multivariate predictorX2R p , the single-index model (Hardle and Stoker 1989), Y i =m( T X i ) +" i (1.2) is a common supervised dimension reduction approach, which generalizes standard linear regression by modeling the effect of a linear projection of multivariate predictors nonparametrically. In this modelm() is an unknown smooth link function and is the primary parameter of interest in the index T X. There exist various approaches to estimate in (1.2), such as projection pursuit regression (Friedman and Stuetzle 1981), average derivatives (Hardle and Stoker 1989, Inchimura 1993) and partial least squares (Naik and Tsai 2000). Xia et al. (2002) proposed a (conditional) minimum average variance estimation (MA VE) approach to search for an efficient dimension reduction space in (1.2). Unlike most previous methods, MA VE can achieve root n consistency for the parametric index estimate without undersmoothing the nonparametric link function. Xia (2008) extended (1.2) to a multi-index model which projectsX i into a dimension greater than one and investigated its theoretical properties. There has been some recent work on extending (1.1) using supervised dimension reduction methods to represent the predictors, and a non-linear model for the response surface. One of the most natural ways to address the problem is to extend the single-index model (1.2) to the functional domain: Y i =m Z (t)X i (t)dt +" i (1.3) where the index function, (t), projects the predictor into a lower dimensional space and m() is a low dimensional non-linear function. Both (t) and m() are estimated as part of the fitting process. James and Silverman (2005) proposed a functional index model similar to (1.3), which extended generalized linear models (McCullagh and Nelder 1989), generalized additive models (Hastie and Tibshirani 1990) and pro- jection pursuit regression (Friedman and Stuetzle 1981) to model functional predictors. Chen et al. (2011) extended this work to a fully nonparametric setting and provided further theoretical justification. 4 In Chapter 2, I consider a practical situation where one only observes a noisy version ofX i (t) over a handful of time points. One example of this setting, which I analyze in Chapter 2, is the auction data (Liu and Muller 2008). This data set consists of 156 eBay online 7 day-second price auctions of Palm M515 Personal Digital Assistants (PDA) during March and May, 2003. Bid arrivals of online auctions show an interesting phenomenon, called “bid sniping”, where bids are sparse in the middle and denser in the beginning and at the end of the auctions. My ultimate goal is to predict the closing price while an auction is still ongoing given the current sparsely observed curve of bid prices. In this setting, computing the integral in (1.3) and hence fitting the index model, becomes considerably more challenging. I propose a new errors-in-variables approach (Carroll et al. 2006), named Sparse Index Model Functional Estimation (SIMFE), which also implements an index model, but provides several advantages over the previously discussed approaches. In particular SIMFE uses a mixed effects model to borrow information across all predictors, and hence provide an accurate reconstruction of X i (t). Moreover, I show that the SIMFE estimate for the index function achieves good convergence properties, even for functional predictors observed at irregular and sparse time points. Finally, SIMFE is applicable in the setting where multiple functional predictors and either scaler or functional responses are considered. 1.2.2 Sparseness in Edges for Functional Graphical Models The graphical model is used to depict the conditional dependence structure among p random variables, X = (X 1 ; ;X p ) T . Such a network consists ofp nodes, one for each variable, and a number of edges connecting a subset of the nodes. The edges describe the conditional dependence relationship of the p variables, i.e. nodesj andl are connected by an edge if and only ifX j andX l are correlated, conditional on the remainingp 2 variables. Recently there has been a lot of interest in fitting high dimensional graphical models, wherep is very large. For Gaussian data, where X follows a multivariate Gaussian distribution, one can show that estimating the edge set is equivalent to identifying the locations of the non-zero elements in the precision matrix, i.e. the inverse covariance matrix of X. Hence, the literature has mainly focused on two approaches for esti- mating high dimensional Gaussian graphical models. One method involves a penalized regression approach whereby each variable is regressed on all the remaining variables, thus identifying the locations of non-zero entries in the precision matrix column by column (Meinshausen and Buhlmann 2006, Cai et al. 2011). The 5 other method, first proposed by Yuan and Lin (2007), optimizes the graphical lasso (glasso) criterion; essen- tially a Gaussian log likelihood with the addition of a lasso (Tibshirani 1996) type penalty on the entries of the precision matrix. This criterion generates a sparse precision matrix with many zero elements, corre- sponding to conditionally independent variables. The glasso has arguably proved the most popular of these two approaches, in part because a number of efficient algorithms have been well developed to minimize the convex glasso criterion (Boyd et al. 2010, Friedman et al. 2010, Witten et al. 2011, Mazumder and Hastie 2012a,b). Its theoretical properties have also been well studied (Lam and Fan 2009, Ravikumar et al. 2011), and several variants and extensions of the glasso have been proposed, see Zhou et al. (2010), Kolar and Xing (2011), Danaher et al. (2014), Zhu et al. (2014b) and the references therein. More recently high dimensional functional data has become more common which has motivated interest in functional graphical models (FGM). One such example, which I discuss in Chapter 3, is an electroen- cephalography (EEG) data set taken from an alcoholism study (Zhang et al. 1995). The study consists of n = 122 subjects split between an alcoholic group and a control group. Each subject was exposed to either a single stimulus or two stimuli. The resulting EEG activity was recorded at 256 time points over a one second interval using electrodes placed at 64 locations on the subject’s scalp. Hence, each observation, or subject, involvesp = 64 different functions observed at 256 time points. It is of scientific interest to identify differences in brain activity between the two groups, so I construct FGMs for each group and explore the differences. Functional data of this sort can arise in a number of other contexts. For example, rather than recording only a static set ofp gene expression levels at a single point in time, it is now becoming common to observe multiple expression levels over time, so g ij (t) would represent the expression level of gene j for subjecti at timet. Alternatively, in a marketing context it is now possible to observe online purchase patterns among a basket ofp different products for each ofn individual customers over a period of time, so g ij (t) might represent the cumulative purchase history of productj by customeri at timet. Hence, in Chap- ter 3 I consider extending the Gaussian graphical model to the functional domain and propose a functional graphical model (FGM) to depict the conditional dependence structure amongp random functions. 1.2.3 Sparseness in Predictors for Ultra-High Dimensional Functional Data Ultra-high dimensional data sets are ubiquitous in modern statistics arising from diverse scientific fields. For example, some genome array datasets consist of millions of genes (features) compared to a few hundred samples (observations). Feature screening provides a powerful tool in dealing with these largep smalln 6 problems. Various regularization approaches have been proposed to identify significant features in high dimensional regression models. These methods involve, but are not limited to, the lasso (Tibshirani 1996), Dantzig selector (Candes and Tao 2007), the SCAD (Fan and Li 2001), elastic net (Zou and Hastie 2005), and the adaptive lasso (Zou 2006). Theoretical properties, efficient algorithms and various extensions have been well explored for high dimensional data, wherep grows at a polynomial rate asn increases. See the review article Fan and Lv (2008b) for a thorough discussion of the issues involved with modeling such data. In the high dimensional functional setting there have been some related extensions to the univariate functional regression model (1.1). Fan et al. (2015) proposed a so-called FAR method using a penalized least squares optimization approach to handle high dimensional problems involving a large number of functional predictors in a non-linear additive model setting. Fan et al. (2014) developed a new method, FRAME, which extends (1.1) to the situation involving both functional and scalar predictors and a functional response and applied it in a Holloywood movie data set. However, both in the standard and functional settings, these regularization approaches may not per- form well for ultra-high dimensional data due to several statistical and computational deficiencies (Fan et al. 2009). To overcome these challenges, an alternative line of research based on marginal regression was proposed. These approaches all assume extreme sparsity in the relationship between the predictors and response i.e. it is assumed that most predictors have no relationship to the response. Some form of independence screening is then performed to assess the marginal relationship between each predictor and the response, and a small subset of variables that are strongly related to the response are selected. Fan and Lv (2008a) introduced a sure independence screening (SIS) procedure for linear models with Gaussian predictors and response via marginal Pearson’s correlation learning and showed that SIS possesses a sure screening property. Fan et al. (2009) relaxed the linear assumption using a generalized linear models frame- work by proposing an independent procedure of ranking the maximum marginal likelihood estimators. Fan et al. (2011) proposed a nonparametric independence screening (NIS) approach by ranking the marginal nonparametric components in sparse additive models. A two step approach, named UPS (Ji and Jin 2012), which first screens by univariate thresholding and then cleans via penalized least squares, can be shown to achieve the optimal rate of convergence. Recently, several model-free screening approaches have increas- ingly attracted attention. Zhu et al. (2011) introduced a sure independence ranking and screening (SIRS) procedure to perform screening for multi-index models. Li et al. (2012) developed an independence screen- ing approach based on the distance correlation metric (Szekely et al. 2007, Szekely and Rizzo 2009), which 7 equals zero if and only if two random vectors are independent. The method, named DS-SIS, can handle non Gaussianity and multivariate responses. Balasubramanian et al. (2013) generalized DS-SIS by suggesting a Hilbert Schimit independence criterion based SIS approach. Shao and Zhang (2014) defined a new martin- gale difference correlation (MDS), which quantifies the departure from conditional mean independence, and used it as a marginal correlation measure to screen out variables not contributing to the conditional mean of the response. Some recent developments in genetics and other areas call for new approaches to deal with functional data in the ultra-high dimensional setting. For example, Storey et al. (2005) studied some gene expression datasets measured over time involving a small number of patients but approximately 40; 000 functional predictors (genes). In this setting it is natural to believe that there is a sparse relationship between the genes and the disease of interest i.e. most genes do not influence the disease. Hence, a screening approach for functional data may be appropriate. Cheng et al. (2014) developed one model-based screening procedure using a sparse semivarying coefficient model for ultra-high dimensional longitudinal data. However, their approach, based on a varying coefficient model which is widely used in modeling longitudinal data, can not handle functional data whose measurements on the same curve display high correlation. In Chapter 4 I consider extending several model-free sure independence screening (SIS) approaches to the functional domain to deal with ultra-high dimensional functional data. 1.3 Summary of Chapters The rest of the thesis is organized as follows. In Chapter 2 I propose a multi-index model for sparsely sampled functional data. In Section 2.2 I present the SIMFE model and develop fitting procedures for both scalar and functional responses. I also present an extension of SIMFE which adjusts for possible bias in the estimate for the nonlinear function in situations where the predictors are observed at different time points for each individual. Theoretical results are presented in Section 2.3, which demonstrate that even with sparsely observed predictors SIMFE will be consistent in estimating the space spanned by the set of index functions and has a faster rate of convergence than other potential approaches. Section 2.4 illustrates the performance of SIMFE on an extensive set of simulations covering both scalar and functional responses. Section 2.5 applies SIMFE to an online auction data set containing sparsely observed predictors. 8 In Chapter 3 I develop a functional graphical model to depict the conditional dependence structure of random functions. In Section 3.2 I propose a convex penalized criterion which has connections to both the graphical lasso (Yuan and Lin 2007) and group lasso (Yuan and Lin 2006). Minimizing my functional graphical lasso criterion provides a sparse estimate, b E, for the edge set E. In Section 3.3, I develop an efficient block coordinate descent algorithm for minimizing the fglasso criterion. The fglasso algorithm is extended to handle even larger values of p by applying the partition approach of Witten et al. (2011). Section 3.4 provides my theoretical results. Here I show that the estimated edge set b E is the same as the true edge setE with probability converging to one, even forp>n. The finite sample performance of the fglasso is examined in Section 3.5 through a series of simulation studies. Section 3.6 provides a demonstration of the fglasso on the EEG data set. In Chapter 4 I propose several model-free independence screening approaches for ultra-high dimensional functional data. In Section 4.2 I propose a general framework to perform different independence feature screenings. I extend DC (Szekely et al. 2007) and MDC (Shao and Zhang 2014) to the functional domain, use them as the corresponding marginal utilities to rank the importance of functional predictors and show the underlying sure screening properties hold in Sections 4.3 and 4.4 respectively. I also develop a functional extension of SIRS (Li et al. 2012) in Section 4.5. For the case of functional data with measurement error, I take the standard approach of modeling functional data using orthogonal basis functions and develop the corresponding SIS procedures in Section 4.6. I examine the finite sample performance of functional versions of DC-SIS, MDC-SIS and SIRS procedures via a series of simulations in Section 4.7 and illustrate the proposed methods through a MEG data set in Section 4.8. All technical proofs and some additional algorithms are given in the Appendix. 9 Chapter 2 Index Models for Sparsely Sampled Functional Data The regression problem involving functional predictors has many important applications and a number of functional regression methods have been developed. However, a common complication in functional data analysis is one of sparsely observed curves, that is predictors that are observed, with error, on a small subset of the possible time points. Such sparsely observed data induces an errors-in-variables model, where one must account for measurement error in the functional predictors. Faced with sparsely observed data, most current functional regression methods simply estimate the unobserved predictors and treat them as fully observed; thus failing to account for the extra uncertainty from the measurement error. In this chapter I propose a new functional errors-in-variables approach, Sparse Index Model Functional Estimation (SIMFE), which uses a functional index model formulation to deal with sparsely observed predictors. SIMFE has several advantages over more traditional methods. First, the index model implements a non-linear regression and uses an accurate supervised method to estimate the lower dimensional space into which the predictors should be projected. Second, SIMFE can be applied to both scalar and functional responses with multiple predictors. Finally, SIMFE uses a mixed effects model to effectively deal with very sparsely observed functional predictors and to correctly model the measurement error. 2.1 Introduction In Section 1.2.1 I discussed some recent work on extending (1.1) using an index model: Y i =m Z (t)X i (t)dt +" i (2.1) 10 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● Day1 Day2 Day3 Day4 Day5 Day6 Day7 0 5 10 15 20 25 30 Number of bids ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● [144,148) [148,152) [152,156) [156,160) [160,164) [164,168) 0 5 10 15 Number of bids Figure 2.1: The left panel displays the boxplot of daily aggregate bids from 156 seven-day eBay online auctions, the right panel provides the boxplot of cumulative bids placed on every four hours during the last day of the auction. The small circles outside the whiskers are identified outliers and the solid lines in the box are the medians. where the index function, (t), projects the predictor into a lower dimensional space and m() is a low dimensional non-linear function. In this chapter I consider a practical situation where one only observes a noisy version ofX i (t) over a handful of time points. As discussed in Section 1.2.1, my motivating example is an eBay online auction data set consisting of 156 online 7-day second price auctions of Palm M515 PDA’s that took place between March and May, 2003. Since bidders tend to actively participate in the auctions in the beginning, and submit their target bids more often at the last moment with a higher chance to win the auction, the phenomenon of “bid sniping” appears, in particular bids are sparse in the middle, more prevalent in the beginning and even denser at the end of the auction. This phenomenon can be verified by Figure 2.1. The left panel shows the median number of bids placed on the first day was three, then dropped to zero till the fourth day, started to rise on day five, and finally exhibited a sharp increase on day seven. The right panel demonstrates that the number of bids on the last day began to increase at noon and jumped to four during the last several hours. This observation of irregular and sparsely distributed bids motivates us to develop models for sparsely sampled functional data to predict the closing price of an auction given current sparsely observed bid trajectories during an ongoing auction. In this setting, computing the integral in (2.1) and hence fitting the index model, becomes consider- ably more challenging. To address this issue I propose a new errors-in-variables approach (Carroll et al. 11 2006), named Sparse Index Model Functional Estimation (SIMFE), which also implements an index model, but provides several advantages over the previously discussed approaches. This chapter is structured as follows. In Section 2.2 I present the SIMFE model and develop fitting procedures for both scalar and func- tional response. I also present an extension of SIMFE which adjusts for possible bias in the estimate for the nonlinear functionm() in situations where the predictors are observed at different time points for each indi- vidual. Theoretical results are presented in Section 2.3, which demonstrate that even for sparsely observed predictors SIMFE will be consistent in estimating the space spanned by the set of index functions and has a faster rate of convergence than other potential approaches. Section 2.4 illustrates the performance of SIMFE on an extensive set of simulations covering both scalar and functional responses. Section 2.5 applies SIMFE to the online auctions data set containing sparsely observed predictors. I conclude Chapter 2 with some possible future directions for this work in Section 2.6. 2.2 SIMFE In Section 2.2.1 I present the SIMFE model for scalar responses and develop a fitting procedure in Sec- tion 2.2.2. SIMFE is extended to functional responses in Section 2.2.3. Then in Section 2.2.4 I present a bias corrected version of SIMFE for situations where predictors are observed at differing sets of time points. Finally, in Section 2.2.5 I develop an approach to select the tunning parameters. 2.2.1 Scalar Response In the scalar response setting I observep functional predictors,X i1 (t);:::;X ip (t), and a scalar response, Y i , wherei = 1;:::;n. For concreteness, I will assume that the domain for each predictor is [0; 1], however, all of the presented methods and conclusions are still valid in the situation where predictors have different domains. Without loss of generality I can model Y i =m 0 (X i ) +" i (2.2) 12 where X i = (X i1 ;:::;X ip ) and E(" i jX i ) = 0. However, (2.2) is too general to fit in practice because m 0 is a function ofp infinite dimensional objects. A particularly natural approach to deal with the infinite dimension of the predictors involves restrictingm 0 to be a function of linear projections of X i : m 0 (X i ) =m(P i ); (2.3) wherem(P i )m(P i1 ;:::; P ip ), P ij = Z j1 (t)X ij (t)dt;:::; Z jd j (t)X ij (t)dt (2.4) and d j represents the dimension of the space into which X ij (t) is projected. Hence, P ij is the d j - dimensional linear projection of X ij (t) formed from the index functions, j1 (t);:::; jd j (t). I can now rewrite equation (2.2) as follows: Y i =m(P i ) +" i (2.5) Equation (2.5) is a functional version of a multi-index model which has been considered previously in James and Silverman (2005) and Chen et al. (2011). However, these previous methods mainly dealt with the situation involving a single densely sampled predictor. In this chapter I consider the common situation involving multiple very sparsely observed predictors. To deal with the sparsely observed predictors I build strength across all observations by modeling X i (t) as coming from a Gaussian process with mean(t) and covariance (s;t): X i (t)G(; ): (2.6) Further, I assume thatX ij (t) is observed, with measurement error, atT ij time points, t ij = (t ij1 ;:::;t ijT ij ). In particular letW ijk represent the observed value ofX ijk =X ij (t ijk ). Then, W ijk =X ijk +e ijk ; (2.7) wherei = 1;:::;n;j = 1;:::;p;k = 1;:::;T ij and thee ijk ’s are iidN(0; 2 ). I will see shortly that (2.6) and (2.7) allow us to fit a mixed effects model to infer the entire predictor trajectory based on only a small number of observations. The SIMFE model is specified by equations (2.5), (2.6) and (2.7). 13 2.2.2 The SIMFE Fitting Method For scalar responses the SIMFE model states thatY i =m(P i ) +" i ; whereE(" i jX i ) = 0. Hence, the fitting methods of James and Silverman (2005) or Chen et al. (2011) could be used to implement SIMFE, provided that the predictors were fully observed. Unfortunately, for sparse functional predictors rather thanX ij (t) I instead observe only W i = (W i1 ;:::; W ip ) where W ij = (W ij1 ;:::;W ijT ij ). However, Theorem 1 shows that, by replacingX ij (t) with its conditional expectation, the response can still be represented using an index model with exactly the same index as ifX ij (t) has been fully observed. Theorem 1. Let e X ij (t) =E [X ij (t)jW i ] and e P ij = Z j1 (t) e X ij (t)dt;:::; Z jd j (t) e X ij (t)dt : (2.8) Then, under the SIMFE model, Y i = e m t i ( e P i ) +" i ; (2.9) where t i = (t i1 ;:::; t ip ) andE(" i jW i ) = 0. The proof of Theorem 1 is provided in Appendix A.2.1. Notice that the definitions of P ij and e P ij are identical except thatX ij (t) in (2.4) is replaced by e X ij (t) in (2.8). In particular, the same index functions, jk (t), are used in both definitions so fitting an index model using e X ij (t) will produce estimates for the same index functions as one would obtain when using the originalX ij (t) predictors. Theorem 1 is an important result because it suggests a two step approach for fitting SIMFE; first estimate e X ij (t) and second fit the resulting multi-index model given by (2.9). Theorem 1 does imply one significant in fitting (2.9); the function e m is related to both e P i and the time points t i . If the time points are common for all individuals, i.e. t i = t for alli, then (2.9) reduces to Y i = e m( e P i ) +" i ; (2.10) and fittinge m() is feasible. However, if the time points differ among individuals, then I am potentially faced with the highly challenging problem of estimatingn separate e m t i () functions from onlyn observations. 14 In the following two sections I first develop an approach for estimating e X ij (t) and then propose a method for fitting (2.9) in the easier setting where (2.10) holds. Then in Section 2.5 I consider the more challenging setting where t i differ across individuals. Estimating e X ij (t) In order to fit the SIMFE model some form of smoothness constraint is required forX ij (t) and jk (t). I take the standard approach of modeling these functions usingq j -dimensional orthogonal basis functions, s j (t), such that R s j (t)s j (t) T dt = I. Hence, X ij (t) = s j (t) T ij and jk (t) = s j (t) T jk ; (2.11) where ij and jk are respectively the basis coefficients for the predictors and index functions. Using (2.11) the projection, P ij , becomes P ij = ( T j1 ij ;:::; T jd j ij ) T ; and the Gaussian process model for X i (t) implies that i = ( T i1 ; ; T ip ) T N( ; ); (2.12) Let s() be a matrix valued function, such that for eacht matrix s(t) is block diagonal with thej-th block given by the column vector s j (t). ThenE(X i (t)) =(t) = T j s(t) and Cov(X i (t); X i (u)) = (t;u) = s(t) T s(u). Let S j be the T j by q j -dimensional basis matrix with kth row s j (t jk ) and S be the block diagonal matrix withj-th block S j . Standard calculations show that (2.7), (2.11) and (2.12) imply X i (t)jW i N s(t) T e i ; s(t) T e s(t) ; where e i = (e T i1 ; ;e T ip ) T = e 1 + 1 2 S T W T i (2.13) 15 and e = 1 + 1 2 S T S 1 : (2.14) Hence, e X ij (t) =E(X ij (t)jW ij ) = s j (t) T e ij : (2.15) In practice ; and 2 are unknown so I need to estimate e X ij (t) using b X ij (t) = s j (t) T b ij ; whereb ij = (b i1 ; ;b ip ) T is computed by inserting appropriate parameter estimates into (2.13). I can estimate , and 2 using an EM algorithm, the details of which are provided in Appendix A.1. Note that (2.13) and (2.14) can be significantly simplified under the assumption that the predictors are independent. In my experience this simplified version of SIMFE often performs competitively relative to the more general version, even in the settings with correlated predictors. Fitting the Multi-Index Model To fit the multi-index model a natural population criterion to minimize is E Y;X Ye m( e P) 2 =E X E Y Ye m( e P) 2 jX ; which can be approximated for a finite sample by E X E Y Ye m( e P) 2 jX 1 n n X i=1 E Y Ye m( e P) 2 jX = X i : (2.16) I approximateE Y Ye m( e P) 2 jX = X i using local linear regression n X l=1 0 @ Y l a i p X j=1 b P lj c ij 1 A 2 K il ; (2.17) where K il =K h ( b P l b P i ) (2.18) 16 is an appropriate kernel function with bandwidthh and b P k = ( b P k1 ;:::; b P kp ). Note that b P kj is identical to e P kj except that e X ij (t) is replaced by ^ X ij (t). I use a Gaussian kernel with the optimal bandwidth h opt = (4=(d + 2)) 1=(d+4) n 1=(d+4) ; (2.19) whered = P p j=1 d j is the dimension of the kernel (Silverman 1999). Combining (2.16) and (2.17) gives the finite sample criterion that SIMFE minimizes 1 n n X i=1 n X l=1 0 @ Y l a i p X j=1 b P lj c ij 1 A 2 K il : (2.20) Note that the resulting estimator does not change if we replace b P lj with b P lj b P ij in the above display, as is often done in the literature. To minimize (2.20) I use a two step iteration where I first estimate e m(), and second compute the jk (t) functions given e m(). I describe these two steps below. Step One Let i = (a i ; c T i1 ;:::; c T ip ) T and R l = (1; b P T l1 ;:::; b P T lp ) T then (2.20) can be written as 1 n n X i=1 n X l=1 K il Y l R T l i 2 ; whose minimum is obtained by setting b i = n X l=1 K il R l R T l ! 1 n X l=1 K il Y l R l ; i = 1;:::n: (2.21) Step Two Next, I estimate jk (t) forj = 1;:::;p andk = 1;:::;d j , given the current estimate for e m(). Let Q il = 0 B B B @ c i1 R b X l1 (t)s 1 (t)dt . . . c ip R b X lp (t)s p (t)dt 1 C C C A = 0 B B B @ c i1 b l1 . . . c ip b lp 1 C C C A 17 and = ( T 11 ;:::; T 1d 1 ;:::; T p1 ;:::; T pdp ) T , where is the Kronecker product. It is not hard to show that the estimate for that minimizes (2.20) is given by b = n X i=1 n X l=1 K il Q il Q T il ! 1 n X i=1 n X l=1 K il Q il (Y l a i ): (2.22) Hence, b jk (t) = s j (t) T ^ jk ; j = 1;:::;p; k = 1;:::;d j ; (2.23) whereb jk represents the corresponding elements ofb . Scalar Response SIMFE Algorithm My theoretical results in Section 3.4 show that initially fitting SIMFE using a bandwith larger than h opt (2.19) and then iteratively reducing h provides a good rate of convergence. Hence, the scalar response version of the SIMFE algorithm is summarized in Algorithm 1. Algorithm 1 Scalar Response SIMFE Algorithm 1. Compute the b X ij (t)’s by plugging the parameter estimates from the EM algorithm into (2.15). 2. Initialize b jk (t) andh/n 1=(~ p+4) wheree p = P p j=1 q j . 3. (a) Estimate e m( b P) using the linear approximation given by (2.21). (b) Compute the b jk (t)’s via (2.22) and (2.23). (c) Repeat Steps (a) and (b) until convergence. 4. Seth ch for somec< 1. 5. Iterate Steps 3. and 4. untilhh opt . I initialize b jk (t) in Step 2 by fitting the groupwise Outer Product of Gradients (gOPG) estimator (Li et al. 2010) to the b ij ’s from Step 1, using a bandwidth of h/ n 1=(e p+4) . Empirically, I have found that the SIMFE algorithm is stable and typically converges fast. Predicting the Response Given a new observation W I will often wish to form a prediction for the corresponding response Y . Once SIMFE has been fitted I can compute b X (t) and hence the b P corresponding to the new observation. 18 Becausee m() is computed using a linear approximation I haven different predictions for b Y i.e. e m i ( b P ) = a i + P j c ij b P j . An obvious approach to produce a single b Y is to weight the individual predictions according to the difference between P i and P . Hence, I use the following weighted average: b Y = e m( b P ) = n X i=1 e m i ( b P )w i = n X i=1 (a i + b P c ij )w i (2.24) wherew i =K i = P l K l andK i is computed using (2.18). 2.2.3 Functional Response In the functional response setting I observe p functional predictors, X i1 (t);:::;X ip (t), and a functional response,Y i (s), wherei = 1;:::;n andY i is observed at pointss i1 ;:::;s in i . LetY ik =Y i (s ik ). A natural approach to extend (2.5) to functional responses is to model E(Y (s)jX) = Y (s) as a function of both P 1 ;:::; P p ands: Y (s) =m(s; P): To fit the multi-index model I attempt to minimize the population criterion Z E Y;X n Y (s)m(s; P) 2 o ds = Z E X n E Y Y (s)m(s; P) 2 jX o ds; which can be approximated for a finite sample by 1 P n i=1 n i X i;k E Y h Y (s)m(s; P) 2 jX = X i ;s =s ik i : (2.25) Again I approximateE Y h Y (s)m(s; P) 2 jX = X i ;s =s ik i using a local linear regression X l;k 0 0 @ Y lk 0a ik c s ik s lk 0 p X j=1 b P lj c ikj 1 A 2 K iklk 0; (2.26) whereK iklk 0 =K h (s lk 0s ik ; b P l b P i ) is an appropriate kernel function with bandwidthh. Note that the d in (2.19) is replaced by ~ d = 1 + P p j=1 d j because the kernel is a function of one additional dimension,s. 19 Combining (2.25) and (2.26) gives the functional response finite sample criterion that SIMFE minimizes: 1 P n i=1 n i X i;k X l;k 0 0 @ Y lk 0a ik c s ik s lk 0 p X j=1 b P lj c ikj 1 A 2 K iklk 0: (2.27) As in the scalar response setting, I can minimize (2.27) using a two step iteration; first estimatem(s; b P), and second compute the jk (t) functions givenm. The details for each step are provided as following. Step One Let ik = a ik ;c s ik ; c T ik1 ;:::; c T ikp T and R lk 0 = 1;s lk 0; b P T l1 ;:::; b P T lp T then (2.27) can be written as, 1 P n i=1 n i X i;k X l;k 0 K iklk 0 Y lk 0 R T lk 0 ik 2 ; whose minimum is obtained by setting, b ik = 0 @ X l;k 0 K iklk 0R lk 0R T lk 0 1 A 1 X l;k 0 K iklk 0Y lk 0R lk 0; (2.28) fori = 1;:::;n andk = 1;:::;n i . Step Two Next, I estimate jk (t) forj = 1;:::;p andk = 1;:::;d j , given the current estimate form(s; b P 1 ;:::; b P p ). Let Q ikl = 0 B B B @ c ik1 R b X l1 (t)s j (t)dt . . . c ikp R b X lp (t)s j (t)dt 1 C C C A = 0 B B B @ c ik1 b l1 . . . c ikp b lp 1 C C C A and = ( 11 ;:::; 1d 1 ;:::; p1 ;:::; pdp ) T , where is the Kronecker product. It is not hard to show that the estimate for that minimizes (2.27) is given by, b = 0 @ X i;k X l;k 0 K iklk 0Q ikl Q T ikl 1 A 1 X i;k X l;k 0 K iklk 0Q ikl (Y lk 0a ik c s ik s lk 0): (2.29) 20 Hence, b jk (t) = s j (t) T b jk ; j = 1;:::;p; k = 1;:::;d j ; (2.30) whereb jk represents the corresponding elements ofb . Combining the above two steps together gives the functional response version of the SIMFE algorithm which is summarized in Algorithm 2. Algorithm 2 Functional Response SIMFE Algorithm 1. Compute the b X ij (t)’s by plugging the parameter estimates from the EM algorithm into (2.15). 2. Initialize b jk (t) andh/n 1=(e p+4) wheree p = 1 + P p j=1 q j . 3. (a) Estimatem(s; b P) using the linear approximation given by (2.28). (b) Compute the b jk (t)’s via (2.29) and (2.30). (c) Repeat Steps (a) and (b) until convergence. 4. Seth ch for somec< 1. 5. Iterate Steps 3. and 4. untilhh opt . 2.2.4 Bias Corrected SIMFE I now return to the situation where t i may differ among individuals. There is in principle nothing preventing us from still fitting the previously described approach. They only change that is required is to replace S j introduced in Section 2.2.2 with a T ij by q j matrix, S ij , whose kth row is given by s j (t ijk ). The corresponding block-diagonal matrix S is then replaced in formulas (2.13), (2.14) and (A.3) by S i , which is constructed using matrices S ij . Indeed, I successfully implement this version of SIMFE in Sections 2.4 and 2.5. I refer to this implementation of my method as Base SIMFE for the remainder of Chapter 2. However, in the setting with differing time points this version of SIMFE suffers from a potential biased estimation problem. 21 To better understand the difficulty with differing time points, consider the expected response,Y i , condi- tional on the noisy observation of the predictors, W i . If the predictors are observed at the set of time points t i , then I have: E(Y i jW i ) = E X (E Y (Y i jX i ; W i )jW i ) = E X (m (P i )jW i ) = E U m e P i + U i jW i = E U m e P i + U i = e m t i e P i (2.31) where U i = P i e P i and e m t i () = E U (m ( + U i )). Equation (2.31) follows because U i is statistically independent of W i ; a fact which I demonstrate in the proof of Theorem 1. However, while U i is statistically independent of W i , the distribution of U i is a function of the time points over which W i is observed. If W i is observed at the same set of points for alli then the distribution of U i is the same for alli and hence, e m t i () = e m(). In other words, I only need estimate a single function. Unfortunately, when the time points differ among observations, functione m t i () depends on the locations of the points. Hence, Base SIMFE, which estimates a singlee m(), essentially takes a weighted average of all the functionse m t i (), which has the potential to produce biased parameter estimates. In this section I explore the bias corrected SIMFE, simply referred to as SIMFE from here on, which estimatese m t i () separately for each observation. Recall from (2.17) that Base SIMFE approximates E Y Ye m( e P) 2 jX = X i (2.32) using a weighted sum of squared errors betweenY l ande m( b P l ), where the weight is a decreasing function of the distance between b P i and b P l . This weighting scheme makes sense ife m() is the same for all observations but may be inappropriate if X i (t) and X l (t) are measured at very different time points, because in that case e m t i ()6= e m t l (). The bias corrected implementation of SIMFE overcomes this deficiency by observing 22 that, for a smoothm(), if the distributions of U i and U l are similar then e m t i () e m t l (). Hence, a more appropriate approximation to (2.32) is given by n X l=1 0 @ Y l a i p X j=1 b P lj c ij 1 A 2 e K il ; (2.33) where e K il is a decreasing function of both the distance between b P i and b P l , and the difference in the distri- butions between U i and U l . This is achieved by setting e K il =K (1) il K (2) il (2.34) whereK (1) il =K h 1 ( b P i b P l ),K (2) il =K h 2 ( b D il ), where b D il is an estimate ofD KL (U i ; U l ), the Kullback- Leibler divergence between the probability distributions of U i and U l . Let j be aq j byd j matrix, whosek-th column is jk and define [] as a block diagonal matrix, with thej-th block given by j . One can show that U i = (U i1 ; ; U ip )N (0; i ) where i = [] T (I i S i )(I i S i ) T [] + 2 [] T i T i [] (2.35) and i = ( 1 + 1 2 S T i S i ) 1 S T i = 2 . Letd = P p j=1 d j : The Kullback-Leibler divergence between two multivariate normal distributions of U i and U l is given by D KL (U i ; U l ) = 1 2 p X j=1 tr( 1 l i ) log det i det l d ; and b D il is obtained by replaced the parameter and 2 and [] in the above formula with their estimates. Thus, given a set of time points, one can compute the Kullback-Leibler divergence between U i and U l for anyi andl, and hence obtainK (2) il and e K il . The new implementation of SIMFE still uses the same algorithm to minimize (2.20) in the scalar response setting, or to minimize (2.27) for functional responses, with the only change being that the kernel functionK is replaced by the kernel e K, defined in (2.34). Notice that if all predictors are observed at the same set of time points then S i = S l for alli andl. In this case b D il = 0 so that b K il /K h 1 ( b P i b P l ), and the new implementation reduces to that of Base SIMFE. Alternatively, in situations where time points differ among observations, the new approach will place most weight on observations wherei andl are similar both 23 in terms of b P i versus b P l and in terms of the distributions of U i and U l . The functional response version of SIMFE can be extended to implement the bias corrected approach using an analogous expansion of the kernel. While the two SIMFE approaches are similar, there is a major distinction between the methods in terms of their estimates for e m(). Consider for example an extreme situation where the predictors are all observed over one of two possible time point configurations, t 1 and t 2 . Then Base SIMFE would ignore the difference in the time points and produce a single estimate for e m(). However, the new implementation of SIMFE would essentially take the observations from one configuration to estimate e m t 1 () and use the remaining observations to estimate e m t 2 (). If the two e m() functions were sufficiently different from each other, I would expect the new SIMFE estimate to be superior to that of the Base SIMFE. I will predict a responseY based on a new observation W using the same approach as the Base SIMFE (2.24). The only difference is that the weightw i is computed using e K rather thanK. 2.2.5 Selection of tuning parameters Implementing SIMFE requires selectingq j (the dimension of the basis function s j (t)), andd j (the dimension into whichX ij (t) will be projected). I useK-fold cross-validation to obtainq j . In particular, for a given candidate value ofq j I remove 1=Kth of the observed time points for eachX ij (t) as a validation set, fit a random effects model to the remaining observations, calculate the squared error between W ij and b X ij (t) on the validation set, and repeat the procedureK times until each time point has been held out once. This procedure is repeated over a grid ofq j values, and the dimension corresponding to the lowest cross-validated squared error is selected. I repeat this approach for each of thep predictors. In the implementation of the Base SIMFE approach the final bandwidth, h opt , is set proportional to n 1=(d+4) , where d is the total reduced dimension for the predictors. In the bias corrected implemen- tation the reduced dimensiond is increased to account for the fact that the link e m in equation (2.31) is a function not only of the projection e P i , but also of the matrix i . Thus, the new reduced dimension, e d, should be set equal to the sum ofd and the number of the unique nonzero elements in i , i.e. e d =d +d(d + 1)=2. The final bandwidth is then set proportional ton 1=( e d+4) . Because I use the Kullback-Leibler divergence in 24 the definition of the kernel, the choice of the proportionality constant in equation (2.19) is no longer justi- fied. For the empirical work in this paper, I set this constant to 2, which gave good results in all the settings I considered. Cross-validation could also be used to selectd j , but this approach suffers from a heavy computational burden. Instead, I extend the criterion that Li et al. (2010) developed for g-MA VE. Let L(d c1 ; ;d cp ) be the minimum value of the SIMFE criterion function for a candidate set of dimensionsfd c1 ; ;d cp g. Defined c = d c1 + +d cp , set e d c = d c +d c (d c + 1)=2 and leth n = h n ( e d c ) be the bandwidth chosen as describe in the paragraph above. I estimate dimensionsd 1 ; ;d p as e d 1 ; ; e d p , which are selected to minimize the following finite sample criterion: log (L(d c1 ; ;d cp )) + d c logn nh n dc : (2.36) The next result, which is proved in Appendix A.2.2, demonstrates that this criterion leads to consistent estimation of the dimensionsd j . Theorem 2. Under assumptions A1,A2, C3-C5, given in the Appendix A.2.3 P n b d 1 ; ; b d p = (d 1 ; ;d p ) o ! 1 as n!1: 2.3 Theory In the standard non-functional setting there exist index model fitting methods for which one can prove a fast rate of convergence to the true subspace as n ! 1 and h ! 0. However, these results are not appropriate for my situation because they assume a standard fully observed non-functional predictor and a scalar response. The SIMFE setting is considerably more complicated because it involves functional data which, for sparsely sampled predictors, are observed subject to measurement error. In addition, the response may also be functional. My main theoretical contribution is to establish that despite these added complications SIMFE still attains a fast convergence rate. In Theorem 3 I consider scalar responses and then, in Theorem 4, extend the result to functional responses. As with existing results in the non-functional setting (Xia 2008, Li et al. 2010), throughout this section I assume that the true values of the reduced dimensions are known for each predictor. My first result 25 corresponds to the estimator obtained by the scalar response SIMFE algorithm. Note that in the standard non-functional setting the rate of convergence for the gOPG estimator, which I use to initialize SIMFE, is n 2=(e p+4) (Xia 2008). Theorem 3 below shows that the rate for the SIMFE estimator is significantly better. In order to state my results I need to define a distance measure between a true index function, j (t) = ( j1 (t);:::; jd j (t)) T , and the corresponding SIMFE estimate, b j (t) = ( b j1 (t);:::; b jd j (t)) T . Since b j (t) defines a linear projection of the predictor,X ij (t), it is non-unique up to a linear transformation. In other words, b j (t) and e j (t) =A j b j (t) both projectX ij (t) into the same lower dimensional space for any invertibled j byd j matrix,A j . Hence, I define the distance between b j (t) and j (t) as r b j (t); j (t) = min A j 2R d j R d j kA j b j (t) j (t)k; wherek ~ j (t) j (t)k 2 = P d j k=1 R ~ jk (t) jk (t) 2 dt. Theorem 3 establishes the rate at which r b j (t); j (t) converges to zero asn tends to infinity. To prove Theorem 3 I impose smoothness and regularity conditions A1-A5, listed in Appendix A.2.3. I also suppose that I am given a finite, but possibly large, collection of potential time points,T =fT 1 ;:::;T L g. I assume that time pointst ij1 ;:::;t ijT ij are randomly generated from the setT , where the number of time points,T ij , is also randomly generated. Theorem 3. Suppose that assumptions A1-A5, stated in the Appendix A.2.3, are satisfied. Then, the SIMFE estimator satisfies r b j (t); j (t) =O p h 4 opt + logn nh d opt +n 1=2 ! =O p n 4 d+4 logn +n 1=2 ; forj = 1;:::;p. To understand the resulting rate of convergence, let us consider some special cases. For d 3 the SIMFE estimator achieves the parametric rate of convergence,n 1=2 , while ford = 4 the parametric rate is attained up to a logn factor. For d 5 the rate of convergence is n 4=(d+4) logn. The next result shows that when the parametric rate of convergence is achieved, the estimator of, which contains the basis coefficients for the functions j (t), is asymptotically normal. 26 Corollary 1. Suppose thatd 3. Under the assumptions of Theorem 3, the SIMFE estimator of the basis coefficients is asymptotically normal: n 1=2 (b ) D !N (0;W ); where the variance matrix,W , is given by formula (A.13) in the Appendix A.2.5. The following result provides the rate of convergence in the functional response case. I assume that the time points at which the response is observed are generated from a continuous distribution; the details are given in the Appendix. Theorem 4. Suppose that assumptions A1-A3, B4 and B5, stated in Appendix A.2.3, are satisfied. Then, the estimator obtained from the functional response SIMFE algorithm satisfies r b j (t); j (t) =O p n 4 d+5 logn +n 1=2 ; forj = 1;:::;p. Note that ford 2 the functional SIMFE estimator achieves the parametric rate of convergence,n 1=2 . Ford = 3 the rate of convergence isn 1=2 logn, and ford 4 it isn 4=(d+5) logn. The rate of convergence in the functional response setting is generally slower than that in the scalar response case. The reason for this is that the dependence of the link function on the time parameter needs to be estimated. Theorems 3 and 4 assume that the time points are generated from a finite collection. The next result corresponds to the case where time point locations come from a continuous distribution. Let b i be the estimate of i in (2.35) and define b i as a vectorized version of the unique elements of b i , i.e. those on and above the diagonal. Note that because i uniquely defines the distribution of U i , the linkm() is a function of both P i and i , i.e. m(P i ; i ): Hence, in order to achieve a faster rate of convergence I slightly modify approximation (2.33) by including an additional term corresponding to b i . n X l=1 0 @ Y l a i p X j=1 b P lj c ij b i e c i 1 A 2 e K il : (2.37) This approach uses local linear smoothing not only with respect to b P i , but also with respect to b i . I also replaceh opt with b h opt =n 1=( e d+4) , where e d =d +d(d + 1)=2, as the final bandwidth. 27 Theorem 5. Suppose that the number of time points and their locations are randomly generated for each observation. Under assumptions A1, A2, C3-C5 given in Appendix A.2.3, the SIMFE estimator satisfies r b j (t); j (t) =O p n 4 e d+4 logn +n 1=2 ; forj = 1;:::;p. In the setting of Theorem 5, the SIMFE estimator generally converges at a slower rate than the one in Theorem 3, where the time points are generated from a finite collection. This is due to the increased dimensionality of the problem. Note that when d = 1 the estimator still achieves the parametric rate of convergence,n 1=2 . However, I believe that in a number of settings it would be appropriate to assume the time points are sampled from a finite set of possibilities, in which case the faster rate of convergence in Theorem 3 would apply. For example in an experimental design setting, one would often sample patients at a finite (and common) set of time points. My results suggest that SIMFE would perform better in this setting relative to generating the time points from a continuous distribution. The proof of all the results in this section are provided in Appendix A.2. 2.4 Simulations To evaluate the finite sample performance of my methods I tested SIMFE out over two general scenarios; scalar responses and functional responses. In each setting I performed 100 simulation runs and compared the two versions of SIMFE with three competing methods: Outer Product of Gradients (gOPG) (Li et al. 2010), groupwise Sliced Inverse Regression (gSIR) (Li and Hsing 2010) and groupwise Principal Components Analysis (gPCA). The three competing approaches are implemented as follows. I start by computing b X ij (t), as described in Step 1 of both the scalar and functional response SIMFE algorithms. This step is exactly the same for all the methods, including SIMFE. I then use the estimated basis coefficients,b ij , to produce estimates for index functions. All the methods proceed as they would in the non-functional setting by treatingb ij as the observed predictors. For all the methods I use local linear smoothing to estimate the link function, which I use to calculate the prediction error. Note that the gOPG implementation is equivalent to 28 applying the first step of SIMFE without iterating. As a consequence, the SIMFE results provide a measure of the relative improvement from applying my iterative fitting procedure. I use the cross-validation approach described in Section 2.2.5 to select the basis dimension q j for all the methods. I also use the criterion given by (2.36) to selectd 1 ;:::;d p for gOPG and gSIR. However, I found that gPCA performed poorly using (2.36). Hence, I generated a separate validation data set for each simulation run and selected the number of principal components that provides the best fit to the predictors on the validation data. While this provided a slight advantage for gPCA relative to the other methods, the performance of gPCA improved significantly 2.4.1 Scalar Gaussian Responses In the first scenario I generate scalar Gaussian responses from a model involving three predictors, X 1 (t);X 2 (t);X 3 (t). The observed values of the predictors, W ij , are generated using equation (2.7) with j = 0:1 forj = 1; 2; 3 and X ij (t) =a ij0 + 2 X k=1 b ijk sin(kt) + 2 X k=1 c ijk cos(kt): (2.38) The corresponding index functions are similarly generated by j (t) = ~ a j0 + 2 X k=1 ~ b jk sin(kt) + 2 X k=1 ~ c jk cos(kt); (2.39) wherea;b;c; ~ a; ~ b and ~ c are sampled from a standard normal. Finally, the responses come from the non-linear model Y i =m i +" i =P i1 + exp(0:8P i2 ) + sin(0:5P i3 ) +" i ; (2.40) where" i N(0; 2 y ), P ij = R X ij (t) j (t)dt and y = 0:1. This model corresponds to projecting each predictor down into a one-dimensional space i.e. d 1 =d 2 =d 3 = 1.To simulate a real world setting where the functional form of the jk (t)’s would be unknown, I fit SIMFE using a spline basis rather than the true Fourier basis from which the data was generated. Seven separate simulation settings are considered. In the first fivea;b andc are generated independently in (2.38) so the predictors are uncorrelated. These settings correspond to two different sample sizes (n = 100 29 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2 t β 1 (t) 0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 0 1 2 t β 2 (t) 0.0 0.2 0.4 0.6 0.8 1.0 −1 0 1 2 3 t β 3 (t) Figure 2.2: Scalar Gaussian simulation with T ij = 8 and n = 200: Comparison of true (t) functions (black solid) with median estimates over 100 simulation runs; SIMFE (blue solid), Base SIMFE (red solid), gOPG (green dotted), gSIR (cyan dashed) and gPCA (grey dash dotted). and n = 200) and three different levels of sparsity for sampled predictors (T ij = 5, T ij = 8 and T ij randomly sampled from the setf5; 6; 7; 8g). Different time points are randomly selected for each observation from a very dense grid of points. Thus, for example, the observed points fori = 1 differ from those fori = 2. In the last two simulation settings I taken = 100 andT ij = 5. However, I generate correlated functional predictors by sampling the coefficients in (2.38) from a zero mean multivariate normal distribution with the diagonal elements of the covariance matrix set to one and off-diagonal elements all equal to either = 0:25 or = 0:5. Figure 2.2 provides a graphical illustration of the results for then = 200 andT ij = 8 simulation setting. The black solid lines correspond to the three true j (t)’s from which the data were generated. The median most accurate estimate, over the 100 simulation runs, is also plotted for each of the five competing methods. Both SIMFE (blue) and Base SIMFE (red) provide the highest levels of accuracy. Table 2.1 compares SIMFE with gOPG, gSIR and gPCA over all seven simulations with mean prediction errors computed on a separate test set of sizen = 500. I also provide the mean correlations between the estimated and true j (t) curves, with numbers close to one demonstrating a good estimate. The correlations are calculated from the vector correlation (Hotelling 1936) between discretized versions of the true and 30 estimated curves. In this calculation, the curves are evaluated on the same dense grid that is used to generate the time points. All results are averaged over 100 simulation runs. As one would expect, the best results are obtained for the more densely sampled case, with larger sample size. In all seven simulations, the SIMFE approaches outperform the competing methods. In most cases the difference is large, and generally statistically significant. The SIMFE methods perform particularly well in terms of prediction error. As I would expect, SIMFE generally outperforms Base SIMFE in the sparsest settings, but the two methods give similar results in the denser scenario. It is also worth noting that gOPG, which is used as the initialization for SIMFE, provides significantly worse results, highlighting the improvement that is possible from the iterative SIMFE algorithm. The correct reduced dimensions are selected in all of the simulation runs using criterion (2.36). In the next simulation, I consider an example with two predictors, where the responses come from a model with the following nonlinear function: m i = P i11 0:5 + (1:5 +P i12 ) 2 +P i21 ; (2.41) where P ijk = R X ij (t) jk (t)dt. This model projects the two predictors, respectively, into a two- dimensional and a one-dimensional space, i.e. d 1 = 2 and d 2 = 1. In all other respects the simulation setup is identical to my first simulation. Table 2.2 reports numerical summaries for all four simulations. SIMFE does better than both gOPG and gSIR in estimating 1 (t) = ( 11 (t); 12 (t)). gPCA gives a slightly better estimate for 1 (t), but SIMFE is substantially superior to all three competing methods in estimating 2 (t). Moreover, SIMFE significantly outperforms all of the competitors, including Base SIMFE, in terms of the prediction error. SIMFE choose the correct dimension, d = (2; 1) T , on more than 90% of simulations, and selected d = (1; 1) T in the remainder. 2.4.2 Functional Gaussian Responses In my third scenario the responses are generated as functions rather than scalars. In particular, each response function is randomly sampled atn i = 5 orn i = 8 different time points,s 1 ;:::;s n i , with Y i (s ik ) =m(s ik ; P i ) +" ik ; " ik N(0; 2 y ); 31 Setting Method MSPE 1(t) 2(t) 3(t) SIMFE 0.388(0.010) 0.818(0.025) 0.776(0.030) 0.812(0.024) Tij = 5 Base SIMFE 0.429(0.011) 0.794(0.027) 0.754(0.031) 0.769(0.027) n = 100 gOPG 1.914(0.099) 0.446(0.036) 0.498(0.033) 0.408(0.029) = 0 gSIR 0.786(0.024) 0.719(0.029) 0.664(0.033) 0.603(0.028) gPCA 0.908(0.019) 0.480(0.028) 0.575(0.027) 0.396(0.024) SIMFE 0.284(0.010) 0.721(0.032) 0.757(0.034) 0.762(0.029) Tij = 5 Base SIMFE 0.314(0.011) 0.702(0.031) 0.727(0.033) 0.732(0.030) n = 200 gOPG 1.745(0.100) 0.325(0.039) 0.522(0.038) 0.323(0.030) = 0 gSIR 0.461(0.014) 0.598(0.037) 0.656(0.038) 0.594(0.032) gPCA 0.712(0.015) 0.455(0.030) 0.623(0.026) 0.329(0.020) SIMFE 0.174(0.006) 0.729(0.023) 0.694(0.024) 0.728(0.023) Tij = 8 Base SIMFE 0.174(0.006) 0.711(0.025) 0.695(0.025) 0.697(0.025) n = 100 gOPG 1.873(0.085) 0.059(0.006) 0.062(0.007) 0.056(0.005) = 0 gSIR 0.654(0.025) 0.366(0.027) 0.265(0.024) 0.246(0.025) gPCA 0.493(0.011) 0.508(0.030) 0.639(0.025) 0.320(0.019) SIMFE 0.124(0.006) 0.896(0.015) 0.883(0.016) 0.864(0.016) Tij = 8 Base SIMFE 0.121(0.006) 0.856(0.018) 0.862(0.017) 0.844(0.017) n = 200 gOPG 2.017(0.077) 0.046(0.003) 0.048(0.003) 0.040(0.003) = 0 gSIR 0.304(0.011) 0.443(0.026) 0.444(0.028) 0.370(0.027) gPCA 0.384(0.009) 0.468(0.029) 0.653(0.025) 0.283(0.018) SIMFE 0.280(0.010) 0.759(0.028) 0.775(0.029) 0.772(0.026) Tij = 5;6;7;8 Base SIMFE 0.298(0.010) 0.708(0.031) 0.742(0.031) 0.745(0.028) n = 100 gOPG 1.701(0.108) 0.364(0.041) 0.447(0.042) 0.388(0.038) = 0 gSIR 0.721(0.028) 0.536(0.033) 0.531(0.036) 0.460(0.032) gPCA 0.673(0.016) 0.483(0.030) 0.598(0.027) 0.357(0.021) SIMFE 0.662(0.024) 0.787(0.026) 0.787(0.027) 0.766(0.027) Tij = 5 Base SIMFE 0.701(0.026) 0.763(0.027) 0.783(0.026) 0.756(0.028) n = 100 gOPG 4.059(0.286) 0.234(0.025) 0.479(0.029) 0.232(0.020) = 0:25 gSIR 1.368(0.055) 0.689(0.030) 0.610(0.029) 0.607(0.029) gPCA 1.459(0.034) 0.205(0.018) 0.873(0.009) 0.148(0.011) SIMFE 0.742(0.038) 0.773(0.027) 0.761(0.029) 0.756(0.026) Tij = 5 Base SIMFE 0.780(0.036) 0.736(0.029) 0.752(0.029) 0.725(0.027) n = 100 gOPG 4.336(0.493) 0.287(0.025) 0.504(0.027) 0.260(0.023) = 0:50 gSIR 1.689(0.048) 0.783(0.023) 0.614(0.023) 0.706(0.024) gPCA 1.419(0.037) 0.192(0.016) 0.862(0.008) 0.145(0.012) Table 2.1: Scalar Gaussian Response for d 1 = d 2 = d 3 = 1: The mean squared prediction error, and correlation coefficients. Standard errors are shown in parentheses. wherem(s ik ; P i ) = s ik m i andm i is equal to the response function from (2.40) withp = 3 predictors. I only consider then = 100 setting. The predictors and index functions are generated in the same fashion as for the scalar response simulations. To implement gOPG I again initialize j (t) by applying gOPG to the estimated basis coefficients,b ij , of the predictor functions and then use (2.28) to estimate e m(), i.e. a single iteration of the functional response version of SIMFE. gSIR and gPCA are computed in an analogous fashion. Figure 2.3 illustrates the graphical results for the more sparsely sampled n i = T ij = 5 case, while Table 2.3 gives numerical summaries for both sparsity levels. The results in Figure 2.3 and Table 2.3 are 32 Setting Method MSPE 1(t) 2(t) SIMFE 0.280(0.009) 0.840(0.016) 0.832(0.027) Tij = 5;n = 100 Base SIMFE 0.520(0.020) 0.858(0.016) 0.809(0.028) = 0 gOPG 1.548(0.073) 0.647(0.021) 0.417(0.030) gSIR 0.736(0.027) 0.773(0.015) 0.771(0.026) gPCA 0.684(0.016) 0.868(0.007) 0.410(0.024) SIMFE 0.197(0.005) 0.853(0.014) 0.783(0.030) Tij = 5;n = 200 Base SIMFE 0.367(0.010) 0.867(0.013) 0.745(0.033) = 0 gOPG 1.414(0.062) 0.631(0.019) 0.349(0.032) gSIR 0.464(0.011) 0.789(0.016) 0.711(0.034) gPCA 0.559(0.011) 0.885(0.005) 0.323(0.020) SIMFE 0.137(0.005) 0.757(0.008) 0.716(0.025) Tij = 8;n = 100 Base SIMFE 0.253(0.009) 0.772(0.009) 0.654(0.026) = 0 gOPG 1.907(0.116) 0.567(0.014) 0.054(0.012) gSIR 0.646(0.035) 0.580(0.013) 0.372(0.025) gPCA 0.395(0.009) 0.915(0.003) 0.329(0.020) SIMFE 0.107(0.003) 0.787(0.012) 0.874(0.014) Tij = 8;n = 200 Base SIMFE 0.203(0.012) 0.811(0.012) 0.822(0.018) = 0 gOPG 1.611(0.035) 0.599(0.018) 0.045(0.010) gSIR 0.358(0.012) 0.629(0.012) 0.572(0.026) gPCA 0.306(0.006) 0.916(0.002) 0.275(0.018) Table 2.2: Scalar Gaussian Response ford 1 = 2,d 2 = 1: The mean squared prediction error, and correlation coefficients. Standard errors are shown in parentheses. Setting Method MSPE 1(t) 2(t) 3(t) SIMFE 0.137 (0.007) 0.843(0.023) 0.793(0.030) 0.735(0.026) Base SIMFE 0.148(0.008) 0.838(0.025) 0.781(0.031) 0.726(0.033) Tij = 5;n = 100 gOPG 0.907(0.042) 0.409(0.031) 0.429(0.037) 0.285(0.028) gSIR 0.435(0.012) 0.737(0.032) 0.643(0.034) 0.545(0.031) gPCA 0.414(0.014) 0.483(0.031) 0.593(0.028) 0.384(0.027) SIMFE 0.073(0.006) 0.793(0.027) 0.754(0.024) 0.763(0.026) Base SIMFE 0.074(0.006) 0.786(0.028) 0.744(0.024) 0.752(0.026) Tij = 8;n = 100 gOPG 0.797(0.039) 0.180(0.016) 0.079(0.015) 0.123(0.020) gSIR 0.277(0.010) 0.681(0.031) 0.521(0.028) 0.248(0.025) gPCA 0.197(0.009) 0.346(0.028) 0.589(0.027) 0.401(0.022) Table 2.3: Functional Gaussian Response: The mean squared prediction error, and correlation coefficients. Standard errors are shown in parentheses. generally consistent with those in the scalar case. In every simulation setting SIMFE provides improved estimates of the index functions, and lower prediction errors, relative to gOPG, gSIR and gPCA. In most cases the differences are highly statistically significant. The correct reduced dimensions are selected in all of the simulation runs using criterion (2.36). When compared to its Base version across all the simulation settings, SIMFE performs better with respect to the prediction error. The advantage of SIMFE is especially prominent in the sparsest simulation scenarios, corresponding toT ij = 5. 33 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2 t β 1 (t) 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2 3 t β 2 (t) 0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 0 1 2 3 t β 3 (t) Figure 2.3: Functional Gaussian Response forT ij =n i = 5;n = 100: Comparison of true j (t) functions (black solid) with SIMFE (blue solid), Base SIMFE (red solid), gOPG(green dotted), gSIR(cyan dashed) and gPCA estimates(grey dash dotted). 2.5 Auction Data Online auctions have attracted considerable attention in recent years. EBay, the world’s largest consumer- to-consumer electronic commerce company, provides a convenient auction site for global sellers and buyers to trade with each other through the internet. The majority of eBay’s revenue is generated through fees for listing and selling auctions. EBay makes the historical auction data for millions of different types of products publicly accessible. Here I consider the most common single-item online auctions, where bidders place their orders during a fixed auction duration. Those who submit the highest bid before the closing time win the auction but only need to pay the second highest bid. EBay does not display the highest bid, but rather the second highest bid (plus a small bid increment). This is called the live bid. The price histories of these live bids from auctions of similar products can be viewed as i.i.d realizations of bid trajectories. Functional data analysis (FDA) provides a powerful tool to deal with such online auction data, see Jank and Shmueli (2005), Shmueli and Jank (2005), Reddy and Dass (2006), Reithinger et al. (2008), Wang and Shmueli (2008) and Liu and Muller (2008). The sequence of observed bids differs from the values of the smooth underlying bid trajectory. These differences can be viewed as random aberrations of bids, but I will treat them as “measurement error”. 34 0 50 100 150 −0.1 0.0 0.1 0.2 0.3 t ≤ 138 t (hours) β 1 (t) 0 50 100 150 −0.2 −0.1 0.0 0.1 0.2 0.3 t ≤ 144 t (hours) β 1 (t) 0 50 100 150 −0.1 0.0 0.1 0.2 0.3 t ≤ 150 t (hours) β 1 (t) 0 50 100 150 −0.1 0.0 0.1 0.2 0.3 t ≤ 156 t (hours) β 1 (t) 0 50 100 150 −0.1 0.0 0.1 0.2 0.3 t ≤ 162 t (hours) β 1 (t) 0 50 100 150 −0.1 0.0 0.1 0.2 0.3 t < 168 t (hours) β 1 (t) Figure 2.4: Estimated jk (t) curves for eBay online Auction Data. The red dashed, black solid, green dotted, blue dash-dot and cyan long dashed lines correspond to SIMFE, Base SIMFE, gOPG, gSIR and gPCA, respectively, up to different current times,T = 138; 144; 150; 156; 162 and 168 hours. Here I examine 156 eBay online 7-day second price auctions of Palm M515 Personal Digital Assistants (PDA) that took place between March and May, 2003 (Liu and Muller 2008). My interest is in predicting Y = “closing price” based on observing the auction trajectory,X(t), up to time 0 t T . Traditional functional data analysis methods require regular and dense data, but auction data are generally sparsely and irregularly observed with measurement errors. Examination of the PDA data confirms the presence of “bid sniping” i.e. where bids are very sparse in the middle of the auction, a little denser in the beginning and much denser at the end of the auction. Each auction contains 9 to 52 sparse and irregular observations of live bids. I converted the bid time to hours,t2 [0; 168), and applied a log transformation to the bid values. Figure 2.4 plots the estimated time varying 1 (t)’s using SIMFE, Base SIMFE, gOPG, gSIR and gPCA. Each plot represents a different subset of the data for T = 138; 144;:::; 168. For example, the top left figure corresponds toT = 138 where I only considered auction trajectories up tot 138 i.e. 30 hours prior to the end of the auction. A few trends are apparent. First, SIMFE (red dashed) and Base SIMFE (black solid) give very similar estimates and both methods place most of the weight on the bid trajectories after t = 100 hours. These results are consist across different values ofT and seem reasonable given that the 35 Method t 138 t 144 t 150 t 156 t 162 t< 168 SIMFE 6.502 6.536 6.452 6.281 6.202 4.435 Base SIMFE 6.481 6.533 6.471 6.286 6.194 4.552 gSIMFE 6.390 6.381 6.263 6.146 6.191 4.283 gOPG 7.139 6.601 7.056 6.926 6.444 5.541 gSIR 6.413 6.543 6.380 6.387 6.253 5.023 gPCA 6.996 6.486 6.721 6.584 6.240 6.109 Mean 7.211 7.211 7.211 7.211 7.211 7.211 Table 2.4: Cross-validated mean squared prediction errors (10 3 ) for the three versions of SIMFE and four competing methods. The lowest MSPE for each value ofT is bolded. most recent bids contain the most information about the final auction price. Second, for the larger values ofT , gPCA (cyan long dash) places roughly equal weight on all sections of the bid trajectory, while gOPG (green dotted) and gSIR (blue dash-dot) produce wildly varying results that are hard to justify based on the context of the problem. To judge which projection produced superior predictions, I computed the 5-fold cross-validated predic- tion errors for SIMFE and Base SIMFE. I also implement a hybrid approach, gSIMFE, which uses SIMFE to computeP i , but then predicts the response by applying generalized additive models (GAM) withP i as the predictor. I compared the three SIMFE methods to similar implementations of gOPG, gSIR and gPCA, where each method was used to estimateP i , and then GAM was fitted to give a predicted response. Note that for gPCA I fitted GAM to the first 3 principal components, which explained more than 95% of the variation in the bid trajectory. The resulting cross-validated errors, for various values ofT , are provided in Table 2.4. I have also included the error rates from the null model, using the mean of the training response to predict the test response. gSIMFE gave the best results, though SIMFE and Base SIMFE were only slightly inferior. For larger values ofT all three SIMFE methods outperformed the competitors. 2.6 Conclusions The SIMFE method is proposed for functional multi-index models to deal with sparsely sampled predic- tors. It enjoys several advantages over competing approaches. First, SIMFE allows a non-linear relationship between the response and predictors and uses a supervised method to estimate the lower dimensional repre- sentation of the functional predictors. Second, SIMFE can model both scalar and functional responses with multiple predictors. Finally, SIMFE uses a mixed effects model approach to handle sparse, irregular and noisy functional predictors and consistently estimates the space spanned by the set of index functions. 36 There exist several possible extensions for future research. The first involves the situation when the dimension d is large, a setting where Theorem 5 implies slow convergence. This is the case when, for example, the number of available predictor functions is large. A more reasonable approach would be to add a group lasso penalty (Yuan and Lin 2006) on the estimated index coefficients to the SIMFE estimation criterion, which would force some of the predictors to be excluded from the estimated model. Fan et al. (2015) recently proposed a penalized regression approach to address high dimensional functional single index models. One possible extension is to develop a multi-index model and its penalized version of SIMFE estimation criterion for sparsely sampled functional data in high dimensional setting. The second extension considers developing efficient supervised methods to estimate the index model for sparse functional predic- tors through other effective dimension reduction techniques. SIMFE uses a MA VE approach (Xia 2008) to perform the supervised dimension reduction. A recent paper by Yao et al. (2015) present a method for the sparse design that borrows strength across the entire sample and provides a way to characterize the effective dimension reduction space via functional cumulative slicing. It is interesting to compare these approaches with applying other supervised dimension reduction methods for sparsely sampled functional data. 37 Chapter 3 Functional Graphical Models Graphical models have attracted increasing attention in recent years, especially in settings involving high dimensional data. In particular Gaussian graphical models are used to model the conditional dependence structure amongp Gaussian random variables. As a result of its computational efficiency the graphical lasso (glasso) has become one of the most popular approaches for fitting high dimensional graphical models. In this chapter I extend the graphical models concept to model the conditional dependence structure amongp random functions. In this setting, not only isp large, but each function is itself a high dimensional object, posing an additional level of statistical and computational complexity. I develop an extension of the glasso criterion (fglasso), which estimates the functional graphical model by imposing a block sparsity constraint on the precision matrix, via a group lasso penalty. The fglasso criterion can be optimized using an efficient block coordinate descent algorithm and my theoretical results demonstrate that, with high probability, the fglasso will correctly identify the true conditional dependence structure. Finally I show that the fglasso significantly outperforms possible competing methods through both simulations and an analysis of a real world EEG data set comparing alcoholic and non-alcoholic patients. 3.1 Introduction Letg 1 (t);:::;g p (t) representp Gaussian random functions wheret2T andT is a closed subset of the real line. My goal is to construct a functional graphical model (FGM) depicting the conditional dependence structure among thesep random functions. Here I assume the same time domain,T , for all random functions to simplify the notation, but the methodological and theoretical results extend naturally to the more general case where each function corresponds to a different time domain. The left panel of Figure 3.1 provides an illustrative example withp = 9 functions, or nodes. I have 100 observations of each function, corresponding to 100 individuals. In other words my data consists of functions, g ij (t) where i = 1;:::; 100 and j = 1;:::; 9. The right panel of Figure 3.1 illustrates the conditional dependence structure of these functions i.e. the FGM. For example, I observe that the last 3 functions are disconnected from, and hence conditionally 38 Node 1 g 1,1 (t) Node 2 g 1,2 (t) ● ● ● Node 3−7 Node 8 g 1,8 (t) Node 9 g 1,9 (t) g 2,1 (t) g 2,2 (t) ● ● ● g 2,8 (t) g 2,9 (t) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● g 99,1 (t) g 99,2 (t) ● ● ● g 99,8 (t) g 99,9 (t) g 100,1 (t) g 100,2 (t) ● ● ● g 100,8 (t) g 100,9 (t) 1 2 3 4 5 6 7 8 9 Figure 3.1: The illustrative example. Left: The data,n = 100 observations ofg ij (t) forj = 1;:::; 9 nodes. Right: the true underlying network structure. independent of, the first 6 functions. I wish to take the observed functions in the left panel and estimate the FGM in the right panel. My motivating example is an electroencephalography (EEG) data set taken from an alcoholism study (Zhang et al. 1995) and discussed in Section 1.2.2. The study consists ofn = 122 subjects split between an alcoholic group and a control group. Each subject was exposed to either a single stimulus or two stimuli. The resulting EEG activity was recorded at 256 time points over a one second interval using electrodes placed at 64 locations on the subject’s scalp. Hence, each observation, or subject, involvesp = 64 different functions observed at 256 time points. It is of scientific interest to identify differences in brain activity between the two groups, so I construct FGM’s for each group and explore the differences. One possible approach to handle this sort of functional data would be to sample the functions over a grid of time points,t 1 ;:::;t T , estimateT separate networks, and then either report allT networks or construct a single graphical model by somehow merging theT networks. However, while conceptually simple, this strategy has several drawbacks. First, in many instances the functions are not observed at a uniform set of time points and are often not measured at the same set of points. The grid approach would fail in this setting. Second, one of the main advantages of a graphical network is its ability to provide a relatively simple representation of the conditional dependence structure. However, simultaneously interpreting T 39 different networks would significantly increase the complexity of the dependence structure. Finally, each of theT networks would only correspond to dependencies among the functions at a common time point. However, it seems likely that some dependencies may only exist at different time points i.e. it may be the case thatg j (t) andg l (t) are conditionally uncorrelated for every individual value oft butg j (s) andg l (t) are correlated for somes6=t. In such a scenario each of theT networks would miss the correlation structure. Some recent research in graphical models considered estimating a time varying graphical model through a nonparametric approach which constructs similar graphs from adjacent time points (Zhou et al. 2010, Kolar and Xing 2011). However, while this approach does consider dependencies across time it will miss correlations among points that are far apart and also requires a common grid of points. In this chapter, I propose a FGM which is able to estimate a single network and overcome these disad- vantages. The functional network still containsp nodes, one for each function, but in order to construct the edges I extend the conditional covariance definition to the functional domain, and then use this extended covariance definition to estimate the edge set,E. There exist several challenges involved in estimating the FGM. Since each function is an infinite dimensional object, I need to adopt a dimension reduction approach to approximate each function by a finite representation, which results in estimating a block precision matrix. Standard glasso algorithms for scalar data involve estimating the non-zero elements in the precision matrix. By comparison I have developed an efficient algorithm to estimate the non-zero blocks of a higher dimen- sional precision matrix. In my theoretical results I investigate the structure of the underlying functional network to obtain the rate of convergence for blocks in the precision matrix and expand previous model selection consistency results from the standard setting to the more complicated functional domain. The remainder of Chapter 3 is set out as follows. In Section 3.2 I propose a convex penalized criterion which has connections to both the graphical lasso (Yuan and Lin 2007) and group lasso (Yuan and Lin 2006). Minimizing my functional graphical lasso criterion provides a sparse estimate, b E, for the edge setE. In Section 3.3, I develop an efficient block coordinate descent algorithm for minimizing the fglasso criterion. The fglasso algorithm is extended to handle even larger values ofp by applying the partition approach of Witten et al. (2011). Section 3.4 provides my theoretical results. Here I show that the estimated edge set b E is the same as the true edge setE with probability converging to one, even forp > n. The finite sample performance of the fglasso is examined in Section 3.5 through a series of simulation studies. Section 3.6 provides a demonstration of the fglasso on the EEG data set. I summarize Chapter 3 and propose several possible extensions for future work in Section 3.7. 40 3.2 Methodology 3.2.1 Gaussian Graphical Models As discussed in the previous Section 1.2.2, the edges depict the conditional dependence structure of thep variables. Specifically, let c jl = Cov(X j ;X l jX k ;k6=j;l) (3.1) represent the covariance of X j and X l conditional on the remaining variables. Then nodes j and l are connected by an edge if and only ifc jl 6= 0. Under the assumption that X = (X 1 ;:::;X p ) T is multivariate Gaussian with covariance matrix , one can show thatc jl = 0 if and only if jl = 0, where jl is the (j;l)th component of the precision matrix, = 1 . Let G = (V;E) denote an undirected graph with vertex set V =f1; ;pg and edge set E =f(j;l) : c jl 6= 0; (j;l)2 V 2 ;j6= lg =f(j;l) : jl 6= 0; (j;l)2 V 2 ;j6= lg. In practice jl , and hence the network structure, must be estimated based on a set of n observed p-dimensional realizations, x 1 ;:::; x n , of the random vector X. Hence, much of the research in this area involves various approaches for estimatingE, which for Gaussian data is equivalent to identifying the locations of the non-zero elements in the precision matrix. One natural approach for estimating is to use the maximum likelihood estimator (MLE). Standard calculations show that the Gaussian log likelihood (up to constants) is given by log det trace(S); which, forp<n, is maximized by setting b = S 1 , when S, the sample covariance matrix, is nonsingular. However, this estimator has two deficiencies. First, fornp the MLE will provide a poor estimate for and forp > n the MLE is not even unique. Second, even in lower dimensional settings, where the MLE may be more accurate, none of the elements of b will be exactly zero—a serious limitation given that my ultimate goal is to identify the pairs (j;l) such that jl = 0. 41 The graphical lasso (Yuan and Lin 2006) addresses both of these limitations by adding a penalty term to the log likelihood function: b = arg max flog det trace(S)kk 1 g; (3.2) where 2R pp is symmetric positive definite, 0 is a tuning parameter andkk 1 = P jl j jl j. In a similar fashion to the standard lasso, the` 1 penalty in (3.2) both regularizes the estimate and ensures that b is sparse i.e. has many zero elements. 3.2.2 Functional Graphical Models The functional setting considered in this paper is more complicated than that for the standard graphical model. Suppose the functional variables, g 1 ; ;g p , following from a multivariate Gaussian process (MGP), belong to an undirected graph G = (V;E) with vertex set V = f1; ;pg and edge set E. Then I must first provide a definition for the conditional covariance between two functions. For each pair fj;lg2V 2 ;j6=l and any (s;t)2T 2 ; the cross-sectional conditional covariance function is defined by C jl (s;t) = Cov (g j (s);g l (t)jg k (u);k6=j;l;8u2T ); (3.3) which represents the covariance betweeng j (s) andg l (t) conditional on the remainingp 2 functions. I considerg j andg l to be conditionally independent ifC jl (s;t) = 0 for all (s;t)2T 2 . Hence my ultimate goal is to estimate the edge set E = fj;lg :C jl (s;t)6= 0 for somes andt; (j;l)2V 2 ;j6=l : (3.4) 3.2.3 Functional Graphical Lasso Suppose I observe g i = (g i1 ;:::;g ip ) T ,i = 1;:::;n, and eachg ij (t),t2T is a realization from a mean zero Gaussian process. The Karhunen-Lo` eve expansion allows us to represent each function in the form g ij (t) = 1 X k=1 a ijk jk (t); 42 wherea ijk N(0; jk ),a ijk is independent froma i 0 jk 0 fori6= i 0 ork6= k 0 , and j1 j2 0 for j = 1;:::;p. In this formulation j1 (t); j2 (t);::: represent functional principal component (FPC) functions and form an infinite dimensional basis representation forg ij (t). If I denote the covariance function byK jj (s;t) =Cov(g j (s);g j (t)), then the corresponding eigen pairs satisfy Z T K jj (s;t) jk (t)dt = jk jk (s); (3.5) where R T jk (t) 2 dt = 1 and R T jk (t) jm (t)dt = 0 form<k. Let e g M ij (t) = M X k=1 a ijk jk (t) = (a M ij ) T M j (t) (3.6) represent theM-truncated version ofg ij (t), where a M ij = (a ij1 ; ;a ijM ) T and M j = ( j1 ; ; jM ) T . Thene g M ij (t) provides the bestM-dimensional approximation tog ij (t) in terms of mean squared error. Anal- ogously to the conditional covariance function (3.3), I define theM-truncated conditional covariance func- tion by e C M jl (s;t) = Cov e g M j (s);e g M l (t)je g M k (u);k6=j;l;8u2T ; (3.7) and the corresponding truncated edge set as e E M = n fj;lg : e C M jl (s;t)6= 0 for somes andt; (j;l)2V 2 ;j6=l o : (3.8) The basic intuition to my approach is to develop a method for estimating e E M which, for large enoughM, will provide an accurate estimate forE. My theoretical results in Section 3.4 formalize this idea. To form my estimate for e E M I first let a M i = (a M i1 ) T ;:::; (a M ip ) T T 2R Mp represent the firstM FPC scores for theith set of functions,i = 1;:::;n. Then, providedg ij (t) is a realization from a Gaussian process, a M i will have a multivariate Gaussian distribution with covariance matrix e M = e M 1 . Lemma 1. For (j;l)2V 2 , let e M jl be theMM matrix corresponding to the (j;l)-th submatrix of e M . Then e E M = n fj;lg :k e M jl k F 6= 0; (j;l)2V 2 ;j6=l o ; (3.9) wherejj:jj F denotes the Frobenius norm. 43 Lemma 1 suggests that the problem of estimating e E M , and henceE, can be reduced to one of accurately estimating the block sparsity structure in e M . Thus, I estimate the network structure using the functional graphical lasso (fglasso), which modifies the graphical lasso by incorporating a group lasso penalty to produce a block sparsity structure. The fglasso is defined as the solution to b M = arg max e M 8 < : log det e M trace( e S M e M ) X j6=l k e M jl k F 9 = ; ; (3.10) where e M 2R MpMp is symmetric positive definite, 0 is a tuning parameter and e S M is the sample covariance matrix of a M . The penalty in (3.10) is equivalent to a group lasso penalty (Yuan and Lin 2006, Friedman et al. 2010) and forces the elements of M jl to either all be zero (a sparse solution) or not all zero (a connected edge betweeng j andg l ). Hence, as increases b M grows sparser in a blockwise fashion. My final estimated edge set is then defined as b E M = n fj;lg :k b M jl k F 6= 0; (j;l)2V 2 ;j6=l o : I calculate e S M directly from a M . Although a M is not directly observed there are a number of standard meth- ods for computing the functional principal component scores from a set of functions. See the Appendix B.3.1 for further details. Note b , e , e S, a j and j ,j = 1 ;p, all depend onM, but for simplicity of notation I will omit the corresponding superscripts where the context is clear. 3.3 Computation 3.3.1 Fglasso Algorithm A number of efficient algorithms (Friedman et al. 2007, Boyd et al. 2010) have been developed to solve the glasso problem, but to date none of these approaches have considered the functional domain. Here I propose an algorithm which mirrors recent techniques for optimizing the glasso criterion (Zhu et al. 2014b). Let e j ; e j and e S j respectively beM(p 1)M(p 1) sub matrices excluding thejth row and column block of e ; e and e S, and let w j ; j and s j beM(p 1)M matrices representing thejth column 44 block after excluding the jth row block. Finally, let jj ; jj and S jj be the (j;j)th MM blocks in e ; e and e S respectively. So, for instance forj = 1, e = 0 @ 11 w T 1 w 1 e 1 1 A : Then, for a fixed value of e j , standard calculations show that (3.10) is minimized by setting b jj = S 1 jj +b w T j e 1 j b w j ; (3.11) where b w j = arg min w j ( trace(S jj w T j e 1 j w j ) + 2trace(s T j w j ) + 2 p1 X l=1 kw jl k F ) ; (3.12) and w jl represents the lth MM block of w j . Computing (3.12) can be achieved using some matrix calculus and standard non-linear equation software. Details are provided in Appendix B.1.1. This suggests a block coordinate descent algorithm where one iterates throughj repeatedly computing (3.12) until convergence. In fact by checking the conditions of Theorem 4.1 in Tseng (2001) it is easy to verify that iteratively minimizing (3.12) over w j and updating jj by (3.11) for j = 1; ;p provides a convergent solution for globally maximizing the fglasso criterion. The main potential difficulty with this approach is that e 1 j must be updated at each step which would be computationally expensive if I performed the matrix inversion directly. However, Algorithm 3 demonstrates that the calculation can be performed efficiently. Steps 2(a) and 2(c) are derived using standard matrix results, the details of which are provided in the Appendix B.1.2. Algorithm 3 Functional Graphical Lasso Algorithm 1. Initialize b = I and b = I. 2. Repeat until convergence forj = 1; ;p. (a) Compute b 1 j b j b j b 1 jj b T j . (b) Solve for b w j in (3.12) using Algorithm 6 in Appendix B.1.1. (c) Reconstruct b using b jj = S jj ,b j =U j S jj and b j = b 1 j + U j S jj U T j , where U j = b 1 j b w j . . 3. Set b E = n (j;l) :k b jl k F 6= 0; (j;l)2V 2 ;j6=l o : 45 3.3.2 Block Partitioning to Accelerate the Algorithm A common approach to significantly speed up the glasso algorithm involves first performing a screening step on the sample covariance matrix to partition the nodes intoK distinct sets and then solvingK separate glasso problems (Witten et al. 2011, Mazumder and Hastie 2012b, Danaher et al. 2014, Zhu et al. 2014b). Since the resulting glasso problems involve many fewer nodes each network can be computed at a much lower computational cost. Here I show that a similar approach can be used to significantly accelerate my proposed fglasso algo- rithm. Proposition 1. If the solution to (3.10) takes a block diagonal form, i.e. e = diag e (1) ; ; e (K) , then (3.10) can be solved by separately maximizingK smaller fglasso problems b (k) = arg max e (k) 8 < : log det e (k) trace( e S (k) e (k) ) X j6=l k e (k) jl k F 9 = ; ; (3.13) fork = 1; ;K, where e S (k) is the submatrix of e S corresponding to e (k) . Proposition 2. Without loss of generality, let G 1 ; ;G K be a partition of p ordered features, hence if i2 G k , i 0 2 G k 0, k < k 0 , theni < i 0 . Then a necessary and sufficient condition for the solution to the fglasso problem to be block diagonal with blocks indexed byG 1 ; ;G K is thatk e S ii 0k F for alli2G k , i 0 2G k 0,k6=k 0 . The proofs of these two propositions are provided in Appendix B.3.3 and B.3.4. Propositions 1 and 2 suggest first performing a screening procedure on e S to identify K distinct graphs and then solving the resultingK fglasso problems. These steps are summarized in Algorithm 4. For a fixedM, implementing Algorithm 3 requiresO(p 3 ) operations. Steps 1 and 2 in Algorithm 4 need O(p 2 ) operations and thekth fglasso problem requiresO(jG k j 3 ) operations fork = 1; ;K; hence the total computational cost for Algorithm 4 isO p 2 + P K k=1 jG k j 3 : Algorithm 4 significantly reduces the computational cost, ifjG 1 j; ;jG K j are much smaller thanp, which is the case when the tuning parameter, , is large. This is the case I am generally interested in for real data problems since, for the sake of network interpretation and visualization, most practical applications estimate sparse networks. 46 Algorithm 4 Fglasso Algorithm with Partitioning Rule 1. Let A be ap byp adjacency matrix, whose diagonal elements are one and off-diagonal elements take the formA ii 0 = 1 k e S ii 0k F > . 2. IdentifyK connected components of the graph based on the adjacency matrix A. LetG k be the index set of the features in thekth connected component,k = 1; ;K. 3. For k = 1; ;K, solve b (k) via Algorithm 3 using the nodes in G k . The final solution to the fglasso problem b is obtained by rearranging the rows/columns of the permuted version, diag b (1) ; ; b (K) : 3.4 Theory I now investigate the sampling properties of the fglasso. To this end, several regularity and smoothness conditions are needed to facilitate my technical analysis. Condition 1. For someM = c 1 n with constantsc 1 > 0 and2 [0; 1), P 1 k=M+1 jk = O(n ) for some constant > 0 and sup s2T max kM+1 j jk (s)j =O(1) holds uniformly inj2V . Condition 1 is a basic assumption in the functional setting. The parameter controls the number of selected FPC functions and the parameter determines the decay rate of any decreasing sequence j1 j2 0 for j = 1; ;p. Larger values of and yield a larger M and a faster decay rate, respectively. Before stating the next few conditions, I introduce some necessary notation. Let g = (g 1 ; ;g p ) T be a collection of random processes whose jth component g j is a continuous function in L 2 (T j ), where T j is a closed subset of the real line. Then the domain for g isT = S p j=1 T j : Denote byK =fK jl : T j T l !R; (j;l)2V 2 g a collection of covariance kernel functions withK jl (s;t) = Cov (g j (s);g l (t)), (s;t)2T j T l , and g MGP(0;K) a MGP, indexed by the vertex setV , with mean 0 and covarianceK: In general, letA =fA jl :T j T l !R; (j;l)2V 2 g be a collection of functionsA jl (s;t), (s;t)2T j T l , andX =fx j :T j !R;j2Vg a collection of functionsx j (s);s2T j . The setsA andX can be regarded as the functional counterparts of matrices and vectors, respectively; see Appendix B.2.2 for more details about the corresponding operations and notation. For any subsetsW ,W 0 ofV , denote by g W an MGP indexed byW , andK WW 0 a subset of covariance with columns and rows inW andW 0 , respectively. The inverse of covarianceK WW is denoted asK 1 WW . 47 I define the correspondingM-truncated versions of g,K,K WW 0, andK 1 WW , bye g, e K, e K WW 0, and e K 1 WW , respectively. Here, for notational simplicity I drop the corresponding superscripts in theM-truncated terms. Condition 2. For each (j;l) 2 V 2 with j 6= l, there exist unique inverse covariancesK 1 UU and e K 1 UU for covariancesK UU and e K UU of g U ande g U withU = Vnfj;lg such that all kernel functions inK, e K, K 1 UU , and e K 1 UU are square integrable, min (K UU ) and min ( e K UU ) c 2 for some constantc 2 > 0, and sup s2T j jjK jU (s)jj max = O(1) and sup t2T l jjK Ul (t)jj max = O(1) uniformly in (j;l), where() andjjjj max are defined in Appendix B.2.2. Condition 3. For each (j;l)2 V 2 withj6= l, supposek jl k F = O(1) and min (D jj ), min (D ll ) c 3 for some positive constantc 3 , hold uniformly in (j;l), where jl = R T j R T l j (s) l (t) T dsdt and D jj = Var (a j ) Cov (a j ; a U ) Var(a U ) 1 Cov (a U ; a j ). Condition 4. It holds that max (D jj ) and max (D ll )c 4 , and min (j;l)2E inf (s;t)2T j T l jC jl (s;t)j c 5 p 2 n =2 + c 6 p 2 n (2)=2 ; (3.14) uniformly in (j;l)2E for some constantsc 4 ,c 5 , andc 6 > 0. Condition 2 is on the eigen-structure and smoothness of the functional network, while Condition 3 is about the FPC functions and eigen-structure of FPC scores. Condition 4 is an assumption on the minimum signal strength for successful graph recovery. These conditions on the underlying functional network are crucial for obtaining the rate of convergence ofjj e jl jj F for (j;l) 2 S c ; the complement of S = E[ f(1; 1); ; (p;p)g inV 2 ; and the equivalence between the truncated and true edge sets. See Lemmas 3 and 4 in Appendix B.2.2 and B.2.3. More specifically, Condition 2 restricts the eigen-structure and puts some smoothness requirement on the functional network such that the truncated conditional covariance function (3.7) converges uniformly to the true one (3.3) at a certain rate asM goes to infinity. In Condition 3, the bound condition on the cross- integrated FPC functions ensures the conditional cross uncertainties ofe g can be captured by the FPC scores a. The restriction on the eigen-structure of FPC scores provides a sufficient condition to obtain the uniform convergence rate ofjj e jl jj F from that ofjjCov(a j ; a l ja U )jj F in (j;l)2S c : Moreover, Condition 4 requires that the minimum magnitude of the conditional covariance function on the true edge set be bounded from below to ensure that the truncated edge set correctly retains all true edges. 48 I next introduce an additional irrepresentable-type assumption for deriving the model selection con- sistency of fglasso, that is, the exact truncated functional graph recovery with asymptotic probability one. Denote by e = e 1 e 1 2 R (Mp) 2 (Mp) 2 with the Kronecker product, and e WW 0 the M 2 jWj M 2 jW 0 j submatrix of e with row and column blocks in W and W 0 , respectively, for any subsets W , W 0 of V 2 . For any block matrix B = (B ij ) with B ij 2R M 2 M 2 ; 1 i;j p, define jjBjj (M 2 ) 1 = max i P p j=1 jjB ij jj F as theM 2 -block version of matrix1-norm. Condition 5. For j 2 V and m 2 f1; ;Mg, there exist some constants c 8 > 0, 2 (0; 1 c 8 p 2 n ()=2 ],> 0, and > 2 such that jj e S c S ( e SS ) 1 jj (M 2 ) 1 1 c 8 p 2 n ()=2 ; (3.15) jj e S c S ( e SS ) 1 e SS c e S c S cjj (M 2 ) 1 " n ()=2 2c 8 p 2 + 1 # ; (3.16) jj e SS cjj (M 2 ) 1 40 p 2c 1 n max j;m jm r log(c 1 n p) + log 4 n : (3.17) It is worth noting that e is the Hessian of log det() evaluated at = e . Hence the entry e (k;k 0 )(m;m 0 ) (j;j 0 )(l;l 0 ) is equal to the partial derivative @( log det()) @ kk 0 jj 0 @ mm 0 ll 0 evaluated at = e , where kk 0 jj 0 is the (k;k 0 )th entry of theMM submatrix jj 0, 1 j;j 0 ;l;l 0 p, 1 k;k 0 ;m;m 0 M: Since a is multivariate Gaussian, some standard calculations show that e (k;k 0 )(m;m 0 ) (j;j 0 )(l;l 0 ) = Cov a jk a j 0 k 0;a lm a l 0 m 0 : The Hessian of the negative log-determinant in the scalar case ofM = 1 was studied in Ravikumar et al. (2011). I extend their work by viewing e , the Fisher information of the model, as an edge-basedM 2 -block covariance matrix instead of the node-based covariance matrix e . For each (j;j 0 )2 V 2 withj6= j 0 , denote by b jj 0 = a j a j 02R M 2 the edge-based vector, where a j , a l are the node-based vectors. Then I have e (j;j 0 )(l;l 0 ) = E(b jj 0b T ll 0 ), which indicates that Condition 5 is the population version of the irrepresentable-type condition. Define the edge-based vector withinS by b S =fb jj 0; (j;j 0 )2Sg. Then inequality (3.15) is equivalent tojjE(b S cb T S )E(b S b T S ) 1 jj (M 2 ) 1 1 c 8 p 2 n ()=2 ; which bounds the effects of non-edges inS c on the edges inS, and restricts b jj 0’s outside the true edge setS to be weakly correlated with those withinS. Moreover, since b S and b S c are multivari- ate Gaussian with mean zero, then inequality (3.16) is equivalent tojjE(b S cb T S )E(b S b T S ) 1 E(b S b T S c) E(b S cb T S c)jj (M 2 ) 1 =jjVar(b S cjb S )jj (M 2 ) 1 [ n ()=2 2c 8 p 2 + 1]; which bounds the variance of b jj 0’s outside S conditional on those withinS. Finally, (3.17) imposes an upper bound onjj e SS cjj (M 2 ) 1 , which constrains 49 the covariance between b S and b S c to ensure a bounded error for the estimate b : In particular, for the scalar case ofM = 1; = 0; =1, (3.16), (3.17) are not required anymore and (3.15) reduces to the irrepresentable condition in Ravikumar et al. (2011), since can be relaxed to be1 in such a case. I am now ready to present the main theorem on the model selection consistency of my proposed approach fglasso for estimating a FGM with different time domains of functions. Denote by e =jj( e SS ) 1 jj (M 2 ) 1 ; e =jj e SS cjj (M 2 ) 1 , e =jj e jj (M) 1 , and d = max j2V jfl2V :C jl (s;t)6= 0 for somes andtgj; (3.18) the maximum degree of the graph in the underlying FGM. Theorem 6. Assume that Conditions 1–5 hold. For j 2 V and m2f1; ;Mg, let b be the unique solution to the fglasso problem (3.10) with regularization parameter = (320 p 2c 1 n = )max j;m jm r log(c 1 n p) + log 4 n : If the sample sizen satisfies the lower boundn>c 7 p 4= and n 12 > maxfC 1 d 2 ;C 2 2 min g 1 + 8 + c 8 p 2 n ()=2 2 [ logn + logp + log(4c 1 )]; (3.19) with some constantc 7 > 0, min = min (j;l)2E jj e jl jj F ;C 2 =f80 p 2c 1 max j;m jm e g 2 , and C 1 = ( 240 p 2c 1 max j;m jm max n e e 1 3(pd)c 8 p 2 n ()=2 e ; 3 e 2 e [1 + 8 1 +c 8 p 2 n ()=2 ] 1 3(pd)c 8 p 2 n ()=2 3 e e [1 + 8 1 +c 8 p 2 n ()=2 ] o ) 2 ; then with probability at least 1 (c 1 n p) 2 , the estimated edge set b E is the same as the true edge setE: For any (j;l) 2 S c , Lemma 3 impliesjj e jl jj F = c 8 p 2 =n ()=2 : For a larger truncation error c 9 = c 8 p 2 =n ()=2 , (3.15) and (3.16) become more stringent. Since in view of (3.15) c 9 is bounded from above by 1 , I see that a sample sizen & p 4 is sufficient to satisfy all the technical assump- tions, where & means asymptotic lower bound. In such a case, the sample size condition (3.19) can be simplified as n 12 > maxfC 1 d 2 ;C 2 2 min g(1 + 8 1 + c 9 ) 2 [logn + logp + log(4c 1 )] with 50 C 1 = 240 p 2c 1 max j;m jm maxf e e 13(pd)c 9 e ; 3 e 2 e (1+8 1 +c 8 ) 13(pd)c 9 3 e e (1+8 1 +c 9 ) g 2 : Let us further assume that e , e , e , remain constant with respect ton,p,d, andn =o(p). Then a sample size n& (d 2 + 2 min )logp 1 12 (3.20) guarantees the model selection consistency of fglasso with probability at least 1 (c 1 n p) 2 : Note that a larger value of parameter enables a higher functional graph recovery probability, at the expense of a larger sample size. In particular, for the scalar case ofM = 1; = 0; =1, the sample size condition (3.19) reduces ton& (d 2 + 2 min ) logp, which is consistent with the results in Ravikumar et al. (2011). It is easy to see that a sample sizen & d 2 logp 1=(12) is sufficient for ensuring model selection consistency as long as the minimum Frobenius norm within the true edge set min & q logp n 12 . In functional data analysis, one usually considers using the first several FPC scores in the case when jk ’s decay fast to zero so that a small and a large can be appropriate. When > 4 and the maximum node degree d =o q p 12 logp , the model selection consistency can hold even in thepn regime. 3.5 Simulations I performed a number of simulation studies to compare the fglasso to potential competing methods. In each setting I generatednp functional variables viag ij (t) = s(t) T ij , where s(t) was a 5-dimensional orthogonal Fourier basis function, and ij 2 R 5 was a random Gaussian vector with mean zero. Hence, i = ( T i1 ; ; T ip ) T 2R 5p followed from a multivariate Gaussian distribution with covariance = 1 . Different block sparse patterns in the precision matrix, , correspond to different conditional dependence structures. I considered two general structures. Model 1: An AR(2) model with jj = I, j;j1 = j1;j = 0:4I, j;j2 = j2;j = 0:2I for j = 1; ;p, and equal to zero at all other locations. Hence, only the adjacent two nodes are connected. Model 2: Forj = 1;:::; 10, 21;:::; 30;:::; the corresponding submatrices in came from Model 1 withp = 10, indicating every alternating block of 10 nodes are connected by an AR(2) model. For j = 11;:::; 20, 31;:::; 40;::: , I only let jj = I, so the remaining nodes are fully isolated. 51 In both settings, I generatedn = 100 samples of i from the associated multivariate Gaussian distribution, and the observed values,h ijk , were sampled using h ijk =g ij (t k ) +e ijk ; e ijk N(0; 0:5 2 ); where each function was observed at 100 equally spaced time points, 0 =t 1 ; ;t 100 = 1: To implement the fglasso I must compute a ij , the first M functional principal component scores of g ij . As mentioned previously, this is a standard problem and there are a number of possible approaches one could use for the calculation. In order to mimic a real data setting I chose to fit each function using a K-dimensional B-spline basis (rather than using the Fourier basis which was used to generate the data) and then compute a ij from the basis coefficients. I used 5-fold cross-validation to choose bothK andM. In particular, for a given candidate pairfK;Mg, I removed 1=5th of the observed time points for eachg ij (t) as a validation data set, applied FPCA on the remaining data, calculated the squared error betweenh ij (t k ) and the fitted values,b g ij (t), on the validation set, and repeated 5 times to compute the cross-validated squared error. I computed the cross-validated error over a grid ofMK values and chose the pair with the lowest error. Typically,K = 5; 6 or 7 basis functions andM = 4; 5 or 6 principal components were selected in my simulations. I compared fglasso to four competing methods. In the first three methods I fit the standard glasso T times, once on each time point, producingT different network structures. I then used one of three possible rules to combine theT networks into a single FGL; ALL involved identifying an edge if it was selected in all T networks, NEVER identified an edge unless it appeared in none of theT networks, and HALF identifying an edge if it was selected in more than half of theT networks. The final approach, PCA, transformed the functional data into a standard format by computing the first principal component score on eachg ij (t) and then running the standard glasso on this data. The dimension of the B-spline basis function, K, was still selected by 5-fold cross validation after settingM = 1. For each method and tuning parameter,, I calculated the true positive rate (TPR ) and false positive rate (FPR ), in terms of network edges correctly identified. These quantities are defined by TPR = TP TP +FN and FPR = FP FP +TN ; 52 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=50 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=100 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=150 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=200 Figure 3.2: Model 1 forp=50, 100, 150 and 200: Comparison of median estimated ROC curves over 100 simulation runs with false positive rates up to 20%. fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted) and PCA (cyan long dashed). where TP and TN respectively stand for true positives/negatives, and respectively FP and FN represent false positives/negatives. Plotting TPR versus FPR over a fine grid of values of produces a ROC curve, with curves close to the top left corner indicating a method that is performing well. I consider different settings withp = 50; 100, 150 and 200, and ran each simulation 100 times. Fig- ures 3.2 and 3.3 respectively plot the median best ROC curves for each of the five comparison methods of Models 1 and 2. I restrict to consider networks withFPR 20% since this already requires estimating thousands of edges. Beyond this point the networks become so dense that the computational cost becomes prohibitive. The fglasso (black curve) clearly provides the best overall performance in estimating the func- tional network. Table 3.1 provides the area under the ROC curves (average over the 100 simulation runs) along with standard errors. Larger numbers indicate superior estimates of the true network structure. Again 53 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=50 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=100 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=150 0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 False Positive Rate True Positive Rate p=200 Figure 3.3: Model 2 forp=50, 100, 150 and 200: Comparison of median estimated ROC curves over 100 simulation runs with false positive rates up to 20%. fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted) and PCA (cyan long dashed). I see that the fglasso provides highly significant improvements in accurary over the competing methods and HALF performs the best among the other approaches. I also consider examining the full ROC curves in the lower dimensional setting withp = 50 andp = 100. The relative performance of various methods are unchanged, the details of which are provided in Table 3.2 and Figure 3.4. 3.6 EEG Data I test the performance of fglasso on the EEG data set from an alcoholism study. The study consists of 122 subjects, 77 in the alcoholic group and 45 in the control group. For each subject, voltage values were mea- sured from 64 electrodes placed on the scalp which were sampled at 256 Hz (3:9-ms epoch) for one second. 54 fglasso ALL NEVER HALF PCA p Model 1 50 1:224(0:010) 0:228(0:003) 0:546(0:013) 0:781(0:015) 0:651(0:006) 100 1:208(0:009) 0:221(0:003) 0:543(0:013) 0:747(0:015) 0:641(0:007) 150 1:181(0:009) 0:218(0:003) 0:540(0:014) 0:721(0:014) 0:625(0:007) 200 1:168(0:008) 0:214(0:002) 0:535(0:012) 0:703(0:014) 0:602(0:007) p Model 2 50 1:485(0:011) 0:266(0:003) 0:769(0:019) 1:031(0:020) 0:858(0:007) 100 1:458(0:010) 0:234(0:003) 0:753(0:018) 0:995(0:016) 0:831(0:007) 150 1:422(0:010) 0:222(0:003) 0:736(0:019) 0:992(0:016) 0:826(0:007) 200 1:398(0:008) 0:216(0:003) 0:702(0:017) 0:970(0:018) 0:824(0:008) Table 3.1: The mean area (10 1 ) under the ROC curves with false positive rates up to 20%. Standard errors are shown in parentheses. fglasso ALL NEVER HALF PCA p Model 1 50 0:815(0:003) 0:549(0:002) 0:656(0:003) 0:703(0:004) 0:661(0:003) 100 0:809(0:002) 0:542(0:002) 0:649(0:002) 0:689(0:002) 0:654(0:003) p Model 2 50 0:892(0:002) 0:529(0:003) 0:746(0:003) 0:782(0:004) 0:718(0:004) 100 0:881(0:002) 0:512(0:002) 0:741(0:002) 0:762(0:003) 0:711(0:003) Table 3.2: The mean area under the ROC curves. Standard errors are shown in parentheses. Each subject completed 120 trials under either a single stimulus or two stimuli. The electrodes were located at standard positions (Standard Electrode Position Nomenclature, American Electroencephalographic Asso- ciation (1990)). Zhang et al. (1995) discussed the data collection process in detail. Li et al. (2010), Zhou and Li (2014), Hung and Wang (2013) analyzed the data treating each covariate as a 256 64 matrix. I focus on the EEG signals filtered at frequency bands between 8 and 12:5Hz, the case considered in Knyazev (2007), Hayden et al. (2006) and Zhu et al. (2014a). Using 4 representative electrodes from the frontal and parietal region of the scalp Hayden et al. (2006) found evidence of regional asymmetric patterns between the two groups. Zhu et al. (2014a) discussed connectivity and asymmetry of 13 electrodes selected from 5 different regions. However, both sets of authors used multiple samples per subject in order to obtain a sufficiently large sample, violating the independence assumption inherent in most methods. Following the analysis in Li et al. (2010) I only consider the average of all trials for each subject under the single stimulus condition. Thus I have at mostn = 77 observations and aim to estimate a network involvingp = 64 edges/electrodes. I first performed a preprocessing step using the eegfilt function (part of the eeglab toolbox) to perform band filtering on the signals. The fglasso was then fitted to the filtered data. The dimension of the B-spline 55 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate p=50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate p=100 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate p=50 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate p=100 Figure 3.4: Model 1 (top row) and Model 2 (bottom row) for p = 50 and 100: Comparison of median estimated ROC curves over 100 simulation runs for fglasso (black solid), ALL (red dashed), NEVER (green dotted), HALF (blue dash dotted), and PCA (cyan long dashed). basis function,K, was selected using the same cross-validation approach as for the simulation study. I set M = 6 for this data since 6 principal components already explained more than 90% of the variation in the signal trajectories. Note that since my goal was to provide interpretable visualizations and investigate differences in brain connectivity between the alcoholic and control groups I computed sparse networks with approximately 5% connected edges. To assess the variability in the fglasso fit I performed a bootstrap procedure by randomly selectingn observations with replacement from the functional data, finding a tuning parameter to yield 5% sparsity level, applying the fglasso approach to the bootstrapped data, and repeating the above process 50 times. The “bootstrapped fglasso” was then constructed from the edges that occurred in at least 50% of the bootstrap replications. 56 FP1 FP2 F7 F8 AF1 AF2 FZ F4 F3 FC6 FC5 FC2 FC1 T8 T7 CZ C3 C4 CP5 CP6 CP1 CP2 P3 P4 PZ P8 P7 PO2 PO1 O2 O1 X AF7 AF8 F5 F6 FT7 FT8 FPZ FC4 FC3 C6 C5 F2 F1 TP8 TP7 AFZ CP3 CP4 P5 P6 C1 C2 PO7 PO8 FCZ POZ OZ P2 P1 CPZ nd Y FP1 FP2 F7 F8 AF1 AF2 FZ F4 F3 FC6 FC5 FC2 FC1 T8 T7 CZ C3 C4 CP5 CP6 CP1 CP2 P3 P4 PZ P8 P7 PO2 PO1 O2 O1 X AF7 AF8 F5 F6 FT7 FT8 FPZ FC4 FC3 C6 C5 F2 F1 TP8 TP7 AFZ CP3 CP4 P5 P6 C1 C2 PO7 PO8 FCZ POZ OZ P2 P1 CPZ nd Y Figure 3.5: Left graph plots the estimated network for the alcoholic group and right graph plots the estimated network for the control group. Black lines denote edges identified by both fglasso and bootstrapped fglasso, blue lines denote edges identified by fglasso but not selected by the bootstrapped fglasso and red lines denote edges identified by the bootstrapped fglasso but missed by the fglasso. Figure 3.5 plots the estimated network using the fglasso and the bootstrapped fglasso for both the alco- holic and the control groups. The bootstrapped fglasso estimated a sparser network with sparsity level 4:1% for the alcoholic group and 2:5% for the control group. As one would expect, estimates for the control group, with a smaller sample size, entail a larger degree of variability. I observe a few apparent patterns. First, electrodes from the frontal region are densely connected in both groups but the control group has increased connectivity relative to the alcoholic group. Second, the left temporal region of the alcoholic group includes more connected edges. Third, electrodes from other regions of the scalp tend to be only sparsely connected. To identify edges that were clearly different between the two groups I selected edges that occurred at least 50% more often in the bootstrap replications for one group relative to the other group. Figure 3.6 plots the edges only identified by either the alcoholic group or the control group. I observe that edges in the left temporal region were identified by the alcoholic group but missed by the control group, while some edges in the frontal region were identified by the control group but missed by the alcoholic group. Both findings provide confirmation for my informal observations from Figure 3.5. 57 Figure 3.6: Black lines denote edges identified only by the alcoholic group and red lines denote edges identified only by the control group. 3.7 Conclusions The FGM extends Gaussian graphical models to describe the conditional dependence structure of functional data, while the fglasso provides an approach for estimating such networks. The fglasso enjoys several advantages. First, I am able to estimate a single network for functional data, providing a more interpretable solution. Second, I develop an efficient algorithm to maximize the fglasso criterion and a partition rule that can be applied to accelerate the algorithm even in high dimensional settings. Third, my theoretical results demonstrate the equivalence between the estimated and true edge sets with a high probability, and the empirical analysis demonstrates strong finite sample performance. There are several possible variants and extensions for future research. The first involves constructing a graphical model for sparsely and irregularly observed functional data with measurement error. This exten- sion could be implemented by performing FPCA on sparsely sampled data using either a mixed effects model (James et al. 2000) or a local smoother method (Yao et al. 2005a), and then applying the fglasso on the estimated principal component scores. The second possible extension concerns jointly estimating multiple functional graphical models. For scalar data, Danaher et al. (2014) propose a joint graphical lasso to estimateQ different graphical models where there is an assumption that they have some structure in com- mon. The joint graphical lasso attempts to borrow strength across all groups to estimate the connections 58 that are common while still allowing for differences among the groups. In my setting the joint functional graphical lasso would correspond to finding e (1) ; ; e (Q) to maximize Q X q=1 n q log det e (q) trace( e S (q) e (q) ) 0 @ X q X j6=l k e (q) jl k F +P ( e ) 1 A (3.21) wheren q is the number of observations for theqth class andP is a penalty to encourage similar structure among the e (q) ’s. The third extension considers applying the node-wise lasso approach of Meinshausen and Buhlmann (2006) to model the conditional dependence structure of multivariate functional data. To be specific, for eachj2V , I build the corresponding functional linear additive model by regressingg j on all the remaining functionsg k ’s, 1k6=jp, g ij (s) = X k6=j Z t2T jk (s;t)g ik (t)dt +" ij (s);i = 1; ;n;s2T (3.22) I then develop a penalized regression approach for (3.22) by adding a penalty on the functions to enforce sparsity, i.e. jk (s;t) = 0 for every (s;t)2T 2 and somek6=p. My purpose is to show the model selection consistency that the estimated edge set b E j = n k2V;k6=j; b jk (s;t)6= 0; for somes andt o is equivalent to the true edge set E j = fk2V;k6=j;C jk (s;t)6= 0; for somes andtg holds with a high probability. This node-wise functional lasso approach adopts a supervised dimension reduction technique which could possibly outperform the functional graphical lasso which uses FPCA to perform an unsupervised dimension reduction. 59 Chapter 4 Feature Screening for Ultra-High Dimensional Functional Regression Models Feature screening for ultra-high dimensional data has become increasingly important in various scientific fields. Several model-free sure independence screening (SIS) approaches were recently proposed based on different forms of dependence measures, such as distance correlation (DC-SIS), martingale difference correlation (MDC-SIS) and sure independence and ranking screening (SIRS). In this chapter, I extend DC and MDC to the functional domain and develop the corresponding SIS procedures to perform feature screening and predictors ranking in the situation involving ultra-high dimensional functional predictors with either a scalar or a functional response. I establish the sure screening properties for the functional versions of DC-SIS and MDC-SIS and conduct several simulations to evaluate the finite sample performance. Both simulation results and real data illustrations indicate the superior performance of DC-SIS/MDC-SIS to SIRS and several other model-based SIS procedures in the functional domain. 4.1 Introduction In Section 1.2.3 I discussed some recent work of model-based and model-free sure independence screening (SIS) procedures for non-functional ultra high dimensional data. In this chapter, I propose a general model- free SIS procedure based on various forms of marginal utilities, which measure the departure from different types of independence between two random functions, for ultra-high dimensional functional data involving one functional response Y i (t);t 2 T , and a large number of functional predictors X ij (s);s 2 S;j = 1 ;p; wherep grows at a non-polynomial rate asn increases. The first setting considers the functional extension of DC, which can be used to measure the dependence between two random functions, and the corresponding DC-based SIS procedure is developed to rank the importance of the functional predictors. Second, I extend MDC to the functional domain by quantifying 60 the departure from conditional mean independence of a functional response given the functional predictor. I compare different active sets of functional predictors that the MDC-based and DC-based independence screenings aim to select. Several important properties of the functional MDC as well as its sample coun- terpart are presented to facilitate the development of a MDC-based SIS approach. I provide the theoretical justification of the sure screening properties for DC-SIS and MDC-SIS under mild moment assumptions on the functional response and predictors. Finally, I consider extending SIRS to the functional setting. Each of the three SIS approaches can be viewed as one kind of independence screening approach, although their formulation and targets are different. A series of simulations are conducted to test the sample performance of my proposed SIS procedures. My numerical results indicate that the MDC-SIS and DC-SIS significantly outperform all of their competitors, including SIRS, SIS and NIS. In addition MDC-SIS performs slightly better than DC-SIS in many model settings. The remainder of Chapter 4 is organized as follows. In Section 4.2 I propose a general framework to perform different independence feature screenings. I extend DC (Szekely et al. 2007) and MDC (Shao and Zhang 2014) to the functional domain, use them as the corresponding marginal utilities to rank the importance of functional predictors and show the underlying sure screening properties hold in Sections 4.3 and 4.4 respectively. I also develop a functional extension of SIRS (Li et al. 2012) in Section 4.5. For the case of functional data with measurement error, I take the standard approach of modeling functional data using orthogonal basis functions and develop the corresponding SIS procedures in Section 4.6. I examine the finite sample performance of functional versions of DC-SIS, MDC-SIS and SIRS procedures via a series of simulations in Section 4.7 and illustrate the proposed methods through a MEG data set in Section 4.8. I conclude Chapter 4 with some possible future extensions for this work in Section 4.9. 4.2 Independence Screening Procedures LetY (t);t2T be a functional response with support Y (T ) and X(s) = (X 1 (s); ;X p (s)) T ;s2S be a functional predictor vector. Note I assume the same time domain,S, for all functional predictors, but my results can extend to the more general case where each function corresponds to a different time domain. Here I extend the notation of active predictors and inactive predictors proposed by Li et al. (2012) to the functional domain without specifying any regression model between Y and X. In the remainder of the chapter, I useY to represent the functional response variable and X = (X 1 ; ;X p ) T as the functional 61 predictor vector involvingp functionsX j ; 1 j p. LetF Y (yjX) denote the distribution function ofY conditional onp random functions, X, i.e.F Y (yjX) =P (Y <yjX), whereY <y representsY (t)<y(t) for everyt2T . Then I define the index set of the active and inactive functional predictors by A F = fj :F Y (yjX) functionally depends onX j for somey2 Y (T )g; I F = fj :F Y (yjX) does not functionally depend onX j for anyy2 Y (T )g: (4.1) Let X A F =fX j ;j2A F g and X I F =fX j ;j2I F g denote an active functional predictor vector and its corresponding compliment in X. Then (4.1) indicates that Y is independent of X I F conditional on the remaining functional variables, X A F : Different forms ofF Y (yjX) lead to my development of various independence screening procedures, which are presented in Sections 4.3, 4.4 and 4.5 respectively. I first propose a general population form of dependence measure betweenX j (s);s2S andY (t);t2T as w j =f(X j ;Y ); (4.2) to rank the importance of all the candidate functional predictors in X. Suppose thatf(X ij ;Y i );i = 1 ;ng is a random sample from the population (X j ;Y ), a natural empirical estimator is given by b w jn = b f n (x j ; y) = 1 n n X i=1 f(X ij ;Y i ); (4.3) where x j = (X 1j ; ;X nj ) T and y = (Y 1 ; ;Y n ) T . I use (4.3) to rank all the functional predictors from the largest to the smallest. For any givend n , I select the topd n candidates that make a submodel as b A F;d =f1jp :jb w jn j is among the firstd n largest of allg: (4.4) This shrinks the full model of sizep to a submodel (4.4) with sized n , which can be less thann. It is apparent that larger values ofd n leads larger probability of including the true modelA F in the selected model b A F;d . Theoretical justification of the proposed approach can be achieved by showing the underlying sure screening consistency P A F b A F;d ! 1 asn!1: (4.5) 62 Examples ofw j and b w jn involve functional versions of DC-SIS (Li et al. 2012), MDC-SIS (Shao and Zhang 2014) and SIRS (Zhu et al. 2011), which are illustrated in Sections 4.3, 4.4 and 4.5 respectively. Throughout the chapter, I concentrate on the densely observed functional data, e.g. a functiong(t);t2 T is observed on a dense set of time points t 1 ; ;t N , then R t2T g(t) can be approximated by various numerical integral methods (Leader 2004). For ease of the presentation, I need the following definitions. LetU(t),V (t);t2T be two square-integrable functions in the Hilbert spaceL 2 (T ), I define theL 2 inner product betweenU andV by<U;V >= R t2T U(t)V (t)dt: Then theL 2 norm forU is denoted byjjUjj = p <U;U > = q R t2T U 2 (t)dt and theL 2 distance betweenU andV is defined asd(U;V ) =jjUVjj = q R t2T (U(t)V (t)) 2 dt: 4.3 DC-SIS Szekely et al. (2007) proposed a distance correlation (DC) for measuring dependence between two random vectors. To be precise, for any random vector u and v, they defined distance correlation by dcorr(u; v); which is equal to zero if and only if u and v are independent. Li et al. (2012) proposed DC-SIS for non- functional data involvingp predictors as well as either an univariate or a multivariate response, which uses DC and its sample estimator as the marginal utility in the SIS procedure. In Section 4.3.1 I propose the functional version of DC and develop a DC-based independence screening approach to rank the importance of functional predictors in Section 4.3.2 4.3.1 Distance Correlation To propose the functional dependence measure, I need to quote the definition for distance covariance in metric spaces given in Lyons (2013) as follows Definition 1. Let (X;d X ) and (Y;d Y ) be semi-metric spaces of negative type (cf. Lyons (2013)), with random variables X and Y taking values inX andY. Let (X 0 ;Y 0 ) be an i.i.d. copy of (X;Y ). The distance covariance betweenX andY is defined as dcov(X;Y ) =S 1 +S 2 2S 3 ; (4.6) 63 where S 1 = E d(X;X 0 )d(Y;Y 0 ) ; S 2 = E d(X;X 0 ) E d(Y;Y 0 ) ; S 3 = 2E E[d(X;X 0 )jX]E[d(Y;Y 0 )jY ] : WhenX =R u ,Y =R v and d refers to the L 2 distance in the Euclidean space, then dcov in (4.6) reduces to the distance covariance proposed in Szekely et al. (2007). The remainder of this section focuses on the functional setting when X(s);s2S, Y (t);t2T are two random functions, andX = L 2 (S), Y = L 2 (T ) are separable Hilbert spaces with d(X;X 0 ) =jjXX 0 jj = q R s2S (X(s)X 0 (s)) 2 ds; I use the definition (4.6) to measure the dependence in the functional domain. The corresponding distance correlation (Lyons 2013) is defined as dcorr(X;Y ) = dcov(X;Y ) p dcov(X;X)dcov(Y;Y ) : (4.7) Given the random samples x = (X 1 ; ;X n ) T and y = (Y 1 ; ;Y n ) T , the empirical version of (4.6) is denoted as d dcov n (x; y) = b S 1n + b S 2n 2 b S 3n ; (4.8) where b S 1n = 1 n 2 n X i=1 n X k=1 d(X i ;X k )d(Y i ;Y k ); b S 2n = 1 n 2 n X i=1 n X k=1 d(X i ;X k ) 1 n 2 n X i=1 n X k=1 d(Y i ;Y k ); b S 3n = 1 n 3 n X i=1 n X k=1 n X l=1 d(X i ;X l )d(Y k ;Y l ) Thus, a natural estimator for dcorr(X;Y ) is given by [ dcorr n (x; y) = d dcov n (x; y) q d dcov n (x; x) d dcov n (y; y) : (4.9) 64 4.3.2 DC-SIS The model-free setting allows for both linear and nonlinear relationships betweenY and X. The responses can be either continuous or discrete with additive or interactive components in X. I thus use w DC j = dcorr 2 (X j ;Y ); (4.10) and b w DC jn = [ dcorr 2 n (x j ; y); (4.11) as the marginal utility to rank the importance of X j , 1 j p, at the population and empirical levels respectively. I select the set of active predictors as b A F = j : b w DC jn cn ; 1jp ; where c and are parameters defined in Condition 7 to tune the threshold level. Larger values of and smaller values ofc lead us to select more active predictors. I now show that the sure screening property holds for DC-SIS in the functional setting. To this end, two conditions are needed to facilitate the technical analysis. Condition 6. Both X(s);s2S andY (t);t2T satisfy the sub-exponential tail probability uniformly inp, i.e. there exists a constantv 0 > 0 such that for all 0<v 2v 0 , sup p max 1jp E exp v Z S X 2 j (s)ds <1; and E exp v Z T Y (t) 2 dt <1: Condition 7. The minimum distance correlation of active functional predictors in setA F satisfies min j2A F w DC j 2cn ; for some constantsc> 0 and 0 1=2: 65 Condition 6 is satisfied when X(s);s2S,Y (t);t2T follows from a Multivariate Gaussian Process, or when they are uniformly bounded. A similar condition was proposed by Zhu et al. (2011) for the case of scalar data, which is weaker than the Gaussian assumption that has been widely used in ultra-high dimen- sional data (Fan and Lv 2008a, Fan et al. 2011). Condition 7 restricts the minimum DC of active functional predictors, which cannot be too small, to separate the active variables from their inactive counterparts. This is analogous to Condition 2 of Zhu et al. (2011) using DC of active predictors as well as Condition 3 of Fan and Lv (2008a) considering Pearson’s correlation. Theorem 7. Assume that Condition 6 holds, for any 0< < 1=2, there exist constantsc 1 ;c 2 > 0 such that P max 1jp b w DC jn w DC j cn O p h exp(c 1 n 12(+ ) ) +n exp(c 2 n ) i : (4.12) Further assume that Condition 7 holds, then P A F b A F 1O s n h exp(c 1 n 12(+ ) ) +n exp(c 2 n ) i ; (4.13) wheres n is the cardinality ofA F : Hence the sure screening property holds for DS-SIS under some suitable moment assumptions. I can balance the two terms on the right hand side of (4.12) by letting = (1 2)=3, then (4.12) becomes P max 1jp b w DC jn w DC j cn O p exp(c 1 n (12)=3 ) , which indicates that I can deal with non-polynomial dimensionality of order logp = o(n (12)=3 ) for functional data. Note I obtain the same probability bound here in (4.12) and (4.13) as in Li et al. (2012), this is due to the fact I treat observations ofX ij (s);s2S andY i (t);t2T ,i = 1 ;n, as i.i.d. samples from the populationX j (s);s2S and Y (t);t2T respectively. Then technical analysis using the theory of U-statistics can be accordingly applied to facilitate the proof. 4.4 MDC-SIS Shao and Zhang (2014) proposed a new metric, martingale difference correlation (MDC), denoted by MDC(VjU), to measure the departure from conditional mean independence of an univariate response, V given a vector predictor,U2R q and used MDC as a marginal utility to rank the predictors for ultra-high 66 dimensional data. In Section 4.4.1, I extend the definition of MDC to the functional domain. I then pro- pose a MDC-based variable screening approach to perform independence ranking and feature screening for ultra-high dimensional functional data in Section 4.4.2. 4.4.1 Martingale Difference Correlation As Shao and Zhang (2014) considered the following three relationships forV 2R andU2R q , Cov(U;V ) = 0; (4.14) E[VjU] =E[V ]; i.eE[(VE(V ))jU] = 0 almost surely; (4.15) U andV are independent (4.16) Then (4.16)) (4.15)) (4.14), so (4.15) is between independence (4.16) and uncorrelatedness (4.14). Moti- vated by the definition of martingale difference sequence in probability theory, Shao and Zhang (2014) defined the martingale difference measure based on the fact thatVE(V ) is a martingale difference with respect toU in (4.15). The difference between (4.15) and (4.16) is that (4.15) corresponds to the fact that the conditional mean ofV givenU is independent ofU, but (4.16) is equivalent to the conditional distribution ofV givenU being independent ofU. To measure the deviation from conditional mean independence based on (4.15), Shao and Zhang (2014) proposed a new framework, involving martingale difference divergence (MDD) and martingale difference correlation (MDC). Notice the distance covariance (Szekely et al. 2007) between two random vectorsU2R q andV 2R r is defined as dcov(U;V ) = Z R q+r jf U;V (u;v)f U (u)f V (u)j 2 w 0 (u;v)dudv 1=2 ; (4.17) where f U;V (u;v) = E[e i(<u;U>+<v;V>) ], f U (u) = E[e i<u;U> ] and f V (v) = E[e i<v;V> ] represent the joint and marginal characteristic functions andw 0 (u;v) =w 00 (u)w 00 (v) is a weight function withw 00 (u) = c 1 q jjujj (q+1) q andc q = 2 q=2 (1=2) 2((q+1)=2) defined in Lemma 1 of Szekely et al. (2007). This motivates Shao and Zhang (2014) to define MDD ofV 2R givenU2R q as MDD(VjU) = Z R q jg V;U (u)g V (u)g U (u)j 2 w 00 (u)du 1=2 ; (4.18) 67 whereg V;U (u) =E[Ve i<u;U> ],g V =E[V ] andg U (u) =E[e i<u;U> ]: The remainder of this section considers the functional setting whenX(s);s2S andY (t);t2T are two random functions. The functional versions of (4.16), (4.15) and (4.14) are presented in (4.21), (4.20) and (4.19) respectively. Cov(X(s);Y (t)) = 0 for everys2S andt2T: (4.19) E[YjX] =E[Y ]; i.eE[(YE(Y ))jX] = 0 almost surely; (4.20) X andY are independent (4.21) Note E(YjX) in (4.20) refers to the expectation of random function Y conditional on the random func- tion X. By a similar argument to the non-functional setting, the functional version of DC proposed in Section 4.3.1 can measure the general dependence between two random functions and the functional coun- terparts of MDD and MDC are used to depict the deviation from conditional mean independence based on (4:20). Specifically I define the functional versions of MDD and MDC in Definitions 2 and 3 respectively and present some important properties in Propositions 3 and 4, which are analogous to Theorem 1 of Shao and Zhang (2014). Definition 2. The martingale difference divergence (MDD) ofY givenX is denoted byMDD(YjX) with definition MDD(YjX) 2 = Z U jjg Y;X (u)g Y g X (u)jj 2 w(u)du (4.22) wherejjjj denotes a L 2 norm of a function, g Y;X (u) = E(Ye i<u;X> ) with u(s);s2S and u2U, g Y = E(Y ) andg X (u) = E(e i<u;X> ): Notew(u) is a weight function such that R U (1 cos < u;X > )w(u)du =jjXjj: Definition 3. The martingale difference correlation (MDC) ofY givenX is defined by MDC(YjX) 2 = MDD(YjX) 2 p fvar(Y ) 2 dvar(X) 2 ; (4.23) where fvar(Y ) =EjjYE(Y )jj 2 and dvar(X) = dcov(X;X) is the distance variance ofX. Proposition 3. IfE(jjXjj 2 +jjYjj 2 )<1, then MDD(YjX) 2 =E[<YE(Y );Y 0 E(Y 0 )>d(X;X 0 )]; (4.24) 68 Proposition 4. IfE(jjXjj 2 +jjYjj 2 )<1, then 0MDC(YjX) 1 andMDC(YjX) = 0 if and only ofE(YjX) =E(Y ) almost surely. Given the random samples x = (X 1 ; ;X n ) T and y = (Y 1 ; ;Y n ) T , let b g n Y;X (u) = 1 n P n j=1 Y j e i<u;X j > ,b g n Y = 1 n P n j=1 Y j andb g n X (u) = 1 n P n j=1 e i<u;X j > denote the empirical versions of g Y;X (u), g Y and g X (u) respectively. Similar to the definition of MDD and MDC in the non-functional setting, I define the sample counterparts as follows. Definition 4. The sample martingale difference divergence of MDD(YjX) is defined by \ MDD n (yjx) 2 = Z U jjb g n Y;X (u)b g n Y b g n X (u)jj 2 w(u)du; (4.25) where w(u) is defined in Definition 2 and the corresponding sample martingale difference correlation MDC(YjX) is defined by \ MDC n (yjx) 2 = \ MDD n (yjx) 2 q d fvar n (y) 2d dvar n (x) 2 ; (4.26) where d fvar n (y) = 1 n P n j=1 jjY j Yjj 2 with Y = 1 n P n j=1 Y j and d dvar n (x) = d dcov n (x; x) is the sample distance variance ofX. Proposition 5. The sample martingale difference divergence \ MDD(YjX) is given by \ MDD n (yjx) 2 = b T 1n b T 2n + 2 b T 3n ; (4.27) where b T 1n = 1 n 2 n X i=1 n X k=1 <Y i ;Y k >d(X i ;X k ); b T 2n = 1 n 2 n X i=1 n X k=1 <Y i ;Y k > 1 n 2 n X i=1 n X k=1 d(X i ;X k ); b T 3n = 1 n 3 n X i=1 n X k=1 n X l=1 <Y i ;Y l >d(X k ;X l ): Proposition 6. If E(jjXjj + jjYjj 2 ) < 1, then lim n!1 \ MDD n (yjx) = MDD(YjX) and lim n!1 \ MDC n (yjx) =MDC(YjX) almost surely. 69 (4.27) in Proposition 5 provides a sample estimator for MDD(YjX), where b T 1n , b T 2n , b T 3n are analogous to b S 1n , b S 2n , b S 3n in the sample distance covariance given by (4.8). Proposition 4 indicates the equivalence between conditional mean independence, i.e. E(YjX) = E(Y ) almost surely, and MDD(YjX) = 0. Proposition 6 shows the consistency of \ MDD n (yjx)/ \ MDC n (yjx) as an estimator for MDD(YjX)/ MDC(YjX). These important properties provide support for developing a MDC-based independence fea- ture screening approach in Section 4.4.2. 4.4.2 MDC-SIS As Shao and Zhang (2014) commented, regression problems concentrate more on inferences about the con- ditional mean of the response given the predictors, rather than other aspects of the conditional distribution. They defined the corresponding sets of active and inactive predictors based on the conditional mean of the response. I extend their definition to the functional domain as follows A E = fj :E(YjX) functionally depends onX j for somey2 Y (T )g; I E = fj :E(YjX) does not functionally depend onX j for anyy2 Y (T )g: (4.28) Since (4.21) implies (4.20), it then follows thatA E A F , thus there is no meaningful reason to compare them. Note that MDC-SIS and DC-SIS aim for different subsets of active functional predictors, although each method can be regarded as one type of sure independence screening procedure. I consider using w MDC j = MDC(X j ;Y ); (4.29) and b w MDC jn = [ MDC n (x j ; y); (4.30) as the population and empirical versions of the marginal utility of the MDC-SIS approach. I select the subset of active predictors as b A E = n j : b w MDC jn c 0 n 0 ; 1jp o ; where prespecified parametersc 0 and 0 defined in Condition 8 control threshold values. 70 Condition 8. The minimum MDC of active functional predictors in setA E satisfies min j2A E w MDC j 2c 0 n 0 ; for some constantsc 0 > 0 and 0 0 1=2: Theorem 8. Assume that Condition 6 holds, for any 0< 0 < 1=2 0 . Then there exist constantsc 0 1 ;c 0 2 > 0 such that P max 1jp b w MDC jn w MDC j c 0 n 0 O p h exp(c 0 1 n 12( 0 + 0 ) ) +n exp(c 0 2 n 0 =2 ) i (4.31) Further assume that Condition 8 holds. Then P A E b A E 1O s 0 n h exp(c 0 1 n 12( 0 + 0 ) ) +n exp(c 0 2 n 0 =2 ) i (4.32) wheres 0 n is the cardinality ofA E : Theorem 8 indicates that the sure screening property holds for the MDC-based screening approach under the same sub-exponential tail bound assumption as stated in Theorem 7. Condition 8 for the MDC is analogous to Condition 7 for DC, to separate active predictors from inactive ones, which leads to (4.32). To balance the two terms on the right hand side of (4.31), I let = (2 4)=5, then (4.31) reduces toP max 1jp b w MDC jn w MDC j c 0 n 0 O p h exp(c 0 1 n (14 0 )=5 ) i , showing MD-SIS can handle ultra-high dimensional functional data of order logp = o(n (14)=5 ). Note I obtain the same probability bounds here in (4.31) and (4.32) with those in Shao and Zhang (2014) by using the theory of U-statistics treating each observation of the random function as an i.i.d. sample. 4.5 SIRS SIRS (Zhu et al. 2011) considers the conditional distribution function of Y 2 R given X = (X 1 ; ;X p ) T 2 R p by letting F Y (yjX) = P (Y < yjX); and defines (y) = E[XF Y (yjX)] = 71 Cov(X; 1(Y < y)) to measure the association between X and Y . Let j (y) denote the jth component of (y). Zhu et al. (2011) proposed w SIRS j =E[ 2 j (Y )]; j = 1 ;p; (4.33) as the population version of the marginal utility for predictors ranking. As Shao and Zhang (2014) com- mented, under suitable assumptions, it can be shown thatw SIRS j = 0 if and only ifE(X j jY ) = E(X j ) almost surely. This observation suggests that SIRS can screen out irrelevant predictors whose conditional mean given the response does not depend on the response. This is different from MDC-SIS, which consid- ers whetherE(YjX j ) = E(Y ) almost surely. For the case thatX j andY are independent, MDC(YjX j ), E[ 2 j (Y )] and dcorr(X j ;Y ) all equal zero. MDC-SIS, DC-SIS and SIRS can thus all be viewed as var- ious implementations of the independence screening approach. I can accordingly choose the underlying approach, depending on the purpose for selecting subsets of active predictors. I next extend (4.33) to the case involving functional predictors,X j (s);s2S and a functional response, Y (t);t2T . LetF Y (yjX) = P (Y < yjX) and (y) = E[XF Y (yjX)] = Cov(X; 1(Y < y)), where the indicator function 1(Y < y) equals one if Y (t) < y(t) for every t 2 T and zero otherwise. Let j (y;s) = Cov(X j (s); 1(Y <y)) denote thekth component of (y) ats2S. The functional version of w SIR j can be naturally defined by w SIRS j = E Z s2S j (Y 0 ;s) 2 ds = E Z s2S Cov X j (s); 1(Y <Y 0 ) 2 ds ; j = 1 ;p: (4.34) The corresponding empirical estimator forw SIRS j is b w SIRS jn = 1 n n X j=1 Z s2S 1 n n X i=1 X ij (s)1(Y i <Y j ) ! 2 ds: (4.35) Sample performance of SIRS is tested in Sections 4.7 and 4.8. I leave theoretical justification of the underlying sure screening property for future research. 72 4.6 Functional Data with Measurement Error In practice, functional data with measurement error is commonly observed. Assume X ij (t) is observed, with measurement error, atK time points, s = (s 1 ; ;s K ). Similarly, I assume a noisy version ofY i (t) observed atL time points, t = (t 1 ; ;t L ). In particular, letW ijk andV il respectively denote the observed values ofX ijk =X ij (s k ) andY il =Y i (t l ). Then W ijk =X ijk +e xijk (4.36) V il =Y il +e yil (4.37) wherei = 1; ;n;j = 1; ;p;k = 1 ;K;l = 1 ;L and error termse x ijk ande y il are modeled as i.i.dN(0; 2 xj ) andN(0; 2 y ) respectively. To perform different types of independence screening procedures, some form of smoothness constraint is needed forX ij (s) andY i (t). Analogously to the techniques I used for fitting SIMFE in Section 2.2.2, I model the functional variables usingM j andN-dimensional orthogonal basis functions, b xj (s) and b y (t), such that R b xj (s)b xj (s) T = I M j and R b y (t)b y (t) T = I N . Thus X ij (s) = b xj (s) T ij andY i (t) = b y (t) T i ; (4.38) where ij and i are respectively the basis coefficients for the functional predictors and the response. Let B xj be the K by M j -dimensional basis matrix with kth row b xj (s k ) and B y be the L by N-dimensional basis matrix with lth row b y (t l ). Denote W ij = (W ij1 ; ;W ijK ) T and V i = (V i1 ; ;V iL ) T . Some standard calculations indicate that ij and i can be estimated by e ij = (B T xj B xj ) 1 B T xj W ij ; (4.39) e i = (B T y B y ) 1 B T y V i : (4.40) Plugging (4.39) and (4.40) into (4.38) leads to e X ij (s) = b xj (s) T e ij and e Y i (t) = b y (t) T e i : (4.41) 73 The corresponding b w DC jn , b w MDC jn and b w SIRS jn using e X ij (s) and e Y i (t) are denoted by e w DC jn , e w MDC jn and e w SIRS jn in (4.42), (4.43) and (4.44) as follows, e w DC jn = d dcov n (e x j ;e y); (4.42) e w MDC jn = [ MDC n (e yje x j ); (4.43) e w SIRS jn = 1 n n X j=1 Z s2S 1 n n X i=1 e X ij (s)1( e Y i < e Y j ) ! 2 ds: (4.44) 4.7 Simulations In this section, I compare the finite sample performance of functional versions of DC-SIS, MDC-SIS, and SIRS with several model based screening procedures such as SIS (Fan and Lv 2008a) and NIS (Fan et al. 2011) via Monte Carlo simulation. I usedP M , the proportion that an individual functional predictor is selected for a given model sized over 100 replications, as well asP a , the proportion that all active predictors are selected for a given model size d over 100 replications, as two criterion for evaluating the performance following Li et al. (2012). Values ofP M andP a closer to one imply stronger ability of a screening procedure to detect an individual (active) predictor and all active predictors respectively. Li et al. (2012) also consideredM, the minimum model size to include all active predictors, to measure the performance of a screening approach. Smaller values ofM associate with larger values ofP M andP a , thus I only present the results on these two proportions. I generatednp functional variables viaX ij (s) = b(s) T ij , where b(s) was a 5-dimensional orthog- onal Fourier basis function, and ij 2R 5 ; 1jp = 500 was sampled from a Gaussian vector with zero mean and covariance matrix, 2R 5p5p , where ii = I 5 and ij = 0:5 jjij I 5 , 1 i6= j p. I gen- eratedn = 100 samples of ij and the observed values,W ijk , were sampled using (4.36) with xj = 0:2, where each function was observed atK = 50 equally spaced time points, 0 = s 1 ; ;s 50 = 1. I applied 74 a functional single index model to depict the relationship between the response and multiple predictors as follows Y i =m(P i ) +" i ; where" i was i.i.d. sampled fromN(0; 0:2 2 ), P i = (P i1 ; ;P ip ) T andP ij = R s2S X ij (s) j (s)ds; 1 j p: Note the corresponding index functions are similarly generated by j (t) = s(t) T j , where each component in j 2R 5 was sampled from a standard normal. I considered the following 3 examples for scalar response. Case(a):m i =P 1i + 0:8P 2i + 0:6P 3i , Case(b):m i =P 1i + (P 2i 1) 2 + sin(0:5P 3i ), Case(c):Y i =P 1i +P 2i +P 2 3i + exp((P 4i +P 5i )=2)" i . Cases (a) and (b) were respectively generated from linear and nonlinear additive functional single index models with a scalar response. In both cases, DC-based and MDC-based screening procedures aim to identify the same subsets of active predictors, whereA F =A E =f1; 2; 3g. Case (c) was conducted to compare the different active predictors that DC-SIS and MDC-SIS are capable of detecting, in particular A F = f1; 2; 3; 4; 5g andA E = f1; 2; 3g. For the setting involving functional response, I considered the following two examples, where each response was observed at L = 50 equally spaced time points, 0 =t 1 ; ;t 50 = 1. Cases (d) and (e): Y i (t l ) = m(t l ; P i ) = t l m i , wherem i takes the form of cases (a) and (b) respec- tively. To stimulate a real data setting, I used Fourier bases to generate the data and chose to smooth the noisy observations W ij using a M j dimensional B-spline basis. M j was obtained by a 5-fold cross-validation approach. In particular, for a given candidate set ofM j I removed 1/5th of the observed time points for each X ij (s) as a validation set, fitted a linear model to the remaining observations based on (4.36) and (4.38), and calculated the squared error between W ij and fitted values, b X ij (t), on the validation set. This procedure was repeated 5 times over a grid ofM j values, and I selected the dimension that provides the lowest cross- validated squared error. I implemented this approach for each of thep predictors. Typically,M j = 4; 5 or 6 B-spline basis functions were selected in my simulations. 75 I compared DC-SIS, MDC-SIS and SIRS with four model based screening methods, SIS-L2, NIS-L2, SIS-PCA and NIS-PCA. In the first two methods (SIS-L2, NIS-L2), I applied SIS and NISK times, once on each time point, producingS-dimensional marginal utility vector. I then computed theL 2 norm to provide the final marginal utility. The last two PCA approaches (SIS-PCA, NIS-PCA) considered transforming the functional data into a standard format by computing the first principal component score on eachX ij (t), then implementing SIS and NIS on this data. Table 4.1 reports the values ofP M andP a , with d = 40, calculated from all examples. In most of the model settings, both DC-SIS and MDC-SIS exhibit significantly stronger capability of identifying active functional predictors than other methods and MDC-SIS performs slightly better than DS-SIS. Under a linear model in case (a), SIS outperforms all the other methods, although model-free screening methods are still capable of identify the linear relationship. Case (c) was conducted to demonstrate that DC-based and MDC-based screening methods aim at different subsets of active predictors, whereA F =f1; 2; 3; 4; 5g andA E =f1; 2; 3g. I did not present values ofP a becauseA F 6=A E . As expected, MDC-SIS does not include X 4 and X 5 with decent percentage, whereas DC-SIS is able to identify these two predictors with higher percentages. Therefore, MDC-SIS is more efficient to detect variables that are contributing to the conditional mean of the response variable than DC-SIS. Results for cases (d) and (e) with functional responses are generally consistent with those in the scalar case. Here I did not provide results for model- based screening methods for two reasons. First, results from the scalar case indicate the inefficiency of approaches that only use the first PC scores. Second, the grid method which performs SIS and NIS on each pair of time points significantly increases the computational burden whenKL is large. 4.8 Real Data I tested the performance of various independence screening procedures on the Magnetoencephalography (MEG) data set collected by the Center for Clinical Neurosciences, University of Texas Health Science Cen- ter at Houston. The MEG activities were recorded at 356 equally spaced time points over 248 “channels”. Each channel corresponded to the measurement of the intensity level of the magnetic field at a particular location on the brain. Multiple trials consisting of reading a patient a word and measuring the MEG over time were completed by each patient. Multiple samples per subject led to a relatively larger sample size, but violated the independence assumption made by most methods. I considered the average of all trails for 76 Method X 1 X 2 X 3 X 4 X 5 All SIS-PCA 0.72 0.71 0.72 0.23 0.21 0.36 SIS-L2 0.94 0.89 0.88 0.11 0.12 0.71 NIS-PCA 0.69 0.66 0.61 0.21 0.25 0.27 (a) NIS-L2 0.84 0.79 0.78 0.11 0.12 0.53 DC-SIS 0.88 0.84 0.83 0.11 0.12 0.64 MDC-SIS 0.91 0.86 0.84 0.13 0.15 0.66 SIRS 0.81 0.79 0.75 0.17 0.18 0.50 SIS-PCA 0.33 0.25 0.33 0.25 0.28 0.05 SIS-L2 0.60 0.22 0.56 0.15 0.19 0.09 NIS-PCA 0.31 0.27 0.39 0.28 0.29 0.05 (b) NIS-L2 0.53 0.33 0.57 0.17 0.19 0.10 DC-SIS 0.81 0.65 0.83 0.13 0.12 0.43 MDC-SIS 0.85 0.69 0.87 0.11 0.16 0.50 SIRS 0.51 0.45 0.58 0.18 0.17 0.13 SIS-PCA 0.78 0.75 0.51 0.21 0.16 SIS-L2 0.92 0.90 0.62 0.28 0.16 NIS-PCA 0.73 0.71 0.58 0.31 0.33 (c) NIS-L2 0.83 0.85 0.77 0.48 0.36 DC-SIS 0.95 0.94 0.90 0.71 0.48 MDC-SIS 0.94 0.94 0.89 0.36 0.13 SIRS 0.78 0.82 0.51 0.49 0.21 DC-SIS 0.83 0.81 0.79 0.10 0.13 0.51 (d) MDC-SIS 0.85 0.79 0.80 0.11 0.11 0.53 SIRS 0.29 0.21 0.28 0.22 0.25 0.02 DC-SIS 0.75 0.60 0.76 0.13 0.15 0.34 (e) MDC-SIS 0.79 0.65 0.77 0.14 0.11 0.34 SIRS 0.22 0.18 0.23 0.19 0.20 0.02 Table 4.1: The proporations ofP M andP a withd = 40: each subject, thus generatingn = 20 observations with 248 functional predictors. The scalar responseY was whether the patient was left or right brain dominated, denoted byY = 0 orY = 1 respectively. Notice some channels had missing MEG recordings for some patients and were removed from the study. Thus I only focused on the data consisting ofp = 199 functional predictors. It is of scientific interest to identify the most influential channels for determining whether the subject is left or right brained. The MEG data have been analyzed by Fan et al. (2015) using a penalized least squares optimization approach to perform variable selection. This is an ultra-high dimensional data set because the ratio of dimensionality to sample size was about 10. Hence, it is more reasonable to first apply the screening procedures to reduce the dimensionality of predictors to a moderate size, followed by the variable selection approach proposed by Fan et al. (2015). I usedM j -dimensional B-spline bases to fit each function andM j , 77 Method Rank 1 2 3 4 5 6 7 8 9 10 SIS-PCA 21 40 41 63 64 91 20 7 3 38 SIS-L2 40 41 21 63 91 8 20 39 7 64 NIS-PCA 245 41 125 3 29 21 40 154 64 90 NIS-L2 121 122 89 90 41 61 21 13 42 123 SIS-DC 41 40 21 91 63 39 64 65 123 90 SIS-MDC 41 40 21 91 63 39 64 65 123 90 SIRS 40 41 21 63 91 8 20 39 7 64 Table 4.2: Results of channel ranking using 7 feature screening methods. 1j 199, was chosen by a 10-fold cross-validation approach described in Section 4.7. Typically,M j = 5, 6, 7, 8 or 9 basis functions were selected for each predictor. I applied seven feature screening methods following the same manner described in Section 4.7. Table 4.2 provides the results of features ranking using different methods. Three model-free SIS procedures, SIS-DC, SIS-MDC, SIRS as well as SIS-L2 all rank Channels 21, 40, 41, 63 and 91 as top five candidates. Notice that the top 5 channels were overlapping among these four methods. The ranking orders were only the same between SIS-DC and SIS-MDC, but they slightly differed from those of SIS-L2 and SIRS. The NIS methods provided completely different rankings for channels. To judge which feature screening method delivered more sensible features ranking, I computed the 5-fold cross validated misclassification rates. Due to the small sample size,n = 20, I only considered the model with the most influential predictor suggested by the ranking results in Table 4.2 . I applied GLM with logistic link (McCullagh and Nelder 1989) to fit the model. The resulting cross-validated misclassification rate of 35% using only Channel 41 provided the lowest error rate in comparison to the other candidate predictors, such as Channels 40, 21, 245 and 121. In my analysis, I dealt with a very challenging data set due to the small sample size, I need a larger number of observations to compare different screening methods more effectively. 4.9 Conclusions In this chapter, I present several metrics in the functional domain, which measure the departure of different types of dependence between two random functions. I extend several model-free independence screening approaches, such as DC-SIS, MDC-SIS and SIRS, to the functional domain, and establish the corresponding sure independence properties when the number of predictors diverges at an exponential rate with the sample 78 size. I examine the finite sample performance of my proposed procedures via Monte Carlo studies and a real world MEG data. There exist several important topics about ultra-high dimensional functional data feature screening for further study. First, similar to the SIS (Fan and Lv 2008a), DC-SIS and MDC-SIS may miss some important functional predictors which are marginally independent but jointly dependent on the response. Thus it is meaningful to develop iterative versions of my proposed approaches to fix this issue in a similar spirit to iterative SIS (Fan and Lv 2008a) and iterative SIRS (Zhu et al. 2011). New effective methods with theo- retical analysis that can identify influential predictors marginally independent of the response are needed for further research. The second extension considers the development of thresholding selection in empirical analysis. Zhu et al. (2011) proposed a thresholding rule of SIRS combing a soft cutoff value, obtained by creating auxiliary variables to the data, and a hard cutoff value, which remains a fixed number of predictors after features ranking. It is of interest to develop various thresholding selection procedures for my pro- posed screening approaches in the functional domain. The third setting involves performing independence screenings for ultra-high dimensional sparse, irregular and noisy functional data. This extension could be implemented through a mixed effect model under a Gaussian assumption, which borrows information across all predictors thus providing an accurate reconstruction of each functional predictor. I aim to establish the corresponding sure screening properties for sparsely sampled functional data in the ultra-high dimensional setting. 79 References K. Balasubramanian, B. Sriperumbudur, and G. Lebanon. Ultrahigh dimensional feature screening via rkhs embeddings. JMLR W CP, 31:126–134, 2013. D. Bosq. Linear Processes in Function Spaces: Theory and Applications. Springer, New York, 2000. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1): 1–122, 2010. T. Cai, W. Liu, and X. Luo. A constrainedl 1 minimization approach to sparse precision matrix estimation. Journal of American Statistical Association, 106, 2011. E. Candes and T. Tao. The dantzig selector: statistical estimation when p is much larger than n (with discussion). The Annals of Statistics, 35:2313–2404, 2007. H. Cardot, F. Ferraty, and P. Sarda. Spline estimators for the functional linear model. Statistical Sinica, 13: 571–591, 2003. R. J. Carroll, D. Ruppert, L. A. Stefanski, and C. Crainiceanu. Measurement error in Nonlinear Models: A Modern Perspective. Chapman and Hall/CRC, 2 edition, 2006. D. Chen, P. Hall, and H. G. Muller. Single and multiple index functional regression models with nonpara- metric link. The Annals of Statistics, 39:1720–1747, 2011. M. Cheng, T. Honda, J. Li, and H. Peng. Nonparametric independence screening and structure indentifica- tion for ultra-high dimensional longitudinal data. The Annals of Statistics, 42:1819–1849, 2014. P. Danaher, P. Wang, and D. Witten. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B, 76:373–397, 2014. 80 J. Dauxois, A. Pousse, and Y . Romain. Asymptotic theory for the principal components analysis of a vector random function: some applications to statistical inference. Journal of Multivarite Analysis, 12:136– 154, 1982. R. Eubank and T. Hsing. Canonical correlation for stochastic processes. Stochastic processes and their applications, 118:1634–1661, 2008. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348–1360, 2001. J. Fan and J. Lv. Sure independence screening for ultrahigh dimensioanl feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70:849–911, 2008a. J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistical Sinica, 20:101–148, 2008b. J. Fan, R. Samworth, and Y . Wu. Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research, 10:1829–1853, 2009. J. Fan, Y . Feng, and R. Song. Nonparametric independence screening in sparse ultra-high dimensional additive models. Journal of the American Statistical Association, 106:544–557, 2011. Y . Fan, N. Foutz, G.M. James, and W. Jank. Functional forecasting of demand decay rates using virtual stock markets. The Annals of Applied Statistics, 9:2435–2460, 2014. Y . Fan, G.M. James, and P. Radchenko. Functional additive regression. The Annals of Statistics, 2015. F. Ferraty and Y . Romain. The Oxford handbook of functional data analysis. Oxford University Press, 2011. F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis Theory and Practice, volume 30. Springer, 2006. J. Friedman and W. Stuetzle. Projection pursuit regression. Journal of American Statistical Association, 76: 817–823, 1981. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2:432–441, 2007. J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and sparse group lasso. Working Paper, 2010. T. Gasser, H. G. Mueller, W. Kohler, L. Molinari, and A. Prader. Nonparametric regression analysis of growth curves. The Annals of Statistics, 12:210–229, 1984. 81 D. Gervini and T. Gasser. Self-modeling warping functions. Journal of Royal Statistical Society, Series B, 66:959–971, 2004. P. Hall and M. Hosseini-Nasab. On properties of functional principal components analysis. Journal of Royal Statistical Society, Series B, 68:109–126, 2006. W. Hardle and T. Stoker. Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association, 84:986–995, 1989. T. Hastie and R. Tibshirani. Genrelized Additive Models. Chapman and Hall, 1990. E.P. Hayden, R.E. Wiegand, E.T. Meyer, L.O. Bauer, S.J. O’s Connor, Nurbberger J.I., Chorlian D.B., Por- jesz B., and Begleiter H. Patterns of regional brain activity in alcohol-dependent subjects. Alcoholism: Clinical and Experimental Research, 30:1986–1991, 2006. S. Hays, H. Shen, and J. Huang. Functional dynamic factor models with application to yield curve forecast- ing. The Annals of Applied Statistics, 6:870–894, 2012. L. Horvath and P. Kokoszka. Inference for Functional Data with Applications. Springer, 2012. H. Hotelling. Relations between two sets of variables. Biometrika, 28:321–377, 1936. H. Hung and C. Wang. Matrix variate logistic regression model with application to eeg data. Biostatistics, 14:189–202, 2013. R. J. Hyndman and M. S. Ullah. Robust forecasting of mortality and fertility rates: a functional data approach. Computational Statistics and Data Analysis, 51:4942–4956, 2007. H. Inchimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Jour- nal of Econometrics, 58:71–120, 1993. G. M. James. Genearlized linear models with functional predictors. Journal of Royal Statistical Society, Series B, 64:411–432, 2002. G. M. James and T. Hastie. Functional linear discriminant analysis for irregularly sampled curves. Journal of Royal Statistical Society, Series B, 63:533–550, 2001. G. M. James and B. W. Silverman. Functional adaptive model estimation. Journal of the American Statistical Association, 100:565–576, 2005. G. M. James and C. Sugar. Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98:397–408, 2003. 82 G. M. James, T. Hastie, and C. Sugar. Principal component models for sparse functional data. Biometrika, 87:587–602, 2000. G. M. James, J. Wang, and J. Zhu. Principal linear regression that’s interpretable. The Annals of Statistics, 37:2083–2108, 2009. W. Jank and G. Shmueli. Profiling price dynamics in online auctions using curve clustering. Working Paper, 2005. W. Jank and G. Shmueli. Functional data analysis in electronic commerce research. Statistical Science, 21: 155–166, 2006. P. Ji and J. Jin. Ups delivers optimal phase diagram in high dimensional variable selection. The Annals of Statistics, 40:73–103, 2012. C.R. Jiang and J. L. Wang. Functional single index models for longitudinal data. The Annals of Statistics, 39:362–388, 2011. G. Knyazev. Motivation, emotion, and their inhibitory control mirrored in brain osclillations. Neuroscience Biobehavioral Reviews, 131:377–395, 2007. M. Kolar and E. Xing. On time varying undirected graphs. Journal of Machine Learning Research, 15: 407–415, 2011. C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics, 37:4254–4278, 2009. J. Leader. Numerical Analysis and Scientific Computation. Pearson, 2004. B. Li, M. Kim, and N. Altman. On dimension folding of matrix-or array-valued statistical objects. The Annals of Statistics, 38:1094–1121, 2010. R. Li, W. Zhong, and L. Zhu. Feature screening via distance correlation learning. Journal of the American Statistical Association, 107:1129–1139, 2012. Y . Li and T. Hsing. Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data. The Annals of Statistics, 38:3321–3351, 2010. B. Liu and H.-G. Muller. Functional data analysis for sparse auction data. In W. Jank and G. Shmueli, editors, Statistical Methods in e-Commerce Research. John Wiley & Sons, Inc., Hoboken, NJ, USA., 2008. R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41:3284–3305, 2013. 83 R. Mazumder and T. Hastie. The graphical lasso: New insights and alternatives. Electronic Journal of Statistics, 6:2125–2149, 2012a. R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale graphical lasso. Journal of Machine Learning Research, 13:781–794, 2012b. P. McCullagh and J. Nelder. Generalized linear models. Chapman and Hall, London, 2 edition, 1989. N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with lasso. The Annals of Statistics, 34:1436–1462, 2006. H. G. Mueller and U. Stadtmueller. Generalized functional linear models. The Annals of Statistics, 33: 774–805, 2005. H. G. Mueller and F. Yao. Functional additive models. Journal of the American Statistical Association, 103: 426–437, 2008. P. Naik and C.-L. Tsai. Partial least squares estimator for single-index models. Journal of the Royal Statis- tical Society, Series B, 62:763–771, 2000. R. Opgen-Rhein and K. Strimmer. Inferring gene dependency networks from genomic longitudinal data: a functional data approach. REVSTAT - Statistical Journal, 4:53–65, 2006. P. Radchenko, X. Qiao, and G. M. James. Index models for sparsely sampled functional data. Journal of American Statistical Association, 2015. J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, 2 edition, 2005. J. O. Ramsay, G. Hooker, and S. Graves. Functional Data Analysis with R and MATLAB. Springer, 2009. P. Ravikumar, M. Wainwright, Raskutti G., and B. Yu. High-dimensional covariance estimation by mini- mizingl 1 -penalized log-determinant deivergence. Electronic Journal of Statistics, 5:935–980, 2011. S. K. Reddy and M. Dass. Modeling on-line art auction dynamics using functional data analysis. Statistical Science, 21:179–193, 2006. M. Reimherr and D. Nocolae. Functional data analysis approach for genetic association studies. The Annals of Applied Statistics, 8:406–429, 2014. F. Reithinger, W. Jank, G. Tutz, and G. Shmueli. Smooothing sparse and unevenly sampled curves using semiparametric mixed models: An application to online auctions. Journal of the Royal Statistical Society, Series C, 57:127–148, 2008. 84 J. A. Rice. Functional and longitudinal data analysis: Perspectives on smoothing. Statistical Sinica, pages 631–647, 2004. H. L. Shang. A survey of functional principal component analysis. AStA Adv Stat Anal, 98:121–142, 2014. X. Shao and J. Zhang. Martingale difference correlation and its use in high dimensional variable selection. Journal of the American Statistical Association, 109:1302–1318, 2014. G. Shmueli and W. Jank. Visualizing online auctions. Journal of Computational and Graphical Statistics, 14:299–319, 2005. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 2nd edition, 1999. A. Sood, G. James, and G. Tellis. Functional regression: a new model for predicting market penetration of new products. Marketing Science, 28:36–51, 2009. J. D. Storey, W. Xiao, J.T. Leek, R. G. Tompkins, and R. W. Davis. Significance analysis of time course microarray experiments. PNAS, 102:12837–12842, 2005. G. J. Szekely and M. L. Rizzo. Brownian distance covariance. The Annals of Applied Statistics, 3:1236– 1265, 2009. G. J. Szekely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing independence by correlation of distances. The Annals of Statistics, 35:2769–2794, 2007. R Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996. P Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109:475–494, 2001. Jank. W. Wang, S. and G. Shmueli. Explaning and forecasting online auction prices and their dynamics using functional data analysis. Journal of Business and Economics Statistics, 26:144–160, 2008. D. Witten, J. Friedman, and N. Simon. New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20:892–900, 2011. Y . Xia. A multiple-index model and dimension reduction. Journal of the American Statistical Association, 103:1631–1640, 2008. Y . Xia, H. Tong, W. K. Li, and L. Zhu. An adaptive estimation of dimension reduction space. Journal of Royal Statistical Society, Series B, 64:363–410, 2002. 85 F. Yao, H. Mueller, and J. Wang. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100:577–590, 2005a. F. Yao, H. Mueller, and Wang J. Functional linear regression analysis for longitudinal data. The Annals of Statistics, 33:2873–2903, 2005b. F. Yao, E. Lei, and Y . Wu. Effective dimension reduction for sparse functional data. Biometrika, 2015. M. Yuan and Y . Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68:49–67, 2006. M. Yuan and Y . Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94:19–35, 2007. X. Zhang, B. Begleiter, W. Porjesz, W. Wang, and A. Litke. Event related potentials during object recogni- tion tasks. Brain Research Bulletin, 38:531–538, 1995. X. Zhao, J. S. Marron, and M. T. Wells. The functional data analysis view of longitudinal data. Statistical Sinica, 14:789–808, 2004. H. Zhou and L. Li. Regularized matrix regression. Journal of the Royal Statistical Society: Series B, 76: 463–483, 2014. S. Zhou, J. Lafferty, and L. Wasserman. Time varying undirected graphs. Machine Learning Journal, 80: 295–319, 2010. H. Zhu, N. Strawn, and D. Dunson. Bayesian graphical models for multivariate functional data. Manucript, 2014a. L. Zhu, L. Li, R. Li, and L. Zhu. Model-free feature screening for ultrahigh dimensional data. Journal of American Statistical Association, 106:1464–1475, 2011. Y . Zhu, X. Shen, and W. Pan. Structural pursuit over multiple undirected graphs. Journal of American Statistical Association, 2014b. H. Zou. The adaptive lasso and its oracle properties. Journal of American Statistical Association, 101: 1418–1429, 2006. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005. 86 Appendix A Algorithms and Proofs in Chapter 2 A.1 EM Algorithm for Estimating e X ij (t) Standard calculations show that the expected value of the joint likelihood for X and W is maximized by setting b = 1 n n X i=1 E [ i jW i ;b i ] = 1 n n X i=1 b i ; (A.1) b = 1 n n X i=1 E h ( i )( i ) T jW i ;b i ; e ;b i = 1 n n X i=1 (b i b )(b i b ) T + e (A.2) and b 2 = 1 P n i=1 P p j=1 T ij n X i=1 E h (W T i S i ) T (W T i S i )jW i ;b i ; e i = 1 P n i=1 P p j=1 T ij n X i=1 (W T i Sb i ) T (W T i Sb i ) + trace S e S T : (A.3) Hence, the following EM algorithm can be applied to efficiently estimate the required parameters. Note that when t i differ among individuals, I replace matrixS in the formulas above withS i . Algorithm 5 EM Algorithm 1. Compute ^ ; ^ and ^ 2 via Equations (A.1), (A.2) and (A.3). 2. Use the current parameter estimates to updateb i and ~ via Equations (2.13) and (2.14). 3. Repeat Steps 1. and 2. until convergence. 87 A.2 Proof of Main Results A.2.1 Proof of Theorem 1 As a consequence of the SIMFE model, I have the following simple fact, E(" i jW i ) =E(E(" i jX i ; W i )jW i ) = 0: I also have, E(Y i jW i ) = E (m (P i1 ;:::; P ip )jW i ) +E (" i jW i ) = E m( e P i1 + U i1 ;:::; e P ip + U ip )jW i ; where U ij = P ij e P ij . Under some mild integrability assumptions, U ij are mean zero Gaussian random vectors, and e P ij =E(P ij jW i ). Note that E(W ijk U ilm ) =E(E(W ijk U ilm jW i )) =E(W ijk E(P ilm e P ilm jW i )) = 0; for all i, j, k, l and m. Consequently, vectors W i and (U i1 ;:::; U ip ) are independent. Let f U i (u) be the density of the second vector. Note that, according to the derivations in Section 2.2.4, the dependence of f U i on the index i is completely determined by t i , the time point configuration for the i-th observa- tion. Write row vector u as (u 1 ;:::;u p ), where u j 2R d j , and define e m t i (s 1 ;:::;s p ), with s j 2R d j , as R m (s 1 +u 1 ;:::;s p +u p )f U i (u)du. Then, E m( e P i1 + U i1 ;:::; e P ip + U ip )jW i = e m t i ( e P i1 ;:::; e P ip ): Hence, Y i =E(Y i jW i ) +" i = e m t i e P i1 ;:::; e P ip +" i ; where, by construction,E(" i jW i ) = 0. 88 A.2.2 Proof of Theorem 2 I will follow the general argument in the proof of Theorem 3 in Li et al. (2010). Write d = (d 1 ;:::;d p ) and d c = (d c1 ;:::;d cp ). LetG n (d c ) denote the difference between the criterion values for the candidate and the true dimension vectors: G n (d c ) = logL(d c ) logL(d) +d c logn=[nh ~ dc n (d c )]d logn=[nh ~ d n (d)]: Note that the set of candidate dimension vectors is finite. Thus, it is sufficient to show that for each vector d c not equal to d functionG n (d c ) is positive with probability tending to one. Note that d c logn=[nh ~ dc n (d c )]d logn=[nh ~ d n (d)] =O logn[n 4=( ~ dc+4) +n 4=( ~ d+4) ] =o(1): Ifd cj < d j for at least onej, I can choose a positive constantc, such that logL(d c ) logL(d) > c with probability tending to one, due to the lack of fit. It follows thatG n (d c )> 0 with probability tending to one. Now consider the remaining case ofd cj d j for allj andd c > d. In this cased c logn=[nh ~ dc n (d c )] d logn=[nh ~ d n (d)]> logn=[nh ~ dc n (d c )] for all sufficiently largen. On the other hand, logL(d c ) logL(d) = log(1 + L(d c )L(d) L(d) ) = log(1 +O p (1=[nh ~ dc n (d c )])) =O p (1=[nh ~ dc n (d c )]); by the classical results on local linear smoothing. Consequently, with probability tending to one, G n (d c )> (1=2) logn=[nh ~ dc n ]> 0: A.2.3 Theoretical Assumptions for the Results in Section 2.3 Recall that consists of components jk , which are the vectors of basis coefficients for the true projection functions in the SIMFE model. I will use e i to denote the vector T 11 e i1 ;:::; T 1d 1 e i1 ;:::; T p1 e ip ;:::; T pdp e ip T and defineb e i ande e i analogously. For tech- nical reasons I will introduce trimming functions into the SIMFE optimization problem (2.20). This is a standard technical device, which serves the purpose of handling the notorious boundary points. Writee i for (e T i1 ;:::;e T ip ) T and defineI ni = 1fjje i jjng. Let be some bounded function with a bounded second 89 derivative, such that (v) > 0 if v > w 0 and (v) = 0 if v w 0 , for some small positive w 0 . Define ^ i =( ^ f ^ (e i ))= ^ f ^ (~ i ), with b f b (e i ) =n 1 P n k=1 K h (b (e k e i )). The only difference between the new objective function, 1 n n X i=1 n X l=1 I ni b i 0 @ Y l a i p X j=1 c T ij b P lj 1 A 2 K il ; (A.4) and the one given by display (2.20) is the presence of the trimming termI ni b i . The corresponding change in the SIMFE algorithm is that the trimming term also appears in formula (2.22). The following assumptions are used in the proof of Theorem 3. A1. The response has a finite fifth absolute moment,EjYj 5 <1. A2. Kernel function e K is a multivariate density function with bounded second-order derivatives and a compact support. A3. For each k, i and j, time point T k has a positive probability of being included in the time point configuration generated for the predictor curve X ij (t). For each j, the L by q j dimensional basis matrix withk-th rows j (T k ) has rankq j . A4. For eachi, functionE(Y i je ~ i =u) has bounded fourth-order derivatives with respect tou ande , fore in a small neighborhood of. A5. For each i, matrix E[(f i (e i ))fre m i ( e i ) rgfre m i ( e i ) rg T ] has full rank, where r =E[(f i (e i ))fre m i (e i )]=E(f i (e i )), andf i is the density ofe i . For the case of functional response, considered in Theorem 4, I need to modify the last two assumptions. B4. Time points at which the response is observed, s ik , are independent realizations of a random vari- ableS, and sequencefn i g is bounded. For eachi, functionE(Y i jS = s; ~ ~ i = u) has bounded fourth-order derivatives with respect tos,u and ~ , for ~ in a small neighborhood of. B5. For eachi, matrixE[(f(S; ~ i ))fr ~ m(S; ~ i ) rgfr ~ m(S; ~ i ) rg T ] has full rank, where r =E[(f(S; ~ i ))fr ~ m(S; ~ i )]=E(f(S; ~ i )), andf is the density of (S; ~ i ). I will need the following assumptions for Theorem 5. The definitions of v i and m are provided in Section 2.2.4. 90 C3. For eachj, time pointst ijk are generated from the same continuous distribution with a density that is bounded away from zero on the domain of thej-th predictor. C4. For eachi, functionE(Y i je e i =u;e v i =w) has bounded fourth-order derivatives with respect tou,w,e ande , for (e ;e ) in a small neighborhood of (; ). C5. For eachi, matrixE[(f(e i ; v i ))fr m(e i ; v i ) rgfr m(e i ; v i ) rg T ] has full rank, where r =E[(f(e i ; v i ))fr m(e i ; v i )]=E(f(e i ; v i )), andf is the density of (e i ; v i ). A.2.4 Proof of Theorems 3 and 4 Theorem 3. For simplicity of the exposition I will focus on the case of a single predictor. Due to the additive structure of the SIMFE estimation procedure with respect to the predictors, extension to the general case presents only notational challenges, while the argument itself remains essentially intact. I will also simplify the notation by omitting the superscript containing the predictor index. For example, I will writee i and b P i instead of e ij and b P ij . Define kh = [logn=(nh k )] 1=2 . Observe the relationshipjjA b (t)(t)jj = kA[b ] []k F , where [b ] and [] denote matrices [b 1 :::b d ] T and [ 1 ::: d ] T , respectively, andkk F stands for the Frobenius matrix norm. Hence, I need to show that there exists an invertible matrixA, for which kA[b ] []k F =O p (h 4 opt + 2 dhopt +n 1=2 ). In the new notation ^ i = (b i1 ;:::;b iq ) T and b b i = (b T 1 b i ;:::;b T d b i ) T = b P i . To be able to conveniently apply existing results, I will slightly modify the trimmed objective function, replacing b P l with b P l b P i and writing this expression in terms ofb andb . I will also useb l:i to denoteb l b i . The modified objective function, 1 n n X i=1 n X l=1 I ni b i Y l a i c T i b b l:i 2 e K h (b b l:i ); however, corresponds to exactly the same SIMFE estimator as the original trimmed function. Define c K t = I ni b i e K ht (b b l:i ), b D l:i = c i b l:i and D l:i = c i e l:i . I will use the superscript (t;) to indicate that the estimator corresponds to the iteration of the algorithm corresponding to the bandwidthh t . A simple manipulation of formula (2.22), taking into account the modifications to the objective function, yields b (t;+1) = r 1 +f n X i;l=1 K t b D (t;) l:i ( b D (t;) l:i ) T g 1 n X i;l=1 c K t b D (t;) l:i fY l a (t;) i r 1 b D (t;) l:i g (A.5) 91 Here r 1 corresponds to an arbitrary rotation of [], and the above formula holds for each such rotation. LetM denote the number of time point configurations, at which the predictor functions are observed. I will only consider configurations that are generated with positive probabilities. Denote byA k ,k = 1;:::;M, the index set of the observations corresponding to thek-th time point configuration. Note that, using the basis representation for the predictors and the projection functions, equation (2.9) simplifies to Y i = e m k T 1 e i ;:::; T d e i +" i ; i2A k ; (A.6) where E(" i jW i ) = 0. The last equality also implies E(" i je i ) = 0. This formulation of the SIMFE model allows us to apply some of the theory developed for the MA VE approach. I will first focus on the right-hand side of display (A.5), for sufficiently large n, with b D replaced by D, and rewrite it as r 1 + f P M k=1 e k g 1 f P M k=1 k g, where e k = P i;l2A k b K t D (t;) l:i (D (t;) l:i ) T and k = P i;l2A k b K t D (t;) l:i (Y i a (t;) j r 1 D (t;) l:i ). I will apply Lemma A.5 in the supplemental material of Xia (2008) to each k , and use a natural generalization of Lemma A.4 to handlef P M k=1 e k g 1 . For eachk M, write k for the probability of thek-th time point configuration, and lete (k) denotee i for somei inA k . Define n =n 1 M X k=1 jA k j n j X i2A k ( e f i (e i )) re m k (e i ) (e i ) " i ; D 1 = M X k=1 2 k E( e f(e (k) )) re m k (e (k) ) (e (k) ) r ~ m k (e (k) ) (e (k) ) T ; (A.7) and D 2 = 2 M X k=1 2 k E( ~ f(e (k) ))re m k (e (k) )r T e m k (e (k) ) !(e (k) ) T : (A.8) where(e ) = e E(e je ) and!(e ) = E(e e T je )E(e je )E(e je ) T . I will use superscript + to denote the MoorePenrose inverse. By Lemmas A.4 and A.5 in the supplemental material of Xia (2008), there exists a rotation of [], call it [ r 2 ], such that r 1 +f n X i;l=1 c K t D (t;) l:i (D (t;) l:i ) T g 1 n X i;l=1 c K t D (t;) l:i fY l a (t;) i D (t;) l:i g = r 2 + (ID + 2 D 1 ) 1 D + 2 n +O p (h 4 t + 2 dht ); (A.9) 92 providedb (t;) r 1 =o p (h t ). Note that n =O p (n 1=2 ). Due to the assumption A3, the unknown param- eters ; and 2 are estimated at the usual parametric rate,n 1=2 , and hence ^ = e (1 +O p (n 1=2 )). Consequently, equations (A.5) and (A.9) imply b (t;+1) r 2 =O p (h 4 t + 2 dht +n 1=2 ); (A.10) as long asb (t;) r 1 =o p (h t ). Note that 2 dht =o(h t ), and hence the stochastic bound in display (A.10) can be written aso p (h t ). It follows that the final estimator for the bandwidthh t , which is also the initial estimator for the bandwidthh t+1 , satisfiesb (t+1;0) r = o p (h t+1 ), for some rotation of []. Hence, as long asb (0;0) =o p (h 0 ) holds for the initialization estimator, I can establish, by induction, that the final SIMFE estimator satisfies jj[b ] [ r ]jj F =O p (h 4 opt + 2 dhopt +n 1=2 ); (A.11) where [ r ] is an appropriate rotation of []. The required bound for the initialization estimator follows from Lemma A.6 of the supplemental material of Xia (2008), which establishes that the OPG estimator converges at the rateO p (h 2 0 + ~ ph 0 + 2 ~ ph 0 h 2 0 ). The last stochastic bound iso p (h 0 ) by the definitions ofh 0 and ~ ph 0 . Theorem 4. In the case wheren i = 1 for alli, I can repeat the proof of Theorem 3, treatingS as an additional univariate predictor. The reduced dimension increases fromd tod + 1. As a result, the convergence rate changes fromO p n 4=(d+4) logn +n 1=2 toO p n 4=(d+5) logn +n 1=2 . In the general case, some of the expressions in the proof of Theorem 3 need to be modified. However, because the sequencefn i g is bounded, the stochastic bounds (A.10) and (A.11) remain the same as in the casen i = 1. A.2.5 Proof of Corollary 1 Using approximation (A.9), I can refine stochastic bound (A.11) as follows: b = (ID + 2 D 1 ) 1 D + 2 n +o p (n 1=2 ); 93 where corresponds to an appropriate rotation of []. For eachkM, let" (k) denote" i for somei inA k . Note thatn 1=2 n converges to a Gaussian vector with mean zero and variance matrixW 2 , which is given by M X k=1 3 k E 2 ( e f k (e (k) )) re m k (e (k) ) (e (k) ) re m k (e (k) ) (e (k) ) T (" (k) ) 2 : (A.12) Consequently, n 1=2 (b ) D !N 0; (ID + 2 D 1 ) 1 D + 2 W 2 (ID + 2 D 1 ) 1 D + 2 T ; (A.13) where the components of the variance matrix are given in displays (A.12), (A.7) and (A.8). These formulas were introduced in the proof of Theorem 3 under the simplification of having a single predictor, however, they continue to hold in the multi-predictor case as well. Estimation of the variance matrix in Corollary 1 Givenj2A k , setf j =jA k j 1 P i2A k I ni b i e K ht (b b i:j ). Define j as the sample counterpart of, i.e. j =b j (jA k jf j ) 1 P i2A k I ni b i ~ K ht (b b i:j )b i . Define! j by analogy, as the sample counterpart of !. Denote by b Y i the predicted value for thei-th response. I can estimate matricesW 2 ,D 1 , andD 2 by c W 2 = M X k=1 3 k jA k j 1 X i2A k I ni (f i ) 2 (c i i )(c i i ) T (Y i b Y i ) 2 ; b D 1 = M X k=1 2 k jA k j 1 X i2A k I ni (f i ) 2 (c i i )(c i i ) T ; and b D 2 = 2 M X k=1 2 k jA k j 1 X i2A k I ni (f i ) 2 c i c T i ! i : This allows us to produce an estimate, c W , for the variance matrix in Corollary 1. The standard errors for the elements ofb are given by the diagonal entries of c W , divided by p n. 94 A.2.6 Proof of Theorem 5 Partition the rows of the matrix (I i S i )(I i S i ) T + 2 i T i intop groups of adjacent rows, so that the size of thej-th group isq j . Partition the columns of the same matrix analogously. This corresponds to a partition of the matrix intop 2 blocks,V ijk , wherej is the group index in the row partition, andk is the group index in the column partition. Write v ijk for the vectorized form of the blockV ijk . Recall the definitions of matrices i and vectors i in sections 2.2.4 and 2.4. Each element of i has the formcov(U ijl 1 ;U ikl 2 ) for some indexes j;k;l 1 ;l 2 that satisfy: p k j 1, d j l 1 1, d k l 2 1, and l 2 l 1 whenever k = j. Consequently, each element of i can be written as T jl 1 V ijk kl 2 . Thus, there exists a collection of vectors jl 1 kl 2 , with the indexes satisfying the inequalities given above, such that for eachi the elements of i have the form T jl 1 kl 2 v ijk . I will denote this representation of the vector i as v i , where is the vector constructed by stacking the vectors jl 1 kl 2 , and v i is similarly constructed from v ijk . Using the derivations in Section2.2.4 and the notation in Appendix A.2.4, I see that e m t i ( e P i ) can be written as b m(e i ; v i ), for some function b m, which does not depend oni. The rest of the proof is the same as the proof of Theorem 3, except for some minor modifications. I no longer partition the observations by time point configuration, but instead treat v i as an additional predictor. Equation (A.6) is replaced with Y i = b m ( ~ i ; v i ) +" i ; and the dimensionality of the corresponding model increases from d to e d = d +d(d + 1)=2. Conse- quently, the stochastic bound in display (A.11) changes toO p ( e h 4 opt + 2 e d ~ hopt +n 1=2 ), which simplifies to O p (n 4=( e d+4) logn +n 1=2 ). 95 Appendix B Algorithms and Proofs in Chapter 3 B.1 Derivations for Fglasso Algorithm B.1.1 Step 2(b) of Algorithm 3 To derive the estimation for w j , I need to use Lemma 2, whose proof is provided in the Supplementary Material. Lemma 2. For any A2R pq , B2R rr , and X2R qr , I have @trace(AX T BX) @X = BXA + B T XA T : (B.1) Note (3.12) is equivalent to finding w j1 ; ; w j(p1) to minimize trace p1 X l=1 p1 X k=1 S jj w T jl ( e 1 j ) lk w jk + 2 p1 X k=1 s T jk w jk ! + 2 p1 X k=1 kw jk k F : (B.2) Setting the derivative of (B.2) with respect to w jk to be zero and applying Lemma 2 yield @(B:2) @w jk = ( e 1 j ) kk w jk S jj + ( e 1 j ) T kk w jk S T jj + X l6=k ( e 1 j ) T lk w jl S T jj + ( e 1 j ) kl w jl S jj + 2s jk + 2 jk = 2 0 @ ( e 1 j ) kk w jk S jj + X l6=k ( e 1 j ) T lk w jl S jj + s jk + jk 1 A = 0; 96 where jk = w jk kw jk k F if w jk 6= 0, and jk 2R MM withk jk k F 1 otherwise,k = 1 ;p1. I define the block “residual” by r jk = X l6=k ( e 1 j ) T lk w jl S jj + s jk : (B.3) If w jk = 0, thenkr jk k F =k jk k F . Otherwise I need to solve for w jk in the following equation ( e 1 j ) kk w jk S jj + r jk + w jk kw jk k F = 0: (B.4) I replace (B.4) by (B.5), and standard package in R/MatLab can be used to solve the followingM 2 byM 2 nonlinear equation (( e 1 j ) kk S jj )vec(w jk ) + vec(r jk ) + vec(w jk ) kw jk k F = 0: (B.5) Hence, the block coordinate descent algorithm for solving w j in (3.12) is summarized in Algorithm 6. Algorithm 6 Block Coordinate Descent Algorithm for Solving w j 1. Initialize b w j . 2. Repeat until convergence fork = 1; ;p 1. (a) Computeb r jk via (B.3). (b) Set b w jk = 0 ifkr jk k F ; otherwise solve for b w jk via (B.5). B.1.2 Steps 2(a) and 2(c) of Algorithm 3 At the jth step, I need to compute e 1 j in (3.12) given current e = e 1 . Then step 2(a) follows by the blockwise inversion formula. Next I solve for w j via Algorithm 6, and then update e 1 given current w j , jj , and e 1 j , by applying blockwise inversion formula again. Rearranging the row and column blocks such that the (j;j)-th block is the last one, I obtain the permuted version of e 1 by 0 @ e 1 j + U j V j U T j U j V j V j U T j V j 1 A ; where U j = e 1 j w j and V j = ( jj w T j U j ) 1 = S jj . Step 2(c) follows as a consequence. 97 B.2 Proofs of main results B.2.1 Proof of Lemma 1 Since both a = (a T 1 ; ; a T p ) T and = ( T 1 ; ; T p ) T depend onM, I omit the corresponding super- scripts to simplify the notation for readability. For any pair (j;l)2V 2 ;j6=l such that e C M jl (s;t) = 0 for all (s;t)2T 2 , letU =Vnfj;lg and a U , U denote (p 2)M-dimensional vectors excluding thejth andlth subvectors from a and, respectively. By definition (3.3), I have e C M jl (s;t) = Cov e g M j (s);e g M l (t)je g M k (u);k6=j;l;8u2T = Cov a T j j (s); a T l l (t)ja T k k (u);k6=j;l;8u2T = Cov a T j j (s); a T l l (t)ja k ;k6=j;l = j (s) T Cov(a j ; a l ja U ) l (t) = 0: (B.6) The third equality comes from the following argument. For any k 2 U and u 2 T , e g M k (u) = P M m=1 a km km (u) = a T k k (u). By the orthogonality of km and the fact P M m=1 E(a 2 km ) = P M m=1 km < 1; I obtain thata km = P M m 0 =1 a km 0 R T km (u) km 0(u)du = R T e g M k (u) km (u)du;m = 1; ;M: It then follows that there exists a one to one correspondence betweenfa k g andfe g M k (u);8u2Tg, which holds uniformly ink. Since (B.6) holds for all (s;t) 2 T 2 , I have Cov(a j ; a l ja U ) = 0. Given that a j ; a l 2 R M and a U 2R (p2)M are multivariate Gaussian, let the covariance matrix of their joint vector a T j ; a T l ; a T U T be 0 @ A B B T C 1 A , where A = Var (a T j ; a T l ) T , B = Cov (a T j ; a T l ) T ; a U , and C = Var (a U ). Then Var (a T j ; a T l ) T ja U = A BC 1 B T = D = 0 @ D jj D jl D T jl D ll 1 A ; 98 where each term in D is anMM submatrix. It follows from the blockwise inversion formula that 0 @ A B B T C 1 A 1 = 0 @ (A BC 1 B T ) 1 1 A = 0 @ D 1 1 A ; where D 1 = 0 @ (D 1 ) jj (D 1 ) jl (D 1 ) T jl (D 1 ) ll 1 A and each term in D 1 is an M M submatrix. Since Cov(a j ; a l ja U ) = D jl = 0, D is block diagonal and thus D 1 is also block diagonal, which implies (D 1 ) jl = 0, i.e., e jl = 0. This completes the proof for one direction, and the other direction can be proved using similar arguments. B.2.2 Convergence Rate for Submatrices in e Lemma 3. Assume thatg 1 ; ;g p are from MGP , Conditions 1–3 hold, andn>c 7 p 4= for some positive constantc 7 . Then jj e jl jj F =O p 2 n ()=2 (B.7) uniformly in (j;l)2S c . The proof of Lemma 3 can be summarized in the following three main steps. I derive the corresponding rates of convergence for e K j;l (s;t); e C j;l (s;t); andjj e jl jj F under Conditions 1–3, respectively. Step 1: For each pair (j;l)2V 2 and any (s;t)2T j T l , since K jl (s;t) = e K jl (s;t) +Cov (e g j (s);g l (t)e g l (t)) +Cov (g j (s)e g j (s);e g l (t)) +Cov (g j (s)e g j (s);g l (t)e g l (t)); (B.8) I can bound the last term by Cauchy-Schwarz inequality sup (s;t)2T j T l jCov (g j (s)e g j (s);g l (t)e g l (t))j 2 sup s2T j Var (g j (s)e g j (s)) sup t2T l Var (g l (t)e g l (t)): 99 Moreover, under Condition 1 sup s2T j Var (g j (s)e g j (s)) = sup s2T j Var 1 X k=M+1 a ijk jk (s) ! = sup s2T j 1 X k=M+1 jk 2 jk (s) 1 X k=M+1 jk sup s2T j max kM+1 2 jk (s) =O(n ): Applying this technique on the last three terms of (B.8), I obtain sup (s;t)2T j T l K jl (s;t) e K jl (s;t) =O(n =2 ): (B.9) Step 2: Recall I have definedA andX , which can be respectively viewed as the functional counterparts of matrices and vectors. I need to define the corresponding operations and notation that will be used in step 2. First, define theC =A?B such thatC jl (s;t) = P k2V R T k A jk (s;u k )B kl (u k ;t)du k ; whereA2A, B2B, andC2C. Note the covarianceK =fK jl :T j T l !R; (j;l)2V 2 g is a collection of covariance kernel functions such thatK jl (s;t) = Cov (g j (s);g l (t)), (s;t)2T j T l : The transpose ofK,K T =fK lj : T l T j !R; (l;j)2 V 2 g is a collection of functionsK lj (t;s) = Cov (g l (t);g j (s)) = K jl (s;t). Hence K =K T andK is symmetric. The inverse covariance ofK,K 1 =f(K 1 ) jl :T j T l !R; (j;l)2V 2 g is defined as a collection of functions (K 1 ) jl (s;t); such thatK?K 1 =I, whereI2I andI jl (s;t) = jl equals one ifj = l and zero otherwise. Moreover, defineY =A?X andZ =X ?A such thaty j (s) = P l R T l A jl (s;t)x l (t)dt andz l (t) = P j R T j x j (s)A jl (s;t)ds; wherex2X , y2Y, z2Z, andA2A. Finally, defineX ?Y = P j R T j x j (s)y j (s)ds; wherex2X andy2Y: Given that g = (g 1 ; ;g p ) T are from MGP, some calculations show that the conditional covariance function (3.3) is given by C jl (s;t) =K jl (s;t)K jU (s)?K 1 UU ?K Ul (t); (B.10) and the correspondingM-truncated conditional covariance function (3.7) is e C jl (s;t) = e K jl (s;t) e K jU (s)? e K 1 UU ? e K Ul (t): (B.11) 100 It follows from some standard algebra that C jl (s;t) e C jl (s;t) = K jl (s;t) e K jl (s;t) | {z } S 1 +K jU (s)? (K 1 UU e K 1 UU )?K Ul (t) | {z } S 2 +(K jU (s) e K jU (s))? e K 1 UU ?K Ul (t) | {z } S 3 + e K jU (s)? e K 1 UU ? (K Ul (t) e K Ul (t)) | {z } S 4 : Note that since all kernel functions in A and X are assumed to be square integrable, i.e., R T j R T l A 2 jl (s;t)dsdt<1 and R T j x 2 j (s)ds<1, I can check that the properties for matrixA and vectorx also hold forA andX in term of my defined operations in the functional domain. Denote(A) to be the eigenvalues ofA such thatA?X =X givenjjXjj 2 = 1; wherejjXjj 2 denotes thel 2 norm ofX such that jjXjj 2 = q P j2V R T j x 2 j (s)ds: Moreover, letjjAjj max be the entrywisel 1 norm ofA such thatjjAjj max = max jl R T j R T l jA jl (s;t)jdsdt, andjjXjj max the l 1 norm ofX such thatjjXjj max = max j R T j jx j (s)jds: DenotejjAjj 2 to be the operator norm ofA such thatjjAjj 2 = p max (A T ?A): For any symmetric positive definiteA, i.e., min (A)> 0, it holds thatjjAjj 2 = max (A) andjjA 1 jj 2 = 1= min (A). It follows fromK 1 ~ K 1 =K 1 ? ( ~ KK)? ~ K 1 thatS 2 =K jU (s)?K 1 ? ( ~ KK)? ~ K 1 ?K Ul (t). Hence, by (B.9) and Condition 2 I have sup (s;t)2T j T l jS 2 j sup s2T j jjK jU (s)jj 2 jjK 1 jj 2 jj ~ KKjj 2 jj ~ K 1 jj 2 sup t2T l jjK Ul (t)jj 2 p p sup s2T j jjK jU (s)jj max pjjK ~ Kjj max p psup t2T l jjK Ul (t)jj max 1= min (K) 1= min ( ~ K) =O p 2 =n =2 : (B.12) It follows from (B.9) and Condition 2 that sup (s;t)2T j T l jS 3 j sup s2T j jjK jU (s) e K jU (s)jj 2 jj e K 1 jj 2 sup t2T l jjK Ul (t)jj 2 p p sup s2T j jjK jU (s) e K jU (s)jj max p psup t2T l jjK Ul (t)jj max 1= min ( ~ K) = O p=n =2 : (B.13) 101 By similar arguments, I can boundS 4 by sup (s;t)2T j T l jS 4 j = sup (s;t)2T j T l e K jU (s)? e K 1 ? (K Ul (t) e K Ul (t)) =O p=n =2 : (B.14) Combining (B.9) and (B.12)–(B.14), I obtain sup (s;t)2T j T l C jl (s;t) e C jl (s;t) sup (s;t)2T j T l (jS 1 j +jS 2 j +jS 3 j +jS 4 j) = O p 2 =n =2 : (B.15) Step 3: It follows from the derivations in (B.6) and (B.15) that sup (s;t)2T j T l C jl (s;t) j (s) T Cov(a j ; a l ja U ) l (t) =O p 2 =n =2 : By the orthogonality property that R T j j (s) j (s) T = R T l l (t) l (t) T = I, I obtain sup (s;t)2T j T l jC jl (s;t) jl Cov(a j ; a l ja U )j =O p 2 =n =2 jl ; where jl = R T j R T l j (s) l (t) T dsdt: Then under Condition 3, I obtain the rate of convergence under the Frobenius norm sup (s;t)2T j T l jjC jl (s;t) jl Cov(a j ; a l ja U )jj F =O p 2 =n =2 : (B.16) For any (j;l)2 S c ,C jl (s;t) = 0 for all (s;t)2T j T l . It follows from (B.16) and the derivations in Lemma 1 thatjjCov(a j ; a l ja U )jj F =jjD jl jj F =O p 2 =n =2 : Hence, I need to obtain the convergence rate forjj e jl jj F =jj(D 1 ) jl jj F . Using the blockwise inversion formula again, I have (D 1 ) jl = D 1 jj D jl (D ll D T jl D 1 jj D jl ) 1 = D 1 jj D jl (I D 1 ll D T jl D 1 jj D jl ) 1 D 1 ll = D 1 jj D jl 1 X k=0 (D 1 ll D T jl D 1 jj D jl ) k D 1 ll = D 1 jj D jl D 1 ll D 1 jj D jl 1 X k=1 (D 1 ll D T jl D 1 jj D jl ) k D 1 ll : 102 The convergent matrix expansion via Neumann series in the fourth equality comes from the following fact. NotejjD jl jj 2 2 jjD jl jj 2 F = O(p 4 =n ) c 4 p 4 =n ; where c 4 > 0 is some large enough constant. Let c 7 > (c 4 =c 2 3 ) 1= . It follows from Condition 3 and lower bound for n thatjjD 1 ll D T jl D 1 jj D jl jj 2 jjD 1 jj jj 2 jjD 1 ll jj 2 jjD jl jj 2 2 jjD jl jj 2 F =( min (D jj ) min (D ll ))c 4 p 4 =(c 2 3 n )<c 4 =(c 2 3 c 7 ) =c 8 < 1: Moreover, I have jjD 1 jj D jl D 1 ll jj 2 jjD 1 jj jj 2 jjD jl jj 2 jjD 1 ll jj 2 p c 4 p 2 n =2 1 c 2 3 : The higher order terms in the above expansion under matrix operator norm, jjD 1 jj D jl P 1 k=1 (D 1 ll D T jl D 1 jj D jl ) k D 1 ll jj 2 , can be bounded from above by jjD 1 jj jj 2 jjD jl jj 2 jjD 1 ll jj 2 1 X k=1 jjD 1 ll D T jl D 1 jj D jl jj k 2 < p c 4 p 2 n =2 1 c 2 3 1 X k=1 c k 8 = p c 4 c 8 c 2 3 (1c 8 ) p 2 n =2 : Hence,jj(D 1 ) jl jj 2 = O p 2 =n =2 andjj(D 1 ) jl jj F p Mjj(D 1 ) jl jj 2 = O p 2 =n ()=2 ; which completes the proof. B.2.3 Equivalence betweenE and e E " In general, for any" 0; define the corresponding truncated edge set e E " =f(j;l)2V 2 :j6=l;jj e jl jj F > "g: LetS " = e E " [f(1; 1); ; (p;p)g. DenoteS c " to be the complement ofS " inV 2 withjj e jl jj F " for (j;l) 2 S c " . Lemma 4 below, whose proof is provided in the Supplementary Material, ensures the equivalence between the true and truncated edge sets. Lemma 4. Under all the conditions of Lemma 3 and Condition 4, let" =c 8 p 2 =n ()=2 for some constant c 8 > 0, I haveE = e E " . B.2.4 Proof of Theorem 1 I first obtain the general error bound for b in Section B.2.5. Then I present the general model selection consistency of fglasso, i.e., with high probability b E = e E " ; in Section B.2.6. Moreover in the Supplementary 103 Material, I list Lemmas 5-10 used to prove Theorems 9 and 10, and provide their proofs therein. Finally in Section B.2.7 I prove Theorem 6 based on the results of Lemmas 3-4 and Theorem 10. To prove the theorems, I need to use the definition of tail condition for the random variable given in Ravikumar et al. (2011). Definition 5 (Tail condition). The random vector a 2 R Mp satisfies the tail condition if there exists a constantv 2 (0;1] and a functionf :N (0;1)! (0;1), such that for any (i;j)2f1; ;Mpg 2 , let e S ij , e ij be the (i;j)-th entry of e S; e respectively, then P j e S ij e ij j 1=f(n;) for all2 (0; 1=v ]: (B.17) The tail functionf is required to be monotonically increasing in andn. The inverse functions ofn and are respectively defined as f (w;n) = argmaxfjf(n;)wg and n f (;w) = argmaxfnjf(n;)wg; wherew2 [1;1): Then I assume that the Hessian of the negative log determinant satisfies the following general irrepresentable-type assumption. Condition 9. There exist some constants 2 (0; 1"],> 0, and > 2 such that jj e S c " S" ( e S"S" ) 1 jj (M 2 ) 1 1 "; (B.18) jj e S c " S" ( e S"S" ) 1 e S"S c " e S c " S c " jj (M 2 ) 1 h 2" + 1 i ; (B.19) jj e S"S c " jj (M 2 ) 1 M f (n; (Mp) ): (B.20) B.2.5 General Error Bound For any block matrix B = (B ij ) with B ij 2R MM ; 1 i;j p, definejjBjj (M) max = max i;j jjB ij jj F as the M-block version of entrywisel 1 -norm. 104 Theorem 9. Assume Condition 9 holds. Let b be the unique solution to the fglasso (3.10) with regulariza- tion parameter = (8M= ) f (n; (Mp) ). Denote byc " = (1 + 8 1 +"), e " =jj( e S"S" ) 1 jj (M 2 ) 1 ; e " =jj e S"S c " jj (M 2 ) 1 , andd " =max j2V l2V :jj e jl jj F >" . If the sample sizen satisfies the lower bound n > n f 1= max v ; 6c " Md " maxf e e " 1 3(pd " )" e ; 3 e 2 e " c " 1 3(pd " )" 3 e e " c " g ) ; (Mp) ! ; (B.21) then with probability at least 1 (Mp) 2 , I have (i) The estimate b satisfies the error bound jj b e jj (M) max 2(1 + 8 1 +") e " M f (n; (Mp) ); (B.22) (ii) The estimated edge set b E is a subset of e E " . By the Karush-Kuhn-Tucker (KKT) condition, a necessary and sufficient condition for e to maximize (3.10) is e 1 e S e Z = 0; (B.23) where the sub-differential of P j6=l k jl k F involves all symmetric matrices e Z withMM blocks defined by e Z jl = 8 > > > > > < > > > > > : 0 ifj =l e jl jj e jl jj F ifj6=l and e jl 6= 0 f e Z jl 2R MM :jj e Z jl jj F 1g ifj6=l and e jl = 0: (B.24) The main idea of the proof is based on constructing the primal-dual witness solution b and b Z in the follow- ing four steps. First, b is obtained by the following restricted fglasso problem min e S c " =0 8 < : trace( e S e ) log det e + X j6=l jj e jl jj F 9 = ; ; (B.25) 105 where e 2R MpMp is symmetric and positive definite. Second, for each (j;l)2S " , I choose b Z jl from the family of sub-differential of P j6=l jj e jl jj F evaluated at b jl defined in (B.24). Third, for each (j;l)2 S c " , i.e.,jj e jl jj F ", b Z jl is replaced by 1 ( e S jl + b 1 jl ) ; (B.26) which satisfies the KKT condition (B.23). Finally, I need to verify strict dual feasibility condition, that is, jj b Z jl jj F < 1 uniformly in (j;l)2S c " . The following terms are needed in the proof of Theorem 9. Let W be the noise matrix, and the difference between the primal witness matrix b and the truth e , W = e S e 1 , = b e : (B.27) The second order remainder for b 1 near e is given by R() = b 1 e 1 + e 1 e 1 : (B.28) I organize the proof of Theorem 9 in the following six steps. Step 1. It follows from the tail condition (B.17) and Lemma 10 that the event n jjWjj (M) max M f (n; (Mp) ) o holds with probability at least 11=(Mp) 2 . I need to verify that the con- ditions in Lemma 6 hold. Choosing the regularization parameter = 8M f (n;(Mp) ) , I havejjWjj (M) max 8 . It remains to provejjR(jj (M) max is also bounded by 8 =M f (n; (Mp) ). Step 2. Letr = 2 e " (jjWjj (M) max + + e " ") 2 e " (1 + 8= +")M f (n; (Mp) ), where I have used (B.20) in Condition 9. By f (n; (Mp) ) 1=v and monotonicity of the inverse tail function, for anyn satisfying the lower bound condition, I have 2 e " (1 + 8= +")M f (n; (Mp) ) min ( 1 3(pd " )" e 3 e d " ; 1 3(pd " )" 3 e e " c " 3 3 e e " d " c " ) min ( 1 3(pd " )" e 3 e d " ; 1 3(pd " )" 3 e e " 3 3 e e " d " ) = min ( 1 3 e d " ; 1 3 3 e e " d " ) pd " d " ": 106 Then the conditions in Lemma 8 are satisfied, and hence the error bound satisfiesjjjj (M) max = jj b e jj (M) max r. Step 3. The conditionjjjj (M) max 1 3 e d" pd" d" " is satisfied by step 2. Thus by Lemma 7 and results in step 2, I havejjR()jj (M) max 3 2 jjjj (M) max d " jjjj (M) max + (pd " )" 3 e = n 3 3 e e " (1 + 8= +") d " 2 e " (1 + 8= +")M f (n; (Mp) ) +(pd " )")g =8 =8; where the last inequality comes from the monotonicity of the tail function, the bound condition for the sample sizen, and the fact that 2d " e " (1 + 8 +")M f (n; (Mp) ) 1 3(pd " )" 3 e e " c " 3 3 e e " c " = 1 c " 1 3 3 e e " (pd " )": Step 4. Steps 1 and 3 imply the strict dual feasibility in Lemma 6 and hence b = b by Lemma 5. Step 5. It follows from the results in steps 2 and 4 that the error bound (B.22) holds with probability at least 1 1=(Mp) 2 : Step 6. For (j;l)2 S c " ,jj e jl jj F ". Step 4 implies b S c " = b S c " : In the restricted fglasso problem (B.25), I have b S c " = b S c " = 0. Therefore, ( e E " ) c ( b E) c and part (ii) follows by taking the complement. 107 B.2.6 General Model Selection Consistency Theorem 10. Let min = min (j;l)2 e E" jj e jl jj F . Under the same conditions as in Theorem 9, if the sample sizen satisfies the lower bound n > n f 1= max 2 e " c " 1 min M;v ; 6c " Md " maxf e e " 1 3(pd " )" e ; 3 e 2 e " c " 1 3(pd " )" 3 e e " c " g ) ; (Mp) ! ; (B.29) then b E = e E " with probability at least 1 (Mp) 2 : Proof. It follows from the proof and results of Theorem 9(i) thatjj b e jj (M) max r 2(1 + 8 1 + ") e " M f (n; (Mp) ) and b = b hold with probability at least 1 1=(Mp) 2 . The lower bound for the sample sizen in (B.29) implies min > 2(1 + 8 1 +") e " M f (n; (Mp) )r. By Lemma 9 I have b jl 6= 0 for all (j;l)2 S " , which entails that e E " b E. Combining this result with Theorem 9(ii) yields e E " = b E: B.2.7 Proof of Theorem 6 Recall a2R Mp . From Lemma 1 in Ravikumar et al. (2011), for any k 2f1;:::;Mpg, the rescaled random variables a k = q e kk are sub-Gaussian with parameter = 1, and the sample covariance matrix satisfies the tail condition (B.17) withv = (max k e kk 40) 1 andf(n;) = (1=4) exp(c n 2 ); wherec = [128 5 2 max k ( e kk ) 2 ] 1 : Therefore the corresponding inverse functions take the following forms f (n; (Mp) ) = s log[4(Mp) ] c n = r 128 5 2 max k ( e kk ) 2 r logMp + log4 n (B.30) and n f (; (Mp) ) = log[4(Mp) ] c 2 = 128 5 2 max k ( e kk ) 2 logMp + log4 2 (B.31) It follows from Lemma 4 with " = c 8 p 2 n ()=2 that E = e E " : Thus I have S = S " , d = d " , e = e " , e = e " , and c " = 1 + 8 1 + c 8 p 2 n ()=2 : Moreover, since Var(a jm ) = jm , then 108 r max k ( e kk ) 2 = q max k (Var(a k )) 2 = max j;m jm . By substituting these terms into Condition 9 and Theo- rem 10, some calculations using (B.30) and (B.31) lead to Condition 5, the lower bound for the sample size, i.e.,n>C 1 M 2 d 2 (1 + 8 1 +c 7 p 2 n ()=2 ) 2 (log(Mp) + log4)=c 2 1 andn>C 2 M 2 2 min (1 + 8 1 + c 7 p 2 n ()=2 ) 2 (log(Mp) +log4)=c 2 1 and the desired regularization parameter withM =c 1 n under Condition 1. By satisfying n > c 7 p 4= and Conditions 1-4, Lemma 4 implies E = e E " . By satisfying (B.29) and Condition 9, Theorem 10 shows e E " = b E holds with probability at least 1 1=(c 1 n p) 2 : Combining these two results completes the proof. B.3 Additional Technical Proofs This section contains the details of estimating FPC scores, technical proofs of Propositions 1–2 and Lemmas 2 and 4–11, and additional simulation results from Section 3.5. B.3.1 Estimation of FPC Scores I model g ij (t) and jk (t) using a collection of K-dimensional orthogonal basis functions B j (t), i.e., R B j (t)B j (t) T dt =I. Hence g ij (t) = B j (t) T c ij and jk (t) = B j (t) T jk ; (B.32) where c ij and jk are the basis coefficients for the functional variables and eigen functions, respectively. Let c j be annK matrix withith row c T ij . Moreover sinceK jj (s;t) = 1 n B(s) T c T j c j B(t), (3.5) becomes n 1 c T j c j jk = jk jk ; jk 2R K ; 1kK; (B.33) which gives the standard principal component analysis (PCA) of the coefficient array c j . Then the kth principal component score is given by a ijk = Z g ij (t) jk (t)dt = c T ij jk ; (B.34) 109 and hence a M ij = (a ij1 ; ;a ijM ) T forMK. B.3.2 Proof of Lemma 2 Since d(trace(AX T BX)) = trace(d(AX T BX)) = trace(d(AX T )BX) + trace(AX T d(BX)) = trace((dX) T BXA) + trace(AX T BdX) = trace(A T X T B T + AX T B)dX; I have @trace(AX T BX) @X = (A T X T B T + AX T B) T , which concludes the proof. B.3.3 Proof of Proposition 1 Substituting e = diag( e (1) ; ; e (K) ) into (3.10) yields max e (1) ;; e (K) 8 < : K X k=1 log det e (k) K X k=1 trace( e S (k) e (k) ) X j6=l K X k=1 k e (k) jl k F 9 = ; ; (B.35) which is equivalent toK separate fglasso problems in (3.13). B.3.4 Proof of Proposition 2 If e is block diagonal, and i and i 0 belong to separate index sets G k and G k 0, then e ii 0 = 0 and hence ( e 1 ) ii 0 = 0. By (B.23), I havek e S ii 0k F k e Z ii 0k F . This completes the proof for the sufficient condition. I need to prove the condition is also necessary. I construct e k by solving the fglasso problem (3.10) applied to the symmetric submatrix of e S given by index set G k for k = 1 ;K, and let = diag e (1) ; ; e (K) : Sincek e S ii 0k F for all i2 G k , i 0 2 G k 0, k6= k 0 , and ii 0 = 0 by construction, I have ( 1 ) ii 0 = 0 and hence the (i;i 0 )-th equation of (B.23) is satisfied. Moreover, the 110 (k;k)-th equation of (B.23) is satisfied by construction. Therefore, satisfies the KKT condition (B.23) and is the solution to the fglasso problem (3.10). B.3.5 Proof of Lemma 4 By Lemma 3, for each (j;l)2 S c ,jj e jl jj F = O(p 2 =n ()=2 ); i.e.,jj e jl jj F " = c 8 p 2 =n ()=2 for some constantc 8 > 0: Then (j;l)2S c " and henceS c S c " ; which implies e E " E: For (j;l)2E, by (B.16) there exists some positive constantc 9 such that jjC jl (s;t) M jl Cov(a j ; a l ja U )jj F c 9 p 2 =n =2 ; which implies jjD jl jj F =jjCov(a j ; a l ja U )jj F jjC jl (s;t) jl jj F c 9 p 2 =n =2 : (B.36) It follows from the blockwise inversion formula that e jl = (D 1 ) jl =D 1 jj D jl (D ll D T jl D 1 jj D jl ) 1 . Under Conditions 3 and 4, I havejjD jj jj 2 = max (D jj ) c 4 andkD ll D T jl D 1 jj D jl k 2 jjD ll jj 2 + jjD T jl D 1 jj D jl jj 2 max (D ll ) + 1= min (D jj )jjD jl jj 2 2 c 4 +c 10 p 4 =c 3 n , wherejjD jl jj 2 jjD jl jj F = p c 10 p 2 =n =2 for some constantc 10 > 0: Then I have jj e jl jj F jj e jl jj 2 jjD jl jj 2 jjD jj jj 2 jjD ll D T jl D 1 jj D jl jj 2 jjD jl jj 2 c 4 (c 4 +c 10 p 4 =c 3 n ) : Moreover, given thatp 4 =n <c 7 , letc 11 = 1 c 4 (c 4 +c 10 =(c 3 c 7 )) > 0. Then it follows from (B.36) that jj e jl jj F >c 11 jjD jl jj 2 c 11 jjD jl jj F = p M c 11 p M jjC jl (s;t) jl jj F c 9 p 2 n =2 c 8 p 2 n ()=2 =": Thus under Conditions 1 and 4jC jl (s;t)j c 9 p 2 n =2 jj jl jj 1 F c 8 p c 1 c 11 p 2 n (2)=2 jj jl jj 1 F holds for all (s;t)2 T j T l on the true edge set, which is equivalent to the minimum signal strength in Condition 4 by letting c 5 =c 9 jj jl jj 1 F andc 6 = c 8 p c 1 c 11 jj jl jj 1 F : Then I obtain (j;l)2 e E " , and henceE e E " . Finally,E = e E " follows. 111 B.3.6 Lemma 5 and its proof Lemma 5. For any 0, the fglasso problem (3.10) has a unique solution that satisfies the optimal condition (B.23) with e Z defined in (B.24). Proof. The fglasso problem can be written in the constrained form min P j6=l jj e jl jj F C() n trace( e S e ) log det e o ; (B.37) where e 2R MpMp is symmetric and positive definite. The objective function is strictly convex in view of its Hessian and the constraint on the parameter space, if the minimum is attained then the solution is uniquely determined. I need to show that the minimum is achieved. Note the off block diagonal elements are bounded by satisfying P j6=l jj e jl jj F C() <1. By the fact that max i;j A ij max i A ii for a positive definite matrix A, I only need to consider the possibly unbounded diagonal elements. By Hadamard’s inequality for positive definite matrices, we have trace( e S e ) log det e Mp X i=1 e S ii e ii log det e ii : The diagonal elements of e S are positive. The objective function goes to infinity as any sequence ( e (k) 11 ; ; e (k) MpMp );k 1, goes to infinity. Thus the minimum is uniquely achieved. B.3.7 Lemma 6 and its proof Lemma 6. Assume that max n jjWjj (M) max ;jjR()jj (M) max o 8 . Then b Z S c " constructed in (B.26) satisfies jj b Z S c " jj (M) max < 1. Proof. The optimal condition (B.23) can be replaced by e 1 e 1 + W R() + b Z = 0: (B.38) 112 Since vec( e 1 e 1 ) = ( e 1 e 1 )vec(), equation (B.38) can be written as 0 @ e S"S" e S"S c " e S c " S" e S c " S c " 1 A 0 @ vec( S" ) vec( S c " ) 1 A + 0 @ vec(W S" ) vec(W S c " ) 1 A 0 @ vec(R S" ) vec(R S c " ) 1 A + 0 @ vec( e Z S" ) vec( e Z S c " ) 1 A = 0: (B.39) I solve for vec( S" ) from the first line and substitute it into the second line. Then vec( b Z S c " ) can be represented as vec( b Z S c " ) = 1 e S c " S" ( e S"S" ) 1 (vec(W S" ) vec(R S" )) + e S c " S" ( e S"S" ) 1 vec( b Z S" ) + 1 e S c " S" ( e S"S" ) 1 e S"S c " e S c " S c " vec( S c " ) 1 vec(W S c " ) vec(R S c " ) : For any vector v = (v j ) with v j 2R M 2 , 1 j p, definejjvjj (M 2 ) max = max j jjv j jj 2 as theM 2 -group version of l 1 norm. Taking the M 2 -group l 1 norm on both sides, it follows from (B.45) and (B.46) in Lemma 11 that jjvec( b Z S c " )jj (M 2 ) max 1 jj e S c " S" ( e S"S" ) 1 jj (M 2 ) 1 jjvec(W S" )jj (M 2 ) max +jjvec(R S" )jj (M 2 ) max +jj e S c " S" ( e S"S" ) 1 jj (M 2 ) 1 jjvec( b Z S" )jj (M 2 ) max + 1 jj e S c " S" ( e S"S" ) 1 e S"S c " e S c " S c " jj (M 2 ) 1 jjvec( S c " )jj (M 2 ) max + 1 jjvec(W) S c " jj (M 2 ) max +jjvec(R) S c " jj (M 2 ) max : By construction, I havejjvec( b Z S" )jj (M 2 ) max 1 andjjvec( S c " )jj (M 2 ) max ": These results together with (B.42) in Lemma 11 and (B.18)–(B.19) in Condition 9 yield jj b Z S c " jj (M) max 2 " jjWjj (M) max +jjRjj (M) max + (1 ") + ( 2" + 1) " 2 " 4 + (1 ") + ( 2" + 1)" < 2 + 1 + " 2" = 1: 113 B.3.8 Lemma 7 and its proof Lemma 7. If jjjj (M) max 1 3 e d" pd" d" ", then jjJ T jj 1 3 2 and jjR()jj (M) max 3 2 jjjj (M) max d " jjjj (M) max + (pd " )" 3 e , where J = P 1 k=0 (1) k ( e 1 ) k and R() = e 1 e 1 J e 1 . Proof. By the fact that has at mostd " MM blocks whose Frobenius norm is at least" for each column block, thenjjjj (M) 1 d " jjjj (M) max + (pd " )": It follows from (B.44) in Lemma 11 that jj e 1 jj (M) 1 jj e 1 jj (M) 1 jjjj (M) 1 e d " jjjj (M) max + (pd " )" 1=3; where I have used the assumption forjjjj (M) max . Hence I have the convergent matrix expansion via Neumann series ( e + ) 1 = e 1 e 1 e 1 + e 1 e 1 J e 1 : By the definitions of R() and , I obtain R() = e 1 e 1 J e 1 . Let e j 2 R Mpp with identity matrix in thejth block and zero matrix elsewhere, and x2R Mpp withjth block x j 2R MM : Definejjxjj (M) max = max j jjx j jj F andjjxjj (M) 1 = P p j=1 jjx j jj F : Recall that given anM-block matrix A, I have definedM-block version of matrix1-norm asjjAjj (M) 1 = max i P p j=1 jjA ij jj F . Define the corresponding M-block version of matrix 1-norm byjjAjj (M) 1 = max j P p i=1 jjA ij jj F . It follows from (B.44) and (B.47)– (B.49) in Lemma 11 that jjR()jj (M) max = max i;j jje T i e 1 e 1 J e 1 e j jj F max i jje T i e 1 jj (M) max max j jj e 1 J e 1 e j jj (M) 1 max i jje T i e 1 jj (M) 1 jjjj (M) max max j jj e 1 J e 1 e j jj (M) 1 = jj e 1 jj (M) 1 jjjj (M) max jj e 1 J e 1 e j jj (M) 1 e jjjj (M) max jj e 1 J T e 1 jj (M) 1 e jjjj (M) max jj e 1 jj (M) 1 jjJ T jj (M) 1 jj e 1 jj (M) 1 jjjj (M) 1 = 3 e jjjj (M) max jjJ T jj (M) 1 jjjj (M) 1 : 114 Note that J = P 1 k=0 (1) k ( e 1 ) k . Then I have jjJ T jj (M) 1 1 X k=0 jjjj (M) 1 jj e jj (M) 1 k = 1 1jjjj (M) 1 jj e 1 jj (M) 1 3 2 : Hence I can bound the second order remainder R() by jjR()jj (M) max 3 2 jjjj (M) max d " jjjj (M) max + (pd " )" 3 e : B.3.9 Lemma 8 and its proof Lemma 8. Assume thatr = 2 e " (jjWjj (M) max + + e " ") min 1 3 e d" ; 1 3 3 e e " d" pd" d" ". Thenjjjj (M) max =jj b e jj (M) max r: Proof. LetG( S" ) =( 1 ) S" + e S S" + b Z S" . I define a continuous mapF :R MjS"j !R MjS"j by F (vec( S" )) =( e S"S" ) 1 vec(G( e S" + S" )) + vec( S" ): (B.40) Note thatF (vec( S" )) = vec( S" ) holds if and only ifG( e S" + S" ) =G( b S" ) = 0 by construction. I need to show that the functionF maps the following ball B(r) onto itself B(r) =f S" :jj S" jj (M) max rg; (B.41) wherer = 2 e " (jjWjj (M) max ++ e " "). Then by Brouwer’s fixed point theorem, there exists some fixed point S" 2 B(r); which will complete the proof. It remains to prove the above claim. Note that G( e S" + S" ) = [( e + ) 1 ] S" + e S S" + e Z S" = [( e + ) 1 e 1 ] S" + [ e S e 1 ] S" + e Z S" = [R() e 1 e 1 ] S" + W S" + e Z S" : 115 Then (B.40) can be substituted by F (vec( S" )) = ( e S"S" ) 1 vec(R S" ) | {z } T 1 ( e S"S" ) 1 vec(W S" ) +vec( e Z S" ) + e S"S c " vec( S c " ) | {z } T 2 : By the definition of e " , e " and (B.45) in Lemma 11, T 2 can be bounded by jjT 2 jj (M 2 ) max e " jjW S" jj (M) max + + e " " =r=2: With the assumed bound forr, I havejjjj (M) max r 1 3 e d" pd" d" ". Then an application of the bound for R() in Lemma 7 yields jjT 1 jj (M 2 ) max 3 2 3 jjjj (M) max d " jjjj (M) max + (pd " )" jjjj (M) max 2 r 2 ; where I have used the assumptionjjjj (M) max r 1 3 3 e e " d" pd" d" ". Therefore, I obtain jjF (vec( S" ))jj (M 2 ) max jjT 1 jj (M 2 ) max +jjT 2 jj (M 2 ) max r: B.3.10 Lemma 9 and its proof Lemma 9. Assume that all the conditions of Lemma 8 hold and min = min (j;l)2E" jj e jl jj F satisfies min > 2 e " (jjWjj (M) max + + e " "). Then b jl 6= 0 for all (j;l)2S " . Proof. From Lemma 8, I havejj b jl e jl jj F r for any (j;l)2S " . Thus b jl 6= 0 for all (j;l)2S " follows immediately from the lower bound on min . B.3.11 Lemma 10 and its proof Lemma 10. For any > 2 and sample size n such that f (n; (Mp) ) 1=v , I have P jjWjj (M) max M f (n; (Mp) ) (Mp) 2 : 116 Proof. Using the definition of the tail function in (B.17), I have P (W kl <) 1 1 f(n;) ; where W2R MpMp and (k;l)2f1; ;Mpg 2 . This implies P (jjW ij jj F <M) 1 1 f(n;) M 2 for (i;j)2V 2 . Taking the union over allp 2 MM blocks in W, I obtain P (max i;j jjW ij jj F <M) = p Y i=1 p Y j=1 P (jjW ij jj F <M) 1 1 f(n;) M 2 p 2 : Let = f (n; (Mp) ). Then it follows that P (jjWjj (M) max M) 1 1 1 f(n;) M 2 p 2 M 2 p 2 f(n;) = (Mp) 2 : B.3.12 Lemma 11 and its proof Lemma 11. Let A = (A ij ); B = (B ij ) with A ij ; B ij 2R MM ; 1 i;j p, u = (u j ); v = (v j ) with u j ; v j 2R M ; 1 j p, and x; y2R MpM withjth block x j ; y j 2R MM , respectively. Then the following norm properties hold: jjAjj (M) max =jjvec(A)jj (M 2 ) max ; (B.42) jjA + Bjj (M) 1 jjAjj (M) 1 +jjBjj (M) 1 ; (B.43) jjABjj (M) 1 jjAjj (M) 1 jjBjj (M) 1 ; (B.44) jjAujj (M) max jjAjj (M) 1 jjujj (M) max ; (B.45) jju + vjj (M) max jjujj (M) max +jjvjj (M) max ; (B.46) jjx T yjj (M) F jjxjj (M) max jjyjj (M) 1 ; (B.47) jjAxjj (M) max jjAjj (M) max jjxjj (M) 1 ; (B.48) 117 jjAjj (M) 1 =jjA T jj (M) 1 : (B.49) Proof. Since jjABjj (M) 1 = max i p X j=1 jj p X k=1 A ik B kj jj F max i p X j=1 p X k=1 jjA ik jj F jjB kj jj F = max i p X k=1 kjA ik jj F j p X j=1 jB kj jj F max i p X k=1 kjA ik jj F max k p X j=1 jjB kj jj F = jjAjj (M) 1 jjBjj (M) 1 ; (B.44) follows. The other properties can be shown similarly. 118 Appendix C Proofs in Chapter 4 C.1 Proof of Theorems 7 and 8 Following the same derivations in the proof of Theorem 1 in Li et al. (2012) considering the settingX2R andY 2R q , I aim to derive the corresponding results in the functional domain whereX(s);s2S and Y (t);t2T are two functions. I treatY 1 ;:::;Y n andX 1j ;:::;X nj ; 1 j p as i.i.d. random functions respectively. Notice Conditions 6 and 7 correspond to the functional counterparts of conditions C1 and C2 in Li et al. (2012). SubstitutingjjXjj 1 andjjYjj q byjjXjj andjjYjj, wherejjjj denotes theL 2 norm of a function and applying the theory of U statistics, I can obtain the same probability bounds in Theorem 7 as those in Theorem 1 of Li et al. (2012). I omit the details of the derivations, since the techniques to facilitate the proof are the same as those in Li et al. (2012). Shao and Zhang (2014) considers the situation whereX2R q andY 2R and presents the underlying sure screening property in Theorem 5. I obtain the corresponding results in the functional setting. Note Conditions 6 and 8 are functional versions of conditions A1 and A2 in Shao and Zhang (2014). Replacing Y i Y k ; 1i;kn andjjXjj q by<Y i ;Y k > andjjXjj, where<;> denotes theL 2 inner product between two functions with the same domain, and then following the same derivations in the proof for Theorem 5 of Shao and Zhang (2014), I obtain the sure screening property for MDC-SIS in Theorem 8, which is the same as that in Theorem 5 of Shao and Zhang (2014). The details of the proof are omitted. 119 C.2 Proof of Proposition 3 Given Definition 2, I write jjg Y;X (u)g Y g X (u)jj 2 = <g Y;X (u)g Y g X (u);g Y;X (u)g Y g X (u)> = <g Y;X (u);g Y;X (u)><g Y;X (u);g Y g X (u)> <g Y g X (u);g Y;X (u)> +<g Y g X (u);g Y g X (u)> = jjg Y;X (u)jj 2 <g Y;X (u);g Y > g X (u) g X (u)<g Y ;g Y;X (u)> +jjg Y jj 2 g X (u) g X (u) where jjg Y;X (u)jj 2 = <E[Ye i<u;X> ];E 0 [Y 0 e <u;X 0 > ]> = E[<Y;Y 0 >e i<u;XX 0 > ] = E[<Y;Y 0 > (1 cos<u;XX 0 >)] +E(<Y;Y 0 >) +R 1 ; <g Y;X (u);g Y > g X (u) = <E[Ye i<u;X> ];E 0 (Y 0 )>E 00 [e i<u;X 00 > ] = E[<Y;Y 0 > (1 cos<u;XX 00 >)] +E(<Y;Y 0 >) +R 2 ; g X (u)<g Y ;g Y;X (u)> = <E 0 (Y 0 );E[Ye i<u;X> ]>E 00 [e i<u;X 00 > ] = E[<Y;Y 0 > (1 cos<u;X 00 X >)] +E(<Y;Y 0 >) +R 3 ; jjg Y jj 2 g X (u) g X (u) = E[<Y;Y 0 >]E[e i<u;XX 0 > ] = E[<Y;Y 0 >]E[(1 cos<u;XX 0 >)] +E(<Y;Y 0 >) +R 4 : 120 with R 1 ;R 2 ;R 3 and R 4 representing remainders that vanish when the integral over u2U is evaluated andE 0 ,E 00 corresponding to the expectations with respect to (X 0 ;Y 0 ) and (X 00 ;Y 00 ) respectively. Notice I choose a weight function w(u) in Definition 2 such that R U (1 cos < u;X >)w(u)du =jjXjj. The analytical form ofw(u) is left for future research. But in the case ofx2R d , Lemma 1 of Szekely et al. (2007) showsw(u) =jjujj (d+1) d =C(d) withC(d) = 2 d=2 (1=2) 2((d+1)=2) : Integrating all the items above overu with chosen weight functionw(u); I have MDD 2 (YjX) = E[<Y;Y 0 >jjXX 0 jj] +E[<Y;Y 0 >jjXX 00 jj] +E[<Y;Y 0 >jjX 00 Xjj]E[<Y;Y 0 >E(jjXX 0 jj)] = E[<YE(Y );Y 0 E(Y 0 )>d(X;X 0 )]; which completes the proof. C.3 Proof of Proposition 4 From Proposition 3, I have MDD(YjX) 2 =E <YE(Y );Y 0 E(Y 0 )> (E X jjXX 0 jj +E X 0jjXX 0 jjjjXX 0 jjEjjXX 0 jj It follows from Cauchy-Schwarz inequality thatMDD(YjX) 2 p EjjYE(Y )jj 2 EjjY 0 E(Y )jj 2 E[(E X jjXX 0 jj +E X 0jjXX 0 jjjjXX 0 jjEjjXX 0 jj) 2 ] = q (EjjYE(Y )jj 2 ) 2 E[(E X jjXX 0 jj +E X 0jjXX 0 jjjjXX 0 jjEjjXX 0 jj) 2 ] = p fvar(Y ) 2 dvar(X) 2 ; where dvar(X) = dcov(X;X) is defined on page 5 of Lyons (2013). Then I have MDC(YjX) 1 and it is trivial to show MDC(YjX) 0. Moreover, it follows from Definition 3 that MDC(YjX) = 0 () MDD(YjX) = 0, i.e.jjg Y;X (u)g Y g X (u)jj 2 = 0 andg Y;X (u) = g Y g X (u) for almost allu from Definition 2, which holds if and only ifE(YjX) =E(X). 121 C.4 Proof of Proposition 5 The proof is similar to those in the proof of Theorem 1 of Szekely et al. (2007) and proof of Theorem 2 of Shao and Zhang (2014). Note jjb g n Y;X (u)b g n Y b g n X (u)jj 2 =jjb g n Y;X (u)jj 2 <b g n Y;X (u);b g n Y > b g n X (u)b g n X (u)<b g n Y ;b g n Y;X (u)> +jjb g n Y jj 2 b g n X (u) b g n X (u): The first term is jjb g n Y;X (u)jj 2 =n 2 n X i;k=1 <Y i ;Y k > (1 cos<u;X i X k >) +n 2 n X i;k=1 <Y i ;Y k > +R 1n : The second term is <b g n Y;X (u);b g n Y > b g n X (u) =n 3 n X i;k;l=1 <Y i ;Y k > (1cos<u;X i X l >)+n 2 n X i;k=1 <Y i ;Y k > +R 2n : The third term is b g n X (u)<b g n Y ;b g n Y;X (u)>=n 3 n X i;k;l=1 <Y i ;Y k > (1cos<u;X l X i >)+n 2 n X i;k=1 <Y i ;Y k > +R 3n : E[<Y;Y 0 > (1 cos<u;X 00 X >)] +E(<Y;Y 0 >) +R 3n : The last term is jjb g n Y jj 2 b g n X (u) b g n X (u) =n 2 n X i;k=1 <Y i ;Y k >n 2 n X i;k=1 (1cos<u;X i X k >)+n 2 n X i;k=1 <Y i ;Y k > +R 4n : Here R 1n ;R 2n ;R 3n and R 4n represent terms that vanish when the integral over u 2 U is computed. Integrating the above four terms over u with the same weight function defined in Definition 2, where R U (1 cos<u;X >)w(u)du =jjXjj, it follows that \ MDD n (yjx) 2 = Z U jjb g n Y;X (u)b g n Y b g n X (u)jj 2 w(u)du = b T 1n b T 2n + 2 b T 3n ; which completes the proof. 122 C.5 Proof of Proposition 6 The proof is similar to that in the proof of Theorem 3 in Shao and Zhang (2014), which considers the situa- tion ofX2R q andY 2R. I concentrate on the functional setting whereX(s);s2S andY (t);t2T are two random functions. WriteMDD 2 (YjX) = R u jjg Y;X (u)g Y g X (u)jj 2 w(u)du and \ MDD n (yjx) 2 = R u jj n (u)jj 2 w(u)du, where n (u) =n 1 P n j=1 Y j e i<u;X j > n 1 P n j=1 Y j n 1 P n j=1 e i<u;X j > : Define the regionD() =fu; <jjujj < 1=g for each > 0: Following the derivations in the proof of Theorem 3 in Shao and Zhang (2014), I can define the functional counterparts of terms and notations used in the underlying proof, and then apply the same techniques in the functional domain to facilitate the results of consistency for sample estimators in Proposition 6. 123
Abstract (if available)
Abstract
Functional Data Analysis (FDA), a branch of statistics that analyzes data providing information about functions measured over some domain, has recently attracted more attention. In my thesis, I present a brief overview of FDA and consider three different settings in FDA where sparseness can play an important role. ❧ The first setting considers sparseness in the functions. Classical FDA requires a large number of regularly spaced measurements per subject. I consider the situation involving sparse, irregularly observed and noisy data. I propose a new functional errors-in-variable approach, Sparse Index Model Functional Formulation (SIMFE), which uses a functional index model formulation to handle sparsely sampled functional predictors. SIMFE enjoys several advantages over traditional methods. First, it implements a non-linear regression and uses a supervised approach to estimate the lower dimensional space that the predictors should be projected. Second, SIMFE can be applied to both scalar and functional responses with multiple predictors. Finally, SIMFES uses a mixed effect model to deal with sparsely sampled functional data. ❧ The other two settings focus on sparseness arising from high dimensional functional data, where the number of functional variables, p, exceeds the number of observations, n. One example considers extending Gaussian graphical model, which depicts the conditional dependence structure among p multivariate Gaussian variables, to the functional domain. Fitting high dimensional graphical model pose several challenges, thus I need assume sparseness in the edges, where the edges only connect a subset of nodes i.e. only some pairs of random variables are conditional dependent. I propose a functional graphical model (FGM) describing the conditional dependence structure among p random functions. Then I develop a fglasso criterion, which estimates FGM by imposing block sparsity in the precision matrix, via a group lasso penalty. I show the successful graph recovery with a high probability. ❧ The other example concerns screening features for ultra-high dimensional functional data. To make the problem feasible I assume extreme sparseness in the predictor space, i.e. most of the functional predictors are not contributing to the response. I propose several model-free independence screening procedures to rank the importance of functional predictors. I compare different active sets of predictors that these approaches aim to select and establish the corresponding sure screening properties when the number of predictors diverges at an exponential rate with the number of observations. ❧ In each setting, I illustrate the sample performance of my proposed method through a series of simulations and a real world data example.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Shrinkage methods for big and complex data analysis
PDF
Transmission tomography for high contrast media based on sparse data
PDF
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
PDF
Model selection principles and false discovery rate control
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Scalable multivariate time series analysis
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Statistical analysis of high-throughput genomic data
PDF
High-dimensional feature selection and its applications
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Essays on econometrics analysis of panel data models
PDF
Essays on high-dimensional econometric models
Asset Metadata
Creator
Qiao, Xinghao
(author)
Core Title
Sparseness in functional data analysis
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
07/31/2015
Defense Date
06/19/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
features screening,functional regression,graphical models,high dimensional functional data,OAI-PMH Harvest,sparsely sampled functional data
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
James, Gareth (
committee chair
), Goldstein, Larry (
committee member
), Radchenko, Peter (
committee member
), Tong, Xin (
committee member
)
Creator Email
qiaoxinghao@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-621510
Unique identifier
UC11304099
Identifier
etd-QiaoXingha-3771.pdf (filename),usctheses-c3-621510 (legacy record id)
Legacy Identifier
etd-QiaoXingha-3771.pdf
Dmrecord
621510
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Qiao, Xinghao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
features screening
functional regression
graphical models
high dimensional functional data
sparsely sampled functional data