Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-dimensional regression for gene-environment interactions
(USC Thesis Other)
High-dimensional regression for gene-environment interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-DIMENSIONAL REGRESSION FOR GENE-ENVIRONMENT INTERACTIONS by Natalia Zemlianskaia A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOOHY (BIOSTATISTICS) December 2021 Copyright 2021 Natalia Zemlianskaia Acknowledgements I want to sincerely thank my wonderful advisor Dr. Juan Pablo Lewinger. I felt truly lucky throughout my years at the Ph.D. program to work with such an incredibly smart and kind mentor who came up with and gave me very challenging and interesting research problems to solve. I'm endlessly grateful to Juan Pablo for always believing in me and motivating and supporting my ideas, giving me the courage to pursue dicult research questions, and defeating various obstacles. It was truly a fullling experience to learn from my advisor, his insightful feedback brought my work to the highest level. I want to express my deepest gratitude to Dr. Jim Gauderman. His passion for research and impactful work inspired me to pursue a career in health research for my future. Jim introduced me to GE research and greatly motivated my work on hierarchical GE analysis. I also would like to pay my special regards to Dr. Paul Marjoram and Dr. Kim Siegmund for helping me throughout my early years in the program which were the most challenging. Paul and Kim gave me the strength to work hard and also enjoy my time at the program. I'm very grateful to my committee member Dr. Jacob Bien. I learned so much from his work on hierarchical pairwise interactions and shaped my own methodology based on his incredible research. I would also like to thank Dr. David Conti for his valuable advice and contribution to the development of this dissertation. Many thanks to my friends and family for their help and support, especially to my dearest brilliant friend Lera. Her friendship and advice got me through a lot. I would like to thank my beloved husband Yury, who is the smartest person I know, my constant inspiration, my best friend, and my favorite scientist. Thank you for sharing every ii minute of this challenging and gratifying experience with me, I could not do it without you. iii Contents Acknowledgements ii List of Tables viii List of Figures xi Abstract xii 1 Introduction to Gene-Environment Interactions 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Standard Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Joint selection of interactions with hierarchical structure 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Hierarchical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 The gesso model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Connection to the hierarchical Lasso . . . . . . . . . . . . . . . . . . . . 10 2.3 Related Work. Group lasso based models . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 FAMILY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 glinternet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Block coordinate descent algorithm for gesso 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iv 3.2 Block coordinate descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Block coordinate descent updates . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Primal and Dual formulation for gesso . . . . . . . . . . . . . . . . . . . 17 3.2.3 Convergence criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Screening rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 SAFE rules for gesso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 Optimal naive projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Path-wise SAFE rules for gesso . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Dynamic SAFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.5 Gap SAFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.6 Working set strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.7 Active set and adaptive max dierence strategies . . . . . . . . . . . . . 31 3.3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Real data example: Epigenetic clock . . . . . . . . . . . . . . . . . . . . . . . . 41 4 gesso for binary outcomes 50 4.1 Logistic gesso formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Newton-Raphson algorithm for logistic regression . . . . . . . . . . . . . . . . . 51 4.3 Gap SAFE rules for logistic regression . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.1 Prostate cancer gene expression data . . . . . . . . . . . . . . . . . . . . 66 4.6.2 FIGI data with external information (imputed gene expression) . . . . . 68 5 gesso package 73 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 v 5.3 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Selection example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Prediction example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Adding unpenalized covariates to the model . . . . . . . . . . . . . . . . . . . . 83 5.7 Sparse matrix (dgCMatrix) option example . . . . . . . . . . . . . . . . . . . . . 84 5.8 Bigmatrix (bigmemory package) option example . . . . . . . . . . . . . . . . . . 85 5.9 Working sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6 Incorporation of External information 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Statistical and Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.6 Model tting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6.1 Block coordinate descent updates . . . . . . . . . . . . . . . . . . . . . . 104 7 Conclusions and Discussion 105 A Introduction to Convex Optimization 107 B Proof of equivalence of relaxed and unconstrained models 114 C Coordinate-wise solutions for the gesso 115 D Block coordinate descent updates for the gesso 120 E Dual formulation for the gesso 123 F Auxiliary problem solution 128 G SAFE rules for gesso 130 vi H Solution to the optimization problem (3.18) 132 I Coordinate-wise solutions for the externally informed gesso 135 J Block coordinate descent updates for the externally informed gesso 141 K Simulations with FAMILY 145 L Weighted least squares gesso 147 L.1 Dual formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 L.2 Naive projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 M Dual formulation for logistic gesso 150 M.1 Dual formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 M.2 Quadratic approximation for the dual objective . . . . . . . . . . . . . . . . . . 151 References 152 vii List of Tables 3.1 Ranks across analyses. Probes identied by other methods highlighted in bold. . 43 3.2 Biological information about the top selected CpG probes (GWIS and gesso). . 47 3.3 Biological information about the top selected CpG probes (glmnet). . . . . . . . 48 4.1 Selection Models for GE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Selected genes by gesso (in light gray) and the Lasso (white) with the stability procedure (selection probability 0.7). Genes selected by both methods are highlighted in dark gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 viii List of Figures 3.1 Warm start strategy for the two-dimensional grid. . . . . . . . . . . . . . . . . . 25 3.2 Dynamic SAFE regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Dynamic GAP SAFE regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Time comparison with added strategies. We denote Algorithm 1 as cd; maxdi : added max dierence strategy, as: added active set strategy, ws: added working set strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Number of variables in the WS versus the number of actual non-zero variables identied by the BCD tting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Working set size (%) for all lambda pairs. . . . . . . . . . . . . . . . . . . . . . 36 3.7 Strong hierarchical mode. p = 2500, n = 100, n g non zero/ n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. 38 3.8 Hierarchical mode. p = 2500, n=100, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. . . . 39 3.9 Anti-hierarchical mode. p = 2500, n=100, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. . . . 40 3.10 Manhattan plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11 Interaction plots for the single marker analysis stratied by sex. Predicted age on the y-axis, methylation frequency on the x-axis. . . . . . . . . . . . . . . . . 45 4.1 Newton-Raphson algorithm. [gure credit: Jeremy Watt] . . . . . . . . . . . . . 52 ix 4.2 Dual function for the logistic gesso formulation g(x) = x log(x) + (1 x) log(1x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Strong hierarchical mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Hierarchical mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Anti hierarchical mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Test loss across modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Main eects variation in oracle model. . . . . . . . . . . . . . . . . . . . . . . . 64 4.8 Main eects variation in selection models. . . . . . . . . . . . . . . . . . . . . . 65 4.9 Mean cross-validated (cv) AUC (SD) and mean number of selected main ef- fects and interactions (SD) across 100 cv splits. Interaction with (a) age, (b) preoperative PSA, and (c) pooled Gleason score is considered. . . . . . . . . . . 67 4.10 Steps involved in the analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Distribution of working set sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Working set sizes by 2D grid of tuning parameters ( 1 ; 2 ). . . . . . . . . . . . . 88 6.1 Integration of gene expression data. . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Simulation design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 GE detection AUC across ve compared models. . . . . . . . . . . . . . . . . 98 6.4 G detection AUC across ve compared models. . . . . . . . . . . . . . . . . . . . 100 6.5 MSE for GE across ve compared models. . . . . . . . . . . . . . . . . . . . . 100 6.6 Test loss across ve compared models. . . . . . . . . . . . . . . . . . . . . . . . 101 A.1 Denition of a convex function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.2 Linear approximation (via gradient) at any point underestimates f. . . . . . . . 108 A.3 Subgradient for f() =jj. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 F.1 Auxiliary problem geometric solution. . . . . . . . . . . . . . . . . . . . . . . . . 129 H.1 Geometric solution for the problem (H.1). Optimal j maximizes the range of possible x values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 x K.1 Strong hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. . . . 145 K.2 Hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. . . . . . . . 146 K.3 Anti-hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. . . . 146 xi Abstract We describe a novel high-dimensional regression method for modeling gene-environment inter- actions that accounts for all of investigated genomic features simultaneously and imposes a main-eect-before-interaction hierarchical structure. We propose an ecient tting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. We present simulation results showing that the model outperforms existing joint selection methods for (GE) interactions in terms of selection performance, scalability and speed. This algorithm removes the computational bottleneck for performing joint hierarchical selection of interactions, providing an eective alternative to the currently dominant paradigm of one-predictor-at-a-time analyses. The algorithm is based on screening rules that can discard large numbers of null predictors, making high-dimensional but sparse problems computationally tractable. We provide theoretical justication and computa- tional experiments for the screening rules we develop. Our implementation of the algorithm is available in the gesso R package on CRAN. We showcase our approach with an application to a large-scale methylation dataset, includ- ing 200,000 CpG markers per subject, identifying four probes that show sex specicity in their eect on human aging. In GWAS and TWAS applications we identify SNPs showing sex speci- city in their eect on childhood asthma and genes associated with colon cancer in interaction with aspirin intake. xii Chapter 1 Introduction to Gene-Environment Interactions 1.1 Introduction The etiology of complex diseases has been linked to both genetic and environmental factors as genetic factors alone do not completely explain the mechanisms of many diseases. Joint (interaction) eects that go beyond additive genetic and environmental eects are of great interest and importance. A gene-environment (GE) interaction is dened as a dierent eect of an environmental factor on disease risk in subjects with dierent genotypes. In other words, we say that inter- action is present when the eect of an environmental factor depends on an individual's genetic makeup. For example, studies have shown that people who smoke and have a particular vari- ant in metabolic gene NAT2 are at elevated risk of bladder cancer compared to non-smokers [4, 6, 7, 8]. From a health care perspective, it is important to investigate if a disease can be prevented by reducing exposure to the relevant environmental agents. For example, phenylketonuria { a rare genetic disease caused by a mutation in the G6PD gene can be entirely prevented by following a specic phenylalanine restrictive diet[5]. 1 One of the aims of pharmacogenomics is to identify genes that modify the drug response. This can lead to a more accurate prediction of a patient's reaction to the drug, allow to establish ne-grained safety and ecacy investigation, and rene drug indication to the sub-populations that would benet from the medicine the most [9, 10]. Ritz et al. (2017) provide a com- prehensive overview of the discovered gene-environmental interactions and their public health importance [3]. In the past years, the amount of genetic data has vastly increased alongside the number of statistical methods to identify GE interactions. 1.2 Standard Methods In a regression equation, an interaction eect is represented as the product of two independent variables. A standard interaction model (1.1) estimates the simultaneous eects of genetic and environmental factors in addition to the marginal eects: g(E[y]) = 0 + G G + E E + GE GE; (1.1) where G and E are the marginal eects and GE is an interaction eect, G = (G 1 ;:::;G p ) are p genomic features and E is an environmental risk factor. Y is an outcome of interest, for example, binary disease status or a continuous trait. g(E[y]) is a link function, for linear regression g(x) =x identity link and for logistic regression g(x) = log( x 1x ) logit link. A Genome-Wide Interaction Scan (GWIS) performs one-SNP-at-a-time analysis (sometimes referred to as a single-marker analysis) to test for interactions and applies a multiple testing correction resulting in a genome-wide signicance level equal to 5e{8. Likelihood ratio, Wald, or Score tests can be used to test for the signicant interaction eect (H 0 : GE = 0). However, the resulting procedure suers from low power and requires a substantial sample size to detect signicant interactions. Several methods have been proposed to improve the power of GE identication analysis. 2 Among them, various two-step procedures that implement a screening step to prioritize variants and to reduce the multiple testing burden for interaction tests performed on the second step [11, 12, 13], case-only (CO) analysis, that has an increased power but requires assumptions of gene-environment independence [16], and empirical Bayes (EB) type approaches that combine case-only and case-control estimates to reduce estimation bias under violations of independence [14, 15]. One of the limitations of one-SNP-at-a-time methods is that they do not account for the correlation between genomic features and joint eects on complex polygenic traits. For complex diseases with many causal genetic factors, analysis that jointly takes into account the eects of all genetic variants is preferable to single-marker analysis where the variants are considered separately. Markers with weak eects might be more identiable when the model has been adjusted for other causal markers, and false-positive eects might be weakened by the inclusion of stronger true causal variants in the model. In addition, those genetic features that are associated with the outcome jointly can be found (for example, for highly polygenic traits). As most genomic data is high-dimensional { in a typical GWAS the number of SNPs is in the order of millions while the sample size is substantially smaller (np) { standard (unpenalized) multiple regression analysis is infeasible. As a remedy, regularization methods such as Lasso [1] and Elastic Net [2] that are suitable for high-dimensional data and also perform variable selection are frequently used. A lot of relevant work has been done to evaluate regularization methods for the selection of main eects. Ayers and Cordell (2010) provide a comprehensive comparison of various penalization methods including Lasso, Ridge, and Elastic Net with single-marker regression and with forward stepwise regression. Model performance comparison with respect to the selection of important main eects and localization of true causal SNPs was evaluated. Reported results indicate that in terms of detection area under the curve (AUC) regularization methods outperform both single marker regression and stepwise methods [24]. Waldmann et al. (2013) conducted a comparison of Lasso and Elastic Net models for highly correlated features in the GWAS setting and explored various values for Elastic Net tuning parameter which controls 3 the ratio between L 1 and L 2 penalties in the model [25]. A couple of methods suggested incorporating regularization methods into two-step selection approaches to either screening step [26] or testing step [27, 28]. Wu et al. (2009) performed Lasso logistic regression analysis that replicated some of the previous results for GWAS coeliac disease markers and suggested new ndings [29]. The authors also proposed a two-step framework for nding gene by gene (GG) interactions among those selected in the rst step main eects. This approach ensures the main-eect-before-interaction hierarchical structure among the selected SNPs and interactions, but screens for main eects and interactions sequentially to reduce the computational burden, not in a joint manner. General regularized regression methods such as the Lasso [1] and Elastic Net [2] can be used forGE identication but, because they do not exploit the hierarchical structure of the (GE) problem, can perform suboptimally. The main-eect-before-interaction hierarchical structure ensures that the nal selected model only includes interactions when corresponding main eects have been selected. This enhances the interpretability of the nal model [21], [20] and increases the ability to detect interactions by reducing the search space [19]. Methods for the GG selection problem that exploit an interaction hierarchical structure like FAMILY [35], glinternet [70], and hierNet [34], and their corresponding R implementations can be in principle applied to the GE case but they are optimized for the symmetric GG case, which results in vastly suboptimal performance in terms of run-time and scalability (they can handle at most a few hundred predictors) for the GE case. Indeed, the structure of the GE selection task is simpler because 1) the dimensionality of the problem grows linearly with additional environmental variables as opposed to quadratically for the GG selection task, and 2) because interactions with a single variable lead to a block-separable optimization problem, which, unlike the G G case, can be eciently solved using a block coordinate descend algorithm. For these reasons, the GE selection problem is amenable to ecient implementations for large-scale analysis (e.g. genome-wide). However, ecient joint selection methods specic for GE in large-scale applications have not been developed. [17] and [18] adopted sparse group penalization approaches to accelerated failure time models 4 for hierarchical selection of GE interactions, but their approaches do not scale to datasets with a very large number of predictors without additional pre-screening procedures. In this work, we present gesso (from G(by)E(la)sso) model for the hierarchical modeling of interaction terms. We propose an ecient tting algorithm for the gesso model and powerful new screening rules that eliminate a large number of variables beforehand, making joint GE analyses feasible at the genome-wide scale. 5 Chapter 2 Joint selection of interactions with hierarchical structure 2.1 Introduction In this chapter we explore joint selection methods for gene-environment (GE) interaction analysis. Particularly, we assess penalization approaches that respect the main-eect-before- interaction hierarchical structure. The structure implies that an interaction term only gets selected if its corresponding main eect has already been selected by the model. We review various types of convex objectives that impose hierarchical structure and propose the gesso model. GE interaction model Consider a linear regression model with n subjects, p SNPs, q environmental variables, and pq two-way interactions minimize 0 ; G ; E ; GE Y 0 + X j G j G j + X j E j E j + p;q X i;j G i E j G i E j 2 2 6 where Y 2R n is an outcome of interest, G is np matrix of genotypes, E is nq matrix of exposure measurements, 0 2R, G 2R p , E 2R q , G i E j 2R pq are regression coecients. We are interested in selecting GE interaction terms that are associated with the outcome variable Y . One of the classical joint selection approaches is Lasso regression model that was proposed by Tibshirani et al. (1996) [1]. Lasso model minimize 0 ; G ; E ; GE Y 0 + X j G j G j + X j E j E j + p;q X i;j G i E j G i E j 2 2 + + k G k 1 +k E k 1 +k GE k 1 (2.1) The standard lasso model corresponds to applying the L 1 penalty to all main eect and inter- action coecients with the same penalization parameter . The model does not distinguish between main eects and interaction eects and could select interaction terms for which main eects have not been selected. In other words, the standard Lasso model above, which is going to be our baseline model, does not enforce a hierarchical structure. In the next subsection we provide a formal denition for a hierarchical structure. 2.2 Hierarchical structure Strong and Weak Hierarchy (Heredity) A strong hierarchical structure implies that if either of the main eects are equal to zero, then interaction term has to be zero as well. A weak hierarchy permits one of the main eects to be zero and only implies zero interaction if both of the main eects are zeroes simultaneously. Strong Hierarchy: G i E j 6= 0 =) G i 6= 0 and E j 6= 0, equivalently, 7 G i = 0 or E i = 0 =) G i E j = 0: Weak Hierarchy: G i E j 6= 0 =) G i 6= 0 or E j 6= 0, equivalently, G i = 0 and E i = 0 =) G i E j = 0: In GE interaction analysis E factors are usually selected based on prior knowledge and have low dimensionality (qp). In this situation, there is no need to maintain a hierarchical constraints with respect to the environmental eects. The strong hierarchical structure is reduced to G i E j 6= 0 =) G i 6= 0, equivalently, G i = 0 =) G i E j = 0: How to impose hierarchical structure There are several ways to impose a hierarchical structure on regression coecients. One ap- proach is forward selection [30, 31], a procedure that iteratively considers adding the \best" variable to the model ensuring that an interaction term is only added if its main eects have already been added to the model on a previous iteration. Since stepwise selection is a greedy algorithm, it tends to underperform compared to global optimization approaches [24]. Another way is to reparameterize the interaction coecients as G i E j = ij G i E j [32], but it results in a non-convex objective function. Regularization or constrained methods can impose many desired structures to the model, for example, group structures or nested hierarchical structures [33]. In the following sections we focus on the regularization methods that perform selection jointly in a convex optimization framework. 8 2.2.1 The gesso model We assume a single environmental exposure variable (q = 1) and leave extension to multiple E variables (q > 1) to the future directions. Denote the mean square error loss function for a GE problem by g( 0 ; G ; E ; GE ) = 1 2n Y 0 + X i G i G i + E E + X i G i E G i E 2 2 : We are interested in hierarchical selection of the G i E interaction terms that are associated with the outcome Y . We propose the following model that we call gesso (G(by)E(la)sso): minimize 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) + p X i=1 1 ( G i ; G i E ) 1 + 2 j G i E j : (2.2) The model has several important properties. First, it guarantees the desired hierarchical relationships between genetic main eects and interaction eects, since the penalty satises the overlapping group hierarchical principle [33]. Having G j E in every group where G j is present ensures that once G j E deviates from zero, the penalty for G j becomes close to zero by the properties of the group lasso models. In addition, having G j E in a group of its own makes it possible for G j to deviate from zero, when G j E is zero. Second, the regularized objective function is convex, which can take advantage of convex optimization theory and algorithms that ensure convergence to the global optimal solution. Moreover, the block-separable structure of the problem, with small-block sizes allows for highly ecient solvers. Third, we include two tuning parameters in the model, 1 and 2 enabling exible and decoupled data dependent control over the group and interaction penalties. Lastly, the groupL 1 penalty has a connection to a Lasso model with hierarchical constraints that we discuss next. 9 2.2.2 Connection to the hierarchical Lasso Bien et al. [34] proposed a Lasso model with hierarchical constraints for all pairwise interactions (the GG selection problem) and demonstrated its advantages over the standard Lasso. The authors showed that the hierarchical Lasso model is equivalent to the unconstrained overlapping group lasso model with theL 1 group norm. Following Bien et al. we shown that our proposed model (2.2) is equivalent to a constrained model (2.4). We describe the intuition behind the constrained version of the model (2.4) below and provide a proof of equivalence of the two models in Appendix B. Lasso with GE Hierarchical Constraints To impose the desired hierarchical structure, the constraintsj G i E jj G i j can be added to the standard Lasso model. These ensure the main eect before interaction property: G i = 0 =) G i E = 0. Furthermore, the model makes the implicit (and arguably a reasonable) assumption that important interactions have large main eects. When this assumption is met, the model will be more powerful in detecting interactions. Unfortunately, the constraint set above is non-convex and yields the non-convex problem: minimize 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) + 1 k G k 1 + 2 k GE k 1 subject to j G i E jj G i j; for i = 1;:::;p: (2.3) In order to transform the non-convex optimization problem (2.3) into a convex one, we decompose G as G = + G G andj G j as + G + G , where + G 0 and G 0. The non-convex constraintsj G i E jj G i j are then replaced by the convex constraintsj G i E j + G i + G i . Note thatj G j = + G + G only if + G G 0, so removing the latter condition results in a relaxed formulation that is not equivalent to the original non-convex problem. Substituting 10 G = + G G andj G j to + G + G in (2.3) we obtain a convex relaxation of the model: minimize 0 ; + G ; G ; E ; GE g( 0 ; + G ; G ; E ; GE ) + 1 ( + G + G ) + 2 k GE k 1 subject to j G i E j + G i + G i ; + G i 0; G i 0 for i = 1;:::;p: (2.4) Now the constraint set is convex and the optimization problem is convex as well. The relaxed constraints ( + G i + G i j G i E j) are less restrictive than the originalj G i jj G i E j. In particular, the model can yield a largej G i E j estimate and a moderately sizedj G i j by making both + G i and G i large. Examining the equivalent formulations for gesso (unconstrained groupL 1 model (2.2) and constrained model (2.4)) we can see that the groupL 1 norm corresponds to the constraints on the eect sizes of the interactions and the main eects. Intuitively, because of the connection between the relaxed model (2.4) and model (2.3), (2.4) will be more powerful for detecting interactions in the case when important interactions have large main eects. This is conrmed by our simulation further below. Additionally, the constrained formulation of gesso (2.4) allows for a simpler, interpretable form for the coordinate-wise solutions, the dual problem, and the development of the screening rules. 2.3 Related Work. Group lasso based models Zhao et al. (2009) proposed a exible framework CAP (composite absolute penalties family) that allows to combine various norms to achieve the target nested or grouped structure on the model coecients [33]. By letting the groups to overlap, CAP penalties can be constructed to maintain hierarchical relationships among selected predictors. Consider a simple case involving two predictors X 1 and X 2 . Suppose we want X 1 to be included in the model before X 2 . This hierarchy can be induced by dening the overlapping 11 groups G 1 =f1; 2g and G 2 =f2g. That results in the penalty: ( 1 ; 2 ) 1 +k 2 k 2 : (2.5) As described in [33], two things are important for the hierarchy to arise from penalty (2.5). First, 2 has to be in every group 1 is in. Second, there should be one group in which 2 is penalized without 1 being penalized. In the GE case having G j E in every group where G j is present, ensures that once G j E deviates from zero the penalty of G becomes close to zero by the properties of the group lasso models [36]. In addition, letting G j E be in a group of its own makes it possible for G j to deviate from zero, while G j E is kept at zero. 2.3.1 FAMILY Haris et. al. (2015) proposed to consider models (2.6) and (2.7) for the analysis of interactions [35]. Models (2.6) and (2.7) are the special cases of the CAP family: for model (2.6) 1 = 2 and 2 = 1 and for model (2.7) 1 =1 and 2 = 1. minimize 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) + 1 p X i=1 G i ; G i E 1 ;:::; G i Eq 2 + 2 k GE k 1 (2.6) minimize 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) + 1 p X i=1 G i ; G i E 1 ;:::; G i Eq 1 + 2 k GE k 1 (2.7) The authors published an R package called FAMILY (available on CRAN) to t both models via the ADMM algorithm for q > 1 (more than one E variable). We downloaded the source code from github and adopted it to t the case of one environmental variable. 2.3.2 glinternet Lim and Hastie (2015) proposed the glinternet model, a model that maintains the strong hier- archy and is designed for aGG setting [70]. Let k be the coecient of thek-th main eect. 12 The authors decompose k intop parameters k = (0) k + (1) k +::: + (k1) k + (k+1) k +::: + (p) k (glinternet doesn't modelX 2 k terms). Let jk denote the coecient for the interaction between variables X j and X k . The glinternet model minimizes the following objective function: minimize 0 ; (j) k ; kj Y 0 p X k=1 X j6=k (j) k X k X j<k jk X j X k 2 2 + X k j (0) k j + X j<k ( (j) k ; (k) j ; jk ) 2 (2.8) The strong hierarchy is enforced, again, via the group lasso penalties: if jk is estimated as nonzero, then (j) k and (j) k will be estimated to be nonzero, and hence so will k and j . The decomposition of main eects allows us to formulate a problem as a non-overlapped group lasso, which is easier to solve. However, a downside is that the number of parameters to estimate becomes larger. Similarly to the hierNet GG model (hierarchical Lasso), glinternet also has just a single hyperparameter . The method can be adapted to the G E case by decomposing the main eects into G j = (0) G j +::: + (q) G j and E k = (0) E k +::: + (p) E k . minimize 0 ; G ; E ; GE Y 0 p X j=1 q X k=1 (k) G j G j q X k=1 p X j=1 (j) E k E k X j;k G j E k G j E k 2 2 + (2.9) + X j j (0) G j j + X k j (0) E k j + X j;k ( (k) G j ; (j) E k ; G j E k ) 2 The Glinternet package is available on CRAN and can be applied to GE case by specifying the parameter interactionCandidates. In the next chapter (Chapter 3) we propose an ecient tting algorithm for the gesso problem and perform simulations to compare existing implementations of the hierarchicalGE selection (glinternet, FAMILY) with ours. In Chapters 3 and 4 we assume a single environmental exposure variable (q = 1) and leave extension to the multiple E variables (q> 1) to the future directions. 13 Chapter 3 Block coordinate descent algorithm for gesso 3.1 Introduction Friedman et al. (2007) among others proposed to use a cyclic coordinate descent method for solving convex regularized regression problems with L 1 and L 2 penalties or their combinations [48, 49, 46, 29]. The main advantage of the algorithm is that each coordinate-descent step has an analytic update formula for the coordinate-wise minimization, making each iteration fast. In addition, screening rules can exploit the sparse structure of the model and eliminate a large number of features beforehand. As a result, the algorithm spends much of its time evaluating only inner products for variables with non-zero coecients, making coordinate descent much faster than alternative convex optimization algorithms. For a convex block-separable function, convergence of the coordinate descent algorithm to a global minimum is guaranteed [45]. The gesso model has a convex objective function with the non-smooth part of the objective being block-separable. Each block contains G j ; G j E ; which we have to solve for and update jointly. Thus, the model can be tted using the block coordinate descent (BCD) algorithm and convergence to a global optimal solution is guaranteed. In the rst sections of this chapter we describe how to t the gesso model with the BCD algorithm. 14 In the following sections we develop ecient screening rules. 3.2 Block coordinate descent algorithm Consider the unconstrained gesso formulation (2.2) with a single exposure E minimize 0 ; G ; E ; GE 1 2n Y (1 0 +G G +E E + (GE) GE ) 2 + + 1 X j max(j G j j;j G j E j) + 2 k GE k 1 where Y 2R n is the outcome of interest, G is np matrix of genotypes, E is a vector of size n of exposure measurements, 0 2R, G 2R p , E 2R, G i E j 2R pq . In the unconstrained formulation it is easy to observe that variables G j and G j E form an inseparable block and have to be estimated jointly. Consider the equivalent constrained model (2.4): minimize 1 2n kY (1 0 +G( + G G ) +E E + (GE) GE )k 2 2 + + 1 1 T ( + G + G ) + 2 k GE k 1 ; subject to j G j E j + G j + G j ; G j 0; for j = 1;:::;p: To derive coordinate-wise updates for the problem (2.4) we form the Lagrangian and evaluate the KKT conditions. We introduce the dual variables paired with the constraints: j 0 is paired withj G j E j + G j + G j and j 0 are paired with G j 0 respectively. A quick introduction to the convex optimization methods, including the Lagrangian and the KKT conditions are presented in Appendix A. We denote the residuals as r() =Y 1 0 +G( + G j G j ) +E E + (GE) GE ; 15 where = ( 0 ; E ; G ; GE ). From the KKT conditions (we leave derivations to Appendix C) we have : Stationarity: G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 j ; (3.1) where r (j) =r() +G j ( + G j G j ); G j E = 1 1 n G j E 2 S 1 n G j E T ~ r (j) ; 2 + j ; (3.2) where ~ r (j) =r() +G j E G j E , E = r (E) kEk 2 , where r (E) =r() + E E; (3.3) 0 = 1 n r (0) , where r (0) =r() + 1 0 : (3.4) Feasibility: j G j E j + G j + G j ; G j 0; j 0; j 0: Complementary slackness: j (j G j E j + G j G j ) = 0; j G j = 0; for j = 1;:::;p: (3.5) By looking closely at the solutions for G j (3.1) and G j E (3.2) we can see that if the hierarchical constraints are inactive, meaning thatj G j E j < + G j + G j , then j = 0 by com- plementary slackness and the solutions for G j and G j E become just the solutions for the standard Lasso penalization. When the constraints are active, i.e. j G j E j = + G j + G j , then j > 0 acts as a balancing variable that adds more penalization to the interaction coecient and relaxes the penalization on the main eect. 16 3.2.1 Block coordinate descent updates The BCD algorithm estimates parameters by cycling through the coordinate blocks 1;:::;p to nd the minimum of the objective function when all other blocks are xed at their current estimates. We update E and 0 coecients according to the solutions obtained from the stationarity conditions (3.4, 3.3). For the block of G j ; G j E ; and j we use solutions (3.1, 3.2, 3.5) to form a system of equations and consider cases to simplify some of the calculations. Denote the residuals without j-th main eect and interaction terms: r (j) =r +G j G j +G j E G j E ; The system of equations that has to be solved for the variables in the block is given by 8 > > > > > > > < > > > > > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 S 1 n G j E T ( r (j) G j G j ); 2 + j j (j G j E j ( + G j + G j )) = 0 (3.6) We show how to eciently solve the system of equations (3.6) in Appendix D by breaking it down to a set of simpler cases. 3.2.2 Primal and Dual formulation for gesso In this section we derive a dual formulation of the gesso problem. In the next section we explore the duality gap as a convergence criterion for the block coordinate descent algorithm. Consider the following primal formulation of the gesso problem, where we substitute the residuals for a new variable z: 17 Primal gesso formulation: minimize ; GE ;z 1 2n z T z + 1 ( + G + G ) + 2 k GE k 1 subject to j GE j4 ( + G + G ); G < 0; z =Y 1 0 +G( + ) +E E +GE GE ; (3.7) where < symbol denotes an element-wise comparison (x < 0 () x j 0; for j = 1::p). In order to formulate the dual problem we introduce the dual variables: 2 R p ; < 0, paired with the conditionj GE j 4 ( + + ), and 2 R n , paired with the condition z = Y 1 0 +G( + ) +E E +GE GE . It is a common approach to substitute residuals for a new variable in the primal formulation of a problem to derive the dual formulation. We denote the linear predictor 1 0 + G( + ) + E E + G E GE as X for notational simplicity. We leave the full derivations of the dual formulation in Appendix E and present the dual problem below. Dual gesso formulation: maximize ; n 2 Y n 2 Y n 2 ! subject to j T GEj4 2 +; j T Gj4 1 ; 2 [0; 1 ]: We denote the dual feasible region asD F D F := 8 > > > > > > < > > > > > > : j T GEj4 2 +; j T Gj4 1 ; 2 [0; 1 ]: 18 Note that the optimal solution of the dual problem ^ can be viewed as a projection of Y n on to the dual feasible setD F : ^ = max 2D F n 2 Y n 2 Y n 2 ! = min 2D F Y n 2 = def Proj D F Y n : From the stationarity conditions (Appendix E: E.1) we establish the following important relationships: ^ = ^ z n = YX ^ n : (3.8) The optimal dual variable ^ is equal to the residuals scaled by the number of observations. Equation ^ = YX ^ n denes the link between the primal ( ^ ) and the dual (^ ) optimal solutions. The stationarity and complementary slackness conditions for G i and G i E (Appendix E: E.2, E.3) lead to the following important consequences, which are true for the primal and dual optimal variables: ^ T G i E < 2 + ^ i =) ^ G i E = 0, for all i, (3.9) ^ T G i < 1 ^ i =) ^ G i = 0, for all i. (3.10) Conditions (3.9), (3.10) form a basis for the screening rules construction. We exploit the above conditions to identify null features and avoid spending time on tting them. This leads to a substantial computational speed-up, especially for the high-dimensional sparse problems. 3.2.3 Convergence criterion For any iterative algorithm, including coordinate descent, convergence (stopping) criterion plays an important role. In this section we explain and propose to consider the duality gap as a stopping criterion for our algorithm. 19 Denote the primal objective function for the gesso model as P (): P () = 1 2n kYXk 2 + 1 p X i=1 max j G i j; G i E 1 + 2 k GE k 1 and the dual objective as D(): D() = n 2 Y n 2 2 Y n 2 2 ! : The duality gap, denoted as Gap(;) is a dierence between primal and dual objectives Gap(;) =P ()D() = (3.11) = 1 2n kYXk 2 + 1 p X i=1 max j G i j; G i E 1 + 2 k GE k 1 n 2 Y n 2 2 Y n 2 2 ! : For the optimal solutions we have P ()P ( ^ ) and D()D(^ ). By strong duality [41] D()D(^ ) =P ( ^ )P () =) Gap(;) 0: Also, Gap(;) =) P ()P ( ^ ), as (3.12) P ()P ( ^ )P ()D() =Gap(;): The duality gap provides an upper bound for the suboptimality gap P ()P ( ^ ). Therefore, given a tolerance > 0, if at iteration t of the algorithm we can construct (t) and (t) 2D F (dual feasible variable) such thatGap( (t) ; (t) ), then (t) is guaranteed to be the-optimal solution of the primal problem. Note that in order to calculate the duality gap at iteration t we need a dual feasible point (t) 2D F . In the next sections we show how to obtain (t) given the current estimate of the primal variables (t) in the context of the SAFE screening rules. Algorithm 1 below presents the 20 block coordinate descent algorithm and Algorithm 2 provides the stopping criterion we propose to use. Algorithm 1: cyclic coordinate descent(set of indices I) for i inner = 1...max iter do check the duality gap() //convergence criterion based on the duality gap for j2I do ( G j ; G j E ) = coordinate descent update( ) //according to section 3.2.1 Algorithm 2: check the duality gap() 0 = compute dual feasible point Gap(, 0 ) = P() - D( 0 ) //equation (3.11) if Gap(, 0 ) < tol then return , exit //convergence The other common stopping criterion based on a heuristic is dened via the maximal ab- solute change in the estimates between two consecutive iterations t 1 and t (3.13). It is implemented in the glmnet package, for example. The idea is that if the estimated coecients do not change that much from iteration to iteration probably we reached the optimal solution already. max i2f1::pg j (t) i (t1) i j 2 2 kX i k 2 2 < (3.13) However, contrary to the duality gap stopping criterion, such heuristic rule do not oer a control over suboptimality and it is generally not clear what value of is small enough [44]. 3.3 Screening rules 3.3.1 SAFE rules for gesso El Ghaoui et al. (2010) proposed the SAFE screening rules for the Lasso model that guarantee a coecient will be zero in the solution vector [39]. In this section we derive SAFE rules to 21 screen predictors for the gesso model. By the KKT conditions (3.10) and (3.9) 8 > > > > > > < > > > > > > : ^ T G i < 1 i ^ T G i E < 2 + i =) ^ i = ^ G i E = 0: i 2 [0; 1 ] (3.14) for any primal optimal variables ^ G i and ^ G i E , dual optimal variable ^ , and dual feasible variable i (for a xed optimal ^ any feasible i would also be optimal, since the dual objective function doesn't depend on ). The problem is that we don't know the dual optimal variable ^ . The idea is to create an upper bounds forj^ T G i j andj^ T G i Ej that are easy to compute. In particular, consider the dual feasible variable 0 and denote D( 0 ) as . Recall, that D() is the dual objective function D() = n 2 Y n 2 2 Y n 2 2 ! : Let =f : D() g. As ^ = argmax D F D() we have D(^ ) D( 0 ) = and, hence, ^ 2 . Then max 2 j T G i j< 1 i =) j^ T G i j< 1 i : The same is true for the interaction terms max 2 j T G i Ej< 2 + i =) j^ T G i Ej< 2 + i ; where max 2 j T G i j and max 2 j T G i Ej are the desired upper bounds forj^ T G i j and j^ T G i Ej respectively. The above arguments lead to the following rules: max 2 j T G i j< 1 i =) ^ G i = 0; (3.15) 22 max 2 j T G i Ej< 2 + i =) ^ G i E = 0: (3.16) Note that the region =f : D() g is equivalent to n 2 Y n 2 2 Y n 2 2 () Y n 2 Y n 2 2 n which is the equation of a lled circle (sphere) withr 2 = Y n 2 2 n and a center at Y n . r = s Y n 2 2 n = v u u t Y n 2 2 n n 2 Y n 2 2 + Y n 0 2 2 ! = Y n 0 : We denote this circle as B(c;r), where c = Y n and r = Y n 0 . In consequence, the upper bounds we constructed max 2 j T G i j and max 2 j T G i Ej are equivalent to the following optimization problems: max 2B(c;r) j T G i j and max 2B(c;r) j T G i Ej and the region is a sphere B(c;r) that has two important properties: 1) it contains the optimal dual solution ^ 2 2) it provides a closed form solution for the desired upper bounds (Appendix F). By Appendix F (F.3) for a general case X2R n the solution is max 2B(c;r) j T Xj =rkXk +jX T cj: Omitting derivations to Appendix G, the SAFE rules to discard ( G i ; G i E ) are: max 0; rkG i Ek +jG i E T cj 2 < 1 rkG i kjG T i cj =) ^ G i = ^ G i E = 0: (3.17) To complete the construction of the SAFE rules, we need to nd a dual feasible point 0 2D F to calculate r = r( 0 ). We can naturally obtain a dual point 0 given the current estimate via the primal-dual link from the stationarity conditions (3.8). Denote res () = YX n , where res stands for residuals. In general, res () will not necessary be feasible, thus, we have to do some sort of re-scaling to get it to a feasible regionD F . For example, for = 0 we can consider the re-scaling factor x = min 1 jres() T G i j ; 2 jres() T G i Ej so that 0 = x res () will be feasible. 23 We call the valuex res () a naive projection of 0 . In the next section we show how to re-scale the dual point to ensure feasibility more optimally. 3.3.2 Optimal naive projection In the previous re-scaling example we set the dual variable equal to 0 for simplicity when we constructed a dual feasible point 0 . However, we can nd a point 0 closer to the optimal by changing to our advantage. The idea is that the better we choose our feasible 0 and (meaning closer to the optimal solution) the tighter our upper bounds forj^ T G i j andj^ T G i Ej on the set =f :D()D( 0 )g are going to be. More formally, we want to nd a scalar x such that 0 =x res () is feasible and maximizes D( 0 ). We can explicitly formulate and solve the arising optimization problem with respect to x and . Consider an optimization problem: minimize x; D( 0 ) subject to j T 0 GEj4 2 +; j T 0 Gj4 1 ; < 0; where 0 =x res (); D() = n 2 Y n 2 Y n 2 : (3.18) We present the closed-form solution to the optimization problem (3.18) in Appendix H. We use the solution to this problem to naively project the dual point res () to obtain the dual feasible variable 0 ( 0 =x res ()), that we require for the SAFE rules (3.17). 24 Figure 3.1: Warm start strategy for the two-dimensional grid. 3.3.3 Path-wise SAFE rules for gesso In practice, the optimal values for the tuning parameters 1 and 2 from the gesso problem (2.4) are not known and the optimal solution is computed for a grid of lambda values before the best tuning parameters are selected, for example, by cross-validation. The values of lambda are commonly selected from a logarithmic grid of 30 to 100 consecutive values. In such sequen- tial context, warm start is a standard approach. Path-wise SAFE rules are designed to take advantage of the warm start strategy. Warm start refers to using the previously computed solution ^ ( (k1) ) with respect to a grid of tuning parameters to initialize the parameters values for the coordinate descent algorithm (k) on the next step. The idea is that for close consecutive penalty values, solutions should be close as well, and initializing at the previous solution should lead to a lower number of iterations for the coordinate descent algorithm to converge. For the two-dimensional grid k = ( 1 ; 2 ) k we start evaluating from the largest grid values and nish at the smallest (Figure 3.1). Equivalently, the warm start strategy is benecial for the SAFE rules. In particular, since we have res () = ^ ( (k1) ) (initialization at the previous solution), then 0 that we get after 25 re-scaling should be close to the dual optimal on the current step ^ ( (k) ). 3.3.4 Dynamic SAFE The idea behind the dynamic screening [40] is to iteratively improve the dual feasible point 0 during the coordinate descent updates. The residuals YX are updated at each iteration i of the coordinate descent algorithm and since the coordinate descent is a monotone algorithm we have (i) ! i ^ and (i) res = YX (i) n ! i YX ^ n = ^ =) (i) res ! i ^ ; (i) 0 ! i ^ : (3.19) Recall, that our SAFE rules depend on 0 in terms of the radius of the sphere region we base Figure 3.2: Dynamic SAFE regions. 26 the upper bounds on: B(c;r( (i) 0 )) =B Y n ; Y n (i) 0 . Also, recall that ^ =Proj D F Y n = min 2D F Y n 2 ; ^ is the closest feasible point to the center Y n . Hence, having 0 closer to the optimal ^ reduces the radius of the sphere and ensures a tighter upper bounds max 2B(c;r) j T G i j and max 2B(c;r) j T G i Ej forj^ T G i j andj^ T G i Ej. This, in turn, ensures a better SAFE rules that are able to discard more variables. To reiterate, by letting the screening to be done not only at the beginning of the algorithm at each new iteration k = ( 1 ; 2 ) k , but all along the iterations i for the coordinate descent, we iteratively improve the estimate of ^ with (i) 0 and consequentially keep improving our SAFE rules. The proposed procedure is the following: at each iteration i of the algorithm use current residuals YX (i) to obtain the current estimate of the dual variable (i) res = YX (i) n , naively project it on to a feasible setD F and apply the rules (3.17). Figure 3.2 illustrates an iterative process of constructing the SAFE sphere regions. There are evident disadvantages of our current choice of the centerc = Y n and the radiusr = Y n (i) 0 of the sphere region. Even when the dual feasible variable (i) 0 converges to the optimal value ^ , the region B Y n ; Y n ^ does not close in around the dual optimal variable, since the radius does not converge to zero and the center is static, indicating that our upper bounds max 2B Y n ;k Y n ^ k j T G i j and max 2B Y n ;k Y n ^ k j T G i Ej remain loose. Gap SAFE rules proposed by Fercoq et al. (2015) [41] beautifully solve the above disadvantages. We present the ideas behind the Gap SAFE rules and their application to the gesso model in the next section. 3.3.5 Gap SAFE Recall that we denoted the primal objective function for the gesso model asP () and the dual objective as D(): D() = n 2 Y n 2 2 Y n 2 2 ! : 27 duality gap is denoted as Gap(;). D() is a quadratic strongly concave function with the strongly concave modulus n 2 . Then D()D(^ ) +hrD(^ ); ^ i n 2 k ^ k 2 n 2 k ^ k 2 D(^ )D() +hrD(^ ); ^ i P ()D() =Gap(;): k ^ k r 2 n Gap(;) (3.20) Figure 3.3: Dynamic GAP SAFE regions. Gap SAFE rules propose to work with the sphere region withc = (i) 0 andr = q 2 n Gap( (i) ; (i) 0 ), which we denote as B Gap (c;r). This is a valid region since ^ 2 B Gap (c;r) by (3.20). An im- portant consequence of the above construction of the sphere region is that when (i) ! i ^ , then (i) 0 ! i ^ via the primal-dual link (3.8, 3.19) and Gap( (i) ; (i) 0 )! i 0. Figure 3.3 illus- 28 trates that the dynamic approach discussed in the previous section in combination with the Gap SAFE rules very naturally results in improving the upper bounds max B Gap (c;r) j T G i j and max B Gap (c;r) j T G i Ej since the radius of our new Gap sphere region is converging to zero and the center is converging to the dual optimal point. Gap SAFE rules follow from substituting r = q 2 n Gap( (i) ; (i) 0 ) and c = (i) 0 in (3.17): max 0; r 2 n Gap( (i) ; (i) 0 )kG i Ek +jG i E T (i) 0 j 2 < 1 r 2 n Gap( (i) ; (i) 0 )kG i kjG T i (i) 0 j =) ^ G i = ^ G i E = 0: (3.21) 3.3.6 Working set strategy Massias et al. (2017) proposed to use a working set approach with the Gap SAFE rules to achieve substantial speed up over the state of art Lasso solvers, including glmnet [43, 44]. The working set strategy involves two nested iteration loops: in the outer loop, a set of features W t 1;::;p is dened, called a working set (WS). In the inner loop, the coordinate descent algorithm is launched to solve the problem restricted to X Wt (considering only the features in W t ). We adopt the Gap SAFE rules we developed for thegesso model to incorporate the proposed working set strategy. Recall the SAFE rules we constructed in (3.17): max 0; rkG i Ek +jG i E T cj 2 < 1 rkG i kjG T i cj =) ^ G i = ^ G i E = 0: By simply rearranging the terms in the inequality we get: 1 rkG i kjG T i cj max 0; rkG i Ek +jG i E T cj 2 > 0 1 jG T i cj max 0; rkG i Ek +jG i E T cj 2 >rkG i k 1 jG T i cj + max rkG i Ek; 2 jG i E T cj >rkG i k +rkG i Ek 29 1 jG T i cj + max rkG i Ek; 2 jG i E T cj kG i k +kG i Ek >r Dene the left-hand side of the above inequality as d i d i := 1 jG T i cj + max rkG i Ek; 2 jG i E T cj kG i k +kG i Ek (3.22) Note that our SAFE rules (3.17) are equivalent to d i >r. The idea is that d i now represents the score of how likely the predictor is zero or non-zero based on the Gap SAFE rules. Predictors for which d i is small are more likely to be non-zero and, conversely, predictors with larger values of d i are more likely to be zero up to the point when d i > r and the corresponding predictor is exactly zero by the SAFE rules. Here by predictor we mean the pair ( G i ; G i E ). The proposed procedure is the following: calculate (next section) the initial number of variables to be assigned to the working set (working set size). Calculate d i and dene the working set as the indices of the smallest d i values up to the value of working set size. Fit the coordinate descent algorithm on the variables from the working set only and check if we achieved an optimal solution via the duality gap for all of the variables. If not, increase the size of the working set, recalculate d i according to the new estimates obtained by tting on a previous working set, and repeat the procedure. We increase the size of the working set by two each time. Warm up operation: Warm up operation is the rst operation of the working set algorithm. It initializes working set as the set of non-zero main eects and interactions estimated at the previous step of k = ( 1 ; 2 ) k grid and assigns working set size as the length of the non-zero vector: working set =fj : ( G j 6= 0) or ( G j E 6= 0)g; working set size = length(working set): Algorithm 3 below describes the main steps of the working set strategy. To recapitulate, the main advantage of the algorithm is that we select variables that are likely to be non-zero 30 and leave likely zeroes out so we do not have to spend unnecessary tting time on them. This approach can be thought of as an acceleration of the SAFE rules. Algorithm 3: coordinate descent with working sets( ) for i outer = 1...max iter do check the duality gap() d = compute d i( ) //equation (3.22) if i outer == 1 then working set =fj : ( G j 6= 0) or ( G j E 6= 0)g //warm up operation if length(working set) == 0 then working set =fj: smallest d[j] up to working set init sizeg working set size = length(working set) else d[working set] = - Inf //to make sure WS increases monotonically working set size = min(2working set size, p) //doubling the size of the WS at each iteration working set =fj: smallest d[j] up to working set sizeg //scoring the coef. by the likelihood of being non-zero, section 3.3.6 cyclic coordinate descent(working set) //Algorithm 1 or Algorithm 4 with I=working set 3.3.7 Active set and adaptive max dierence strategies Algorithm 1 presented the basic steps of the coordinate descent algorithm. We now revisit Algorithm 2 and specify that the dual feasible point required to determine the dual gap is calculated according to the naive projection method we proposed at the section 3.3.2. Computing naive projection is expensive, since it requires to compute a matrix by vector product with a vector of length n (sample size) and a matrix of size n length(working set) to calculate A j and B j for each j (Appendix H). The idea of an active set and adaptive max dierence strategies is to reduce the amount of times we have to use check the duality gap() function to ensure convergence. 31 Algorithm 1: cyclic coordinate descent(set of indices I) for i inner = 1...max iter do check the duality gap() //convergence criterion based on the duality gap, expensive for j2I do ( G j ; G j E ) = coordinate descent update( ) //according to section 3.1.2 Algorithm 2: check the duality gap( ) 0 = compute naive projection // (Appendix H), expensive Gap(, 0 ) = P() - D( 0 ) //equation (3.11) if Gap(, 0 ) < tol then return , exit //convergence Adaptive max dierence strategy: Recall that we discussed an alternative convergence criterion (we call it max dierence criterion) based on the maximum absolute dierence between the consecutive estimates (3.13): max i2f1::pg j (t) i (t1) i j 2 2 kX i k 2 2 <: The drawback of this heuristic criterion is that it doesn't oer a control over the suboptimality P ( (t) )P ( ^ ), however, it is fast to compute, sincekX i k 2 2 is pre-computed or normalized to be one. To reduce the number of times we check the convergence based on the duality gap criterion in Algorithm 1 we propose to utilize criterion (3.13) as a proxy convergence criterion (Algorithm 4). The procedure is the following: we initialize the tolerance for the max dierence criterion as the tolerance we set for the duality gap convergence. We proceed by tting the coordinate descent algorithm until we meet the max dierence convergence criterion and then check the duality gap for the nal convergence. If the duality gap criterion is not met, we decrease the max dierence tolerance by 10 and proceed again. As a result, instead of checking the duality gap after each cycle of the coordinate descent algorithm, we wait until the proxy convergence criterion (which is very cheap to check) is met and then check the proper criterion. By adaptively reducing 32 the tolerance value of the proxy criterion we allow more coordinate descent cycles if they are needed, but carefully control the adaptive convergence to make as few checks as possible. Active set strategy: Active set is a heuristic proposed by the authors of the glmnet package. Active set procedure collects variables that got updated during the rst coordinate descent cycle and proceeds by tting the coordinate descent algorithm only on those variables. The idea is pretty similar to the working set strategy { reduce the set of variables to t to the ones that are most likely to be non-zero. Active set strategy naturally combines with our proposed adaptive max dierence procedure since we calculate the estimates dierences on the consecutive iterations for our proxy convergence as well. We believe that max dierence strategy we proposed in combination with the active sets can accelerate the state of art methods for tting Lasso as well. We summarize the main steps for the both strategies in Algorithm 4 below which is an optimized version of Algorithm 1. 33 Algorithm 4: cyclic coordinate descent optimized(set of indices I) for i inner = 1...max iter do check the duality gap() if i inner == 1 then max di toll = tol //initializing heuristic convergence criterion else max di toll = max di toll/10 //adaptively rene the criterion while t < max iter do max di = 0 for j2I do ( G j ; G j E ) = coordinate descent update( ) current di = max (t) G j (t1) G j 2 G j 2 ; (t) G j E (t1) G j E 2 G j E 2 //convergence heuristics based on the max dierence bw the coef. updates max di = max(current di, max di) if current di > 0 then active set[j] = TRUE //active set = coecients that got updated if max di < max di tol then break while t < max iter do max di = 0 for j2 active set do //active set iteration ( G j ; G j E ) = coordinate descent update( ) current di = max (t) G j (t1) G j 2 G j 2 ; (t) G j E (t1) G j E 2 G j E 2 max di = max(current di, max di) if max di < max di tol then break 3.3.8 Experiments Runtime experiment: We conducted an experiment to compare the runtime of the basic coordinate descent algorithm (Algorithm 1) and the various accelerating strategies we described in the above sections. To 34 recap, we adopted working set strategy for the gesso problem (Algorithm 3), we proposed the max dierence strategy, and we added the active set strategy in combination to the max dierence approach (Algorithm 4). The size of the dataset we used for the experiment was n = 200;p = 10; 000 (20; 000 predictors in total, p main eects + p interaction terms). We ran each algorithm 100 times and reported the mean average seconds in Figure 3.4. Figure 3.4: Time comparison with added strategies. We denote Algorithm 1 as cd; maxdi : added max dierence strategy, as: added active set strategy, ws: added working set strategy. Working set experiment: In this section we evaluate the screening ability of our SAFE rules in combination with the working sets (WS) acceleration. We simulate a dataset with n = 1; 000 and p = 100; 000, among them 15 non-zero main eects and 10 non-zero interaction terms. Figure 3.5 plots the percent of non-zero variables (to the total number of predictors p) as estimated by the BCD (in blue) and the size of the WS (in red) across the path of two tuning parameters. Parameters are ordered as shown in Figure 3.1: 1 monotonically decreases and 2 oscillates from large values to small and back to large. We can see that when both lambdas are large our working sets very accurately discard irrelevant predictors and we only spend time 35 Figure 3.5: Number of variables in the WS versus the number of actual non-zero variables identied by the BCD tting. Figure 3.6: Working set size (%) for all lambda pairs. tting the actual non-zero variables. Starting from the middle of the plot tuning parameter 1 becomes smaller and the working sets start to overestimate the number of the non-zero predictors. However, the model never considers more than 40% of all variables. Figure 3.6 illustrates the size of the working set in percents to the total number of predictors p (the same value that is depicted in red in Figure 3.5) for all pairs of the tuning parameters. 36 For most of the ( 1 ; 2 ) k pairs we were able to nd the solution by considering only a small number of variables. 3.4 Simulations The goal of our simulations is to compare available models for the hierarchical interaction selection analysis and their implementations with our proposed algorithm. We chose glinternet (2.8) and FAMILY (2.6 and 2.7) models we presented in the section 2.4 for comparison because R packages implementing them are available on CRAN and the source code can be modied to handle GE case with only one environmental variable, which is the case we are interested in. In glinternet we use the parameter interactionCandidates to specify the GE interaction case, not the default GG case. For the FAMILY package, which can handle the GE case with strictly more than one environmental variable we modify the downloaded source code in order to t the GE model with one E. Also, we compare our algorithm with the standard Lasso model implemented in the glmnet package. We simulate a dataset with p = 2; 500 SNPs (5,000 predictors in total) and the sample size n = 100. We set the number of non-zero main eects (n g non zero) among p to 10 and the number of non-zero interaction terms (n gxe non zero) among p to 5. We simulate all main eects G to have the same absolute value with randomly chosen signs and similarly for the interaction terms GE . We explore three simulation modes for the true coecients that we call strong hierarchical, hierarchical, and anti hierarchical. In the strong hierarchical mode the hierarchical structure is maintained ( G i = 0 =) G i E = 0) and alsoj G i jj G i E j. In the hierarchical mode the hierarchical structure is maintained, but we setj G i jj G i E j, with an attempt to violate the gesso assumption on the eect sizes. In the anti-hierarchical mode the hierarchical structure is violated ( G i E 6= 0 =) G i = 0). We simulate a single binary environmental factor with prevalence equal to 0.3. We set G = 3 and GE = 1:5 for all of the modes except for the hierarchical mode where we set G = 1:5 and GE = 3. We set the noise variable such that an 37 Figure 3.7: Strong hierarchical mode. p = 2500, n = 100, n g non zero/ n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. interaction SNR (signal to noise ratio) is around 2 for the strong and anti-hierarchical modes and around 8 for the hierarchical mode. We generate independent training, validation, and test sets under the same settings and report the model performance metrics on the test set. We run 100 simulations replicates. Unfortunately, the FAMILY package showed consistently poor results. We tried to reduce the tolerance parameter and provided a larger (and ner) grid for the two tuning parameters, but the resulting performance did not improve. Also, tting time was much longer than for the other implementations we considered. It is likely that the algorithm the FAMILY package is based on is much better suited/optimized for the case p q (number of genomic features approximately equal to the number of environmental variables), which is a dierent problem than the one we are solving. In the paper [35] the authors only reported results for the GG case, which is probably a more advantageous scenario for their implementation. We report 38 Figure 3.8: Hierarchical mode. p = 2500, n=100, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. FAMILY tting results in Appendix K and exclude this package from the simulation results we present below. Results For the gures below, we obtain a path of solutions and compute the test loss and interaction selection metrics (interaction detection precision, sensitivity, specicity, and AUC) as a function of the number of interactions discovered. In the strong hierarchical mode (Figure 3.7) the models ensuring hierarchical structure (gesso and glinternet) clearly outperform the Lasso model. gesso model also shows better results compared to the glinternet model, especially in terms of sensitivity and precision (for detection of the true interaction terms). The glinternet model did not detect more than 22 interaction terms across any of the simulation replicates, which is why it stops earlier on the plots. 39 Figure 3.9: Anti-hierarchical mode. p = 2500, n=100, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. We considered the hierarchical mode to make sure the gesso model does not lose power when assumptions on the eect sizes (j G i E jj G i j) are violated, although the relaxed convex assumptions are less stringent (j G i E j + G i + G i ). Note that larger interaction SNR (8 for the hierarchical mode versus 2 for the strong hierarchical) results in much better detection overall. In the hierarchical mode (Figure 3.8) glinternet and gesso still outperform the Lasso. gesso model remains the best performing model in terms of selection, however, the performance dierence between glinternet and gesso is smaller than for the strong hierarchical mode. In the anti-hierarchical mode (Figure 3.9) all three models perform similarly in terms of the selection metrics. Even though the hierarchical assumptions are violated, hierarchical models still do not loose to the Lasso model in terms of the selection performance. Moreover, the test loss remains better for the hierarchical models. In the glinternet paper [70] the authors reported that glinternet model outperformed the 40 hierNetGG model in the anti-hierarchical case and performed on par with the hierNetGG in the strong hierarchical case. In our results the gesso model outperforms glinternet for the strong hierarchical case and performs on par with the glinternet for the ani-hierarchical case. The discrepancy might be due to the fact that we considered two tuning parameters for the gesso model (versus single tuning parameter in the original paper) resulting in more exibility. 3.5 Real data example: Epigenetic clock To demonstrate the feasibility of our tting algorithm and to compare the gesso method to the other available hierarchical methods and implementations (glinternet, FAMILY) and the standard Lasso regression (glmnet) we analyzed the GSE40279 methylation dataset [22]. The dataset consists of 450,000 CpG markers on autosomal chromosomes from whole blood for 656 subjects (Illumina Innium 450k Human Methylation Beadchip). Age and gender of the subjects are also included in the dataset. Age of the individuals ranges from 19 to 101 years old, 52 % of the study group are females. Hannum et. al. (2013) built a predictive model of aging based on the same methylation dataset. When exploring the ratio of the predicted age (based on the methylation probes) to the chronological age the authors found that gender signicantly contributed to the aging rate. The methylome of men appeared to age approximately 4 percent faster than that of women [22]. Epidemiological data also indicates that females live longer than males but the reasons for gender discrepancies are still unknown. Few studies have explicitly looked into whether DNA methylation is dierently associated with age in males and females. Thus, our goal is to identify important methylation by sex interactions that are associated with age. We subselect 100,000 of the most variable probes for our analysis, leaving us with 100,000 CpG probes main eects and 100,000 interactions terms of methylation probes by sex, 200,000 predictors in total. Age is our dependent/outcome variable. The set up of our data analysis is the following. We run our implementation of the gesso model over one hundred random splits of the cross-validation assignments and calculate the 41 selection rate for each of the predictors. The selection rate is the number of times a predictor was chosen by the model across the number of runs divided by the number of runs (one hundred in our case). We choose the hyperparameters based on the minimum cross-validation loss. We perform the same procedure for the Lasso model with the glmnet package. We also compare the gesso results to the standard GWIS analysis (Genome Wide Interaction Scan, one CpG probe at a time interaction test). For the GWIS analysis we set the Bonferroni adjusted signicance level to 0:05=100; 000 = 5e 7. We ran the standard interaction model (1.1) for each CpG probe, performing a Wald test to identify signicant interaction terms (H 0 : probesex = 0). For each method we report the three to four top CpG probes, Manhattan plot of the log p- values for the interaction testH 0 : probesex = 0, with marked locations of the selected probes, interaction plots from the single marker analysis stratied by sex, and a table of biological information, relevant to the selected CpG probes that can help to assess the validity of the identied probes. The latter was obtained with the IPA (Ingenuity Pathway Analysis) software by QIGEN, which provides one of the largest manually curated genomic databases, and the UCSC genome browser. We tried to use the glinternet package to t the age dataset but it exited with an out of memory error (Error in cbind(rep(1:(pInt - 1), (pInt - 1):1), unlist(sapply(1:(pCont - : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522 ). We downloaded the source code and xed an error due to the glinternet implementation being unable to allocate the needed memory for the size of theGE set as it was not storing objects eciently. But even with the corrected function glinternet would not complete the analysis after running for more than 24 hours, so we stopped it. We hypothesize that, due to specics of the current implementation, the glinternet package is ecient for the GG format, but not for the reduced GE format with large G. We also tried to run the FAMILY package on the our methylation dataset after modifying the source code to generalize the function to the GE case, as we mentioned in section 3.4. However, the program exited with an out of memory error (Error: vector memory exhausted (limit reached?)). To conclude, the existing packages are not able to handle large datasets (here 200,000 features) for the GE analysis. 42 Table 3.1: Ranks across analyses. Probes identied by other methods highlighted in bold. Results Table 3.1 presents ranks obtained from the GWIS, gesso, and glmnet analysis for all of the discovered probes. Ranks were calculated by ordering selection rates and assigning rank 1 to the highest rate value (most frequently selected feature), and so on. To highlight the degree of overlap across the three methods, a probe identied by more than one method is shown in bold in Table 3.1, where we considered a probe 'identied' if it ranked among the top 30 probes. The most similar results overall are observed betweengesso and glmnet, which are both joint selection models. The gesso model identied two probes discovered by glmnet: cg19009405 and cg14345497, while glmnet identied two probe discovered by gesso: cg00091483 and cg12015310. Glmnet also identied one probe discovered by the GWIS analysis (cg19273316), while GWIS identied one probe also discovered by glmnet (cg09365557). Overall, there was a relatively low agreement between the joint and single marker analyses. The gesso model did not identify any of the GWIS hits (and visa versa), indicating that the GWIS-identied probes have weak main eects and probably weak interaction eects when adjusted for the other markers. The average run-time for our implementation of thegesso across 100 runs of cross-validation was 4 minutes for the grid of 30 values for both of our tuning parameters. The average run-time for the glmnet package for the same grid for one tuning parameter was 1.5 minutes. 43 Figure 3.10: Manhattan plots. Results of the GWIS analysis are summarized in the Manhattan plot (Figure 3.10). At the signicance level of 5e 7 only one probe was identied (cg19273316). At the level of 5e 6 we identied two more CpG probes, highlighted in Figure 3.10. Figure 3.11 illustrates interaction plots from the single marker analysis for the three most 44 Figure 3.11: Interaction plots for the single marker analysis stratied by sex. Predicted age on the y-axis, methylation frequency on the x-axis. signicant probes. The plots indicate that for females methylation rate increases with increasing age for the cg07516946 and cg10142275 probes, while for males methylation rate decreases with increasing age. For the cg19273316 probe we observe the opposite situation. Table 3.2 presents relevant biological information for the identied CpG probes including chromosome location, linked genes, molecular function, and the biological process associated with the linked genes. Two out of the three genes linked with the identied probes are functional in the mitochondrion. As aging is accompanied by a general decline in mitochondrial function in all tissues [55], this provides a possible validation for the GWIS ndings. Also, the IBA57 gene that is associated with the top signicant probe is highly expressed in the testis, indicating sex-specicity of the gene. However, we didn't nd any literature linking the genes from Table 45 3.1 to the process of aging or sex-specicity. For thegesso analysis, we selected the four CpG probes with the highest selection rate based on the described stability selection procedure. Table 3.2 summarizes the biological information about the identied probes. The cg00091483 probe had the highest selection rate in our analysis indicating that this probe was the most consistently selected feature across multiple random cross-validation splits. The probe is associated with the promoter of PEBP1 gene. Gene Ontology and IPA databases indicate that PEBP1 gene is involved in the aging process and a negative regulation of the MAPK pathway [62]. The MAPK and SAPK/JNK signaling networks promote senescence (in vitro) and aging (in vivo, animal models and human cohorts) in response to oxidative stress and in ammation [58, 63]. The Rat Genome Database (RGD) indicates that the PEBP1 gene is implicated in prostate and ovarian cancers [61], indicating some sex-specicity. Also, it is overexpressed in the adrenal gland tissue that produces sex hormones. The RGD database reports the PEBP1 gene as a biomarker of Alzheimer's disease [59, 60, 64]. The second identied probe cg12015310 is linked to the MEIS2 gene, which is an important transcription factor that regulates CDKN2B (p15) tumor suppressor gene involved in biological processes such as aging and cellular senescence (IPA database). [50]. A meta-analysis of GWAS and age-associated diseases (2012) mapped highly signicant peaks of disease association to the CDKN2A/B tumor suppressor locus on 9p21.3 [50]. Kim et. al. (2006) discussed the regulation and role of INK4a, ARF and CDKN2B in cancer and aging, relating them to suppression of cancer and promotion of aging [51]. The third identied probe cg08327269 is located in the enhancer region of NEUROD1 gene which acts as a transcription activator for a tumor suppressor gene CDKN1A (p21). The CDKN1A gene has been linked to aging in multiple publications, with CDKN1A expression found to upregulate multiple genes that have not only been associated with senescence, but have also been implicated in age-related diseases, including Alzheimer's disease, atherosclerosis, amyloidosis, arthritis, and cancer. Overexpression of CDKN1A from an inducible promoter in a human cell line induces growth arrest and phenotypic features of senescence [67, 68]. In addition, 46 Table 3.2: Biological information about the top selected CpG probes (GWIS and gesso). NEUROD1 has a connection with type 2 diabetes, a chronic disease commonly associated with aging. Diabetes develops when pancreatic beta cells age and senesce. NEUROD1 participates in the regulation of beta-cell development and gene expression in beta cells. The fourth identied probe cg103755456 is associated with the ICAM1 gene expressed on the cells of the immune system. ICAM1 is one of the molecules involved in in ammatory response that is overexpressed in senescent cells and aged tissues. Furthermore, ICAM1 is overexpressed in atherosclerosis, an age-related, chronic in ammatory disease [52, 53]. Figure 3.10 highlights the identied probes on the manhattan plot. We can see that probes 47 Table 3.3: Biological information about the top selected CpG probes (glmnet). discovered in the joint hierarchical analysis are not among the most signicant probes discovered in the single marker analysis (GWIS). This is not entirely surprising, giving a very dierent nature of the analyses (unadjusted eects versus adjusted for all the CpG probes), but additional exploration and comparison is required. Figure 3.11 presents interaction plots. Plots for the probes cg08327296 (NEUROD1) and cg00091483 (PEBP1) show the same signs for the stratied by sex methylation eects. For cg12015310 (MEIS2) and cg103755456 (ICAM1) the red lines corresponding to the female strata are relatively at. When estimating the relationship between methylation and age separately for females and males, signicant associations are only observed in the male group: for both probes increased age is associated with increased methylation rate in males only. For the glmnet analysis, among the top four probes, three probes were linked to regulatory genes. For example, HOXB4 (linked to cg14345497) and TBX2 (linked to cg09365557) en- code for important transcription factors. When up-regulated, TBX2 inhibits CDKN1A (p21), the gene we described earlier for the interaction discovered by gesso for probe cg08327269. CDKN1A is necessary for tissue senescence, and when compromised, leaves the tissue vul- 48 nerable to tumor-promoting signals. Probes cg09365557 (identied by glmnet) and cg08327269 (identied by gesso) could be both uncovering the same biological process involving CDKN1A. HNRNPUL2 (linked to cg19009405) is a gene splicing regulator identied as associated with the aging process in several publications [71, 72]. The last probe, cg27652200, is linked to the two closely located genes TAP1 and PSMB9, both involved in immune response processes. Studies suggest that an epigenetically mediated immune response involving those genes is a predictor of recurrence and, possibly, treatment response for high-grade serous epithelial ovarian carcinomas [73]. Figure 3.11 shows that for probes cg14345497 and c09365557, a signicant relationship between age and methylation rate is observed only in males. Probe cg19009405 has the same sign for the stratied by sex methylation eects, and probe cg27652200 has dierent signs. 49 Chapter 4 gesso for binary outcomes For many clinical applications, the outcome of interest is a binary variable. For example, in case-control study patients with a particular disease are compared to healthy controls to identify disease risk factors. Logistic regression can be used to associate a binary dependent variable with a set of predictors. In this chapter, we describe a generalization of the gesso model to logistic regression and develop an ecient tting algorithm. 4.1 Logistic gesso formulation Consider Y 2f0; 1g n a binary outcome, p genomic features G = (G 1 ;:::;G p ), and an en- vironmental risk factor E, where G i ;E 2 R n and n is the sample size. For simplicity, de- note the design matrix that contains the environmental, genomic, and interaction predictors X = [G; E; GE] and a linear predictor asX = 1 0 +G G +E E +GE GE , where 0 is an intercept, G and E are the genetic and environmental main eects respectively, and GE is an interaction eect. The logistic regression model assumes that the conditional distribution Y i jX i Bernoulli(p i ), where p i = P (Y i = 1jX i ) = (X i ) and () is the sigmoid function: (t) = 1 1+exp(t) . The logistic loss function (negative log likelihood) is given by l() = n X i=1 w i logP (Y i jX i ) = n X i=1 w i log (X i ) Y i (1(X i )) 1Y i = 50 = n X i=1 w i log(1 + exp(X i ))Y i X i ; where w i are the custom weights, usually w i = 1=n. The logistic regression formulation for gesso follows from adding the overlapped group lasso penalty described in Chapter 2 to the logistic loss function l(): minimize l() + 1 p X i=1 G i ; G i E 1 + 2 k GE k 1 (4.1) which is equivalent to the constrained formulation minimize l() + 1 ( + G + G ) + 2 k GE k 1 ; subject to j GE j4 ( + G + G ); G < 0: (4.2) Note that since the constraints or regularization terms are the same as in the linear case, the model ensures the same hierarchical properties we dened and discussed in Chapter 2. In particular, the nal selected model will only include interaction terms adjusted for their genomic and environmental main eects. In addition, since l() is a convex function, the logistic regression formulation for gesso is also convex. To t the logistic gesso the Newton-Rapson algorithm can be used. The structure of the algorithm allows us to take advantage of the ecient linear solver we developed for gesso in Chapter 3. In the next section, we propose and explain the algorithm. 4.2 Newton-Raphson algorithm for logistic regression The Newton-Raphson algorithm is a second order optimization method that guarantees con- vergence to a global minimum for a convex objective function. It consists of iterations over two major steps (Figure 4.1): (1) local approximation of the loss function by a Taylor's quadratic expansion; 51 Figure 4.1: Newton-Raphson algorithm. [gure credit: Jeremy Watt] (2) minimization of the quadratic approximation. For a function inR we get: f approx (x)f(x t ) +f 0 (x t )(xx t ) + 1 2 f 00 (x t )(xx t ) 2 f approx (x)! min x ; df approx (x) dx =f 0 (x t ) +f 00 (x t )xf 00 (x t )x t = 0 NR update: x t+1 =x t 1 f 00 (x t ) f 0 (x t ): For a function inR n we get: f approx (x)f(x t ) +rf(x t ) T (xx t ) + 1 2 (xx t ) T H(x t )(xx t ) NR update: x t+1 =x t (H(x t )) 1 rf(x t ); (4.3) whererf(x t ) is a gradient vector at x t and H(x t ) is a Hessian matrix at x t . Before deriving Newton-Rhapson algorithm for the logistic gesso model we brie y describe an auxiliary problem that we rely on for developing the algorithm. Weighted least squares regression (WLS) is a generalization of ordinary least squares regression where the residuals are weighted by custom weights. For example, the weights could 52 be equal to the inverse variance of the observations. min 1 2 W 1=2 (YX) 2 2 ; where W =diag(w 1 ;:::;w n )2R nn : (4.4) We form the normal equations and nd the solution to (4.4) (W 1=2 X) T (W 1=2 (YX)) =X T W 1=2 W 1=2 (YX) =X T WYX T WX = 0 WLS solution: = (X T WX) 1 X T WY: (4.5) Newton-Raphson for logistic regression To derive the Newton-Raphson algorithm for the logistic regression we form the logistic loss function and calculate the gradient and the hessian. The logistic regression model is given by: P (Y i = 1jX i ) = 1 1 + exp(X i ) =:p i The likelihood function is then as follows: L() = n Y i=1 P (Y i jX i ) = n Y i=1 p Y i i (1p i ) (1Y i ) We form the logistic loss function and obtain the gradient and the hessian: l() = logL() = n X i=1 Y i logp i + (1Y i ) log(1p i ) ! min @l() @ j = n X i=1 (p i Y i )x ij ; @l() @ =rl() =X T (Y ) @ 2 l() @ k j =H = n X i=1 p i (1p i )X ik X ij =X T SX; where S =diag(p 1 (1p 1 );:::;p n (1p n )): The newton-Raphson update is given by (4.3): t+1 = t (H( t )) 1 rl( t ). After substituting 53 the calculated gradient and hessian we get: t+1 = t (X T S t X) 1 X T (p t Y ) = (X T S t X) 1 ((X T S t X) t +X T (p t Y )) = = (X T S t X) 1 X T S t (X t +S 1 t (Yp t )) = (X T S t X) 1 X T S t Z t = [by (4:5)] = arg min t 1 2 S 1=2 t (Z t X t ) 2 2 ; where Z t =X t +S 1 t (Yp t )is a working response. Adding a penalty () to the logistic objective function in the equations above is trivial. Newton-Raphson for penalized logistic regression is summarized in Algorithm 5. Algorithm 5 is equivalent to an Iteratively Reweighted Least Squares (IRLS) algorithm where in the rst step the weights are computed/updated, and in the second step the (penalized) WLS regression problem is solved. Algorithm 5: Newton-Raphson for Logistic Regression for t = 1...max iter do while not converged do p t =(X t ); S t =diag(p 1 t (1p 1 t );:::;p n t (1p n t ))//update weights Z t =X t +S 1 t (Yp t ) //update working response t+1 = arg min t 1 2 S 1=2 t (Z t X t ) 2 2 + () //solve penalized WLS Algorithm 5 provides a general framework to t logistic gesso model. One of the advantages of the algorithm is that we can reuse the block coordinate descent algorithm we developed for the linear gesso in Chapter 3. In the next section, we develop screening rules to lter out variables before the logistic loss function is approximated. 4.3 Gap SAFE rules for logistic regression The Gap SAFE rules we developed for the linear gesso in Chapter 3 were based on the dual problem formulation and its properties. In this section we present a dual formulation for the logistic gesso problem, discuss its properties and similarities to the linear case, and derive the 54 screening rules for the logistic model. Dual formulation Figure 4.2: Dual function for the logisticgesso formulationg(x) = x log(x)+(1x) log(1 x) . The primal formulation of the logistic gesso can be rewritten in the following form minimize n X i=1 w i log(1 + exp(Z i ))Y i Z i + 1 ( + G + G ) + 2 k GE k 1 ; subject to Z =X; j GE j4 ( + G + G ); G < 0: (4.6) We form the Lagrangian and derive the dual problem in Appendix M.1. The dual objective for the logistic gesso is given by: D() = n X i=1 w i log 1 (Y i i w i ) 1 (Y i i w i + Y i i w i log Y i i w i ; (4.7) where 2R n is a dual variable. D() is a strongly concave function (Figure 4.2) with domain Y i i w i > 0 and Y i i w i < 1 =) i 2 (Y i 1)w i );Y i w i : 55 The primal dual-link equation (Appendix M.1) is given by ^ =w T (Y(X ^ )); (4.8) which is similar to the linear case, where now, instead of the weighted residuals, we naturally have a weighted dierence between the binary response and the estimated probabilities. The dual feasible region is the same as in the linear gesso dual problem because the constraints we impose on the model are the same. 8 > > > > > > > > > > < > > > > > > > > > > : j T G j j 1 j ; j T G j Ej 2 + j ; T 1 = 0; T E = 0: D() is a strongly concave function, thus D()D(^ ) +hrD(^ ); ^ i 2 k ^ k 2 ; (4.9) where is a strongly concave modulus. 2 k ^ k 2 D(^ )D() +hrD(^ ); ^ i P ()D() =Gap(;); k ^ k r 2 Gap(;): We consider a quadratic approximation to the dual function (4.9) closely in Appendix M.2 and calculate that = 4 using the denition of concavity modulus. We get k ^ k r 1 2 Gap(;): (4.10) 56 Logistic Gap rules Since the dual feasible set is the same as for the linear case the Gap SAFE rules have the same general form we derived in Chapter 2 (3.17) max 0; rkG i Ek +jG i E T cj 2 < 1 rkG i kjG T i cj =) ^ G i = ^ G i E = 0: The radius is given by r = q 1 2 Gap(;) (4.10), the center c = is given by (4.8), and the dual gap is calculated by subtracting the primal logistic objective (4.1) from the dual logistic objective (4.7) we derived in this section. 4.4 Algorithms In this section, we present an algorithm to solve the logistic gesso model which uses Gap SAFE rules with the working sets approach we discussed in Chapter 3. It is similar conceptually to the algorithms we developed for the linear model, two major dierences include a dierent form of the rules and an additional loop where we approximate the objective function with quadratic Taylor expansion and then solve the weighted least squares. Note that in Chapter 3 we described a particular case where weights were equal to 1 n . A more general linear gesso approach for any weights is brie y illustrated in Appendix L. Algorithm 6 below describes working sets acceleration with application to the logistic Gap SAFE rules. Algorithm 7 describes the coordinate descent algorithm paired with the Newton- Raphson algorithm to t the logistic gesso model. We highlight in grey the major steps related to the Newton-Raphson algorithm in Algorithm 7. There is a trade-o in the Newton-Raphson algorithm for logistic regression concerning the accuracy of the quadratic approximation versus the convergence tolerance to solve the weighted least squares (WLS) problem on the particular approximation. It is not ecient to require a high convergence tolerance for the WLS estimates while the quadratic approximation is still very poor. It is more optimal to reach a certain level of accuracy for the approximation rst 57 and then require a high subsequent tolerance for WLS solution on a particular approximation. To partially account for that we propose a simple heuristic. Adaptive convergence for Newton-Raphson The simple idea consists in avoiding unnecessary quadratic approximation updates. After an iteration where weights (used for quadratic approximation of the logistic objective function) are updated we check if the model consequently converges in a single iteration with a max di convergence criterion (scaled dierence between current and updated coecients values, section 3.3.7). If yes, this means that the WLS solutions (regression coecients) didn't change after the updated quadratic approximation that much. Since weights depend on WLS solutions the new approximation will be very similar to the previous one. This "cyclic" situation could lead to a very slow convergence since the approximation will struggle to escape the localized values. To avoid this situation we propose to reduce the max di tolerance in the case of convergence in a single iteration to achieve a better WLS solution which in turn will lead to better updated weights for the quadratic approximation. The steps involved in the proposed convergence heuristic are highlighted in blue in Algorithm 7. Algorithm 6: coordinate descent with working sets( ) for i outer = 1...max iter do check duality gap() d = compute d i( ) //equation (3.22) based on logistic r and if i outer == 1 then working set =fj : ( Gj 6= 0) or ( GjE 6= 0)g //warm up operation if length(working set) == 0 then working set =fj: smallest d[j] up to working set init sizeg working set size = length(working set) else d[working set] = - Inf //to make sure WS increases monotonically working set size = min(2working set size, p) //doubling the size of the WS at each iteration working set =fj: smallest d[j] up to working set sizeg //scoring the coef. by the likelihood of being non-zero, section 3.3.6 newton raphson coordinate descent(working set) //Algorithm 7 58 Algorithm 7: newton raphson coordinate descent(set of indices I) has converged fast = FALSE for i inner = 1...max iter do check the duality gap() weights = update quadratic approximation() if has converged fast then max di toll = max di toll/4 has converged fast = TRUE while t < max iter do max di = 0 for j2I do ( Gj ; GjE ) = weighted coordinate descent update(weights) current di = max (t) Gj (t1) Gj 2 w 1=2 G j 2 ; (t) GjE (t1) GjE 2 w 1=2 G j E 2 //convergence heuristics based on the max dierence bw the coef. updates max di = max(current di, max di) if current di > 0 then active set[j] = TRUE //active set = coecients that got updated if max di < max di tol then break has converged fast = FALSE while t < max iter do max di = 0 for j2 active set do //active set iteration ( Gj ; GjE ) = coordinate descent update( ) current di = max (t) Gj (t1) Gj 2 G j 2 ; (t) GjE (t1) GjE 2 G j E 2 max di = max(current di, max di) if max di < max di tol then break 4.5 Simulations We performed a series of simulations to evaluate the selection performance of the logistic gesso and compare it to that of alternative models. As a baseline model for comparison, we used the standard logistic Lasso model as implemented in the glmnet package [42]. Among the models that impose hierarchical interactions, we chose the glinternet [70] model for comparison (Table 4.1), as it is implemented in an R package and can handle the GE case with a 59 Table 4.1: Selection Models for GE Name Model penalty Comments (convexity, assumptions, hyperparameters) lasso j E j +k G k 1 +k GE k 1 convex, two tuning parameters, E is penalized gesso 1 P p i=1 ( Gi ; GiE ) 1 + 2 k GE k 1 convex, single tuning parameter, E is not penalized glinternet j (0) E j + p X i=1 j (0) Gi j + p X i=1 ( (j) E ; (E) Gj ; GjE ) 2 ; where Gj = (0) Gj + (E) Gj ; E = (0) E + (0) E + + (p) E convex, single tuning parameter, E is penalized single environmental variable (by specifying the parameter interactionCandidates). We also considered single-marker analysis for comparison. Simulation settings We simulated a dataset withn = 500 subjects,p = 500 SNPs, and a single binary environmental variable E, for a total of 1,001 predictors. We set the number of non-zero main eects, p G out ofp SNPs to 10 and the number of non-zero interaction,p GE , out of thep interaction terms to 10. We simulated all non-zero main eects G to have the same absolute value with randomly chosen signs and similarly for the interaction terms GE . We simulated a balanced outcome variable. As in the linear case, we explored three simulation modes for the true coecients that we call strong hierarchical, hierarchical, and anti hierarchical. In the strong hierarchical mode the hierarchical structure is maintained ( G i = 0 =) G i E = 0) and alsoj G i jj G i E j. In the hierarchical mode the hierarchical structure is maintained, but we setj G i jj G i E j, with an attempt to violate the gesso assumption on the eect sizes. In the anti-hierarchical mode the hierarchical structure is violated ( G i E 6= 0 =) G i = 0). We simulated a single binary environmental factor with prevalence equal to 0.5. We set G = 2 and GE = 1:2 for all of the modes except for the hierarchical mode where we set G = 1 and GE = 1:2. We generated independent training, validation, and test sets under the same settings and 60 Figure 4.3: Strong hierarchical mode. report the model performance metrics on the test set. We run 100 replicates of the simulation for each parameter settings. Results We obtained solution paths across either one (for the lasso, glinternet) or two-dimensional grid (for gesso) of tuning parameter values and computed principal selection metrics for the detection of interactions as a function of the number of interactions discovered. For gesso we t the model for a grid of tuning parameters and aggregate/plot the selection metrics from the smallest number of selected interactions up to 30. If multiple pairs of tuning parameters correspond to the same number of non-zero interactions, we select the pair with either (1) the largest validation AUC, we call this modelgesso auc or (2) the pair corresponding to the most parsimonious model, i.e. with the smallest number of the non-zero main eects, given the number of non-zero interactions, we call this model gesso min g. Note that options (1) and (2) only dier by the way we select the best tuning parameters for each number of 61 Figure 4.4: Hierarchical mode non-zero interactions selected. For the single-marker analysis (named "gwas" on the plots), we obtain p-values for the signicance of the interaction coecient and sort them in increasing order. Then for each number k from 1 to 30, we consider the k-smallest p-values as positive model-identied classes (whereas for the joint models we consider non-zero interactions) and calculate selection metrics. In the strong hierarchical mode (Figure 4.3) the models imposing a hierarchical structure (gesso and glinternet) outperform the Lasso model. Both tuning options for the gesso per- form better selection compared to glinternet, especially in terms of sensitivity. gesso min g substantially outperforms other models because, given the underlying strong hierarchical gen- erative model, gesso is able to capture the true interactions guided by the strong main eects. gesso auc performs worse than gesso min g in terms of selection but is best in terms of predic- tion. This can be seen in Figure 4.6 where we compare logistic loss on the test set to evaluate prediction performance. Sparse gesso min g model tends to underestimate (bias towards null) the coecients of the main eects leading to the reduced prediction performance, however, the 62 Figure 4.5: Anti hierarchical mode. Figure 4.6: Test loss across modes. MSE (mean squared error) for the interaction coecients is the smallest for this model, mean- ing that estimation bias is the smallest. In fact, both gesso models demonstrate the smallest MSE for the interaction coecients (Figure 4.3) outperforming the unpenalized single-marker model (gwas model on Figure 4.3). We considered the hierarchical mode to evaluate the selection performance of gesso when 63 Figure 4.7: Main eects variation in oracle model. j G i E jj G i j does not hold. Because gesso does not make this stringent assumption, (only the laxerj G i E j + G i + G i ) we expect it to be robust to its violation. In the hierarchical mode (Figure 4.4) glinternet and gesso still outperform the Lasso. gesso auc model performs on par with glinternet, gesso min g shows the best selection performance. In terms of prediction performance gesso auc again shows the minimal test loss (Figure 4.6). In the anti-hierarchical mode lasso and glinternet perform similarly, whereas gesso auc per- forms better in terms of selection and prediction (Figure 4.5, 4.6). In contrast to gesso min g, gesso auc is vastly robust to the assumptions violation. Eect size analysis In the oracle hierarchical multivariate model, where we include true 10 non-zero main eects and 10 respective interaction terms, we evaluate the interaction detection sensitivity for varying main eect sizes. We consider values beta g = 1 (hierarchical mode), 1.2, 1.5, 2 (strong hier- 64 Figure 4.8: Main eects variation in selection models. archical mode), beta gxe stays at the value 1.2. Interaction detection sensitivity is calculated as the number of signicant interaction coecients divided by the total number of interaction terms included (ten). Figure 4.7 demonstrates that the larger the main eect, the harder it is to identify the interaction terms in multivariate logistic regression, even though the interaction coecient stays at the same value. The highest mean sensitivity (65 %) is achieved when we generate the smallest main eects (beta g = 1) in the model. When beta g equals 2 the mean sensitivity to identify interaction terms in the oracle model drops down to 42%. We did not observe this phenomenon in the case when the outcome is linear. The dierence could be due to the additional sampling step involved in the logistic regression model formulation. In a logistic model a linear predictor is used to estimate the probability of the positive case in the outcome. It could be that predictors with large eects dominate the estimated probability and make smaller eects less identiable. We evaluated how the dierence in main eects in uences the performance of the considered 65 selection models (Figure 4.8). For all of the competitor models (lasso, glinternet, single-marker) we observed the same phenomenon: simulations with the smaller main eects results in a better interaction selection performance. At the same time, the gesso auc model is not aected by the dierences in the main eects coecients. Moreover, gesso min g model performs better for larger main eects. 4.6 Real data application 4.6.1 Prostate cancer gene expression data To evaluate the advantages of our approach in terms of prediction performance over models that include only main eects or, alternatively, include main eects and interactions but in an unstructured manner, we analyzed a dataset from a case-control study of 187 prostate cancer patients. The dataset consists of 154 patients with no recurrence and 33 patients with clinical recurrence of prostate cancer. Genomic features include whole genome gene expression proles (29,000 transcripts) that were obtained using the Whole Genome DASL HT platform (Illumina, Inc) [74]. Methods The goal of the analysis is to build a prediction model of prostate cancer recurrence with the logistic gesso model and evaluate whether adding interaction terms to the model in a hierarchical manner can improve prediction performance. We considered three environmental variables: age, pooled Gleason score, and pre-operative PSA. For interaction models with each of the environmental variables, we included other variables as unpenalized confounders. Operational year (categorical) was included as a confounding variable in every model. We compared our model with the main eects only Lasso and interactions Lasso against the cross- validated AUC averaged over 100 replicates with random cross-validation assignments. 66 Figure 4.9: Mean cross-validated (cv) AUC (SD) and mean number of selected main eects and interactions (SD) across 100 cv splits. Interaction with (a) age, (b) preoperative PSA, and (c) pooled Gleason score is considered. Results Figure 4.9 presents mean cross-validated AUC and a mean number of selected features across methods for each of the environmental variables. Adding main-eect-before-interaction hi- erarchical constraints benets the prediction model and results in an increased AUC and a reduced number of selected features for all of the considered exposures. Adding interaction with age (Figure 4.9 a) to the model without requiring hierarchical structure doesn't seem to substantially improve the prediction performance. For the other two environmental factors, no interactions are selected with glmnet across the cross-validation splits. 67 4.6.2 FIGI data with external information (imputed gene expres- sion) The FIGI (Functionally Informed Gene-environment Interactions) consortium is dedicated to discovering gene-environment interactions associated with the risk of colon cancer across mul- tiple relevant data sources. Considered environmental exposures include principal colon cancer risk factors (e.g. processed/cooked red meat consumption, smoking) and protective factors (fruit and vegetable consumption, aspirin intake). FIGI data consists of a genome-wide panel of measured genotypes for a large set of individuals (130,000 subjects). Case-control colon cancer status is available for each individual. Here we focus on the aspirin environmental exposure. Aspirin is a known protective factor that reduces the risk of colon cancer because of its anti-in ammatory activity. The goal of our analysis is to nd important gene by aspirin interactions associated with colon cancer. As external/functional information we use gene-expression and matched genotypes from the normal colons measured at the Barcelona University (BarcUVa-Seq) [75]. For the analysis, we propose to use a mediation two-step approach. In the rst step, we build models to predict the expression of each gene based on the cis SNPs on the BarcUVa-Seq dataset and utilize the estimated weights to impute gene expression from the FIGI genotypes [76]. In the second step, we associate imputed gene expression with the case-control colon cancer status of the FIGI population with gesso, a hierarchical joint selection approach we developed to nd gene by aspirin interactions (Figure 4.10). Methods An Elastic Net (EN) model (4.11) with = 0.5 was applied to estimate imputation weights ^ w i on the training dataset (X;G), where X is the matrix of gene expression measurements, X j is gene expression for the gene j, and G 1 ;:::;G p j are the cis SNPs for the gene j. Cis SNPs are 68 Figure 4.10: Steps involved in the analysis. dened as1Mb from the gene start/end. ^ w = arg min w X j p j X i=1 w i G i 2 2 + kwk 1 + (1)kwk 2 2 (4.11) For the training dataset (X;G) we use gene expression and genotypes data from the Barcelona University (BarcUVa-Seq). Gene expression for 20,695 genes was measured from 445 normal (healthy) colon biopsies. In addition, subjects were genotyped using the IlluminaQ17 OncoAr- ray 500K beadchip, and genome-wide SNPs were imputed [75]. We built non-degenerative models for 11,728 genes with the mean cross-validated R-squared equal to 0.05 (range 0 - 0.67) and imputed gene expression based on the large FIGI dataset that includes genotypes measured for 72,698 subjects using estimated weights ^ w. Imputed gene expression is dened as ^ X j (FIGI) = p j X i=1 ^ w j G (FIGI) i : We utilize the hierarchical gesso logistic model to associate imputed gene expression by aspirin with the case-control colon cancer status and select important genes. To control for the false discovery rate we use the complementary pairs stability selection procedure paired with our selection method. We run gesso on the entire dataset and estimate the parameter that 69 Table 4.2: Selected genes by gesso (in light gray) and the Lasso (white) with the stability procedure (selection probability 0.7). Genes selected by both methods are highlighted in dark gray. represents the ratio of the two hyperparameters associated with the minimal cross-validated logistic loss of the model ^ = 2 opt 1 opt . To construct stable features we t gesso 100 times (50 pairs) on half of the data with a single hyperparameter 1 and assume that 2 = ^ 1 . We apply a probability threshold equal to 0.7 for the stability selection. We include unpenalized confounders to the model: age, sex, study indicators, and three principal components. Results The nal selected genes interacting with aspirin are presented in Table 4.2. The table consists of short gene descriptions, stability selection probabilities for the gesso model (highlighted in light gray) and the standard Lasso model (tted with glmnet package, highlighted in white), 70 and the single-marker analysis metrics (p-values and estimated coecients). Genes selected by both gesso and the Lasso are highlighted in dark gray (7/15, 47%). Among the top 7 selected genes by the gesso model, only one gene was not supported by the lasso model at 0.7 threshold, the COLEC12 gene. COLEC12 gene encodes a scavenger receptor that displays several functions associated with host defense, for example, it binds antigens, facilitating their recognition and removal. Chang et al. [77] evaluated the role of COLEC12 in gastric stromal cells in interacting with Prostaglandin E2 (PGE2). Prostaglandins are a group of lipids made at sites of tissue damage or infection that control processes such as in ammation. Prostaglandins are downregulated by aspirin [78]. An et al. [80] built an immune 12-gene signature with high prognostic value in colorectal cancer that included the COLEC12 gene. In addition, Durand et al. [79] showed that COLEC12 was dysregulated in colorectal cancer samples compared to normal samples and suggested this gene as a novel target for potential treatment and research in colorectal cancer. The gesso model evaluated key colorectal cancer-related genes MYC, MLH1, and HLA- DRB1 at a higher selection probability than the Lasso model. MYC (c-Myc) belongs to a family of regulator genes and proto-oncogenes that code for transcription factors. The MYC gene can promote tumorigenesis in various malignant tumors and mediate the critical role in colorectal cancer progression. Furthermore, dysregulation of c-MYC is a consequence of mutations in the APC gene, a central hub in early colorectal carcinogenesis [81, 82]. Ai et al. [83] showed that inhibition of c-Myc levels may represent a pathway by which aspirin exerts its anti-cancer eect and decrease the occurrence of cancer in epithelial tissues. Aspirin therapy reduces the ability of platelets to promote colon and pancreatic cancer cell proliferation in a manner dependent on the upregulation and activation of the oncoprotein c-MYC [84]. MLH1 is one of the mismatch repair (MMR) genes and is commonly associated with hered- itary nonpolyposis colorectal cancer. [85] showed that microsatellite instability in colorectal cancer cells decient for a subset of the human mismatch repair genes (hMLH1, hMSH2, and hMSH6) is markedly reduced during exposure to aspirin. [86] suggested the involvement of 71 the DNA repair proteins in the mechanism of action of aspirin and investigated the expres- sion of DNA damage signaling pathways genes upon aspirin exposure. Long-term daily use of acetylsalicylic acid (aspirin) has been shown to reduce cancer risk in individuals with Lynch syndrome [87] (mutations in the MLH1 gene cause Lynch syndrome). HLA-DR is a major histocompatibility complex (MHC) class II molecule that mediates anti- gen presentation, including tumor antigens, and is a favorable prognostic indicator in colorectal cancer [88, 89]. [90] showed that cyclooxygenase-2 (COX-2) were involved in the reduction of surface HLA-DR expression in peripheral blood mononuclear cells (PBMCs) isolated from col- orectal cancer patients. The COX group of genes is responsible for the formation of prostanoids and is downregulated by aspirin. The gesso model also assigned higher selection probability than the Lasso model to the NKIRAS1 (NFKB Inhibitor Interacting Ras Like 1) gene. This gene acts as a potent regulator of NF-kappa-B (Nuclear factor kappa B) activity which is a key regulator of the COX-2 gene. Shi et al. [95] showed that upregulation of COX-2 is associated with activation of the alternative nuclear factor kappa B signaling pathway in colonic adenocarcinoma. Chen et al. [96] establish that aspirin and other NSAIDs activate the NF-kappa-B pathway in neoplastic epithelial cells in the context of a whole tumor setting, and support the proposition that this eect is important for the anti-tumor activity of the agent. Many publications link the DIP2B gene with colorectal cancer via eQTL and meta-analysis [91, 92, 93, 94]. This gene might have been selected by the gesso model because of it's strong main (eQTL-mediated) eect. On the other hand, the Lasso model assigned a higher selection probability than gesso to the PDGFC gene. Platelet-derived growth factors (PDGFs) are known to be associated with tumor growth and angiogenesis, several studies revealed the participation of the PDGF family in colorectal cancer (CRC) [97, 98]. PDGFs aect platelet count and high platelet count is associated with systemic in ammation in CRC [99]. 72 Chapter 5 gesso package We developed an R package for researchers to use the gesso model which is available on our github page https://github.com/NataliaZemlianskaia/gesso. We present the detailed package vignette in this chapter. Purpose of this vignette This vignette is a tutorial about how to use the gesso package for the selection of the gene-environment interaction terms in a joint main-eect-before- interaction hierarchical manner for building a well-formulated prediction model containing interaction terms with a spe- cic exposure of interest, where the nal prediction model only includes interaction terms for which their respective main eects are also included in the model 5.1 Overview The package is developed to t a regularized regression model that we call gesso for the joint selection of gene-environment (GE) interactions. The model focuses on a single environmental exposure and induces a main-eect-before-interaction hierarchical structure. We developed and 73 implemented an ecient tting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. The gesso model induces hierarchical selection of the (GE) interaction terms via over- lapped group lasso structure. The model has two tuning parameters 1 and 2 allowing exible, data-dependent control over group sparsity and additional interactions sparsity. The response (outcome) variable can be either continuous or binary 0/1 variable. For a binary response, the IRLS procedure is used with the custom screening rules we developed. The package supports sparse matrices dgCMatrix and (lebacked) bigmatrix format from the bigmemory package for large or out of RAM datasets. Package highlights The package allows to model either continuous or binary (0/1) outcomes; to tune either both hyperparameters ( 1 , 2 ), or only one hyperparameter ( 1 ) and set : 2 = 1 ; to tune hyperparameters for the binary outcomes either by maximizing AUC or minimiz- ing logistic loss; to add unpenalized covariates to the model; to utilize dgCMatrix (sparse) and bigmatrix (lebacked) data formats for more ecient and exible application of the package System requirements NOTE: RcppThread cannot be installed with gcc 4.8.5 versionhttps://github.com/tnagler/ RcppThread/issues/13. In this case, we recommend updating gcc to a more recent version. 74 5.2 Installation Installation can take a couple of minutes because of the requirement to install dependent pack- ages (dplyr , Rcpp , RcppEigen , RcppThread , bigmemory ) ## install.packages("devtools") library(devtools) devtools::install_github("NataliaZemlianskaia/gesso") Attaching libraries for the examples below library(gesso) library(glmnet) library(ggplot2) library(bigmemory) 5.3 Data generation First, we generate the dataset using data.gen() function from the package. The function allows us to generate a binary (0/1) matrix of genotypes G of a given sample size and p number of features, binary vector E of environmental measurements, and a response variable Y such that Y g( 0 + E E + G G + GE GE) - either quantitative or binary depending on the family parameter. We can specify the number of non-zero main eects (n g non zero) and interactions (n gxe non zero) we want to generate. We also specied a strong hierarchical mode for our dataset, which means that (1) if an interaction eect is generated as non-zero, it's respective genetic main eect is also generated as non-zero ( GE 6= 0! G 6= 0), and (2) eect sizes of the main eects are larger than that of interaction eects (j G jj GE j). family = "gaussian" sample_size = 150; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5 data = data.gen(seed=1, sample_size=sample_size, p=p, 75 n_g_non_zero=n_g_non_zero, n_gxe_non_zero=n_gxe_non_zero, mode="strong_hierarchical", family=family) Let's look at the dataset we generated data$G train, data$E train, and data$Y train dim(data$G_train) ## [1] 150 400 head(data$G_train[,1:5]) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 0 0 0 0 ## [2,] 0 1 0 1 1 ## [3,] 0 0 1 0 0 ## [4,] 1 0 0 0 0 ## [5,] 0 0 0 0 1 ## [6,] 1 0 0 0 0 data$E_train[1:10] ## [1] 0 0 0 0 0 0 0 0 0 0 head(data$Y_train) ## [1] 5.6849366 0.6601472 8.6602417 2.9610805 8.2804916 -2.3606432 We generated model coecients data$Beta G and data$Beta GxE such that we got sum(data$Beta G != 0) = 10 non-zero main eects andsum(data$Beta GxE != 0) = 5 non- zero interaction eects as we specied. c(sum(data$Beta_G != 0), sum(data$Beta_GxE != 0)) ## [1] 10 5 We also imposed strong hierarchical relationships between main eects and interaction eects cbind(data$Beta_G[data$Beta_G != 0], data$Beta_GxE[data$Beta_G != 0]) ## [,1] [,2] ## [1,] 3 -1.5 ## [2,] -3 -1.5 ## [3,] 3 -1.5 ## [4,] 3 0.0 76 ## [5,] 3 -1.5 ## [6,] 3 0.0 ## [7,] 3 0.0 ## [8,] 3 0.0 ## [9,] 3 1.5 ## [10,] 3 0.0 5.4 Selection example To tune the model over a 2D grid of hyperparameters (lambda 1 and lambda 2) we use the function gesso.cv() , where we specify our data G, E, Y and family parameter for the response variable; matrix G should be a numeric matrix, vectorE should be a numeric vector. VectorY could either be a continuous or binary 0/1 numeric vector tolerance for the convergence of the tting algorithm grid size to automatically generate grid values for cross-validation nfolds to specify how many folds to use in cross-validation parallel TRUE to enable parallel cross-validation (TRUE by default) seed set the random seed to control random folds assignments normalizeTRUE to normalize matrix G (such that column-wise sd is equal to 1) and vector E (TRUE by default) normalize response TRUE to normalize vector Y (FALSE by default) There is an option to use a custom grid in the gesso.cv() function by specifying grid parameter instead of grid size parameter. The user can also specify alpha parameter to enable tuning only over lambda 1 while assuming lambda 2 = lambda 1 * alpha. 77 verbose parameter is set to TRUE by default in gesso.cv() , but we can set it to FALSE to avoid outputting function messages about partial execution time. start = Sys.time() tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train, family=family, grid_size=20, tolerance=1e-4, parallel=TRUE, nfolds=3, normalize=TRUE, normalize_response=TRUE, seed=1, max_iterations=10000) ## Compute grid: ## Time difference of 0.0009689331 secs ## Parallel cv: ## Time difference of 5.771211 secs ## Fit on the full dataset: ## Time difference of 7.373647 secs Sys.time() - start ## Time difference of 13.15446 secs Obtain best model with gesso.coef() function To obtain interaction and main eect coecients corresponding to the model with a particular pair of tuning parameters we use the gesso.coef() function. We need to specify a model t object and a pair of lambda = (lambda 1, lambda 2) values organized in a tibble (e.g. lambda = tibble(lambda 1=tune model$grid[1], lambda 2=tune model$grid[1])). Below we set fit = tune model$fit - model t on the full dataset and lambda = tune model$lambda min - the pair of tuning parameters corresponding to the mini- mum cross-validated error that gesso.coef() returns. coefficients = gesso.coef(fit=tune_model$fit, lambda=tune_model$lambda_min) gxe_coefficients = coefficients$beta_gxe g_coefficients = coefficients$beta_g 78 Check if all non-zero interaction features were recovered by the model cbind(data$Beta_GxE[data$Beta_GxE != 0], gxe_coefficients[data$Beta_GxE != 0]) ## [,1] [,2] ## [1,] -1.5 -1.0173458 ## [2,] -1.5 -2.3494359 ## [3,] -1.5 -0.7284187 ## [4,] -1.5 -1.1949190 ## [5,] 1.5 2.1356086 Check that the largest estimated interaction eects correspond to the true non-zero coecients (data$Beta_GxE[order(abs(gxe_coefficients), decreasing=TRUE)])[1:10] ## [1] -1.5 1.5 -1.5 -1.5 -1.5 0.0 0.0 0.0 0.0 0.0 Calculate principal selection metrics with the selection.metrics() function available in the package selection_gesso = selection.metrics(true_b_g=data$Beta_G, true_b_gxe=data$ Beta_GxE, estimated_b_g=g_coefficients, estimated_b_gxe=gxe_coefficients) cbind(selection_gesso) ## selection_gesso ## b_g_non_zero 46 ## b_gxe_non_zero 45 ## mse_b_g 0.3726178 ## mse_b_gxe 0.6398291 ## sensitivity_g 1 ## specificity_g 0.9076923 ## precision_g 0.2173913 ## sensitivity_gxe 1 ## specificity_gxe 0.8987342 ## precision_gxe 0.1111111 ## auc_g 0.999999 79 ## auc_gxe 0.999998 Compare with the standard Lasso model (we use the glmnet package) set.seed(1) tune_model_glmnet = cv.glmnet(x=cbind(data$E_train, data$G_train, data$G_train * data$E_train), y=data$Y_train, nfolds=3, family=family) coef_glmnet = coef(tune_model_glmnet, s=tune_model_glmnet$lambda.min) g_glmnet = coef_glmnet[3: (p + 2)] gxe_glmnet = coef_glmnet[(p + 3): (2 * p + 2)] cbind(data$Beta_GxE[data$Beta_GxE != 0], gxe_glmnet[data$Beta_GxE != 0]) ## [,1] [,2] ## [1,] -1.5 0.0000000 ## [2,] -1.5 -0.0224555 ## [3,] -1.5 0.0000000 ## [4,] -1.5 0.0000000 ## [5,] 1.5 1.2517060 (data$Beta_GxE[order(abs(gxe_glmnet), decreasing=TRUE)])[1:10] ## [1] 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 selection_glmnet = selection.metrics(data$Beta_G, data$Beta_GxE, g_glmnet, gxe_glmnet) cbind(selection_gesso, selection_glmnet) ## selection_gesso selection_glmnet ## b_g_non_zero 46 62 ## b_gxe_non_zero 45 26 ## mse_b_g 0.3726178 0.2966215 ## mse_b_gxe 0.6398291 1.341252 ## sensitivity_g 1 1 80 ## specificity_g 0.9076923 0.8666667 ## precision_g 0.2173913 0.1612903 ## sensitivity_gxe 1 0.4 ## specificity_gxe 0.8987342 0.9392405 ## precision_gxe 0.1111111 0.07692308 ## auc_g 0.999999 0.999999 ## auc_gxe 0.999998 0.8405046 We see that the gesso model outperforms the Lasso model in terms of the detection of in- teraction terms in all prime selection metrics, including GtimesE detection AUC (auc gxe), sensitivity (sensitivity gxe), and precision (precision gxe). The estimation bias of the interaction coecients is also substantially lower for the gesso model (mse b gxe). Obtain target GE model with gesso.coefnum() function To control how parsimonious is our model with respect to the selected non-zero interaction terms we can use the gesso.coefnum() function. We obtain interaction and main eect coef- cients corresponding to the targetGE model, where instead of specifying tuning parameters lambda, we set the target b gxe non zero parameter which is the number of at most (specify less than=TRUE - default value) or at least (specify less than=FALSE) non-zero interactions we want to include in the model, depending on the parameter less than. The target model is selected based on the averaged cross-validation (cv) results (tune model$cv result): for each pair of parameters lambda=(lambda 1, lambda 2) in the grid and for each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target b gxe non zero non-zero interaction terms. Returned coecients are obtained by tting the model on the full dataset with the selected tuning parameters. Note that the number of estimated non-zero interactions will only approximately re ect the numbers obtained on cv datasets. coefficients = gesso.coefnum(cv_model=tune_model, target_b_gxe_non_zero=5) 81 gxe_coefficients = coefficients$beta_gxe g_coefficients = coefficients$beta_g Calculate principal selection metrics selection_gesso = selection.metrics(data$Beta_G, data$Beta_GxE, g_ coefficients, gxe_coefficients) cbind(selection_gesso, selection_glmnet) ## selection_gesso selection_glmnet ## b_g_non_zero 41 62 ## b_gxe_non_zero 4 26 ## mse_b_g 0.3887929 0.2966215 ## mse_b_gxe 1.187474 1.341252 ## sensitivity_g 1 1 ## specificity_g 0.9205128 0.8666667 ## precision_g 0.2439024 0.1612903 ## sensitivity_gxe 0.6 0.4 ## specificity_gxe 0.9974684 0.9392405 ## precision_gxe 0.75 0.07692308 ## auc_g 0.999999 0.999999 ## auc_gxe 0.9610107 0.8405046 Here we see that the number of non-zero interactions in the model (b gxe non zero) is 4 (vs 45 non-zero terms in the model corresponding to the minimal cross-validated error), leading to substantially increased precision and decreased sensitivity. 5.5 Prediction example With gesso.coef() we obtain coecients corresponding to the best model in terms of minimal cross-validated error and use gesso.predict() to apply our estimated model to a new test dataset that can also be generated by the function data.gen() . We calculate test R-squared to assess model performance. 82 coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min) beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe new_G = data$G_test; new_E = data$E_test new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E) test_R2_gesso = cor(new_Y, data$Y_test)^2 Compare with the standard Lasso (we use glmnet package) new_Y_glmnet = predict(tune_model_glmnet, newx=cbind(new_E, new_G, new_G * new_E), s=tune_model_glmnet$lambda.min) test_R2_glmnet = cor(new_Y_glmnet[,1], data$Y_test)^2 cbind(test_R2_gesso, test_R2_glmnet) ## test_R2_gesso test_R2_glmnet ## [1,] 0.9738906 0.9437631 5.6 Adding unpenalized covariates to the model The package allows adding unpenalized covariates to the model (for example, main adjustment demographic variables like age and gender). In this example, we rst generate data with additional covariates (specify n confounders parameter in data.gen() ) and then show how to t the model, the user just needs to specify the numeric matrix C of covariates organized by columns. family = "gaussian" sample_size = 150; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5 n_confounders = 2 grid = 10^seq(-3, log10(1), length.out = 20) data = data.gen(seed=1, sample_size=sample_size, p=p, 83 n_g_non_zero=n_g_non_zero, n_gxe_non_zero=n_gxe_non_zero, mode="strong_hierarchical", family=family, n_confounders=n_confounders) tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train, C=data$C_train, family=family, grid=grid, tolerance=1e-4, parallel=TRUE, nfolds=3, normalize=TRUE, normalize_response=TRUE, verbose=FALSE, seed=1) 5.7 Sparse matrix (dgCMatrix) option example We generate a large highly sparse genotype matrix G by setting the genotype prevalence pa- rameter pG to 0.03 and measure how long it takes to t the model. sample_size = 600; p = 30000 pG = 0.03 data = data.gen(sample_size=sample_size, p=p, n_g_non_zero=n_g_non_zero, n_gxe_non_zero=n_gxe_non_zero, mode="strong_hierarchical", pG=pG, family=family) sum(data$G_train != 0) / (sample_size * p) ## [1] 0.03003111 start = Sys.time() fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, grid_size=20, grid_min_ratio=1e-1, 84 tolerance=1e-4, normalize=TRUE, normalize_response=TRUE) time_non_sparse = difftime(Sys.time(), start, units="secs"); time_non_sparse ## Time difference of 16.76301 secs We then convert our generated matrix to sparse format and measure tting time again. G_train_sparse = as(data$G_train, "dgCMatrix") start = Sys.time() fit = gesso.fit(G=G_train_sparse, E=data$E_train, Y=data$Y_train, tolerance=1e-4, grid_size=20, grid_min_ratio=1e-1, normalize=TRUE, normalize_response=TRUE) time_sparse = difftime(Sys.time(), start, units="secs"); time_sparse ## Time difference of 2.732556 secs We see that the sparse option enables (6.13 times) more ecient tting, however, the speed-up works only for substantially sparse matrices. as.numeric(time_non_sparse) / as.numeric(time_sparse) ## [1] 6.134554 5.8 Bigmatrix (bigmemory package) option example For out of RAM objects function attach.big.matrix("g train.desc") can be used to up- load the data given that the les are already created in a correct format (g train.desc, g train.bin). ## how to create filebacked matrix G_train = filebacked.big.matrix(nrow=dim(data$G_train)[1], ncol=dim(data$G_train)[2], backingpath="./", 85 backingfile="g_train.bin", descriptorfile="g_train.desc") for (i in 1:dim(data$G_train)[2]){ G_train[, i] = data$G_train[, i] } ## how to attach filebacked matrix G_train = attach.big.matrix("g_train.desc") ## is.filebacked(G_train) should return TRUE fit_bm = gesso.fit(G_train, data$E_train, data$Y_train, family=family) tune_model_bm = gesso.cv(G_train, data$E_train, data$Y_train, family=family) 5.9 Working sets We use the same large dataset (p = 30,000) that we generated for the sparse matrices example to demonstrate how our screening rules work. Working sets (WS) are the sets of variables left after we applied our screening rules to the full set of predictors. A histogram of the working set sizes (Figure 5.1) shows that most of the time we have to t only a handful of variables. hist(fit$working_set_size, breaks = 100, col="blue") The 2-dimensional plot below shows the log(WS size) for each ( 1 , 2 ) t (Figure 5.2). df = data.frame(lambda_1_factor = factor(fit$lambda_1), lambda_2_factor = factor(fit$lambda_2), ws = fit$working_set_size) log_0 = function(x){ return(ifelse(x == 0, 0, log10(x))) 86 Figure 5.1: Distribution of working set sizes. } ggplot(df, aes(lambda_1_factor, lambda_2_factor, fill=log_0(ws))) + scale_fill_distiller(palette = "RdBu") + scale_x_discrete("lambda_1", breaks=c(1)) + scale_y_discrete("lambda_2", breaks=c(1)) + labs(fill='log WS') + geom_tile() 87 Figure 5.2: Working set sizes by 2D grid of tuning parameters ( 1 ; 2 ). 88 Chapter 6 Incorporation of External information 6.1 Introduction Reliable identication of gene-environment (GE) interactions remains challenging because it is intrinsically a low-power problem compared to the detection of main eects. Incorporat- ing external gene expression data (e.g. GTEX, BarcUVa-Seq) potentially relevant to GE interactions may increase power to detect GE interactions. In this section we explore the integration of gene expression data (or other external information) to improve the selection of important Gene-Environment (GE) interactions associated with the outcome of interest. As in the previous chapters, we are interested in the joint analysis and want to impose a hierar- chical structure. Our goal is to develop a regularized high-dimensional model for identifying important GE interactions that: 1. Jointly models SNPs and interactions across a large number of variants. 2. Leverages the \main eect before interaction" hierarchical structure: GE 6= 0 =) G 6= 0 or G = 0 =) GE = 0: 3. Incorporates gene expression data to guide the selection of SNPs involved in a GE interactions in a robust way. 89 We start with a discussion of the potential benets of the hierarchical structure and gene expression data incorporation that could be informative for GE selection analysis. We then propose a model that satises our three objectives and give details about model construction. We continue with a description of a simulation design to evaluate our model and present sim- ulation results. We nish with a proposed algorithm to t the model. 6.2 Statistical and Biological Motivation Statistical motivation for the hierarchical structure As described in Chapter 3, the main-eect-before-interaction hierarchical structure is important for high-dimensional problems with a large number of variables when only a few potential inter- actions are expected to be associated with the outcome. Imposing the hierarchical constraints helps reduce the search space for potential interactions by prioritizing interaction terms with the strong main eects, and improves detection sensitivity and specicity. Even for small-scale experiments (Section 3.4) hierarchical models have a selection advantage over non-hierarchical. We refer to Chapter 3 for details. Biological motivation for integrating gene expression It is believed that a large proportion of genetic eects on a disease is mediated through gene ex- pression and regulation. Identication of eQTLs (expression quantitative trait loci) has proven to be a powerful tool in the study and understanding of diseases in human and other species. Unfortunately, gene expression datasets do not typically include information about environ- mental exposures of the subjects. The Functionally Informed Gene-Environment Interactions for Colorectal Cancer (FIGI) project, however, has exposure level information paired with the expression measurements. In addition to a large consortium-assembled GWAS data with about 100,000 cases and controls, the project has measured gene expression in healthy colon tissue from 500 individuals. This gives us a unique opportunity to estimate SNP E interaction eects on gene expression and relate them to the SNP E interaction eects on the disease 90 and assess whether gene expression data is able to guide the selection of SNPs involved inGE interactions in a robust way. 6.3 The model We consider an outcome vector Y of size n, for now assumed continuous, G a np matrix of genotypes, where p is the number of SNPs, and E is a vector of individual environmental measurements of sizen. Z is apq matrix whose entries are estimates of gene expression, where q is the number of genes. We discuss the construction of the matrixZ in the following sections. The model we propose is an extension of thegesso model in Chapter 3, with additional terms to integrate the external information in Z. The model consists of three important parts, denoted as (a), (b) and (c). Each part corresponds to the objectives we stated in the Introduction. minimize ; kY (G G +E E +GE GE )k 2 2 + (a) (6.1) + 1 X j max(j G j j;j G j E j) + 2 k GE k 1 + (b) + 3 k GE Zk 2 2 + 4 kk 1 ; (c) G 2R p , E 2R, GE 2R p , 2R q . The rst term (part a) in the model is a joint L 2 linear loss corresponding to a continuous outcome. For a binary outcome (for example, the case-control disease status) logistic negative log-likelihood can be used instead. We evaluate all the SNPs and interactions terms together in one model. Regularization terms on the second line (part b) are adopted from the gesso model. These terms are added to the model to ensure the main-eect-before-interaction hierarchical structure. This structure prioritizes interactions that have stronger main eects on the outcome of interest. The last two terms on the third line (part c) incorporate gene expression estimates to the model and provide a framework to robustly select SNP E interactions with stronger eects 91 on gene expression. Integration of gene expression data In this section we look closely at the external information part of our model (part c) and explain how the matrixZ is constructed. We denote the gene expression data byX, which is anmq matrix, where m is the sample size of X and q is the number of genes in the dataset. Consider model (6.2), where G i is i th genotype, X j is a gene expression vector for gene j, and E is a vector of environmental measurements for each subject. G ij E is the interaction eect of a SNPi and environmental variableE on the gene expression of a genej. Our hypothesis is that GE estimates can inform us about potentialGE interactions associated with the outcome. X j ij G i + E E + G ij E G i E (6.2) We construct the external information matrix Z, as the matrix of gamma estimates ^ G ij E estimated from model (6.2): By construction,Z is a block diagonal matrix, where each gene (column) has non-zero rows corresponding to in-cis SNPs that have a signicant (interaction) eect on expression. The ridge penalty 3 k GE Zk 2 2 in part (c) of the model captures the relationship be- tween interaction eects on gene expression and interaction eects on the outcome (disease), through parameter vector (Figure 6.1). The lasso term 4 kk 1 allows the model to perform column selection on matrixZ, which is selection with respect to the genes. In other words, the 92 term is added to screen out unimportant genes for the disease of interest (Weaver et al., 2020). Note that we could also use the main eects of the variants on gene expression ^ ij as external information to prioritize SNPs with eQTL eects or a linear combination of ^ ij and ^ G ij E . Another possibility is to consider a joint gene expression model, where instead of evaluating one SNP at a time as in model (6.2), we include all SNPs in cis into a single model. We defer the discussion of the optimal form of external information from model (6.2) to future research and the data application. Figure 6.1: Integration of gene expression data. Tuning parameters The proposed model (6.1) has four tuning parameters, which in practice makes it computa- tionally infeasible to tune by cross-validation. In our simulations we considered two possible reductions: pairing the gesso tuning parameters as in the hierNetGG model, where 1 = 2 , and setting 2 to zero. We have noticed that model (6.1) performs much better in the second case, when we remove the sparsity inducing term in the gesso model, 2 k GE k 1 , from the ob- 93 jective function (results not shown). This might be because restricting the tuning parameters 1 and 2 to be equal is sub-optimal for the GE case and assuming 1 2 ; 2 0 is more advantageous for the model performance. From this point onward, we only consider the reduced model (6.3). minimize ; kY (G G +E E +GE GE )k 2 2 + (a) (6.3) + 1 X j max(j G j j;j G j E j) + (b) + 3 k GE Zk 2 2 + 4 kk 1 (c) Note that part (b) of the model (6.3) when the number of environmental variables is equal to one is equivalent to the group L 1 penalty. This means that the hierarchical structure is maintained in a grouped, more restrictive way: GE 6= 0 () G 6= 0 or GE = 0 () G = 0: 6.4 Simulation design The main goal of our simulation is to evaluate how benecial it is to integrate the gamma estimates into the model for a varying degree of informativeness of the gene expression data. We also want to impose a hierarchical structure on the estimates in our simulation design and assume a highly sparse model with respect to the interaction eects. Based on the experience with real Genome Wide Interaction Scans (GWIS), we expect only a few SNPs to be involved in interactions and the vast majority to be null. To summarize our simulation goals, we want to: 1. Capture the dependence on external information 2. Assume the hierarchical structure holds 3. Assume the ground truth is sparse (GWIS) 94 To achieve the simulation goals (Figure 4.1), we set 90% of the GE coecients to zero to have a sparse true model (null terms in Figure 4.1), and the main eects for the 10% of non-zero interaction terms are set to non-zero values to maintain the hierarchical structure. The set of non-zero interaction coecients is split into two groups according to whether they are associated with the gene expression information. Associated with gene expression are those for which the interaction eects on gene expression ( GE ) are informative for the interaction eects on the outcome ( GE ). We simulate the associated terms as normal random variables with the mean GE and standard deviation 1 . Non-associated with gene expression are those for which the interaction eects on gene expression ( GE ) are not informative for the interaction eects on the outcome ( GE ). We simulate non-associated terms as Normal random variables with the mean and standard deviation 2 (Figure 4.1). The specic parameter values are presented in the section below. We consider three core scenarios to evaluate our model. For the rst scenario we assign all of the non-zero interaction coecients to be non-associated with gene expression. This is an important scenario to evaluate the robustness of our model with respect to a misleading external information. Scenario three captures the opposite situation where all of the non-zero interaction coecients are associated with gene expression, meaning that external dataset is informative for all of the true interaction eects. Scenario three is the most favorable scenario for our method. Scenario two describes an intermediate situation where half of the terms are associated and the other half are non-associated with gene expression. The second scenario is important for evaluation how the model performs in the presence of both associated and non-associated with gene expression SNPs, which is the most likely real world situation. Comparison models We evaluate the performance of our model (6.3), which we denote as external gesso on the results gures, against four other models. The rst one is the standard Lasso model, which does 95 Figure 6.2: Simulation design. not maintain a hierarchical structure and does not incorporate external information, denoted as lasso. The second natural comparison is a model, that only includes part (a) and part (b) from model (6.3), i.e. the structural model without the external information ( 3 = 4 = 0), denoted as gesso max. We also include model (6.4), which is a model similar to (6.3), but it maintains hierarchical structure in a dierent way. Model (6.3) uses the L 1 penalty and model (6.4) ensures hierarchical structure by adding the group lasso penalty from model (2.6). We denote this model as external group lasso. Analogously to the gesso max model, we include model (2.6) with 2 = 0, denoted as group lasso to compare external group lasso and group lasso as similar models with and without external information. 96 minimize ; kY (G G +E E +GE GE )k 2 2 + (a) (6.4) + 1 p X i=1 ( G i ; G i E ) 2 + (b) + 3 k GE Zk 2 2 + 4 kk 1 (c) In summary, the models we consider are: lasso { standard L 1 regularized regression (2.1) (no hierarchical structure and no external information) gesso max { model (6.3) with 3 = 4 = 0 (structural model without external information) external gesso { model (6.3) (with hierarchical structure and external information) group lasso { model (2.6) with 2 = 0, which is the same as model (6.4) with 3 = 4 = 0 (structural model without external information) external group lasso { model (6.4) (with hierarchical structure and external information). Simulations parameters Main dataset (trait, genotype, and E) Sample size: n = 240 Number of SNPs: p = 600, 1200 predictors in total, among them 60 non-zero interaction terms, 80 non-zero main eects (all non-zero interactions have the non-zero main eects) External dataset (gene expression, genotype, and E) Sample size: m = 50 Number of genes: q = 20 Parameter values E = 1:5; 0 = 0; G = 1:5 random sign; pG = 0.2, pE = 0.3; GBernoulli(pG), EBernoulli(pE); 97 GE Normal(1; (0:05) 2 ); ^ GE Normal( GE ; 2 1 m ), 1 = 0:7; GE Normal( GE random sign, 2 3 ) for associated with gene expression, 3 = 0:1; GE Normal(; 2 2 ) for non-associated with gene expression, = 1; 2 = 0:1; YNormal(X; 2 I); = 1. 6.5 Results All the tested models were tted using the CVXR R package, which is a very general and exible convex optimization solver, but it is not scalable to large problems. We simulated 50 replicates for each of the models. The parameters were tuned using the independent training, validation and test sets, which were generated under the same settings. The simulation results Figure 6.3: GE detection AUC across ve compared models. are summarized in Figures 4.2 - 4.5. We compared the performance of the dierent models with respect to the selection of true interaction and, separately, main eects. We also compare mean squared errors (MSE) of the estimated interaction coecients and the loss function values on 98 the test set. Figure 4.2 depicts the area under the curve (AUC) for detecting interaction terms on the y-axis for the three scenarios we described. On the x-axis we have three scenarios starting from zero percent of interaction terms associated with gene expression (scenario 1), then ve percent (scenario 2) and, lastly, ten percent (scenario 3). When there are no associated with gene expression interactions (scenario 1), the models without external data perform slightly better than the models that incorporate non-informative external data. However, the dierences in the interaction selection performance are remarkably small, indicating that both models with external information (6.3) and (6.4) are able to robustly disregard uninformative external data and avoid a loss in selection performance. It is also of note that, in alignment with the results in Chapter 3, the models that incorporate the structural hierarchical constraints signicantly outperform the standard lasso model. For the second scenario we can see that both models using external information start to improve along the y-axes, indicating that the models are able to pick up the true signal from the gene expression data. Model (6.3) demonstrates a better performance than model (6.4). For the third scenario, when all of the non-zero interaction terms are modeled to be de- pendant on the gene expression estimates, the models that use the external data signicantly outperform the other models, with model (6.3) demonstrating again superior results over model (6.4). Figure 4.3 illustrates selection performance with respect to the main eects. On the y-axis we have main eects detection AUC, on the x-axis three scenarios. We see a similar pattern than for the interaction detection AUC, with models (6.3) and (6.4) outperforming the models without external information and model (6.3) showing the best performance. Figure 4.4 shows the mean squared error (MSE) of the estimated interaction coecients. Even though the goal of our model is selection, not estimation, we still see that incorporation of relevant external information reduces the bias and improves the estimates of the interaction terms, especially for model (6.3). We present the test loss on the Figure 4.5 to demonstrate that model (6.3) also outperforms 99 Figure 6.4: G detection AUC across ve compared models. Figure 6.5: MSE for GE across ve compared models. all the other models in terms of prediction. 100 Figure 6.6: Test loss across ve compared models. Conclusions The main conclusion is that the proposed model (6.3) is able to select non-zero interactions with high accuracy when the expression data is informative and performs on par with structural feature selection methods when expression data is non-informative, showing that the model robustly incorporates the external information. The model also outperforms the other models in terms of prediction accuracy (on the test set) and estimation accuracy (interaction coecients MSE). We presented two models with external information: the rst model using the penalty from the gesso approach (6.3) and the second model using a group lasso structure (6.4). Model (6.3) demonstrated a signicantly superior performance over model (6.4), indicating that the gesso penalty is better matched to our proposed way of incorporating external information, compared to the group lasso approach. We hypothesize that this is because the gesso model imposes more strict assumptions than the group lasso methods (the assumption on eect sizes to achieve hierarchy), which leads to higher performance when they are met, but more exploration 101 is required. It is important to note that the usefulness of the external data in the proposed framework relies on the assumption that interaction eects on gene expression are related to interaction eects on the disease (or generally, any trait of interest). 6.6 Model tting We presented an ecient block coordinate descent algorithm to t gesso in the previous chap- ter, which naturally exploits the sparse and block structure of the problem. In this section we propose a similar approach to t model (6.3). We consider the full model (6.1) and call it "Externally informed gesso" model. Then model (6.3) can just be tted with the proposed algorithm assuming 2 = 0. Externally informed gesso minimize ; 1 2n Y (1 0 +G G +E E + (GE) GE ) 2 2 + + 1 X j max(j G j j;j G j E j) + 2 k GE k 1 + 3 2 k GE Zk 2 2 + 4 kk 1 whereY2R n is the outcome of interest,G isnp matrix of genotypes,E is a vector of sizen of exposure measurements, Z ispq matrix of external information, 0 2R, G 2R p , E 2R, G i E j 2R pq , 2R q . Equivalent constrained model minimize ; 1 2n kY (1 0 +G G +E E + (GE) GE )k 2 2 + + 1 1 T ( + + ) + 2 k GE k 1 + 102 + 3 2 k GE Zk 2 2 + 4 kk 1 subject toj G j E j + G j + G j ; G j 0; for j = 1;:::;p: From KKT conditions we have (Appendix I): Stationarity: G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 j ; (6.5) where r =r() =Y 1 0 E E G G (GE) GE and r (j) =r() +G j ( + G j G j ): G j E = 1 1 n G j E 2 + 3 S 1 n G j E T ~ r (j) + 3 Z j ; 2 + j ; (6.6) where ~ r (j) =r() +G j E G j E , Z j =Z [j;:] j = 1 Z j 2 S Z T j G j E ; 4 3 ; (6.7) E = r (E) kEk 2 , where r (E) =r() + E E, 0 = 1 n r (0) , where r (0) =r() + 1 0 . Feasibility: j G j E j + G j + G j ; G j 0; j 0; j 0 Complementary slackness: j (u j G j E + ) = 0; j G j = 0; for j = 1;:::;p: Note 1: Similarly to the BCD algorithm for the gesso, if the hierarchical constrains are inactive, meaningj GE j< + G j + G j , then j = 0, and the solutions for G j and GE are the 103 solutions corresponding to the regular lasso penalization. When the constraints are active, i.e. j GE j = + G j + G j , then j adds more penalization for the interaction coecient and relaxes the penalization for the main eect. Note 2: Roughly speaking, the parameter j is proportional to the correlation of external data and the interaction term estimates. For j (6.7) if the external data is informative, then 3 should generally be large and 4 small, leading to 4 3 being small and more j s estimated as non-zero. Conversely, if the external data is non-informative, the parameter 4 , which is responsible for the sparsity of vector , should be large and 3 should be small, leading to 4 3 being large and more j s estimated as zero. 6.6.1 Block coordinate descent updates Similarly to section 3.2.1 (updates for the gesso model without external data) we form a system of equations for the inseparable blocks G j ; G j E ; and j . We update E ; 0 , and variables according to the solutions above. Denote r (j) =r +G j G j +G j E G j E . The system of equations that has to be solved is given by 8 > > > > > > > < > > > > > > > : G j = + G j G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ( r (j) G j G j ) + 3 Z j ; 2 + j j (j G j E j ( + G j + G j )) = 0 (6.8) We show how to eciently solve the system of equations (6.8) in Appendix J by breaking it down to a set of simpler cases. 104 Chapter 7 Conclusions and Discussion We introduced a new approach to model GE interactions with the hierarchical main-eect- before-interaction property. We showed in simulations that our approach outperforms other hierarchical models. Our proposed block coordinate descent algorithm is scalable to large numbers of predictors because of the custom screening rules we developed. We showed that existing packages for the hierarchical selection of interactions cannot handle a large number of predictors typical in studies with high-dimensional omic data. When considered separately, and not as a sub-case of aGG analysis, theGE case can be solved much more eciently. The implementation of our method for the linear and logistic regression is available in our R package gesso that can be downloaded from our GitHub page (https://github.com/ NataliaZemlianskaia/gesso) or CRAN (https://cran.r-project.org/web/packages/gesso/ index.html). The package is exible to include unpenalized confounders to the model, allows the user to tune either both hyperparameters or only one hyperparameter in a custom way, and supports tuning for multiple objectives. In addition, the package has an option to work with large out of RAM datasets and sparse matrices. The model can be generalized to include more than one environmental variable E. How- ever, for the BCD algorithm to be ecient, blocks have to be relatively small. When includ- ing multiple E variables, the size of the coordinate descent blocks grows linearly and con- tains the main eect and all its corresponding interactions with the environmental variables 105 ( G j ; G j E 1 ;::: G j Eq ). The bottleneck here is to eciently solve the system of equations re- sulting from the stationarity conditions, the size of which grows exponentially with the number of E variables. For more than a handful of E variables other algorithms, like ADMM, could become more ecient. Apart from the computational eciency of the algorithm, memory considerations are critical for large-scale analyses. The gesso package allows users to analyze large genome-wide datasets that do not t in RAM using the le-backed bigmemory [101] format. However, this involves more time-consuming data transfers between RAM and the external memory source. To more eciently handle datasets that exceed available RAM, the screening rules could be exploited for ecient batch processing [100]. We explored two dierent ways to leverage external information for the selection of GE interaction terms to enhance our gesso model. In Chapter 6 we showed that an external gene expression dataset can be incorporated to the gesso objective in a convex penalty to improve selection given that external information is informative. In Chapter 4 (section 4.6.1) we explored a mediation two-step approach where external information was used to predict gene expression measurements in the rst step and the gesso model was applied in the second step. 106 Appendix A Introduction to Convex Optimization Convex set: A set C2R n is called convex if for any x;y2C and 2 [0; 1]; x + (1)y2C: Convex function: A function f :R n !R with the convex domain domf is convex if for x;y2 domf and any 2 [0; 1], f(x + (1)y)f(x) + (1)f(y): (A.1) If the inequality in (A.1) is strict for 2 (0; 1) then the function is called strictly convex. Figure A.1: Denition of a convex function. 107 A function f :R n !R is strongly convex () for any x;y2domf 9m> 0 such that f(y)f(x) +rf(x) T (yx) +mkyxk 2 2 : A function f is strongly concave iff is strongly convex. Properties of a convex function: If f is dierentiable, then f is convex () for all x;y2domf f(y)f(x) +rf(x) T (yx): The property states that if f is a convex and dierentiable function, then the linear approxi- mation (via gradient) at any point will always underestimate f. Figure A.2: Linear approximation (via gradient) at any point underestimates f. Subgradient of a convex function: The subgradient @f(x) of a convex functionf atx is the set of all tangent to the function lines @f(x) =fg :f(y)f(x) +g T (yx) for all yg: Example (Figure A.3): f(x) =jxj, then 8 > > > > > > > < > > > > > > > : @f(x) x>0 = 1 @f(x) x<0 =1 @f(x) x=0 = [1; 1] 108 Figure A.3: Subgradient for f() =jj. Convex optimization problem: Convex optimization problem (canonical form) is a problem of the form minimize x f 0 (x) subject to f i (x) 0 for i = 1::p:; h j (x) = 0 for j = 1::q; (A.2) where f 0 ;f i are convex functions and h i are ane functions. The set F =fx :f i (x) 0; h i (x) = 0;i = 1;;p; j = 1::qg is called a feasibility set. Properties of a convex optimization problem: every local minimum is a global minimum the optimal set is convex if the objective function is strictly convex, then the problem has at most one optimal point 109 Lagrangian duality: A Lagrangian is dened as L(x;;) =f 0 (x) + p X i=1 i f i (x) + q X j=1 j h j (x); where < 0 and are the dual variables paired with the constraints f i (x) 0 and h i (x) = 0 respectively. Symbol < denotes an element-wise comparison: < 0 () i 0 for all i = 1::p. Note that max <0; L(x;;) = 8 > > < > > : f 0 (x); if f i (x) 0; h i (x) = 0; +1; else; thus the expression min x max <0; L(x;;) is equivalent to the convex optimization problem (A.2). We call the problem (A.2) and its equivalent expression the primal problem. A dual function is dened as g(;) = min x L(x;;) = min x 8 < : f 0 (x) + k X i=1 i f i (x) + l X j=1 j h j (x) 9 = ; : Dual function g(;) is always concave in and (as a minimum of ane functions). A dual problem is dened as max <0; g(;) = max <0; min x L(x;;): Observe the symmetry between the primal and the dual problems, where the only dierence is in the order of the max and min operations. Both problems are optimizing the Lagrangian but in dierent orders: Primal problem: min x max <0; L(x;;) Dual problem: max <0; min x L(x;;) 110 Lower bound property: For any < 0 and x2F g(;)f 0 (x): The lower bound property states that the dual function g(;) is always a concave underesti- mator of the primal objective function f 0 (x). Proof: If f i (x) 0 and i 0, f 0 (x)f 0 (x) + X i i f i (x) + X j j h j (x) min x 8 < : f 0 (x) + X i i f i (x) + X j j h j (x) 9 = ; =g(;) Weak duality: Let ^ x and ^ ; ^ be the primal and the dual optimal solutions, then g( ^ ; ^ )f 0 (^ x), or max <0; min x L(x;;) min x max <0; L(x;;) Proof: max <0; min x L(x;;) = min x L(x; ^ ; ^ )L(^ x; ^ ; ^ ) max <0; L(^ x;;) = min x max <0; L(x;;) Strong duality: Strong duality holds if the primal and dual objectives are equal at the optimal solutions: g( ^ ; ^ ) =f 0 (^ x) or min x max <0; L(x;;) = max <0; min x L(x;;) In case of the ane constraints (iff i (x) are ane functions), then for a convexf 0 strong duality always holds. 111 The Karush-Kuhn-Tucker (KKT) conditions: Given that the strong duality holds, it's possible to formulate the necessary conditions for the optimal solutions: primal ^ x and dual ^ ; ^ . Assume the strong duality holds, then f 0 (^ x) = min x max <0; L(x;;) = max <0; min x L(x;;) = = min x L(x; ^ ; ^ )L(^ x; ^ ; ^ ) =f 0 (^ x) + p X i=1 ^ i f i (^ x) + q X j=1 ^ j h j (^ x) f 0 (^ x) Therefore, all inequalities have to be equalities. Then, rst, for all i = 1::p ^ i f i (^ x) = 0: This condition is called complementary slackness. Second, if f 0 and f i are dierentiable func- tions, then min x L(x; ^ ; ^ ) =L(^ x; ^ ; ^ ) and, hence,r x L(^ x; ^ ; ^ ) = 0. This condition is called the stationarity condition. Thus, if ^ x and ^ ; ^ are the primal and dual optimal solution respectively, then the following conditions (called Karush-Kuhn-Tucker (KKT) conditions) must hold: Stationarity r x L(^ x; ^ ; ^ ) =r x f 0 (^ x) + p X i=1 ^ i r x f i (^ x) + q X j=1 ^ j r x h j (^ x) = 0 Complementary slackness ^ i f i (^ x) = 0 112 Feasibility f i (^ x) 0 for i = 1::p h i (^ x) 0 for i = 1::q ^ i 0 for i = 1::p The KKT conditions are necessary conditions for the optimal solutions. They are also sucient conditions for the variables x;; and to be the optimal solutions if functions f 0 and f i are convex and continuously dierentiable. The stationarity condition can be formulated even for the non-dierentiable convex functions by using the subgradient 02@ x L(^ x; ^ ; ^ ) =@ x f 0 (^ x) + p X i=1 ^ i @ x f i (^ x) + q X j=1 ^ j @ x h j (^ x): Duality gap: The duality gap is dened as the dierence between the primal and the dual objective functions f 0 (x)g(;). The main use of the duality gap is as a certicate of optimality: g(;)f 0 (^ x)f 0 (x) =) f 0 (^ x)f 0 (x)g(;) =) f 0 (x)f 0 (^ x)f 0 (x)g(;): The duality gap forms an upper bound to the suboptimaty gap of our current solutions. The suboptimality gap is dened as the dierence between the objective function evaluated at the current solution and the optimal objective function value. 113 Appendix B Proof of equivalence of relaxed and unconstrained models We want to prove that models (2.4) and (2.2) are equivalent: min 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) +( + G i + G i ) +k GE k 1 ; subject to: G i E 1 + G i + G i ; G i 0 for i = 1;:::;p: and min 0 ; G ; E ; GE g( 0 ; G ; E ; GE ) + p X i=1 max j G i j; G i E 1 +k GE k 1 : Proof: Recall that G i = + G i G i and G i 0, then G i = + G i G i and + G i G i . + G i + G i = 2 + G i G i and since G i E 1 + G i + G i , we have + G i k G i Ek 1 + G i 2 . Since + G i 0, + G i G i , and + G i k G i Ek 1 + G i 2 , we can write + G i max n [ G i ] + ; k G i Ek 1 + G i 2 o , where [ G i ] + = maxf G i ; 0g. Then we have that + G i + G i = = 2 + G i G i max n 2[ G i ] + G i ; G i E 1 o . Finally we note that 2[ G i ] + G i =j G i j. We have + G i + G i max n j G i j; G i E 1 o and since we are solving a minimization problem we can substitute ( + G i + G i ) in model (2.4) to its minimum value max n j G i j; G i E 1 o which nalizes our transformation from model (2.4) to (2.2) and concludes the proof of equivalence. 114 Appendix C Coordinate-wise solutions for the gesso gesso model minimize 1 2n kY (1 0 +G( + G G ) +E E + (GE) GE )k 2 2 + + 2 k GE k 1 + 1 1 T ( + G + G ); s:t:j G j E j + G j + G j ; G j 0; for j = 1;:::;p: Lagrangian form L( + ; ; GE ; E ; 0 ;;u; + ; ) = 1 2n kY (1 0 +G( + G G ) +E E +GE GE )k 2 2 + + 2 u T GE + 1 1 T ( + G + G ) + + X j ( j (u j G j E + G j G j ) + j + G j j G j ); where ; are dual variables and u is a sub-gradient of vectorj GE j. Denote residuals r() =Y 1 0 G( + G G )E E (GE) GE : KKT conditions: 115 Stationarity: @L @ + G = @L @ G = @L @ GE = 0 Feasibility: j G j E j + G j + G j ; G j 0; j 0; j 0 Complementary slackness: j (u j G j E + G G ) = 0; j G j = 0; for j = 1;:::;p: Solution for G j = + G j G j : @L @ + G j = 1 n G T j r() + 1 j + i = 0 @L @ G j = 1 n G T j r() + 1 j i = 0 Denote r (j) =r() +G j ( + G j G j ) r() =r (j) G j + G j +G j G j 1 n G T j r() + 1 j + i = 0 1 n G T j (r (j) G j ( + G j G j )) + 1 j + j = 0 1. + G j > 0, G j = 0, + j = 0: 1 n G T j (r (j) G j + G j ) + 1 j = 0 + G j = 1 n G T r (j) ( 1 j ) 1 n G T j G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 > 0 1 n G T r (j) > 1 j 116 Then G j = + G j G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) > 1 j . 2. G j > 0, + G j = 0, j = 0: 1 n G T j (r (j) +G j G j ) + 1 j = 0 G j = 1 n G T r (j) ( 1 j ) 1 n G T j G j = 1 n G T r (j) + ( 1 j ) 1 n kG j k 2 > 0 Then G j = + G j G j = 1 n G T r (j) +( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) <( 1 j ). 3. G j = 0, + G j = 0, j > 0: 1 n G T r (j) + 1 j + j = 0 1 n G T r (j) + 1 j j = 0 + j = 1 n G T r (j) + 1 j 0 j = 1 n G T r (j) + 1 j 0 1 j 1 n G T r (j) and 1 j 1 n G T r (j) Then G j = 0, whenj 1 n G T r (j) j 1 j . 4. G j > 0, + G j > 0, j = 0: 1 n G T j (r (j) G j ( + G j G j )) + 1 j = 0 1 n G T j (r (j) G j ( + G j G j )) + 1 j = 0 2 n G T j r (j) 2 n G T j G j ( + G j G j ) = 0 117 G j = + G j G j = G T j r (j) kG j k 2 2 1 2 j = 0, 1 = i : Then, G j = G T j r (j) kG j k 2 , when 1 = i . We have G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) > 1 j G j = 1 n G T r (j) +( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) <( 1 j ) G j = 0, whenj 1 n G T r (j) j 1 j , thus G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 j : (C.1) Solution for G j E : @L @ G j E = 1 n G j E T r() +u j ( 2 + j ) = 0 1 n G j E T ~ r (j) + G j E 1 n G j E 2 +u j ( 2 + j ) = 0; where ~ r (j) =r() +G j E G j E , u j = 8 > > < > > : sign( G j E ) ; if G j E 6= 0 [1; 1] ; if G j E = 0 G j E = 1 n G j E T ~ r (j) u j ( 2 + j ) 1 n G j E 2 = = 1 1 n G j E 2 S 1 n G j E T ~ r (j) ; 2 + j (C.2) Solution for E : @L @ E = 1 n E T r() = 0 118 E T Y 1 0 G( + )E E (GE) GE = 0 E = E T r (E) kEk 2 ; , where r (E) =r() +E E : (C.3) Solution for 0 : @L @ 0 = 1 n 1 T r() = 0 1 T Y 1 0 G( + )E E (GE) GE = 0 0 = 1 T r (0) n , where r (0) =r() + 1 0 : (C.4) 119 Appendix D Block coordinate descent updates for the gesso We show how to eciently solve the system of equations (3.6): 8 > > > > > > > < > > > > > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 S 1 n G j E T ( r (j) G j G j ); 2 + j j (j G j E j ( + G j + G j )) = 0 by breaking it down to a set of simpler cases. Note 1: Consider equations @L @ + G j = @L @ G j = 0 (Appendix C) @L @ + G j = 1 n G T j r() + 1 j + i = 0 @L @ G j = 1 n G T j r() + 1 j i = 0 if we add these two equations we get 2 1 2 j + j j = 0, 1 j = 1 2 ( + j + j ) 0, thus j 1 . 1 is an upper bound for j for any j. Also, we know that j 0 from the feasibility conditions, hence, j 2 [0; 1 ]: 120 We brak down the system of equations (3.6) to a set of simpler cases. Case 1: G j > 0, 1 = j . From complementary slackness condition we have j (j G j E j ( + G j + G j )) = 0 and since j = 1 6= 0, hencej G j E j ( + G j + G j ) = 0, then G j =j G j E j + G j . From case (4) Appendix C we have G j = + G j G j = G T j r (j) kG j k 2 , + G j = G T j r (j) kG j k 2 + G j = G T j r (j) kG j k 2 +j G j E j + G j . Then 8 > > < > > : + G j = G T j r (j) 2kG j k 2 + 1 2 j G j E j > 0 G j = G T j r (j) 2kG j k 2 + 1 2 j G j E j > 0: andj G j E j> G T j r (j) kG j k 2 =j G j j. To obtain G j E and G j we have to solve a system of linear equations with respect to G j , G j E : 8 > > > > > > > < > > > > > > > : G j = G T j ( r (j) G j E G j E ) kG j k 2 G j E = 1 1 n kG j Ek 2 S 1 n G j E T ( r (j) G j G j ); 2 + 1 j G j E j>j G j j : Case 2: j 2 [0; 1 ), G j = 0, G j E = 0. From the stationarity conditions G j = 0, whenj 1 n G T r (j) j 1 j , and G j E = 0, when 1 n G j E T ~ r (j) 2 + j . Then max 1 n G j E T ~ r (j) 2 ; 0 ! j 1 1 n G T r (j) and we can set j to any value from the feasible range, we set j to the upper bound j = 1 1 n G T r (j) . Case 3: j 2 [0; 1 ), G j ; G j E 6= 0,j G j j =j G j E j ( j 6= 0). Solve a system of linear equations with respect to G j , GE , and j : 121 8 > > > > > > > < > > > > > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 S 1 n G j E T ( r (j) G j G j ); 2 + j j (j G j E jj G j j) = 0 : Case 4: j = 0, G j E = 0. G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 ; and check that 1 n G j E T ( r (j) G j G j ) 2 . Case 5: j = 0. Solve a system of linear equations with respect to G j , G j E : 8 > > > < > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 G j E = 1 1 n kG j Ek 2 S 1 n G j E T ( r (j) G j G j ); 2 : 122 Appendix E Dual formulation for the gesso In this section we derive a dual formulation of the gesso problem. Primal gesso formulation: minimize ; GE ;z 1 2n z T z + 1 ( + G + G ) + 2 k GE k 1 subject to j GE j4 ( + G + G ); < 0; z =Y 1 0 +G( + G G ) +E E +GE GE ; where < symbol denotes an element-wise comparison. In order to formulate the dual prob- lem we form a Lagrangian, where ; GE ;z are the primal variables, < 0; < 0; are the dual variables, and u is a sub-gradient of vectorj GE j. We denote the linear predictor 1 0 +G( + ) +E E +GE GE as X for notational simplicity. L( G ; GE ;z;; ;) = 1 2n z T z + 1 ( + G + G )+ + 2 u T GE + T (u T GE + G G ) +T + G T G +(YXz): 123 Dual function: g(; ;) = inf G ; GE ;z L( G ; GE ;z;; ;): Consider the partial derivatives (Stationarity conditions): @L @z = 1 n z = 0; z =n and = z n = YX n : (E.1) The dual variable is equal to the residuals scaled by the number of observations. Equation = YX n denes the link between the primal () and the dual () optimal solutions. @L @ G i E = 2 u i + i u i T G i E = 0; where u i = 8 > > > > > > < > > > > > > : 1; if G i E > 0 1; if G i E < 0 [1; 1]; if G i E = 0: The feasibility conditions for the dual problem follow: G i E > 0, then 2 + i = T G i E G i E < 0, then 2 + i = T G i E G i E = 0, then u i ( 2 + i ) = T G i E, where u i 2 [1; 1], which is equivalent to j T G i Ej 2 + i . The solution for G i E leads to the following important consequences, which are true for the primal and dual optimal variables: T G i E < 2 + i =) G i E = 0, for all i. (E.2) 124 @L @ + i = 1 i + j T G i = 0; 1 i + T G i = + i 0; @L @ i = 1 i j T G i = 0; 1 i T G i = i 0; leading to T G i 1 i , and T G i 1 i , hence,j T G i j 1 i { feasibility conditions for the dual problem. If we add the equations for + i and i we get 2 1 2 j = + i + i 0 =) j < 1 , and i 0, hence, j 2 [0; 1 ]. Note that when 1 i > T G i , then + i > 0, and then by the complementary slackness conditions + i + i = 0 we have + i = 0. Analogously, 1 i > T G i , then i > 0, and then by the slackness conditions i i = 0 we have i = 0. This leads to the following important consequences, which are true for the primal and dual optimal variables: T G i < 1 i =) G i = 0, for all i. (E.3) @L @ 0 = T 1 = 0; X i = 0: @L @ E = T E = 0; X i E i = 0: Let's review the above two conditions more closely: Denote ~ r =YX + 1 0 +E E , then residualsr =YX = ~ r 1 0 E E . As we derived before, = z n = r n , then 8 > > < > > : P i = 0 P i E i = 0 =) 8 > > < > > : 1 T (~ r 1 0 E E ) = 0 E T (~ r 1 0 E E ) = 0 =) 8 > > < > > : 0 = 1 T ~ r1 T E E n = ~ r E E E T ~ rE T 1( ~ r E E )kEk 2 E = 0 =) 8 > > < > > : 0 = ~ r E E E = n E ~ rE T ~ r n E 2 kEk 2 ; (E.4) where ~ r = 1 T ~ r n and E = E T 1 n . 125 Taking into account stationarity conditions, we have g(; ;) = [z =n] = 1 2n n 2 T T (n) + T Y = = n 2 T n T + T Y = n 2 T + T Y: The dual function: g(; ;) = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : n 2 T + T Y; j( T GE) i j 2 + i ; j( T G) i j 1 i ; 0 = ~ r E E ; E = n E ~ rE T ~ r n E 2 kEk 2 ; 1; otherwise: We can rewrite the dual function in a complete square form: n 2 Y n 2 = n 2 T + T Y Y T Y 2n ; then n 2 T + T Y = n 2 Y n 2 + n 2 Y n 2 : Dual gesso formulation: maximize ; n 2 Y n 2 Y n 2 subject to j T GEj4 2 +; j T Gj4 1 ; 2 [0; 1 ] Note that we omitted the conditions on 0 and E (E.4) from the dual feasible set, since the dual variable is equal to the scaled residuals and satisfying the conditions for 0 and E is equivalent to rst calculating the variables according to the conditions (E.4) and then updating 126 the residuals. 127 Appendix F Auxiliary problem solution Auxiliary problem: maximize jX T j subject to D() ; where D() = n 2 kyk 2 kyk 2 ! (F.1) The regionD() is equivalent tokyk 2 kyk 2 2 n which is an equation of a lled circle (sphere) with r 2 =kyk 2 2 n and a center at y. We denote this circle as B(y;r). r = r kyk 2 2 n = r kyk 2 2 n n 2 kyk 2 2 +ky 0 k 2 2 =ky 0 k: Optimization problem (F.1) is equivalent to max 2B(y;ky 0 k) jX T j; (F.2) and the region is a sphere B(c;r) that contains the optimal dual solution ^ 2 . We proceed by solving (F.2). X T is proportional to a projection of vector to a vectorX (denoted asProj X ) which we need to maximize with the constraint that vector contains in the circle B(y;ky 0 k). Recall that projection of vector x to y is equal to x T y kyk . 128 Figure F.1: Auxiliary problem geometric solution. Figure E.1 helps to see that maxProj X =r +Proj X y =r + y T X kXk ; analogously, maxProj X =rProj X y =r y T X kXk : Using the fact that maxjX T j = max(X T ;X T ), we get max X T kXk ; X T kXk =r X T y kXk ; hence, max 2B(y;r) jX T j =rkXk +jX T yj; or more generally (for any center c) max 2B(c;r) jX T j =rkXk +jX T cj: (F.3) 129 Appendix G SAFE rules for gesso By the KKT conditions (3.10) and (3.9): 8 > > > > > > < > > > > > > : ^ T G i < 1 i ^ T G i E < 2 + i =) ^ i = ^ G i E = 0: j 2 [0; 1 ] Then, 8 > > > > > > < > > > > > > : max 2B(c;r) T G i < 1 i max 2B(c;r) T G i E < 2 + i =) ^ i = ^ G i E = 0 i 2 [0; 1 ] (G.1) () 8 > > > > > > < > > > > > > : rkG i k +jG T i cj< 1 i rkG i Ek +jG i E T cj< 2 + i =) ^ i = ^ G i E = 0 i 2 [0; 1 ] 130 Let's consider the last system of inequalities more closely: 8 > > > > > > < > > > > > > : rkG i k +jG T i cj< 1 i rkG i Ek +jG i E T cj< 2 + i i 2 [0; 1 ] () 8 > > > > > > < > > > > > > : i < 1 rkG i kjG T i cj i >rkG i Ek +jG i E T cj 2 () i 2 [0; 1 ] 8 > > < > > : i < 1 rkG i kjG T i cj i > max 0; rkG i Ek +jG i E T cj 2 Then, (G:1) () 9 i feasible () max 0; rkG i Ek +jG i E T cj 2 < 1 rkG i kjG T i cj; and, hence, the SAFE rules to discard ( G i ; G i E ) are: max 0; rkG i Ek +jG i E T cj 2 < 1 rkG i kjG T i cj =) ^ i = ^ G i E = 0: (G.2) 131 Appendix H Solution to the optimization problem (3.18) Consider an optimization problem: maximize x; D( 0 ) subject to j T 0 GEj4 2 +; j T 0 Gj4 1 ; < 0; where 0 =x res (); D() = n 2 Y n 2 Y n 2 Let's denote Y=n as y and res () as . We have minimize x; kyxk 2 subject to jx T G j Ej 2 + j ; jx T G j j 1 j ; j 0; for j = 1::p: (H.1) 132 Solution: @D() @x = (yx) T = 0; y T xkk 2 2 = 0; ^ x = y T kk 2 2 : Let's denotej T (GE) j j asB j andj T G j j asA j . The feasible set for our optimization problem (H.1) is 8 > > < > > : jxjB j 2 + j jxjA j 1 j ; =) 8 > > < > > : jxj 2 + j B j jxj 1 j A j : Then optimal j is the solution to 2 + j B j = 1 j A j because a solution to this equation maximizes Figure H.1: Geometric solution for the problem (H.1). Optimal j maximizes the range of possible x values. the feasible set and hence provides more broad set to nd an optimal x (Figure 3.2). 2 + j B j = 1 j A j =) A j ( 2 + j ) =B j ( 1 j ) =) j = B j 1 A j 2 B j +A j : We know that j 2 [0; 1 ],8j (Appendix C). If j > 1 , then ^ j = 1 , and if j < 0, then ^ j = 0. Therefore, ^ j = max(0; min( j ; 1 )): 133 We found ^ x = y T kk 2 2 that minimises our objective function by the stationarity conditions. To be optimal it has to satisfy out feasibility conditions for any j: 8 > > < > > : jxj min j 1 ^ j A j ; jxj min j 2 + ^ j B j : Then feasible x is going to satisfy the inequality jxj min min j 1 ^ j A j ; min j 2 + ^ j B j : Let's denote the RHS of the inequality as M. Ifj^ xjM, then optimal ^ x = ^ x and ifj^ xj>M, then optimal ^ x = sign(^ x)M. 134 Appendix I Coordinate-wise solutions for the externally informed gesso Externally informed gesso model minimize ; 1 2n kY (1 0 +G G +E E + (GE) GE )k 2 2 + + 1 1 T ( + + ) + 2 k GE k 1 + + 3 2 k GE Zk 2 2 + 4 kk 1 subject to j G j E j + G j + G j ; G j 0; for j = 1;:::;p: Lagrangian form L( + ; ; GE ; E ; 0 ;;u;w;; + ; ) = 1 2n kY (1 0 +G G +E E +GE GE )k 2 2 + + 3 2 k GE Zk 2 2 + 4 w T + 2 u T GE + 1 1 T ( + + ) + + X j ( j (u j G j E + ) + j + G j j G j ); 135 where < 0; < 0 are dual variables. Denote r() =Y 1 0 G( + )E E (GE) GE : KKT conditions: Stationarity: @L @ + = @L @ = @L @ GE = @L @ = 0 Feasibility: j G j E j + G j + G j ; G j 0; j 0; j 0 Complementary slackness: j (u j G j E + ) = 0; j G j = 0; for j = 1;:::;p: Solution for G j : @L @ + G j = 1 n G T j r() + 1 j + i = 0 @L @ G j = 1 n G T j r() + 1 j i = 0 Denote r (j) =r() +G j ( + G j G j ) r() =r (j) G j + G j +G j G j 1 n G T j r() + 1 j + i = 0 1 n G T j (r (j) G j ( + G j G j )) + 1 j + j = 0 1. + G j > 0, G j = 0, + j = 0: 1 n G T j (r (j) G j + G j ) + 1 j = 0 136 + G j = 1 n G T r (j) ( 1 j ) 1 n G T j G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 > 0 1 n G T r (j) > 1 j Then G j = + G j G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) > 1 j . 2. G j > 0, + G j = 0, j = 0: 1 n G T j (r (j) +G j G j ) + 1 j = 0 G j = 1 n G T r (j) ( 1 j ) 1 n G T j G j = 1 n G T r (j) + ( 1 j ) 1 n kG j k 2 > 0 Then G j = + G j G j = 1 n G T r (j) +( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) <( 1 j ). 3. G j = 0, + G j = 0, j 0: 1 n G T r (j) + 1 j + j = 0 1 n G T r (j) + 1 j j = 0 + j = 1 n G T r (j) + 1 j 0 j = 1 n G T r (j) + 1 j 0 1 j 1 n G T r (j) and 1 j 1 n G T r (j) Then G j = 0, whenj 1 n G T r (j) j 1 j . 4. G j > 0, + G j > 0, j = 0: 1 n G T j (r (j) G j ( + G j G j )) + 1 j = 0 137 1 n G T j (r (j) G j ( + G j G j )) + 1 j = 0 2 n G T j r (j) 2 n G T j G j ( + G j G j ) = 0 G j = + G j G j = G T j r (j) kG j k 2 2 1 2 j = 0, 1 = i Then, G j = G T j r (j) kG j k 2 , when 1 = i . We have G j = 1 n G T r (j) ( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) > 1 j G j = 1 n G T r (j) +( 1 j ) 1 n kG j k 2 , when 1 n G T r (j) <( 1 j ) G j = 0, whenj 1 n G T r (j) j 1 j , thus G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 j : (I.1) Solution for G j E : @L @ G j E = 1 n G j E T r( + ; ; G j E ) +u j ( 2 + j ) + 3 ( G j E Z j ) = 0 1 n G j E T ~ r (j) + G j E 1 n G j E 2 + 3 +u j ( 2 + j ) 3 Z j = 0; where ~ r (j) =r( + ; ; G j E ) +G j E G j E , Z j =Z [j;:] , u j = 8 > > < > > : sign( G j E ) ; if G j E 6= 0 [1; 1] ; if G j E = 0 138 G j E = 1 n G j E T ~ r (j) + 3 Z j u j ( 2 + j ) 1 n G j E 2 + 3 = = 1 1 n G j E 2 + 3 S 1 n G j E T ~ r (j) + 3 Z j ; 2 + j (I.2) Solution for : @L @ j =Z j 3 ( G j E Z j j ) + 4 w j = 0; where Z j =Z [:;j] , w j = 8 > > < > > : sign( j ) ; if j 6= 0 [1; 1] ; if j = 0 3 Z T j G j E + 1 n j Z j 2 + 4 w j = 0 j = 3 Z T j G j E 4 w j Z j 2 3 = 1 Z j 2 S Z T j G j E ; 4 3 : (I.3) Solution for E : @L @ E = 1 n E T r() = 0 E T Y 1 0 G( + )E E (GE) GE = 0 E = r (E) kEk 2 , where r (E) =r() + E E: (I.4) Solution for 0 : @L @ 0 = 1 n 1 T r() = 0 1 T Y 1 0 G( + )E E (GE) GE = 0 139 0 = 1 n r (0) , where r (0) =r() + 1 0 : (I.5) 140 Appendix J Block coordinate descent updates for the externally informed gesso Denote r =r() =Y 1 0 E E G G (GE) GE , r (j) =r +G j G j +G j E G j E , ~ r (j) =r +G j E G j E = r (j) G j G j , r (j) =r +G j G j = r (j) G j E G j E 8 > > > > > > > < > > > > > > > : G j = + G j G j = 1 1 n kG jk 2 S 1 n G T r (j) ; 1 j G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ~ r (j) + 3 Z j ; 2 + j j (j G j E j ( + G j + G j )) = 0 = = 8 > > > > > > > < > > > > > > > : G j = + G j G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ( r (j) G j G j ) + 3 Z j ; 2 + j j (j G j E j ( + G j + G j )) = 0 141 Note 1: Consider equations @L @ + G j = @L @ G j = 0 @L @ + G j = 1 n G T j r() + 1 j + i = 0 @L @ G j = 1 n G T j r() + 1 j i = 0 if we add those two equations we get 2 1 2 j + j j = 0, 1 j = 1 2 ( + j + j ) 0, hence, j 2 [0; 1 ]: Note 2: If + G j = 0 or G j = 0, thenj G j j = + G j + G j . If both + G j and G j are non zero, then the equality doesn't hold (see convex relaxation details). Case 1: G j > 0, 1 = j . From complementary slackness condition we have j (j G j E j ( + G j + G j )) = 0 and since j = 1 6= 0, hencej G j E j ( + G j + G j ) = 0, then G j =j G j E j + G j . From case (4) Appendix I we have G j = + G j G j = G T j r (j) kG j k 2 , + G j = G T j r (j) kG j k 2 + G j = G T j r (j) kG j k 2 +j G j E j + G j . Then 8 > > < > > : + G j = G T j r (j) 2kG j k 2 + 1 2 j G j E j > 0 G j = G T j r (j) 2kG j k 2 + 1 2 j G j E j > 0 : andj G j E j> G T j r (j) kG j k 2 =j G j j. To obtain G j E and G j we have to solve a system of linear equations with respect to G j , GE : 8 > > > > > > > < > > > > > > > : G j = G T j ( r (j) G j E G j E ) kG j k 2 G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ( r (j) G j G j ) + 3 Z j ; 2 + 1 j GE j>j G j j : 142 Case 2: j 2 [0; 1 ), G j = 0, GE = 0. From (3) on page 3 G j = 0, whenj 1 n G T r (j) j 1 j , and GE = 0, when 1 n G j E T ~ r (j) + 3 Z j 2 + j . Then max 1 n G j E T ~ r (j) + 3 Z j 2 ; 0 ! j 1 1 n G T r (j) and we can set j to any value from the feasible range, we set j to the upper bound j = 1 1 n G T r (j) . Case 3: j 2 [0; 1 ), G j ; GE 6= 0,j G j j =j GE j ( j 6= 0). Solve a system of linear equations with respect to G j , GE , and j : 8 > > > > > > > < > > > > > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 j G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ( r (j) G j G j ) + 3 Z j ; 2 + j j (j G j E jj G j j) = 0 : Case 4: j = 0, GE = 0. G j = 1 1 n G j 2 S 1 n G T r (j) ; 1 ; and check that 1 n G j E T ( r (j) G j G j ) + 3 Z j 2 . Case 5: j = 0. 143 Solve a system of linear equations with respect to G j , G j E : 8 > > > < > > > : G j = 1 1 n kG jk 2 S 1 n G T ( r (j) G j E G j E ); 1 G j E = 1 1 n kG j Ek 2 + 3 S 1 n G j E T ( r (j) G j G j ) + 3 Z j ; 2 : 144 Appendix K Simulations with FAMILY Figure K.1: Strong hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. 145 Figure K.2: Hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. Figure K.3: Anti-hierarchical mode. p=1000, n=120, n g non zero/n gxe non zero = 10/5. Model performance as a function of the number of interactions discovered. 146 Appendix L Weighted least squares gesso minimize 1 2 w 1=2 Y (1 0 +G( + G G ) +E E + (GE) GE ) 2 2 + + 2 k GE k 1 + 1 1 T ( + G + G ); subject to j G j E j + G j + G j ; G j 0; for j = 1;:::;p: Denote X = 1 0 +G( + G G ) +E E + (GE) GE ; r() =YX; where = ( 0 ; E ; + G ; G ; GE ): Solutions G j = 1 G T j w G j S G T w r (j) ; 1 j ; where r (j) =r() +G j ( + G j G j ); (L.1) G j E = 1 (G j E) T w G j E S G j E T w ~ r (j) ; 2 + j ; where ~ r (j) =r() +G j E G j E : (L.2) 147 L.1 Dual formulation Primal gesso formulation: minimize ; GE ;z 1 2 z T w z + 1 ( + G + G ) + 2 k GE k 1 ; subject to j GE j4 ( + G + G ); G < 0; z =YX: (L.3) Lagrangian L = 1 2 z T w z + 1 ( + G + G ) + 2 k GE k 1 + T (j GE j + G G ) + T + G T G + T (YXz): @L @z =wz = 0; =wz and z =w 1 : Dual gesso formulation: Dual objective D() = 1 2 (w 1 ) T w(w 1 ) +Y (w 1 ) = 1 2 T w 1 + T Y = 1 2 kYk 2 w kwYk 2 w 1 : Dual feasible region 8 > > > > > > > > > > < > > > > > > > > > > : j T G j j 1 j ; j T G j Ej 2 + j ; T 1 = 0; T E = 0: 148 L.2 Naive projection (t+1) =x (t) = [ =wz] =xwz (t) kwYk 2 w 1 = xwz (t) wY 2 w 1 = (xz (t) Y ) T w T w 1 w(xz (t) Y ) = xz (t) Y 2 w D() = 1 2 kYk 2 w kwYk 2 w 1 = 1 2 kYk 2 w kxzYk 2 w : minimize x; D( 0 ) subject to j T 0 GEj4 2 +; j T 0 Gj4 1 ; < 0; where 0 =xwz (t) ; D() = 1 2 kYk 2 w kxzYk 2 w : (L.4) kxzYk 2 w ! min x ; P n i=1 (X i z i Y i ) 2 w i ! min x @D @x = n X i=1 2(X i z i Y i )z i w i = 0; x = P w i Y i z i P w i z 2 i : 149 Appendix M Dual formulation for logistic gesso M.1 Dual formulation minimize n X i=1 w i log(1 + exp(z i ))Y i z i + 1 ( + G + G ) + 2 k GE k 1 ; subject to z =X; j GE j4 ( + G + G ); G < 0: (M.1) L =l(z) + 1 ( + G + G ) + 2 k GE k 1 + T (j GE j + G G ) + T + G T G + T (zX)! min z; Stationary conditions are given by: @l @z i =w i exp(z i ) 1 +exp(z i ) Y i + i = 0; i =w i (Y i (z i )); =w y 1 exp(z) ; z = log 1 (y w ) y w : 150 The primary dual-link equation follows =w T (Y(X)): We calculate the dual function D() = min z; L =w(log(1 + exp(z))yz) + T z = [z = log 1 (y w ) y w ] = =w log 1 + y w 1 (y w ) +y log 1 (y w ) y w T log 1 (y w ) y w = =w log 1 (y w ) 1 (y w + y w log y w : M.2 Quadratic approximation for the dual objective g() =x log(x) + (1x) log(1x)f(x min ) +f 0 (x min )(xx min ) + 1 2 f 00 (x min )(xx min ) 2 = = [x min = 1 2 ] = log 2 + 2 x 1 2 2 @g() @ = log(x) log(1x) = 0; x min = 1 2 ; @ 2 g() @ 2 = 1 x 1 1x = 0; and f 00 (x min ) = 4: 151 References [1] Tibshirani R. 1996. Regression shrinkage via the lasso. J R Statis Soc 58:267{288. [2] Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Statis Soc Ser B 67:301{320. [3] Ritz, B. R., Chatterjee, N., Garcia-Closas, M., Gauderman, W. J., Pierce, B. L., Kraft, P., ... McAllister, K. (2017). Lessons Learned from Past Gene- Environment Interaction Successes. American Journal of Epidemiology, 186(7), 778-786. https://doi.org/10.1093/aje/kwx230 [4] Hutter CM, Mechanic LE, Chatterjee N, et al. Gene- environment interactions in cancer epidemiology: a National Cancer Institute Think Tank report. Genet Epidemiol. 2013; 37(7):643{657. [5] National Institutes of Health Consensus Development Panel. National Institutes of Health Consensus Development conference statement: phenylketonuria: screening and manage- ment, October 16{18, 2000. Pediatrics. 2001;108(4): 972{982. [6] Figueroa JD, Han SS, Garcia-Closas M, et al. Genome-wide interaction study of smoking and bladder cancer risk. Carcinogenesis. 2014;35(8):1737{1744. [7] Garcia-Closas M, Rothman N, Figueroa JD, et al. Common genetic polymorphisms mod- ify the eect of smoking on absolute risk of bladder cancer. Cancer Res. 2013;73(7): 2211{2220. 152 [8] Garc a-Closas M, Malats N, Silverman D, et al. NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta- analyses. Lancet. 2005;366(9486):649{659. [9] Chhibber A, Kroetz DL, Tantisira KG, et al. Genomic architecture of pharmacological ecacy and adverse events. Pharmacogenomics. 2014;15(16):2025{2048. [10] Ritchie MD. The success of pharmacogenomics in moving genetic association studies from bench to bedside: study design and implementation of precision medicine in the post-GWAS era [published correction appears in Hum Genet. 2013 Dec;132(12):1437]. Hum Genet. 2012;131(10):1615{1626. doi:10.1007/s00439-012-1221-z [11] Gauderman WJ, Thomas DC, Murcray CE, et al. Ecient genome-wide associa- tion testing of gene-environment interaction in case-parent trios. Am J Epidemiol. 2010;172(1):116{122. [12] Gauderman WJ, Zhang P, Morrison JL, et al. Finding novel genes by testing GxE inter- actions in a genome-wide association study. Genet Epidemiol. 2013;37(6):603{613. [13] Hsu L, Jiao S, Dai JY, et al. Powerful cocktail methods for detecting genome-wide gene- environment interaction. Genet Epidemiol. 2012;36(3):183{194. [14] Chen YH, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and ecient inference in haplotype-based case-control studies. J Am Stat Assoc. 2009;104(485):220{233. [15] Li D, Conti DV. Detecting gene-environment interactions using a combined case-only and case-control approach. Am J Epidemiol. 2009;169(4):497{504. doi:10.1093/aje/kwn339 [16] Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med. 1994;13(2):153{162. 153 [17] Liu, J., Huang, J., Zhang, Y., Lan, Q., Rothman, N., Zheng, T., Ma, S. Identication of gene{environment interactions in cancer studies using penalization, Genomics, 2013, 102(4), 189{194. [18] Wu, C., Jiang, Y., Ren, J., Cui, Y., Ma, S. Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures., Stat. Med.,2017, 37(3), 437{456. [19] Chipman, H. Bayesian Variable Selection with Related Predictors, The Canadian Journal of Statistics, 1996, 24(1), 17{36. [20] Cox, D.R. Interaction, International Statistical Review, 1984, 52(1), 1{24. [21] Nelder, J.A. (1977). A Reformulation of Linear Models, Journal of the Royal Statistical Society. Series A, 1977, 140(1), 48-77. [22] G. Hannum, J. Guinney, L. Zhao, L. Zhang, G. Hughes, S. Sadda, B. Klotzle, M. Bibikova, J.-B. Fan, Y. Gao, R. Deconde, M. Chen, I. Rajapakse, S. Friend, T. Ideker, K. Zhang, Genome-wide methylation proles reveal quantitative views of human ag- ing rates, Mol. Cell. 49 (2013) 359{367. https://doi.org/10.1016/j.molcel.2012.10.016; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40279 [23] Igor Yusipov, Maria Giulia Bacalini, Alena Kalyakulina, Mikhail Krivonosov, Chiara Pirazzini, No emie Gensous, Francesco Ravaioli, Maddalena Milazzo, Maria Vedunova, Giovanni Fiorito, Amedeo Gagliardi, Silvia Polidoro, Paolo Garagnani, Mikhail Ivanchenko, Claudio Franceschi, Complex sex-specic age-related changes in DNA methylation including variability, epimutations and entropy, bioRxiv 2020.01.15.905224. https://doi.org/10.1101/2020.01.15.905224 [24] Ayers KL, Cordell HJ. SNP selection in genome-wide and candidate gene studies via pe- nalized logistic regression. Genet Epidemiol. 2010;34(8):879{891. doi:10.1002/gepi.20543 154 [25] Waldmann P, M esz aros G, Gredler B, Fuerst C and S olkner J (2013) Evaluation of the lasso and the elastic net in genome-wide association studies. Front. Genet. 4:270. doi: 10.3389/fgene.2013.00270 [26] Motyer et al.: LASSO model selection with post- processing for a genome-wide association study data set. BMC Proceedings 2011 5(Suppl 9):S24. [27] Cho, S., Kim, K., Kim, Y. J., Lee, J.-K., Cho, Y. S., Lee, J.-Y., Han, B.-G., Kim, H., Ott, J., Park, T. (2010). Joint Identication of Multiple Genetic Variants via Elastic-Net Variable Selection in a Genome-Wide Association Analysis. Annals of Human Genetics, 74(5), 416{428. https://doi.org/10.1111/j.1469-1809.2010.00597.x [28] Kim J, Kim J, Min H, et al. Joint identication of genetic variants for physical activity in Korean population. Int J Mol Sci. 2014;15(7):12407{12421. Published 2014 Jul 14. doi:10.3390/ijms150712407 [29] Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714{721. doi:10.1093/bioinformatics/btp041 [30] Efroymson,M. A. (1960) "Multiple regression analysis," Mathematical Methods for Digital Computers, Ralston A. and Wilf,H. S., (eds.), Wiley, New York. [31] Hocking, R. R. (1976) "The Analysis and Selection of Variables in Linear Regression," Biometrics, 32. [32] Wu, M., Zhang, Q., Ma, S. (2019). Structured gene-environment interaction analysis. Biometrics. https://doi.org/10.1111/biom.13139 [33] Zhao, P., Rocha, G., Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A), 3468{3497. https://doi.org/10.1214/07-aos584 155 [34] Bien, Jacob; Taylor, Jonathan; Tibshirani, Robert. A lasso for hierarchical in- teractions. Ann. Statist. 41 (2013), no. 3, 1111{1141. doi:10.1214/13-AOS1096. https://projecteuclid.org/euclid.aos/1371150895 [35] Asad Haris, Daniela Witten and Noah Simon (2016) Convex Modeling of Interactions With Strong Heredity, Journal of Computational and Graphical Statistics, 25:4, 981- 1004, DOI: 10.1080/10618600.2015.1067217 [36] Yuan, M. and Lin, Y. (2007), `Model selection and estimation in regression with grouped variables', Journal of the Royal Statistical Society, Series B 68(1), 49{67. [37] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and sparse group lasso. arViV:1001.0736v1. [38] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC. [39] El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature elimination in sparse supervised learning. J. Pacic Optim., 8(4):667{698, 2012. [40] Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R. A dynamic screening principle for the lasso. In EUSIPCO, 2014. [41] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML, pages 333{342, 2015. [42] Friedman J, Hastie T, Tibshirani R (2009). glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 1.1-4, URL http://CRAN.R-project.org/ package=glmnet. [43] M Massias, A Gramfort, J Salmon From safe screening rules to working sets for faster lasso-type solvers - arXiv preprint arXiv:1703.07285, 2017, 10th NIPS Workshop on Op- timization for Machine Learning. 156 [44] M Massias, A Gramfort, J Salmon Celer: a fast solver for the Lasso with dual extrap- olation - arXiv preprint arXiv:1802.07481, 2018, Proceedings of the 35th International Conference on Machine Learning. [45] Tseng, P. (2001). Convergence of a Block Coordinate Descent Method for Nondieren- tiable Minimization. Journal of Optimization Theory and Applications, 109(3), 475{494. https://doi.org/10.1023/a:1017501703105 [46] Friedman J, Hastie T, Hoe ing H, Tibshirani R (2007). \Pathwise Coordinate Optimiza- tion." The Annals of Applied Statistics, 2(1), 302{332. [47] Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, January 2010, Volume 33, Issue 1. http://www.jstatsoft.org/ [48] Fu W (1998). \Penalized Regressions: The Bridge vs. the Lasso." Journal of Computa- tional and Graphical Statistics, 7(3), 397{416. [49] Shevade K, Keerthi S (2003). A Simple and Ecient Algorithm for Gene Selection Using Sparse Logistic Regression. Bioinformatics, 19, 2246{2253. [50] Jeck WR, Siebold AP, Sharpless NE. Review: a meta-analysis of GWAS and age- associated diseases. Aging Cell. 2012;11(5):727{731. doi:10.1111/j.1474-9726.2012.00871.x [51] Kim, W. Y., Sharpless, N. E. (2006). The Regulation of INK4/ARF in Cancer and Aging. Cell, 127(2), 265{275. https://doi.org/10.1016/j.cell.2006.10.003 [52] Gorgoulis, V. G., Pratsinis, H., Zacharatos, P., Demoliou, C., Sigala, F., Asimacopoulos, P. J., Papavassiliou, A. G., Kletsas, D. (2005). p53-Dependent ICAM-1 overexpression in senescent human cells identied in atherosclerotic lesions. Laboratory Investigation, 85(4), 502{511. https://doi.org/10.1038/labinvest.3700241 157 [53] Kletsas, D., Pratsinis, H., Mariatos, G., Zacharatos, P., Gorgoulis, V.G. (2004). The Proin ammatory Phenotype of Senescent Cells: The p53-Mediated ICAM- 1 Expression. Annals of the New York Academy of Sciences, 1019(1), 330{332. https://doi.org/10.1196/annals.1297.056 [54] Xu, M., Xu, X., Pan, B. et al. LncRNA SATB2-AS1 inhibits tumor metastasis and aects the tumor immune cell microenvironment in colorectal cancer by regulating SATB2. Mol Cancer 18, 135 (2019). https://doi.org/10.1186/s12943-019-1063-6 [55] Sun N, Youle RJ, Finkel T. The Mitochondrial Basis of Aging. Mol Cell. 2016;61(5):654{666. doi:10.1016/j.molcel.2016.01.028 [56] Kim SE, Paik HY, Yoon H, Lee JE, Kim N, Sung MK. Sex- and gender-specic dis- parities in colorectal cancer risk. World J Gastroenterol 2015; 21(17): 5167-5175 doi: https://dx.doi.org/10.3748/wjg.v21.i17.5167 [57] White A, Ironmonger L, Steele RJC, Ormiston-Smith N, Crawford C, Seims A. A review of sex-related dierences in colorectal cancer incidence, screening uptake, routes to diagnosis, cancer stage and survival in the UK. BMC Cancer. 2018;18(1):906. Published 2018 Sep 20. doi:10.1186/s12885-018-4786-7 [58] Papaconstantinou, J. The Role of Signaling Pathways of In ammation and Oxidative Stress in Development of Senescence and Aging Phenotypes in Cardiovascular Disease. Cells 2019, 8, 1383. [59] Mina Maki, MD, Noriyuki Matsukawa, MD, PhD, Hiroyuki Yuasa, MD, Yasushi Otsuka, MD, PhD, Takayuki Yamamoto, MD, PhD, Hiroyasu Akatsu, MD, PhD, Takashi Okamoto, MD, PhD, Ryuzo Ueda, MD, PhD, Kosei Ojika, MD, PhD, De- creased Expression of Hippocampal Cholinergic Neurostimulating Peptide Precursor Protein mRNA in the Hippocampus in Alzheimer Disease, Journal of Neuropathol- ogy and Experimental Neurology, Volume 61, Issue 2, February 2002, Pages 176{185, https://doi.org/10.1093/jnen/61.2.176 158 [60] George, A. J., Holsinger, R. M. D., McLean, C. A., Tan, S.-S., Scott, H. S., Car- damone, T., Cappai, R., Masters, C. L., and Li, Q.-X. (2006). Decreased phos- phatidylethanolamine binding protein expression correlates with alpha beta accumula- tion in the Tg2576 mouse model of Alzheimer's disease. Neurobiology of Aging, 27(4), 614{623. https://doi.org/10.1016/j.neurobiolaging.2005.03.014 [61] Scholler, N., Gross, J.A., Garvik, B. et al. Use of cancer-specic yeast-secreted in vivo biotinylated recombinant antibodies for serum biomarker discovery. J Transl Med 6, 41 (2008). https://doi.org/10.1186/1479-5876-6-41 [62] Schoentgen, F., Jonic, S. PEBP1/RKIP behavior: a mirror of actin-membrane organiza- tion. Cell. Mol. Life Sci. 77, 859{874 (2020). https://doi.org/10.1007/s00018-020-03455-5 [63] Megumu Ito, Kenji Miyado, Koji Nakagawa, Miho Muraki, Misa Imai, Naomi Yamakawa, Junwen Qin, Yoshihiko Hosoi, Hidekazu Saito, Yuji Takahashi, Age-associated changes in the subcellular localization of phosphorylated p38 MAPK in human granulosa cells, Molecular Human Reproduction, Volume 16, Issue 12, December 2010, Pages 928{937, https://doi.org/10.1093/molehr/gaq076 [64] Tavel, L., Jaquillard, L., Karsisiotis, A.I., Saab, F., Jouvensal, L., Brans, A., Delmas, A.F., Schoentgen, F., Cadene, M., and Damblon, C. (2012). Ligand Binding Study of Human PEBP1/RKIP: Interaction with Nucleotides and Raf-1 Peptides Evidenced by NMR and Mass Spectrometry. PloS one. [65] Kim, N.-Y., Woo, A.-M., Kim, J.-R., and Lee, C. (2009). Exploration of senescence- associated genes by dierential display reverse transcription polymerase chain reaction: Prosaposin as a novel senescence-associated gene. Archives of Pharmacal Research, 32(5), 737{745. https://doi.org/10.1007/s12272-009-1513-6 [66] ChuHee Lee, Jae-Ryong Kim, Joon Hyuk Choi. (2017) .Immunohistochemical expression of prosaposin in normal arteries and atherosclerosis. International Journal of Clinical and Experimental Pathology. 2017;10(6):6781-6786 159 [67] Chang, B.-D., Watanabe, K., Broude, E. V., Fang, J., Poole, J. C., Kalinichenko, T. V., and Roninson, I. B. (2000). Eects of p21Waf1/Cip1/Sdi1 on cellular gene expression: Implications for carcinogenesis, senescence, and age-related diseases. Proceedings of the National Academy of Sciences, 97(8), 4291{4296. https://doi.org/10.1073/pnas.97.8.4291 [68] Pruitt, S., Freeland, A., Rusiniak, M. et al. Cdkn1b overexpression in adult mice al- ters the balance between genome and tissue ageing. Nat Commun 4, 2626 (2013). https://doi.org/10.1038/ncomms3626 [69] Xu, M., Xu, X., Pan, B. et al. LncRNA SATB2-AS1 inhibits tumor metastasis and aects the tumor immune cell microenvironment in colorectal cancer by regulating SATB2. Mol Cancer 18, 135 (2019). https://doi.org/10.1186/s12943-019-1063-6 [70] Lim, M., and Hastie, T. (2015), Learning Interactions Through Hierarchical Group-Lasso Regularization, Journal of Computational and Graphical Statistics, 24, 627{654. [71] Holly AC, Melzer D, Pilling LC, et al. Changes in splicing factor expression are associated with advancing age in man. Mech Ageing Dev. 2013;134(9):356{366. doi:10.1016/j.mad.2013.05.006 [72] Lee BP, Pilling LC, Emond F et al (2016) Changes in the expression of splicing factor transcripts and variations in alternative splicing are associated with lifespan in mice and humans. Aging Cell 15:903{913. https://doi.org/10.1111/acel.12499 [73] Wang C, Cicek MS, Charbonneau B, Kalli KR, Armasu SM, Larson MC, Konecny GE, Winterho B, Fan JB, Bibikova M, Chien J, Shridhar V, Block MS, Hartmann LC, Visscher DW, Cunningham JM, Knutson KL, Fridley BL, Goode EL (Jun 2014). "Tu- mor hypomethylation at 6p21.3 associates with longer time to recurrence of high-grade serous epithelial ovarian cancer". Cancer Research. 74 (11): 3084{91. doi:10.1158/0008- 5472.CAN-13-3198 160 [74] Ahva Shahabi, Juan Pablo Lewinger, Jie Ren, Craig April, Andy E. Sherrod, Joseph G. Hacia, Siamak Daneshmand, Inderbir Gill, Jacek K. Pinski, Jian-Bing Fan, and Mariana C. Stern. Novel Gene Expression Signature Predictive of Clinical Recurrence After Radical Prostatectomy in Early Stage Prostate Cancer Patients. The Prostate, 2016; 76:1239{1256 [75] Virginia D ez-Obrero, Christopher H. Dampier, Ferran Moratalla-Navarro, Matthew De- vall, Sarah J. Plummer, Anna D ez-Villanueva, Ulrike Peters, Stephanie Bien, Jeroen R. Huyghe, Anshul Kundaje, Gemma Ib a~ nez-Sanz, Elisabeth Guin o, Mireia Ob on- Santacana, Robert Carreras-Torres, Graham Casey. Genetic Eects on Transcriptome Proles in Colon Epithelium Provide Functional Insights for Genetic Risk Loci. Cellular and molecular gastroenterology and hepatology, 2021. [76] Eric R Gamazon, Heather E Wheeler, Kaanan P Shah, Sahar V Mozaari, Keston Aquino- Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, GTEx Consortium, Dan L Nicolae, Nancy J Cox, Hae Kyung Im. A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 2015; volume 47, pages 1091{1098. [77] Chang, LL., Hsu, WH., Kao, MC. et al. Stromal C-type lectin receptor COLEC12 inte- grates H. pylori, PGE2-EP2/4 axis and innate immunity in gastric diseases. Sci Rep 8, 3821 [78] Stephen M. Ratchford, Kaleen M. Lavin, Ryan K. Perkins, Bozena Jemiolo, Scott W. Trappe, and Todd A. Trappe. Aspirin as a COX inhibitor and anti-in ammatory drug in human skeletal muscle. J Appl Physiol 123: 1610{1616, 2017 [79] Durand S, Trillet K, Uguen A, Saint-Pierre A, Le Jossic-Corcos C, Corcos L. A transcriptome-based protein network that identies new therapeutic targets in colorectal cancer. BMC Genomics. 2017;18(1):758. Published 2017 Sep 30. 161 [80] Ning An, Xiaoyu Shi, Yueming Zhang, Ning Lv, Lin Feng, Xuebing Di, Naijun Han, Guiqi Wang, Shujun Cheng, Kaitai Zhang. Discovery of a Novel Immune Gene Signature with Profound Prognostic Value in Colorectal Cancer: A Model of Cooperativity Disorientation Created in the Process from Development to Cancer. Plos One,September 1, 2015. [81] Lee, K.S., Kwak, Y., Nam, K.H. et al. Favorable prognosis in colorectal cancer patients with co-expression of c-MYC and -catenin. BMC Cancer 16, 730 (2016). [82] Smith DR, Myint T, Goh HS. Over-expression of the c-myc proto-oncogene in colorectal carcinoma. Br J Cancer. 1993;68(2):407-413. doi:10.1038/bjc.1993.350 [83] Ai G, Dachineni R, Muley P, Tummala H, Bhat GJ. Aspirin and salicylic acid decrease c-Myc expression in cancer cells: a potential role in chemoprevention. Tumour Biol. 2016 Feb;37(2):1727-38. doi: 10.1007/s13277-015-3959-0. Epub 2015 Aug 28. PMID: 26314861. [84] Mitrugno A, Sylman JL, Ngo AT, Pang J, Sears RC, Williams CD, McCarty OJ. As- pirin therapy reduces the ability of platelets to promote colon and pancreatic cancer cell proliferation: Implications for the oncoprotein c-MYC. Am J Physiol Cell Physiol. 2017 Feb 1;312(2):C176-C189. doi: 10.1152/ajpcell.00196.2016. Epub 2016 Nov 30. PMID: 27903583; PMCID: PMC5336594. [85] R uscho J, Wallinger S, Dietmaier W, Bocker T, Brockho G, Hofst adter F, Fishel R. Aspirin suppresses the mutator phenotype associated with hereditary nonpolyposis col- orectal cancer by genetic selection. Proc Natl Acad Sci U S A. 1998 Sep 15;95(19):11301-6. doi: 10.1073/pnas.95.19.11301. PMID: 9736731; PMCID: PMC21637. [86] Dibra HK. Determination of an interaction between the DNA repair proteins MLH1 and sMBD4 and aspirin regulation of DNA repair gene and protein expression in colorectal cancer. Europe PMC, 2010, ETH:532842. 162 [87] Leenders EKSM, Westdorp H, Br uggemann RJ, Loeen J, Kratz C, Burn J, Hoogerbrugge N, Jongmans MCJ. Cancer prevention by aspirin in children with Constitutional Mis- match Repair Deciency (CMMRD). Eur J Hum Genet. 2018 Oct;26(10):1417-1423. doi: 10.1038/s41431-018-0197-0. Epub 2018 Jun 14. PMID: 29904176; PMCID: PMC6138701. [88] Dunne, M.R., Phelan, J.J., Michielsen, A.J. et al. Characterising the prognostic poten- tial of HLA-DR during colorectal cancer development. Cancer Immunol Immunother 69, 1577{1588 (2020). https://doi.org/10.1007/s00262-020-02571-2 [89] Sconocchia G, Eppenberger-Castori S, Zlobec I, et al. HLA class II antigen expression in colorectal carcinoma tumors as a favorable prognostic marker. Neoplasia. 2014;16(1):31- 42. doi:10.1593/neo.131568 [90] Ohno Y, Kitamura H, Takahashi N, Ohtake J, Kaneumi S, Sumida K, Homma S, Kawa- mura H, Minagawa N, Shibasaki S, Taketomi A. IL-6 down-regulates HLA class II ex- pression and IL-12 production of human dendritic cells to impair activation of antigen- specic CD4(+) T cells. Cancer Immunol Immunother. 2016 Feb;65(2):193-204. doi: 10.1007/s00262-015-1791-4. Epub 2016 Jan 12. PMID: 26759006. [91] Adria Closa, David Cordero, Rebeca Sanz-Pamplona, Xavier Sol e, Marta Crous-Bou, Laia Par e-Brunet, Antoni Berenguer, Elisabet Guino, Adriana Lopez-Doriga, Jordi Guardiola, Sebastiano Biondo, Ramon Salazar, Victor Moreno, Identication of candidate suscep- tibility genes for colorectal cancer through eQTL analysis, Carcinogenesis, Volume 35, Issue 9, September 2014, Pages 2039{2046, https://doi.org/10.1093/carcin/bgu092 [92] Lu, X., Cao, M., Han, S. et al. Colorectal cancer risk genes are functionally enriched in regulatory pathways. Sci Rep 6, 25347 (2016). https://doi.org/10.1038/srep25347 [93] Yao, L., Tak, Y., Berman, B. et al. Functional annotation of colon cancer risk SNPs. Nat Commun 5, 5114 (2014). https://doi.org/10.1038/ncomms6114 163 [94] Houlston RS, Cheadle J, Dobbins SE, et al. Meta-analysis of three genome-wide associ- ation studies identies susceptibility loci for colorectal cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33. Nat Genet. 2010;42(11):973-977. doi:10.1038/ng.670 [95] Shi G, Li D, Fu J, et al. Upregulation of cyclooxygenase-2 is associated with activation of the alternative nuclear factor kappa B signaling pathway in colonic adenocarcinoma. Am J Transl Res. 2015;7(9):1612-1620. Published 2015 Sep 15. [96] Chen J, Stark LA. Aspirin Prevention of Colorectal Cancer: Focus on NF-kappa- B Signalling and the Nucleolus. Biomedicines. 2017;5(3):43. Published 2017 Jul 18. doi:10.3390/biomedicines5030043 [97] Shinichi Yamauchi, Satoru Iida, Megumi Ishiguro, Toshiaki Ishikawa, Hiroyuki Uetake, Kenichi Sugihara. Clinical Signicance of Platelet-Derived Growth Factor-C Expression in Colorectal Cancer. Journal of Cancer Therapy, Vol.5 No.1, 2014 [98] Manzat Saplacan RM, Balacescu L, Gherman C, et al. The Role of PDGFs and PDGFRs in Colorectal Cancer. Mediators In amm. 2017;2017:4708076. doi:10.1155/2017/4708076 [99] V ayrynen, J.P., V ayrynen, S.A., Sirni o, P. et al. Platelet count, aspirin use, and charac- teristics of host in ammatory responses in colorectal cancer. J Transl Med 17, 199 (2019). https://doi.org/10.1186/s12967-019-1950-z [100] Qian, J., Tanigawa, Y., Du, W., Aguirre, W., Chang, C., Tibshirani, R., Rivas, M.A., Hastie, T. (2020). A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank, bioRxiv 630079. [101] Kane, M., Emerson, J., Weston, S. (2015). Scalable Strategies for Computing with Massive Data, Journal of Statistical Software, 55(14), 1-19. 164
Abstract (if available)
Abstract
We describe a novel high-dimensional regression method for modeling gene-environment interactions that accounts for all of investigated genomic features simultaneously and imposes a main-effect-before-interaction hierarchical structure. ❧ We propose an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. We present simulation results showing that the model outperforms existing joint selection methods for (G×E) interactions in terms of selection performance, scalability and speed. This algorithm removes the computational bottleneck for performing joint hierarchical selection of interactions, providing an effective alternative to the currently dominant paradigm of one-predictor-at-a-time analyses. The algorithm is based on screening rules that can discard large numbers of null predictors, making high-dimensional but sparse problems computationally tractable. We provide theoretical justification and computational experiments for the screening rules we develop. Our implementation of the algorithm is available in the gesso R package on CRAN. ❧ We showcase our approach with an application to a large-scale methylation dataset, including 200,000 CpG markers per subject, identifying four probes that show sex specificity in their effect on human aging. In GWAS and TWAS applications we identify SNPs showing sex specificity in their effect on childhood asthma and genes associated with colon cancer in interaction with aspirin intake.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
Incorporating prior knowledge into regularized regression
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Bayesian model averaging methods for gene-environment interactions and admixture mapping
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
High-dimensional feature selection and its applications
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
Asset Metadata
Creator
Zemlianskaia, Natalia
(author)
Core Title
High-dimensional regression for gene-environment interactions
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2021-12
Publication Date
09/19/2021
Defense Date
07/01/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convex optimization,gene-environment interactions,high-dimensional analysis,OAI-PMH Harvest,regularized regression,screening rules
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bien, Jacob (
committee chair
), Conti, David (
committee chair
), Gauderman, W. James (
committee chair
), Lewinger, Juan Pablo (
committee chair
), Marjoram, Paul (
committee chair
)
Creator Email
natasha.zemlianskaia@gmail.com,zemlians@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15919476
Unique identifier
UC15919476
Legacy Identifier
etd-Zemlianska-10078
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zemlianskaia, Natalia
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
convex optimization
gene-environment interactions
high-dimensional analysis
regularized regression
screening rules