Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical analysis of high-throughput genomic data
(USC Thesis Other)
Statistical analysis of high-throughput genomic data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Statistical Analysis of High-throughput Genomic Data by Jie Liu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2017 Copyright 2016 Jie Liu ii Contents List of Figures ........................................................................................................................................ iv List of Tables ....................................................................................................................................... vii Abstract ................................................................................................................................................ viii Chapter 1- An evaluation of processing methods for HumanMethylation450 BeadChip Data ....................................................................................................................................... 1 1.1 Introduction ............................................................................................................................................... 1 1.1.1 DNA methylation ................................................................................................................................................... 1 1.1.2 Infinium HumanMethylation array ............................................................................................................... 2 1.1.3 Problems associated with Infinium HumanMethylation data analysis ......................................... 4 1.2 Methods ....................................................................................................................................................... 5 1.2.1 Preprocessing Methods ...................................................................................................................................... 5 1.2.2 Illumina HumanMethylation450 data sets ................................................................................................ 7 1.2.3 Evaluation metrics ............................................................................................................................................... 9 1.2.4 Feature Filtering ................................................................................................................................................. 10 1.3 Results ........................................................................................................................................................ 10 1.3.1 Reduction in technical variance .................................................................................................................. 10 1.3.2 Cross-platform comparison with whole-genome bisulfite sequencing ..................................... 12 1.3.3 Reproducibility of differential DNA methylation analysis ............................................................... 13 1. 4 Discussion ................................................................................................................................................ 18 1. 5 Conclusion ............................................................................................................................................... 20 Chapter 2- Multi-tuning parameter penalized regression for sample classification with data integration ........................................................................................................................ 20 2.1 Introduction ............................................................................................................................................. 20 2.1.1 Omics data integration .................................................................................................................................... 20 2.1.2 Types of integrative genomic analysis ..................................................................................................... 21 2.1.3 Regularized regression .................................................................................................................................... 24 2.1.4 Problems of standard Lasso and Elastic net in data integration ................................................... 31 2.2 Methods ..................................................................................................................................................... 33 2.2.1 Multi-tuning parameter Lasso regression .............................................................................................. 33 2.2.2 Multi-tuning parameter Elastic net regression ..................................................................................... 34 2.2.3 Tuning of parameters in multi-tuning parameter Lasso and Elastic net ................................... 35 2.2.4 Simulation study ................................................................................................................................................ 35 2.2.5 Maximum possible AUC .................................................................................................................................. 38 2.2.6 Prediction model evaluation metrics ........................................................................................................ 39 2.2.6 Real data analysis .............................................................................................................................................. 39 2.3 Results ........................................................................................................................................................ 41 2.3.1 Independent Settings ....................................................................................................................................... 41 2.3.1.1 Comparison of maximum possible AUCs ............................................................................................. 41 2.3.1.2 Performance of multi-tuning parameter Lasso ................................................................................. 42 2.3.1.3 Comparison between Late and Early data integration .................................................................. 45 2.3.1.4 Performance of Multi-tuning parameter Lasso in different simulation settings ................ 46 iii 2.3.2 Correlated Settings ............................................................................................................................................ 53 2.3.2.1 Comparison of maximum possible AUCs ............................................................................................. 53 2.3.2.2 Performance of multi-tuning parameter Elastic net ....................................................................... 55 2.3.2.3 Comparison between late and early integration .............................................................................. 56 2.3.2.4 Performance of multi-tuning Elastic net in different simulation settings ............................. 58 2.3.3 Real data analysis .............................................................................................................................................. 62 2.4 Discussion ................................................................................................................................................. 63 2.5 Future Work ............................................................................................................................................. 67 Reference ............................................................................................................................................. 67 Additional Files .................................................................................................................................. 74 iv List of Figures Figure 1. 1 Evaluation of preprocessing methods in reducing technical variance using PBLs dataset. ..................................................................................................................................................................... 11 Figure 1. 2 Cross-platform comparison of Beta values from HumanMethylation450 vs whole-genome bisulfite sequencing. .......................................................................................................... 13 Figure 1. 3 Multidimensional scaling plot of distances between samples in three datasets. ...... 14 Figure 1. 4 Overlap plot and ROC curves for KIRC (A,B) and COAD (C,D) datasets. ........................ 16 Figure 1. 5 Overlap plot and ROC curves for brain dataset without batch correction (A,B) and with batch correction (C,D). ........................................................................................................................... 17 Figure 2.1 1 Late and early data integration ..................................................................................................... 23 Figure 2.1 2 Illustration of the problems in standard Lasso or standard Elastic net. ..................... 32 Figure 2.2 1 The illustration of simulation studies for independent setting. ..................................... 37 Figure 2.2 2 The illustration of simulation for correlated settings. ........................................................ 38 Figure 2.3 1 Maximum possible AUCs for four simulation settings. ....................................................... 42 Figure 2.3 2 The empirical AUCs as a function of the weight parameter (N=200 simulated data sets). .......................................................................................................................................................................... 44 Figure 2.3 3 Comparison between early and late data integration. ........................................................ 46 Figure 2.3 4 The optimal weight increases with increased Coef1 when the r1 and r2 are fixed (N=200 simulated data sets). ......................................................................................................................... 48 Figure 2.3 5 The optimal weight decreases with increased r2 when the Coef1 and Coef2 are fixed (N=200 simulated data sets). .............................................................................................................. 51 Figure 2.3 6 The weight parameter is influenced by the proportion, not the number of non-zero coefficients in individual datasets (N=200 simulated data sets). .................................................. 52 Figure 2.3 7 The optimal weight increases with the increase of total number of features (N=200 simulated data sets). .......................................................................................................................................... 53 Figure 2.3 8 Maximum possible AUCs for the two correlated simulation settings (N=200 simulated data sets). .......................................................................................................................................... 54 Figure 2.3 9 The AUCs as a function of the weight parameter in two correlated data settings (N=200 simulated data sets). ......................................................................................................................... 55 v Figure 2.3 10 AUCs for early and late data integration in correlated settings (N=200 simulated data sets). ................................................................................................................................................................ 57 Figure 2.3 11 The change of optimal weight parameters under different correlated data simulation settings (N=200 simulated data sets). ................................................................................ 59 Figure 2.3 12 The change of AUCs for different weight parameters in Elastic net (red) and Lasso (black) when we have correlations between informative genes and noise gene features (Cor_Noise>0). .................................................................................................................................... 62 Figure 2.3 13 Performance of multi-tuning parameter penalized regression in (A) AML data and (B) PRAD data set. ...................................................................................................................................... 63 Additional Figure 2. 1 Maximum possible AUCs with varied non-zero coefficients (N=200)….. 89 Additional Figure 2. 2 The empirical AUCs as a function of the weight parameter. ........................ 79 Additional Figure 2. 3 Spaghetti plot of AUCs for standard Lasso and multi-tuning parameter Lasso (N=100 simulated data sets). ............................................................................................................ 80 Additional Figure 2. 4 Results of evaluation metrics for the standard Lasso (yellow) vs. multi-tuning parameter Lasso (red). .......................................................................................................... 81 Additional Figure 2. 5 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments when the inter-array correlation is zero (Cor_inter=0). ............................................................................................................................................. 82 Additional Figure 2. 6 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments when the inter-array correlation is larger than zero (Cor_inter>0). ..................................................................................................................... 83 Additional Figure 2. 7 Difference in testing AUCs between Elastic net and Lasso under different simulation settings (N=200 simulated data sets). ................................................................................ 84 Additional Figure 2. 8 Difference in testing AUCs between multi-tuning parameter Elastic net and standard Elastic net under different simulation settings (N=200 simulated data sets). ..................................................................................................................................................................................... 85 Additional Figure 2. 9 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments with varied correlation coefficients between informative methylation features. ............................................................................................ 86 Additional Figure 2. 10 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments with varied number of correlated informative methylation features. ............................................................................................................... 86 Additional Figure 2. 11 Maximum possible AUCs for individual data types (Type 1 and Type 2) and combined data under different simulation settings (N=200 simulated data sets for each simulation setting). .................................................................................................................................. 87 vi Additional Figure 2. 12 Estimated coefficients for the selected features from the multi-tuning parameter Elastic net model in PRAD data. ............................................................................................. 88 vii List of Tables Additional table 1. 1 Summary of preprocessing methods used in this study. .................................. 74 Additional table 1. 2 Number of samples in TCGA-KIRC, TCGA-COAD, and Alzheimer’s disease brain data assigned to Discovery and Validation data sets by disease status and plate. ..... 75 Additional table 2. 1 Statistical summaries in paired t-tests between standard Lasso and multi-tuning parameter Lasso…………………………………………………………………………………………….89 viii Abstract With the development of high-throughput technologies, it’s promising to utilize high-dimensional molecular data to unravel relationships between genomic alterations and disease phenotypes, to describe characteristic molecular profiles of the diseased population, and to predict diagnosis or prognosis for patients. The major goal of this dissertation is to develop novel statistical methods to analyze high-throughput molecular data. Data quality is fundamental to any research project. The first part of this dissertation focuses on low-level signal processing and batch effect correction of HumanMethylation450 (HM450) BeadChip arrays, a commonly used platform for high-throughput DNA methylation analysis. As with other types of microarray platforms, technical artifacts are a concern for HM450 data, including background fluorescence, dye-bias from the use of two colour channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing. In our study nine processing methods, including both within- and between- array methods, were applied and compared in four datasets. The comparisons in our study provided valuable insights in preprocessing HM450 BeadChip data. They showed an advantage to using BMIQ to adjust type I/II bias in combination with Noob, a within-array procedure to correct background fluorescence and dye-bias, or with Funnorm, a between-array normalization procedure utilizing control probes. We also saw a benefit from using RUVm, a new variant of Remove Unwanted Variation designed for DNA methylation data, when batch effects were apparent and differential DNA methylation was the goal. Performing RUVm on the processed data led to considerable improvement in reproducibility in the list of top differentially methylated markers. The results for this project have been published in BMC Genomics. The second part of this dissertation is to develop methods for integrating different omics data to predict disease subtypes. Combining the information of measurements from multiple platforms, such as gene expression, methylation, and copy number variation, investigators hope to obtain a more accurate prediction model than using a single data type alone by utilizing complementary information in different platforms. Penalized regression, such as Lasso and Elastic net, has been used to setup prediction models in data integration. However, in previous studies a single penalized regression model that penalizes features from all platforms equally was used. Using a uniform penalty parameter for all features is ix problematic since we are at the risk of missing some subtle but important features. We propose an approach termed “multi-tuning parameter penalized regression”, where we consider separate tuning parameter penalties and allow for differential shrinkage across features from different platforms. From intensive simulation studies we identify scenarios when multi-tuning parameter penalized regression outperforms standard one in terms of both prediction accuracy and variable selection, and we consider both situations when variables are independent and correlated. Besides, we show that combining features before model setup (i.e. “early integration”), generally works better than selecting variables from individual platforms followed by model setup using combined selected features (i.e. “late integration”). Lastly, we investigate the performance of multi-tuning parameter penalized regression in real cancer datasets obtained from TCGA and GEO. The thesis focuses on the development of statistical methods for high-dimensional genomic data analysis. Specifically, we set up an analytical pipeline, from the processing of the high-throughput HM450 data to utilizing the cleaned data in predicting disease subtypes. We identified the appropriate combination of the preprocessing methods for HM450 array in terms of reducing technical variance and improving reproducibility in differential methylation analysis. And we proposed a novel Lasso- and Elastic net- variant specifically designed for classification when integrating data from different genomic platforms. The methods developed and evaluated in this thesis provide guidance on best practices for the statistical analysis of high-throughput ‘omic’ data. 1 Chapter 1- An evaluation of processing methods for HumanMethylation450 BeadChip Data 1.1 Introduction 1.1.1 DNA methylation Epigenetics provides stability and diversity to the cellular phenotype through chromatin marks that affect local transcriptional potential and that are preserved or regenerated during cell division. DNA methylation, a covalent chemical modification of DNA occurring at cytosine residues in CpG dinucleotides, is the most studied form of epigenetic modification. The mammalian DNA methylation machinery is composed of two components, the DNA methyltransferases (DNMTs), which establish and maintain DNA methylation patterns, and the methyl-CpG binding proteins (MBDs), which are involved in 'reading' methylation marks (Robertson, 2005). When located at gene promoters, DNA methylation is usually a repressive mark. However, CpG DNA methylation is increased in the gene bodies of actively transcribed genes in plants and mammals (Laird, 2010). As a crucial epigenetic modification of the genome, DNA methylation regulates chromatin structure and gene expression involved in processes such as X chromosome inactivation, imprinting, embryogenesis, gametogenesis, and silencing of repetitive DNA elements (E. Li, 2002). Consistent with these important roles, a growing number of human diseases have been found to be associated with aberrant DNA methylation (Robertson, 2005). In normal cells, DNA methylation occurs predominantly in repetitive genomic regions. DNA methylation represses transcription directly, by inhibiting the binding of specific transcription factors, and indirectly, by recruiting methyl-CpG-binding proteins and their associated repressive chromatin remodeling activities (Robertson, 2005). Changes in the patterns and levels of global and regional DNA methylation regulate development and contribute directly to disease states. Dynamic regulation of de novo DNA methyltransferase expression occurs during development with higher levels in undifferentiated cells and reduced expression upon differentiation. DNMTs are required for cellular differentiation during early embryonic development to regulate the systematic transcriptional inactivation of particular genes by promoter methylation (Gopalakrishnan, Van Emburgh, & Robertson, 2008). Epigenetic defects associated with cancer are characterized as global genomic 2 hypomethylation and locus-specific hypermethylation of CpG islands. Hypomethylation is most related with repetitive regions and transposable elements, although it also occurs within promoters of normally silenced genes. Gene-specific hypermethylation frequently occurs in the promoter regions of tumor suppressor genes. These alterations, in turn, lead to genomic instability. DNA methylation alterations are also described for other types of diseases, such as neurological and autoimmune diseases, and other disorders such as cardiovascular diseases, metabolic diseases and myopathie (Portela & Esteller, 2010). Furthermore, epidemiological studies have revealed associations between DNA methylation and exposures such as prenatal maternal smoking (Lee & Pausova, 2013) and environmental chemical (Hou, Zhang, Wang, & Baccarelli, 2012). 1.1.2 Infinium HumanMethylation array Given the important implications of DNA methylation for normal biology and disease, profiling DNA methylation across the genome is vital to understanding its influence. The bisulfite-conversion technique, which converts unmethylated cytosines to uracils, followed by amplification, effectively turns an epigenetic difference into a genetic difference, and allows existing genotyping and sequencing technologies to be repurposed for studies of DNA methylation. The Illumina Infinium platform incorporates a whole-genome amplification step after bisulfite conversion, which is followed by fragmentation and hybridization of the sample to methylation-specific DNA oligomers that are linked to individual bead types. The Illumina Infinium HumanMethylation450 (HM450) BeadChip and the MethylationEPIC (EPIC) BeadChip for DNA methylation studies have reached a predominant place in the market and the scientific arena, due to their advantage in reagent cost and time, comprehensive coverage and high throughput. Notably, the HM450 array was selected for The Cancer Genome Atlas (TCGA) studies (Weisenberger, 2014), and was used in numerous studies investigating aberrant DNA methylation profiles in cancers (Naumov et al., 2013; Williams, Anderton, Lee, Pentecost, & Arcaro, 2014) and aging process (Heyn et al., 2012; Horvath, 2013). Recently, the HM450 platform was replaced by the MethylationEPIC BeadChip, which contains >90% of the sites covered in HM450, and adds 333,265 CpGs located in enhancer regions identified by the ENCODE and FANTOM5 projects (Moran, Arribas, & Esteller, 2015). In this study we evaluated preprocessing methods using HM450 array data, since it is the technology utilized until 2015, and thousands of HM450 samples are available for study from public data repositories. However, we note that the methods we evaluated are not specific to this platform and can be applied equally to the new EPIC array. 3 The Infinium assay detects methylation status with single base resolution. The HM450 array interrogates 485 577 targets, of which 482 421 are CpGs (>99.3%), 3091 are CpHs (0.6%) and 65 are SNPs (<0.1%). A total of 848 control bead types are assayed, 613 of which are ‘negative’ controls. At each targeted cytosine position, the fluorescence intensities of the methylated and unmethylated signals are measured. DNA methylation could be estimated as the proportion of the total signal coming from the methylated channel. A benefit of Illumina’s HM450 BeadArray design is that for each probe sequence, a median of 14 beads are randomly distributed on the array, with each bead containing hundreds of thousands of oligonucleotides. Further, Illumina organizes arrays into 12-sample BeadChips offering great potential for automation (Triche, Weisenberger, Van Den Berg, Laird, & Siegmund, 2013). The HM450, as well as the MethylationEPIC array, is unique in that it uses a combination of two chemical assays, corresponding to two probe designs, Infinium I and II. In Infinium I assay, the methylation level of each CpG locus is interrogated by two 50-base probes in the same color channel. The 3′ terminus of the probe is designed to match either the protected cytosine (methylated design) or the thymine base resulting from bisulfite conversion and amplification (unmethylated design) (Bibikova et al., 2011). The color channel is determined by the base downstream of the ‘C’ of the target CpG (Cy3-Green for ‘C’ or ‘G’ and Cy5-Red for ‘A’ or ‘T’), so that both methylated and unmethylated probes fluoresce at the same wavelength for a single CpG target. The methylation level can be calculated by comparing the intensities from the two different probes in the same color: β= M/(U + M). It is assumed that the methylation status of CpGs underlying the 50 bp probe body is correlated to that of the target CpG such that CpGs in the probe body of an unmethylated (converted) probe are also converted, while CpGs in the body of a methylated (unconverted) probe will remain unconverted. By contrast, in Infinium II design, each target CpG is interrogated using a single 49-base oligonucleotide in two color channels. The probe hybridizes to the bisulfite-converted target sequence in a non-DNA methylation-dependent fashion upstream of the target CpG. Methylation state is detected by single base extension at the position of the 'C' of the target CpG, which always results in the addition of a Cy3-linked 'G' or Cy5-linked 'A' nucleotide, complementary to either the 'methylated' C or 'unmethylated' T, respectively. These probes are designed to fluoresce at either wavelength (Cy3 for methylated, Cy5 for unmethylated), and methylation status is determined by comparing the two colors from the one position: β = Green (M)/(Red (U) + Green (M)). On average, the Infinium I probes contain a higher proportion of CpGs along the body of the probe comparing with the Infinium II probes. Based on the Illumina annotation, 57% of 4 Infinium I probes are found in CpG islands, whilst only 21% of Infinium II probes are designated as islands. 1.1.3 Problems associated with Infinium HumanMethylation data analysis As with other types of microarray platforms, technical artifacts are a concern for the Infinium HumanMethylation array. One source of the technical noise for this platform is the variation in background fluorescence signal, which contributes as additive bias and reduces dynamic range for the beta value. Another concern is the different average intensities in the red and green channels, and such dye bias variations skew the values of Infinium II probes differentially across samples. Furthermore, type I and type II probes differ not only in chemical designs, but also in biological characteristics, with type I probes more likely to map to CpG islands than type II probes. Hence there is an inherent difference in the relative proportion of methylated and unmethylated probes between the two designs. However, it has been shown that type II probe values have lower dynamic range, are biased and generally less reproducible (Dedeurwaerder et al., 2011; Teschendorff et al., 2013). Issues regarding the type2 probe bias should be considered when analyzing two designs together, while at the same time the biological difference of two design types should be maintained. Last but not least, the between-sample variation is a concern, which may arise from the position on the chip, and the processing time or places. Given the increasing popularity of the HM450 array and the biases observed due to platform design, several approaches and pipelines have been developed, either targeting a single technical issue or designed to address multiple biases within a complete framework. Recently, some systematic comparisons of preprocessing procedures have appeared (Dedeurwaerder et al., 2013; Morris & Beck, 2014; Wilhelm-Benartzi et al., 2013; Wu et al., 2014), however, evaluations of combination approaches are seldom reported(Marabita et al., 2013; Touleimat & Tost, 2012). Pidsley et al. (Pidsley et al., 2013) evaluated the combining of methods with different mechanisms together. However, improved approaches have since appeared and are worthy of further consideration. Of interest is (1) the overall best approach for signal processing; (2) the performance of recent between-array methods when analyzing data with substantial biological heterogeneity; and (3) the efficacy of processing approaches when DNA methylation differences are subtler than the stark changes observed in tumorigenesis or aging. In this study nine processing methods are applied and compared in four datasets. The methods include both between-array and 5 within-array normalization methods, as well as combination approaches that sequentially apply procedures addressing different platform biases. The datasets range from cancer data to Alzheimer’s brain data, showing distinctly different variation in DNA methylation. The methods are evaluated based on their ability to reduce technical variance, as well as to identify reproducible differential methylation positions (DMPs) in case-control studies. We show that the performance of processing methods will depend on the nature of the datasets, and, in general, within-array processing for background and probe design bias performs well in all datasets. Furthermore, we reveal that the within-array methods appear robust to obtaining reproducible results across different types of data sets. This should be especially meaningful when dealing with clinical research data where we have only one sample per group. Finally, between-array normalization helps when the variation due to noise is greater than the variation due to biological signal; however, when the methylation alterations are substantial in size and number, some between-array methods might remove biological signal. 1.2 Methods 1.2.1 Preprocessing Methods In total we evaluate and compare nine preprocessing approaches based on the following within-array and between-array methods: (1) background correction and dye-bias equalization (Noob) (Triche et al., 2013); (2) beta-mixture quantile normalization (BMIQ)(Teschendorff et al., 2013); (3) subset-quantile within-array normalization (SWAN)(Maksimovic, Gordon, & Oshlack, 2012); (4) background adjustment followed by between-array normalization performed separately by probe design (dasen) (Pidsley et al., 2013); (5) subset-quantile normalization (SQN) (Maksimovic et al., 2012; Touleimat & Tost, 2012); and (6) functional normalization (Funnorm) (Fortin, 2014). All analyses are implemented in R version 3.0.3, Bioconductor version 2.12 (Huber et al., 2015), with functions Noob, SWAN, SQN and Funnorm implemented in the minfi package (version 1.6.0)(Aryee et al., 2014), and dasen and BMIQ in wateRmelon (version 1.0.3) (Pidsley et al., 2013). In addition to these published methods, we also combine methods that correct for different biases: (1) Noob+BMIQ; (2) Noob+SWAN; (3) Funnorm+BMIQ. We note that the function Funnorm already includes Noob as a first step. Briefly, Noob performs within-array normalization correcting for background fluorescence and dye bias. It fits a normal-exponential convolution model to estimate the true signal 6 conditional on the observed intensities, capitalizing on the unique design of the Infinium I probe pairs to estimate non-specific signal from the ‘out-of-band’ intensities, the wavelength in the opposite color channel to their design (n=92596 for Cy3 features and n=178406 for Cy5 features). These background-corrected intensities are then normalized for variation in average intensity in the red and green channel via a multiplicative scale factor computed using the average intensities of the positive control probes. BMIQ is a mixture-model-based normalization method designed to correct the type II probe bias and make the methylation distribution of type II features comparable to the distribution of type I features. BMIQ fits a three-state (unmethylated, 50% methylated and fully methylated) beta mixture model for the type I and type II probes separately, with probes assigned to the state with maximum probability. Beta values for the type II features are normalized by state to the distributions of the same estimated in type I features. Like Noob, it is a within-array method. SWAN is also a within-array normalization method for probe type design, but unlike BMIQ, it starts from methylated and unmethylated intensities. SWAN is based on the assumption that probes with the same number of CpGs in the probe body should have similar intensity distributions (since the regions they interrogate should have similar biological features). For the methylated and unmethylated intensities separately, a random subset of type I and II probes matched on underlying CpG number are selected and quantile normalized. The intensities of the remaining probes are adjusted by linear interpolation. Consequently, the intensity distributions between two probe types are identical in each subset, while the difference in overall distributions between two probe types still remains. In contrast to the first three methods, the dasen method is a between-sample normalization method, and like Noob and SWAN, it adjusts the raw intensities instead of beta values. The background of type I probes is adjusted to match that of the type II and standard quantile normalization applied to methylated and unmethylated intensities separately by probe type (I or II). SQN is another between-sample normalization method for methylated and unmethylated intensities that also involves within-sample normalization of type I and II probes. Type II intensities are normalized across arrays; within arrays, type I intensities are normalized to the type II distributions within four feature subsets (CpG island, CpG island shore, CpG island shelf, and Open Sea) (Aryee et al., 2014). The stratified within array normalization allows for the biased distribution of type I and II probes by genomic regions, with type I probes appearing more frequently in CpG dense regions that are typically unmethylated. 7 Funnorm is a between-sample (functional) normalization method that attempts to remove unwanted variation by adjusting for covariates estimated from a control probe matrix. Briefly, 42 summary measures are estimated from the combined 848 control probes and type I ‘out-of-band’ intensities. The first two principle components of the above summarized measures are chosen as covariates for intensity adjustment. The functional normalization model is applied to the methylated and unmethylated channels separately. Since the relationship between the methylation values and the control probes is expected to differ between type I and type II probes, functional normalization is also applied separately by probe type. For probes mapped to X and Y chromosomes, males and females are processed separately, with ordinary quantile normalization used for probes on the Y chromosome because of the small number of probes (N=416). By default, the functional normalization is applied after Noob in the current version of minfi package (version 1.6.0). The summary of data preprocessing methods used in this study is shown in Additional Table 1.1. We note that both SQN and Funnorm normalize features on X and Y chromosomes in males and females separately, while dasen does not. Also, Funnorm only uses control probes for between-array normalization whereas SQN and dasen normalize signals using the biological features directly. 1.2.2 Illumina HumanMethylation450 data sets Five data sets are used to evaluate the different preprocessing methods. One contains six technical replicates from human peripheral blood lymphocytes (PBLs), and is used to investigate how the processing methods reduce technical variances. These samples were commercially bought, pooled human PBLs, and are the same as used in (Triche et al., 2013). The second data set is five lung adenocarcinomas from The Cancer Genome Atlas (TCGA), measured using both HM450 and whole-genome bisulfite sequencing (WGBS) platforms. These data, shared by Titus et al. (Titus, Houseman, Johnson, & Christensen, 2016), and allow us to benchmark our DNA methylation estimates for HM450 with a whole-genome sequencing-based assay. The remaining three data sets are used to evaluate reproducibility of differential methylation analysis with data processing. The motivation is that better signal processing should lead to better reproducibility in identifying differential methylation. Two of these data sets are from The Cancer Genome Atlas (TCGA) with large-scale methylation differences between cancer and normal tissue, and one is from an unpublished brain data set, showing subtle methylation differences between cases with Alzheimer’s disease and controls. 8 All samples are analyzed using the Infinium HumanMethylation450 (HM450) BeadChip, and DNA methylation levels are reported as Beta (β) values, the proportion of methylation intensity to the total intensity, ranging from 0 to 1. The samples from the last three studies are divided into discovery and validation datasets for evaluating reproducibility of testing differential methylation. Each processing method is performed in the discovery and validation set separately. The separate processing will not affect the Beta values for individual samples processed when using within-array methods (Noob, SWAN, BMIQ), but can affect the Beta values when using between-array processing methods (dasen, SQN, Funnorm). The details for defining discovery and validation data sets are described below, and summarized in Additional Table 1.2. All samples are anonymized, and this study did not require institutional review board approval. The TCGA-KIRC dataset We use the identical samples and study design implemented by Fortin et al. (Fortin, 2014). In total, 222 kidney clear cell tumor samples and 160 non-tumor kidney samples are assayed and included in the study. These samples are split into discovery (training) and validation (testing) datasets. The discovery set contains 65 tumor samples and 65 non-tumor kidney samples, analyzed in two plates with little variation between plates. The validation set contains 157 tumor samples and 95 non-tumor kidney samples, spread over 4 plates and designed to have larger variations among samples. The TCGA-COAD dataset A total of 321 COAD (colon adenocarcinoma) samples (284 tumor/37 non-tumor colon) are included in this study. These samples are selected by plate to create discovery and validation datasets. No substantial plate-to-plate variation was observed in COAD dataset (Additional Figure 1.1). The discovery set was assigned 143 tumor samples and 17 controls spread across 7 plates, and the validation set 141 tumors and 20 controls run over 4 plates. All samples had fewer than 5% missing beta values in each color channel and were analyzed on plates with more than 2 COAD samples. The Brain data A total of 376 bulk brain samples obtained from the middle temporal gyrus were analyzed over 8 plates, of which 215 are from patients diagnosed with Alzheimer’s disease (AD) and 161 are from age- and sex-matched controls. Once more, discovery and validation data sets are created selecting samples by plate. A total of 180 samples across the first four plates are assigned to the discovery set (102 diseased and 78 controls) and 196 samples from the latter four plates to the validation set (113 diseased and 83 controls). These data were generated at the University of Southern California. 9 1.2.3 Evaluation metrics Improved measures of DNA methylation should increase our power to detect biological associations. However, evaluating our ability to detect true signals in real data is complicated by the fact that we do not know which signals are genuine. However, we can study the reproducibility of differential methylation results between different data sets. Higher reproducibility would reflect more potential for a method in revealing true signals, while poorer agreement indicates the results are more likely due to chance. We assess differential methylation by disease status for the case-control data by applying two-sample t-tests separately in discovery and validation sets. The data are analyzed on the Beta-value scale and tests are two-sided. Several tests of reproducibility are performed. First, result reproducibility is evaluated from the lists of (ranked) p-values using concordance plots and ROC curves. Second, we assess overlap of differentially methylated positions (DMPs) applying the Benjamini-Yekutieli method (Benjamini & Yekutieli, 2001) to control the false-discovery rate (FDR-adjusted p <0.05). Benjamini-Yekutieli allows for correlation in test results and results in many fewer significant test results than methods that ignore correlation between test statistics. Concordance Plot In a concordance plot, the overlap percentage is plotted against feature list size for lists of size one to k, where the overlap percentage is defined by the proportion of features ranked in the top k in both the discovery and validation set. The processing approach with higher overlap percentages across different list sizes indicates higher reproducibility. This approach has connections with the change of correspondence plot proposed for the irreproducibility rate framework (Q. Li, Brown, Huang, & Bickel, 2011). ROC curves ROC curves are used to quantify, for a known gold standard, the true positive and false-positive results using all possible threshold values of a quantitative marker. Although the genuine gold standard is unknown, we use the results from the discovery subset, samples selected because they reflected lower technical variation between plates, to define the ‘gold standard’, and pick the number of true signals from a list size with high overlap percentage from the concordance plots. Specifically, for the TCGA KIRC and COAD data sets, the features from the top 100,000 ranked p values in the discovery subset are defined as “true signal”. For the brain data, we anticipate fewer differences and only select the top 100 in the discovery set as “true signals”. These same list sizes are used in (Fortin, 2014) for evaluation of strong and weak signals. The area under the curve (AUC) is computed to assess the performance of a method. A method with good reproducibility will 10 have an ROC curve above the diagonal line, with AUC near 1. The ROC analysis is performed using ROCR package in R (version 1.0-7). 1.2.4 Feature Filtering A total of 485,512 cytosines are queried by the HM450 BeadChip array. For method evaluation, we filter probes with missing Beta values across samples. For the TCGA and brain data sets we also exclude probes mapped to the X (n=11240) or Y (N=416) chromosomes, containing a common SNP (minor allele frequency >0.1) within 10 bp of the target cytosine, or map to multiple regions of the genome. This filtering results in 485110 features in the PBL dataset, 371370 in KIRC, 384470 in COAD, and 360894 in the brain data set. It is worth noting that the minfi and methylumi packages in R have different criteria in assigning missing to beta values. In minfi, missing values (“NAs”) are assigned when both the raw methylated and unmethylated intensities are zero. Although this catches all features with failed probes that do not fluoresce, it allows features fluorescing at background levels for both M and U probes to have beta values near 0.5. We favor the additional assignment of ‘NA’ to features that do not have at least one of the M and U intensities fluoresce above background. Such masking is applied in the methylumi package, where the detection p values are used for determining missing beta values; the detection p-value reports the quantile from the distribution of 600 negative control probes (from the same color channel) for the larger of the methylated or unmethylated intensity. Then, the beta value will be assigned missing if the corresponding detection p-value is more than 0.05 (not above background). This can result in a much larger number of missing beta values than assigned in the minfi package. We use missing values assigned by the methylumi package. 1.3 Results 1.3.1 Reduction in technical variance We assess the ability of preprocessing methods to reduce technical variance and adjust type I/II bias using six technical replicates from commercially available pooled human male peripheral blood lymphoctytes (PBLs) assayed as part of a larger experiment and previously evaluated by Triche et al (Triche et al., 2013). Figure 1.1A shows variability in the density distributions of the beta values for the six replicates that is no longer evident 11 after processing the data using Noob+BMIQ (Figure 1.1B). The density distributions, stratified by probe type, appear in Additional Figure 1.2. For the raw data the density distributions vary considerably across the six replicates, especially among type II probes; the reduced dynamic range of raw beta-values for type II probes relative to type I probes was also apparent (Dedeurwaerder et al., 2011). All of the normalization methods increase the dynamic range of the type II probes with perhaps the greatest reduction in type I/type II bias seen when combining Noob with BMIQ. It is worth of noting that SQN changes the distribution of type I probes most noticeably, presumably because it uses type II probes as the anchors when normalizing between two probe designs in each sub-category (defined by genomic context relative to CpG islands). Figure1. 1 Evaluation of preprocessing methods in reducing technical variance using PBLs dataset. 12 A) Density distribution of beta values in six replicates in the raw data. B). Density distribution of beta values in six replicates in the data processed by Noob+BMIQ. C) Loess curve fitted to the scatterplot of standard deviation versus average of methylation beta values for different processing methods. D) Boxplot of absolute difference in methylation beta-values between adjacent type1-type2 probe pairs in PBLs dataset, averaged across six replicates. Figure 1.1C shows a smoothed curve summarizing a scatterplot of standard deviation versus mean Beta values across six replicates for all probes. All processing methods reduce the average standard deviation compared to the raw data. The correlation between standard deviations and means is least obvious after application of dasen and SQN, possibly due to the fact that both dasen and SQN are between-array. Interestingly, the within-array methods are competitive with the between-array methods despite the application to technical replicates for which all between-array assumptions are met. Also, Funnorm shows good stability despite the use of principle components adjustment estimated from only six arrays. Figure 1.1D shows a boxplot of the average of the absolute difference in beta value for the 32,713 pairs of adjacent type I/II probes that lie within 200bp of each other. Based on the well identified spatial correlation of DNA methylation at scales<500bp (Eckhardt et al., 2006), we expect that the ideal normalization algorithms should minimize the averaged absolute difference, as previously observed by Teschendorff et al (Teschendorff et al., 2013). Noob+BMIQ seems to show the greatest reduction in deviation between measures from adjacent type I/type II probes, while SQN does not perform well with respect to reducing type I/type II bias. 1.3.2 Cross-platform comparison with whole-genome bisulfite sequencing We used five lung adenocarcinoma (LUAD) samples from TCGA to benchmark the post-processing HM450 Beta value estimates to estimates from WGBS. The number of features available for cross-platform comparison ranged from 110,962 to 199,441 after restricting to cytosines with a minimum sequencing depth of 10. Loess curves summarizing the relationship between the Beta values from the two platforms showed Noob + BMIQ estimates were most similar to WGBS (Fig. 1.2A). The Pearson correlation was highest for Noob + BMIQ (0.953), lowest for SQN (0.930) and dasen (0.937), and intermediate for the 13 raw data (0.942). The other four LUAD samples also showed the highest Pearson correlations for Noob + BMIQ processed data (Fig. 1.2B). Figure1. 2 Cross-platform comparison of Beta values from HumanMethylation450 vs whole-genome bisulfite sequencing. A) For 1 LUAD sample, loess curve fitted to 199441 paired Beta values for WGBS and HM450 data after different processing methods. The dashed line indicates equal values. B) Scatter plot of correlations between WGBS Beta values and HM450 Beta values after different processing methods in 5 LUAD samples. Different colors represent different samples. Aqua shows the results from the sample in A) 1.3.3 Reproducibility of differential DNA methylation analysis The ability to reliably identify features differentially methylated by disease subtype will depend on the strength of the biological signal and noise due to technical issues. A multidimensional scaling plot of the top 5000 variable features allows us to visualize sample distances by disease state for our three studies (Figure 1.3). For the KIRC and COAD data sets, the first dimension explains 49.8% and 43.7% respectively, of the total variance in Beta values from the raw signals, with the second dimension explaining only 5.9% and 6.0% (Figure 1.3A, 1.3B). These two scaling dimensions allow us to visualize a clear separation between cancer and control samples for both the discovery and validation data sets indicating a large biological signal even in the raw data. On the other hand, the first and second scaling dimensions for the brain data only explain 29.9% and 3.8% of the total 14 variance, and it is not until the 18 th dimension that the cumulative proportion of variance exceeds 50%. Case-control differences are not apparent until the 10 th dimension (t-test p = 0.002) (Figure 1.3C). Figure 1.3D shows scaling dimensions 1 and 2 for the brain data with samples colored by processing plate. We see that the 1 st scaling dimension is associated with sample plate, with plate 5 and plate 8 samples appearing more towards the right side of the figure. Figure1. 3 Multidimensional scaling plot of distances between samples in three datasets. A) TCGA-KIRC, B) TCGA-COAD, C) Alzheimer’s brain data, scaling dimensions 9 vs 10, D) Alzheimer’s brain data, scaling dimensions 1 vs 2, samples colored by processing plates. Euclidean distances between sample pairs are computed for a common set of 5000 features having the largest standard deviations across all samples. Features containing SNPs or mapping to the sex chromosomes were excluded. Samples are colored by disease status (A,B,C) or plates (D). 15 We use concordance plots to present the overlap in the top ranking DMPs as a function of list size when applying the different processing methods. Combining the methods Noob with BMIQ or SWAN always showed higher reproducibility than any of the three methods alone so the results for the individual methods are not shown. For the two cancer data sets, all within-array combination methods show similar performance to one another and all show higher concordance compared to the raw data (Fig. 1.4A and C). The performances of between-array methods show more variability. Funnorm, a method that utilizes control probes for normalization, always performs better than the raw data, but not better than the within-array methods. SQN and dasen, two between-array methods that only use signal from the biological features for normalization, only perform better than the raw data for the KIRC study; for COAD, they are worse (Grey lines in Fig. 1.4A, C). The poor performance might be due to the substantial heterogeneity among the COAD tumor samples relative to control samples (as is shown in Fig. 1.3B), and normalizing intensities across samples may hide true biological signals. In contrast to the concordance plot which reflects sensitivity across different list sizes, the ROC curve shows sensitivity across different false positive rates for a specific list size. In our study the results of ROC curve are consistent with the overlap plot analysis, showing high sensitivity for Noob + BMIQ, Noob + SWAN, and Funnorm + BMIQ (Fig. 1.4B, D with A, C, respectively). And whereas there is little variation in sensitivity among the evaluated methods for the KIRC study, the COAD study shows a clear loss in sensitivity when performing between-sample normalization using SQN (Fig. 1.4D). We note that for each of these two cancer datasets, the concordance after processing nears 85%. However, the concordance of discoveries is above 70% even for the raw data at a list size of 15,000 features. The overlap percentage for the raw COAD data is over 80% for the top 50,000 features. These results are not surprising given that the separate clustering of cancer and non-cancer samples in the MDS plots reflects substantial variation in DNA methylation between groups with genuine signals very likely to be detected and reproduced. The higher homogeneity of colon control samples and their homogeneity relative to cancer samples might explain the higher reproducibility of DMPs in the COAD study (see Fig 1.3B). 16 Figure1. 4 Overlap plot and ROC curves for KIRC (A,B) and COAD (C,D) datasets. The top ranked 100,000 features in the discovery set were defined as genuine signals for the ROC curves analysis (B,D) Reproducibility for the Brain data is much lower than for the cancer data, which was not surprising based on the earlier MDS figures. This time, the fraction of overlapping discoveries from the analysis of the raw data appears essentially random (Figure 1.5A); however, the processed data sets are able to show an enrichment in the overlap of top hits. This time the greatest overlap occured for the between-array method Funnorm combined with (within-array) BMIQ. The advantage over Noob+BMIQ, the second best method, was modest. We note that the strictly within-array methods Noob+BMIQ or Noob+SWAN 17 appeared as good as or better than the between-array (Noob+)Funnorm alone. This time the between-array methods SQN and dasen were always much better than the raw data, but never achieved the reproducibility of the top within-array method, Noob+BMIQ. The same results are seen from an ROC curve analysis using the top 100 features as the gold-standard (Fig 1.5B). All together, these results suggest that correction for technical effects in these data is needed to find the more subtle biological signal. Furthermore, the modest overlap percentage after normalization led us to investigate RUVm, the new variant of remove unwanted variation designed for removing batch effects in DNA methylation data (Maksimovic, Gagnon-Bartsch, Speed, & Oshlack, 2015). We evaluated RUVm for batch effect correction, comparing it to competing methods surrogate variable analysis (SVA) (Leek, Johnson, Parker, Jaffe, & Storey, 2012) and ComBat (Johnson, Li, & Rabinovic, 2007). Figure1. 5 Overlap plot and ROC curves for brain dataset without batch correction (A,B) and with batch correction (C,D). The top ranked 100 features in the discovery set were defined as genuine signals for the ROC curve analysis (B,D). 18 RUVm improved result reproducibility, outperforming both SVA and ComBat. In fact ComBat showed a minor loss in reproducibility over no batch correction, suggesting there was confounding between samples and BeadChips. Applying RUVm to Noob + BMIQ processed brain data resulted in a nearly 30% improvement in overlap percentage, achieving a max 63% overlap in the brain data set, up from 36% observed using normalization methods alone (Fig. 1.5C). The results using RUVm depended heavily on the original data processing, with Noob + BMIQ + RUVm showing a 65% increase in sensitivity compared to SWAN + RUVm at a 5% false-positive rate (Fig. 1.5D). Lastly, we investigated whether our results are sensitive to the single split of the data into discovery and validation sets. We resampled the brain data ten times into new discovery and validation sets, each time computing the AUC for the raw data, the data processed with Noob + BMIQ, and the data processed with Noob + BMIQ + RUVm. There was very little overlap in distributions, with AUC interquartile ranges (IQR) 0.65-0.68 for the raw data, 0.7-0.82 for Noob + BMIQ and 0.95-0.97 for Noob + BMIQ + RUVm. This order of results held their rank, and suggested that one data split was representative, and the performance does not change with repeated sampling. 1. 4 Discussion In this study nine preprocessing methods for Illumina HM450 array data are applied and compared, using combinations of three within-array and three between-array methods. Since different normalization methods address different technical effects in the data, we take advantage of this diversity of approaches and combine methods addressing different mechanisms. Analysis of technical replicates showed that different methods optimized different assessment criterion. The within-array methods Noob+BMIQ and Noob+SWAN were favored at removing type I/type II bias, while the between-array methods Funnorm, SQN and dasen, reduced between-sample variability the best. However, a recent paper by Lemire et al. found Noob+BMIQ performed best in reducing differences between Beta values in intra-plate duplicates (Lemire et al., 2015). SQN, a between-array method that normalizes intensities between typeI/II probes, did the least to remove probe type bias; this might be due to the fact that SQN utilizes type II probes, those known to show greater bias, as “anchors” when normalizing type I intensities. The results consistently showed that a priori background correction and dye-bias normalization using Noob improved both bias and variance over type I/type II correction methods alone (BMIQ and SWAN). Analysis of paired 19 HM450 and WGBS data showed the highest Pearson correlations after using Noob + BMIQ to process the HM450 data. When evaluation focused on detecting reproducible DMPs across different disease data sets, within-array normalization and between-array methods that relied only on control features consistently displayed the highest reproducibility. The performance of between-array methods that utilized biological features for normalization depended on the characteristics of the data set. When disease state was not associated with the principal scaling dimension, between-array methods tended to improve sensitivity of reproduced signals. Still, they never outperformed the best within-array methods. Furthermore, when the biological signal was strong, they had the potential to behave worse than no processing at all. This was the case for the COAD data set for which the first two principal scaling dimensions were both associated with disease status. It also agreed with the separate finding of a lower validation rate after SQN by Wu et al. (Wu et al., 2014). Overall, we found the within-array method Noob+BMIQ to consistently provide the most reproducible signal detection across the three data sets. We evaluated three batch-correction methods capable of removing additional variation not accounted for by data processing (RUVm, SVA, and ComBat), to see whether they could improve the low reproducibility of DMPs in the brain data. Both RUVm and SVA improved reproducibility, however using ComBat to adjust for BeadChip effects reduced it. This suggested that there was confounding of case-control comparisons by BeadChip. Application of RUVm increased DMP reproducibility in the brain data the most, and interestingly, a large variation in performance appeared depending on the preprocessing method applied; Noob + BMIQ turned in substantially superior reproducibility compared to competing approaches. Finally, Beta regression has been proposed as a more powerful test of differential DNA methylation than t-tests (Triche, Laird, & Siegmund, 2016). However, since current software for Beta regression is slow, and our focus was on finding the most reproducible processing method instead of the most sensitive test, we used t-tests on the Beta values. Data that are subjected to variance-stabilizing transformations prior to t-tests might show higher sensitivity for differential testing (Triche et al., 2016), with the difference (in sensitivity) a function of both the effect size and sample size. The already high overlap percentages observed in our cancer studies suggest that data transformation is unlikely to change the results. For the brain data, it is possible that data transformation could improve the overlap percentages for the different normalization methods. The analyses with batch correction 20 were performed on the logit scale, so these potentially represent the most sensitive t-test result. 1. 5 Conclusion This study provides a comprehensive comparison of the currently popular normalization methods in processing HM450 array data. Combinations of methods are applied and compared in five data sets, ranging from cancer data to Alzheimer’s brain data, and showing distinctly different variation in DNA methylation. We find that the combination of Noob + BMIQ, a within-array method, performs well in reducing technical variance, adjusting typeI/II bias, and gaining reproducibility in differential methylation analysis. At the same time, the between-array normalization methods might hurt the data when there are global methylation alterations. For differential DNA methylation analysis RUVm was the most powerful batch correction method, and it performed best on data first processed with Noob + BMIQ. The combination of methods and comprehensive comparisons in our study provide valuable insights in processing HM450 BeadChip data. The methods evaluated in this study are also applicable to the MethylationEPIC array, since it maintains the same chemical assays and thus suffers from the same technical issues with HM450 array. Chapter 2- Multi-tuning parameter penalized regression for sample classification with data integration 2.1 Introduction 2.1.1 Omics data integration Technological advances in generation of high-throughput omic measurements, including gene expression (using microarrays and RNA sequencing), epigenetic variation (by DNA methylation arrays, sequencing or chromatin immunoprecipitation followed by DNA sequencing), protein variation (assayed in either metabolomic or proteomic studies in various ways) and copy number alteration (assayed by SNPs array), have driven the field of translational bioinformatics for the past decade, producing ever-increasing amounts of data as researchers strive to develop appropriate analysis tools (Ritchie, Holzinger, Li, Pendergrass, & Kim, 2015). The majority of current studies focus on using single data type to explore associations between phenotypes and genomic measurements. Due to the 21 limitation of information that could be supplied by individual data type and availability of multiple genomic measurements for the same exposure or disease, especially for the same group of samples, integration of omics data in interrogating genotype-phenotype associations has been gaining more and more attention. It has been suggested that a system genomics approach can achieve a more thorough and reliable understanding of complex biological process. Firstly, missing values or noise in any single data type due to technical issues could be compensated by data integration. Secondly, the genomic loci or molecular pathways demonstrated as responsible for phenotypes by multiple genomic platforms have more evidence supporting their importance. Thirdly, data integration may provide improved power to identify the important genomic factors and their interactions. Last but not least, capturing and utilizing the interactions between diverse genomic measurements, and incorporating prior knowledge or relevant clinical features, are essential in correctly unraveling the mechanism or causal-relationships of complex-trait architecture. In this proposal we will focus on methods to predict disease outcome using data from multiple genomic platforms. One goal is to evaluate the superiority of combining omics data to using a single data type alone. Other goals are described below. 2.1.2 Types of integrative genomic analysis The framework of data integration could be divided into two parts: multi-staged and meta-dimensional (Ritchie et al., 2015). The first is featured as a stepwise or hierarchical analysis that reduces the search space through different stages of analysis, while the latter is characterized by the combination of all scales of data simultaneously to produce complex models with multiple variables from multiple scales of data. The two frameworks are based on and reflecting different assumptions regarding molecular variability. The first framework assumes a fundamentally linear assumption of disease aetiology. Specifically, that variation at the DNA level will lead to changes in gene expression, leading to changes in protein expression and finally a change in phenotype. With this linear assumption, it’s reasonable and efficient to adopt the multi-staged approach, stratifying the data by type and performing analyses between different platforms step by step according to the causal chain in the linear aetiology. An alternative hypothesis is that the multiple levels of molecular variation contribute to disease risk in a nonlinear, interactive and complex way. Under such an assumption, multi-staged analysis would hide the information offered by interaction between different data types, while a meta-dimensional approach, which combines the multi-omics data sets prior to data reduction, would be more appropriate. Next, we describe 22 examples of analyses that follow either a multi-stage or meta-dimensional approach to data integration. Since multi-staged analysis relies heavily on the linear causal relationships between different layers of genomic variation, it’s especially useful in association studies, where the objective of integrative analysis is to understand molecular behaviors, mechanisms and relationships between and within the different types of molecular structures. One commonly used method in multi-staged analysis is the three-stage method, also called the triangle method (Holzinger & Ritchie, 2012). One application of this method is to infer putative functional SNPs. For this, one seeks SNPs associated with both a gene expression trait (expression quantitative trait loci, eQTLs) or a methylation level (methylation QTLs, mQTLs) and the phenotype of interest. In some instances, the phenotype of interest might be chemotherapeutic drug response (Hartford et al., 2009; R. S. Huang et al., 2007). The triangle method can overlap associations found in separate data sets and does not require all measurements made on a common set of samples. A related approach is sequential analysis (Kristensen et al., 2014), in which the confirmation or refinement of findings based on one data type, are confirmed through analyses of further omics data obtained on the same set of samples. Such method could be useful when regarding the expression level of any transcript as a function of copy number and DNA methylation (Chari, Coe, Vucic, Lockwood, & Lam, 2010). Meta-dimensional analysis, on the other hand, tries to consider and model complex interactive relationships between different genomic platforms. Such information is critical when the objective of integrative analysis is (1) to understand the taxonomy of diseases, thereby classifying individuals (or samples) into latent classes of disease subtype; and (2) to predict an outcome or phenotype (such as survival or efficacy of therapy) for prospective patients. Ritchie et al (Ritchie et al., 2015) described the approaches in meta-dimensional analysis in detail. In this study, the goal is to predict disease diagnosis or cancer outcomes for prospective patients using multiple genomic data types, allowing for correlations between platforms. Thus, we are going to focus on meta-dimensional instead of the meta-stage analysis. Particularly, we are interested in “early” and “late” data integration. The “early” integration is essentially a concatenation-based method. The method combines different data matrices for each sample into one large input matrix before constructing a model. Such a method has been applied in a study that aims to predict recurrence and survival of ovarian cancer using copy number alteration, methylation, miRNA and gene expression data with a 23 multivariate Cox Lasso model (Mankoo, Shen, Schultz, Levine, & Sander, 2011). The main advantage of this approach is that concatenation-based integration is particularly useful for considering correlations between different types of genomic data. Alternatively, we can consider “late” integration, where we set up separate models from individual data types, and then we combine the selected variables from the individual data type and get the final prediction model with the combined features. The process of early and late data integration is annotated in Figure 2.1.1. Both approaches to data integration have a chance to utilize the complementary information from separate data types, while such information will be ignored when setting up the model with a single data type. Compared to late integration, early integration is likely to suffer even further from the high-dimensionality issue, but it will consider potential correlations between different data types when selecting the variables, and it could have better prediction performance by utilizing the whole information in the combined data in the feature selection step. One goal of this study is to compare the performance of early and late data integration. As for the methods we are going to consider for the prediction model, we will focus on penalized regressions, which will be introduced below. Figure 2.1 1 Late and early data integration 24 A: Late integration: the features are selected from individual data types and then combined to form the final model. B: Early integration: the features are combined before variable selection and model setup. This figure is adapted from Ritchie et al. Figure 4 (Holzinger & Ritchie, 2012) 2.1.3 Regularized regression 2.1.3.1 Regularization and Lasso Starting off in the regression setting, suppose that we observe an outcome vector !∈ℝ $ and a predictor matrix !∈ℝ $×& , whose columns ! " ,…! % ,∈ℝ ( correspond to predictor variables. Linear regression solves the least squares problem !"# $ ∈ ℝ ' (* + - $ - . - ) ' -01 2 3 +01 !"# $ ∈ ℝ ' *-.$ 2 2 Note that we denote ! " as ( " # $ #%& ' ) &/' . When rank(X)=p, i.e. the predictor variables ! " ,…! % are linearly independent, the above least squares criterion is strictly convex, and so there exists a unique solution ! . However, when rank (X) <p, e.g. this happens when p>n, which is very common in high-dimensional genomic data analysis, there are infinitely many solutions; given any one solution ! , the quantity !+# is also a solution for any ! ∈ #$%% & . Furthermore, this type of non-uniqueness makes interpretation of solutions meaningless: it follows that for at least one variable j∈ {1,…&} , we will have ! " >0 at one solution ! (1) , but ! " <0 at another solution ! (2) . Moreover, suppose that rank (X)=p, so that a unique linear regression estimate exists. However, if p is moderately close to n, then the linear regression estimate can have quite poor predictive accuracy. This is because its variance scales linearly with p. 25 One way to deal with this problem is to constrain the least squares estimator, i.e., choose ! to solve !"# $ ∈ℝ ' (-*$ + + subject to ! ∈# , for some set !⊆ℝ $ , and we can actually call the constraint ! ∈# a form of regularization. The choice of ! will have a big effect on both accuracy and interpretability of the estimate. Two well-known examples using regularization are found in (Hoerl & Kennard, 1970) and (R Tibshirani & Society, 1996): Ridge regression: !"# $ ∈ ℝ ' (-*$ + + -./0123 34 $ + ≤ 3, (1) Lasso regression: !"# $ ∈ ℝ ' (-*$ + + -./0123 34 $ 5 ≤ 3. (2) where t serves as a parameter that controls the size of the constraint set. Both constraints restrict the estimate to lie in a ball around 0, with respect to some norm (L1 norm for Lasso and L2 norm for Ridge); i.e., they shrink the estimated coefficients toward 0, in a particular way. The smaller the value of the running parameter t is, the more shrinkage. From duality and Karush-Kuhn-Tucker conditions, we know we can rewrite problems (1), (2) as Ridge regression: !"# $ ∈ℝ ' (-*$ + + +- $ + + , (3) Lasso regression: !"# $ ∈ ℝ ' (-*$ + + + . $ / . (4) with !> 0 the tuning parameter. Equations (3) and (4) are called the penalized (or Lagrange) forms of the problems, whereas (1) and (2) are called the constrained forms. For each of the constrained problems above, and every value of t >0, there is a corresponding value of ! >0, such that the analogous penalized problem has the same solutions. Both L1 and L2 penalization functions continuously shrink the estimated coefficients toward 0 as λ increases. Continuity of a penalty function is necessary to avoid instability in model prediction (Fan & Li, 2001). Instability can occur when using best subset selection, which incorporates an L0 penalty and suffers extreme variability due to its inherent discreteness 26 (Pan & Shen, 2007). Shrinkage reduces the variance of the estimate, at the expense of (likely) introducing bias. It’s often the case that the reduction in variance (tuned at some level λ) is well worth the price of bias, so that the tradeoff in overall accuracy is favorable. Besides, the addition of either L1 or L2 penalty results in a convex optimization problem, whose solution does not suffer from the multiple local minima issue, and its global minimizer can be efficiently solved. Compared with the L2 penalty in Ridge regression, the L1 penalty induces sparsity in the estimate (i.e. the increase of λ can shrink coefficients to exact zero). This is a desirable feature since it corresponds to performing variable selection in the constricted model and provides a level of interpretability. In this study we are interested in the classification form of regularized regression to predict disease status or subtype using integrated genomic data. Further, we are interested in a parsimonious classifier for clinical testing and therefore prefer methods such as the Lasso that perform variable selection. 2.1.3.2 Elastic net Although Lasso can induce a sparse model with continuous shrinkage on the coefficients, it has some limitations especially when applied in high-dimensional genomic data analysis. First, in the !># case, the Lasso selects at most ! variables before it saturates, due to the nature of the convex optimization problem. So there is an upper bound on the number of variables that could be selected in the model. Second, if there is a group of variables with high pairwise correlations, the Lasso tends to randomly select only one variable from the group and ignore the others. Even for the usual !># situations, if there are high correlations between predictors, it has been empirically observed that the prediction performance of the Lasso is dominated by ridge regression (R Tibshirani & Society, 1996). To overcome the above limitations of Lasso and meanwhile maintain its desirable features such as automatic variable selection and continuous shrinkage, Zou et al (Zou & Hastie, 2005) developed a regularization technique named Elastic net. The Elastic net is a regularized regression that uses both the L1 and L2 penalties of the Lasso and Ridge methods. The naïve Elastic net [equation 5], described first, directly applies the two shrinkage procedures. !"# $ ∈ℝ ' (-*$ + + +- + $ + + +- . $ . (5) 27 Let ! =# $ /(# ' +# $ ) ; then solving ! in (5) is equivalent to the optimization problem: ! = #$%&'( ! )-+! , , , subject to 1−# $ % +# $ ' ' ≤) for some t. (6) The function 1−# $ % +# $ ' ' is the Elastic net penalty, which is a convex combination of the Lasso and ridge penalty. When !=1 , the naïve Elastic net becomes simple ridge regression. For all ! ∈ [0,1) , the Elastic net penalty function is singular (without first derivative) at 0 and it is strictly convex for all α>0, thus having the characteristics of both the Lasso and ridge regression. The solution of the naïve Elastic net is essentially a scaled Lasso solution for an augmented predictor matrix with rank ! . So naïve Elastic net can potentially select all ! predictors in all situations, and it can perform an automatic variable selection in a fashion similar to the Lasso. Zou also showed that the naïve Elastic net has the ability of selecting ‘grouped’ variables, due to the inclusion of ridge penalty function, which is strictly convex and thus could guarantee the grouping effect in the extreme situation with identical predictors. This is a property that is not shared by the Lasso, which is not strictly convex. The naïve Elastic net suffers a problem of double shrinkage, which does not help to reduce the variances much and introduces unnecessary extra bias, compared with pure Lasso or ridge shrinkage. The Elastic net coefficients are rescaled naïve Elastic net coefficients to correct for the double shrinkage: !(#$%&'() +#') = (1+# $ ) ' (naïve Elastic net) In simulation studies, the Elastic net is significantly more accurate than the Lasso, even when the Lasso is doing much better than ridge regression. The simulation results indicate that the Elastic net dominates the Lasso under collinearity. The Elastic net tends to select more variables than the Lasso does, owing to the grouping effect. Real world data is also used to compare support vector machine(Guyon, Weston, Barnhill, & Vapnik, 2002), penalized logistic regression(Zhu & Hastie, 2004), nearest shrunken centroids(Robert Tibshirani, Hastie, Narasimhan, & Chu, 2002) and Elastic net in terms of microarray 28 classification and gene selection. It is claimed that the Elastic net gives the best classification, and it has an internal gene selection facility(Zou & Hastie, 2005). Correlations among genomic features are common for data generated by high-throughput omic technologies. The correlation in expression from genes in the same biological pathway or correlations in DNA methylation from neighboring CpG sites in CpG islands can be high, and thus sets of predictors could form groups. For the analysis of integrated data the correlations between gene expression values and DNA methylation are also important. For this kind of grouped variable situation, the Lasso is not the ideal method because it lacks the ability to reveal the grouping information. Instead, we will consider the Elastic net, and compare its performance to the Lasso, in both simulation studies and real data analysis. 2.1.3.3 Variants of Lasso Adaptive Lasso It has been argued(Fan & Li, 2001; Fan & Peng, 2004) that a good fitting procedure should have the following oracle properties (asymptotically): (1)the method can correctly select the nonzero coefficients with probability converging to one, and (2) the estimators of the nonzero coefficients are asymptotically normal with the same means and covariance that they would have if the zero coefficients were known in advance. There are solid arguments that the Lasso does not satisfy these oracle properties. Fan and Li (Fan & Li, 2001) stated that the Lasso shrinkage produces biased estimates for the large coefficients, and thus it could be suboptimal in terms of estimation risk. Meinshausen and Bühlmann(Meinshausen & Bühlmann, 2006) also showed the conflict of optimal prediction and consistent variable selection in the Lasso. They proved that the optimal λ for prediction gives inconsistent variable selection results; in fact, many noise features are included in the predictive model. H. Zou et al (Zou, 2006)derived the necessary condition for the consistency of the Lasso variable selection, and they exemplified several violated scenarios and thus stated that the necessary condition was nontrivial. H. Zou proposed a new version of the Lasso, the adaptive Lasso, in which adaptive weights are used for penalizing different coefficients in the L1 penalty. The adaptive Lasso could be defined as follows: 29 Suppose that ! is a root-n-consistent estimator to ! * (the true coefficients); for example, we can use ! "#$ . Pick a ! >0 , and define the weight vector ! =1/ % & . The adaptive Lasso estimates ! *($) are given by ! *($) =!"#$%& ' (- * + , - . -/0 1 +3 4 5 - , - . -/0 The adaptive Lasso is essentially a L1 penalization method and is computationally efficient. Zou proved that when ! " / $⟶0 and ! " # (%-')/* ⟶∞ , the adaptive Lasso estimates must satisfy the oracle property. Note that the data-dependent ! is the key in the oracle property of adaptive Lasso. As the sample size grows, the weights for zero-coefficient predictors get inflated (to infinity), whereas the weights for nonzero coefficient predictors converge to a finite constant. Thus adaptive Lasso can simultaneously unbiasedly (asymptotically) estimate large coefficient and small threshold estimates. J Huang et al (J. Huang & Zhang, 2008) studied the asymptotic properties of the adaptive Lasso estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size, i.e. the case when ! " ⟶∞ as !⟶∞ . They show that, if a reasonable initial estimator is available, under appropriate conditions, the adaptive Lasso has an oracle property. In addition, under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coefficients, marginal regression can be used to obtain the initial estimator. With this initial estimator, the adaptive Lasso has the oracle property even when the number of covariates is much larger than the sample size. The partial orthogonality condition is reasonable in microarray data analysis, where the genes that are correlated with the phenotype of interest may be in different functional pathways from the genes that are not related to the phenotype (Bair, Hastie, Paul, & Tibshirani, 2006). Group Lasso In some regression problems we have categorical predictors (factors), where each factor may have several levels and can be expressed through a group of dummy variables. Another 30 situation is the additive model with polynomial or nonparametric components, where each component may be expressed as a linear combination of a number of basis functions of the original measured variables. In both scenarios (with categorical predictors present or addictive model considered), variable selection typically amounts to the selection of important factors (groups of variables) rather than individual derived variables. However, the Lasso solution is not satisfactory as it only selects individual dummy variables instead of whole factors, or individual derived input variables rather than the addictive components. Moreover, the solution of Lasso depends on how the factors are orthonormalized. Choosing different contrasts for a categorical predictor will produce different solutions in general. The group Lasso (Yuan & Lin, 2006) overcomes these problems by introducing a suitable extension of the Lasso penalty. Suppose that the ! predictors are divided into ! groups, with ! " the number in group ! . We use a matrix ! " to represent the predictors corresponding to the ! th group, with corresponding coefficient vector ! " . Then the group Lasso solves the convex optimization problem ("- $ % & % ' %() * * +, - % ' %() & % * ) /∈1 2 345 (7) where the ! " term accounts for the varying group sizes, and ∙ 2 is the Euclidean norm. This penalty can be viewed as an intermediate between the ! 1 and ! 2 type penalty. It has the attractive property that it does variable selection at the group level and is invariant under (groupwise) orthogonal transformation like ridge regression. And Meier et al (Meier, Van De Geer, & Bühlmann, 2008) extend the group Lasso to logistic regression. The group Lasso expressed in (7) doesn’t yield sparsity within a group, and the algorithm proposed by Yuan & Lin (Yuan & Lin, 2006) assumes the model matrices in each group are orthonormal. To overcome such limitations, Friedman et al (Friedman, Hastie, & Tibshirani, 2010a)proposed sparse group Lasso as follows: ("- $ % & % ' %() * * +, ) - % ' %() & % * +, * & ) ) /∈1 2 345 31 where ! = (! $ ,! & ,…! ( ) is the entire parameter vector. The involvement of both group Lasso and Lasso terms yields sparsity at both the group and individual feature levels, and the algorithm Friedman (Friedman et al., 2010a) provided also works for the standard group Lasso with non-orthonormal model matrices. In this study we are interested in using concatenation-based data integration to predict disease outcomes, and we propose a novel penalized regression variant to improve prediction performance. Our approach uses an unweighted penalized regression that results in a sparse model. Specifically, we only consider Lasso and Elastic net. Unweighted approaches can be more accurate for outcome prediction whereas the weighted approaches are preferred for variable selection. In the future we may consider other penalized regressions with more sophisticated weight functions, such as adaptive Lasso or group Lasso. 2.1.4 Problems of standard Lasso and Elastic net in data integration Taskesen et al. (2015) used the Lasso for classification of samples measured on multiple genomic platforms (Taskesen, Babaei, Reinders, & De Ridder, 2015). They used a uniform penalty parameter for the features across different data types. We name this traditional way of penalized regression as “standard Lasso”. This could be undesirable if individual datasets contribute differently to the final prediction and would put us at risk of missing some subtle but important features. For instance, assume that we are integrating data from two genomic arrays, such as gene expression and HM450 methylation arrays. And assume both of the two data types contribute independent information to the prediction of disease outcome. By combining the data types, we can obtain additional information compared with using any single data type alone. Furthermore, suppose that the gene expression data have larger effect sizes in predicting the outcome variable but fewer predictive variables, while the DNA methylation data contain more predictive features but with smaller effect sizes. If we use a single penalty parameter, we shrink variables across the different arrays equally, and by doing this we might only keep features from the data type yielding larger coefficients (e.g. gene expression data) while missing important signals in the data where the sizes of coefficients are smaller (e.g. methylation data). In the end, only gene features might be selected in the final model, and the methylation features, which are also informative in outcome prediction, might be missed due to the stringent shrinkage. The problem of 32 standard Lasso is illustrated in Figure 2.1.2. If we were to substitute Elastic net for Lasso, the standard Elastic net, where a single set of penalty parameter (" # ," % ) is used for the features across different arrays, will suffer the same problem as well. Figure 2.1 2 Illustration of the problems in standard Lasso or standard Elastic net. Using a uniform penalty parameter, we might lose some subtle but important features in an array with smaller effect sizes. To solve the problem caused by having a single penalty parameter, we propose a new method that will allow different constraints on separate data types. Specifically, we will use different penalty parameters for the features in different arrays, for both Lasso and Elastic net regressions. We will compare our proposed method (multi-tuning parameter Lasso and Elastic net) with standard Lasso or Elastic net, as well as to late data integration approaches, where we select variables from individual data types using separate Lasso or Elastic net models before combining the selected variables into a single model. In such late data integration models, we use separate penalty parameters for different data types by design. Here, the variables are selected by platform, without accounting for independent variation from the different data types. Within the late integration approach, we consider two methods for combining the selected variables from individual data types. For this, we either apply traditional logistic regression (denoted as “Logistic”) or use penalized regression again (denoted as “Late Lasso/Elastic net”) to get the final model. 33 Models are tuned and tested using training and testing data sets, respectively. The tuning parameters are determined based on the AUCs in cross-validation in training data. Model performance is evaluated based on the AUCs in independent testing data. In addition, the sensitivity /specificity in variable selection is also investigated. In simulation studies, we consider situations where the variables are independent or correlated, and we identify the situations where the multi-tuning parameter Lasso or Elastic net is particularly useful. We also evaluate different models in real genomic data obtained from both GEO and TCGA. 2.2 Methods 2.2.1 Multi-tuning parameter Lasso regression Suppose that we observe an outcome vector !∈ℝ $ and a predictor matrix !∈ℝ $×& , whose columns ! " ,…! % ∈ℝ ( correspond to empirical realization of p predictor variables. As we discussed in 2.1.3.1, standard Lasso solves the problem !"# $∈ℝ ' () * - , - . *- ) ' -01 2 3 *01 +5 , - ' -01 where ! >0 is the penalty parameter. Herein we consider using different penalty parameters for different data types in the concatenation-based integration. We first consider the simplest setting, where we have two individual data types and the number of variables in two data sets is p1 and p2, separately. The multi-tuning parameter Lasso could be formulated as !"# $∈ℝ ' () * - , - . *- ) ' -01 2 3 *01 +5 1 , - ' 6 -01 +5 2 , - ' -0' 6 71 (8) where λ1>0, λ2 >0 are the penalty parameters for the first and second data type, respectively, and p = p1+p2. Let w=λ2/λ1, and then the formula (8) could be rewritten as !"# $∈ℝ ' () * - , - . *- ) ' -01 2 3 *01 +5 , - ' 6 -01 +5*8 , - ' -0' 6 91 34 In our particular study we are interested in classification, with Lasso penalization for the logistic model. The general form of penalized regression maximizes the log likelihood function. Suppose now we have a binary response variable ! " ∈ {0,1} . The multi-tuning parameter Lasso in logistic regression setting is defined as (again we replace λ2/λ1 with w): ! " = $ % !&Λ ( % " + 1−$ % !&Λ -( % " -- " . --*0 " . 1 .21 3 45 1 3 .25 6 %25 7∈ℝ : ;<= , (9) where Λ " = $%& ()) +,$%& ()) 2.2.2 Multi-tuning parameter Elastic net regression Using the same setting defined at the beginning of the method section for multi-tuning parameter Lasso (Section 2.2.1), the penalty term for the standard Elastic net is defined as Penalty=![(1−& ) ! " # "$% + ! " # $ % #&' ] In our study we proposed multi-tuning parameter Elastic net. Assuming we have two data types, we will use different penalty parameters λ1>0, λ2 >0 for the first and second data type, respectively. Herein we fix ! =0.5, and the penalty term in the multi-tuning parameter Elastic net is Penalty= ! " {$ ! ( & ' ( ) '*! + ! " # $ % "&' )+ ! " ( $ % & %'& ( )* + ! " # $ "%$ & '( )} Specifically, in the current study we have binary outcome ! " ∈ {0,1} , and we let w= λ2/λ1, then the multi-tuning parameter Elastic net in logistic regression setting could be defined as ! " = $ % !&' ( % " + 1−$ % !&' - ( % " - - %./ 0∈ℝ 3 456 / 7 {9 / ( " ; + " ; 7 < = ;./ )+ < = ;./ 9 / *@( " ; + " ; 7 < ;.< = A/ )} < ;.< = A/ (10) where Λ " = $%& ()) +,$%& ()) 35 To solve (9) and (10) we can take advantage of the glmnet package in R (version 2.0-5)(Friedman, Hastie, & Tibshirani, 2010b) where we can tune the parameter w with the “penalty.factor” argument. 2.2.3 Tuning of parameters in multi-tuning parameter Lasso and Elastic net We perform a nested tuning routine for our two parameters, w the weight and lambda the shrinkage parameter, and the weight is evaluated over a grid from 0.2 to 1.8 using intervals of 0.05. These boundaries are selected as any lower or higher weight outside of this range would shrink the effects of one data type so much as to essentially make sole use of the other, and the effect of data integration would be lost. For each weight, the λ is tuned based on 10-fold cross-validation on the training data. For each weight, the maximum of mean AUC across 10-fold cross-validation is recorded (termed as “training AUC” in the following text). The model generated from training data by λ_1se, the largest value of lambda such that AUC is within 1 standard error of the maximum, is used in an independent testing dataset and the AUC is recorded (termed as “testing AUC”). The weight parameter that can obtain the maximum of training AUC is termed as “optimal weight”. 2.2.4 Simulation study In the simulation studies we considered situations where the variables were independent and situations where they were correlated. Under both settings, we were interested in addressing the following questions: (1) Does data integration provide better prediction than use of a single data type? (2) Does early data integration, by combining data types before model set up, outperform late integration, where we select the variables from individual data first before data integration? (3) Within late data integration, is it necessary to apply another penalized regression model on the selected features, or is it sufficient to use logistic regression with the combined feature sets? (4) What factors will affect the selection of the optimal weight parameter in the multi-tuning parameter penalized regression? We were particularly interested in the performance of the multi-tuning parameter Lasso in comparison with that of the standard Lasso, and of the multi-tuning parameter Elastic net in comparison with that of the standard Elastic net. We also compared the performance of Elastic net with Lasso. The models were evaluated based on both outcome prediction and variable selection (evaluation metrics will be discussed in 2.2.6). 36 Simulations for independent settings In this study we focus on the situation where the outcome is binary, the independent variables are continuous, and the number of features is larger than sample size (p>n). Specifically, the outcome ! "×$ given on the input matrix ! "×$ is generated based on the logistic regression model: ! "×$ =log ) * =1 ) * =0 =- . "×$ +0 "×1 ∙- 1×$ ! "×$ = log ) * = 1 ) * = 0 = - . "×$ +(1 "×2 3 ,1 "×2 5 )∙ - $ 2 3 ×$ - 8 2 5 ×$ !" = exp ' 1+exp (') !~#$%&'())* (Pr) where X1, X2, …Xp are i.i.d and follow standard normal distribution N(0,1). In our simulations we fixed n=200, p=500, ! 0=0. It is worth noting that herein we are not simulating a high-dimensional scenario, as we used a relatively small number of samples and features due to the consideration of computational speed. Later we will apply the proposed methods in real cancer data sets, each featured by thousands of predictors and hundreds of samples. In order to mimic the procedure of data integration, we assume the top p1 features are from the data type 1 (e.g., gene expression), and the rest (p2=p-p1) features are from the second data type (e.g., DNA methylation). Among the p1 features, we further assume r1 features have non-zero coefficient β1 (also denoted as “Coef1” in the following text), and the rest of p1-r1 features have zero coefficients. Similarly, for those p2 features from the second data type, we assume r2 features have non-zero coefficient β2 (also denote as “Coef2” in the following text), and the rest of p2-r2 features have zero coefficients. The illustration of the simulation for the independent setting is shown in Figure 2.2.1. The Coef1, Coef2, r1 and r2 will take on different values to investigate the performance of multi-tuning parameter Lasso under different simulation settings. 37 Figure 2.2 1 The illustration of simulation studies for independent setting. Simulation for correlated settings In the correlated variable settings, we get X1, X2, …Xp from a multivariate normal distribution with mean vector as 0 and variance covariance matrix ! : !~#(%,') . Specifically, we first consider the correlations among informative variables, both within- and between-platforms. We simulate the situations where n1 out of those r1 informative genes (whose simulated coefficients are Coef1¹0) have pair-wise correlation as Cor_gene, and n2 out of those r2 informative methylation features (whose simulated coefficients are Coef2¹0) have pair-wise correlation as Cor_meth, and those n1 genes and n2 methylation features can also be pair-wisely correlated with the correlation coefficient as Cor_inter. Figure 2.2.2A shows the structure of a sample variance-covariance matrix as a heatmap, for r1 genes and r2 methylation features. When we investigate the performance of multi-tuning parameter Lasso or Elastic net in correlated settings, we use correlation coefficients that are much smaller, and more similar to those observed in real data. Apart from the correlations among informative variables, we are also interested in the correlations between informative variables and noise. This additional variance-covariance matrix structure is illustrated in Figure 2.2.2B. 38 Figure 2.2 2 The illustration of simulation for correlated settings. The figure shows the heatmap of variance-covariance matrix for the informative features (A) and for the informative and noise features (B). 2.2.5 Maximum possible AUC The maximum possible AUC is used to measure the prediction contribution for subsets of variables in the simulation studies. This allows us to compare, under different simulation settings, the prediction performance for individual data types compared to the combined data set. Specifically, we can first get pairs of samples randomly chosen from the whole sample sets (in total we can get ! 2 pairs), and calculate the frequency of pairs whose order of scores (!") is the same with the order of outcome variable, i.e. enumerating Pr (% & ' > % ) '|+ & > + ) ) , which is actually equal to Pr (% & ' > % ) '|+ & = 1,+ ) = 0) in our binary setting, where i, j represent the index of paired samples. To calculate the maximum possible AUCs for individual data types, we can separate ! (#×%) into ! (#×%&) and ! (#×%&) , and compute the maximum possible AUCs for each data type. The maximum possible AUCs are used to guide our simulations by providing information about relative prediction 39 powers for the individual data types and the whole data, and by tuning the parameters in our simulations we can control the informativeness of the simulated data. 2.2.6 Prediction model evaluation metrics We used seven criteria to evaluate and compare prediction performance between the standard Lasso (or Elastic net) and multi-tuning parameter Lasso (or Elastic net) on an independent test data set: (1) AUC; (2) 1-misclassification error, when assigning observations to the category with the highest posterior probability; (3) Total sensitivity: the proportion of variables with non-zero coefficients that are correctly selected in the final model; (4) Total specificity: the proportion of variables with zero coefficients that are correctly excluded in the final model; (5) Sensitivity in r1: the proportion of those r1 variables from Data Type 1 with non-zero coefficients that are correctly selected in the final model; (6) Sensitivity in r2: the proportion of those r2 variables from Data Type 2 with non-zero coefficients that are correctly selected in the final model; (7)Positive predicted value (PPV): the proportion of selected variables that truly have non-zero coefficients in simulation. 2.2.6 Real data analysis Acute myeloid leukemia data The Acute myeloid leukemia data for 344 samples were obtained from NCBI Gene Expression Omnibus, with accession numbers for gene expression and methylation data as GSE14468 (HOVON-SAKK cohort) and GSE18700, respectively. Both datasets are annotated using UCSC hg19 genome build. For each patient sample, genome-wide mRNA expression data was measured using Affymetrix HGU133 plus2.0 (Santa Clara, CA, USA). We filtered the gene features with fewer than 10 unique expression values, resulting in 46083 gene features. For the same set of samples, genome-wide DNA methylation data was measured with the HELP-assay (Shaknovich, Figueroa, & Melnick, 2010). All of the 25626 DNA methylation features were considered in this study. For those 344 adults, clinical, cytogenetical and molecular characteristics are analysed using bone marrow or peripheral blood, as described previously (Figueroa et al., 2010; Valk et al., 2004). Groups of AML patients that are characterized by common cytogenetic or molecular abnormality are denoted as subtypes. From the 15 subtypes studied by Taskesen et al (Taskesen et al., 2015), we focus on the 7q AML subtype, characterized by partial deletion of the genome fragments on the short arm of chromosome 7. In the 344 patients, 35 have 7q AML, and the 40 rest can be characterized as non-7q. The multi-tuning parameter Lasso and Elastic net are used to set up classifiers by combining gene expression and DNA methylation data to differentiate 7q AML from the rest of the subtypes. Prostate cancer data The prostate adenocarcinoma data was obtained from GDC Data Portal (Project ID TCGA-PRAD) and assembled by our collaborators at USC. In total 408 samples with both RNA-Seq gene expression and HM450 array data available were considered, with tumor aggressiveness as the outcome variable. Herein we defined the tumor samples with Gleason score no more than 7 and T category of T2 (T2a, T2b and T2c) as non-aggressive, and those with Gleason score no less than 7 and T category of T3 (T3a, T3b) and T4 as aggressive. This resulted in 143 samples defined as non-aggressive and 265 defined as aggressive. We filtered gene features with zero variance across all samples, such that 20216 gene features remained in this study. For the HM450 DNA methylation array, we excluded probes targeting SNPs, mapped to the X and Y chromosomes, in cross-reactive regions, and having missing values in at least one sample. This resulted in 371513 (out of 485577) remaining features. The gene expression data are log2 transformed to avoid the skewness in data distribution. Statistical Analysis The real data were split into training and testing data sets, with the penalty parameters tuned by multiple cross-validations. Stratified by outcome, 80% of samples were selected for the training data and the remaining for the test set. In the training data, for weights ranging from 0.1 to 1.5, we performed 5-fold cross validation 10 times over a grid of lambda values. The optimal lambda parameter for each weight was selected as the one that resulted in the maximum of the mean AUCs for 10 cross validations. We picked the optimal weight by applying the model for the paired tuning parameters, the weight and its optimal lambda, to the test data across the grid of weights. The weight yielding the highest AUC was selected. Note that within each cross validation, we used the same sample splits across the different weight parameters. 41 2.3 Results 2.3.1 Independent Settings 2.3.1.1 Comparison of maximum possible AUCs We first confirm that maximum possible AUCs can be used to quantitatively measure the predictive power of individual (omic) data sets and thus serve as a gauge in designing different simulation settings. We considered four simulation scenarios where the relative effect sizes and number of variables with non-zero coefficients in two combined data sets varied dramatically. Across all four settings we only considered the situation where the two data sets had equal number of variables (i.e., p1=p2=250). For each scenario we simulated 200 replicate data sets. The boxplots of maximum possible AUCs for individual data sets, as well as for the overall data, are shown in Figure 2.3.1 In Fig 2.3.1A we kept the effect size and number of non-zero coefficients identical in two data sets (Coef1 = Coef2, r1=r2). As expected, the prediction power of the two separate data subsets was more or less the same (the medians of maximum possible AUCs for Type 1 and Type 2 data sets were both around 0.8), and by combining the two parts we could indeed increase prediction power (maximum possible AUC for the overall dataset was more than 0.9). When the number of non-zero coefficients differed for the two data types, the prediction performance of the data set containing more variables with non-zero coefficients would be substantially better (Fig 2.3.1B, Coef1 = Coef2, r1>r2). Similarly, when the individual data sets had equal numbers of non-zero coefficients, the prediction performance would be favorable to the data set where the effect size was larger (Fig 2.3.1C, r1=r2, Coef1 > Coef2). Fig2.3.1D represents a situation where we have a trade-off between effect size and number of non-zero coefficients in two parts of the data. In this specific setting, the second part (Data Type 2) contained more variables that were informative (r1<r2) and still dominated the first part, despite the larger effect size in Data Type 1 (Coef1>Coef2). Later we will pay more attention to this trade-off situation (see Additional Figure 2.1). Note that in all of the four simulation settings above the combined data achieved the best performance compared with any of the individual data type subsets. This indicates that by adding more information through combining independent data sets together, we can indeed improve prediction performance. Furthermore, Fig2.3.1 shows that maximum possible AUCs can be informative in measuring the prediction ability of individual datasets. In the 42 following sections we utilized the maximum possible AUCs to guide and realize various simulation settings. Figure 2.3 1 Maximum possible AUCs for four simulation settings. r1, r2: number of variables with non-zero coefficients in data Type 1 and data Type 2, respectively. Coef1, Coef2: non-zero coefficients used in data Type 1 and data Type 2, respectively. The width of the blue squares represents effect size and the height represents number of informative variables. (N=200 simulated data sets). 2.3.1.2 Performance of multi-tuning parameter Lasso In this part we used the above four simulation settings with known maximum possible AUCs to investigate (1) whether the prediction performance varied with the different “weight” parameters; (2) whether the combined data outperformed the individual data types in terms of prediction, (3) whether the optimal weight parameters differed in the four simulation settings, and (4) whether the multi-tuning parameter Lasso outperformed the standard Lasso. In order to solve these problems, for each simulation setting, we plotted 43 averaged empirical AUCs from 200 test data sets against different weight parameters (Figure 2.3.2). We used a vertical dashed blue line to denote the “best weight” parameter, which could be defined as the weight parameter that resulted in the maximum of mean training AUCs (from 200 simulations) across different weights. We also created the boxplots of the testing AUCs for the “best weight” parameter and for the standard Lasso (Figure 2.3.3). From the Figure 2.3.2 we indeed observed that the prediction performance, as measured by empirical AUCs, changed with different weight parameters in the multi-tuning parameter Lasso. The smaller the weight, the larger the penalty imposed on the first part of the data, and the performance of the integrated data would be no better than that when the second part of the data was used alone (w=λ2/λ1). When the weight equaled one, the two parts of data were equally penalized. For weights greater than one, the larger the weight the greater the shrinkage on the second part of the data and the performance approaches that from using the first part of the data alone. For all of the four simulation settings we observed that the empirical AUCs were maximized away from the boundary conditions, which suggested that use of the combined data outperformed any of individual data set. Additional Figure 2.2 shows the change of both ‘internal’ AUCs, AUCs from cross-validation in the training data, and ‘external’ AUCs, AUCs from the test data, across the weight parameter. Comparing the internal and testing AUCs, the training AUCs from the training data were always larger than testing AUCs from the test data, as we would expect. Notably, we found similar patterns of change of internal and testing AUCs across the different weights. Such consistency supports using the AUCs from the training data to select the optimal weight parameter, since in real situations the outcome of our test data would be unknown. The optimal weight differed across four simulation settings (Figure 2.3.2). In Fig 2.3.2A, where the maximum possible AUCs for two parts of the data were almost the same, suggesting equal informativeness of each data type, the best weight equaled 1, i.e. no bias should be imposed on either data set. When the second data set contributed more in total prediction performance, the best weight was less than 1, which corresponds to more shrinkage imposed to the less informative data set (Figs 2.3.2B,D). In Fig 2.3.2 C, where the first data set contributed more to prediction accuracy due to larger effect size for the same number of informative features, the best weight was slightly larger than 1. 44 Figure 2.3 2 The empirical AUCs as a function of the weight parameter (N=200 simulated data sets). x axis: weight parameter; y axis: empirical AUCs; black line: mean AUCs from testing data sets; blue line: best weight parameter. The width of the blue squares represents effect size and the height represents number of informative variables. The benefit of multi-tuning parameter Lasso compared with standard Lasso was shown in Additional Figure 2.3. We notice that except for the first setting where the two data types were equally informative and there was no additional benefit from adding a weight parameter, the prediction performance for the multi-tuning parameter Lasso outperformed that of the standard Lasso. Specifically, the median differences in AUCs between standard Lasso and multi-tuning parameter Lasso in the four settings are 0.0084, 0.030, 0.010, and 0.021, respectively, with only the latter three differences statistically significant (paired t test p<0.05). Small differences in AUC were observed for a small number of data sets under the first setting. This variation was due to random chance, with the more flexible multi-tuning parameter always having higher AUC than the standard Lasso. 45 2.3.1.3 Comparison between Late and Early data integration One goal of this study is to investigate and compare different approaches of data integration. In our “late integration” approach, we first selected the variables from the individual data, and then we either used logistic regression or Lasso (for a second time) with the combined selected feature sets to get the final predictive model. We named the corresponding testing AUCs as “Logistic” and “Late Lasso”, respectively. For our “early integration” approach, we were interested in the difference in prediction performance between standard Lasso and multi-tuning parameter Lasso with the best weight parameter (Section 2.3.2). Herein we were particularly interested in the last situation depicted in Figure 2.3.1, where the second data type contains more informative features but with smaller effect size. From Figure 2.3.3A we clearly see that early data integration has better prediction performance than late integration. Within late integration, using the penalized regression for a second time after feature selection from individual arrays can further improve prediction, compared with using logistic regression on the combined features, possibly due to the fact that the Lasso can further remove noise features that have been included in the first step of feature selection. Within early integration, as we have showed in Figure 2.3.2 and Additional Figure 2.3, using the additional weight parameter can achieve better prediction performance. Figure 2.3.3A also shows that using data integration by any method is better than using either of the individual data types alone. Comparing the informativeness of the individual data types, the simulated DNA methylation data has better prediction performance than that of the expression data, presumably due to the larger fraction of methylation features informative of outcome despite their smaller effect sizes. We also evaluated different approaches of data integration from the perspective of variable selection. Specifically, we compared the sensitivity, i.e. the proportion of true signals that have non-zero coefficients, for the four integration methods under consideration (Figure 2.3.3B). We found that early integration achieved higher sensitivity than late integration. The sensitivity for Logistic and Late Lasso was almost the same. And the higher sensitivity in multi-tuning parameter Lasso is not out of expectation, since by imposing less penalty on the second data set we should rescue some true signals that could be missed otherwise in standard Lasso. 46 Figure 2.3 3 Comparison between early and late data integration. A: AUCs on an independent testing data from the model generated by individual data sets, and combined data with different integration approaches. B: Sensitivity in variable selection for different approaches of data integration. (N=200 simulated data sets). 2.3.1.4 Performance of Multi-tuning parameter Lasso in different simulation settings 2.3.1.4.1 Varying the coefficients Now we are particularly interested in the trade-off situation described in Fig2.3.1D and Fig2.3.2D, when the number of informative features and their effect sizes vary between the two data sets. Here, we explored the effect of varying the coefficients for the first data type from 1.0 to 2.2, keeping all other variables the same (r1=5, r2=20, p1=p2=250, Coef2 = 1). With the increase of Coef1, the predictive power of the first part was getting larger, as was expected. However, we also observed that the prediction capability of the second part was getting smaller. That is because with the outcome dependent on both data types, the more data type 1 explains the data, the less overall contribution is given by data type 2, despite the parameters remaining the same. Consequently, the contribution of individual data sets was approaching the same level, as evidenced by maximum possible AUCs in Additional Figure 2.1 (boxplots). As a result, in Figure 2.3.4A we can observe that the right tail of figures, which represented the prediction power of using the first data set alone in terms of 47 empirical AUCs was increasing; and the left tail of figures was decreasing. The weight that resulted in the maximum of mean AUCs (“best weight”) approaches 1 with the increase of Coef1 (Fig 2.3.4A). Since the second part had more contribution to the total prediction performance compared with the first part in all of the settings in Fig2.3.4A according to the maximum possible AUCs, the optimal weights should be less than 1 in all settings. Fig2.3.4B showed the mean (red dots) and standard error of the mean (caps) of optimal weights across 200 simulations in each setting, where we considered choosing the weight parameter from 0.2 to 1.8 for step sizes of 0.05. Consistent with what we have observed in Fig2.3.4A where the dashed blue line shifted gradually to the right, in Fig2.3.4B the mean of optimal weights obtained from 200 single simulations increased with the increase of Coef1. And the standard error of mean was tiny compared with the mean, which suggested that the selection of weights was pretty stable. However, it is worth noticing that the further increase of Coef1, especially beyond 2.0, would not lead the mean of optimal weights to increase accordingly. When we have substantially large Coef1, the standard Lasso could already pick up the right variables from the first part, and the inclusion of those important variables could achieve a prediction that was very close to the AUC from the whole data set. 48 Figure 2.3 4 The optimal weight increases with increased Coef1 when the r 1 and r 2 are fixed (N=200 simulated data sets). (A): The change of the mean of empirical AUCs with different weights in simulation settings with varied Coef1. In all settings we have fixed r1=5, r2=20, Coef2=1, p1=p2=250. x axis: weight parameters; y axis: empirical AUCs; blue line: best weight parameter. The maximum possible AUCs for individual data sets and whole data are indicated in the right corner. (B): The mean (red dots) and standard error of mean (caps) of optimal weights across 200 simulations in the settings with varied Coef1. Now that we can solve for the optimal weight for each simulation setting based on the empirical AUC, we need to confirm whether the multi-tuning parameter Lasso, with the proper combination of weight and lambda parameters, can outperform standard Lasso in generating a better prediction model, as is determined by both prediction accuracy and variable selection ability. We note that the standard Lasso is equivalent to using a weight of one in the multi-tuning-parameter Lasso (simple concatenation). For each simulation setting, we obtained prediction models for both standard Lasso and multi-tuning parameter Lasso from the same training data. Then we applied the models to an independent testing data, and the seven evaluation metrics were computed. The results are shown in Additional 49 Figure 2.4 for 200 replicate data sets, for both standard Lasso (in yellow) and multi-tuning parameter Lasso (in red). We found that (1) across all of the simulation scenarios examined, the AUC, 1-misclassification error, sensitivity and sensitivity in the r2 features obtained by multi-tuning parameter Lasso were larger than those from standard Lasso, and such differences were statistically significant (p<0.001). (2) The differences of AUC and 1-misclassification error between the two compared models were getting smaller with the increase of Coef1, which suggested that the advantage of multi-tuning parameter Lasso in terms of prediction accuracy decreased when the contribution of the individual data sets approached the same level. (3) Albeit the multi-tuning parameter Lasso could have more informative variables selected from the second part by imposing more penalty on the first part (as was evidenced by the increased sensitivity in r2), it didn’t hurt the sensitivity in r1 compared to standard Lasso. (4) Regarding the specificity and positive predicted values (PPV), the standard Lasso outperformed the multi-tuning parameter Lasso for most scenarios. This is not surprising and reflects a larger number of variables selected by the more flexible multi-tuning parameter Lasso, including noise variables as well. Earlier we saw that the differences from the multi-parameter Lasso translated to higher prediction accuracy. 2.3.1.4.2 Varying the number of informative variables In this part we are interested in the effect of the proportion of informative variables in the individual data sets on the selection of the optimal weight parameter when the coefficients have been fixed. i.e. the results we found in Fig2.3.2 panel B compared to panel A. Specifically, we fixed the coefficients in data type 1 and type 2 as Coef1=Coef2=1, and we also had the number of variables with non-zero coefficients fixed in the first part as r1=5. We changed the number of non-zero coefficients in the second part r2 from 10 to 25 with increments of 5. As was expected, the contribution of the second part increased with the increase of r2, as was evidenced by the maximum possible AUCs (Fig 2.3.5A). Consequently, the weight that resulted in the largest mean AUCs (indicated by the dashed blue line in Fig 2.3.5A) was less than 1 and kept decreasing with the increase of r2. Similarly, the mean of optimal weights across 200 replicated simulations also decreased with the increase of r2 (Fig 2.3.5B). This is reasonable since less penalty imposed on the second data set will allow more informative variables to be selected. It is worth noting that the prediction performance of data type 2 approached that of the whole data set with the increase of r2, and thus the AUC for the second data set was more and more similar to that for the integrated data as r2 increased. 50 51 Figure 2.3 5 The optimal weight decreases with increased r 2 when the Coef1 and Coef2 are fixed (N=200 simulated data sets). (A): The change of the mean of empirical AUCs with different weights in simulation settings with varied r 2 . In all settings we have fixed r 1 =5, Coef1=Coef2=1, p 1 =p 2 =250. x axis: weight parameters; y axis: empirical AUCs; black line: test data AUCs; blue line: best weight parameter. The maximum possible AUCs for individual data sets and whole data are indicated in the right corner. (B): The mean (red dots) and standard error of mean (caps) of optimal weights across 200 simulations in the settings with varied r 2 . 2.3.1.4.3 Varing the number of features in individual data types Previously we just considered the situation where we had equal number of features in two individual data sets, i.e. p1=p2=250, and we found that the overall prediction performance improved when less penalty was imposed on the part with more non-zero coefficients. However, it remains unknown about the determined factor that affected the selection of the best weight parameter when we had fixed coefficients: whether it was the absolute values of r1, r2 or the proportions of non-zero coefficients (r1/p1, r2/p2) in the individual data sets. In this part we changed the simulation settings as p1=400, p2=100, and we tuned the value of r1, r2 in two different ways. The first was to keep the number of non-zero coefficients the same in two data sets (r1=r2=10), while the proportion of non-zero coefficients was larger in the second part (r1/p1 =0.025, r1/p1 =0.1). The second simulation setting was to use different number of non-zero coefficients in two data sets (r1=20, r1=5), but the proportion of non-zero coefficients was the same (r1/p1 = r1/p1 =0.05). A plot of AUC versus weight shows that the best weights for the two scenarios differ (0.75 and 1) (Figure 2.3.6A,B), suggesting that it is the proportion, rather than the number of non-zero coefficients in two individual data sets, that drives the selection of the best weight parameter. 52 Figure 2.3 6 The weight parameter is influenced by the proportion, not the number of non-zero coefficients in individual datasets (N=200 simulated data sets). In both simulation settings we fixed Coef1=Coef2=1, p1=400, p2=100. (A) The proportion of features with non-zero coefficients varied in two data sets (r1=r2=10, r 1 /p 1 =0.025, r 1 /p 1 =0.1); (B) The proportion of non-zero coefficients was the same in both data sets (r1=20, r1=5 r 1 /p 1 = r 1 /p 1 =0.05). 2.3.1.4.4 Varying the total number of features In previous sections we fixed the total number of features (p) as 500, and now we are particularly interested in the effect of dimensionality on the performance of multi-tuning parmater Lasso. We gradually increased the total number of features from 500 to 4000, while fixing the coefficients and number of informative variables the same as those we used in Figure 2.3.1 D (i.e. Coef1=1.5, Coef2=1, r1=5 and r2=20). Figure 2.3.7 shows that the optimal weight approaches (and can be even larger than) 1 with the increase of dimensionality. Besides, the prediction performance in terms of empirical AUCs for the multi-tuning parameter Lasso model decreases with the increase of total number of features. This is within expectation, since given on the fixed coefficients and number of informative variables, the more noise features we have, the poorer the model performance should be. Actually the empirical AUCs on independent testing data sets when p equals 4000 are merely above 0.6 (red line in Figure 2.3.7). The addition of a weight parameter, when we have a large number of features, will end up including more noise features. And 53 thus the advantage of multi-tuning paramter Lasso will be less and less obvious with the increase of dimensionality. Figure 2.3 7 The optimal weight increases with the increase of total number of features (N=200 simulated data sets). The black, blue, purple and red curves represent the situations when we have 250, 500, 1000 and 2000 features in individual data types. In all simulation settings we fix Coef1=1.5, Coef2=1, r 1 =5, r 2 =20. 2.3.2 Correlated Settings 2.3.2.1 Comparison of maximum possible AUCs In the correlated settings, we are generally interested in two scenarios: (1) there are no correlations between the informative gene features and informative methylation features (Cor_inter=0), and (2) there are correlations between the informative gene features and informative methylation features (Cor_inter>0). Within each scenario, we considered the intra-array correlations, i.e. the correlations among the informative genes (Cor_Gene>0), and the correlations among the informative methylation features (Cor_meth>0). Figure 2.3.8 shows the boxplots of maximum possible AUCs for the two scenarios mentioned 54 above, where the Cor_inter equals to 0 and 0.4 in Fig 2.3.8A and B, respectively. In both situations we have 3 out of 5 informative genes having pair-wise correlations that equal 0.4 (Cor_Gene=0.4, r1=5, n1=3), and 3 out of the 20 informative methylation features having pair-wise correlations that equal 0.2 (Cor_meth=0.2, r1=20, n1=3), and the correlations between the rest of features are 0. We fixed the Coef1=0.8, Coef2=0.6, and p1=p2=250. In Figure 2.3.8 A we notice that the combined data has better prediction power than any of the individual data types in terms of the maximum possible AUCs for this simulated correlated setting. And the simulated methylation array (data type 2), albeit has smaller coefficients, actually has better prediction performance than that of the gene array (data type 1). Comparing with the situation where we have no inter-array correlations, when we have Cor_inter larger than 0 (Figure 2.3.8 B), the prediction power for both individual data types and the combined data gets increased. And more importantly, the difference in maximum possible AUCs between data type 1 and data type 2 decreases when the inter-array correlations are considered, indicating that the two individual data types tend to capture similar effects. Figure 2.3 8 Maximum possible AUCs for the two correlated simulation settings (N=200 simulated data sets). (A): The correlations between informative variables from two data types are zero (Cor_inter=0). (B): Cor_inter=0.4. In both (A) and (B) we considered correlations among informative variables in individual data types (Cor_gene=0.4, and Cor_meth=0.2). 55 2.3.2.2 Performance of multi-tuning parameter Elastic net Next, we are interested in comparing the prediction accuracy of Elastic net and Lasso for the correlated data settings. Figure 2.3.9 describes the change of AUCs for the multi-tuning parameter Elastic net (red line) and Lasso (black line) under different weight parameters, where the inter-array correlations for the informative variables are either zero (Fig2.3.9A) or non-zero (Fig2.3.9B). First, we observe that the Elastic net has higher AUC than the Lasso for both settings, but the benefits of using Elastic net, which encourages grouping effect under collinearity, is more obvious for the simulation with more correlation in the data (Panel B). Second, we note that the advantage of multi-tuning parameter penalized regression is greater (weight farther from 1) when there is no correlation between the informative variables from different arrays. The performance of multi-tuning parameter penalized regression within each scenario (when Cor_inter=0 or >0) will be further studied in Section 2.3.2.4. Figure 2.3 9 The AUCs as a function of the weight parameter in two correlated data settings (N=200 simulated data sets). x axis: weight parameter; y axis: testing AUCs; black line: Lasso; red line: Elastic net; blue line: best weight parameter. Cor_gene: correlation coefficients among the informative gene features; Cor_meth: correlation coefficients among the informative DNA methylation features; Cor_inter: correlation 56 coefficients between the informative genes and DNA methylation features. Herein we have Cor_gene=0.4, Cor_meth=0.2 for both (A) and (B), and Cor_inter=0 for (A), Cor_inter=0.4 for (B). 2.3.2.3 Comparison between late and early integration Predictive performance of Elastic net was compared to Lasso for both early and late data integration. For “late integration”, we applied Elastic net to the combined sets of features selected from applications of Elastic net on each data type separately (“Late EN”). Figure 2.3.10A and B depict the situations where the correlations for the features between-platforms (Cor_inter) are zero and above zero, respectively. In both figures we observed that the data integration, no matter what approach we took (Lasso or Elastic net), achieved better prediction performance in terms of AUCs than any of the individual data types. For late integration, using penalized regression again on the combined selected features can obtain better prediction results than simply using logistic regression. For early integration, the multi-tuning parameter penalized regression is better than standard penalized regression, although the difference is much less obvious when we have inter-array correlations between the informative variables. Further, the Elastic net generally dominated Lasso, especially in Late integration (Late Lasso/EN), where the combined selected features tended to be enriched by the correlated variables. Elastic net also tended to dominate in early integration when Cor_inter>0, when we had intensive correlations among informative variables both within-and between-platforms. Comparing Figure 2.3.10A and B, we notice that the early integration outperforms late integration in the situation where we have no inter-array correlations, but we can barely observe any difference between early and late integration when the inter-array correlation is high. Apart from the prediction performance, we are also interested in feature selection for different integration approaches. Additional Figure 2.5 and 2.6 show the percentage of times that each informative variable is selected in the final model across 100 replicated simulations when Cor_inter=0 and Cor_inter>0, respectively. When we have no inter-array correlations (Additional Figure 2.5), we notice that early-integration has more chance to select the informative gene features than late-integration, and the inclusion of genuine gene signals in the final model tends to result in better prediction performance, since gene features have larger effect sizes than methylation features in the considered simulation settings. This could explain why early integration outperforms late integration when Cor_inter=0. For early integration, we observe that the multi-tuning parameter Elastic net has more chance to select informative methylation features than standard Elastic net, while the selection of true gene signals is not affected much. This could suggest why the 57 two-weighted Elastic net is even better than the standard form. When there are inter-array correlations (Additional Figure 2.6), it is clear that both early and late data integration have a large chance to select those informative, pair-wisely correlated features. The inclusion of those six informative variables, which have pair-wise correlation both within- and between-platforms, will guarantee a good enough prediction performance. As a result, the difference between late and early integration is much less obvious when the inter-platform correlation is considered due to the successful selection of the correlated informative variables in all integration approaches. Figure 2.3 10 AUCs for early and late data integration in correlated settings (N=200 simulated data sets). Red and blue boxes represent the results for Lasso and Elastic net regressions, separately. Herein we have Cor_gene=0.4, Cor_meth=0.2 for both (A) and (B), and Cor_inter=0 for (A), Cor_inter=0.4 for (B). 58 2.3.2.4 Performance of multi-tuning Elastic net in different simulation settings 2.3.2.4.1 Varying the correlation coefficients within platform We first consider the situations where we have fixed effect sizes and number of informative variables (Coef1=0.8, Coef2=0.6, r1=5, r2=20), and fixed correlation coefficients among the informative gene features (Cor_gene=0.4, n1=3) and no correlation between platforms (Cor_inter=0), but we vary correlation coefficients between the informative methylation features (Cor_meth varies from 0 to 0.8 at the interval of 0.2, and n2=3). As is expected, the advantage of Elastic net over Lasso with their corresponding optimal weight parameters is more obvious with the increase of Cor_meth (Additional Figure 2.7A). When the correlation coefficients between the informative methylation features gets larger, the mean of the optimal weight parameters in multi-tuning parameter Elastic net gets closer to 1, from around 0.82 when Cor_meth equals 0 to around 0.89 when Cor_meth equals 0.8 (Figure 2.3.11A). As the optimal weight parameter approaches 1, the difference in prediction performance between multi-tuning Elastic net and standard Elastic net decreases accordingly (Additional Figure 2.8A). In terms of feature selection, Additional Figure 2.9 shows the percentage of times that each informative variable is selected across 100 replicated simulations for standard Elastic net with varied correlation coefficients among informative methylation features. It’s obvious that with the increase of Cor_meth, those pair-wise correlated methylation signals are more likely to be selected in the final model, and the chance of being selected for the rest of informative variables is barely affected. The inclusion of those correlated informative features enables standard Elastic net to have decent prediction performance, and the addition of the weight parameter that is either smaller or larger than 1, will be at the risk of either including more noise or missing informative signals, and in turn will be unnecessary. The maximum possible AUCs for the simulation settings with different Cor_meth could provide another possible explanation why the optimal weight parameter approaches 1 with the increase of Cor_meth. Additional Figure 2.11A indicates that when Cor_meth increases, the prediction power of the methylation array is getting better, and the difference in prediction ability between methylation array and combined data is getting smaller. This 59 suggests that when we have higher Cor_meth, the standard Elastic net can obtain a fair enough prediction model by including the important features in methylation array, and the consideration of additional signals by using weight parameters might not be necessary. Figure 2.3 11 The change of optimal weight parameters under different correlated data simulation settings (N=200 simulated data sets). Optimal weight as a function of (A) the correlation coefficients between the informative methylation features (Cor_meth); (B) the number of correlated methylation features (n 2 ); (C) the inter-platform correlation coefficients between informative genes and methylation features (Cor_inter). 2.3.2.4.2 Varying the number of correlated features within platforms Similar with the settings described in 2.3.2.4.1 (i.e. Varying the correlation coefficients within platforms), herein we fix the correlation coefficients between informative 60 methylation features (Cor_meth=0.2), but we keep increasing the number of pair-wise correlated methylation features (n2 ranges from 2 to 6). Since Elastic net is designed to encourage grouping effect, the superiority of Elastic net to Lasso is more and more obvious with the increase of n2 (Additional Figure 2.7B). The mean optimal weight parameter for Elastic net increases from 0.83 when n2=2 to 0.96 when n2=6 (Figure 2.3.11B), and correspondingly the difference between multi-tuning parameter Elastic net and standard Elastic net decreased with the increase of n2 (Additional Figure 2.8B). The phenomenon we observed, i.e. the optimal weight parameter in multi-tuning Elastic net approaches 1 with the increase of n2, might be explained by feature selection in Standard Elastic net and maximum possible AUCs for our simulation settings. Additional Figure 2.10 shows that the more correlated informative features we have in a single array, albeit they have the same correlation coefficients, the more likely those correlated features will be selected in the final simple concatenation model (weight=1). And the inclusion of those multiple correlated features empower the model to achieve satisfactory prediction performance in terms of AUCs, while the further addition of a weight parameter that is not 1 would offer no further benefit. Additional Figure 2.11B also suggests that the larger number of correlated features we have in the simulated methylation array, the less difference in prediction power between methylation array and combined data we can expect, and the successfully selection of the important features in methylation array could allow the standard Elastic net model to obtain high prediction AUCs, which would make the multi-tuning parameter Elastic net less beneficial. 2.3.2.4.3 Varying the correlation coefficients between platforms This time we fix the correlation coefficients between the informative variables in individual arrays, and we also fix the number of correlated features, but we change the correlation between informative gene and methylation features (Cor_inter) from 0 to 0.5 with the increment of 0.1. It turns out that with the increase of inter-platform correlations, the benefit of Elastic net over Lasso increases (Additional Figure 2.7C), but the optimal weight parameter in multi-tuning Elastic net gradually approaches 1 (Figure 2.3.11C), and the advantage of the two-weighted Elastic net under the optimal weight parameter over standard Elastic net is less obvious (Additional Figure 2.8C). Similar to the previous sections listed above (2.3.2.4.1 and 2.3.2.4.2), the less prominent effect of multi-tuning parameter Elastic net with the increase of Cor_inter might be due to the larger chance of 61 selecting correlated informative variables in standard Elastic net, and the relatively less difference in prediction power between individual array and combined data (Additional Figure 2.11C). 2.3.2.4.4 Correlations between informative and noise features Previously we discussed the situations where we have correlations among informative variables, now we consider the simulation settings where we have non-zero correlation coefficients between informative variables and noise features. In order to better emphasize the effect of such correlated situation, we chose a panel of simulation parameters that are similar in Figure 2.3.1A, where we have identical simulated effect sizes and number of informative variables in two arrays (Coef1=Coef2=1; r1=r2=5). Moreover, we have identical correlation coefficients among the informative variables within the two platforms (Cor_Gene=Cor_meth=0.4, n1=n2=3). Thus, theoretically, the optimal weight parameter that can result in the maximum of mean AUCs should be 1 based on the current simulation, since the two individual arrays have identical prediction power. Now we consider correlations between 3 of those correlated informative genes and 3 of gene noise features that are randomly selected, and we set the correlation coefficient Cor_Noise as 0.8. Figure 2.3.12training AUC shows that even if we’ve imposed such a strong correlation between informative genes and noise genes, the optimal weight that is selected based on the testing AUCs is still 1, and the performance of Elastic net is slightly better than that of Lasso. The result suggests that the performance of multi-tuning parameter penalized regression is not affected by the correlations between informative variables and noise features. 62 Figure 2.3 12 The change of AUCs for different weight parameters in Elastic net (red) and Lasso (black) when we have correlations between informative genes and noise gene features (Cor_Noise>0). The heatmap on the left represents the variance-covariance matrix between the 5 informative genes and 5 noise gene features. 2.3.3 Real data analysis We also applied the multi-tuning parameter penalized regression on real cancer data sets, combining gene expression and DNA methylation data for outcome prediction. In both AML and prostate cancer data, we found that the performance of multi-tuning parameter Elastic net was better than that of the Lasso (Figure 2.3.13 A, B). This is within expectation since we are dealing with real data, which is featured by commonly existing correlations among predictors, and the Elastic net is specially designed to cope with such situations. The multi-tuning parameter penalized regression achieved the best performance at the weight of 0.7 for AML (Figure 2.3.13 A) and 1.1 for the prostate cancer data (Figure 2.3.13B). This result indicates that in AML data, by deliberately adding relatively less penalty on the methylation features, we can obtain a better classifier than the traditional method (i.e. 63 simple concatenation). For the prostate cancer data, there was very little difference in prediction performance between using data from a single platform or combining it. Figure 2.3 13 Performance of multi-tuning parameter penalized regression in (A) AML data and (B) PRAD data set. For the prostate cancer data, out of the original 20216 gene features and 371513 methylation features, we selected 8 gene features and 4 methylation features in the final two-tuning parameter Elastic net model (weight=1.1). Additional Figure 2.12 shows the estimated coefficients for the selected features. We notice that comparing with the methylation loci, we selected more gene features with relatively smaller coefficients. This suggests that for predicting tumor aggressiveness in the prostate cancer data set, the gene array contains more informative variables with smaller effect sizes, and under such situation it is reasonable to use a smaller penalty parameter on the gene array to obtain a better classifier. It is interesting that the majority of the selected features is reported to be associated with the growth, progress and metastasis of prostate cancer. Especially for the 4 gene features that have positive estimated coefficients, i.e. CENP-A, CIT, FAM72a, and ASPN, researchers have shown that the over-expression of those genes is associated with the aggressiveness of the prostate cancer (Bieniek, Childress, Swatski, & Yang, 2014; Kutzner, Pramanik, Kim, & Heese, 2015; Özdemir et al., 2014; Whitworth et al., 2012). 2.4 Discussion With the development of high-throughput biological technologies, for a single subject we can have multiple –omics data available, including genomic, epigenomic, proteomic and 64 metabolomic. For the purpose of predictive modeling, such as predicting disease subtypes, prognosis or drug sensitivity, by combining the information from multiple –omics platforms (i.e. data integration), we can obtain a more comprehensive and accurate prediction model than using a single data type alone. However, analyzing the integrated data is challenging since we will face the issues of increased dimensionality and difference in prediction power for different platforms. Several methods have been proposed to deal with the high dimensionality issue (Chung & Keles, 2010; Fan & Li, 2001; Zou, Hastie, & Tibshirani, 2006). However, to our best knowledge, no one has ever targeted the issue of different prediction capability for the features in different platforms when combining data from different platforms into a single classification model. In our study, we are interested in predicting disease subtypes using data concatenation (“early integration”, i.e. combining the –omics data before model setup). To address the issues of high dimensionality and different behaviors of individual data types, we proposed a novel approach termed multi-tuning parameter (MTP) penalized regression, which imposes different penalty parameters on different data types in Lasso or Elastic Net (EN) regression. We compared this to a “late integration” approach, in which we selected the variables from individual platforms first and then combine the selected features to get the final model, using either logistic regression (“Logistic”) or penalized regression for a second time (“Late Lasso/EN”). We investigated model performance via intensive simulations where we varied the relative contribution of individual data sets and correlation structures. Our simulations showed that both “early integration” and “late integration” outperform any of the individual data types in predicting disease outcomes, especially when there is no correlation between the features from different platforms. Comparing the “early” and “late” integration, the former performed better than the latter when the features were independent, possibly due to the fact that the “early integration” does a better job in variable selection. When the correlations between features from different platforms are not negligible, the advantage of “early integration” is much less obvious, since under such situations similar features are selected under either approach. For “late integration”, we found that “Late Lasso/EN” outperforms “Logistic”, by further excluding some noise features that have been selected from individual platforms. For “early integration”, we found that the optimal weight varies as a function of (1) the fraction of informative variables in each data set and their relative effect sizes, (2) the correlation coefficients among the informative variables, and (3) the number of correlated informative features. Interestingly, correlations between informative variables and noise features didn’t affect the performance of MTP-penalized regression. When the correlations between the 65 important features are considered, Elastic Net always dominated Lasso in terms of prediction AUCs. Specifically, we found that MTP Lasso might be particularly appropriate for the settings where (1) the individual data sets have similar effect sizes but varied proportion of non-zero coefficients; or (2) the proportion of non-zero coefficients is larger in one data set, and the effect size in the other data set is not large enough to make the prediction performance of the two data sets similar, and (3) there is no notable correlation between the informative variables, especially between the important features in the different platforms. Alternatively, when the prediction power of the individual data sets are similar to one another, or there are strong correlations between the features that are predictive, the best weight parameter might be around one and a simple concatenation could be used. Furthermore, we applied the proposed method on real cancer data, and we observed results consistent with our simulations; by imposing a relatively smaller penalty parameter on the features from the array which has more informative variables with smaller effect sizes, we can obtain a better classifier than using a single penalty parameter on all features. In the independent simulation settings we also explored the effect of dimensionality on the performance of the MTP Lasso (Figure 2.3.7). We observed that the prediction performance of penalized regression, as well as the benefit of multi-tuning parameter Lasso comparing with the standard Lasso, decreased with the increase of the number of noise features. The results suggest that it’s necessary and important to consider feature pre-selection before we apply the penalized regression. By excluding noise features and appropriately reducing dimensionality, we can obtain a prediction model that is more accurate and stable, and can also maximize the performance of multi-tuning parameter penalized regression. The acute myeloid leukemia (AML) data set we used in this study is identical with the data used by Taskesen et al. (Taskesen et al., 2015). They also considered a logistic regression model with Lasso regularization to predict AML subtypes in 344 samples. They evaluated three different classification strategies, including early, late and no integration. In early integration, they combined gene expression profile (GEP) and DNA methylation profile (DMP) features first and then applied Lasso regression on all features to predict AML subtypes (simple concatenation-based integration). In late regression, they first used Lasso regression on GEP and DMP individually, and then trained a nearest mean classifier with the posterior probabilities of the GEP and DMP logistic regression as predictors (a two-layer classifier). They showed that early integration improved the predictive power compared to classifiers trained on GEP or DMP alone; and late integration outperformed the early 66 integration. We compared our results to theirs, focusing on predicting the 7q AML_subtype, i.e. we intended to differentiate 7q subtype from the non-7q through penalized regression, in concatenation-based data integration. We had consistent results with Taskesen et al. regarding the performance of combined data versus the individual data; we observed that the maximum AUC was obtained at the weight of 0.7 (combined data) rather than an extremely small weight (left tail of Figure 2.3.13A, DNA methylation data only) or large weight (right tail of Figure 2.3.13A, gene expression data only). More importantly, however, we noticed that the mean AUC for MTP Lasso is 0.815 for the optimal weight parameter, which is higher than the AUC of 0.7998 obtained from the best method (Late integration) in Taskesen et al. And the mean AUC for the MTP Elastic Net is even higher (0.820). These results further support the view that when we consider the inter-correlations between different platforms when setting up the prediction model, we can better utilize the complementary information across different platforms and obtain a model with better prediction performance. We also applied our method on prostate adenocarcinoma (PRAD) TCGA data, which utilized the HM450 DNA methylation array and contains many more features than the AML data set. The classifier generated by the MTP Elastic Net from the PRAD data attained a much higher AUC comparing with the AML data, possibly due to the fact that (1) we are considering different outcome variables, i.e. the aggressiveness of prostate cancer for PRAD data and the molecular subtype for AML data, and (2) we are considering combining different platforms, e.g. the HM450 for PRAD data and the HELP-assay for AML data. The classifier for the PRAD data contains 12 features with non-zero coefficients, and it’s interesting that the majority of the selected features is reported to be associated with the growth, progress and metastasis of prostate cancer. Especially for the 4 gene features that have positive estimated coefficients, i.e. CENP-A, CIT, FAM72a, and ASPN, researchers have shown that the over-expression of those genes is associated with the aggressiveness of the prostate cancer (Bieniek et al., 2014; Kutzner et al., 2015; Özdemir et al., 2014; Whitworth et al., 2012). Among the four selected DNA methylation features, cg01135464 has recently been identified as one of the epigenetic signature of Gleason score and prostate cancer recurrence after radical prostatectomy (Geybels et al., 2016). In general, in this study we proposed a novel logistic penalized regression variant that is specially designed for data integration in outcome prediction. The method highlighted the importance of data integration, and paved the way for the future research in leveraging multiple high-throughput genomic data to better predict disease diagnosis or prognosis, which is essentially one of the ultimate goals for personal medicine. 67 2.5 Future Work In our current study, in both simulation settings and real data analysis we just considered two data types for the purpose of simplicity. However, our proposed method should also work in the situations where we have more than two platforms. Assuming we have q platforms (q³2), and lambda will be selected across N l values, weight will be selected across Nw values, then in a q-dimensional grid search we will performe the penalized regression ! " ×! $ (&-() times to determine the optimal combination of l and q-1 weight parameters. To speed up the computation, we can shrink the range of N l or Nw based on some prior knowledge or the information supplied by some preliminary studies. The exploration of the performance of multi-tuning parameter penalized regression on multiple platforms will be considered in the future. Besides, in this study we focused on Lasso and Elastic net, in the future we may consider other penalized regressions with more sophisticated weight functions, such as adaptive Lasso or group Lasso. Reference Aryee, M. J., Jaffe, A. E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A. P., Hansen, K. D., & Irizarry, R. a. (2014). Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics, 30(10), 1363–1369. http://doi.org/10.1093/bioinformatics/btu049 Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by Supervised Principal Components. Journal of the American Statistical Association, 101(473), 119–137. http://doi.org/10.1198/016214505000000628 Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188. http://doi.org/10.1214/aos/1013699998 Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J. M., … Shen, R. (2011). High density DNA methylation array with single CpG site resolution. Genomics, 98(4), 288–295. http://doi.org/10.1016/j.ygeno.2011.07.007 Bieniek, J., Childress, C., Swatski, M. D., & Yang, W. (2014). COX-2 inhibitors arrest prostate cancer cell cycle progression by down-regulation of kinetochore/centromere proteins. Prostate, 74(10), 999–1011. http://doi.org/10.1002/pros.22815 Chari, R., Coe, B. P., Vucic, E. A., Lockwood, W. W., & Lam, W. L. (2010). An integrative multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer. BMC Systems Biology, 4, 67. http://doi.org/10.1186/1752-0509-4-67 68 Chung, D., & Keles, S. (2010). Sparse partial least squares classification for high dimensional data. Statistical Applications in Genetics and Molecular Biology, 9(1), Article17. http://doi.org/10.2202/1544-6115.1492 Dedeurwaerder, S., Defrance, M., Bizet, M., Calonne, E., Bontempi, G., & Fuks, F. (2013). A comprehensive overview of Infinium HumanMethylation450 data processing. Briefings in Bioinformatics, 15(6), 7–13. http://doi.org/10.1093/bib/bbt054 Dedeurwaerder, S., Defrance, M., Calonne, E., Denis, H., Sotiriou, C., & Fuks, F. (2011). Evaluation of the Infinium Methylation 450K technology. Epigenomics, 3(6), 771–784. http://doi.org/10.2217/epi.11.105 Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V. K., Attwood, J., Burger, M., … Beck, S. (2006). DNA methylation profiling of human chromosomes 6, 20 and 22. Nature Genetics, 38(12), 1378–85. http://doi.org/10.1038/ng1909 Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348–1360. http://doi.org/10.1198/016214501753382273 Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32(3), 928–961. http://doi.org/10.1214/009053604000000256 Figueroa, M. E., Lugthart, S., Li, Y., Erpelinck-Verschueren, C., Deng, X., Christos, P. J., … Melnick, A. (2010). DNA Methylation Signatures Identify Biologically Distinct Subtypes in Acute Myeloid Leukemia. Cancer Cell, 17(1), 13–27. http://doi.org/10.1016/j.ccr.2009.11.020 Fortin, J. (2014). Functional normalization of 450k methylation array data improves replication in large cancer studies. bioRxiv, 0–42. http://doi.org/10.1101/002956 Friedman, J., Hastie, T., & Tibshirani, R. (2010a). A note on the group Lasso and a sparse group Lasso. arXiv:1001.0736 [Math, Stat], 8. http://doi.org/10.1111/biom.12292 Friedman, J., Hastie, T., & Tibshirani, R. (2010b). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. http://doi.org/10.1359/JBMR.0301229 Geybels, M. S., Wright, J. L., Bibikova, M., Klotzle, B., Fan, J.-B., Zhao, S., … Stanford, J. L. (2016). Epigenetic signature of Gleason score and prostate cancer recurrence after radical prostatectomy. Clinical Epigenetics, 1–11. http://doi.org/10.1186/s13148-016-0260-z Gopalakrishnan, S., Van Emburgh, B. O., & Robertson, K. D. (2008). DNA methylation in development and human disease. Mutation Research, 647(1–2), 30–8. http://doi.org/10.1016/j.mrfmmm.2008.08.006 Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification 69 using support vector machines. Machine Learning, 46(1–3), 389–422. http://doi.org/10.1023/A:1012487302797 Hartford, C. M., Duan, S., Delaney, S. M., Mi, S., Kistner, E. O., Lamba, J. K., … Dolan, M. E. (2009). Population-specific genetic variants important in susceptibility to cytarabine arabinoside cytotoxicity. Blood, 113(10), 2145–2153. http://doi.org/10.1182/blood-2008-05-154302 Heyn, H., Li, N., Ferreira, H. H. J., Moran, S., Pisano, D. G., Gomez, A., & Diez, J. (2012). Distinct DNA methylomes of newborns and centenarians. Proceedings of the National Academy of Sciences of the United States of America, 109(26), 10522–10527. http://doi.org/10.1073/pnas.1120658109/-/DCSupplemental.www.pnas.org/cgi/doi /10.1073/pnas.1120658109 Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67. http://doi.org/10.1080/00401706.1970.10488634 Holzinger, E. R., & Ritchie, M. D. (2012). Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics, 13(2), 213–22. http://doi.org/10.2217/pgs.11.145 Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biology, 14(10), R115. http://doi.org/10.1186/gb-2013-14-10-r115 Hou, L., Zhang, X., Wang, D., & Baccarelli, A. (2012). Environmental chemical exposures and human epigenetics. International Journal of Epidemiology, 41(1), 79–105. http://doi.org/10.1093/ije/dyr154 Huang, J., & Zhang, C. (2008). ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL. Statistica Sinica, 18, 1603–1618. Huang, R. S., Duan, S., Bleibel, W. K., Kistner, E. O., Zhang, W., Clark, T. A., … Dolan, M. E. (2007). A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. Proceedings of the National Academy of Sciences of the United States of America, 104(23), 9758–63. http://doi.org/10.1073/pnas.0703736104 Huber, W., Carey, V. J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B. S., … Morgan, M. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat Meth, 12(2), 115–121. http://doi.org/10.1038/nmeth.3252 Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics (Oxford, England), 8(1), 118–27. http://doi.org/10.1093/biostatistics/kxj037 Kristensen, V. N., Lingjærde, O. C., Russnes, H. G., Vollan, H. K. M., Frigessi, A., & Børresen-Dale, A.-L. (2014). Principles and methods of integrative genomic analyses in cancer. Nature Reviews. Cancer, 14(5), 299–313. http://doi.org/10.1038/nrc3721 70 Kutzner, A., Pramanik, S., Kim, P. S., & Heese, K. (2015). All-or-(N)One - an epistemological characterization of the human tumorigenic neuronal paralogous FAM72 gene loci. Genomics, 106(5), 278–285. http://doi.org/10.1016/j.ygeno.2015.07.003 Laird, P. W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nature Reviews. Genetics, 11(3), 191–203. http://doi.org/10.1038/nrg2732 Lee, K. W., & Pausova, Z. (2013). Cigarette smoking and DNA methylation. Frontiers in Genetics, 4, 132. http://doi.org/10.3389/fgene.2013.00132 [doi] Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., & Storey, J. D. (2012). The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882–883. http://doi.org/10.1093/bioinformatics/bts034 Lemire, M., Zaidi, S. H. E., Ban, M., Ge, B., Aïssi, D., Germain, M., … Hudson, T. J. (2015). Long-range epigenetic regulation is conferred by genetic variation located at thousands of independent loci. Nature Communications, 6, 6326. http://doi.org/10.1038/ncomms7326 Li, E. (2002). Chromatin modification and epigenetic reprogramming in mammalian development. Nature Reviews. Genetics, 3(9), 662–673. http://doi.org/10.1038/nrg887 Li, Q., Brown, J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, 5(3), 1752–1779. http://doi.org/10.1214/11-AOAS466 Maksimovic, J., Gagnon-Bartsch, J. a, Speed, T. P., & Oshlack, A. (2015). Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Research, (22), 1–14. http://doi.org/10.1093/nar/gkv526 Maksimovic, J., Gordon, L., & Oshlack, A. (2012). SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips. Genome Biology, 13(6), R44. http://doi.org/10.1186/gb-2012-13-6-r44 Mankoo, P. K., Shen, R., Schultz, N., Levine, D. A., & Sander, C. (2011). Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles. PloS One, 6(11), e24709. http://doi.org/10.1371/journal.pone.0024709 Marabita, F., Almgren, M., Lindholm, M. E., Ruhrmann, S., Fagerström-Billai, F., Jagodic, M., … Gomez-Cabrero, D. (2013). An evaluation of analysis pipelines for DNA methylation profiling using the Illumina HumanMethylation450 BeadChip platform. Epigenetics, 8(3), 333–46. http://doi.org/10.4161/epi.24008 Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group Lasso for logistic regression. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 70(1), 53–71. http://doi.org/10.1111/j.1467-9868.2007.00627.x Meinshausen, N., & Bühlmann, P. (2006). High Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462. 71 http://doi.org/10.1214/009053606000000281 Moran, S., Arribas, C., & Esteller, M. (2015). Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics, 6(July 2015), epi.15.114. http://doi.org/10.2217/epi.15.114 Morris, T. J., & Beck, S. (2014). Analysis pipelines and packages for Infinium HumanMethylation450 BeadChip (450k) data. Methods (San Diego, Calif.), 72, 3–8. http://doi.org/10.1016/j.ymeth.2014.08.011 Naumov, V. A., Generozov, E. V., Zaharjevskaya, N. B., Matushkina, D. S., Larin, A. K., Chernyshov, S. V., … Govorun, V. M. (2013). Genome-scale analysis of DNA methylation in colorectal cancer using Infinium HumanMethylation450 BeadChips. Epigenetics, 8(9), 921–934. http://doi.org/10.4161/epi.25577 Özdemir, B. C., Hensel, J., Secondini, C., Wetterwald, A., Schwaninger, R., Fleischmann, A., … Cecchini, M. G. (2014). The molecular signature of the stroma response in prostate cancer-induced osteoblastic bone metastasis highlights expansion of hematopoietic and prostate epithelial stem cell niches. PLoS ONE, 9(12). http://doi.org/10.1371/journal.pone.0114530 Pan, W., & Shen, X. (2007). Penalized model-based clustering with application to variable selection. The Journal of Machine Learning Research, 8, 1145--1164. Retrieved from http://portal.acm.org/citation.cfm?id=1248698\nhttp://jmlr.csail.mit.edu/papers/v8 /pan07a.html\nhttp://www.jmlr.org/papers/volume8/pan07a/pan07a.pdf Pidsley, R., Y Wong, C. C., Volta, M., Lunnon, K., Mill, J., & Schalkwyk, L. C. (2013). A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics, 14, 293. http://doi.org/10.1186/1471-2164-14-293 Portela, A., & Esteller, M. (2010). Epigenetic modifications and human disease. Nature Biotechnology, 28(10), 1057–1068. http://doi.org/10.1038/nbt.1685 Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A., & Kim, D. (2015). Methods of integrating data to uncover genotype–phenotype interactions. Nature Reviews Genetics, 16(2), 85–97. http://doi.org/10.1038/nrg3868 Robertson, K. D. (2005). DNA methylation and human disease. Nature Reviews. Genetics, 6(8), 597–610. http://doi.org/10.1038/nrg1655 Shaknovich, R., Figueroa, M. E., & Melnick, A. (2010). HELP (HpaII tiny fragment enrichment by ligation-mediated pcr) assay for dna methylation profiling of primary normal and malignant b lymphocytes. Methods in Molecular Biology, 632, 191–201. http://doi.org/10.1007/978-1-60761-663-4_12 Taskesen, E., Babaei, S., Reinders, M. M., & De Ridder, J. (2015). Integration of gene expression and DNA- methylation profiles improves molecular subtype classification in acute myeloid leukemia. BMC Bioinformatics, 16, S5. http://doi.org/10.1186/1471-2105-16-S4-S5 72 Teschendorff, A. E., Marabita, F., Lechner, M., Bartlett, T., Tegner, J., Gomez-Cabrero, D., & Beck, S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics (Oxford, England), 29(2), 189–96. http://doi.org/10.1093/bioinformatics/bts680 Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 99(10), 6567–72. http://doi.org/10.1073/pnas.082099299 Tibshirani, R., & Society, R. S. (1996). Regression and shrinkage via the Lasso. J R Stat Soc, Ser B, 58(1), 267–288. Titus, A. J., Houseman, E. A. D. S., Johnson, K. C., & Christensen, B. C. (2016). MethyLiftover: Cross-platform DNA methylation data integration. Bioinformatics, 32(16), 2517–2519. http://doi.org/10.1093/bioinformatics/btw180 Touleimat, N., & Tost, J. (2012). Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics, 4(3), 325–341. http://doi.org/10.2217/epi.12.21 Triche, T. J., Laird, P. W., & Siegmund, K. D. (2016). Beta regression improves the detection of differential DNA methylation for epigenetic epidemiology. bioRxiv. Retrieved from http://biorxiv.org/lookup/doi/10.1101/054643 Triche, T. J., Weisenberger, D. J., Van Den Berg, D., Laird, P. W., & Siegmund, K. D. (2013). Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Research, 41(7), 1–11. http://doi.org/10.1093/nar/gkt090 Valk, P. J. M., Verhaak, R. G. W., Beijen, M. A., Erpelinck, C. A. J., Barjesteh van Waalwijk van Doorn-Khosrovani, S., Boer, J. M., … Delwel, R. (2004). Prognostically useful gene-expression profiles in acute myeloid leukemia. The New England Journal of Medicine, 350(16), 1617–28. http://doi.org/10.1056/NEJMoa040465 Weisenberger, D. J. (2014). Characterizing DNA methylation alterations from the cancer genome atlas. Journal of Clinical Investigation. http://doi.org/10.1172/JCI69740 Whitworth, H., Bhadel, S., Ivey, M., Conaway, M., Spencer, A., Hernan, R., … Gioeli, D. (2012). Identification of kinases regulating prostate cancer cell growth using an RNAi phenotypic screen. PLoS ONE, 7(6). http://doi.org/10.1371/journal.pone.0038950 Wilhelm-Benartzi, C. S., Koestler, D. C., Karagas, M. R., Flanagan, J. M., Christensen, B. C., Kelsey, K. T., … Brown, R. (2013). Review of processing and analysis methods for DNA methylation array data. British Journal of Cancer, 109(6), 1394–402. http://doi.org/10.1038/bjc.2013.496 Williams, K. E., Anderton, D. L., Lee, M. P., Pentecost, B. T., & Arcaro, K. F. (2014). High-density array analysis of DNA methylation in Tamoxifen-resistant breast cancer cell lines. Epigenetics : Official Journal of the DNA Methylation Society, 9(2), 297–307. 73 http://doi.org/10.4161/epi.27111 Wu, M. C., Joubert, B. R., Kuan, P., Håberg, S. E., Nystad, W., Peddada, S. D., & London, S. J. (2014). A systematic assessment of normalization approaches for the Infinium 450K methylation platform. Epigenetics : Official Journal of the DNA Methylation Society, 9(2), 318–29. http://doi.org/10.4161/epi.27119 Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 68(1), 49–67. http://doi.org/10.1111/j.1467-9868.2005.00532.x Zhu, J., & Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3), 427–443. http://doi.org/10.1093/biostatistics/kxg046 Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101(476), 1418–1429. http://doi.org/10.1198/016214506000000735 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2), 301–320. http://doi.org/10.1111/j.1467-9868.2005.00503.x Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286. http://doi.org/10.1198/106186006X113430 74 Additional Files Additional table 1. 1 Summary of preprocessing methods used in this study. The methods in bold are the ones used in this study 75 Additional table 1. 2 Number of samples in TCGA-KIRC, TCGA-COAD, and Alzheimer’s disease brain data assigned to Discovery and Validation data sets by disease status and plate. 1424 1500 (3) (4) 94 36 65 130 47/47-(50) 18/18-(50) 1275 1418 1500 1670 (1) (2) (4) (5) 72 88 47 45 95 252 43/29-(60) 47/41-(53) 26/21-(55) 41/4-(91) 1407 1551 1651 1926 (1) (2) (3) (7) 42 9 41 29 34/8-(81) 0/9-(0) 41/0-(1) 29/0-(1) 1721 1772 1837 A27A (4) (5) (6) (10) 62 35 46 18 47/15-(76) 30/5-(86) 46/0-(1) 18/0-(1) Disease T Discovery 143 Validation 141 1033 1034 1035 1036 (1) (2) (3) (4) 46 44 49 41 78 180 28/18-(61) 24/20-(55) 30/19-(61) 20/21-(49) 1037 1038 1039 1040 (5) (6) (7) (8) 50 50 47 49 83 196 24/26-(46) 30/20-(60) 33/14-(70) 26/23-(53) 1033 1034 1035 1036 1037 1040 (1) (2) (3) (4) (5) (8) 25 25 25 19 26 20 14/11-(56) 11/14-(44) 15/10-(60) 7/12-(37) 12/14-(46) 12/8-(60) 1033 1034 1035 1036 1037 1040 (1) (2) (3) (4) (5) (8) 21 19 24 22 24 29 14/7-(67) 13/6-(68) 15/9-(63) 13/9-(59) 12/12-(50) 14/15-(48) Disease Discovery Validation 12/0-(1) 12/0-(1) 15/0-(1) Case-N./Control-N.-(%case) Case-N./Control-N.-(%case) 12/9-(57) 14/9-(61) 18/11-(62) 19/5-(79) Case-N./Control-N.-(%case) Case-N./Control-N.-(%case) Case-N./Control-N.-(%case) Case-N./Control-N.-(%case) Sample-N. N Total 17 Sample-N. 65 Validation Plate (plate-index) Disease T TCGA%KIRC)dataset Discovery Plate (plate-index) Disease T-* N-* Total Case-N./Control-N.-(%case) Discovery Plate (plate-index) A153 (8) A16X (9) N Total Sample-N. 157 TCGA%COAD)dataset A280 (11) Sample-N. 12 12 15 Case-N./Control-N.-(%case) 160 Validation Plate (plate-index) 20 161 Brain)dataset Discovery Plate (plate-index) Disease A-* N Total Sample-N. 113 ‘Balanced’)Brain)dataset N Total Sample-N. 102 Validation Plate (plate-index) Disease A-* Validation Plate (plate-index) 1038 (6) 1039 (7) Discovery Plate (plate-index) 1038 (6) 1039 (7) Sample-N. 29 24 A N Total Sample-N. 21 23 97 87 184 118 74 192 76 Additional Figure 1. 1 Multidimensional scaling plot of distances between samples in COAD, scaling dimensions 1 vs 2, samples colored by plates. Euclidean distances between sample pairs are computed for a common set of 5000 features having the largest standard deviations across all samples. Features containing SNPs or mapping to the sex chromosomes were excluded. 77 Additional Figure 1. 2 Density distributions of beta-values for the Type I (solid lines) and Type II (dashed lines) probes for six PBLs replicates under different processing methods. 78 Additional Figure 2. 1 Maximum possible AUCs with varied non-zero coefficients (N=200) In all settings we have fixed r 1 =5, r 2 =20, coef2=1, p 1 =p 2 =250. The coef1 changes from 1 to 2.1 with increment of 0.1. Coef1, Coef2: non-zero coefficient for the features in the first and second data type, separately; r1, r2: number of informative variables (those with non-zero coefficients) in the first and second data type, separately; p 1 , p 2 : number of features in the first and second data type, separately. N=200 simulated data sets. 79 Additional Figure 2. 2 The empirical AUCs as a function of the weight parameter. x axis: weight parameter; y axis: empirical AUCs; black line: mean AUCs from cross-validation in training data sets; black line: mean AUCs from independent testing data sets; blue line: best weight parameter. The width of the blue squares represents effect size and the height represents number of informative variables. 80 Additional Figure 2. 3 Spaghetti plot of AUCs for standard Lasso and multi-tuning parameter Lasso (N=100 simulated data sets). 81 Additional Figure 2. 4 Results of evaluation metrics for the standard Lasso (yellow) vs. multi-tuning parameter Lasso (red). The columns are seven evaluation metrics that were considered: AUC, 1-misclassification error, sensitivity, specificity, sensitivity in the first and second data sets, and positive predictive value. The rows are simulation settings with varied Coef1. In each simulation, prediction models obtained from training data were applied on the independent testing data, and the boxplots show the summary statistics for 200 simulations. Coef1: non-zero coefficient for the features in the first data type; r1, r2: number of informative variables (those with non-zero coefficients) in the first and second data type, separately. 82 Additional Figure 2. 5 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments when the inter-array correlation is zero (Cor_inter=0). Blue: informative methylation features; Red: informative gene features. 83 Additional Figure 2. 6 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments when the inter-array correlation is larger than zero (Cor_inter>0). Blue: informative methylation features; Red: informative gene features. 84 Additional Figure 2. 7 Difference in testing AUCs between Elastic net and Lasso under different simulation settings (N=200 simulated data sets). (A): Correlation coefficients among informative methylation features (Cor_meth) varied. (B): Number of correlated features in simulated methylation array (n 2 ) varied. (C) Correlation coefficients between informative genes and informative methylation features (Cor_inter) varied. 85 Additional Figure 2. 8 Difference in testing AUCs between multi-tuning parameter Elastic net and standard Elastic net under different simulation settings (N=200 simulated data sets). (A): Correlation coefficients among informative methylation features (Cor_meth) varied. (B): Number of correlated features in simulated methylation array (n 2 ) varied. (C) Correlation coefficients between informative genes and informative methylation features (Cor_inter) varied. 86 Additional Figure 2. 9 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments with varied correlation coefficients between informative methylation features. Blue: informative methylation features; Red: informative gene features. Additional Figure 2. 10 Bar plot of percentages of the times that each informative variable is selected across 100 replicated simulation experiments with varied number of correlated informative methylation features. Blue: informative methylation features; Red: informative gene features. 87 Additional Figure 2. 11 Maximum possible AUCs for individual data types (Type 1 and Type 2) and combined data under different simulation settings (N=200 simulated data sets for each simulation setting). (A): Correlation coefficients among informative methylation features (Cor_meth) varied. (B): Number of correlated features in simulated methylation array (n 2 ) varied. (C) Correlation coefficients between informative genes and informative methylation features (Cor_inter) varied. 88 Additional Figure 2. 12 Estimated coefficients for the selected features from the multi-tuning parameter Elastic net model in PRAD data. 89 Additional table 2. 1 Statistical summaries in paired t-tests between standard Lasso and multi-tuning parameter Lasso. mean(d): mean difference ×10 -2 ; sem(d): standard error of the mean difference ×10 -2
Abstract (if available)
Abstract
With the development of high‐throughput technologies, it’s promising to utilize high‐dimensional molecular data to unravel relationships between genomic alterations and disease phenotypes, to describe characteristic molecular profiles of the diseased population, and to predict diagnosis or prognosis for patients. The major goal of this dissertation is to develop novel statistical methods to analyze high‐throughput molecular data. ❧ Data quality is fundamental to any research project. The first part of this dissertation focuses on low‐level signal processing and batch effect correction of HumanMethylation450 (HM450) BeadChip arrays, a commonly used platform for high‐throughput DNA methylation analysis. As with other types of microarray platforms, technical artifacts are a concern for HM450 data, including background fluorescence, dye‐bias from the use of two colour channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing. In our study nine processing methods, including both within‐ and between‐array methods, were applied and compared in four datasets. The comparisons in our study provided valuable insights in preprocessing HM450 BeadChip data. They showed an advantage to using BMIQ to adjust type I/II bias in combination with Noob, a within‐array procedure to correct background fluorescence and dye‐bias, or with Funnorm, a between‐array normalization procedure utilizing control probes. We also saw a benefit from using RUVm, a new variant of Remove Unwanted Variation designed for DNA methylation data, when batch effects were apparent and differential DNA methylation was the goal. Performing RUVm on the processed data led to considerable improvement in reproducibility in the list of top differentially methylated markers. The results for this project have been published in BMC Genomics. ❧ The second part of this dissertation is to develop methods for integrating different omics data to predict disease subtypes. Combining the information of measurements from multiple platforms, such as gene expression, methylation, and copy number variation, investigators hope to obtain a more accurate prediction model than using a single data type alone by utilizing complementary information in different platforms. Penalized regression, such as Lasso and Elastic net, has been used to setup prediction models in data integration. However, in previous studies a single penalized regression model that penalizes features from all platforms equally was used. Using a uniform penalty parameter for all features is problematic since we are at the risk of missing some subtle but important features. We propose an approach termed “multi‐tuning parameter penalized regression”, where we consider separate tuning parameter penalties and allow for differential shrinkage across features from different platforms. From intensive simulation studies we identify scenarios when multi‐tuning parameter penalized regression outperforms standard one in terms of both prediction accuracy and variable selection, and we consider both situations when variables are independent and correlated. Besides, we show that combining features before model setup (i.e. “early integration”), generally works better than selecting variables from individual platforms followed by model setup using combined selected features (i.e. “late integration”). Lastly, we investigate the performance of multi‐tuning parameter penalized regression in real cancer datasets obtained from TCGA and GEO. ❧ The thesis focuses on the development of statistical methods for high‐dimensional genomic data analysis. Specifically, we set up an analytical pipeline, from the processing of the high‐throughput HM450 data to utilizing the cleaned data in predicting disease subtypes. We identified the appropriate combination of the preprocessing methods for HM450 array in terms of reducing technical variance and improving reproducibility in differential methylation analysis. And we proposed a novel Lasso‐ and Elastic net‐ variant specifically designed for classification when integrating data from different genomic platforms. The methods developed and evaluated in this thesis provide guidance on best practices for the statistical analysis of high‐throughput ‘omic’ data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incorporating prior knowledge into regularized regression
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Finding signals in Infinium DNA methylation data
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Shrinkage methods for big and complex data analysis
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
Asset Metadata
Creator
Liu, Jie
(author)
Core Title
Statistical analysis of high-throughput genomic data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
02/28/2017
Defense Date
12/13/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
batch correction,BeadChip,data integration,HumanMethylation450,normalization,OAI-PMH Harvest,penalized regression,preprocessing,sample classification
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly (
committee chair
), Breton, Carrie (
committee member
), Laird-Offringa, Ite A. (
committee member
), Lewinger, Juan Pablo (
committee member
), Millstein, Joshua (
committee member
)
Creator Email
liu485@usc.edu,liujieyixue@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-342801
Unique identifier
UC11258221
Identifier
etd-LiuJie-5101.pdf (filename),usctheses-c40-342801 (legacy record id)
Legacy Identifier
etd-LiuJie-5101.pdf
Dmrecord
342801
Document Type
Dissertation
Rights
Liu, Jie
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
batch correction
BeadChip
data integration
HumanMethylation450
normalization
penalized regression
preprocessing
sample classification