Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrative analysis of multi-view data with applications in epidemiology
(USC Thesis Other)
Integrative analysis of multi-view data with applications in epidemiology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTEGRATIVE ANALYSIS OF MULTI-VIEW DATA WITH APPLICATIONS IN EPIDEMIOLOGY by Yinqi Zhao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2023 Copyright 2023 Yinqi Zhao Acknowledgements My deepest gratitude goes to my dissertation advisor, Dr. David V. Conti, for his invaluable patience and feedback. David has witnessed my growth as I toddle along the fantastic world of biostatistics. I could not have undertaken this journey without his tremendous encouragement and mentorship. David is one of the smartest person I know. He has inspired me in research and in life. I am greatly influenced by his words, “always try to be a thinker instead of a follower”. With his insightful advices, I will be continuously choosing my own adventure. Many thanks to my dissertation committee member, Dr. Leda Chatzi, Dr. Nicholas Mancus, Dr. Juan Pablo Lewinger and Dr. Andrew D. Smith, for their brilliant comments on my dissertation and valuable suggestions in general. Leda also offered me with a research opportunity, through which I gained knowl- edge of statistical consulting and environmental epidemiology. The contents from Chapter 2 to Chapter 4 has resulted in an R package, LUCIDus, hosted on CRAN. Throughout the development of LUCIDus, I would like to thank Dr. David V. Conti, Dr. Jesse A. Goodrich and Dr. Nikos Stratakis for their generous support and valuable feedback. I am also grateful to the Division of Biostatistics for the fellowship support. Special thanks to Dr. Kimberly Siegmund and Dr. Sandrah P. Eckel for their advices on and support for my PhD study. Thanks should also go to my friends in and outside the biostatistic program for all the enjoyable moments we have spent together at Soto and “the Broadway”. ii I would like to sincerely thank my parents, Wei Liu and Haining Zhao. They have sacrificed so much for me, personally and financially, and provided me with unconditional support and love. There are no words to convey how much I love them. The best outcome from the past four years and half is finding my best friend and partner, Yutong Liu. She always has faith in me, even when I do not trust myself. She is a great supporter and the only person who appreciates my cooking. I truly thank Yutong for sticking by my side. With her love, I learned a lot about life. iii TableofContents Acknowledgements ii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Overview of multi-view data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges in analyzing multi-view data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Integration methods for analyzing multi-view data . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 General strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Variable selection and regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Clustering analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.4 High dimensional mediation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Missing data and analysis methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Missing mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.3 Expectation-maximization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Enhancements on the LUCID model 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Statistical modeling of LUCID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 EM algorithm to estimate MLE of LUCID . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.3 Integrated variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.5 Label switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.6 Supervised and unsupervised LUCID . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.7 Variable selection based on correlation structure . . . . . . . . . . . . . . . . . . . 34 2.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1 Speed comparison of two initialization strategies of the EM algorithm . . . . . . . 36 2.3.2 Robustness of integrated variable selection . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4.1 Data description and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 iv 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Appendix 2.A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.7 Appendix 2.B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8 Appendix 2.C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.9 Appendix 2.D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.10 Appdendix 2.E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.11 Appdendix 2.F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 3: LUCID incorporating missing omics data 64 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Listwise missing pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.2 Sporadic missing pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4 Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.6 Appendix 3.A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 4: LUCIDus: An R Package to implement the LUCID model 82 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 General workflow of LUCIDus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.1 Fitting a LUCID model bylucid() . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.2 Interpreting LUCID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.3 Model selection and variable selection . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.4 Deriving confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3.6 Incorporating missingness in omics data . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Other applied examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.1 Integrated analysis of prenatal PFAS, the metabolites and liver injury . . . . . . . . 101 4.4.2 Integrated analysis of in utero exposure to mercury, inflammation biomarkers and liver injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 5: LUCID incorporating multiple latent variables 105 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.1 EM algorithm for LUCID with multiple latent variables . . . . . . . . . . . . . . . . 106 5.2.2 Label switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2.3 A feature updates for the LUCIDus R package . . . . . . . . . . . . . . . . . . . . . 114 5.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.2 Parameter estimation and prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.3 Comparison to other integration methods . . . . . . . . . . . . . . . . . . . . . . . 125 v 5.4 Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6 Appendix 5.A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.7 Appendix 5.B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.8 Appendix 5.C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Bibliography 142 vi ListofTables 2.1 Simulation results for the correlation-based variable selection criteria in low dimensional scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Bootstrap standard error for LUCID model, calculated based on 500 replicates. . . . . . . . 58 2.3 Additional simulation scenarios for variable selection based on correlation. . . . . . . . . . 63 3.1 Simulation setting for listwise missingness in omic data. . . . . . . . . . . . . . . . . . . . 81 3.2 Simulation setting for sporadic missingness in omic data. . . . . . . . . . . . . . . . . . . . 81 4.1 Functions in the LUCIDus package. lucid() calls est_lucid() and tune_lucid() in the backend. The two workhorse functions are not normally called directly, but they can be useful when a user wants to examine the model fitting process in more detail. . . . . . . 83 5.1 Simulation setting 1 - two omics layers withK 1 =2 andK 2 =3 . . . . . . . . . . . . . . 117 5.2 Simulation setting 2 - two omics layers withK 1 = 2 andK 2 = 3. Different variance- covariance matrices are used to generate omic data: Var(Z 1 )= 0.2I, Var(Z 2 )= 0.4I, whereI is a6× 6 diagonal matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Comparison of prediction of clustering and the outcome, for LM, LS, CR and LC. . . . . . . 126 5.4 Comparison of LM, LS and CR: exposure effect . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.5 Comparison of LM, LS and CR: omic effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.6 Comparison of LM, LS and CR: outcome effect . . . . . . . . . . . . . . . . . . . . . . . . . 138 vii ListofFigures 1.1 Components of multi-view data in HELIX study, inspired by ISGlobal 2021 data challenge . 2 1.2 Illustrations of different integration strategies for analyzing multi-view data. D 1 andD 2 represent two omics datasets,Y represents the outcome of interest. . . . . . . . . . . . . . 6 1.3 Coefficients path of LASSO regression using the simulated HELIX data (details on the dataset are discussed in Chapter 2). The multi-omics data are concatenated (early integration) to investigate their joint effect on children’s body mass index (BMI). The numbers at the bottom represent penalty in the log scale. The numbers at the top represent the number of selected variables given corresponding penalty . . . . . . . . . . . . . . . . 7 1.4 Three causal graphs for the mediation models. Black uppercase letters represent model component; blue lowercase letters are model coefficients: (a) simple mediation model; (b) latent mediation model; (c) high dimensional mediation model . . . . . . . . . . . . . . . . 10 2.1 Directed Acyclic Graph (DAG) for LUCID. Square represents observed data; circle represents latent clusters or unobserved model parameters; diamond represents the penalty terms for regularization, e.g. lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Ellipses with different geometric features in two-dimensional cases with 3 clusters. Adapted from [85] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Cluster structure defined only by the correlation between 2 variables . . . . . . . . . . . . 35 2.4 Flowchart of deciding whether a variable is informative to clustering or not . . . . . . . . 35 2.5 Comparing the speed of convergence of the EM algorithm using random initialization and heuristic initialization. (a) listwise missingness in omics data (b) sporadic missingness in the omics data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Simulation scenarios for (a) independent mediators (b) correlated mediators . . . . . . . . 39 2.7 Simulation results comparing variable selection based on LUCID and other mediation approaches (HIMA and BAMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 LUCID analysis on the simulated HELIX data . . . . . . . . . . . . . . . . . . . . . . . . . . 44 viii 2.9 Supplementary analysis for HELIX data for ISGlobal data challenge 2021 . . . . . . . . . . 52 2.10 Pairwise scatter plot for simulated omics variables. Z 1 andZ 2 are informative variables with difference in mean across the 2 clusters. Z3 to Z 8 are informative variables with difference in correlation across the 2 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1 Illustrations of the two missing patterns in the omics data, as well as corresponding methods to deal with missingness. The white block are missing values in the omics data Z. The dark blue blocks represented the imputed values inZ . . . . . . . . . . . . . . . . 67 3.2 Illustration of a more general case with a combination of listwise and sporadic missing pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 Simulation results for listwise missingness in omic data . . . . . . . . . . . . . . . . . . . . 74 3.4 Simulation results for sporadic missingness in omic data . . . . . . . . . . . . . . . . . . . 76 3.5 “Risk” profiles for individuals with and without measured metabolomics data . . . . . . . . 79 4.1 The workflow of the LUCIDus package. Dark blue nodes represent input data, light blue nodes represent output results. Green nodes and dashed arrows are optional steps for model estimation. Red texts correspond to 5 key functions in LUCIDus. . . . . . . . . . . . 84 4.2 An example of using the Sankey diagram to visualize a LUCID model. The dark grey nodes represent exposures, the brown node represents the outcome, the blue nodes are omics data and orange nodes are latent clusters. The width of links and nodes corresponds to effect size. Light colored links represent negative associations while dark colored links indicate positive associations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Choosing optimal number of latent clusters K based on BIC. . . . . . . . . . . . . . . . . . 93 4.4 Two simulated missing patterns in proteomics data . . . . . . . . . . . . . . . . . . . . . . 99 4.5 Integrated clustering analysis of prenatal PFAS, serum metabolites and liver injury . . . . 102 4.6 Integrated clustering analysis of maternal exposure to Hg, inflammation biomarkers and elevated ALT levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1 Directed Acyclic Graph (DAG) for LUCID with multiple latent mediators. The square represents observed data, the circle corresponds to latent mediator interpreted as latent cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Sankey diagram for LUCID with multiple latent variables. The nodes on the left represent the exposures; the nodes in the middle represent the latent clusters; and the nodes on the right represent omics layers and the outcome. Each link represents a statistical association. Dark gray links represent the positive ones and the light gray links represent the negative ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 ix 5.3 BIC of the LUCID model using different combinations of K 1 andK 2 . . . . . . . . . . . . . 118 5.4 Simulation results for supervised LUCID with two latent variable, part one: parameter estimation for exposure and omic effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5 Simulation results for supervised LUCID with two latent variable, part two: parameter estimation for outcome effect, prediction on clusters and the outcome. . . . . . . . . . . . 122 5.6 Simulation results for unsupervised LUCID with two latent variable, part one: parameter estimation for exposure and omic effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7 Simulation results for unsupervised LUCID with two latent variable, part two: parameter estimation for outcome effect, prediction on clusters and the outcome. . . . . . . . . . . . 124 5.8 Results of LUCID analysis to investigate the joint effect of prenatal HCB, proteomics, serum and urine metabolites on children’s BMI z-score. . . . . . . . . . . . . . . . . . . . . 129 5.9 DAG of LUCID with multiple latent variables in serial . . . . . . . . . . . . . . . . . . . . . 131 5.10 Pearson correlation of the 32 selected omic variables. Only statistically significant correlations (p-value< 0.05) are shown in the heatmap. . . . . . . . . . . . . . . . . . . . . 139 5.11 Comparison of exposure effect calculated from LM and LS . . . . . . . . . . . . . . . . . . 140 5.12 Comparison of PIPs calculated by LM and LS. . . . . . . . . . . . . . . . . . . . . . . . . . 141 x Abstract Omics data refers to data produced by high-throughput biochemical assays that simultaneously measure a large number of molecules of the same type from a biological sample. These data sets include DNA sequences (genomics), RNA expression data (transcriptomics), metabolites in biofluids (metabolomics), proteins in cell or tissue (proteomics), as well as chemical exposures in the environment (exposomics). The rapid advances in molecular biotechnology have made it possible to measure multiple omics and dis- ease phenotypes (referred to as “multi-view data”) on the same person in many cohort studies, such as Human Early-Life Exposome project (HELIX), The Cancer Gene Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Integrated analyses of multi-view data facilitate insights into the underlying molecular mechanism in complex biological processes, and many studies have demonstrated the impor- tance of integrating multi-view data over single-view data analysis. However, the analysis of multi-view data is still analytically and computationally challenging, due to the high-dimensionality of multi-view information, missing values and the complex structure of the data collected from various sources. In this dissertation, we propose novel methods for integrated analysis of multi-view data. Our methods aim at un- covering subgroups across multiple high-dimensional data types and relate the subgroups to the outcome of interest. The proposed methods also address common issues in analyzing multi-view data, including variable selection, missing values as well as various data integration strategies. Peng et al. proposed an integrative model to estimate latent unknown clusters relevant to outcome of interest (LUCID) using multi-view data. Our methods improve computational performance of LUCID, xi especially under high-dimensional scenarios. We use eigenvalue decomposition to allow for more flexible geometric features in omic data within the LUCID framework. LUCID is previously a supervised learning model. We extend the LUCID for an unsupervised learning process, and construct its linkage to Gaussian Mixture Model (GMM). Additionally, our methods can handle various missing scenarios in omics data, including listwise missingness (omics data are available for a subset of observations), sporadic missingness (omics data are missing completely at random), or a combination of the two. The original LUCID method is limited to being informed by a single type of omic data. We implement an approach to LUCID for multiple types of measured omic data by using multiple latent variables. This dissertation is structured as follows. Chapter 1 reviews comprehensively different strategies for integrative analysis of multi-view data and relevant statistical models, including mediation analysis, clus- tering model and penalized regression techniques for high-dimensional data problems. Chapter 2 proposes several enhancements to the LUCID model. Chapter 3 discusses our method to incorporate missing omic data in the LUCID analysis. Chapter 4 contains a comprehensive introduction to the R package, LUCIDus, which implements our statistical model discussed in Chapter 2 and Chapter 3. Chapter 5 extends the LUCID model to multiple types of omics data. Chapter 6 summarize our findings and contributions. xii Chapter1 Introduction 1.1 Overviewofmulti-viewdata Rapid technological advancements have brought the scientific community into the era of data. Researchers have been processing information that grow exponentially in volume, variety and complexity [121]. Multi- view data refers to multiple types of data generated on the same sample. In bioinformatics, multi-view data are often referred to as multi-omics data [54]. Each type of omics data, such as genomics, transcriptomics, proteomics, metabolomics, only characterizes a biological process at a certain molecular level. Multi- omics data together provide complementary information to explain the complexity of biological process [74]. Examples of analyses of multi-omics data include constructing biological architecture of complex phenotype, establishing causal mechanisms in common disease, identifying driver mutations and discov- ering subtypes of cancer, etc. [46]. Generation of multi-omics data requires incredible efforts. Many large research consortiums have been established to build rich and fruitful multi-omics databases for the broad research community, such the Cancer Genome Atlas (TCGA) [109], the Roadmap Epigenetic Project [52], the Common Fund’s Genotype-Tissue Expression (GTEx) Program [57], etc. In the context of environmental epidemiology, we use the term multi-view data to denote the combi- nation of exposures, outcome, multi-omics data and covariates. The concept of exposures can be expanded to an omics scale terminology, exposomics [65]. It is a collection of exposures from the environment, diet, 1 Figure 1.1: Components of multi-view data in HELIX study, inspired by ISGlobal 2021 data challenge behavior and endogenous process throughout lifespan. Human Early Life Exposome (HELIX) study has been investigating the effect of multiple environmental exposures (during pregnancy and childhood) on child health outcomes, and associating those exposures to multi-omics signature [104, 103]. It is a typical example of environmental epidemiological study, which collects and analyzes the multi-view data. Figure 1.1 illustrates various components of multi-view data measured in HELIX study. For genetic epidemiology, investigators study the effect of genetic variants on human diseases. The exposures of multi-view data can be defined as genomics data, or polygenic risk score (PRS), which is a summary of genotyping information [18], together with other molecular data including gene expression data, DNA methylation data etc. 1.2 Challengesinanalyzingmulti-viewdata Multi-view data offer researchers unprecedented opportunities to draw a more complete picture of a com- plex biological system. [54] outlined 5 types of problems in multi-view data analyses: (1) feature selection (2) prediction (classification or regression) (3) clustering (4) interactions among inter-view features (for example, interaction of gene and environmental risk factors) (5) association study in a complex system. 2 Various challenges arise when researchers try to answer the aforementioned questions. Multi-view data are measured by high throughput sequencing or bioassay technologies. Each sample can generate tens of thousands of variables. As a result, the number of variables is substantially greater than the sample size in each dataset, which is referred to as high dimensionality issue [74]. Traditional statistical methods such as regression analysis are not applicable to high dimensional data, while many popular machine learn- ing methods tend to fall into the trap of overfitting. Besides, the number of variables differs greatly across multi-view data, ranging from dozens (proteomics) to tens of thousands (DNA methylation data). The vari- ations in the size of multi-view data can result in imbalanced integration. Furthermore, multi-view data consists of various data type, including discrete (genotype coded as 0, 1 and 2; copy number variation), con- tinuous (gene expression, DNA methylation and protein concentration), categorical (questionnaire data, common covariates such as sex, cohort etc.). Successful integrated analysis should correctly handle het- erogeneity in probabilistic distributions of multi-view data. Additionally, covariates, such as demographic characteristics and batch effects, should be rightly adjusted for. Otherwise, the association analysis across multi-view data can be confounded. We will discuss more details on strategies and analysis methods for multi-view data in the next section. Missing data is another inevitable issue in the integrated analysis of multi-view data. Especially, it occurs in multi-omics datasets for various reason, including accuracy of measurement device, experimental constraints, budget limitation, lack of consent for data usage, etc. [89]. Observations can have missing values either in a subset of variables within a particular omics dataset or in the whole omics dataset. Without proper handle on missing values, integrative analysis may suffer from decreased sample size. [89] comprehensively reviewed state-of-the-art imputation methods for multiple types of omics data. 3 1.3 Integrationmethodsforanalyzingmulti-viewdata 1.3.1 Generalstrategies Due to the heterogeneous nature of multi-view data, integrated analysis is not straightforward. It is not uncommon that multiple statistical/machine learning methods are implemented in the integration of multi- view data. Before introducing specific statistical methods, we first review some high level ideas for inte- grating multi-view data. More comprehensive discussions on these integration strategies are covered by [74, 77]. Multi-view data are stored in multiple datasets of rectangle shape, where each dataset (“data view”) has same rows (sample size) but different number of columns (variable). Early integration is a strategy that first concatenates all datasets into a single large data matrix, and then applies available single-view methods on the concatenated dataset. Spicker et al., for instance, improved prediction of toxicological classes by principal component analysis (PCA), using concatenated information from clinical chemistry, gene expression and metabolomics [91]. Early integration is commonly used for its simplicity. It can also uncover interactions between different data views [74]. However, it has several obvious drawbacks. First, concatenation dramatically increases the data dimension, which aggravates “the curse of dimensionality” [48]. Second, it doesn’t consider heterogeneity of probabilistic distributions between data views. Third, concatenation gives more weight to the dataset with more variables, which causes imbalance in detected signals. Late integration is another straightforward strategy that utilizes existing statistical methods for single- view data. It first analyzes each data view independently, then integrates results. The advantage of late integration is that researchers can apply different methods designed for particular types of data. The drawback lies in the fact that interaction across data views is not captured. Additionally, results from separate single-view analysis can contradict with each other. 4 Picard et al. described intermediate integration strategy as “any methods capable of jointly integrating the multi-omics datasets without needing prior transformation and without relying on a simple concate- nation” [74]. Intermediate integration relies on the assumption that different data views share a common latent space, which captures the underlying biology. Shen et al. proposed iCluster, a joint latent variable model-based clustering method [86]. In their application to breast cancer, the common latent space is cancer subtypes characterized by molecular signatures. Intermediate integration covers a wide range of methods of various purpose, including supervised (e.g., high dimensional mediation analysis) and unsu- pervised (e.g., multi-view clustering) approaches. Its strength lies in the joint modeling of multiple data views, which can reveal the interplay between different data views. On the one hand, intermediate meth- ods can model different data views based on their specific data types. On the other hand, it’s challenging to create new tools that analyze multi-view data in a unified framework. Particularly, in environmental epidemiology, there are growing interests and needs in developing novel intermediate integration methods as few comprehensive tools are available in the community. Peng et al. created a model called Latent Un- known Clustering Integrating Multi-Omics data (LUCID). Besides integration of multi-omics data, LUCID considers genetic/environmental risk factors and jointly estimate latent clusters relevant to the outcome of interest [72]. Usage of LUCID is demonstrated in several environmental epidemiological studies, see [93, 92, 61]. Figure 1.2 illustrates different strategies to integrate multi-view data. 1.3.2 Variableselectionandregularization Variable selection is commonly performed in multi-view data analysis to decrease the dimensionality (number of variables) of datasets for the following considerations: (1) to reduce noises and capture weak signals; (2) to lighten computational burden; (3) sometimes to make sample size greater than number of features so that more statistical tools are applicable; (4) improve model interpretability. Variable selection 5 Figure 1.2: Illustrations of different integration strategies for analyzing multi-view data. D 1 andD 2 rep- resent two omics datasets,Y represents the outcome of interest. can be either applied to each data view separately, or implemented on the concatenated data matrix in early integration. Some intermediate integration methods also come with integrated variable selection. Regularization can induce sparsity in the model. Tibshirani proposed the least absolute shrinkage and selection operator (LASSO), which is also referred to as L 1 regularization [97]. Consider a linear model Y =Xβ +ϵ , whereY is an-dimensional vector of outcome,X is an× p matrix of predictors andϵ is the error term. The LASSO problem is expressed as min β ∈R (Y − Xβ ) 2 +λ p X j=1 |β j | (1.1) where λ is the Lagrange multiplier, also called the penalty term. When then penalty term is small, the estimation of LASSO regression is close to the maximum likelihood estimation (MLE). Whenλ gets larger, more coefficients shrink to 0. The remaining variables with coefficients not equal to 0 are considered as informative variables for integrated analysis.L 1 regularization is often embedded in a particular statistical model, such as regression [97], clustering [71]. Figure 1.3 shows an example of usingL 1 regularization in an association study on the multi-view data. 6 Figure 1.3: Coefficients path of LASSO regression using the simulated HELIX data (details on the dataset are discussed in Chapter 2). The multi-omics data are concatenated (early integration) to investigate their joint effect on children’s body mass index (BMI). The numbers at the bottom represent penalty in the log scale. The numbers at the top represent the number of selected variables given corresponding penalty 7 Heuristic methods, such as filtering based on variation, correlation or distance, are also commonly used. Those methods are often implemented independent of any integrated analysis methods and use “rule of thumb” as selection criteria. For example, xMWAS, an automated network analysis framework for integrated omics data, filters variables by using the relative standard deviation (RSD) and max number of selected variables defined by user [99]. Another example is “meet in the middle” screening [11]. The ag- nostic approach reduces data dimensions by utilizing pairwise associations among exposome, intermediate variable (omics data) and outcome. Another useful tool for variable selection is biological annotation, including gene annotation, pathway information etc. Screening based on biology is sometimes necessary for association study, as the aim is to reveal the mechanism of how exposure influences the outcome. 1.3.3 Clusteringanalysis Clustering analysis is an unsupervised learning task that classifies observations into multiple groups based on particularly defined similarity. Model-based clustering is a well studied statistical approach for cluster- ing task. It assumes data are generated from a combination of component model. Each component model follows a multivariate distribution. Gaussian mixture model (GMM) is a famous model-based clustering approach [32]. GMM assumes every component model is a multivariate Gaussian distribution and the joint distribution is f(Z)= k X j=1 π j ϕ (Z|µ j ,Σ j ) (1.2) where π j is component weight with constraint P j π j = 1, µ j andΣ j are component-specific means and variance covariance matrix. The advantages of GMM lies in the probability modeling. Multivariate Gaussian distribution is well studied, and many previous statistical works are applicable to GMM [32]. Additionally, GMM, as well as other model-based clustering methods, can make ‘soft’ predictions to cluster assignment. Each observation is associated with a vector of probabilities for cluster memberships [77]. 8 There are substantial statistical softwares implementing GMM. mclust is the most popular package in R programming, featuring model based clustering, classification, dimension reduction and inference based on resampling [85]. Over-parameterization is a major criticism about GMM. As a result, fitting GMM directly on high dimensional data is computationally expensive and GMM is often implemented together with variable selection. In multi-view data analysis, GMM is often implemented with early integration strategy by concatenating multiple omics data. Since the cluster assignment is treated as an unobserved variable, or missing variable, approaches to deal with GMM are also discussed under the context of missing value imputation, which will be discussed later. iCluster is a multi-view clustering model using intermediate integration strategy [86]. It assumes all data views generated from a low dimensional space representing cluster membership. The mathematical model of iCluster is Z j =W j X +ϵ j (1.3) whereX is a k× n matrix of relaxed cluster indicator matrix,W j is a view specific p j × k matrix and ϵ j is normally distributed error terms. Compared to GMM, iCluster distinguishes the common variance captured byX and view specific variances ϵ j . The assumption of normally distributed error term limits iCluster to continuous data. An extension, iCluster+ is proposed by Mo et al. to incorporate discrete, categorical and count data [68]. Both methods useL 1 regularization to select informative variables. 1.3.4 Highdimensionalmediationanalysis In epidemiological study, a mediation model aims to explain the specific mechanism that underlies the association between the exposure (E) and the outcome (Y ) by a third variable called the mediator (M) 9 (a) (b) (c) Figure 1.4: Three causal graphs for the mediation models. Black uppercase letters represent model compo- nent; blue lowercase letters are model coefficients: (a) simple mediation model; (b) latent mediation model; (c) high dimensional mediation model (Figure 1.4a). The statistical framework of mediation analysis is based on the following three regression models: Y =β 01 +cE+ϵ Y1 (1.4a) Y =β 02 +c ′ E+bM +ϵ Y2 (1.4b) M =β 03 +aE+ϵ M (1.4c) The product of a and b is referred to as the indirect effect, measuring the amount of mediation effect through the mediator variable. Coefficient c is the direct effect. If c = 0, the mediator M completely explains the mechanism of association between the exposure and the outcome, resulting in a complete mediation model. Otherwise, the model is referred to as partial mediation model. Albert et al. extended mediation model to multiple mediators by jointly estimate the indirect effect through a latent variable [1]. The conceptual model is depicted by Figure 1.4b. They assumed that the outcome is influenced by an unmeasured latent mediator X which summarized all the effect of the ob- served intermediate variables. They incorporated mixed (continuous and dichotomous) variable types and constructed a generalized Structural Equation Model (SEM) to estimate coefficients of the latent mediation model. 10 Multiple mediation model cannot fulfill the needs for association study on high dimensional multi-view data. Substantial mediation methods are developed to deal with high dimensionality. The general idea is to treat the high-dimensional mediation problem as a variable selection problem. Song et al. developed a Bayesian mediation method using continuous shrinkage priors [90]. The author assumed that all mediators potentially contribute small effects in mediating the association between exposure and outcome, while only a small portion of mediators exhibiting large effects. The mediation framework is an extension of Equations 1.4b and 1.4c, which is Y =Mb+c ′ E+ϵ Y (1.5a) M =Ea+ϵ M (1.5b) whereM is an× p matrix of mediators,b anda are lengthp vectors. Under a Bayesian setting, the priors of the parameters follow a mixture of two normal distribution, b j ∼ π M N(0,σ 2 M1 )+(1− π M )N(0,σ 2 M2 ) (1.6a) a j ∼ π E N(0,σ 2 E1 )+(1− π E )N(0,σ 2 E2 ) (1.6b) whereσ 2 M1 >σ 2 M2 ,σ 2 E1 >σ 2 E2 .π M andπ E represent the proportion of active mediators that truly exhibit mediation effects. The method produces posterior inclusion probability (PIP) as the evidence for an active mediator. Zhang et al. proposed a regularization based method called HIMA, named after its R package [117]. HIMA is a two-step approach. First, it reduces the candidate mediators to a moderate number by using 11 sure independence screening (SIS) [27]. Second, it optimizes the penalized loss function based on Equation 1.5a, argmax b n X i=1 Y i − Mb− c ′ E + p X j=1 p(b j ) (1.7) wherep(·) is a penalty function. Available options forp(·) include LASSO, RIDGE, elastic net and minimax concave penalty (MCP) proposed by Zhang et al. [116]. MCP produces nearly unbiased estimates for variable selection compared to LASSO and is less computationally expensive compared to the best subset selection. High dimensional mediation model provides a solution for a special case of multi-view data integration with one exposure. When multiple exposures (exposomics) are included in the analysis, we can implement the late integration strategy to fit high dimensional mediation models for each exposure, then combine the analysis results. However, this does not consider the correlations between exposures and their joint effect on the outcome of interest. 1.4 Missingdataandanalysismethods 1.4.1 Missingmechanisms In statistical analyses, the dataset is organized into a rectangular shape, with rows representing observation and columns representing variables. Each entry of the dataset is an observed value. When analyzing real datasets, some entries of the dataset are labeled as “NA” or “ ”, which means these entries are missing values. According to the definition by Little et al., “Missing data are unobserved values that would be meaningful for analysis if observed” [56]. Missingness is a major challenge in integrative analysis of omics data. It is important to understand the mechanisms of missing data to correctly handle the remaining data. Little and Rubin [56] categorized missing mechanisms into three categories: missing completely at random (MCAR), missing at random (MAT) and missing not at random (MNAR). The differences among these three 12 categories are characterized by statistical examinations of the relationship among complete data, missing pattern, and underlying parameters. Let D be the complete data matrix. It can be partitioned into = D = {D obs ,D miss } where D obs and D miss refer to observable and missing parts of the complete data, respectively. LetM denote the missing pattern, andΘ denote the parameters of interest. If the observed and missing data combined have no influence on the missing pattern, which is f(M|D,Θ)= f(M|Θ) (1.8) then the data are MCAR; if the missingness only depends on the observed part of data, which is f(M|D,Θ)= f(M|D obs ,Θ) (1.9) then the data are MAR; finally, if the missing values in the data matrix D impact the missing pattern in any way, then missing mechanism of the data is NMAR. When the data are MCAR or at least MAR, the missing mechanism is ignorable because the parameters of interest are separable from the observed data when conducting statistical inference through the data likelihood. However, this is not the situation when the data are NMAR. As a result, the missing mechanism should be included in the analysis. 1.4.2 Imputation Complete case analysis is often used to deal with missing data when the missing ratio is relatively low. According to Strike et al., when the missing data are less than 10% of the sample, it is usually safe to delete the missing cases without having a drastic influence on the analysis result [94], although this rule of thumb is not applicable to all datasets collected from different domain area. Nowadays imputation is, perhaps, the most commonly used method to analyze datasets with missing values. Imputation refers to the process of replacing the missing data by substituted values. A variety of methods have been developed to impute 13 missing values. The majority of those methods are evaluated over either a simulated or real dataset with missing ratio less than 50% [55]. The imputation approaches can be generally classified into two categories: statistical and machine learning approaches [35]. Mean / mode imputation use the column mean / mode to replace the missing values. It is easy to implement, however, it completely ignores the correlations across various variables [81]. Regression based imputation fills in the missing data by the predicted values from a regression model. Considering a dataset withm variables indexed by1,...,j,...,m, the missing value injth variable is pre- dicted by the remainingm− 1 variables. If multiple variables have missingness in an observation, a mul- tivariate regression model will be used to impute missing values. Regression based imputation preserves the covariance of the variables. The method has the disadvantage that it might suffer from multicollinear- ity if multiple variables are missing in a column [81, 56]. A similar approach to regression imputation is hot deck imputation, which replaces the missing values with the data from a similar complete observa- tion (a famous example is K-nearest neighbor)[9]. Cold deck imputation, instead, uses the values from a dissimilar complete observation [56]. The three methods discussed above only impute missing values once, which can not reflect the uncertainty of the missing data. To evaluate the imputation uncertainty, multiple imputation replace each missing item by a set of possible values and pooled the estimation across the multiple imputations. A comprehensive review can be found in [82]. As machine learning becomes more and more popular, many machine learning based imputation methods are proposed. Those models are usually constructed as sophisticated predictive models, such as SOM imputation [80], recurrent neural networks [16, 8], support vector machines [10] etc. 1.4.3 Expectation-maximizationalgorithm The problem of latent variables is a special case of missing data problem. In statistics, latent variables are “variables that can only be inferred indirectly through a mathematical model from other observable 14 variables that can be directly observed or measured” [24]. In clustering analysis, the latent variables are unobserved, or missing, cluster labels. The probabilistic model of GMM is described in Equation 1.2. Es- timating maximum likelihood estimators (MLEs) of the model parameters in Equation 1.2 is challenging because the objective likelihood function contains latent variables. The expectation–maximization (EM) algorithm is one of the most used approaches to obtain MLEs for GMM. The principle of EM algorithm is to find a relatively simpler approximation for the marginal likelihood in the presence of missing data (latent variables), then optimize the alternative objective function. We use the GMM for illustration. The likelihood of GMM is written as L(Θ |Z)= n X i=1 log k X j=1 π j ϕ (Z i |Θ) (1.10) which is hard to optimize. The intuition of the EM algorithm is that if we have access to the missing “cluster assignment”X, then we can approximate the likelihood of Equation 1.10 by the “complete-data” likelihood, which is L c (Θ |Z,X)= n X i=1 k X j=1 I(X i =j)(logπ j +logϕ (Z i |Θ )) (1.11) Then the MLE ofΘ in L(Θ |Z) can be obtained by maximizing L c (Θ |Z,X). We iterate the following two steps: E-step: Q(Θ |Θ (t) )=E(L c (Θ |Z,X)|X,Θ (t) ) (1.12) and M-step:Θ (t+1) =argmaxQ(Θ |Θ (t) ) (1.13) wheret is the index of iterations in the EM algorithm. The EM monotonically increases and will converge to a fixed point in the parameter space, which is usually a local maximum. Mathematical proof of monotonic 15 property of the EM algorithm can be found in [63]. The iteration of the EM algorithm starts with an initial guess ofΘ . Meticulous strategies of initialization can potentially speed up the convergence of the EM algorithm or prevent the EM algorithm from trapping around a local maximum. When using the EM algorithm, the standard errors (SEs) are not readily estimated. Efron introduced the application of the bootstrap to missing data problems [26]. Bootstrap has become one of the most commonly used approaches to estimate SEs for the EM algorithm [63]. Another option is supplemented EM (SEM) algorithm proposed by Meng and Rubin [64]. SEM evaluates the variance-covariance matrices for complete data likelihood and obtains numerically stable asymptotic variance–covariance matrices for parameter estimates, which can be used to construct SEs for MLEs. 16 Chapter2 EnhancementsontheLUCIDmodel 2.1 Introduction Advances in molecular biotechnology such as deep sequencing and high throughput experimental methods have made a large number of molecular data sets accessible for researchers. These data sets include DNA genome sequences [37], RNA expression data [70], metabolites in biofluids [6], proteins in cell or tissue [3], as well as chemical exposures in the environment [110]. Such data are termed “omic” data (genomics, transcriptomics, metabolomics, proteomics and exposomics, respectively). The rapid advancement in high throughput technology have made it possible to measure multiple omics (referred to as “multi-omics”) on the same person in many cohort studies. For example, the Human Early-Life Exposome project (HELIX) measured molecular omics signatures including RNA expression data, metabolites, plasma proteins etc. from 1300 children at the age of 6-11 in six European countries [104]. Integrated analyses of multi-omics data have facilitated insights into the underlying molecular mechanism in complex biological processes and many studies have demonstrated the importance of integrating multi-omics data over single omics analysis [95]. However, the analysis of multi-omics data is still analytically and computationally challeng- ing, due to the complex structure of the data collected from various sources, the limited sample size, and the high-dimensionality of multi-omics information [98]. Conventional analysis tools range from cluster- ing approaches to variable selection approaches [36]. Clustering is a fundamental method for analyzing 17 both single omic and multi-omics data [77]. The aim of clustering is to divide observations into several groups (called clusters) such that observations within a group are similar while samples in different groups are dissimilar. Clustering analysis has many important applications in the field of biological science and medicine, such as finding subtypes of cancer and performing diagnostic prediction [49]. Conventional clustering methods applied to multi-omics data either concatenate multiple datasets one by one or analyze each dataset independently, but both approaches having limitations in capturing variation across omic levels or in reflecting heterogeneity of the integrated datasets. To overcome these limitations, several spe- cific integrated clustering methods for analyzing multi-omics data are available in the community, such as SGCCA [96], SNF [107], iCluster plus [68] etc. [75] conducted a thorough review comparing 13 unsuper- vised methods for multi-omics data integration. These methods are mostly data-driven and don’t consider study design or biological processes. In addition to clustering, omic data analysis usually requires dimension reduction or variable selection to reduce noise, improve model interpretability, and avoid problems with overfitting. For example, some variables may not possess any information related to cluster structure. Including these non-informative variables only add "noises" to cluster assignment and unnecessarily increases the model complexity [29]. Even all variables are informative to clustering, along with the increasing number of dimensions, model- based clustering may suffer from low convergence rate and other computational issues, referred by Bellman as the curse of dimensionality [7]. For situations of moderate or low dimensionality, variable selection is able to facilitate model interpretation and identify the correct cluster structure [30]. Fop et al. [29] conducted a survey of variable selection methods for Gaussian mixture models and clas- sified existing methods into 3 categories: Bayesian approach, model selection and regularization approach. In particular, regularization approaches introduce penalization terms to the log likelihood of Gaussian mix- ture. Variable selection is achieved by shrinking cluster-specific means of non-informative variables to a global mean. Pan and Shen [71] introduced a seminal work for this class of approaches. The authors utilize 18 aL 1 penalty under a diagonal covariance structure for GMM. Zhou et al. relax the diagonal assumption and extend the previous method to unconstrained covariance matrices [120]. As they pointed out, regu- larization in GMM has two major considerations: (1) penalty on the mean parameters and (2) penalty on covariance structure. The penalty on the mean is used for variable selection. Specifically, if the cluster- specific means of a certain variable shrink towards the global mean (or 0, if the variable is standardized) across all clusters, the variable is considered as non-informative. For (2), when the number of variables m in GMM is larger than sample sizen, regularization is needed since the covariance matrix is singular. In addition, as Friedman et al. discussed, estimation benefits from regularization for a large covariance matrix, evenm<n [33]. Peng et al. proposed a novel clustering model called Latent Unknown Clustering Integrating omics Data (LUCID) to conduct integrated clustering of multi-view data [72]. The LUCID model extends the idea of GMM and uses external information to estimate prior probability of Gaussian component for each observation. In addition, LUCID can directly link the latent cluster to the outcome, which makes latent clusters more interpretable. In the original LUCID paper, an L 1 penalty is used to obtain a sparse solu- tion for model parameters. In this chapter, we propose a new integrated variable selection approach to select informative variables in omics data. Additionally, we come up with several other enhancements towards the existing LUCID model, including more flexible geometric features of the omics data, different initialization strategies of the EM algorithm as well as a method to deal with label switching problem in LUCID. The chapter is organized as follows: first, we introduce the statistical model of LUCID; second, we present the details of the EM algorithm to estimate MLE of LUCID; third, we discuss the model selection and label switching problems; forth, several simulation studies are conducted to illustrate the advantages of the proposed enhancements towards LUCID; last, we present a data application in ISG Exposome Data Challenge 2021. 19 Figure 2.1: Directed Acyclic Graph (DAG) for LUCID. Square represents observed data; circle represents latent clusters or unobserved model parameters; diamond represents the penalty terms for regularization, e.g. lasso. 2.2 Methods 2.2.1 StatisticalmodelingofLUCID We first review the framework of LUCID with complete measurements on omics data, illustrated by the Directed Acyclic Graph (DAG) [21] in Figure 2.1. ExposureG, omics dataZ and the outcome of interest Y are linked through a latent cluster variableX. By the causal directions of the DAG, the latent cluster X depends only onG, andZ andY are independent with each other conditioning on the latent cluster X. We explicitly assume a prospective sampling scheme based on the DAG, which indicates the omics data and outcome are measured after the exposure variables. This assumption fits particularly well in the cohort study. 20 Suppose we have a sample of n observations indexed by i = 1,2,...,n, let us denoteG as a n× p matrix with columns representing exposure variables and rows being observations;Z is an× m matrix representing omics data andY is a vector of lengthn representing outcome of interest, either a continuous or a categorical variable. SinceX hask categories (indexed byj =1,2,...,k), it is natural to assume that X follows a multinomial distribution conditioning onG, whose probability mass function is well depicted by the softmax functionS(·) with corresponding parametersβ . P(X i =j|G i ,β )= exp(G i ;β j ) P k b=1 exp(G i ;β b ) (2.1) We further assume the omics dataZ follows a multivariate Gaussian distributionϕ (µ j ,Σ j ) conditioning onX = j, whereµ j andΣ j are cluster-specific mean and variance-covariance matrix. This assumption nicely fits in the model-based clustering framework discussed by Fraley and Raftery [32]. In the original work by Cheng et al.[72], there are no geometric constraints made on covariance matrixΣ j . To allow more flexible geometric features of latent cluster, such as volume, shape and orientation determined by Σ j , we propose to use the parameterization of covariance matrices by means of eigen value decomposition in the form of Σ j =λ j D j A j D T j (2.2) whereλ j is a scalar,D j is the orthogonal matrix of eigenvectors andA j is a diagonal matrix whose values are proportional to eigenvalues [4]. Figure 2.2 visualize clustering models with different geometric features based on Equation 2.2. The distribution ofY|X is pretty flexible, depending on the type of outcome. We can use a general notation f(Y|X,γ ) to represent its distribution. IfY is a continuous variable, we assumeY follows a Gaussian distribution, denoted by ϕ (Y i |γ j ,σ 2 j ,X i = j), where γ j and σ 2 j are cluster-specific mean and variance. IfY is a binary outcome, we assumeY follows a Bernoulli distribution. The distribution ofY 21 Figure 2.2: Ellipses with different geometric features in two-dimensional cases with 3 clusters. Adapted from [85] 22 is depicted by a softmax function S(Y i |γ j ,X i = j), where γ j is the log odds ratio for a latent cluster j. Because of the assumption of conditional independence ofY andZ givenX, LUCID has the potential to incorporate more types of outcome, which will be discussed in section 2.5. LetΘ be the overall parameters associated with LUCID model, the log likelihood of LUCID model can be written as l(Θ |(Z,Y|G))= n X i=1 logf((Z i ,Y i |G)|Θ ) = n X i=1 log k Y j=1 f((Z i ,Y i ,X i |G)|Θ ) I(X i =j) = n X i=1 k X j=1 I(X i =j)logf((Z i ,Y i ,X i |G)|Θ ) = n X i=1 k X j=1 I(X i =j)[logS(X i =j|G i ,β j )+logϕ (Z i |X i =j,µ j ,Σ j ) +logf(Y i |X i =j,γ )] (2.3) To facilitate interpretation of the LUCID model, we define the terms below: 1. Exposure effect β j : the log odds ratio for a certain observation being assigned to the latent cluster j, compared to the reference cluster. 2. Omics effect µ j : the estimated means of the omics variables for the latent clusterj. 3. Outcome effect γ j : for a continuous outcome,γ j is the estimated mean of the outcome for the latent clusterj; for a binary outcome,γ j is the estimated log odds ratio of being cases (coded as 1) compared to the controls (coded as 0). 2.2.2 EMalgorithmtoestimateMLEofLUCID Since the latent clusterX is not observed, we implement the EM algorithm to obtain the maximum likeli- hood estimator (MLE) for LUCID model. For illustration purpose, we assumeY is a continuous outcome. The derivation for binary outcome can be found in the Appendix 2.6. Now we specify the distribution of 23 Y in Equation 2.3 as ϕ (Y i |γ j ,σ 2 j ,X i = j), the log likelihood of LUCID based on Equation 2.3 is further detailed as l(Θ |(Z,X,Y|G))= n X i=1 k X j=1 I(X i =j)[logS(X i =j|G i ;β j )+logϕ (Z i |X i =j;µ j ,Σ j ) +logϕ (Y i |X i =j;γ j ,σ 2 j )] = n X i=1 k X j=1 I(X i =j)logS(X i =j|G i ;β j ) − 1 2 n X i=1 k X j=1 I(X i =j)(Z i − µ j ) T Σ − 1 j (Z i − µ j ) − 1 2 n X i=1 k X j=1 I(X i =j)log det(Σ j )− 1 2 n X i=1 k X j=1 I(X i =j) (Y i − γ j ) 2 σ 2 j − 1 2 n X i=1 k X j=1 I(X i =j)logσ 2 j (2.4) E-step We denote the observed data asD ={G,Z,Y}. At iterationt, given the current estimate of the model parametersΘ (t) , the E-step calculates the posterior probability of observationi, being assigned to latent clusterj, which is r (t) ij =E(I(X i =j)|D;Θ (t) ) =P(X i =j|D;Θ (t) ) = S(X i =j|G i ;β (t) j )ϕ (Z i |X i =j;µ (t) j ,Σ (t) j )ϕ (Y i |X i =j;γ (t) j ,σ 2(t) j ) P k b=1 S(X i =b|G i ;β (t) b )ϕ (Z i |X i =b;µ (t) b ,Σ (t) b )ϕ (Y i |X i =b;γ (t) b ,σ 2(t) b ) (2.5) 24 The expectation of the log likelihood (denoted asQ(Θ |D,Θ (t) ))is obtained via replacing I(X i = j) by r (t) ij in Equation 2.4. Q(Θ |D,Θ (t) )= n X i=1 K X j=1 r ij logS(X i =j|G i ;β )+ n X i=1 K X j=1 r ij logϕ (Z i |X i =j;µ j ,Σ j ) + n X i=1 K X j=1 r ij logϕ Y|X i =j;δ j ,σ 2 j (2.6) M-step At iteration t, the M-step is to maximize the likelihood function l(Θ ) in terms ofβ ,µ ,γ andσ 2 . The optimized parameters are referred asΘ (t+1) , which are shown as below β (t+1) =argmax β n X i=1 k X j=1 r (t) ij logS(X i =j|G i ;β j ) (2.7) µ (t+1) j = P n i=1 r (t) ij Z i P n i=1 r (t) ij (2.8) Σ (t+1) j = P n i=1 r (t) ij Z i − µ t+1 j Z i − µ t+1 j T P n i=1 r (t) ij (2.9) γ (t+1) j = P n i=1 r (t) ij Y i P n i=1 r (t) ij (2.10) σ 2(t+1) j = P n i=1 r (t) ij (Y i − γ (t+1) j ) 2 P n i=1 r (t) ij (2.11) Equation 2.9 is a general case for maximizing the cluster-specific variance-covariance structure without adding any geometric constraints described in Figure 2.2. A detailed discussion of maximizingΣ j param- eterized by the eigenvalue decomposition in Equation 2.2 has been given by Celeux and Govaert [14]. An implementation of their algorithm in R has been carried out by Scrucca et al. [85]. 25 InitializationoftheEMalgorithm Before discussing the initialization of EM algorithm for LUCID, it is worth pointing out that EM algorithm is an iterative algorithm. It doesn’t matter whether it starts from E-step or M-step. For example, for model- based clustering such as GMM, we can specify the initial values of{µ j ,Σ j ,p j } and start from the M-step. Likewise, we can start from the M-step by feeding initial weights, r ij to EM algorithm. There are many ways to configure initialization of EM algorithm. The easiest way is to initialize parameters or weights randomly (e.g. generating random values from uniform distribution for means, using the product of a random matrix and its transpose to create symmetric matrices for covariance structure). We can also im- plement some heuristic methods, such as K-means or hierarchical clustering, to assign cluster membership first and then calculate weights by clustering results [39]. LUCID provides two initialization options: random initialization and heuristic initialization. Both ap- proaches start from the E-step. The random initialization randomly generates start values for parameters β , andγ j from a uniform distributionU(− 1,1), and forσ 2 j fromU(0,1). For parameters associated with omics data, we fit a K-mean clustering to obtain initial values for µ j ; forΣ j , we use a diagonal matrix with values generated fromU(0,2). The heuristic initialization first fits a mclust model [85] on the omics data to initializeµ j andΣ j . Specifically, mclust will fit a series of cluster models with different geometric features and the optimal model is picked based on BIC [31]. Through iterations of EM algorithm,Σ j in Equation 2.4 is optimized using the optimal covairance model determined by mclust. Then we fit a multinomial logistic model by regressing the PIPs (obtained from mclust) on the exposures. The resulted estimations are starting values forβ . We also regress outcome on PIPs by fitting a generalized linear regression model to initializeγ . Algorithm 1 presents the EM algorithm using the heuristic initialization. 26 Algorithm1 The EM algorithm for LUCID model with one latent variable Input: Multi-view dataD ={G,Z,Y}, total number of iterationst max , convergence toleranceϵ max 1: Initialization: 2: ϵ ←∞ 3: Initializeµ (0) ,Σ (0) by mclust 4: Initializeβ (0) by nnet 5: Initializeγ (0) by GLM 6: Compute the log-likelihoodl 1 based onµ (0) ,Σ (0) ,β (0) ,γ (0) , by Equation 2.4 7: t← 0 8: EMalgorithm: 9: whilet<t max orϵ>ϵ max do 10: E-step: 11: Computer (t) ij usingµ (t) ,Σ (t) ,β (t) ,γ (t) based on Equation 2.5 12: M-step: 13: Updateβ (t+1) by Equation 2.7 14: Updateµ (t+1) ,Σ (t+1) by mclust 15: Updateγ (t+1) by Equation 2.10 16: Compute the updated log-likelihoodl 2 byµ (t+1) ,Σ (t+1) ,β (t+1) ,γ (t+1) and Equation 2.4 17: ϵ ← l 2 − l 1 18: l 1 ← l 2 19: t← t+1 20: endwhile 21: Computer (t) ij usingµ (t) ,Σ (t) ,β (t) ,γ (t) based on Equation 2.5 Output:Θ t ={µ (t) ,Σ (t) ,β (t) ,γ (t) } andr (t) 27 2.2.3 Integratedvariableselection Previous work by Peng et al. uses an L 1 penalty to achieve variable selection for LUCID. Specifically, sparse solution ofβ is obtained by β (t+1) lasso =argmax β n X i=1 k X j=1 r (t) ij logS(X i =j|G i ;β j )− λ G k X j=1 p+1 X l=1 |β jl | (2.12) Sparse estimates of µ j andΣ j are obtained by implementing the lasso-type Scout method [111]. The author uses an L 1 norm in the form of|Σ − 1 j µ j | to shrink µ j . Although this penalty term effectively selects informative variables in simulation study and a real data example on colorectal cancer (CRC), its performance is limited when variables, especially for variables in omics data, are correlated [72]. Zhou et al. [120] develop a general penalized likelihood method with unconstrained covariance structure. The authors point out for this general penalized approach, it is not needed to assume the non-informative variables are uncorrelated with informative variables. In particular, when some informative variables are correlated, they have empirically shown this general approach performs better than that using diagonal covariance matrices. The newly proposed integrated variable selection for LUCID is built upon the penalized EM algorithm. The correctness of the penalized EM algorithm is shown in Appendix 2.7. For variable selection in exposure G, we use the same strategy in Equation 2.12. For variable selection towards omcis dataZ, we adapt the penalty function proposed by Zhou et al., p λ (Θ )=λ µ k X j=1 m X l=1 |µ jl |+λ W k X j=1 X l,s |w j;l,s | (2.13) whereW =Σ − 1 andw j;l,o is the element inlth row andoth column in covariance matrix for clusterj. To derive the estimate of meanµ j and covariance matrixΣ j , we focus on the M-step of Equation 2.4. In 28 particular, we are interested in the maximization of expected log likelihood related to omics dataZ with a penalty function in Equation 2.13, El Z,p (Θ ,Θ (t) )= n X i=1 k X j=1 r (t) ij logϕ (Z i ;µ j ,Σ j )− p λ (Θ ) =− 1 2 n X i=1 k X j=1 r (t) ij log det(Σ j )− 1 2 n X i=1 k X j=1 r (t) ij (Z i − µ j ) T Σ − 1 j (Z i − µ j )− p λ (Θ ) = n 2 log det(W j )− 1 2 n X i=1 k X j=1 r (t) ij (Z i − µ j ) T W j (Z i − µ j )− p λ (Θ ) (2.14) Maximizationofthemean The cluster specific mean µ j is a vector of length m and denoteW j;·l = (w j;1l ,w j;2l ,...,w j;ll ) T . For a fixed j and l, Zhou et al. [120] has proved the sufficient and necessary conditions for µ jl to be MLE for penalized likelihood for GMM. Because the likelihood of omics data in LUCID model can be viewed as an GMM, the sufficient and necessary conditions for µ jl also holds for LUCID in Equation 2.14, which are n X i=1 r (t) ij (Z T i W j;·l )− n X i=1 r (t) ij ! µ T j W j;·l =λ µ sign(µ jl ) ifµ jl ̸=0 (2.15) and n X i=1 r (t) ij m X s=1,s̸=l (Z is − µ js )w sl +Z il w ll ≤ λ µ ifµ jl =0 (2.16) and corresponding updated formula for Equation 2.16 is n X i=1 r (t) ij ! µ (t+1) jl w ll +λ µ sign(µ (t+1) jl )= n X i=1 r (t) ij (Z T i W j;·l )− n X i=1 r (t) ij ! µ (t) T j W j;·l − µ (t) jl w ll (2.17) The derivation of Equation 2.15 can be found in Appendix 2.10 29 Maximizationofthecovariance To obtain the sparse estimation of cluster specific covariance structure, as we discussed in the literature review, there is a connection between GMM and Gaussian graphical model. In terms of the M-step corre- sponding toΣ j , it suffices to maximize 1 2 n X i=1 k X j=1 r ij logdet(W j )− 1 2 n X i=1 k X j=1 r ij (Z i − µ j ) T W j (Z i − µ j )− λ W k X j=1 X l,s |w j;l,s | = k X j=1 1 2 n X i=1 r ij logdet(W j )− 1 2 n X i=1 r ij trace(S j W j )− λ W X l,s |w j;l,s | (2.18) whereS j is the empirical covariance matrix at each iteration, S j = P n i=1 r ij (Z i − µ j )(Z i − µ j ) T P n i=1 r ij (2.19) Therefore at iterationt, the maximization of Equation 2.18 can be treated ask separate problems. For each fixed j, the maximization problem is formulated as W (t+1) j =argmax W j det(W j )− trace(S (t) j W j )− λ W X l,s |w j;l,s | (2.20) which is exactly a graphical lasso problem and has been well studied by Friedman et al. [33]. They de- veloped an R package called glasso to implement their algorithm. We use glasso package in the coding framework of LUCID. 2.2.4 Modelselection As mentioned before, in practice of model based clustering, as well as LUCID model, the first task is to determine the number of latent clustersK. A common procedure is to fit a series of model with different 30 K and then use a model selection criterion to choose the best model. Bayesian Information Criteria (BIC) [84] is defined as BIC =− 2logL(Θ )+Dlogn (2.21) whereΘ is the MLE of the model,n is the number of observations andD is the dimension of parameters to be estimated. In the LUCID model, the softmax functionS(X|G) containsp× (K− 1) parameters, the multivariate normal function ϕ (Z|X) has a total of K(m+m(m+1)/2) parameters, and the number of parameters related to outcome depends on the distribution of outcome. For the normally distributed outcome, n Y = K + n covariate + 1 where 1 refers to the variance of Y . For binary outcome, n Y = K +n covariate . HenceD =p(K− 1)+Km+Km(m+1)/2+n Y . For the LUCID model with integrated variable selection, in addition to choosingK, we also need to tune penalty terms λ G ,λ µ ,λ W to obtain a sparse solution of Θ . We adapted a modified BIC proposed by Pan and Shen [71] based on variable selection in GMM. They followed results from Efron et al. [25] and Zou et al. [122] on lasso regression and subtracted number of parameters shrunk to 0 from dimension of parameters. For LUCID model, we consider both shrinkage over the effect of exposure G and omics dataZ. Letd G andd Z be the number of parameters shrunk to 0 inG andZ, respectively. We tune the combination ofK,λ G ,λ µ ,λ W based on the modified BIC as BIC p =− 2logL(Θ )+(D− d G − d Z )logn (2.22) 31 2.2.5 Labelswitching Label switching is a challenging problem in mixture model. Lots of efforts has been devoted to solving it, such as [15] and [113]. Suppose we have a vector of observationsx = {x 1 ,...,x m } drawn from a Gaussian mixture model withK components. The likelihood for an individual observation is f(x i ;Θ )= K X j=1 p j f(x i ;µ j ,σ 2 j ) (2.23) where P K j=1 p j =1. The joint likelihood of observed data is written as l(Θ |x)= N X i=1 log K X j=1 p j f(x i ;µ j ,σ 2 j (2.24) Given a permutation P = {P(1),P(2),...,P(K)} for{1,2,...,K}, the corresponding permutation of the model parameter isΘ (P K ) = {µ P(1) ,...,µ (P(K)) ;σ 2 P(1) ,...,σ 2 P(K) }. It’s obvious that the joint likelihood in Equation 2.24 is invariant to any permutationP , which meansl(Θ |x) = l(Θ (P)|x). This property results in some difficulties when we use a simulation study to evaluate the bias and variance of the MLE of the model parameters or when we use bootstrap to derive the confidence intervals (CIs). For two replicatesa andb of simulations or bootstrapping samples, a direct comparison of the estimated MLEsΘ (P a ) andΘ (P b ) might not be meaningful as they may represent different permutations of the true parameters. Since LUCID can be treated as an extension of GMM, we have the same label switching problem when estimating MLEs. Following up the suggestion by [15], we use a heuristic method and add a parameter 32 constraint onΘ such that only a certain permutation will satisfy the constraint. We consider the cluster- specific means in Equation 2.10 (for binary outcome, we use the cluster-specific log odds ratio in Appendix 2.6.) Given MLEs forˆ γ 1 ,ˆ γ 2 ,...,ˆ γ K , we create a permutationP under the constraint ˆ γ P(1) < ˆ γ P(2) <...< ˆ γ P(K) (2.25) Then we useP as cluster labels, and re-order other parameters from Equation 2.7 to Equation 2.11 such that ˆ Θ = ˆ Θ P 2.2.6 SupervisedandunsupervisedLUCID When building joint likelihood in Equation 2.3, the likelihood ofY is included. Because the information of the outcome is used to define latent clusters in LUCID, the clustering model based on Equation 2.3 is similar to a traditional supervised learning process (Especially for binary outcome discussed in Appendix 2.6, where explicit label information is included). We use the term supervised LUCID to refer to the LU- CID model using all three componentsG,Z andY to conduct integrated clustering. On the other hand, we also propose another unsupervised LUCID that doesn’t incorporateY to define the latent cluster. The unsupervised LUCID is a two-step analysis for integrative clustering. In the first step, unsupervised LU- CID utilizes information from G and Z to estimate latent clusters. Specifically, the joint likelihood of unsupervised LUCID is written as logL(Θ )= N X i=1 K X j=1 I(X i =j)(logS(X i =j|G i ;Θ )+logϕ (Z i |X i =j;Θ )) (2.26) and the corresponding responsibility based on Equation 2.26 is derived as r (t) ij = S X i =j|G i ;β (t) ϕ Z i |X i =j;µ (t) j ,Σ (t) j P K j=1 S X i =j|G i ;β (t) ϕ Z i |X i =j;µ (t) j ,Σ (t) j (2.27) 33 We still use the same EM algorithm discussed previously to obtain the MLE of LUCID (β ,µ andΣ ) as well as the PIP for each observation. Next, we regressY to the PIPs to obtain the MLE forγ . The choice of supervised and unsupervised LUCID depends on the data generating process and the research questions. For example, if the data are collected through a prospective sampling process (the ex- posures and the omics data are measured before the outcome is measured), an unsupervised LUCID model is appropriate. Additionally, incorporatingY in the clustering will create clusters in favor of separatingY , which will potentially bias the association between the latent clusterX andY . If the aim of the study is to obtain an unbiased estimation for the outcome effect, we should use an unsupervised LUCID model. On the other hand, the supervised LUCID may be more appropriate, for instance, if (1) the data are generated from the cross-sectional design (the cluster structure might be defined jointly by G,Z andY ); or (2) the goal is to build a predictive model based on one cohort then applying to a different cohort. 2.2.7 Variableselectionbasedoncorrelationstructure Many works on variable selection, such as [71, 120, 86, 117], are implemented by penalizing the likelihood. Particularly, regarding variable selection in GMM, the penalty terms are added to the mean parameterµ j in Equation 1.2. Given a penalty termλ , for a certain variablel, if the cluster specific mean µ jl does not equal to 0 across all clusters, then the variablel is considered as “informative variable” for clustering and included in the final clustering model [71]. Although [120] also proposed to add penalty terms to the variance- covariance structure in GMM, the major initiation is to connect GMM to Gaussian graphical modeling. We recognized another possibility of selecting the “informative variables” by penalizing the variance- covariance matrices in GMM. For example, we consider the case of two bivariate normal distributions p 1 = Z 1 Z 2 ∼ N 0 0 , 1 ρ ρ 1 ,p 2 = Z 1 Z 2 ∼ N 0 0 , 1 − ρ − ρ 1 (2.28) 34 Figure 2.3: Cluster structure defined only by the correlation between 2 variables Figure 2.4: Flowchart of deciding whether a variable is informative to clustering or not ThenZ 1 andZ 2 define two clusters based on correlation structure ρ while their cluster-specific means are both 0 across two clusters (Figure 2.3). The previous definition of “informative variables” doesn’t apply to this situation, as the informative variables actually have 0 means across the two latent clusters. To solve this problem, we propose that a variable is “informative” if (1) the penalized cluster-specific means are not 0 across all latent clusters or (2) the signs of the penalized covariance are not the same across all latent clusters. We follow the flowchart in Figure 2.4 to check whether a variable is informative or not. 35 2.3 Simulationstudy 2.3.1 SpeedcomparisonoftwoinitializationstrategiesoftheEMalgorithm Methods We conduct a simulation study to compare the speeds of convergence of the two initialization strategies discussed previously. We simulate data with 2 latent clusters. First we generate 10 exposuresG, 5 of them are causal (logOR = 2) while the rest 5 are noises (logOR = 1). Then we generate cluster label X conditioning onG. Conditioning onX, we create 10 omics variablesZ, 5 of them are causal (∆ Z = 4). We finally generate a continuous outcome with effect size ∆ Y = 4. The omics data have missing values (missing ratio = 0.2, either listwise or sporadic missing patterns, which we will discuss in Chapter 3). We fit LUCID models using random initialization and heuristic initialization across a range of sample sizes for each missing pattern. In each scenario, we record the speed of convergence in seconds. Results Figure 2.5 shows simulation results of two initialization strategies. The heuristic strategy using mclust is faster than random initialization in both missing patterns in the omics data. The advantage of using the heuristic initialization is much more obvious when the sporadic missing pattern is presented in the omics data, as shown in Figure 2.5b, especially under the large sample size. As sample size increases from 200 to 20000, the convergence time for heuristic method increase from 1.76 seconds to 27.3 seconds (about 16 folds) while that for random method increases from 2.73 seconds to 446 seconds (about 163 folds). The heuristic initialization method is recommended to obtain MLE for the LUCID model. It is compu- tationally more efficient, particularly for sporadic missing data. Additionally, the heuristic method allows different geometric features of the omics data (Figure 2.2). However, we notice that when fitting LUCID 36 0 10 20 0 5000 10000 15000 20000 Sample size Time (seconds) Initialization method heuristic random (a) 0 100 200 300 400 0 5000 10000 15000 20000 Sample size Time (seconds) Initialization method heuristic random (b) Figure 2.5: Comparing the speed of convergence of the EM algorithm using random initialization and heuristic initialization. (a) listwise missingness in omics data (b) sporadic missingness in the omics data 37 model on real data, the mclust model sometimes might take very long time to run or even does not con- verge. Under this situation, we instead use a random initialization to make sure the EM algorithm is able to start. 2.3.2 Robustnessofintegratedvariableselection Methods Peng et al. illustrated the performance of variable selection using data simulated under the LUCID model [72]. In this section, we focus on the robustness of the proposed variable selection method. First, we evaluate the performance of the proposed variable selection using data generated from the mediation model. We compare the performance of integrated variable selection based on LUCID to other mediation based variable selection methods, HIMA [117] and BAMA [90]. We simulated data under two scenarios: independent mediator and correlated mediator. At each scenario, we simulate 10,000 observations as the population. We randomly draw 1,000 observations from the population and replicate the analysis for 400 times. The two simulation scenarios are illustrated in Figure 2.6. The exposure variables are simulated underN(0,1). For independent mediator in Figure 2.6a,M j is generated by M j = 0.5× E j +ϵ where ϵ ∼ N(0,1). For correlated mediator described in Figure 2.6b, a 2 anda 3 are set as 0.34 such that correlation between two exposures is 0.25. Because HIMA and BAMA can only deal with one exposure, for each replication, we fit HIMA and BAMA on 10 exposures separately, and choose the mediator as “selected mediator” if one of the 10 models selected it. We use True Positive Rate (TPR) and Ture Negative Rate (TNR) to evaluate model performance and take the average of the two causal mediators. We use a low-dimensional simulation study to investigate whether LUCID can pick up the “informative variable” based on correlations. We generate data under a LUCID model with 10 omic variables. Four of them are informative, which have correlation ρ for the cluster 1 and − ρ for the other cluster (e.g. 38 (a) (b) Figure 2.6: Simulation scenarios for (a) independent mediators (b) correlated mediators correlation noise # of informative feature # of noise mean sd mean sd 0.9 independent 4.00 0 0.002 0.05 0.9 correlated 4.00 0 0.004 0.09 0.7 independent 3.38 1.20 1.34 1.95 0.5 independent 2.87 1.64 2.90 2.36 Table 2.1: Simulation results for the correlation-based variable selection criteria in low dimensional sce- nario. cor(Z 1 ,Z 2 ) = ρ for cluster 1 and or(Z 1 ,Z 2 ) =− ρ for cluster 2. The same correlation holds forZ 3 ,Z 4 ). We simulated data across a grid of ρ = 0.5,0.7,0.9. The noises are generated either from independent N(0, 1) or from a multivariate normal distribution with correlation = 0.1. For each simulation, 2000 data points are randomly sampled from the simulated population. We repeat the simulation for 500 times. Results Figure 2.7a presents the result for independent mediator. LUCID has a larger TPR (0.61) for selecting the true mediators compared to BAMA and HIMA (0.28 and 0.26, respectively). Regarding excluding noises 39 in the final model, the TNR for LUCID is smaller than BAMA and HIMA. BAMA and HIMA both have TNR=1. Additionally, LUCID is also able to perform variable selection for exposure while the other two methods cannot. Although under the independent mediator scenario, the sensitivity of LUCID is not high (0.30), LUCID performs well in terms of excluding noises in the exposure. We also investigated which exposure is selected through the 400 replicates. We discovered that for the replicates that true exposure is selected, only one ofE 1 orE 2 is picked by the LUCID model. IfE 1 is selected, then the selected true mediator is onlyM 1 and vice versa. Because there is no correlation betweenE 1 andM 2 or betweenE 2 andM 1 , our results suggest that under the simulation model with independent mediators, LUCID is able to select the entire mediation pathway. We observe the same pattern in correlated mediators, as shown in Figure 2.7b. The performances of all three methods get better, in terms of both TPR and TNR. Still, LUCID (TPR =0.96) outperforms HIMA and BAMA in terms of sensitivity (0.76 and 0.76, respectively). The TNR of LUCID remains the same for cor- related mediators. Regarding selection of exposures, the TPR for LUCID increase from 0.30 to 0.75, which shows that LUCID has a strong ability to discover informative exposures when the mediation pathway are correlated with each other. LUCID model uses latent mediators to summarize the mediation effect through the observed mediators and associates the exposures to the latent mediator. As a result, it implicitly as- sumes the observed mediators are correlated with each other, which emulate the data generating process in Figure 2.6b. The integrated variable selection based on LUCID is robust to the data generating process. LUCID is able to selected informative exposures and omics data (the mediators) not only under the LUCID model, but also under the mediation model, especially under the situation that mediators are correlated with each other. Compared to other variable selection methods available in the community, LUCID can deal with the situation of multiple exposures and multiple mediators. 40 (a) Independent mediators (b) Correlated mediators Figure 2.7: Simulation results comparing variable selection based on LUCID and other mediation ap- proaches (HIMA and BAMA) 41 Table 2.1 presents the results of variable selection based on correlations under the low dimensional scenario. When the correlation between informative variables are high, the proposed variable selection procedure could effectively distinguish the informative variables from the noise (No. of selected infor- mative variables = 4.00 when ρ = 0.9, despite the weak correlation in noises). The weak correlation in the noise won’t influence the selection result. As the correlation of informative variables decreases, the mean number of informative variables selected decreases with an increasing standard deviation. We could observe the opposite pattern for the number of noise. Since we penalize the covariance structure to obtain a sparse covariance matrix, the weak correlations of the noises are more likely to be shrunk to 0. 2.4 Dataexample (This analysis work was a selected presentation “Zhao,Y., Stratakis, N., Conti DV., Latent unknwon clus- tering with integrated multi-omics data (LUCID)” in ISGlobal Expsome Data Challenge 2021, section of analysis on omics data. It is cited in [60]) 2.4.1 Datadescriptionandmethods We apply the integrated variable selection method to a partially simulated exposome dataset based on real cases from Human Early Life Exposome Study (HELIX) [104]. The data is released by the Barcelona In- stitute for Global Health (ISGlobal). Baillie-Hamilton et al. conducted a study suggesting that the obesity epidemic correlated with the increased production of synthetic chemicals. Among them, organochlorines (OCs) constitute a major class of synthetic chemicals with suspected obesogenic properties [53]. We in- vestigate whether early-life exposure to organochlorines is associated with childhood obesity and selected metabolites and proteins characterizing the underlying biological pathway. The study sample consists of 1152 children. The outcome of interest is body mass index (BMI) z-score for children at 6-11 years old, based on World Health Organization (WHO) reference and standardized on 42 sex and age [104]. We include 14 exposures related to the organochlorines, 177 serum metabolites, 44 urine metabolites and 36 proteins as omics layer for LUCID model (257 variables in total). We consider covariates including maternal pre-pregnancy BMI, child sex, maternal age, cohort of inclusion and maternal education level. We conducte a "meet in the middle approach"[11] to screen the 257 variables first. Next, we perform a grid search over tuning parameters for LUCID, including number of latent clusterK, penaltiesλ µ and λ W in Equation 2.13. The optimal combination of tuning parameters is determined by BIC. We conduct the integrated variable selection based on the optimalK,λ µ andλ W and refit LUCID model with selected omics variables. In addition, we construct the 95% confidence interval based on bootstrap resampling [26]. The final LUCID model is visualized via a Sankey diagram, and we characterize different components of the LUCID model through boxplot and bar plot. 2.4.2 Results 45 variables in the omics layer pass the preliminary screening based on "meet in the middle" approach (FDR = 0.1). We input these 45 variables into LUCID model to select the optimal model. The optimal combination of tuning parameters with lowest BIC are K = 4, λ µ = 50 and λ W = 0.1. Since K = 4, we identify 4 latent clusters in sample. 14 omics variables are selected, including 7 proteins, 1 urinary metabolite and 6 serum metabolites. LUCID relates 4 latent cluster to BMI z-score shown in Figure 2.9a. There is an increasing trend in BMI z-score from latent cluster 1 to 4. Latent cluster 4 is identified as the obesity group (BMI z-score in latent cluster 4: mean= 1.98, median= 2.14) based on WHO standard. Except for serum metabolite 63, the rest 13 metabolites have at least 1 cluster-specific point estimate which is statistically significant, according to 95% bootstrap CI shown in Table 2.2. This result shows linkage between informative variable for clustering and statistical significance in LUCID model. We build omics profiles based on estimation from LUCID model, shown in Figure 2.8a. Each latent cluster is characterized 43 (a) Omics profile for 14 selected omics variable. The star indicates statistically significant estimate based on bootstrap inference. The y-axis is cluster specific mean estimated by LUCID model (b) Visualizing LUCID model for partially simulated HELIX data. Each node represents a variable/component for LUCID model and each link represents an association. The light grey corresponds to positive association while the dark grey corresponds to negative one. The width of link represents the magnitude of association (thicker link means a larger effect) Figure 2.8: LUCID analysis on the simulated HELIX data 44 by unique pattern of mean for each omics variable. We specifically highlight the proteins in the blue box in Figure 2.9b in Appendix 2.8. The obesity group is particularly characterized with elevated levels of IL-1 beta (mean = 1.85, median =2.02) and IL-6 (mean = 1.78, median = 1.88), two key markers of systemic inflammation in the human body [58] and higher levels of insulin (mean = 0.75, median= 0.77), which is known to closely associate with obesity and underlying metabolic dysfunction.[22]. Figure 2.9b shows the exposure profiles for each latent cluster. Moving from cluster 2 to 4, concentration of childhood HCB are decreasing (OR(cluster 2) = 0.44, OR(cluster 3) = 0.24, OR(cluster 4) = 0.12), which confirmed previous HELIX publication results that childhood HCB exposure cross-sectionally associates with reduced childhood BMI-z score [103]. Besides, we identify a new association that prenatal HCB exposure increases children’s BMI (obesogenic effect). The interplay of 4 components of LUCID model, exposure to OCs, latent cluster, omics data and outcome is presented in Figure 2.8b. 2.5 Discussion Model based clustering is a popular method to conduct clustering. One advantage of model based clustering is that, with an underlying probabilistic model, we are able to determine the number of latent clusters via choosing a statistical model [31]. On the contrast, heuristic algorithms such as K-means, there are no established statistical methods for model selection. Researchers have to rely on subjective approaches, such as rule of thumb, data visualization to determine the optimal number ofK in advance and statistical measurements, such as information criteria is not available [73]. LUCID is an integrative model-based clustering method. Probabilistic model of LUCID is explicitly specified via Sankey disgram in Figure 2.1. As a result, LUCID utilizes statistical criteria BIC to conduct model selection. BIC provides an objective tool to balance between model complexity and fitness of data. 45 It is worth pointing out the linkdage between unsupervised LUCID model and GMM. The unsupervised LUCID can be treated as an extension of GMM incorporating individual-level mixture weight. For GMM, based on Equation 1.2, the responsibility is defined as r (t) ij = π (t) j ϕ Z i |X i =j;µ (t) j ,Σ (t) j P K j=1 π (t) j ϕ Z i |X i =j;µ (t) j ,Σ (t) j (2.29) The only difference between the two responsibilities is that the mixture weight π j in Equation 2.29 is replaced byS(X i =j|G i ;β ). GMM assumes population-level mixture weights, soπ j remains the same for all individuals. The mixture weights are interpreted as prior knowledge for clustering (the proportion of components in the population). Unsupervised LUCID relaxes this assumption and estimates mixture weights as a function of exposures, resulting in individual-level mixture weights. In LUCID, the clusters can be interpreted as subgroups of a population with different risk of developing the disease (for binary outcome). We assume different exposure profiles will influence the relevant etiology pathway, which ul- timately influence the risk of developing the disease. We may benefit from including individual-level exposures as the prior knowledge and achieve better clustering accuracy. Current penalization approaches of variable selection for clustering mainly focus on regularize cluster- specific means, see [71, 29, 120, 112] for more details. To the best of our knowledge, no existing penalization based methods use correlation to select informative variables. The proposed definition of “informative variable” covers more possible situations that a variable could contribute to the clustering in the data. We conducted several additional simulations to explore more about the effectiveness of the proposed variable selection (Appendix 2.11). There are several questions to be addressed: (1) when the correlation of the two variables define clusters, how strong the correlation should be? As we show in Figure 2.10, when the correlation decreases to 0.5, it’s hard to visualize the difference of the two clusters, and including these variables in the clustering model does not improve model performance; (2) how clusters are defined in the 46 high dimensional space (large number of variables andK >2)? In the simulation, we consider the simplest scenario of 2 clusters. We also tried scenarios in whichK≥ 3 (results not shown) but the previous criteria defined in Figure 2.4 doesn’t work well. When the number of clusters and number of feature increase, it is common to have some variables whose regularized correlations are not 0 and the sign of the correlation are different across certain clusters, especially for real data applications. As a result, it is hard to select informative variables and the false negative rate might drastically increase. It is of interest to come up with a more rigorous and practical criteria to extend the definition of “informative variable” to high dimensional space. To summarize, the correlation of variables could improve the accuracy of clustering, and we believe this idea could potentially lead to novel variable selection techniques for clustering in the future. The LUCID model discussed in this chapter is limited to being informed by one type of omic data. If multiple omic data are included in the analysis, we concatenate these datasets one by one, then input as a single data matrix. It might be of interest to model each omic dataset separately and take the distinguished dependencies within each omics data into account. For example, the omics data, such as protein, serum and urine collected from HELIX study exhibits quite different correlation pattern. Since the variance covari- ance matrix can be eigen-decomposed, the distinguished correlation matrix represents different geometric structures for each omics data. Therefore, concatenation of omics data into a single matrix will dismiss the information contained in the unique correlation structure in each omic data and might not reveal the true biology behind the scenes. Secondly, different omics data might define different clusters within sample. We can estimate the cluster separately and test their joint effect on the health outcome. These extensions are related to the extension of LUCID with multiple latent variables, which will be discussed in Chapter 5. Additionally, we might take the biological knowledge and the study design into consideration. Accord- ing to the central dogma of biology, there is an information flow from DNA to RNA to protein. A natural causal relationship is established among the omics data. Incorporating multiple latent mediator “in serial” allows us to establish such causal relationship. Another consideration is the measurement time for each 47 omic data. Large cohort like HELIX, usually takes long time and omic data are collected at different time. A LUCID model to incorporate multiple omics data “in serial” will include the time factor in the model design. 48 2.6 Appendix2.A When the outcome is binary, we only need to modify the likelihood related toY . We assumeY i follows a Bernoulli function with parameterp j , interpreted as the probability of a case in a latent clusterj. f(Y i |X i =j;p j )=p Y i j (1− p j ) (1− Y i ) (2.30) The corresponding model likelihood becomes l(Θ |(Z,X,Y|G))= n X i=1 k X j=1 I(X i =j)[logS(X i =j|G i ;β j )+logϕ (Z i |X i =j;µ j ,Σ j ) +logϕ (Y i |X i =j;p j )] = n X i=1 k X j=1 I(X i =j)logS(X i =j|G i ;β j ) − 1 2 n X i=1 k X j=1 I(X i =j)(Z i − µ j ) T Σ − 1 j (Z i − µ j ) − 1 2 n X i=1 k X j=1 I(X i =j)log det(Σ j )− 1 2 n X i=1 k X j=1 I(X i =j)Y i logp j − 1 2 n X i=1 k X j=1 I(X i =j)(1− Y i )log(1− p j ) (2.31) Then at iterationt+1, taking the partial derivative towardsp j , we obatin ∂l(Θ |(Z,X,Y|G)) ∂p j =− P n i=1 r ij Y i 2p j + P n i=1 r ij (1− Y i ) 2(1− p j ) (2.32) By setting Equation 2.32 as 0, we update the parameterp j by p (t+1) j = P n i=1 r ij Y i P n i=1 r ij (2.33) 49 2.7 Appendix2.B Suppose f(˜ x|Θ ) be any probability density function where ˜ x is a p-dimensional vector andΘ is a m- dimensional parameter space. Our aim is to maximize the log likelihood function with aL 1 penalty (actu- ally the penalty could be in the general form ofL p ) l(x|Θ )= n X i=1 logf(˜ x i |Θ )− m X j=1 |θ j | (2.34) Assume some latent variable will actually make the calculation easier, then by Jensen’s inequality l(x|Θ )= n X i=1 log k X j=1 f(˜ x i ,z =j|Θ )− m X s=1 |θ s | = n X i=1 log k X j=1 p(z =j|˜ x i ,Θ ) f(˜ x i ,z =j|Θ ) p(z =j|˜ x i ,Θ ) − m X s=1 |θ s | ≥ n X i=1 k X j=1 p(z =j|˜ x i ,Θ )log f(˜ x i ,z =j|Θ ) p(z =j|˜ x i ,Θ ) − m X s=1 |θ s | (2.35) Denote the responsibilityp(z = j|˜ x i ,Θ ) asr k . To make sure the penalized EM algorithm works, all we need to proof is just make sure at each iteration, our target function in Equation 2.34 is non-decreasing. Suppose at iterationt, we have obtained some estimates of parameters, denoted asΘ (t) , then at iteration t+1, the E-step will be calculating r (t+1) k =p(z =j|˜ x i ,Θ (t) ) (2.36) and the parameter space is updated by Θ (t+1) =argmax n X i=1 k X j=1 r (t+1) k log f(˜ x i ,z =j|Θ ) r (t+1) k − m X s=1 |θ s | (2.37) 50 By the choice ofp(z =j|˜ x i ,Θ ), the equality holds for Jensen’s inequality, thus l(Θ (t) )= n X i=1 k X j=1 p(z =j|˜ x i ,Θ (t) )log f(˜ x i ,z =j|Θ (t) ) p(z =j|˜ x i ,Θ (t) ) − m X s=1 |θ (t) s | (2.38) Because Equation 2.35 holds for anyp(z =j|˜ x i ,Θ ), combine Equation 2.35, 2.37 and 2.38, we have l(Θ (t+1) )≥ n X i=1 k X j=1 r (t+1) k log f(˜ x i ,z =j|Θ (t+1) ) r (t+1) k − m X s=1 |θ (t+1) s | ≥ n X i=1 k X j=1 r (t+1) k log f(˜ x i ,z =j|Θ (t) ) r (t+1) k − m X s=1 |θ (t) s | = n X i=1 k X j=1 p(z =j|˜ x i ,Θ (t) )log f(˜ x i ,z =j|Θ (t) ) p(z =j|˜ x i ,Θ (t) ) − m X s=1 |θ (t) s | =l(Θ (t) ) (2.39) Therefore, the target penalized likelihood function is non-decreasing at each iteration. As a result, as long as the likelihood function is bounded, the algorithm will converge to the maximum of our target function in Equation 2.34. 51 2.8 Appendix2.C (a) Boxplot of BMI z-score for each latent cluster. Number in the parenthesis represents the number of observations assigned to each latent cluster, based on the posterior inclusion probability (PIP) of LUCID. The average BMI score increases from latent cluster 1 to latent cluster 4. (b) The exposures profile for each latent cluster. The y axis represent the log odds ratio for each exposure. Star indicates statistical significance based on 95% CI built upon bootstrap resampling Figure 2.9: Supplementary analysis for HELIX data for ISGlobal data challenge 2021 52 2.9 Appendix2.D Type Variable Cluster Estimate Bootstrap se 95% CI lower bound 95% CI upper bound exposure hs_dde_cadj_Log2 cluster2 -0.18 0.28 -0.73 0.37 exposure hs_dde_cadj_Log2 cluster3 -0.15 0.42 -0.97 0.67 exposure hs_dde_cadj_Log2 cluster4 -0.2 0.22 -0.63 0.23 exposure hs_dde_madj_Log2 cluster2 0.18 0.23 -0.27 0.63 exposure hs_dde_madj_Log2 cluster3 0.23 0.26 -0.28 0.74 exposure hs_dde_madj_Log2 cluster4 0.29 0.22 -0.14 0.72 exposure hs_ddt_madj_Log2 cluster2 0.08 0.13 -0.17 0.33 exposure hs_ddt_madj_Log2 cluster3 0.08 0.16 -0.23 0.39 exposure hs_ddt_madj_Log2 cluster4 0.13 0.11 -0.09 0.35 exposure hs_hcb_cadj_Log2 cluster2 -0.82 0.63 -2.05 0.41 exposure hs_hcb_cadj_Log2 cluster3 -1.43 0.7 -2.80 -0.06 exposure hs_hcb_cadj_Log2 cluster4 -2.1 0.61 -3.30 -0.90 exposure hs_hcb_madj_Log2 cluster2 0.6 0.26 0.09 1.11 exposure hs_hcb_madj_Log2 cluster3 0.62 0.4 -0.16 1.40 exposure hs_hcb_madj_Log2 cluster4 0.89 0.23 0.44 1.34 exposure hs_pcb118_cadj_Log2 cluster2 0.24 0.57 -0.88 1.36 exposure hs_pcb118_cadj_Log2 cluster3 0.22 0.67 -1.09 1.53 exposure hs_pcb118_cadj_Log2 cluster4 0.38 0.54 -0.68 1.44 53 exposure hs_pcb118_madj_- Log2 cluster2 -0.18 0.56 -1.28 0.92 exposure hs_pcb118_madj_- Log2 cluster3 -0.07 0.66 -1.36 1.22 exposure hs_pcb118_madj_- Log2 cluster4 0.59 0.47 -0.33 1.51 exposure hs_pcb138_madj_- Log2 cluster2 -0.15 0.37 -0.88 0.58 exposure hs_pcb138_madj_- Log2 cluster3 0.11 0.44 -0.75 0.97 exposure hs_pcb138_madj_- Log2 cluster4 -0.11 0.31 -0.72 0.50 exposure hs_pcb170_cadj_Log2 cluster2 -0.13 0.19 -0.50 0.24 exposure hs_pcb170_cadj_Log2 cluster3 -0.2 0.23 -0.65 0.25 exposure hs_pcb170_cadj_Log2 cluster4 -0.19 0.15 -0.48 0.10 exposure hs_pcb170_madj_- Log2 cluster2 0 0.38 -0.74 0.74 exposure hs_pcb170_madj_- Log2 cluster3 -0.38 0.56 -1.48 0.72 exposure hs_pcb170_madj_- Log2 cluster4 -0.65 0.34 -1.32 0.02 exposure hs_pcb180_cadj_Log2 cluster2 -0.22 0.37 -0.95 0.51 exposure hs_pcb180_cadj_Log2 cluster3 -0.25 0.44 -1.11 0.61 exposure hs_pcb180_cadj_Log2 cluster4 -0.27 0.29 -0.84 0.30 54 exposure hs_pcb180_madj_- Log2 cluster2 0.24 0.35 -0.45 0.93 exposure hs_pcb180_madj_- Log2 cluster3 0.48 0.5 -0.50 1.46 exposure hs_pcb180_madj_- Log2 cluster4 0.49 0.3 -0.10 1.08 exposure hs_sumPCBs5_cadj_- Log2 cluster2 0.61 0.9 -1.15 2.37 exposure hs_sumPCBs5_cadj_- Log2 cluster3 0.32 0.99 -1.62 2.26 exposure hs_sumPCBs5_cadj_- Log2 cluster4 -0.05 0.64 -1.30 1.20 exposure hs_sumPCBs5_madj_- Log2 cluster2 -0.23 0.48 -1.17 0.71 exposure hs_sumPCBs5_madj_- Log2 cluster3 -0.27 0.62 -1.49 0.95 exposure hs_sumPCBs5_madj_- Log2 cluster4 -0.05 0.44 -0.91 0.81 serum serum_metab_63 cluster1 0.04 0.24 -0.43 0.51 serum serum_metab_63 cluster2 -0.4 0.34 -1.07 0.27 serum serum_metab_63 cluster3 0.05 0.36 -0.66 0.76 serum serum_metab_63 cluster4 -0.17 0.1 -0.37 0.03 serum serum_metab_79 cluster1 0.06 0.13 -0.19 0.31 serum serum_metab_79 cluster2 -0.15 0.17 -0.48 0.18 55 serum serum_metab_79 cluster3 -0.2 0.25 -0.69 0.29 serum serum_metab_79 cluster4 0.39 0.12 0.15 0.63 serum serum_metab_81 cluster1 0.1 0.19 -0.27 0.47 serum serum_metab_81 cluster2 -0.22 0.22 -0.65 0.21 serum serum_metab_81 cluster3 -0.22 0.21 -0.63 0.19 serum serum_metab_81 cluster4 0.31 0.11 0.09 0.53 serum serum_metab_95 cluster1 -0.29 0.23 -0.74 0.16 serum serum_metab_95 cluster2 -0.19 0.21 -0.60 0.22 serum serum_metab_95 cluster3 0.25 0.26 -0.26 0.76 serum serum_metab_95 cluster4 0.74 0.13 0.49 0.99 serum serum_metab_122 cluster1 0.25 0.21 -0.16 0.66 serum serum_metab_122 cluster2 -0.33 0.23 -0.78 0.12 serum serum_metab_122 cluster3 -0.25 0.23 -0.70 0.20 serum serum_metab_122 cluster4 -0.25 0.1 -0.45 -0.05 serum serum_metab_153 cluster1 0.04 0.23 -0.41 0.49 serum serum_metab_153 cluster2 -0.39 0.3 -0.98 0.20 serum serum_metab_153 cluster3 0.08 0.29 -0.49 0.65 serum serum_metab_153 cluster4 -0.27 0.09 -0.45 -0.09 urine urine_metab_5 cluster1 -0.09 0.17 -0.42 0.24 urine urine_metab_5 cluster2 0.61 0.44 -0.25 1.47 urine urine_metab_5 cluster3 -0.08 0.72 -1.49 1.33 urine urine_metab_5 cluster4 0.42 0.14 0.15 0.69 protein IL1beta cluster1 -0.57 0.21 -0.98 -0.16 protein IL1beta cluster2 0.02 0.22 -0.41 0.45 56 protein IL1beta cluster3 0.22 0.28 -0.33 0.77 protein IL1beta cluster4 2.12 0.24 1.65 2.59 protein IL6 cluster1 -0.59 0.38 -1.33 0.15 protein IL6 cluster2 0.98 0.55 -0.10 2.06 protein IL6 cluster3 0.16 0.5 -0.82 1.14 protein IL6 cluster4 1.92 0.19 1.55 2.29 protein Leptin cluster1 -0.49 0.42 -1.31 0.33 protein Leptin cluster2 0.21 0.37 -0.52 0.94 protein Leptin cluster3 0.69 0.3 0.10 1.28 protein Leptin cluster4 0.08 0.19 -0.29 0.45 protein HGF cluster1 -0.16 0.18 -0.51 0.19 protein HGF cluster2 0.99 0.4 0.21 1.77 protein HGF cluster3 -0.12 0.73 -1.55 1.31 protein HGF cluster4 0.72 0.19 0.35 1.09 protein INSULIN cluster1 -0.24 0.27 -0.77 0.29 protein INSULIN cluster2 0.31 0.36 -0.40 1.02 protein INSULIN cluster3 0.03 0.4 -0.75 0.81 protein INSULIN cluster4 0.94 0.14 0.67 1.21 protein TNFalfa cluster1 -0.13 0.11 -0.35 0.09 protein TNFalfa cluster2 0.92 0.41 0.12 1.72 protein TNFalfa cluster3 -0.09 0.92 -1.89 1.71 protein TNFalfa cluster4 0.5 0.26 -0.01 1.01 protein IL8 cluster1 -0.15 0.13 -0.40 0.10 protein IL8 cluster2 1.29 0.54 0.23 2.35 57 protein IL8 cluster3 -0.13 1.29 -2.66 2.40 protein IL8 cluster4 0.53 0.27 0.00 1.06 outcome BMI z-score cluster1 -1.73 0.39 -2.49 -0.97 outcome BMI z-score cluster2 -0.86 0.4 -1.64 -0.08 outcome BMI z-score cluster3 -0.43 0.43 -1.27 0.41 outcome BMI z-score cluster4 0.85 0.44 -0.01 1.71 covariate h_mbmi_None NA 0.04 0.01 0.02 0.06 covariate e3_sex_Nonemale NA 0.32 0.08 0.16 0.48 covariate h_age_None NA 0 0.01 -0.02 0.02 covariate h_cohort2 NA -0.26 0.17 -0.59 0.07 covariate h_cohort3 NA 0.04 0.18 -0.31 0.39 covariate h_cohort4 NA 0.53 0.13 0.28 0.78 covariate h_cohort5 NA 0.31 0.13 0.06 0.56 covariate h_cohort6 NA 0.39 0.13 0.14 0.64 covariate h_edumc_None2 NA -0.01 0.1 -0.21 0.19 covariate h_edumc_None3 NA 0.04 0.1 -0.16 0.24 Table 2.2: Bootstrap standard error for LUCID model, calculated based on 500 replicates. 58 2.10 Appdendix2.E In order to achieve the variable selection of omic variables, we modified the EM algorithm by adding penalty terms towards parameters related to the omic variables, which are cluster-specific means µ j and variance-covariance structureΣ j . As a convenience, denoteW j =Σ − 1 j and therefore the expected like- lihood related to the omic variables is R(µ ,Σ ;µ (t) ,Σ (t) )= n X i=1 k X j=1 r ij logϕ (Z i ;µ j ,Σ j ) = 1 2 n X i=1 k X j=1 r (t) ij log det(W j )− 1 2 n X i=1 k X j=1 r (t) ij (Z i − µ j ) T W j (Z i − µ j ) (2.40) Without loss of generality, suppose the omic variables are scaled to have mean 0 and standard deviation 1. If a variable doesn’t provide any information towards the clustering, it’s natural to think it has 0 means across all latent clusters. However, there is another scenario where the correlation structure will contribute to the clustering even if the cluster-specific means are all 0. As a result, the regularization on the covariance structure is desired. We then modified the expected likelihood R as a penalized likelihood function with shrinkage parameters onµ andW, which is R p (µ ,Σ ;µ (t) ,Σ (t) )=R(µ ,Σ ;µ (t) ,Σ (t) )− λ 1 k X j=1 p X l=1 |µ jl |− λ 2 k X j=1 p X l=1 p X m=1 |w jlm | (2.41) For convenience, fix i andj and we have S =(Z i − µ j ) T W j (Z i − µ j ) = p X l=1 p X m=1 w jlm (Z il Z im − µ jm Z il − µ jl Z im +µ jl µ jm ) (2.42) forj =j ′ andl =l ′ , ifl̸=l ′ andm̸=l ′ , ∂S ∂µ j ′ l ′ =0 (2.43) 59 ifl =l ′ andm̸=l ′ , ∂S ∂µ j ′ l ′ = p X m̸=l ′ w j ′ l ′ m (µ j ′ m − Z im ) (2.44) ifl̸=l ′ andm=l ′ , by symmetry, ∂S ∂µ j ′ l ′ = p X l̸=l ′ w j ′ ll ′(µ j ′ l − Z il )= p X m̸=l ′ w j ′ l ′ m (µ j ′ m − Z im ) (2.45) and lastly ifl =l ′ andm=l ′ , ∂S ∂µ j ′ l ′ =2w j ′ l ′ l ′(µ j ′ l ′− Z il ′) (2.46) by adding Equation E.4 to E.7, we have ∂S ∂µ j ′ l ′ =− 2 p X l=1 w j ′ ll ′(Z il − µ j ′ l ) (2.47) Thus, ∂R p (µ ;µ (t) ) ∂µ j ′ l ′ =− 1 2 n X i=1 r (t) ij ′ ∂S ∂µ j ′ l ′ − λ 1 sign(µ j ′ l ′) = n X i=1 r (t) ij ′ p X l=1 w j ′ ll ′(Z il − µ j ′ l ) ! − λ 1 sign(µ j ′ l ′) (2.48) 60 2.11 Appdendix2.F Descriptionofthesimulationscenario We simulate 10000 observations as the population and replicate the analysis 200 times for each analysis scenario. At each replicate, we draw 2000 observations from the population and fit the LUCID model. For conputational consideration, for each scenario we use one sample to tune the regularization penalties. We assume the data contains two clusters (K = 2), 5 exposures with log(OR) = log(2). Regarding the exposure effect, Z 1 and Z 2 follow N(0.2,1) for clsuter 1 and N(− 0.2,1) for cluster 2, respectively. Z 3 andZ 4 are informative variables such thatZ 3 ,Z 4 ∼ N((0,0),(1,0.8,0.8,1)) for cluster 1 andZ 3 ,Z 4 ∼ N((0,0),(1,− 0.8,− 0.8,1)) for cluster 2. Z 7 andZ 8 are correlated withZ 3 with correlation 0.7 and 0.4, respectively. Z 9 toZ 12 are noises generated from independentN(0,1). 61 Results Figure 2.10: Pairwise scatter plot for simulated omics variables. Z 1 andZ 2 are informative variables with difference in mean across the 2 clusters. Z3 toZ 8 are informative variables with difference in correlation across the 2 clusters. 62 Table 2.3: Additional simulation scenarios for variable selection based on correlation. Scenario AUC (X) AUC (Y) Probability of selecting the metabolites (sd) Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 1 0.52 (0.01) 0.95 (0.21) 0.95 (0.21) - - - - - - 0.84 (0.36) 0.82 (0.39) 0.83 (0.38) 0.80 (0.40) 2 0.87 (0.08) 0.55 (0.02) - - 1 (0) 1 (0) - - - - 0.68 (0.47) 0.58 (0.49) 0.66 (0.47) 0.62 (0.49) 3 0.55 (0.04) 0.51 (0.01) 0.91 (0.28) 0.91 (0.28) - - - - - - 0.78 (0.41) 0.75 (0.43) 0.77 (0.42) 0.80 (0.4) 4 0.52 (0.04) 0.51 (0.01) 0.84 (0.36) 0.83 (0.38) - - - - 0.97 (0.17) 0.97 (0.17) 0.72 (0.45) 0.76 (0.45) 0.74 (0.44) 0.70 (0.46) 5 0.76 (0.17) 0.53 (0.02) - - 1 (0) 0.98 (0.13) - - 1 (0) 1 (0) 0.45 (0.50) 0.40 (0.50) 0.50 (0.50) 0.42 (0.50) 6 0.51 (0.01) 0.51 (0.01) - - - - 0.66 (0.47) 0.70 (0.46) 0.98 (0.14) 0.98 (0.15) 0.64 (0.48) 0.64 (0.48) 0.62 (0.49) 0.56 (0.50) 7 0.74 (0.18) 0.53 (0.03) 0.93 (0.25) 0.92 (0.28) 0.98 (0.10) 0.98 (0.12) - - 0.99 (0.10) 0.98 (0.10) 0.14 (0.35) 0.15 (0.35) 0.19 (0.39) 0.11 (0.31) 8 0.85 (0.14) 0.55 (0.02) - - 1 (0) 0.98 (0.13) 0.94 (0.24) 0.94 (0.23) 1 (0) 0.99 (0.05) 0.15 (0.36) 0.15 (0.36) 0.18 (0.38) 0.12 (0.33) 9 0.84 (0.14) 0.55 (0.02) 0.98 (0.13) 0.98 (0.14) 1.00 (0.05) 1.00 (0.05) 0.95 (0.22) 0.94 (0.23) 1.00 (0.05) 1.00 (0.05) 0.13 (0.34) 0.14 (0.35) 0.20 (0.40) 0.16 (0.36) 63 Chapter3 LUCIDincorporatingmissingomicsdata 3.1 Introduction Fast developments in biotechnologies in recent decades have made omics data available for numerous cohort studies. For example, the Human Early-Life Exposome project (HELIX) measured molecular omics signatures including DNA methylation, whole blood transcription, metabolites, antfd plasma proteins from 1300 children at the age of 6-11 in six European countries [59]. As a second example, the Study of Latino Adolescents at Risk of Type 2 Diabetes (SOLAR) project measured phenotyping for clinical risk factors of type 2 diabetes as well as plasma metabolomics data in a longitudinal cohort in Los Angeles, California [38]. Such cohort studies with omics data provide unprecedented opportunities to investigate the direct and indirect effect of exposures on complex disease phenotype, and comprehensively understand the biological process behind the scene. Despite the great potential, multi-omics data brings up the needs for new integrated statistical tools. Shen et al. proposed a method called iCluster to conduct integrative clustering of multi-omics data [86, 68]. Peng et al. developed another model called Latent Unknown Clustering model Integrating multi-omics Data (LUCID) to conduct integrative analysis using omics data with exposure and outcome, leveraging the underlying causal relationships in cohort study design [72]. 64 A major challenge in integrative analysis of multi-omics data is the problem of missingness. In large cohort studies, it is common that some omics data are not available for all participants due to budget limits, failure to extract samples, or lack of data consent. In HELIX study, for example, 1198 children are available for urinary/serum metabolomics data and only 941 are available for miRNA data. Another reason for missingness is that omics measurements, such as protein expression levels, are subject to detection limits. Values below the detection or quantification threshold are not detectable by current biotechnology instruments. Such data are also referred to as censored data [106]. Lots of attention has been paid towards statistical methods for handling missing data. Complete-case analysis is easy to implement but might not be viable if a large number of samples contain missing values [76]. For missing mechanisms due to detection limits, single value imputation which replaces missing values by a selected constant is currently used in the proteomics community. Common options are dividing the global/individual minimum by 2 or square root of 2. However, this approach is statistically inefficient and possibly biased [19, 108]. Likelihood based methods, such as Expectation Maximization (EM) algorithm is another popular approach because of its ease of implementation and flexible statistical framework [56]. Zhang et al. extended the idea of EM algorithm under the framework of Gaussian Mixture Model (GMM) to carry out an algorithm which simultaneously conducts clustering and imputation of missing data [118]. Scrucca et al. proposed a two-step approach to deal with missing values in GMM. Their method, mclust, imputes missing values under the General Location Model then conducts a clustering algorithm on the imputed dataset. [85, 81] In this chapter, under the scenario of integrate analysis for omics data, we extend the previously pro- posed LUCID model to handle missing values for (1) listwise missing case (or missing rows) that omics measurements are only available in a subset of observations and (2) sporadic missing case that omics mea- surements are missing completely at random. We derive the joint likelihood for LUCID by allowing omics data to be potentially missing. We propose a likelihood partition method for case (1) and an integrated 65 imputation framework similar to Zhang et al.[118] for case (2). We use an EM algorithm for maximum likelihood estimation. We evaluate the performance of our approach through extensive simulation stud- ies and demonstrate the advantage of the proposed methods over complete-case analysis and imputation methods. Finally, we apply our method to Polygenic Risk Score (PRS), metabolomics data and Prostate Cancer (PCa) from Multi-Ethnic Cohort Study (MEC) [51]. 3.2 Methods 3.2.1 Listwisemissingpattern To incorporate the listwise missing pattern, we proposed a revised LUCID model using likelihood partition, illustrated in Figure 3.1a. The observations are divided into 2 disjoint subsets,{i o } such thatZ io is observed and{i m } such thatZ im is completely missing (possibly due to limit of budget, contamination of sample etc.). The joint log-likelihood for set{i o } (denotedl o (Θ |D)) remains the same as Equation 2.4, while that of the set{i m } becomes l m (Θ |D)= no X io=1 K X j=1 I(X io =j) logS(X io =j|G io ;β j )+logϕ (Y io |δ j ,σ 2 j ) (3.1) We can easily obtain the MLEs ofΘ under listwise missing pattern via a modification of E-step for the EM algorithm discussed in Chapter 2. Equation 3.1 explicitly points out thatl m (Θ |D) only consists of the 66 (a) Listwise missing pattern (b) Sporadic missing pattern Figure 3.1: Illustrations of the two missing patterns in the omics data, as well as corresponding methods to deal with missingness. The white block are missing values in the omics dataZ. The dark blue blocks represented the imputed values inZ 67 likelihood components related toG andY . This results in the corresponding responsibility for the subset {i m }, which is r (t) imj = E I(X im =j)|D;Θ (t) =P X im =j|G im ,Z im ,Y im ;Θ (t) = S X im =j|G im ;β (t) ϕ Y im |X im =j;δ (t) j ,σ 2(t) j P K j=1 S X im =j|G im ;β (t) ϕ Y im |X im =j;δ (t) j ,σ 2(t) j (3.2) For subset{i o },r (t) ioj is the same as Equation 2.5. As a result, given the current estimate ofΘ (t) at iteration t, the E-step calculates the expectation of the log-likelihood of LUCID (denoted asQ(Θ |D,Θ (t) )), which is the sum of the likelihood with omics measurements and the one without omics measurements. Q(Θ |D,Θ (t) )= El o (Θ |D,Θ (t) )+ El m (Θ |D,Θ (t) ) = no X io=1 K X j=1 r (t) ioj (logS(X io =j|G io ;β j )+logϕ (Z io |X io =j;µ j ,Σ j ) +logϕ (Y io |X io =j;δ j ,σ 2 j )) + nm X im=1 K X j=1 r (t) imj logS(X im =j|G im ;β j )+logϕ (Y im |X im =j;δ j ,σ 2 j ) = n X i=1 K X j=1 r (t) ij logS(X i =j|G i ;β j )+logϕ (Y i |X i =j;δ j ,σ 2 j ) + no X io=1 K X j=1 r (t) ioj logϕ (Z io |X io =j;µ j ,Σ j ) (3.3) where r (r) ij = r (t) ioj , ifi∈{i o } r (t) imj , ifi∈{i m } (3.4) In M-step, the maximization ofβ (t+1) j , δ (t+1) j andσ (t+1) j remains the same as Equation 2.7, Equation 2.8 and Equation 2.10 respectively, since the likelihood components related to those parameters consists of all 68 observations. We only need to replace r (t) ij by r (t) imj if i ∈ {i m }. On the contrast, parameters associated with omics data,µ j andΣ j , are updated only based on observations in the subset{i o }: µ (t+1) j = P no io r (t) ioj Z i P no io r (t) ioj (3.5) Σ (t+1) j = P no io r (t) ioj Z io − µ t+1 j Z io − µ t+1 j T P no io r (t) ioj (3.6) Still,Σ (t+1) j in Equation 3.6 is a general case without considering any geometric constraints. We use mclust to allow more flexible shapes, orientations and volumes for Σ . Depending on the omics data are missing or observed for observationi, we calculater (t+1) ij according to Equation 3.4 and input it as prior information for mclust at each iteration of the EM algorithm. 3.2.2 Sporadicmissingpattern We now consider another situation that missing values occur randomly in omics data. This situation fits the missing mechanism of missing completely at random (MCAR) or missing at random (MAR) [56]. When the data are MCAR or at least MAR (referred to as the sporadic missing pattern hereafter), the missing mechanism is ignorable, and the EM algorithm is still applicable. To deal with sporadic missing pattern in Z, we modify the two-step optimization algorithm for GMM with missing data proposed by Zhang et al. [118] and integrate it into the EM algorithm for LUCID (Figure 3.1b). Suppose we have omics dataZ ={Z 1 ,Z 2 ,...Z i ,...,Z n } with sporadic missing pattern, whereZ ia represents observable variables forZ i andZ ib represents missing values forZ i . Under the LUCID model, we define the optimization problem as follows argmax Θ ;Z l(Θ |D ={G,Z,Y})s.t.Z ia is fixed for all i (3.7) 69 We still use EM algorithm (Algorithm 1) discussed in Chapter 2 to optimize the problem in Equation 3.7 iteratively. The crux of the problem is how to maximizeZ given observable part ofZ fixed. We propose an additional “Impute” step in the EM algorithm to maximizeZ. Specifically, after initializing the missing values inZ through some imputation methods, the E-step and M-step remain the same at each iteration. After updatingΘ in the M-step, according to Equation 2.4, optimizingl(Θ |D ={G,Z,Y}) in respective toZ is only related to the likelihood componentlogϕ (Z|µ ,Σ ). Therefore, the optimization problem in 3.7 is equivalent to argmax Z n X i=1 K X j=1 r (t) ij logϕ (Z i |µ j ,Σ j ) s.t.Z ia is fixed for all i (3.8) Equation 3.8 can be divided inton sub-problems. Each sub-problem optimizesZ i with fixed r (t) ij . For each observation Z i , we re-index it into observable and missing parts {Z ia ,Z ib }. We also divide the cluster-specific mean µ j and variance-covariance matrixΣ j the same way asZ i , which is shown below µ j =(µ ja ,µ jb ) (3.9) Σ − 1 j = Σ − 1 jaa Σ − 1 jab Σ − 1 jba Σ − 1 jbb (3.10) We take the partial derivative of Equation 3.8 in respective toZ ib and set it to 0. The closed form solution is Z (t+1) ib = K X j=1 r (t) ij p (t) ij Σ − 1(t) jbb − 1 K X j=1 r (t) ij p (t) ij Σ − 1(t) jba µ ja +Σ − 1(t) jbb µ jb − Σ − 1(t) jba Z ia (3.11) where p (t) ij =ϕ (Z (t) i |µ (t) j ,Σ (t) j ) (3.12) 70 Figure 3.2: Illustration of a more general case with a combination of listwise and sporadic missing pattern. Details of deriving Equation 3.11 can be found in [118]. Equation 3.11 implies that missing values inZ are updated at each iterationt , based on the observed values and current estimated parameters,µ (t) j andΣ (t) j , of GMM. Optimization of Equation 3.8 is equivalent to a dynamic imputation process (referred to as “I- step” hereafter) inside the EM algorithm of LUCID. This EM algorithm obtains the MLE ofΘ and imputes missing values inZ simultaneously. As Zhang et al. pointed it out, model parametersΘ and missing values inZ are updated by maximizing the expected likelihood function, which guarantees convergence of the EM algorithm [118]. It is possible that omics data have listwise and sporadic missing pattern simultaneously. We combine the ideas in Chapter 3.2.1 and Chapter 3.2.2 and extend LUCID to this more realistic scenario (Figure 3.2). If 71 an observationi has sporadic missing values inZ, we first impute the missing values Z ib then treatZ i as “completely observable”. Next, we implement the likelihood partition to deal with the rest of observations with listwise missing pattern. After calculatingΘ (t) , we update the missing valuesZ ib usingΘ (t) . In Algorithm 2, we provide details of the EM algorithm to deal with the combination of listwise and sporadic missingness. To initialize this modified EM algorithm with I-step, we use the R package mix to impute the missing values inZ under a general location model [45]. Algorithm 2 The EM algorithm for LUCID model with one latent variable and missing values in omics dataZ Input: Multi-view dataD ={G,Z,Y}, total number of iterationst max , convergence toleranceϵ max 1: Initialization: 2: Divide the indexi into 3 subsets: (1)i a : individuals with complete observation inZ; (2)i b : individuals with partially observable values inZ; (3)i c : individuals with complete missing values inZ. 3: InitializeZ (0) i b by mix 4: Initializeµ (0) ,Σ (0) by mclust 5: Initializeβ (0) by nnet 6: Initializeγ (0) by GLM 7: Compute the log-likelihoodl 1 by Equation 3.3 8: ϵ ←∞ 9: t← 0 10: EMalgorithm: 11: whilet<t max orϵ>ϵ max do 12: E-step: 13: Computer (t) iaj ,r (t) icj based on Equation 2.5 andr (t) i b j based on Equation 3.2 14: M-step: 15: Updateβ (t+1) by Equation 2.7 16: Updateµ (t+1) ,Σ (t+1) by mclust 17: Updateγ (t+1) by Equation 2.10 18: I-step: 19: UpdateZ (t+1) i b by Equation 3.11 20: Compute the updated log-likelihoodl 2 usingµ (t+1) ,Σ (t+1) ,β (t+1) ,γ (t+1) according to Equation 2.4 21: ϵ ← l 2 − l 1 22: l 1 ← l 2 23: t← t+1 24: endwhile 25: Computer (t) ij usingµ (t) ,Σ (t) ,β (t) ,γ (t) based on Equation 3.3 Output:Θ t ={µ (t) ,Σ (t) ,β (t) ,γ (t) },r (t) andZ (t) 72 3.3 Simulationstudy 3.3.1 Methods We performed a simulation study across a range of missing ratios in omic data (Z) considering (1) listwise missing pattern and (2) sporadic missing pattern. We compared the properties of the following analysis methods: (1) the proposed LUCID model (L); (2) LUCID analysis based on complete cases (CL) and (3) LUCID analysis with imputed data (IL). The missing data are imputed by mean imputation [56] and mclust [85, 45]. We generated 20000 subjects with missing data and replicated 500 analyses to obtain parameter estimates and cluster assignment under each simulation scenario. In each replicate, a random sample of 2000 subjects is selected. Due to conditional independence implied in Figure 2.1, we first generated exposuresG, then conditional onG, cluster labelsX are assigned to each subject, and finally, omic data Z and outcomeY are generated, givenX. We examined several metrics to compare the performance of different analysis methods, including mean parameter estimations and corresponding standard deviations over 500 replicates. We also evaluated the accuracy of latent cluster assignments using the area under the curves (AUC) by comparing estimated posterior inclusion probability (PIP) to truly simulated cluster label. The analyses are carried out by R packages LUCIDus [119], mclust [85] and missMethods[78]. All the simulation scenarios are described in Appendix 3.6. 3.3.2 Results Figure 3.3 shows the simulation results of the listwise missingness across increasing missing ratio in the omics data. For exposure effect (the association between exposure and latent cluster), as missing ratio in omic data increases, the average parameter estimates of L and CL slightly deviates away from the true effect (indicated by the red dashed line while the average parameter estimates of IL are drastically biased towards 0, especially when missing ratio is larger than 0.5. Regarding variance of estimation, L produces 73 0.0 0.5 1.0 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio β Method CL IL L Exposure effect (a) 0.00 0.25 0.50 0.75 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio Δ μ Method CL IL L Omics effect (b) 0.0 0.2 0.4 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio Δ γ Method CL IL L Outcome effect (c) 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio AUC Method CL IL L AUC for all observations (d) 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio AUC Method CL IL L AUC for all observations with omics data (e) 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Missing ratio AUC Method IL L AUC for all observations without omics data (f) Figure 3.3: Simulation results for listwise missingness in omic data 74 smaller standard deviations (SDs) across replicates compared to CL (Figure 3.3a). Regarding omics effect (the association between latent cluster and the omics data), L produces unbiased estimations even the missing ratio is extremely high (>0.7). It also produces smaller SDs compared to CL and IL (Figure 3.3b). Similar trends are observed when estimating the outcome effect (the association between latent cluster and the outcome). Both L and IL produces unbiased estimations of the outcome effect, albeit the SDs of L is smaller than IL and CL (Figure 3.3c). L results in improved model performance in discriminating clusters compared to IL and CL. When missing ratio increases, the median AUC of all observations for L and CL slightly decreases, while the median AUC of IL drastically drops to 0.7 at missing ratio = 0.8 (Figure 3.3d). To make more reasonable comparisons on the same subjects, we investigate AUCs for the subsets with and without omic data. Figure 3.3e shows that the L approach always leads to a higher median AUC for subjects with complete measured omic data, compared to CL and IL. The variances of estimated AUC of L are also smaller than CL and L. Particularly, for those subjects without measured omic data, the proposed L approach still offers acceptable estimations for the latent cluster. Although the median AUC of L for subjects with missing omic data is lower than that for subjects with measured omic data, L outperforms IL in terms of median and variance of AUCs (Figure 3.3f). The simulation results for sporadic missingness are presented in Figure 3.4. We compare the proposed LUCID model with integrated imputation (L) to two imputation methods: imputation by mean (IL-mean) and imputation based on a general location model implemented by mclust (IL-mclust). For exposure effect (Figure 3.4a), with an increased missing ratio, the average estimation across replicates for L and IL-mclust slightly decreases while the variance of estimation across the replicates slightly increases. The mean esti- mation of IL-mean remains close to the simulation truth then suddenly drops when missing ratio reaches 0.35. IL-mean also produces estimation with larger variances compared to L and IL-mclust. A similar pat- tern is observed in outcome effect (Figure 3.4d). The advantage of using L to estimate omic effect is shown in Figure 3.4c. Despite increase missing ratio in omic data, the mean estimation of omic effect remains 75 0.00 0.25 0.50 0.75 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Missing ratio β Method impute−mclust impute−mean L Exposure effect (a) −0.10 −0.05 0.00 0.05 0.10 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Missing ratio β Method impute−mclust impute−mean L Exposure effect (b) 0.0 0.5 1.0 1.5 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Missing ratio Δ μ Method impute−mclust impute−mean L Omics effect (c) 0.0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Missing ratio Δ γ Method impute−mclust impute−mean L Outcome effect (d) 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Missing ratio AUC Method impute−mclust impute−mean L AUC for all observations (e) Figure 3.4: Simulation results for sporadic missingness in omic data 76 close to the simulation truth, although the variance of estimation moderately increases. As a comparison, the mean estimation of IL-mclust deviates from the true effect when missing ratio reaches 0.25. The mean estimation of IL-mean is highly biased and drops to 0 when missing ratio reaches 0.4. Since missingness occurs in omic data, estimation for omic effects is sensitive to missing ratio. When estimating omic ef- fect, L performs better since it incorporates additional information from exposure and outcome to impute the missing omic data. By contrast, IL-mean and IL-mclust only consider information in omic data and missingness in omics data will bias their estimations toward the null. In terms of AUC for estimated clusters, we observe a decreasing trend in median AUC similar to pa- rameter estimations, as a result of biased estimations (Figure 3.4e). L has the best performance in discrim- inating clusters among the three methods, since the estimation of omic effect from L remain close to the true effect despite increasing missing ratio. When missing ratio reaches 0.35, IL-mean fails to distinguish the clustering structure (the AUC of IL-mean falls almost to 0.5) since the estimation of exposure, omic and outcome effect of IL-mean are highly biased towards the null. 3.4 Dataexample 3.4.1 Methods We applied the LUCID model to data from the Multi-ethnic Cohort Study (MEC). MEC is a multi-center consortium consisting of participants primarily of five ethnic groups (Caucasians, Japanese Americans, Native Hawaiians, African Americans (AA) and Latinos). In our analysis, we include 4346 AA males with polygenic risk scores (PRS). 1567 of them were diagnosed with prostate cancer (PCa), 672 of them had untargeted pre-diagnostic serum metabolomics data available. The serum metabolomic profiling was per- formed at Metabolon, Inc. (Durham, NC) using Ultrahigh Performance Liquid Chromatography-Tandem Mass Spectroscopy (UPLC-MS/MS) and resulted in the identification of 1,433 metabolites. Metabolomics 77 data were also generated in an additional 23 duplicate quality control samples. The median metabolite correlation in the 23 duplicate samples wasr =93.2% (interquartile range:80.4% to98.2%). Metabolites with a correlation r < 40%, calculated between duplicate samples processed either within or between batches, were excluded from analyses (n = 176 metabolites). Metabolites with an interquartile range of zero (i.e., those with low or no variability; n = 38 metabolites) and those missing in > 80% of samples (n = 161 metabolites) were excluded from analyses, bringing the total number of excluded metabolites to 335. No samples were excluded due to high missing rates. Samples from men who are correlated with each other were excluded. To correct systematic error effect on the metabolites, each metabolite level was standardized and adjusted for the first 3 principal components (PCs), age and batch run. A “meet in the middle” approach was applied to screen metabolites relevant to PRS and PCa [11]. A grid search was used to determine the number of clusters in the sample. The supervised LUCID (discussed in Chapter 2) was applied, and the risk profiles were constructed by using cluster-specific means of selected metabolite estimated by LUCID. 3.4.2 Results The supervised LUCID discovers two latent clusters in the data. Latent cluster 2 is associated with high risk of PCa (log OR=8.27). The histogram at the bottom of Figure 3.5 presents PIPs of cluster 2. A small group (approximately1%) of people have PIPs equal to 1 and are characterized by high expression levels of metabolites from biological pathways including amino acid, peptide and xenobiotics. The bar on the left side consists of people without PCa. Majority of people with measured metabolomics data fall into this bar and no obvious metabolomic pattern is observed. The bars in the middle shows that increasing PRS quartile is associated with latent cluster 2 (OR(PRS-second quartile)=1.44, OR(PRS-third quartile)=3.72 and OR(PRS-forth quartile)=7.52), which is ultimately associated with high risk of PCa. In addition, “risk profiles” (signatures of metabolomic data) are constructed for observations without measured metabolomic 78 Figure 3.5: “Risk” profiles for individuals with and without measured metabolomics data 79 data, as a weighted average of risk profiles for latent cluster 1 and 2. The weights are PIPs estimated by LUCID. The estimated risk profiles of observations with PCa and PRS in the third quartile are particularly similar to that of cluster 2. 3.5 Discussion In this chapter, the proposed LUCID model elegantly incorporates listwise and sporadic missing patterns in omic data. As demonstrated in the simulation study, when listwise missingness occurs in omic data, LUCID produces less biased estimation compared to traditional imputation methods and smaller standard deviations compared to complete case analysis. For sporadic missing data, LUCID produces less biased estimations of omic effect compared to imputation by mean and mclust. It also produces satisfying clus- tering assignment. However, it is worth pointing out that integrated imputation within LUCID does not perform well when missing ratio is greater than 0.5. MCAR is the fundamental assumption when formulating the imputation problem based on LUCID. There are other missing mechanisms commonly used to construct imputation model for missing values in the omic data. Detection limit theory assumes the missing values are below or beyond the detection limit of the biotechnological machine and is widely adopted in the bioinformatics community [114]. One prevailing approach to deal with missing values in the quantitative omic data is to replace the missing values by the detection limit. However, this approach is statistically inefficient and possibly biased [69]. Lin et al. proposed a general framework for integrative analysis of incomplete multi-omics data through generalized linear regression models by assuming missing variables are due to detection of limit [55]. A future direction of LUCID is to incorporate the detection limit mechanism for missing values. A possible solution is to use truncated normal distribution to model the distribution of omic data. Specifically, we can use left truncated model for below detection limit and right truncated model for above detection limit. 80 Missing ratio ExposureG OmicZ OutcomeY # causal # noise OR G # causal # noise ∆ µ Z ∆ µ Y 0 5 5 2.0 2 2 0.6 0.4 0.1 5 5 2.0 2 2 0.6 0.4 0.2 5 5 2.0 2 2 0.6 0.4 0.3 5 5 2.0 2 2 0.6 0.4 0.4 5 5 2.0 2 2 0.6 0.4 0.5 5 5 2.0 2 2 0.6 0.4 0.6 5 5 2.0 2 2 0.6 0.4 0.7 5 5 2.0 2 2 0.6 0.4 0.8 5 5 2.0 2 2 0.6 0.4 Table 3.1: Simulation setting for listwise missingness in omic data. Missing ratio ExposureG OmicZ OutcomeY # causal # noise OR G # causal # noise ∆ µ Z ∆ µ Y 0 1 9 2.0 4 0 1.6 0.4 0.15 1 9 2.0 4 0 1.6 0.4 0.2 1 9 2.0 4 0 1.6 0.4 0.25 1 9 2.0 4 0 1.6 0.4 0.3 1 9 2.0 4 0 1.6 0.4 0.35 1 9 2.0 4 0 1.6 0.4 0.4 1 9 2.0 4 0 1.6 0.4 Table 3.2: Simulation setting for sporadic missingness in omic data. 3.6 Appendix3.A 81 Chapter4 LUCIDus: AnRPackagetoimplementtheLUCIDmodel (This chapter is a modified version of the paper: Zhao,Y., Goodrich, J.A., Conti, D.V., LUCIDus: An R Pack- age For Estimating Latent Unknown Clusters By Integrating Multi-omics Data (LUCID) With Phenotypic Traits, submitted to R Journal) 4.1 Introduction In the previous chapters, we have introduced the LUCID model for integrated clustering analysis of multi- view data. LUCID jointly estimates latent unknown clusters in multi-omics data and relate the latent clusters to the outcome of interest. LUCIDus, the R package to implement the LUCID model, provides an integrated clustering framework to researchers, and has numerous downloads (around 19,000 times since it was first introduced according to dlstats [115]). It has also been applied in several environmental epidemiological studies [44, 93, 61]. In this chapter, we introduced LUCIDus version 2, a major update and enhancement from the original release. LUCIDus version 2 is more robust, features more powerful model selection, model visualization, and inference based on bootstrap resampling. It also incorporates methods to deal with missingness in omics data. The chapter is organized as follows: we first go through the details of functions in the LUCIDus package; we then illustrate the workflow of fitting LUCID models by using a dataset simulated based on real cases from HELIX study [104, 60]. 82 Mainfunction Description lucid() Main function to fit LUCID models, specified by giving integrated data and a distribution of the outcome. It also conducts model selection and variable selection. summary_lucid() Create tables to summarize a LUCID model. plot_lucid() Visualize LUCID models through a Sankey diagram. boot_lucid() Derive confidence intervals based on bootstrap resampling. predict_lucid() Predict latent cluster assignment and outcome using integrated data. Workhorsefunction Description est_lucid() Fit a LUCID model to estimate latent clusters by using integrated data. tune_lucid() Conduct model selection and variable selection. It fits a series of LUCID models over different combination of tuning parameters and determine an optimal model with the minimum BIC. Table 4.1: Functions in the LUCIDus package. lucid() calls est_lucid() and tune_lucid() in the backend. The two workhorse functions are not normally called directly, but they can be useful when a user wants to examine the model fitting process in more detail. 4.2 GeneralworkflowofLUCIDus The LUCIDus package includes five main functions and two auxiliary functions to implement the analysis framework based on LUCID. Brief descriptions of each function are listed in Table 4.1. The workflow of the LUCIDus package is shown in Figure 4.1. Below, we describe the typical workflow of analyzing integrated data using the LUCID model. The function lucid() is the primary function in the package, which fits a LUCID model based on an exposure matrix (argument G), omics data (argument Z), outcome data (argument Y), the number of latent clusters (K; default is 2), and the family of the outcome (argument family; default is normal). If a vector of K and/or L 1 penalties are supplied, lucid() will automatically conduct model selection on number of clustersK, select informative variables inG orZ, or both, and returns a LUCID model (an R object of classlucid) with optimalK and selected variables inG and/orZ. Several additional functions can then be applied to a fitted LUCID object. summary_lucid() summarizes the fitted LUCID model by producing summary tables of parameter estimation and interpretation. Visualization is performed via the plot_lucid() function to create a Sankey diagram showing the interplay among the three components 83 Figure 4.1: The workflow of the LUCIDus package. Dark blue nodes represent input data, light blue nodes represent output results. Green nodes and dashed arrows are optional steps for model estimation. Red texts correspond to 5 key functions in LUCIDus. 84 (G,Z andY ). In addition, statistical inference can be accomplished for LUCID by constructing confidence intervals (CIs) based on bootstrap resampling. This is achieved by the function boot_lucid(). Finally, predictions on the cluster assignment and the outcome can be obtained by calling the functionpredict_- lucid(). In practice, it might not be necessary to implement the entire workflow above. For instance, if we have prior knowledge on the number of latent clustersK, model selection for the number of clusters can be skipped. If a given dataset has limited variables (for example, variables selected based on biological annotations), then variable selection may not be necessary. 4.3 Illustration 4.3.1 FittingaLUCIDmodelbylucid() To illustrate integrative clustering with LUCID, we use a simulated data based on real cases from the HELIX study released by ISGlobal (available at https://github.com/isglobal-exposomeHub/ExposomeDataCha llenge2021) [104, 60]. A subset of data from the original release is incorporated in the LUCIDus package to replicate the workflow presented here. The dataset is a list of dataframe containing data from 100 children, with 8 variables measuring child and maternal exposure to chemical pollutants (referred to as the exposome), 10 proteins measured in children (referred to as the proteomics), child body mass index (BMI) z-score (a continuous outcome), child BMI category (a binary outcome) and 3 covariates, including mother’s BMI, child’s sex, and maternal age at birth. To organize the data into separate dataframes for each type of data and to better illustrate functionality within LUCIDus, we use the following code: 1 library(LUCIDus) 2 # load data 3 data("helix_data") 4 exposome <- helix_data$exposure 5 proteomics <- helix_data$omics 85 6 zBMI <- helix_data$outcome["zBMI"] 7 cat_BMI <- helix_data$outcome["cat_BMI"] 8 covariate <- model.matrix(~., helix_data$covariate)[, -1] Our goal is to conduct an integrated clustering of the exposome and proteome and to relate the esti- mated latent clusters to childhood BMI (measured by both BMI z-score and BMI category defined by WHO). The LUCID model is fitted using the function lucid(). The input data are specified by arguments G, Z and Y, corresponding to the exposome, the proteome, and BMI, respectively. In practice, scaling of omics data is highly recommended to obtain more stable estimates; the proteomics data used in this example are already scaled. If multiple omics data are included in the LUCID analysis as the omic intermediateZ, the user can concatenate them one by one into a single data matrix then input the concatenated matrix asZ. For illustrative purposes, we assume that, in the example data, the optimal number of latent clusters is 2 (Determining the optimal number of clusters is a major question prior to performing clustering analysis. More details are discussed in the later Section Model selection and variable selection. For illustration, we fit two LUCID models with continuous BMI and binary BMI, respectively. The pa- rameterfamily specifies the outcome type. The default setting is normal, corresponding to a continuous outcome. For a binary outcome,family should be specified as binary. lucid() returns an object of class lucid providing the MLE of the LUCID model as well as PIPs. 1 # fit LUCID model with a continuous outcome 2 fit1 <- lucid(G = exposome, Z = proteomics,Y = zBMI, family = "normal", K = 2) 3 # MLE of the LUCID model 4 # fit1$pars 5 # PIPs of the sample 6 # fit1$post.p 7 # fit LUCID model with a binary outcome 8 fit2 <- lucid(G = exposome, Z = proteomics, Y = cat_BMI, family = "binary", K = 2) 86 By default, lucid() calls Mclust() to choose the optimal mixture model for omics dataZ. Alterna- tively, the user can also specify the argumentmodelName to choose a certain mixture model. The available mixture models are listed inmclust::mclustModelNames [85]. 1 # fit LUCID model with spherical shape, equal volume 2 fit3 <- lucid(G = exposome, Z = proteomics,Y = zBMI, family = "normal", K = 2, modelName = "EII") 3 # fit LUCID model with ellipsoidal shape, varying volume, shape, and orientation 4 fit4 <- lucid(G = exposome, Z = proteomics,Y = zBMI, family = "normal", K = 2, modelName = "VVV") SupervisedLUCIDversusunsupervisedLUCID In Equation 2.5, the latent clustersX are jointly defined by genomic/environmental exposure G, other omics dataZ and outcomeY . Because the outcome is explicitly incorporated in the likelihood function, the estimation process based on Equation 2.5 is similar to a supervised learning process. LUCIDus also allows for an unsupervised version of LUCID. In this unsupervised LUCID, the latent clusters are estimated only byG andZ, which is dicussed in Chapter 1. Estimations for unsupervised LUCID are also obtained by the EM algorithm discussed previously. The parameter useY in lucid() is a flag indicating a supervised or unsupervised LUCID model. By default, useY = TRUE and lucid() fits supervised LUCID. To fit an unsupervised LUCID model, user should set useY = FALSE. 1 # fit an unsupervised LUCID model 2 fit5 <- lucid(G = exposome, Z = proteomics, Y = zBMI, K = 2, useY = FALSE) The choice of a supervised LUCID or unsupervised LUCID model depends both on the study design and the research question. A supervised LUCID analysis may be appropriate, for example, if: (1) there is the 87 belief that the cluster structure is defined jointly by G,Z andY (for example, data collected from a cross- sectional design); or (2) the goal is to build a predictive model with data from one cohort to then apply directly to another cohort. If the aim is to obtain an unbiased estimate for the association between the latent clusterX and the outcomeY , an unsupervised LUCID model is more appropriate. Adjustingforcovariates LUCIDus also allows for the adjustment of covariates. According to the DAG in Figure 2.1, covariates may be including for the association between the exposure and the latent cluster (referred to as G to X covariate) or for the association between the latent cluster and the outcome (referred to asX toY covariate). Covariates in the G to X relationship act more like predictors of the latent cluster while covariates in the X to Y relationship may be interpreted more in the context of an adjustment for a potential confounding effect. A variable can serve as both a G toX covariate and aX toY covariate. In the lucid() function,G toX covariates are specified by parameter CoG andX toY covariates are specified by parameter CoY. 1 # include covariates as G to X covariates 2 fit6 <- lucid(G = exposome, Z = proteomics, Y = zBMI, K = 2, family = "normal", CoG = covariate) 3 # include covariates as X to Y covariates 4 fit7 <- lucid(G = exposome, Z = proteomics, Y = cat_BMI, K = 2, family = "binary", CoY = covariate) 4.3.2 InterpretingLUCID Since LUCID is an integrated analysis framework consisting of clustering, regression, and multiple data components, we provide two utility functions, summary_lucid() and plot_lucid() to summarize re- sults of LUCID and to facilitate interpretation. 88 SummarizingLUCIDintablesbysummary_lucid() A direct call of summary_lucid() with input of a lucid object from the function lucid() prints three tables corresponding to associations amongG,Z andY . We take the LUCID model fitted with the con- tinuous BMI as an example. 1 > summary_lucid(fit1) 2 ----------Summary of the LUCID model---------- 3 4 K = 2 , log likelihood = -1071.893 , BIC = 3216.791 5 6 (1) Y (continuous outcome): mean of Y for each latent cluster (and effect of covariates if included) 7 beta 8 cluster1 0.08771338 9 cluster2 0.86691317 10 11 (2) Z: mean of omics data for each latent cluster 12 mu_cluster1 mu_cluster2 13 IL1beta -0.21282280 0.39620999 14 IL6 -0.25313089 0.47125114 15 INSULIN -0.31598417 0.58826444 16 IFNalfa 0.15370117 -0.28614387 17 IL1RA -0.05274367 0.09819234 18 IL2 0.17884736 -0.33295828 19 IP10 -0.14934740 0.27803850 20 IL2R 0.13363312 -0.24878339 21 MIG 0.11645372 -0.21680069 22 IL4 0.15979421 -0.29748722 23 89 24 25 (3) E: odds ratio of being assigned to each latent cluster for each exposure 26 beta OR 27 DDE_c.cluster2 -1.787901798 0.1673109 28 DDE_m.cluster2 0.367653805 1.4443419 29 DDT_c.cluster2 0.211258288 1.2352314 30 DDT_m.cluster2 0.004310885 1.0043202 31 HCB_c.cluster2 1.847153233 6.3417403 32 HCB_m.cluster2 0.174522198 1.1906772 33 PCB_c.cluster2 -0.751327088 0.4717401 34 PCB_m.cluster2 -0.027843969 0.9725401 The first table summarizes the association between the latent cluster and the outcome (BMI). The estimated mean of BMI for cluster 1 is 0.31 while that for cluster 2 is 0.47. In the case of binary outcome, the coefficient in the first table is interpreted as the log odds for the variable for that specific cluster vs. the reference cluster. The second table characterizes each cluster by its omics signature. The omics signature is a cluster-specific mean for each variable in the omics data. For instance, latent cluster 1 is characterized by low levels of IL1-beta (µ 1 (IL1beta)=− 0.12) and insulin (µ 1 (insulin)=− 0.19) while cluster 2 features high level of IL1-beta and insulin (µ 2 (IL1beta)=0.26 andµ 2 (insulin)=0.42). The third table relates the exposures to the latent clusters. For instance,HCB_c represents Hexachlorobenzene (HCB) concentrations in the child on the log2 scale. The coefficient OR for HCB_c is 4.07, meaning for each doubling of HCB concentrations in child plasma, the odds ratio of being assigned to latent cluster 2 is 4.07. Since latent cluster 2 is associated with higher BMI z-score, these results indicate that exposure to HCB is associated with overweight/obesity. 90 Figure 4.2: An example of using the Sankey diagram to visualize a LUCID model. The dark grey nodes represent exposures, the brown node represents the outcome, the blue nodes are omics data and orange nodes are latent clusters. The width of links and nodes corresponds to effect size. Light colored links represent negative associations while dark colored links indicate positive associations. VisualizingLUCIDbyplot_lucid() Visualization is another imperative way to interpret statistical models. Here we use a Sankey diagram [83] for visualization of the relationships among different components in LUCID. plot_lucid() takes a fitted LUCID model as input and creates a Sankey diagram in html format. It also accepts a user-defined color palette. 1 # visualize LUCID model via a sankey diagram 2 plot_lucid(fit1) 3 # change node color 4 plot_lucid(fit1, G_color = "yellow") 5 # change link color 6 plot_lucid(fit1, pos_link_color = "red", neg_link_color = "green") 91 Figure 4.2 shows an example of a Sankey diagram for the example data. Each node represents a com- ponent in the LUCID model. Different components are labeled by user-defined colors. Each link represents the statistical association between two nodes. The width of links reflects the effect size of the association, and the color of links indicates the direction of the association. By default, dark blue represents positive associations while the light blue indicates negative associations. 4.3.3 Modelselectionandvariableselection To determine the number of latent clusters, K, LUCID implements model selection by fitting a series of models, each with a different value for K, and then the Bayesian Information Criteria (BIC) is used to choose the optimal model [28]. The optimal model is the one with the lowest BIC. The BIC for LUCID is defined as BIC=− 2logL( ˆ Θ )+DlogN (4.1) where ˆ Θ is the maximum likelihood estimation andD is the number of parameters in LUCID. Specifically, D =P(K− 1)+KM+KM(M+1)/2+n Y , wheren Y is the number of parameters dependent upon the type of outcomeY . IfY is a continuous outcome, thenn Y =2K. IfY is a binary outcome, thenn Y =K. If a vector ofK is used as input to the functionlucid(), model selection is automatically performed and returns the model with optimalK. 1 > fit8 <- lucid(G = exposome, Z = proteomics, Y = zBMI, K = 2:6) 2 Fitting LUCID model 3 > # check the optimal K 4 > fit8$K 5 [1] 2 92 Figure 4.3: Choosing optimal number of latent clusters K based on BIC. Separately, users can use the auxiliary function tune_lucid() to look into model selection in more details. This function returns the optimal model as well as a table recording the tuning process. Below is an example of tuningK and visualizing the tuning process by usingtune_lucid(). 1 > # look into the tuning process in more details 2 > tune_K <- tune_lucid(G = exposome, Z = proteomics, Y = zBMI, K = 2:6) 3 > fit9 <- tune_K$best_model 4 > ggplot(data = tune_K$tune_list, aes(x = K, y = BIC, label = round(BIC, 0))) + 5 + geom_point() + 6 + geom_line() + 7 + geom_text(hjust = -0.2) Figure 4.3 shows the tuning process withK =2 resulting in the lowest BIC. Including additional or redundant variables in clustering models may increase model complexity, im- pair prediction accuracy, and boost computational time [29]. LUCIDus performs variable selection for both 93 the genetic/environmental exposuresG and for the other omics dataZ viaL 1 -norm penalty terms. For variable selection forG, we apply the LASSO regression to obtain sparse solutions forβ , which is β (t+1) LASSO =argmax β N X i=1 K X j=1 r (t) ij logS(X i =j|G i ;β j )− λ β K X j=1 P X l=1 |β jl | (4.2) To obtain sparse estimation forµ andΣ , we implement the penalized model-based method [120].µ is first updated by maximizing the following equation, µ (t+1) j =argmax µ N X i=1 K X j=1 r (t) ij logϕ (Z i |µ j ,Σ j )− λ µ K X j=1 M X l=1 |µ jl | (4.3) DenoteW =Σ − 1 . ThenΣ is updated by maximizing its inverse penalized byL 1 -norm W (t+1) j =argmax W N X i=1 K X j=1 r (t) ij detW j − trace(S (t) j W j ) − λ W X ls |w jls | (4.4) whereS j is the empirical covariance matrix at each iteration, defined as S (t) j = P N i=1 r ij Z i − µ (t) j Z i − µ (t) j T P N i=1 r (t) ij (4.5) To choose the optimal combination of the three L 1 -norm penalties, we implement a grid-search by adopting a modified BIC [71], defined as BIC p =− 2logL( ˆ Θ )+(D− D G − D Z )logN (4.6) WhereD G is the number of exposure variables whose effect estimates are 0 across all latent clusters and D Z is the number of omics variables whose effect estimates are 0 across all latent clusters. We can tune λ β ,λ µ andλ W either separately or jointly. For instance, if only exposome dataG is high-dimensional, we 94 only need to tuneλ β . If variable selection is desired for bothG andZ, then the threeL 1 -norm penalties should be tuned simultaneously.λ β ,λ µ andλ W match the parametersRho_G,Rho_Z_Mu andRho_Z_Cov in the lucid() function. Each parameter accepts a numeric vector or a scalar as input. For guidance, empirical experiments utilizing normalized omic data suggest integer values in the range of 0− 100 for Rho_Z_Mu and values in the range of0− 1 forRho_G andRho_Z_Cov. Below are examplesfor conducting variable selection inG andZ, separately and jointly. 1 > # variable selection for G 2 > set.seed(1008) 3 > # add more noise to the exposome 4 > noise <- matrix(rnorm(100 * 50), nrow = 100) 5 > exposome_noise <- cbind(exposome, noise) 6 > fit10 <- lucid(G = exposome_noise, Z = proteomics, Y = zBMI, K = 2, 7 + Rho_G = seq(0, 0.05, by = 1e-3)) 8 Fitting LUCID model 9 10 11/58 exposures are selected 11 12 > # summary of optimal lucid model 13 > # summary_lucid(fit10) 14 > > # variable selection for Z 15 > # add more noise to the proteomics 16 > proteomics_noise <- cbind(proteomics, noise) 17 > fit11 <- lucid(G = exposome, Z = proteomics_noise, Y = zBMI, K = 2, 18 + Rho_Z_Mu = 1:10, Rho_Z_Cov = 0.1) 19 Fitting LUCID model 20 21 13/60 omics variables are selected 22 95 23 > # summary of optimal lucid model 24 > # summary_lucid(fit11) 25 > 26 > # variable selection for G and Z jointly 27 > fit12 <- lucid(G = exposome_noise, Z = proteomics_noise, Y = zBMI, K = 2, 28 + Rho_G = 0.01, Rho_Z_Mu = 5, Rho_Z_Cov = 0.5, seed = 123) 29 Fitting LUCID model 30 31 33/58 exposures are selected 32 33 27/60 omics variables are selected 4.3.4 Derivingconfidenceintervals We use the nonparametric bootstrap approach to derive confidence intervals (CIs) for the MLE estimates from a LUCID model [23]. Nonparametric bootstrap draws observations from a sample with replacement. We index the bootstrap samples by b = 1,2,...,B. For each bootstrap sampleD b , we fit the LUCID model and calculate MLEΘ b . Two types of bootstrap CIs are constructed based on the distribution ofΘ b : (1) normal CIs and (2) percentile CIs. Given the confidence level 1− α , the normal CIs are constructed by (1− α )%CI normal =Θ − bias± Z 1− α/ 2 σ (Θ ) (4.7) where bias= ¯Θ − Θ , ¯Θ is the mean of the bootstrap estimators andσ (Θ ) is the bootstrap standard error, which is σ (Θ )= v u u t 1 B− 1 B X b=1 Θ b − ¯Θ (4.8) All calculations in Equation 4.7 and 4.8 are elementwise. 96 In addition, the1− α percentile CIs are calculated by ordering the bootstrap estimators from smallest to largest and selecting the estimates atα/ 2 percentile and(1− α/ 2) percentile as the CIs. boot_lucid() constructs the two bootstrap CIs described above for inference. Users can specify the confidence level ( conf) and the number of bootstrap replicates (R). While boot_lucid() is running, a progress bar is displayed in the R console. We recommend R≥ 200 to estimate the bootstrap standard error for normal CIs andR≥ 800 to estimate quantiles to construct percentile CIs. Below are examples of deriving bootstrap CIs for LUCID. By default, LUCID calculates95% CIs. 1 > # bootstrap to obtain 95% CI (by default) for LUCID 2 > set.seed(123) 3 > boot1 <- boot_lucid(G = exposome, Z = proteomics, Y = zBMI, model = fit1, 4 + R = 200) 5 # 90% CIs 6 > boot2 <- boot_lucid(G = exposome, Z = proteomics, Y = zBMI, model = fit1, 7 + R = 200, conf = 0.9) We can obtain a more comprehensive summary table with bootstrap CIs by integrating summary_- lucid() with output from boot_lucid(). The resulted summary table has 5 columns: t0 corresponds to parameters estimated from the observed data; norm_lower and norm_upper represents the lower and upper limit of normal CIs, respectively;perc_lower andperc_upper represents the percentile CIs. 1 > # summary table with 95% bootstrap CIs 2 > summary_lucid(fit1, boot.se = boot1) boot_lucid() is built upon the R library boot [13]. The original output from package boot is also returned byboot_lucid() and can be used to evaluate the normality of the distribution of the bootstrap estimations. 97 4.3.5 Prediction The prediction of LUCID includes two parts: prediction of the latent clusters X and prediction of the outcomeY . predict_lucid() can perform both tasks. Prediction for latent cluster is determined by PIP using the optimal rule, which is ˆ X i =argmax j P(X i =j|D,Θ ) (4.9) Prediction of outcome follows a conventional regression framework. When making predictions, predict_lucid() requires input of a new G and Z, and optional input of Y . If Y is provided, we use Equation 2.5 to calculate PIPs; otherwise, Equation 2.27 is applied. As a result, we can either fit a supervised LUCID model to then make predictions in an unsupervised fashion or vice versa. This design allows for flexibility. For instance, we can build a LUCID model in one cohort in a supervised fashion, then make predictions on another new independent cohort, regardless of if the outcome is measured or not. predict_lucid() returns alist containing PIPs for each observation, predicted cluster assignment, and predicted outcome. Below is example code for prediction based on a supervised LUCID model,fit1. 1 > # predict cluster with information of Y 2 > pred1 <- predict_lucid(G = exposome, Z = proteomics, Y = zBMI, model = fit1) 3 > # predict cluster without information of Y 4 > pred2 <- predict_lucid(G = exposome, Z = proteomics, model = fit1) 5 > # predicted cluster label 6 > table(pred1$pred.x) 7 8 1 2 9 69 31 10 > # predicted outcome 11 > pred1$pred.y[1:5] 98 Figure 4.4: Two simulated missing patterns in proteomics data 12 [1] 0.4740648 0.3078931 0.3078923 0.3080683 0.3099139 4.3.6 Incorporatingmissingnessinomicsdata A major update in LUCIDus version 2 is incorporation of missing data. Missingness is a major challenge in integrative analysis of omics. In large cohort studies, it is common that some omics data are not available for all participants for various reasons such as budgetary constraints, low sample availability, or lack of consent for future use of biospecimens [102]. We refer to this type of missingness as a list-wise missingness pattern. In addition, even when omics data are measured for a given participant, it is common that some omics features are randomly missing due to the measurement process. We assume those missing values are 99 missing at random and refer to this as sporadic missingness. We implemented several statistical solutions to deal with the two missing patterns, or the combination of two. Details are discussed in Chapter 3 To illustrate this new feature, we first simulate missing values in the HELIX proteomics data by ran- domly setting values toNA. Both list-wise and sporadic missing patterns are considered, as shown in Figure 4.4. 1 > library(naniar) 2 > set.seed(1) 3 > proteomics_miss_sporadic <- proteomics_miss_listwise <- as.matrix(proteomics) 4 > index <- arrayInd(sample(1000, 0.1 * 1000), dim(proteomics)) 5 > proteomics_miss_sporadic[index] <- NA # sporadic missing pattern 6 > proteomics_miss_listwise[sample(1:100, 30), ] <- NA # listwise missing pattern 7 > vis_miss(as.data.frame(proteomics_miss_sporadic)) 8 > vis_miss(as.data.frame(proteomics_miss_listwise)) Below are examples for fitting the LUCID model with missingness in omics data. lucid() will auto- matically examine missing patterns present in the omics data and select a corresponding method to account for the missing data. For sporadic missing patterns, we use R library mclust to initialize missing values in the omics data. 1 > fit13 <- lucid(G = exposome, Z = proteomics_miss_listwise, Y = zBMI, K = 2) 2 > # summary_lucid(fit13) 3 > fit14 <- lucid(G = exposome, Z = proteomics_miss_sporadic, Y = zBMI, K = 2) 4 > # summary_lucid(fit14) 100 4.4 Otherappliedexamples 4.4.1 IntegratedanalysisofprenatalPFAS,themetabolitesandliverinjury (This section is a part of the paper: Stratakis, N., V. Conti, D., Jin, R., Margetaki, K., Valvi, D., Siskos, A.P., Maitre, L., Garcia, E., Varo, N., Zhao, Y. and Roumeliotaki, T., 2020. Prenatal exposure to perfluoroalkyl substances associated with increased susceptibility to liver injury in children. Hepatology, 72(5), pp.1758- 1770.) Per- and polyfluoroalkyl substances (PFAS) are a group of chemicals widely used in industrial pro- duction. PFAS are chemically stable and could accumulate in human tissues over time. Animal studies have shown hepatotoxic effects, but human evidence is scarce. We used data from 1105 mothers and their children from the HELIX study [59] and conducted a LUCID analysis to investigate the joint contribu- tion of prenatal exposure to PFAS and the serum metabolites to liver injury. The liver injury is a binary outcome, defined as having any liver enzyme concentration above the 90th percentile for the study pop- ulation. The metabolite panel is created by using xMWAS [99]. The results of LUCID analysis are shown in Figure 4.5. We identify two clusters in the children: one is associated with high risk for liver injury (OR =1.56,95%CI=(1.21,1.92), using the other cluster as the reference). The high risk cluster is character- ized by increased PFAS level and increased serum levels in branched-chain amino acids (BCAAs; valine, leucine, and isoleucine), the aromatic amino acids (AAAs) tryptophan and phenylalanine, biogenic amine acetylornithine, and glycerophospholipids phosphatidylcholine (PC) aa C36:1 and Lyso-PC a C18:1. 101 Figure 4.5: Integrated clustering analysis of prenatal PFAS, serum metabolites and liver injury 4.4.2 Integrated analysis of in utero exposure to mercury, inflammation biomarkers andliverinjury (This section is a part of the paper: Stratakis, N., Golden-Mason, L., Margetaki, K., Zhao, Y., Valvi, D., Garcia, E., Maitre, L., Andrusaityte, S., Basagana, X., Borràs, E. and Bustamante, M., 2021. In utero expo- sure to mercury is associated with increased susceptibility to liver injury and inflammation in childhood. Hepatology, 74(3), pp.1546-1559.) Mercury (Hg) is a ubiquitous toxic pollutant associated with liver injury and fatty liver [105]. Non- alcoholic fatty liver disease (NAFLD) is the most prevalent cause of liver disease in children [2]. To identify the subgroup of children at high risk of NAFLD, We conduct an integrated clustering analysis based on maternal exposure to Hg, inflammatory cytokines and the elevated levels of alanine aminotrans- ferase (ALT). The 872 mother-children pairs are collected from the European Human Early-Life Expo- some study [104]. Figure 4.6 shows the results of the integrated analysis. Two subgroups are identified in the sample. We discovered that maternal blood Hg level is associated with a high-risk group of children 102 Figure 4.6: Integrated clustering analysis of maternal exposure to Hg, inflammation biomarkers and ele- vated ALT levels (OR= 1.4,95%CI= (1.1,2.0)). The high-risk group is characterized by elevated levels of ALT (OR= 4.9, 95%CI= (2.4,9.9), compared to the low-risk group) and higher expression levels of inflammatory cy- tokines, including IL-1β , IL-6, IL-8 and TNF-α . 4.5 Conclusion In this chapter, we have introduced the LUCIDus package, version 2, to perform estimation of the LU- CID model in R. LUCIDus focuses on integrative clustering analysis using multi-omics data, both with and without phenotypic traits. It consists of a set of toolkits for modeling, interpretation, visualization, inference, and prediction. Compared to version 1, LUCIDus version 2 is faster (120 times faster than the 103 previous version for a dataset with 20,000 observations) and more stable, and includes several features for more functionality. The LUCIDus package is a useful tool for performing integrated clustering analysis and can potentially provide greater insights in environmental epidemiology studies with measured omic data. 104 Chapter5 LUCIDincorporatingmultiplelatentvariables 5.1 Introduction The dawn of omic era is upon us. Technological advances in high-throughput biochemical data genera- tion have made it possible to obtain multiple omics data on the same individuals for various epidemiolog- ical studies, including genomics, exposomics, transcriptomics, proteomics, metabolomics etc. Multi-omics studies allow researchers combine information measured at multiple molecular levels and have the poten- tial to explain how biological systems work as a whole [34]. However, the analysis often models each omic dataset separately, which might not reveal the biological complexity of the phenotypes or ignore the moti- vating biology [50, 17, 95]. This approach also fails to detect weak but consistent signals from multiple data sources [43]. Given the promise of multi-omics data, integrative analysis of multi-omics data receives lots of attention and has emerged as a field to facilitate full understanding of the biological interplay between genetic or environmental risk factors and complex traits. There are plenty of approaches available in the community. mixOmics (http://mixomics.org/) is a col- laborative project which provides a wide range of data-driven, multi-variate approaches for integrating multi-omics data. Especially, they forcus on variable selection to identify key variables that are highly cor- related within or across omics datasets, and potentially explain the underlying biology of the outcome [79]. Mediation analysis relates genetic or environmental exposures to omics data at multiple molecular level, 105 capturing transitional processes that ultimately influence the outcome of interest. Examples include inte- gration of genomics data generated from different platforms [41], iGWAS [42], HIMA [117], and Bayesian mediation approach BAMA [90] etc. In practice, multi-omics clustering is often implemented to reduce dimenisonality of the data and detect disease subtypes. Comparehensive reviews on multi-omics cluster- ing approaches are available at [66, 98, 95, 77]. As an alternative to high-dimensional pairwise mediation or clustering approaches, Latent Unknown Clustering Integrating multi omics Data (LUCID), is an inte- grative statistical model that jointly estimates latent clusters characterizing each omic profile and genetic or environmental factors, while simultaneously associating these clusters to the phenotype [72]. LUCID incorporates biological process into its conceptual model and is tailored for prospective cohort studies. In Chapter 2 and Chapter 3, we have introduced the LUCID model with one latent variable. If multiple omics data are included, we use the early integration strategy by concatenating each omic layer one by one and input a single data matrix into the LUCID model. However, researchers may be interested in model the correaltion structure of each omic layer independently, or guided by the hypothesis, multiple omics data may act in parallel on an outcome. In this chapter, we extend the previous LUCID model to incorporate multiple omics data by introducing multiple latent variables into the LUCID framework. This chapter is organized as follows: we first develope the statistical model of LUCID with multiple latent variables; we then evaluate the performance of the proposed method by comprehensive simulation studies; last, we apply the proposed model on the HELIX dataset simualted based on real cases[104, 60]. 5.2 Methods 5.2.1 EMalgorithmforLUCIDwithmultiplelatentvariables Suppose we have a sample of size n, indexed by i = 1,2,...,n. It has a collection of m omics data, denoted byZ 1 ,...,Z a ,...,Z m with corresponding dimensionsp 1 ,...,p a ,...,p m . Each omic dataZ a is 106 Figure 5.1: Directed Acyclic Graph (DAG) for LUCID with multiple latent mediators. The square represents observed data, the circle corresponds to latent mediator interpreted as latent cluster. summarized by a latent categorical variableX a , which containsk a categories. Each category is interpreted as a latent cluster (or subgroup) for that particular omic layer. All latent variables are linked to the exposure matrixG and the outcomeY , shown in Figure 5.1. We follow the same specification in Chapter 2: G is a n× p matrix with columns representing exposure variables and rows being observations;Y is an-length vector representing the outcome;Θ is the generic notations for all parameters related to LUCID model. Suppose the latent variableX ai is observed fora=1,...,m. According to the DAG in Figure 5.1, the distributions off(Z ai |G) are conditionally independent with each other fora = 1,...,m, the distribu- tions of the latent variablef(X ai |G i ) are also conditionally independent with each other. GivenX ai , the conditional distribution of the outcomef(Y i |X 1i ,...X mi ) is independent of the rest model components. As a result, for observationi, the joint distribution ofG i ,Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i is written as f(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ )=ϕ (Y i |X 1i ,...,X mi ) m Y a=1 [ϕ (Z ai |X ai )S(X ai |G i )] (5.1) SinceX ai is a discrete variable withk i categories, we assumeX ai follows a multinomial distribution con- ditioning onG, denoted by the softmax functionS(·). We assume omics dataZ ai follows a multivariate Gaussian distribution conditioning onX ai , denoted by ϕ (Z ai |X ai = j a ;µ ja ,Σ ja ), whereµ ja andΣ ja 107 are cluster-specific mean and variance-covariance matrix for omic layer a. The outcomeY can be either a continuous outcome or a binary outcome. For illustration, here, we discuss the situation that Y is a continuous variable which follows a Gaussian distribution (The proof for binary outcome can be found in Appendix 5.6). We postulate equal variance (σ 2 ) among the outcome of interest across different combi- nations of the latent variables (equivalent to the homoscedasticity assumption of linear regression). The relationship between latent variablesX a and outcomeY is formulated as EY i =δ 0 + k 1 X j 1 =2 δ j 1 I(X 1i =j 1 )+··· + ka X ja=2 δ ja I(X ai =j a )+··· + km X jm=2 δ jm I(X mi =j m ) (5.2) where the interceptδ 0 is the expected value ofY i givenX ai = 1 fora = 1,2,...,m (the reference cluster for all combinations of latent cluster assignment);δ ja is the change ofY if the latent variableX a becomes j a instead of 1. This setup is exactly the same as ordinary linear regression with categorical predictors. As a result, the MLE ofδ ja can be estimated by linear regression, as we will show later. 108 LetD be the generic notation for all observed data. The log likelihood of LUCID with multiple latent variables is constructed below, l(Θ |D)= n X i=1 logf(Z 1i ,...,Z mi ,Y i |G i ;Θ ) = n X i=1 log k 1 Y j 1 =1 ··· km Y jm=1 f(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ ) I(X 1i =j 1 ,...,X mi =jm) = n X i=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logf(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ ) = n X i=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logϕ (Y i |X 1i ,...,X mi ,δ ,σ 2 ) + n X i=1 m X a=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + n X i=1 m X a=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logS(X ai =j a |G i ,β a ) (5.3) The log likelihood of LUCID with multiple intermediate variables is similar to that with a single in- termediate variable in Equation 2.2. It is natural to follow the same principles of EM algorithm for LUCID with single intermediate variable, which is previously discussed in Chapter 2. Next, we present details of EM algorithm for LUCID with multiple intermediate variables. E-step We define the responsibility r i as ank 1 × k 2 × ...× k m array. A cell inr i , denoted byr i (j 1 ,j 2 ,...,j m ), is interpreted as the probability of the observationi being assigned to clusterj 1 for omic layer 1, cluster 109 j 2 for omic layer 2,..., clusterj m for omic layerm. In the E-step of the EM algorithm, we calculate each element in the arrayr i by r i (j 1 ,j 2 ,...,j m ) =EI(X 1i =j 1 ,...,X mi =j m ) =P(X 1i =j 1 ,...,X mi =j m )) = ϕ (Y i |X 1i =j 1 ,...,X mi =j m ) Q m a=1 [ϕ (Z ai |X a i =j a )S(X ai =j a |G i )] P k 1 j 1 =1 ... P km jm=1 ϕ (Y i |X 1i =j 1 ,...,X mi =j m ) Q m a=1 [ϕ (Z ai |X a i =j a )S(X ai =j a |G i )] (5.4) When calculating the normalized quantity in Equation 5.4, statistical packages, such as R, might encounter overflow/underflow of the numerical values. We use the log-sum-exponential trick to avoid such issues. To facilitate optimization during the EM algorithm, we further define the marginal responsibility as r i (j a ·)= k b X j b =1,b̸=a r i (j 1 ,...,j a ,...,j m ) (5.5) This quantity is interpreted as the marginal posterior probability of an observationi belonging to the latent clusterj a for omic layera, regardless of the rest of the omics layers. 110 By utilizing the concept of marginal responsibility, at iterationt, the expectation of Equation 5.4 can be simplified into Q(Θ |D,Θ (t) ) =El(Θ |D,Θ (t) ) = n X i=1 n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) logϕ (Y i |X ai =j a ,δ ,σ 2 ) + n X i=1 n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) m X a=1 logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) m X a=1 logS(X ai =j a |G i ,β a ) = n X i=1 n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) logϕ (Y i |X ai =j a ,δ ,σ 2 ) + m X a=1 n X i=1 ka X ja=1 r i (j a ·) (t) logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + m X a=1 n X i=1 ka X ja=1 r i (j a ·) (t) logS(X ai =j a |G i ,β a ) (5.6) M-step According to Equation 5.6, the optimization ofβ can be divided intom replicated sub-problems, indexed bya=1,2,...,m. Each sub-problem is formulated as β (t+1) a =argmax β a n X i=1 ka X ja=1 r i (j a ·) (t) logS(X ai =j a |G i ,β a ) (5.7) Although the maximization of equation above does not have a closed-form solution, it can be easily solved iteratively since it is a convex optimization problem [101]. Following the definition of r i (j a ·) (t) as a 111 marginal probability, we can treat the optimization ofβ (t+1) a as a multinomial logistic regression prob- lem. Specifically, we treat r i (j a ·) (t) (for j a = 1,2,...,k a ) as the outcome and G i as the independent variables. From the programming perspective, this can be done by the R package nnet. Likewise, the optimization ofµ andΣ can also be divided intom replicated sub-problems indexed by a=1,2,...,m. Each sub-problem is formulated as arg max µ a,Σ a n X i=1 ka X ja=1 r i (j a ·) (t) logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) (5.8) For each sub-problem, we have µ (t+1) a,ja = P n i=1 r i (j a ·) (t) Z i P n i=1 r i (j a ·) (t) (5.9) Σ (t+1) a,ja = P n i=1 r i (j a ·) (t) Z i − µ t+1 a,ja Z i − µ t+1 a,ja T P n i=1 r i (j a ·) (t) (5.10) forj a =1,2,...,k a . To allow for more flexible geometric features of the clusters (described in Figure 2.2, we also use the Equation 2.2 to model the variance covariance structureΣ . The optimization ofΣ with geometric restraints can be found in [31]. According to the parameterization in Equation 5.2, the parametersδ ,σ 2 could be estimated through maximizing the log-likelihood of linear regression using posterior inclusion probabilities of latent clusters to predictY . Specifically, we fit a linear regression model Y i =δ 0 + k 1 X j 1 =2 r i (j 1 ·)+··· + k 1 X jm=2 r i (j m ·)=δ T r i +ϵ (5.11) 112 where δ (t+1) = ˆ δ OLS (5.12) σ 2(t+1) = Var(ϵ )= Var(Y − δ T r) (5.13) It is worth noticing that the sub-problems in Equation 5.7 and in Equation 5.8 are exactly the same as the optimization problem in LUCID with single latent variable. The only difference is to replace r ij by the marginal posterior inclusion probability. This could explain by the DAG in Figure 5.1, in which different omics layers are included in the LUCID model in a parallel fashion and conditionally independent with each other given the exposures. For the LUCID model with multiple latent variables, we need to pre-specify the number of latent clus- ter for each omic layer (k 1 ,k 2 ,...,k m , respectively.) We recommend selecting the optimal combination of k using the Bayesian Information Criterion (BIC) by fitting the model over a range of pre-specified combinations ofk. BIC=− 2l(Θ |D)+(n G +n Z +n Y )logn (5.14) wheren G ,n Z andn Y are the numbers of parameters inβ ,µ andΣ , andδ , respectively. The optimalk can be subsequently identified by the candidate model with the minimum BIC. 5.2.2 Labelswitching To deal with the label switching problem, we extend the heuristic idea in Chapter 2. To start with, we transform the outcome effect δ into ak 1 × k 2 ×···× k m array. Each cell represents the mean ofY for a 113 specific combination of number of latent clusters, say δ (j 1 ,...,j m ). We set the referenceδ 0 as the mean of the combination of clusters such that j (1) ,...,j (m) = min j 1 ,...,jm δ (j 1 ,...,j m ) (5.15) Next, for the omic layera, we reorder the clusters for that certain omic layer by the order of δ a(1) <δ a(2) <··· <δ a(ka) (5.16) The process in Equation 5.16 is replicated fora=1,2,...,m. Additionally, we reorderβ ,µ andΣ based ona(1),a(2),...,a(m) fora=1,2,...,m. 5.2.3 AfeatureupdatesfortheLUCIDusRpackage We offer a feature update for the R package LUCIDus (Chapter 4) to allow LUCID incorporating multi- omics data in a parallel style. The codes are hosted on https://github.com/Yinqi93/LUCIDusM, including several new features to implement the proposed methods. 1. est_lucid: The function accepts a list of omics data as input. Each element of the list is ann× p a matrix. User can also specify a vector of integers (with the same length of the list) as the pre-specified number of clusters for each omic layer. The function will automatically search the model space to choose the optimal variance model for each omic layer. After conducting the EM algorithm, est_- lucid will call another workhorse functionreorder_lucid to switch cluster labels and re-arrange model parameters. This function has the options to fit supervised or unsupervised LUCID model. 2. tune_lucid: This function searches the optimal model given different combinations of the number of clusters for each omic layer. It returns the optimal model as well as the process of model search. 114 3. predict_lucid: It predict the cluster labels for each omic layer and the outcome using new data. This feature update extend the previous analysis toolkits based on the LUCID model, offering options of early integration and intermediate integration. The function documents provide several simple examples to fit LUCID model with multiple omics data in parallel. This feature will be integrated into the next official release of the LUCIDus package on CRAN. 5.3 Simulationstudy We first use a single reference simulation scenario to present the interpretation of LUCID with multiple latent variables. We simulated a dataset of 2 omics layers with K 1 = 2 and K 2 = 3. The details of simulation setting are presented in Table 5.1. The LUCID analysis for this reference scenario is presented in Figure 5.2. We incorporate two omics layers in the LUCID analysis and estimate 2 clusters for omic layer 1 and 3 clusters for omic layer 2. For example, for omic layer 1, exposure G1 is positively associated with latent cluster 2 (OR = 2.25); for omic layer 2, G1 is positively associated with latent cluster 2 (OR = 2.53) and latent cluster 3 (OR = 1.45). The links between latent cluster and omics features (variable names starting with “Z”) are cluster-specific means. For instance, the estimated mean for Z1 in omic layer 1 is− 0.6 and 0.83 for cluster 1 and 2, respectively. The links between cluster and outcome represent the increase (or decrease) in mean of the outcome compared to the reference (cluster 1 for omic layer 1 and cluster for omic layer 2). For example, the estimated mean for the reference is− 0.08. If an individual belongs to cluster 1 for omic layer 1 and cluster 2 for omic layer 2, then the estimated mean of the outcome for this individual will be− 0.08− 1.87 = − 1.95; if the individual belongs to cluster 2 for omic layer 1 and cluster 3 for omic layer 2, then the mean of the outcome will become− 0.08+1.87+1.98=3.77. 115 Figure 5.2: Sankey diagram for LUCID with multiple latent variables. The nodes on the left represent the exposures; the nodes in the middle represent the latent clusters; and the nodes on the right represent omics layers and the outcome. Each link represents a statistical association. Dark gray links represent the positive ones and the light gray links represent the negative ones. 116 Table 5.1: Simulation setting 1 - two omics layers withK 1 =2 andK 2 =3 Exposure effect Omics Cluster G1 G2 G3 G4 G5 Layer 1 1 (reference) - - - - - 2 log3 log2 0 0 0 Layer 2 1 (reference) - - - - - 2 log3 log2 0 0 0 3 log2 log3 0 0 0 Omics effect Omics Cluster Z1 Z2 Z3 Z4 Z5 Z6 Layer 1 1 -1 0 0 - - - 2 1 0 0 - - - Layer 2 1 -1 -1 0 0 0 0 2 -1 1 0 0 0 0 3 1 1 0 0 0 0 Outcome effect Omics Layer 1 Layer 2 Cluster reference (1, 1) 2 2 3 δ 0 2 -2 2 5.3.1 Modelselection We use the data simulated from Table 5.1 to illustrate how to perform model selection for LUCID with multiple latent variables. We set the grid for omic layer 1 asK 1 = 2 : 4 (e.g. the number of clusters for omic layer 1 ranges from 2 to 4) and the grid for omic layer 2 as K 2 = 2 : 5. We fit the LUCID model for each combination of the number of clusters and calculate the corresponding BICs. Figure 5.3 shows that the model with the lowest BIC is the one withK 1 = 2 andK 2 = 3, which matches the “truth” (the simulation setting). This simulation study illustrates the effectiveness of using BIC as the criteria of model selection. 117 Figure 5.3: BIC of the LUCID model using different combinations of K 1 andK 2 118 Table 5.2: Simulation setting 2 - two omics layers withK 1 =2 andK 2 =3. Different variance-covariance matrices are used to generate omic data: Var(Z 1 )= 0.2I, Var(Z 2 )= 0.4I, whereI is a 6× 6 diagonal matrix. Exposure effect Omics Cluster G1 G2 G3 G4 G5 Layer 1 1 (reference) - - - - - 2 log2 log2 0 0 0 Layer 2 1 (reference) - - - - - 2 log3 log2 0 0 0 3 log2 log3 0 0 0 Omics effect Omics Cluster Z1 Z2 Z3 Z4 Z5 Z6 Layer 1 1 − µ − µ − µ 0 0 0 2 µ µ µ 0 0 0 Layer 2 1 0.5 -0.5 0 0 0 0 2 -0.5 -0.8 0 0 0 0 3 0.7 0.4 0 0 0 0 µ =0,0.05,0.1,0.2,0.3,0.4 Outcome effect Omics Layer 1 Layer 2 Cluster reference (1, 1) 2 2 3 δ -0.4 0.2 0.3 0.5 119 5.3.2 Parameterestimationandprediction Methods We conduct extensive simulation studies to investigate the influence of varying omics effect on parameter estimation and prediction for latent clusters and the outcome. We simulate data under the LUCID model across a range of omics effect. A simulated population with 10000 observations is generated for each scenario. Details of the simulation setting are provided in Table 5.2. Specifically, the datasets consist of two omics layers withK 1 =2 andK 2 =3. The exposure effect, outcome effect and the omic effect for layer 2 are fixed, while the omic effect for layer 1 ∆ µ ranges from 0.1 to 0.8. We consider a special case for no omic effect for layer 1 ( ∆ µ = 0). Exposure variables are first simulated, then conditional on the exposures, a cluster label is generated for each omic layer. Omics data and the outcome are finally generated given the cluster assignment. For each scenario, we sample 1000 observations from the simulated population and fit both supervised and unsupervised LUCID model with multiple latent variables. The analyses are replicate for 600 times under each scenario. We examine several metrics to evaluate the performance of the LUCID model, including mean parameter estimates and corresponding standard deviations of estimates across the 500 replicates, the area under the curve (AUC) for estimated cluster assignment compared to true labels and the residual sum of squares (RSS) for the predicted outcome. Results We first investigate the results of supervised LUCID analyses. When the omics effect of omic layer 1 increases, the exposure effect for omic layer 1 first decreases then increases and becomes closer to the true effect (as relative to the true values indicated by the red dashed lines) while their standard deviations across replicates become smaller (Figure 5.4a). The exposure effect for omics layer 2 directly becomes closer to the true effect, while the standard deviation also becomes smaller (Figure 5.4b). The same patterns are observed for the omics effect (For example, Figure 5.4d, Figure 5.4e, and Figure 5.4f) and for the outcome effect (Figure 120 0.00 0.25 0.50 0.75 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal non−causal latent cluster 2 v.s. latent cluster 1 (reference) Exposure effect for omics layer 1 (a) 0.0 0.4 0.8 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal−1 causal−2 non−causal latent cluster 2 v.s. latent cluster 1 (reference) Exposure effect for omics layer 2 (b) 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal−1 causal−2 non−causal latent cluster 3 v.s. latent cluster 1 (reference) Exposure effect for omics layer 2 (c) −0.4 −0.3 −0.2 −0.1 0.0 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 1) causal non−causal Omics effect for omics layer 1 (d) −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 2) causal non−causal Omics effect for omics layer 1 (e) −0.4 0.0 0.4 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 1) causal−1 causal−2 non−causal Omics effect for omics layer 2 (f) −0.6 −0.3 0.0 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 2) causal−1 causal−2 non−causal Omics effect for omics layer 2 (g) 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 3) causal−1 causal−2 non−causal Omics effect for omics layer 2 (h) Figure 5.4: Simulation results for supervised LUCID with two latent variable, part one: parameter estima- tion for exposure and omic effect 121 −0.5 0.0 0.5 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) δ If X1=2 If X2=2 If X2=3 reference(X1=1, X2=1) Outcome effect for omics layer 1 and omics layer 2 (a) 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) AUC Omics layer omics layer 1 omics layer 2 AUC for prediction of latent cluster (b) 5 10 15 0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) RSS RSS for prediction of outcome (c) Figure 5.5: Simulation results for supervised LUCID with two latent variable, part two: parameter estima- tion for outcome effect, prediction on clusters and the outcome. 5.5a). Interestingly, not all the estimation for the outcome effect are unbiased. For instance, the estimation of δI (X 2 = 3) is biased when ∆ µ (Omics 1) = 0.8 while the rest estimations for the outcome effects become unbiased. The increased accuracy for parameter estimation result in substantial improvement in the AUC for cluster assignment for omics layer 1. The AUC for omics layer 2 does not substantially improve, albeit there is a decrease in the standard deviation of the AUC (Figure 5.5b). However, the increase in accuracy of parameter estimation also result in increase in RSS for prediction of the continuous outcome. This is related to a bias-variance trade off. The predicted outcome ˆ Y is calculated by P K j=1 γ j PIP j , The increased accuracy in parameter estimation results in more accurate prediction on cluster assignment. As the PIPs for each observation get closer to 0 or 1, the predicted outcome is less variable, resulting in higher RSS for ˆ Y . Figure 5.6 and Figure 5.7 shows simulation results for the unsupervised analyses. When∆ µ increases for omic layer 1, the estimation of exposure effect in omic layer 1 gets closer to the true value (all true 122 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal non−causal latent cluster 2 v.s. latent cluster 1 (reference) Exposure effect for omics layer 1 (a) 0.0 0.5 1.0 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal−1 causal−2 non−causal latent cluster 2 v.s. latent cluster 1 (reference) Exposure effect for omics layer 2 (b) 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) β causal−1 causal−2 non−causal latent cluster 3 v.s. latent cluster 1 (reference) Exposure effect for omics layer 2 (c) −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 1) causal non−causal Omics effect for omics layer 1 (d) −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 2) causal non−causal Omics effect for omics layer 1 (e) −0.3 0.0 0.3 0.6 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 1) causal−1 causal−2 non−causal Omics effect for omics layer 2 (f) −0.8 −0.6 −0.4 −0.2 0.0 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 2) causal−1 causal−2 non−causal Omics effect for omics layer 2 (g) 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics layer 1) μ (cluster 3) causal−1 causal−2 non−causal Omics effect for omics layer 2 (h) Figure 5.6: Simulation results for unsupervised LUCID with two latent variable, part one: parameter esti- mation for exposure and omic effect 123 −0.3 0.0 0.3 0.6 0.0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) δ If X1=2 If X2=2 If X2=3 reference(X1=1, X2=1) Outcome effect for omics layer 1 and omics layer 2 (a) 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) AUC Omics layer omics layer 1 omics layer 2 AUC for prediction of latent cluster (b) 50 60 70 0 0.1 0.2 0.4 0.6 0.8 Δμ(Omics 1) RSS RSS for prediction of outcome (c) Figure 5.7: Simulation results for unsupervised LUCID with two latent variable, part two: parameter esti- mation for outcome effect, prediction on clusters and the outcome. values are indicated by the red dashed lines in the figures) while the standard deviation of the estimation across replicates becomes smaller; the same pattern is observed for the noises in the exposures (Figure 5.6a). Regarding the estimation of exposure effect in omic layer 2, the estimation remain unbiased for both causal and non-causal variables, and the standard deviation of the estimation remains the same across the range of ∆ µ (Omics layer 1) (Figure 5.6b and Figure 5.6c). We find that the increased omic effect in layer 1 decreases the standard deviation across replicates, and the estimated causal and non-causal effect remain unbiased (Figure 5.6d and Figure 5.6e). Additionally, the increased omic effect in layer 1 (1) do not influence the estimations in layer 2 (Figure 5.6f to Figure 5.6h), but it decreases the bias of estimating the outcome effect as well as its standard deviation of estimation (Figure 5.7a); (2) improve prediction accuracy in omic layer 1 (the mean of AUC goes from 0.54 to 0.99) but doesn’t influence prediction accuracy in omic layer 2. Interestingly, we observe a decrease trend in RSS for prediction of the outcome. Increased omic effect improves the prediction of the outcome, which is an opposite trend compared to supervised LUCID. 124 However, the RSS of unsupervised LUCID is drastically higher compared to the supervised LUCID, which is as expected. By comparing the results of supervised LUCID and unsupervised LUCID model, we conclude that: (1) for supervised LUCID with multiple latent variables, increased information in omic layer result in less biased estimation of the model parameters. It not only improves accuracy of clustering for that layer, but also improves accuracy of clustering for the other layer; (2) for unsupervised LUCID with multiple latent variables, increased information in one omic layer only result in less biased estimation of model parameters and improvement in clustering accuracy for that certain layer. 5.3.3 Comparisontootherintegrationmethods Methods We further compare the performance of LUCID with multiple latent variables (LM) to other analysis meth- ods, including the LUCID model with one latent variables (fit LUCID model on each omic layer separately then combine the results, referred to as LS), two-step analysis (fit clustering model on omic data then build regression models between cluster assignment and exposure, and between outcome and cluster assign- ment, referred to as CR) and early integration method (fit LUCID with one latent variable by concatenating omic layers together, referred to as CL). Our aim is to evaluate the prediction performance on clusters and the outcome. We use a reference scenario described in Table 5.2 withµ = 0.3 to simulate a population of 10000 observations. For each replicate, we sample 1000 observations from the population, then fit LM, LS, CR and LC model on the sample. We record parameter estimations from LM, LS and CR (the parameters of LC are not comparable to the rest three methods), prediction of latent clusters and the outcome. We use AUC to evaluate the predicting accuracy of latent cluster and use RSS to evaluate the predicting accuracy of the outcome. 125 Table 5.3: Comparison of prediction of clustering and the outcome, for LM, LS, CR and LC. Metric Methods LM LS CR LC AUC of latent cluster Layer 1 0.94 (0.02) 0.90 (0.03) 0.95 (0.01) - Layer 2 0.89 (0.01) 0.88 (0.03) 0.81 (0.07) - RSS of outcome 6.03 (0.58) 28.67 (3.22) 61.59 (3.00) 72.47 (7.68) Results LM and CR have great performance on discriminating the two clusters in omic layer 1 (mean AUCs are 0.94 and 0.95, respectively), but LM perform better on discriminating clusters in omic layer 2 (mean AUC= 0.89) compared to CR (mean AUC= 0.81). Although LS is inferior to LM when clustering subjects in omic layer 1, LS performs as good as LM in omic layer 2. Overall, LM represents the intermediate integration, which utilizes information from the exposures and two omics layers together. LS and CR use the idea of late integration. Each model only uses information from one omic layer. When the two omics layers are correlated with each other, LM works the best. Appendix 5.7 lists comparison of parameters from LM, LS and CR. 5.4 Dataexample (This analysis work was a selected platform presentation “Zhao, Y., Conti DV., LUCID, an integrative clustering model for multi omics data” in IGES 2022) 5.4.1 Methods We apply the LUCID model with multiple latent variables on the HELIX data simulated based on the real cases, described in Section 2.4.1. We focus on Hexachlorobenzene (HCB), a “model persistent organic pollutant” globally spread [5]. The use of HCB is restricted or banned by the Stockholm Convention on Persistent Organic Pollutants (2001). Smink et al. recruited 482 children at 6.5 years old in Menroca. They 126 discovered the association between prenatal exposure to HCB and an increase in BMI and weight. In this analysis, we investigate the joint contribution of prenatal exposure to HCB and omics profiles towards BMI. The exposure is Hexachlorobenzene (HCB) in mother, measured during pregnancy. The measurements are adjusted for lipids and log 2 transformed to reduce skewness. The outcome is Body mass index z-score at 6-11 years old, standardized by sex and age. The omics profile consists of three omics layers measured in children, including protein omics, serum and urine metabolites. We implement “meet in the middle” to select omic variables that are significantly associated with maternal HCB and zBMI [11]. We calculate the Pearson correlations of the selected metabolites, then fit the LUCID model with three latent variables. We construct the omic profiles by estimating the cluster-specific mean for each omic variable and use a Sankey diagram to visualize the interplay among prenatal exposure, three omics layers and the BMI z-score. 5.4.2 Results We select 32 omic variables, 9 from proteome, 2 from urine metabolome and 21 from serum metabolome. The correlations of the 32 omic variables are presented in Figure 5.10. We observe two highly correlated islands in the heatmap: (1) a small island consisting of IL1beta, IL6 and Leptin (largest correlation) (2) a large highly correlated island consisting most of the serum metabolites starting from serum metabolite 55. The rest of the selected metabolites are either mildly correlated or not significantly correlated. With a two-fold increase in maternal HCB, the OR of a subject being assigned to latent cluster 2 in proteomic layer is 1.54, which ultimately leads to an 1.52 increase in zBMI score. The same pattern is also observed in urine metabolomic layer (OR= 1.30, increase in zBMI = 0.56) and in serum metabolomic layer (OR = 5.75, increase in zBMI= 0.12). The interplay of prenatal HCB, three omics layers and the zBMI is shown in Figure 5.8a. To better characterize the latent clusters, we create “multi-omics profiles” (Figure 5.8b). The latent cluster 1 is mainly characterized by the urine metabolites, while the latent cluster 2 is mainly characterized by the proteomics data. Some selected omic variables (for example, FGBasic, 127 serum metabolite 59 and 79 etc.) do not show obvious difference in cluster-specific means, albeit they are significantly associated with HCB and zBMI in the regression model. To summarize, the increased maternal HCB is associated with increased BMI in children. The pattern is consistent with previous findings in [100] and [88]. The mechanisms of prenatal exposure to HCB increasing body fatness are still unknown. Huang et al. concluded that prenatal period is a key time window for developing adult obesity [40]. The mechanisms related to obesity can be associated with glucocorticoid resistance in utero because HCB is reported to disrupt gluconeogenic pathway in animal studies [62]. We also fit LUCID model with one latent variables on each omic layer (LS) and compare the results of directly fit a LUCID model with multiple latent variables (LM). The results of exposure effect and multi-omics profiles are listed in Appendix 5.8. We find that the clusters defined by urine metabolites are exactly the same from the two models. The clusters defined by proteomics are quite the same, while huge differences present among clusters defined by serum metabolites. The overall association between maternal HCB and children’s zBMI holds for LS. 5.5 Discussion In summary, the proposed method effectively incorporates multiple omics layers under the analysis frame- work of LUCID. It estimates latent clusters in each omic layer separately by assuming no correlations across different omics layers. Extensive simulation studies have shown that: (1) it selects the optimal model guided by BIC; and (2) efficiently estimates the model parameters despite changes in omic effect. Following the methods discussed in Chapter 1, we also proposed supervised and unsupervised versions of the LUCID model with multiple latent variables. Particularly, unsupervised LUCID produces unbiased estimation when incorporating multiple omics layers in parallel. It might be of scientific interest to estimate separate sets of clusters for each type of omic data. As a result, our model relaxes the assumption that the number of latent clusters remains unchanged across 128 (a) Sankey diagram of LUCID analysis with three omics layers, each layer has two clusters. The orange nodes represent proteins and latent clusters related to proteomics; the green nodes represent urine metabolites and the corresponding latent clusters; the purple nodes represent serum metabolites and the corresponding latent clusters. (b) Omic profiles for the two identified clusters. Figure 5.8: Results of LUCID analysis to investigate the joint effect of prenatal HCB, proteomics, serum and urine metabolites on children’s BMI z-score. 129 different omics layers. The advantage of the extended LUCID model lies in estimating individual contri- bution of each omic data towards the outcome. Other available approaches in the community, such as the iCluster family (including iCluster [86], iCluster Plus [68] and iCluster Bayes [67]) and DIABLO [87], assume the latent variables are jointly defined by the multi-omics data. If the number of omics data are not balanced, the results might be biased towards certain omics data types [12]. Since LUCID estimates each latent cluster separately, it is of interest to study the robustness of LUCID under unbalanced omics data. For unsupervised approaches, researchers are interested in interpreting the estimated clusters in the data. A common solution is to further conduct downstream analysis and relate the clusters to the phe- notype of interest. For instance, Shen et al. identified three subtypes of Glioblastoma then estimated survival curves [86]. In unsupervised LUCID model, we use the same strategy by regressing the outcome on the PIPs. Additionally, we propose the multi-omics profiles to characterize the estimated clusters by differentially expression level of the omics data. To provide further insights of the underlying biological mechanism between exposure and outcome, it is of interest to conduct pathway analysis or annotate the variables in the multi-omics profiles for their biological function. Variable selection is an issue when applying LUCID analysis on multiple omics data. Omics data are generated from the high-throughput biotechnology. The number of features ranges from hundreds (e.g. exposomics data measuring environmental pollutants and proteomics data) to hundreds of thousands (e.g. gene expression data and genomics data). Although some screening approaches, such as screening based on variation and meet in the middle, are usually applied to reduce dimensionality of the omics data prior to formal analysis, it is still desirable to implement variable selection to pick variables informative to cluster- ing. Pan et al. came up with a regularization-based variable selection for clustering [71]. We introduced a similar idea to the framework of LUCID in Chapter 2. The same idea is applicable to LUCID with multiple 130 Figure 5.9: DAG of LUCID with multiple latent variables in serial omics data. We can add theL 1 penalty to the likelihood in Equation 5.3 towardsµ j andΣ j , then imple- ment the EM algorithm to obtain MLE. The regularization can also cope with high dimensional scenario, where the dimension of omics data is greater than the sample size. As we have discussed in Chapter 3, missing data is a problem when analyzing omics data. Missing values in the omics measurements might be a result of the nature of the high-throughput technology, study design, limit of budget etc. Inappropriate handling of missing data may lead to biased estimations and invalid inference[56]. To deal with the missingness in omics data, we proposed the likelihood partition approach for listwise missing values and an EM algorithm with additional imputation step for sporadic missing data. Still, these two approaches can be extended to LUCID with multiple latent variables by modify the likelihood function in Equation 5.3. LUCID can also be extended to incorporate multiple latent variables in serial if researchers believe multiple omics data act serially through a multistep process towards the outcome, illustrated in Figure 5.9. This framework can accommodate the following situations: (1) longitudinal measurements on the same omic data type and (2) biological relationship of multiple omics data. For situation (1), we assume Z represents a certain omic data type, for example metabolomics data, measured at time t 1 ,t 2 ,...,t m . The serial arrangement of latent variables can provide insights of how exposureG influences Z across time t 1 ,t 2 ,...,t m and ultimately influences the disease phenotype Y . Longitudinal changes in Z are 131 captured by the dependencies of latent variablesX. For situation (2), we are guided by the hypothesis of the central dogma of molecular biology [20]. It states that the biological information flow goes from DNA to RNA then to protein. For example,G can represent genotypes,Z 1 andZ 2 represent mRNAs and proteomics respectively. Under the LUCID framework, we can study how changes in genotypes influence gene expression and protein expression sequentially, and the joint impact ofG,Z 1 andZ 2 on the outcome. The estimation of latent clusters for each omic data can be formulated as an unsupervised LUCID model with one latent variable. For example, for a certain omic typem, clusterX m− 1 serves as the “G” in the original LUCID model andX m is estimated jointly byX m− 1 andZ m . Alternatively, we can add another component in the joint likelihood relatingX m− 1 toX m . The component can be a special case of a Markov chain that state only moves forward [47]. Mathematically, we use a transition probability matrix to describe the state change withP(X m− 1 =j|X m =k)=0 andP(X m =k|X m− 1 =j)=p jk , wherep jk >0. Conclusively, the LUCID model with multiple latent variable extends the original LUCID method to incorporate multiple omics data in a parallel fashion. Compared to other data integration methods that estimate shared latent clusters, it can estimate different clusters for each omic layer separately and pre- serves the capability of the original LUCID model with one latent variable in predicting the outcome and detecting the underlying subgroups. 132 5.6 Appendix5.A In this section, we discuss how to derive MLE for binary outcome under the framework of LUCID with multiple latent variables. AssumeY i is a binary outcome coded as 0 or 1. The likelihood for observationi is f(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ )=p Y i i (1− p i ) 1− Y i m Y a=1 [ϕ (Z ai |X ai )S(X ai |G i )] (5.17) wherep i is the probability of observationi being a case. Under the framework of LUCID with multiple la- tent mediators,p i is a conditional probability given cluster assignmentX 1i ,X 2i ,...,X ai ,...,X mi , which is p i =Pr(Y i =1|X 1i =j 1 ,...,X mi =j m ) (5.18) In other words, the parameters for association between latent cluster and outcome are actually a probability arrayp of dimensionk 1 × k 2 ...× k m .Pr(Y i =1|X 1i =j 1 ,...,X mi =j m ) is thej 1 ,...,j m th elements ofp. 133 For a binary outcome, the joint likelihood for LUCID model is l(Θ |D)= n X i=1 logf(Z 1i ,...,Z mi ,Y i |G i ;Θ ) = n X i=1 log k 1 Y j 1 =1 ··· km Y jm=1 f(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ ) I(X 1i =j 1 ,...,X mi =jm) = n X i=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logf(Z 1i ,...,Z mi ,X 1i ,...,X mi ,Y i |G i ;Θ ) = n X i=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )(Y i logp i +(1− Y i )log(1− p i )) + n X i=1 m X a=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + n X i=1 m X a=1 k 1 X j 1 =1 ··· km X jm=1 I(X 1i =j 1 ,...,X mi =j m )logS(X ai =j a |G i ,β a ) (5.19) 134 and the corresponding expectation of joint likelihood is Q(Θ |D,Θ (t) ) =El(Θ |D,Θ (t) ) = n X i=1 n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) (Y i logp i +(1− Y i )log(1− p i )) + n X i=1 n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) m X a=1 logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) m X a=1 logS(X ai =j a |G i ,β a ) = n X i=1 k 1 X j 1 =1 ··· km X jm=1 r i (j 1 ,...,j m ) (t) (Y i logp i +(1− Y i )log(1− p i )) + m X a=1 n X i=1 ka X ja=1 r i (j a ·) (t) logϕ (Z ai |X ai =j a ,µ a,ja ,Σ a,ja ) + m X a=1 n X i=1 ka X ja=1 r i (j a ·) (t) logS(X ai =j a |G i ,β a ) (5.20) Derivations forµ ,Σ andβ remain exactly the same as discussed in Section 5.2. We only need to derive the MLE forPr(Y i =1|X 1i =j 1 ,...,X mi =j m ). The partial derivation is ∂Q(Θ |D,Θ (t) ) ∂Pr(Y i =1|X 1i =j 1 ,...,X mi =j m ) = n X i=1 r i (j 1 ,...,j m ) (t) Y i p i − 1− Y i 1− p i (5.21) By setting Equation 5.21 as 0, we obtain Pr(Y i =1|X 1i =j 1 ,...,X mi =j m )= P n i=1 r i (j 1 ,...,j m ) (t) Y i P n i=1 r i (j 1 ,...,j m ) (t) (5.22) 135 Equivalently, we can also use a similar parameterization in Equation 5.2, which is logit(p i )=δ 0 + k 1 X j 1 =2 δ j 1 I(X 1i =j 1 )+··· + ka X ja=2 δ ja I(X ai =j a )+··· + km X jm=2 δ jm I(X mi =j m ) (5.23) Instead of directly optimizing p i , we first estiamte the MLE of δ , which is equivalent to fitting a logis- tic regression predicting a binary outcome by using the posteior inclusion probability r i (j 1 ,...,j m ) as predictors. The cluster-specific probability of being a case is calcualted by Pr(Y i =1|X 1i =j 1 ,...,X mi =j m )= exp(r T δ ) 1+exp(r T δ ) (5.24) 136 5.7 Appendix5.B Table 5.4: Comparison of LM, LS and CR: exposure effect Omic Cluster Variable True Value Estimation LM LS CR Layer 1 Cluster 2 Causal-1 0.69 0.59 (0.11) 1.10 (0.17) 0.39 (0.05) Causal-2 0.69 0.54 (0.10) 1.29 (0.17) 0.41 (0.05) Non-causal 0.00 0.00 (0.08) -0.02 (0.10) 0.01 (0.05) Layer 2 Cluster 2 Causal-1 1.09 0.78 (0.13) 1.27 (0.27) 0.28 (0.11) Causal-2 0.69 0.50 (0.12) 0.85 (0.22) 0.06 (0.07) Non-causal 0.00 -0.05 (0.09) -0.05 (0.14) 0.00 (0.04) Cluster 3 Causal-1 0.69 0.53 (0.12) 0.88 (0.16) 0.04 (0.07) Causal-2 1.09 0.99 (0.12) 1.31 (0.16) 0.27 (0.07) Non-causal 0.00 -0.04 (0.10) -0.03 (0.12) 0.03 (0.04) 137 Table 5.5: Comparison of LM, LS and CR: omic effect Omic Cluster Variable True Value Estimation LM LS CR Layer 1 Cluster 1 Causal-1 -0.30 -0.29 (0.03) -0.22 (0.05) -0.31 (0.02) Causal-2 -0.30 -0.25 (0.03) -0.21 (0.05) -0.29 (0.02) Non-causal 0.00 0.00 (0.02) 0.00 (0.02) 0.00 (0.02) Cluster 2 Causal-1 0.30 0.28 (0.03) 0.20 (0.06) 0.31 (0.03) Causal-2 0.30 0.26 (0.03) 0.21 (0.06) 0.32 (0.03) Non-causal 0.00 0.01 (0.02) 0.01 (0.06) 0.01 (0.02) Layer 2 Cluster 1 Causal-1 0.50 0.55 (0.06) 0.46 (0.06) 0.50 (0.17) Causal-2 -0.50 -0.43 (0.04) -0.49 (0.05) -0.51 (0.12) Non-causal 0.00 -0.01 (0.04) -0.01 (0.05) 0.00 (0.07) Cluster 2 Causal-1 -0.50 -0.23 (0.09) -0.38 (0.23) -0.42 (0.29) Causal-2 -0.80 -0.63(0.08) -0.74 (0.15) -0.73 (0.24) Non-causal 0.00 0.02 (0.03) 0.02 (0.10) 0.01 (0.05) Cluster 3 Causal-1 0.70 0.65 (0.07) 0.67 (0.14) 0.63 (0.29) Causal-2 0.40 0.37 (0.07) 0.40 (0.15) 0.35 (0.28) Non-causal 0.00 0.01 (0.04) 0.00 (0.05) 0.01 (0.06) Table 5.6: Comparison of LM, LS and CR: outcome effect Omic True value Estimation Layer 1 Layer 2 LM LS CR Cluster Cluster 1 Cluster 1 -0.40 -0.51 (0.03) -0.58 (0.06) -0.40 (0.07) Cluster 1 Cluster 2 -0.10 -0.15 (0.04) -0.19 (0.04) -0.10 (0.04) Cluster 1 Cluster 3 0.10 0.12 (0.04) 0.15 (0.07) 0.02 (0.05) Cluster 2 Cluster 1 -0.20 -0.24 (0.08) -0.24 (0.08) -0.16 (0.06) Cluster 2 Cluster 2 0.20 0.12 (0.03) -0.04 (0.06) 0.14 (0.04) Cluster 2 Cluster 3 0.30 0.39 (0.03) 0.34 (0.04) 0.25 (0.05) 138 5.8 Appendix5.C Figure 5.10: Pearson correlation of the 32 selected omic variables. Only statistically significant correlations (p-value< 0.05) are shown in the heatmap. 139 Figure 5.11: Comparison of exposure effect calculated from LM and LS 140 (a) (b) (c) Figure 5.12: Comparison of PIPs calculated by LM and LS. 141 Bibliography [1] Jeffrey M Albert, Cuiyu Geng, and Suchitra Nelson. “Causal mediation analysis with a latent mediator”. In: Biometrical Journal 58.3 (2016), pp. 535–548. [2] Emma L Anderson, Laura D Howe, Hayley E Jones, Julian PT Higgins, Debbie A Lawlor, and Abigail Fraser. “The prevalence of non-alcoholic fatty liver disease in children and adolescents: a systematic review and meta-analysis”. In: PloS one 10.10 (2015), e0140908. [3] Bilal Aslam, Madiha Basit, Muhammad Atif Nisar, Mohsin Khurshid, and Muhammad Hidayat Rasool. “Proteomics: technologies and their applications”. In: Journal of chromatographic science 55.2 (2017), pp. 182–196. [4] Jeffrey D Banfield and Adrian E Raftery. “Model-based Gaussian and non-Gaussian clustering”. In: Biometrics (1993), pp. 803–821. [5] Jonathan L Barber, Andrew J Sweetman, Dolf Van Wijk, and Kevin C Jones. “Hexachlorobenzene in the global environment: emissions, levels, distribution, trends and processes”. In: Science of the total environment 349.1-3 (2005), pp. 1–44. [6] Richard D Beger. “A review of applications of metabolomics in cancer”. In: Metabolites 3.3 (2013), pp. 552–574. [7] Richard Bellman. “Dynamic programming”. In: Science 153.3731 (1966), pp. 34–37. [8] Yoshua Bengio and Francois Gingras. “Recurrent neural networks for missing or asynchronous data”. In: Advances in neural information processing systems 8 (1995). [9] Lorenzo Beretta and Alessandro Santaniello. “Nearest neighbor imputation algorithms: a critical evaluation”. In: BMC medical informatics and decision making 16.3 (2016), pp. 197–208. [10] Chiranjib Bhattacharyya, Pannagadatta Shivaswamy, and Alex Smola. “A second order cone programming formulation for classifying missing data”. In: Advances in neural information processing systems 17 (2004). 142 [11] Solène Cadiou, Xavier Basagaña, Juan R Gonzalez, Johanna Lepeule, Martine Vrijheid, Valérie Siroux, and Rémy Slama. “Performance of approaches relying on multidimensional intermediary data to decipher causal relationships between the exposome and health: A simulation study under various causal structures”. In: Environment International 153 (2021), p. 106509. [12] Laura Cantini, Pooya Zakeri, Celine Hernandez, Aurelien Naldi, Denis Thieffry, Elisabeth Remy, and Anaïs Baudot. “Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer”. In: Nature communications 12.1 (2021), pp. 1–12. [13] Angelo Canty and Brian Ripley. boot: Bootstrap Functions (Originally by Angelo Canty for S). R package version 1.3-28. 2021.url: https://CRAN.R-project.org/package=boot. [14] Gilles Celeux and Gérard Govaert. “Gaussian parsimonious clustering models”. In: Pattern recognition 28.5 (1995), pp. 781–793. [15] Gilles Celeux, Merrilee Hurn, and Christian P Robert. “Computational and inferential difficulties with mixture posterior distributions”. In: Journal of the American Statistical Association 95.451 (2000), pp. 957–970. [16] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. “Recurrent neural networks for multivariate time series with missing values”. In: Scientific reports 8.1 (2018), pp. 1–12. [17] Hyungwon Choi and Norman Pavelka. “When one and one gives more than two: challenges and opportunities of integrative omics”. In: Frontiers in genetics 2 (2012), p. 105. [18] Shing Wan Choi, Timothy Shin-Heng Mak, and Paul F O’Reilly. “Tutorial: a guide to performing polygenic risk score analyses”. In: Nature protocols 15.9 (2020), pp. 2759–2772. [19] Stephen R Cole, Haitao Chu, Lei Nie, and Enrique F Schisterman. “Estimating the odds ratio when exposure has a limit of detection”. In: International journal of epidemiology 38.6 (2009), pp. 1674–1680. [20] Francis H Crick. “On protein synthesis”. In: Symp Soc Exp Biol. Vol. 12. 138-63. 1958, p. 8. [21] Scott Cunningham. Causal inference. Yale University Press, 2021. [22] Paresh Dandona, Ahmad Aljada, and Arindam Bandyopadhyay. “Inflammation: the link between insulin resistance, obesity and diabetes”. In: Trends in immunology 25.1 (2004), pp. 4–7. [23] Anthony Christopher Davison and David Victor Hinkley. Bootstrap methods and their application. 1. Cambridge university press, 1997. [24] Yadolah Dodge, David Cox, and Daniel Commenges. The Oxford dictionary of statistical terms. Oxford University Press on Demand, 2006. [25] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. “Least angle regression”. In: The Annals of statistics 32.2 (2004), pp. 407–499. 143 [26] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994. [27] Jianqing Fan and Jinchi Lv. “Sure independence screening for ultrahigh dimensional feature space”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.5 (2008), pp. 849–911. [28] Yingying Fan and Cheng Yong Tang. “Tuning parameter selection in high dimensional penalized likelihood”. In:JournaloftheRoyalStatistical Society:SeriesB(Statistical Methodology) 75.3 (2013), pp. 531–552. [29] Michael Fop and Thomas Brendan Murphy. “Variable selection methods for model-based clustering”. In: Statistics Surveys 12 (2018), pp. 18–65. [30] Edward B Fowlkes, Ram Gnanadesikan, and John R Kettenring. “Variable selection in clustering”. In: Journal of classification 5.2 (1988), pp. 205–228. [31] Chris Fraley and Adrian E Raftery. “How many clusters? Which clustering method? Answers via model-based cluster analysis”. In: The computer journal 41.8 (1998), pp. 578–588. [32] Chris Fraley and Adrian E Raftery. “Model-based clustering, discriminant analysis, and density estimation”. In: Journal of the American statistical Association 97.458 (2002), pp. 611–631. [33] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Sparse inverse covariance estimation with the graphical lasso”. In: Biostatistics 9.3 (2008), pp. 432–441. [34] Atsushi Fukushima, Miyako Kusano, Henning Redestig, Masanori Arita, and Kazuki Saito. “Integrated omics approaches in plant systems biology”. In: Current opinion in chemical biology 13.5-6 (2009), pp. 532–538. [35] Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. “Pattern classification with missing data: a review”. In: Neural Computing and Applications 19.2 (2010), pp. 263–282. [36] Juan R González and Alejandro Cáceres. Omic association studies with r and bioconductor. CRC Press, 2019. [37] Sara Goodwin, John D McPherson, and W Richard McCombie. “Coming of age: ten years of next-generation sequencing technologies”. In: Nature Reviews Genetics 17.6 (2016), pp. 333–351. [38] Michael I Goran, Richard N Bergman, Quintilia Avila, Michael Watkins, Geoff DC Ball, Gabriel Q Shaibi, Marc J Weigensberg, and Martha L Cruz. “Impaired glucose tolerance and reducedβ -cell function in overweight Latino children with a positive family history for type 2 diabetes”. In: The Journal of Clinical Endocrinology & Metabolism 89.1 (2004), pp. 207–212. [39] Zhengyu Hu. Initializing the EM algorithm for data clustering and sub-population detection. The Ohio State University, 2015. [40] Jennifer S Huang, Tiffany A Lee, and Michael C Lu. “Prenatal programming of childhood overweight and obesity”. In: Maternal and child health journal 11.5 (2007), pp. 461–473. 144 [41] Yen-Tsung Huang. “Integrative modeling of multi-platform genomic data under the framework of mediation analysis”. In: Statistics in medicine 34.1 (2015), pp. 162–178. [42] Yen-Tsung Huang, Liming Liang, Miriam F Moffatt, William OCM Cookson, and Xihong Lin. “iGWAS: Integrative genome-wide association studies of genetic and genomic data for disease susceptibility using mediation analysis”. In: Genetic epidemiology 39.5 (2015), pp. 347–356. [43] Trey Ideker, Janusz Dutkowski, and Leroy Hood. “Boosting signal-to-noise in complex biology: prior knowledge is power”. In: Cell 144.6 (2011), pp. 860–863. [44] Ran Jin, Rob McConnell, Cioffi Catherine, Shujing Xu, Douglas I Walker, Nikos Stratakis, Dean P Jones, Gary W Miller, Cheng Peng, David V Conti, et al. “Perfluoroalkyl substances and severity of nonalcoholic fatty liver in children: an untargeted metabolomics approach”. In: Environment international 134 (2020), p. 105220. [45] Original by Joseph L. Schafer. mix: Estimation/Multiple Imputation for Mixed Categorical and Continuous Data. R package version 1.0-11. 2022.url: https://CRAN.R-project.org/package=mix. [46] Konrad J Karczewski and Michael P Snyder. “Integrative omics for health and disease”. In: Nature Reviews Genetics 19.5 (2018), pp. 299–310. [47] Samuel Karlin. A first course in stochastic processes . Academic press, 2014. [48] Eamonn Keogh and Abdullah Mueen. “Curse of Dimensionality”. In: Encyclopedia of Machine Learning and Data Mining. Ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, 2017, pp. 314–315.isbn: 978-1-4899-7687-1.doi: 10.1007/978-1-4899-7687-1_192. [49] Javed Khan, Jun S Wei, Markus Ringner, Lao H Saal, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab, Cristina R Antonescu, Carsten Peterson, et al. “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks”. In: Nature medicine 7.6 (2001), pp. 673–679. [50] Stephanie Kim, Hillary Hollinger, and Elizabeth G Radke. “‘Omics in environmental epidemiological studies of chemical exposures: a systematic evidence map”. In: Environment International (2022), p. 107243. [51] Laurence N Kolonel, Brian E Henderson, Jean H Hankin, Abraham MY Nomura, Lynne R Wilkens, Malcolm C Pike, Daniel O Stram, Kristine R Monroe, Maj E Earle, and Faye S Nagamine. “A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics”. In: American journal of epidemiology 151.4 (2000), pp. 346–357. [52] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. “Integrative analysis of 111 reference human epigenomes”. In: Nature 518.7539 (2015), pp. 317–330. [53] Pavel Langer. “The impacts of organochlorines and other persistent pollutants on thyroid and metabolic health”. In: Frontiers in Neuroendocrinology 31.4 (2010), pp. 497–518. 145 [54] Yifeng Li, Fang-Xiang Wu, and Alioune Ngom. “A review on machine learning principles for multi-view biological data integration”. In: Briefings in bioinformatics 19.2 (2018), pp. 325–340. [55] Wei-Chao Lin and Chih-Fong Tsai. “Missing value imputation: a review and analysis of the literature (2006–2017)”. In: Artificial Intelligence Review 53.2 (2020), pp. 1487–1509. [56] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. Vol. 793. John Wiley & Sons, 2019. [57] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. “The genotype-tissue expression (GTEx) project”. In: Nature genetics 45.6 (2013), pp. 580–585. [58] KE Lyke, R Burges, Y Cissoko, L Sangare, M Dao, I Diarra, A Kone, R Harley, CV Plowe, OK Doumbo, et al. “Serum levels of the proinflammatory cytokines interleukin-1 beta (IL-1 β ), IL-6, IL-8, IL-10, tumor necrosis factor alpha, and IL-12 (p70) in Malian children with severe Plasmodium falciparum malaria and matched uncomplicated malaria or healthy controls”. In: Infection and immunity 72.10 (2004), pp. 5630–5637. [59] Léa Maitre, Jeroen De Bont, Maribel Casas, Oliver Robinson, Gunn Marit Aasvang, Lydiane Agier, Sandra Andrušaityt˙ e, Ferran Ballester, Xavier Basagaña, Eva Borràs, et al. “Human Early Life Exposome (HELIX) study: a European population-based exposome cohort”. In: BMJ open 8.9 (2018), e021311. [60] Léa Maitre, Jean-Baptiste Guimbaud, Charline Warembourg, Nuria Güil-Oumrait, The Exposome Data Challenge Participant Consortium, Paula Marcela Petrone, Marc Chadeau-Hyam, Martine Vrijheid, Juan R Gonzalez, and Xavier Basagaña. “State-of-the-Art Methods for Exposure-Health Studies: results from the Exposome Data Challenge Event”. In: arXiv preprint arXiv:2202.01680 (2022). [61] Komodo Matta, Tiphaine Lefebvre, Evelyne Vigneau, Véronique Cariou, Philippe Marchand, Yann Guitton, Anne-Lise Royer, Stéphane Ploteau, Bruno Le Bizec, Jean-Philippe Antignac, et al. “Associations between persistent organic pollutants and endometriosis: A multiblock approach integrating metabolic and cytokine profiling”. In: Environment International 158 (2022), p. 106926. [62] Marta Blanca Mazzetti, María Cristina Taira, Sandra Marcela Lelli, Eduardo Dascal, Juan Carlos Basabe, and Leonor Carmen San Martín de Viale. “Hexachlorobenzene impairs glucose metabolism in a rat model of porphyria cutanea tarda: a mechanistic approach”. In: Archives of toxicology 78.1 (2004), pp. 25–33. [63] Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007. [64] Xiao-Li Meng and Donald B Rubin. “Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm”. In: Journal of the American Statistical Association 86.416 (1991), pp. 899–909. [65] Gary W Miller. The exposome: A primer. Elsevier, 2013. 146 [66] Sayantan Mitra, Sriparna Saha, and Mohammed Hasanuzzaman. “Multi-view clustering for multi-omics data using unified embedding”. In: Scientific reports 10.1 (2020), pp. 1–16. [67] Qianxing Mo, Ronglai Shen, Cui Guo, Marina Vannucci, Keith S Chan, and Susan G Hilsenbeck. “A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data”. In: Biostatistics 19.1 (2018), pp. 71–86. [68] Qianxing Mo, Sijian Wang, Venkatraman E Seshan, Adam B Olshen, Nikolaus Schultz, Chris Sander, R Scott Powers, Marc Ladanyi, and Ronglai Shen. “Pattern discovery and cancer gene identification in integrated cancer genomic data”. In: Proceedings of the National Academy of Sciences 110.11 (2013), pp. 4245–4250. [69] Lei Nie, Haitao Chu, Chenglong Liu, Stephen R Cole, Albert Vexler, and Enrique F Schisterman. “Linear regression with an independent variable subject to a detection limit”. In: Epidemiology (Cambridge, Mass.) 21.Suppl 4 (2010), S17. [70] Fatih Ozsolak and Patrice M Milos. “RNA sequencing: advances, challenges and opportunities”. In: Nature reviews genetics 12.2 (2011), pp. 87–98. [71] Wei Pan and Xiaotong Shen. “Penalized model-based clustering with application to variable selection.” In: Journal of machine learning research 8.5 (2007). [72] Cheng Peng, Jun Wang, Isaac Asante, Stan Louie, Ran Jin, Lida Chatzi, Graham Casey, Duncan C Thomas, and David V Conti. “A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits”. In: Bioinformatics 36.3 (2020), pp. 842–850. [73] Duc Truong Pham, Stefan S Dimov, and Chi D Nguyen. “Selection of K in K-means clustering”. In:ProceedingsoftheInstitutionofMechanicalEngineers,PartC:JournalofMechanicalEngineering Science 219.1 (2005), pp. 103–119. [74] Milan Picard, Marie-Pier Scott-Boyer, Antoine Bodein, Olivier Périn, and Arnaud Droit. “Integration strategies of multi-omics data for machine learning analysis”. In: Computational and Structural Biotechnology Journal 19 (2021), pp. 3735–3746. [75] Morgane Pierre-Jean, Jean-François Deleuze, Edith Le Floch, and Florence Mauger. “Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration”. In: Briefings in bioinformatics 21.6 (2020), pp. 2011–2030. [76] Therese D Pigott. “A review of methods for missing data”. In: Educational research and evaluation 7.4 (2001), pp. 353–383. [77] Nimrod Rappoport and Ron Shamir. “Multi-omic and multi-view clustering algorithms: review and cancer benchmark”. In: Nucleic acids research 46.20 (2018), pp. 10546–10562. [78] Tobias Rockel. missMethods: Methods for Missing Data. R package version 0.4.0. 2022.url: https://CRAN.R-project.org/package=missMethods. 147 [79] Florian Rohart, Benoît Gautier, Amrit Singh, and Kim-Anh Lê Cao. “mixOmics: An R package for ‘omics feature selection and multiple data integration”. In: PLoS computational biology 13.11 (2017), e1005752. [80] Tariq Samad and Steven A Harp. “Self–organization with partial data”. In: Network: Computation in Neural Systems 3.2 (1992), pp. 205–212. [81] Joseph L Schafer. Analysis of incomplete multivariate data. CRC press, 1997. [82] Joseph L Schafer. “Multiple imputation: a primer”. In: Statistical methods in medical research 8.1 (1999), pp. 3–15. [83] Mario Schmidt. “The Sankey diagram in energy and material flow management: part II: methodology and current applications”. In: Journal of industrial ecology 12.2 (2008), pp. 173–185. [84] Gideon Schwarz. “Estimating the dimension of a model”. In: The annals of statistics (1978), pp. 461–464. [85] Luca Scrucca, Michael Fop, T Brendan Murphy, and Adrian E Raftery. “mclust 5: clustering, classification and density estimation using Gaussian finite mixture models”. In: The R journal 8.1 (2016), p. 289. [86] Ronglai Shen, Adam B Olshen, and Marc Ladanyi. “Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis”. In: Bioinformatics 25.22 (2009), pp. 2906–2912. [87] Amrit Singh, Casey P Shannon, Benoît Gautier, Florian Rohart, Michaël Vacher, Scott J Tebbutt, and Kim-Anh Lê Cao. “DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays”. In: Bioinformatics 35.17 (2019), pp. 3055–3062. [88] Agnes Smink, Nuria Ribas-Fito, Raquel Garcia, Maties Torrent, Michelle A Mendez, Joan O Grimalt, and Jordi Sunyer. “Exposure to hexachlorobenzene during pregnancy increases the risk of overweight in children aged 6 years”. In: Acta paediatrica 97.10 (2008), pp. 1465–1469. [89] Meng Song, Jonathan Greenbaum, Joseph Luttrell IV, Weihua Zhou, Chong Wu, Hui Shen, Ping Gong, Chaoyang Zhang, and Hong-Wen Deng. “A review of integrative imputation for multi-omics datasets”. In: Frontiers in genetics 11 (2020), p. 570255. [90] Yanyi Song, Xiang Zhou, Min Zhang, Wei Zhao, Yongmei Liu, Sharon LR Kardia, Ana V Diez Roux, Belinda L Needham, Jennifer A Smith, and Bhramar Mukherjee. “Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies”. In: Biometrics 76.3 (2020), pp. 700–710. [91] Jeppe S Spicker, Søren Brunak, Klaus S Frederiksen, and Henrik Toft. “Integration of clinical chemistry, expression, and metabolite data leads to better toxicological class separation”. In: Toxicological Sciences 102.2 (2008), pp. 444–454. 148 [92] Nikos Stratakis, Lucy Golden-Mason, Katerina Margetaki, Yinqi Zhao, Damaskini Valvi, Erika Garcia, Léa Maitre, Sandra Andrusaityte, Xavier Basagana, Eva Borràs, et al. “In utero exposure to mercury is associated with increased susceptibility to liver injury and inflammation in childhood”. In: Hepatology 74.3 (2021), pp. 1546–1559. [93] Nikos Stratakis, David V. Conti, Ran Jin, Katerina Margetaki, Damaskini Valvi, Alexandros P Siskos, Léa Maitre, Erika Garcia, Nerea Varo, Yinqi Zhao, et al. “Prenatal exposure to perfluoroalkyl substances associated with increased susceptibility to liver injury in children”. In: Hepatology 72.5 (2020), pp. 1758–1770. [94] Kevin Strike, Khaled El Emam, and Nazim Madhavji. “Software cost estimation with incomplete data”. In: IEEE Transactions on Software Engineering 27.10 (2001), pp. 890–908. [95] Indhupriya Subramanian, Srikant Verma, Shiva Kumar, Abhay Jere, and Krishanpal Anamika. “Multi-omics data integration, interpretation, and its application”. In: Bioinformatics and biology insights 14 (2020), p. 1177932219899051. [96] Arthur Tenenhaus, Cathy Philippe, Vincent Guillemot, Kim-Anh Le Cao, Jacques Grill, and Vincent Frouin. “Variable selection for generalized canonical correlation analysis”. In: Biostatistics 15.3 (2014), pp. 569–583. [97] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996), pp. 267–288. [98] Giulia Tini, Luca Marchetti, Corrado Priami, and Marie-Pier Scott-Boyer. “Multi-omics integration—a comparison of unsupervised clustering methodologies”. In: Briefings in bioinformatics 20.4 (2019), pp. 1269–1279. [99] Karan Uppal, Chunyu Ma, Young-Mi Go, and Dean P Jones. “xMWAS: a data-driven integration and differential network analysis tool”. In: Bioinformatics 34.4 (2018), pp. 701–702. [100] Marina Vafeiadi, Vaggelis Georgiou, Georgia Chalkiadaki, Panu Rantakokko, Hannu Kiviranta, Marianna Karachaliou, Eleni Fthenou, Maria Venihaki, Katerina Sarri, Maria Vassilaki, et al. “Association of prenatal exposure to persistent organic pollutants with obesity and cardiometabolic traits in early childhood: the Rhea mother–child cohort (Crete, Greece)”. In: Environmental health perspectives 123.10 (2015), pp. 1015–1021. [101] William N Venables and Brian D Ripley. Modern applied statistics with S-PLUS. Springer Science & Business Media, 2013. [102] Valentin Voillet, Philippe Besse, Laurence Liaubet, Magali San Cristobal, and Ignacio González. “Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework”. In: BMC bioinformatics 17.1 (2016), pp. 1–16. [103] Martine Vrijheid, Serena Fossati, Léa Maitre, Sandra Márquez, Theano Roumeliotaki, Lydiane Agier, Sandra Andrusaityte, Solène Cadiou, Maribel Casas, Montserrat de Castro, et al. “Early-life environmental exposures and childhood obesity: an exposome-wide approach”. In: Environmental health perspectives 128.6 (2020), p. 067009. 149 [104] Martine Vrijheid, Rémy Slama, Oliver Robinson, Leda Chatzi, Muireann Coen, Peter Van den Hazel, Cathrine Thomsen, John Wright, Toby J Athersuch, Narcis Avellana, et al. “The human early-life exposome (HELIX): project rationale and design”. In: Environmental health perspectives 122.6 (2014), pp. 535–544. [105] Banrida Wahlang, Juliane I Beier, Heather B Clair, Heather J Bellis-Jones, K Cameron Falkner, Craig J McClain, and Matt C Cave. “Toxicant-associated steatohepatitis”. In: Toxicologic pathology 41.2 (2013), pp. 343–360. [106] Beata Walczak and Désiré L Massart. “Dealing with missing data: Part II”. In: Chemometrics and Intelligent Laboratory Systems 58.1 (2001), pp. 29–42. [107] Bo Wang, Aziz M Mezlini, Feyyaz Demir, Marc Fiume, Zhuowen Tu, Michael Brudno, Benjamin Haibe-Kains, and Anna Goldenberg. “Similarity network fusion for aggregating data types on a genomic scale”. In: Nature methods 11.3 (2014), pp. 333–337. [108] Bobbie-Jo M Webb-Robertson, Holli K Wiberg, Melissa M Matzke, Joseph N Brown, Jing Wang, Jason E McDermott, Richard D Smith, Karin D Rodland, Thomas O Metz, Joel G Pounds, et al. “Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics”. In: Journal of proteome research 14.5 (2015), pp. 1993–2001. [109] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. “The cancer genome atlas pan-cancer analysis project”. In: Nature genetics 45.10 (2013), pp. 1113–1120. [110] Christopher Paul Wild. “Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology”. In: Cancer Epidemiology and Prevention Biomarkers 14.8 (2005), pp. 1847–1850. [111] Daniela M Witten and Robert Tibshirani. “Covariance-regularized regression and classification for high dimensional problems”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71.3 (2009), pp. 615–636. [112] Benhuai Xie, Wei Pan, and Xiaotong Shen. “Variable selection in penalized model-based clustering via regularization on grouped parameters”. In: Biometrics 64.3 (2008), pp. 921–930. [113] Weixin Yao. “Label switching and its solutions for frequentist mixture models”. In: Journal of Statistical Computation and Simulation 85.5 (2015), pp. 1000–1012. [114] Bing Yu, Yan Zheng, Danny Alexander, Alanna C Morrison, Josef Coresh, and Eric Boerwinkle. “Genetic determinants influencing human serum metabolome among African Americans”. In: PLoS genetics 10.3 (2014), e1004212. [115] Guangchuang Yu. dlstats: Download Stats of R Packages. R package version 0.1.5. 2022.url: https://CRAN.R-project.org/package=dlstats. [116] Cun-Hui Zhang. “Nearly unbiased variable selection under minimax concave penalty”. In: The Annals of statistics 38.2 (2010), pp. 894–942. 150 [117] Haixiang Zhang, Yinan Zheng, Zhou Zhang, Tao Gao, Brian Joyce, Grace Yoon, Wei Zhang, Joel Schwartz, Allan Just, Elena Colicino, et al. “Estimating and testing high-dimensional mediation effects in epigenetic studies”. In: Bioinformatics 32.20 (2016), pp. 3150–3154. [118] Yi Zhang, Miaomiao Li, Siwei Wang, Sisi Dai, Lei Luo, En Zhu, Huiying Xu, Xinzhong Zhu, Chaoyun Yao, and Haoran Zhou. “Gaussian mixture model clustering with incomplete data”. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17.1s (2021), pp. 1–14. [119] Yinqi Zhao. LUCIDus: an R package to implement the LUCID model. 2.2.0. CRAN, 2022.url: https://yinqi93.github.io/LUCIDus/index.html. [120] Hui Zhou, Wei Pan, and Xiaotong Shen. “Penalized model-based clustering with unconstrained covariance matrices”. In: Electronic journal of statistics 3 (2009), p. 1473. [121] Zhi-Hua Zhou, Nitesh V Chawla, Yaochu Jin, and Graham J Williams. “Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum]”. In: IEEE Computational intelligence magazine 9.4 (2014), pp. 62–74. [122] H Zou, T Hastie, and R Tibshirani. “On the “degrees of freedom” of the lasso Technical report”. In: Ann. Stat. (2005), pp. 352173–2192. 151
Abstract (if available)
Abstract
Technological advances in high-throughput biochemical data generation have made multiple omics data accessible for genetic epidemiology studies simultaneously. Analyzing each omics data individually cannot reveal the biological complexity of disease phenotypes. Integrative analysis of multi omics data has emerged as an approach to facilitate full understanding of the biological interplay between genetic or environmental risk factors and complex traits.
As an alternative to high-dimensional pairwise mediation or alternative clustering approaches, Latent Unknown Clustering Integrating multi omics Data (LUCID), is an integrative statistical model that jointly estimates latent clusters characterized by each omics profiles and genetic or environmental factors, while simultaneously associating the clusters to the phenotype. LUCID incorporates biological process into its conceptual model and is tailored for prospective sampling scheme in cohort studies. The dissertation focuses on further development of LUCID, including (1) improving computational performance of LUCID, especially under high-dimensional scenarios; (2) more flexible geometric features in omic data; (3) incorporating missing omic data; (4) incorporating multiple types of omic data etc. An R package, LUCIDus is introduced to implement the workflow based on LUCID model.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Two-step study designs in genetic epidemiology
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Statistical analysis of high-throughput genomic data
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Scalable latent factor models for inferring genetic regulatory networks
PDF
Molecular mechanisms of young-onset type 2 diabetes: integration of diet and multi-omics biomarkers
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Incorporating prior knowledge into regularized regression
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Identifying and quantifying transcriptional module heterogeneity and genetic co-regulation, with applications in asthma
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
High-dimensional regression for gene-environment interactions
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Statistical downscaling with artificial neural network
Asset Metadata
Creator
Zhao, Yinqi
(author)
Core Title
Integrative analysis of multi-view data with applications in epidemiology
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2023-05
Publication Date
01/24/2023
Defense Date
12/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,missing data,multi-omics integration,multi-view data,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David (
committee chair
), Chatzi, Vaia (
committee member
), Lewinger, Juan (
committee member
), Mancuso, Nicholas (
committee member
), Smith, Andrew (
committee member
)
Creator Email
yinqiz@usc.edu,zhaoyq93@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112718704
Unique identifier
UC112718704
Identifier
etd-ZhaoYinqi-11436.pdf (filename)
Legacy Identifier
etd-ZhaoYinqi-11436
Document Type
Dissertation
Format
theses (aat)
Rights
Zhao, Yinqi
Internet Media Type
application/pdf
Type
texts
Source
20230126-usctheses-batch-1003
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
clustering
missing data
multi-omics integration
multi-view data