Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Latent unknown clustering with integrated data (LUCID)
(USC Thesis Other)
Latent unknown clustering with integrated data (LUCID)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LATENT UNKNOWN CLUSTERING WITH INTEGRATED DATA
(LUCID)
by
Cheng Peng
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
(Biostatistics)
May 2019
Copyright 2019 Cheng Peng
Acknowledgements
I would like to first express my sincere gratitude to Dr. David Conti for his generous support and
endless encouragement. His brilliant mind and invaluable advice are the most reliable compass
that has guided me in the right direction. I am unbelievably fortunate to have him as my Ph.D.
advisor in this extraordinary journey of doctoral study and research. I also would like to thank
my committee members, Drs. Kimberly Siegmund, Juan Pablo Lewinger, and B. Keith Jenkins,
for their precious time and insightful suggestions in the development of this dissertation.
I would also like to thank all the members of the USC Department of Preventive
Medicine for allowing me the joy and privilege of studying and working beside them in the past
productive years. Particularly, thank you to Dr. Duncan Thomas for those helpful suggestions
and feedback.
Lastly, I would like to thank my family, especially my parents, for their unconditional
love. They are the light in my life, illuminating true north for me always.
TABLE OF CONTENTS i
Table of Contents
CHAPTER 1 INTRODUCTION ............................................................................................................... 1
1.1 INTEGRATED DATA ............................................................................................................................. 1
1.1.1 Overview ..................................................................................................................................... 1
1.1.2 Components of Integrated Data .................................................................................................. 1
1.1.3 Challenges in Analyzing Integrated Data ................................................................................... 3
1.1.4 Emerging Data Integration Methods .......................................................................................... 4
1.2 MISSING DATA AND ANALYSIS METHODS ......................................................................................... 5
1.2.1 Incomplete or Missing Observations ........................................................................................... 5
1.2.2 Latent or Unobserved Variables ................................................................................................. 8
1.2.3 Expectation-maximization (EM) Algorithm .............................................................................. 11
1.3 REGULARIZATION PATHS FOR HIGH-DIMENSIONAL DATA ............................................................... 13
1.3.1 Sparse Solutions through Regularized Regression ................................................................... 13
1.3.2 Regularized EM Algorithm ........................................................................................................ 19
CHAPTER 2 INTEGRATIVE LATENT CLUSTER ASSIGNMENT USING MULTI-OMICS
DATA WITH PHENOTYPIC TRAITS ................................................................................................. 21
2.1 INTRODUCTION ................................................................................................................................. 21
2.2 METHODS .......................................................................................................................................... 23
2.2.1 LUCID Latent Clustering Model ............................................................................................... 23
2.2.2 Joint Estimating through Latent Cluster via EM Algorithm ..................................................... 25
2.2.3 Sparse Solutions for High-dimensional Multi-omics Features ................................................. 28
2.3 SIMULATION STUDY.......................................................................................................................... 30
2.3.1 Data Description and Scenario Settings ................................................................................... 30
2.3.2 Simulation Results ..................................................................................................................... 31
TABLE OF CONTENTS ii
2.4 APPLIED EXAMPLES .......................................................................................................................... 43
2.4.1 Data Description ....................................................................................................................... 43
2.4.2 Applied Examples Results ......................................................................................................... 44
2.5 IMPLEMENTATION – LUCIDUS R PACKAGE ..................................................................................... 50
2.6 DISCUSSION ....................................................................................................................................... 50
2.7 ACKNOWLEDGEMENTS ..................................................................................................................... 56
2.8 FUNDING SUPPORT ............................................................................................................................ 57
CHAPTER 3 INTEGRATIVE CLUSTERING WITH INCOMPLETE BIO-FEATURE
MEASUREMENTS .................................................................................................................................. 58
3.1 INTRODUCTION ................................................................................................................................. 58
3.2 METHODS .......................................................................................................................................... 62
3.2.1 Revised DAG Latent Cluster Model .......................................................................................... 62
3.2.2 Missingness Pattern and Likelihood Partition .......................................................................... 63
3.2.3 Joint Estimating through Latent Cluster via EM Algorithm ..................................................... 66
3.2.4 Sparse Solutions for High-dimensional Multi-omics Features ................................................. 70
3.3 SIMULATION STUDY.......................................................................................................................... 71
3.3.1 Data Description and Scenario Settings ................................................................................... 71
3.3.2 Simulation Results ..................................................................................................................... 72
3.4 AN APPLICATION ON COLON CFR FOLATE BIOMARKERS STUDY ................................................... 84
3.4.1 Data Description ....................................................................................................................... 84
3.4.2 Applied Example Results ........................................................................................................... 85
3.5 IMPLEMENTATION - A FEATURE UPDATE FOR R PACKAGE LUCIDUS............................................. 88
3.6 DISCUSSION ....................................................................................................................................... 89
3.7 FUNDING SUPPORT ............................................................................................................................ 92
TABLE OF CONTENTS iii
CHAPTER 4 POTENTIAL OPTIONS FOR THE STATISTICAL INFERENCE OF LUCID
INTEGRATIVE CLUSTERING ............................................................................................................. 93
4.1 INTRODUCTION ................................................................................................................................. 93
4.2 METHODS .......................................................................................................................................... 95
4.2.1 Standard Error (SE) Estimation through SEM Algorithm ........................................................ 95
4.2.2 SE Estimation through Bootstrapping ....................................................................................... 95
4.2.2 Hypothesis Testing for High-dimensional Data ........................................................................ 96
4.3 SIMULATION STUDY.......................................................................................................................... 97
4.3.1 Data Description and Scenario Settings ................................................................................... 97
4.3.2 Simulation Results ..................................................................................................................... 97
4.4 APPLIED EXAMPLES FOR STATISTICAL INFERENCE ........................................................................ 104
4.4.1 Data Description ..................................................................................................................... 104
4.4.2 Application Results .................................................................................................................. 105
4.5 IMPLEMENTATION - LUCIDUS R PACKAGE UPDATES ................................................................... 106
4.6 DISCUSSION ..................................................................................................................................... 106
4.7 ACKNOWLEDGEMENTS ................................................................................................................... 109
4.8 FUNDING SUPPORT .......................................................................................................................... 109
CHAPTER 5 LUCIDUS: AN R PACKAGE FOR THE IMPLEMENTATION OF LUCID .......... 110
5.1 INTRODUCTION ............................................................................................................................... 110
5.2 LUCIDUS PACKAGE MODELS – A REVIEW .................................................................................... 111
5.2.1 Base Model .............................................................................................................................. 111
5.2.2 Adjusted Model ........................................................................................................................ 112
5.2.3 Regularization ......................................................................................................................... 113
5.3 IMPLEMENTATION AND APPLICATION ............................................................................................ 113
5.3.1 Functions in the LUCIDus Package ........................................................................................ 113
TABLE OF CONTENTS iv
5.3.2 General Workflow ................................................................................................................... 115
5.3.3 Input and Data Requirements.................................................................................................. 118
5.3.4 Model Specification ................................................................................................................. 118
5.3.5 LUCID Model Usage and Output ........................................................................................... 119
5.3.6 Applications and Examples ..................................................................................................... 126
5.4 CONCLUSIONS ................................................................................................................................. 129
CHAPTER 6 CONCLUSIONS AND FUTURE DIRECTIONS ........................................................ 133
6.1 SUMMARY OF FINDINGS .................................................................................................................. 133
6.2 FUTURE DIRECTIONS OF LUCID .................................................................................................... 135
6.3 FINAL REMARKS ............................................................................................................................. 138
REFERENCES ........................................................................................................................................ 139
APPENDICES ......................................................................................................................................... 146
APPENDIX I ........................................................................................................................................... 146
APPENDIX II .......................................................................................................................................... 148
APPENDIX III ......................................................................................................................................... 149
APPENDIX IV ........................................................................................................................................ 150
APPENDIX V .......................................................................................................................................... 151
APPENDIX VI ........................................................................................................................................ 153
APPENDIX VII ....................................................................................................................................... 154
APPENDIX VIII ...................................................................................................................................... 155
APPENDIX IX ........................................................................................................................................ 158
APPENDIX X .......................................................................................................................................... 159
APPENDIX XI ........................................................................................................................................ 161
APPENDIX XII ....................................................................................................................................... 162
APPENDIX XIII ...................................................................................................................................... 163
CHAPTER 1 1
Chapter 1 Introduction
1.1 Integrated Data
1.1.1 Overview
Rapid advances in technologies in medicine, biology, and bioinformatics have made various
components in complex metabolic pathways of human subjects measurable, offering researchers
unprecedented details in an expanded scope of multiple types of omics data. These high-
throughput data could make a substantial difference in understanding the genetic and metabolic
basis of phenotypic traits and outcomes, although constructing powerful and efficient analytical
approaches to take full advantage of integrated data remains challenging. Emerging data
integration methods have been developed to handle this task from distinct perspectives; however,
the quest for novel ways to adapt to the growing complexity of integrated data is ongoing.
1.1.2 Components of Integrated Data
Integrated data, literally, is a combined use of information from various sources. Generally,
integrated data in the context of genetic epidemiology and statistics consists of two main parts as
illustrated in Figure 1.1: multi-omics features and phenotypic traits.
Multi-omics features include multiple and various omics measures, which assemble
complex and big data sets to comprehensively explore functional relationships of biological
processes. Specifically, the information could be collected from different disciplines of biology,
such as genomics (e.g. genome-wide sequencing data, single-nucleotide polymorphism (SNP), or
copy number variant), epigenomics (e.g. DNA methylation patterns), transcriptomics (e.g.
differential gene expression), proteomics (e.g. protein mass spectrometry data), and
metabolomics (e.g. metabolite profiling information). Compared with analysis methods using a
CHAPTER 1 2
single type of data, the utility of multi-omics measures could further provide us with informative
pieces from the puzzle of biological relevance, which reveals the complete mechanism of the
concerned metabolic pathway.
Phenotypic traits are a more straightforward part of integrated data, which are usually the
outcome of interest for a research topic. For instance, they can simply be certain disease
outcomes, say colorectal cancer risk, or a metabolic syndrome involving abnormal levels of
many phenotypic traits, such as hypertension, hyperglycemia, low high-density lipoprotein
(HDL), etc. A particular phenotypic trait or outcome is usually the ultimate destination of a
metabolic pathway. Their special role as the “final product” in the biological processes make
phenotypic traits important in analyzing integrated data, and inevitable for an informative
interrogation of associations between multi-omics measures and outcome of interest.
Figure 1.1 Components of Integrated Data
Adapted from Ritchie et al. (2009)
CHAPTER 1 3
1.1.3 Challenges in Analyzing Integrated Data
Combining multi-omics measures with phenotypic traits gives us integrated data with a thorough
reflection of metabolic pathways, but with a large and complicated architecture. Admittedly,
bridging the gap between complex integrated data and a complete biological story is extremely
challenging.
First of all, the large-scale nature of high-throughput data may jeopardize the balance
between data quality and completeness. To achieve satisfactory results, the high quality of data
needs to be ensured. However, an overly rigorous protocol of quality control may damage data
completeness by excluding too many subjects from the analytical sample. Thus, in addition to
choosing good quality control pipelines, imputations or other statistically involved methods of
missing data analysis should also be considered.
Furthermore, each omics data set could contain genome-scale information, indicating that
the whole integrated data could be immensely high-dimensional, which means the
dimensionality of features p is much greater than the size of the analytic sample n (𝑝 ≫𝑛 ).
Trying to analyze the data without dimension reduction and feature selection is neither
statistically reasonable nor computationally feasible. External information, such as prior
knowledge regarding informative omics features for phenotypic traits of interest, would be
enlightening if available. Otherwise, many analytical strategies, such as factor analysis, principal
component analysis (PCA), and regularization methods, could be applied to enable dimension
reduction and/or feature selection.
Finally, other covariates besides integrated data, such as demographic characteristics and
environmental exposures, could be influential. For instance, confounding due to ethnicity,
especially genetic ancestry, has been commonly recognized in genome-wide association studies
CHAPTER 1 4
(GWAS). Moreover, gene-environmental interactions characterized by differential gene
expressions by exposure status are also widely reported. Therefore, the effects of these covariates
on integrated data and the whole metabolic pathway should not be ignored.
1.1.4 Emerging Data Integration Methods
Owing to growing interests in understanding biological processes through integrated data, a
number of data integration methods have been proposed over the past decade. Depending on
whether the primary goal of analyses is to estimate the effect of labeled phenotypic traits or not,
these methods could be divided into two main types in the context of machine learning:
supervised learning and unsupervised learning methods.
Supervised learning methods aim to predict phenotypic traits of interest by fitting
statistically powerful and effective models with a refined subset of predictors including most
informative features from multi-omics data. Specifically, these supervised learning approaches
can be roughly summarized into two categories: multi-staged analysis and meta-dimensional
analysis, respectively. Multi-staged analysis refers to a hierarchical analysis elucidating
important features from multi-omics data through various stages. In contrast, the meta-
dimensional analysis is an approach with different omics data combined simultaneously to build
a complex model to jointly predict phenotypic traits of interest.
Unsupervised learning methods are used when the phenotypic traits are unlabeled, or the
real outcomes of interest are latent and unobserved variables. For instance, some computational
approach using metabolomics data, such as mummichog (Li et al., 2013), focus on the estimation
of unobservable network mapping the relationships between functional activity and metabolic
pathway. Meanwhile, latent cluster analysis could facilitate the discovery of hidden structures in
metabolic pathways through information embedded in the integrated data. These clustering
CHAPTER 1 5
methods could be non-probabilistic, like separate hierarchical clustering, or probabilistic, such as
iCluster (Shen et al., 2009). Details regarding latent cluster analysis approaches are discussed in
the next section.
1.2 Missing Data and Analysis Methods
1.2.1 Incomplete or Missing Observations
When we conduct statistical analyses, a standard data set we expect is often having a rectangular
shape, which is a matrix with rows representing observations or subjects and columns referring
to variables or features measured for each subject. Missing data problems emerge in practice
when some entries of the data matrix are not observed, which could be due to various factors,
such as non-reply in surveys, recording errors, poor data quality, etc. For instance, each source of
multi-omics data we mentioned above could have a considerable amount of missing data due to
the quality control pipeline or the expense of measurement. Missing data problems cast
challenges on data analysis because they can impact statisticians’ decision from many aspects.
Typically, most of the available statistical packages exclude subjects with missing values from
any variables in a set when we try to analyze data with missingness. However, this listwise
deletion strategy of getting complete cases for analysis is not ideal, given that it damages the
structure of the data matrix which is intended to remain the shape of the original study design.
Drawbacks of listwise deletion include introducing serious bias into the analytic sample and
devastating statistical power in detecting hypothesized effects, which urges statisticians to
harness the utility of missing data with more thoughtful methods.
Before proposing more appropriate methods, the patterns and mechanisms of missingness
must first be understood. As shown in Figure 1.2, Little and Rubin (2002) classify missing data
CHAPTER 1 6
into univariate, multivariate, monotone, disjoint, general, and latent. Among all these missing
types, an assumption of latent variables is a belief of existing some unobservable features which
are important in explaining the underlying patterns or structure of the observed data. This
particular and attractive role of latent variables has drawn statisticians’ attention in the past
decades. In the background of data integration, exploratory pattern discovery sheds new light on
undefined biological processes, which could be new indicators of unknown disease subtypes. A
detailed discussion of the latent cluster and relevant analysis methods will be given later in this
section. In Chapter 2, we will propose a novel likelihood-based latent variable model to take full
advantage of integrated data to understand unobserved patterns in complex metabolic pathways.
Figure 1.2 Missing Patterns
Adapted from Little & Rubin (2002)
CHAPTER 1 7
In terms of missing mechanisms, Little and Rubin (2002) categorize them into 3 groups,
which are missing completely at random (MCAR), missing at random (MAR), and not missing at
random (NMAR) respectively. The differences among these three categories are characterized by
statistical examinations of the relationship between complete data, missing pattern, and some
unknown parameters. Let 𝒟 =(𝒟 𝑜𝑏𝑠 ,𝒟 𝑚𝑖𝑠 ) denote the complete data matrix, where 𝒟 𝑜𝑏𝑠 and
𝒟 𝑚𝑖𝑠 refer to observed and missing parts of data respectively, M denote the missing pattern, and
Θ denote unknown parameters. Then, if observed and missing data combined have no influence
on the missing pattern, i.e.
𝑓 (𝑀 |𝒟 ,Θ)=𝑓 (𝑀 |Θ) (1.1)
then the data are MCAR; if the missingness depends only on the observed part of data, i.e.
𝑓 (𝑀 |𝒟 ,Θ)=𝑓 (𝑀 |𝒟 𝑜𝑏𝑠 ,Θ) (1.2)
then the data are MAR; finally, if the missing values in the data matrix 𝒟 impact the missing
pattern in any way, then the missing mechanism of the data is NMAR. When the data are MCAR
or at least MAR, the missing mechanism is ignorable because the missingness parameter is
separable from other parameters of interest during statistical inference through data likelihood.
However, this is not the situation when the data are NMAR and the missing mechanism therefore
should be included in the analysis.
Missing patterns and mechanisms have crucial influences on the choice of analysis
methods. Certain patterns of missing data can be handled only by standard complete-case
analysis after listwise deletion or other particular methods, say weighting methods. In contrast,
other analysis schemes, such as imputation techniques and model-based likelihood methods, are
suitable for more general missing-data patterns. Indeed, the model-based likelihood methods are
most statistically sensible but computationally intensive among all since they take the missing
CHAPTER 1 8
data patterns and mechanisms into account when building the model likelihood upon which
statistical inference is sequentially based. Fortunately, the rapid development of modern
algorithms and computers have reduced the obstacles to efficiently applying those methods. In
Chapter 3, we will extend the model-based likelihood approach proposed in Chapter 2 to handle
integrated data with some subjects in the sample with MAR in bio-features.
1.2.2 Latent or Unobserved Variables
1.2.2.1 Background
As previously mentioned, the problem of latent variables is a special case of missing data issues.
Although latent variables are unobservable and their forms are unknown, they usually serve as
informative labels grouping observed data into classes sharing similar properties. Analysis
approaches aiming to learn from observed data and elucidate patterns of similar subjects via
latent variables are broadly called clustering, which is a type of unsupervised learning method.
Prevalent methods of clustering include non-probabilistic techniques, such as non-
hierarchical and hierarchical clustering, and probabilistic approaches, such as latent cluster
analysis. Although easy and fast to implement, non-probabilistic methods achieve segmentation
using arbitrary criteria, which are loosely based on minimizing the with-in cluster variation
and/or maximizing the between-cluster variation. Moreover, the scaling of observable variables
strongly impacts the clustering results for non-probabilistic methods. The drawbacks of non-
probabilistic techniques and rapid developments of computer hardware make latent cluster
analysis statistically favorable and practically applicable.
1.2.2.2 Latent Cluster Analysis
Latent cluster analysis is a probabilistic clustering approach since it is constructed upon a model-
based likelihood method. The model likelihood assumes that the observed data are generated
CHAPTER 1 9
from a mixture of underlying probability distributions flagged by latent variables. Considering
that latency is a type of missing patterns, strategies of the missing-data theory need to be applied
to perform parameter estimation and statistical inference. Actually, latent variables are MCAR,
which means their missing mechanisms are ignorable in the model likelihood, making parameter
estimation not extremely complicated.
The basic latent variable model is given by
𝑃 (𝓓 |𝚯 )=∏𝑃 (𝓓 𝑖 |𝚯 )
𝑛 𝑖 =1
=∏∑𝜋 𝑗 𝑃 𝑗 (𝓓 𝑖 |𝚯 𝑗 )
𝑘 𝑗 =1
𝑛 𝑖 =1
(1.3)
where 𝓓 𝑖 denotes an individual i’s observed data, 𝚯 is the vector of model parameters, 𝜋 𝑗
denotes the probability of belonging to certain latent cluster j, with n and k referring to sample
size and number of latent clusters respectively. Obviously, after cluster-specific densities of
overserved data are postulated, the likelihood of model parameters involving latent variables can
be explicitly formed. In terms of optimization, when complex model likelihood with missing
variables is the objective function, if the data are at least MAR, expectation-maximization (EM)
algorithm is often used as an alternative to obtain maximum likelihood estimations (MLE) of
model parameters. As an important tool which will be repeatedly used when we introduce our
innovative statistical models in the next three chapters, the EM algorithm will be discussed later
in this section.
Overall, the latent cluster analysis enjoys many advantages when compared with non-
probabilistic methods. The first benefit is extensive flexibility in fitting various distributional
forms for the observed data within clusters in the construction of model likelihood. Moreover,
statistical modeling makes it easier to introduce penalty terms into model likelihood so that
sparse solutions of parameter estimates can be obtained following regularization paths.
CHAPTER 1 10
Additionally, model-based methods facilitate constructing hypothesis testing for parameters of
interest, which also allows statistical power and type I error to be studied through simulations.
Finally, latent cluster analysis reserves the uncertainty of clustering by generating individual
level posterior probabilities for cluster membership calculated with observed data and model
parameter estimates. These aforementioned advantages will be echoed in the next three chapters
when we elaborate on our novel latent cluster analysis model and its extensions which
disentangle complex metabolic pathways.
1.2.2.3 iCluster and Extensions
iCluster proposed by Shen and colleagues (2009) is a popular tool among several available ones
in estimating latent clusters using multi-omics data. It generates integrative latent cluster
assignments by jointly fitting a latent variable model for each distinct source of multi-omics data.
Many researchers have applied iCluster in identifying novel subtypes of various cancer types,
such as lung cancer (Shen et al., 2009), breast cancer (Curtis et al., 2012), and glioblastoma
(Shen et al., 2012). Initially, iCluster was designed to harness only continuous multi-omics
measures. Nevertheless, Mo and others (2013) expanded its flexibility by presenting iCluster+,
an enhanced version of iCluster, with the capability of integrating some data types other than
continuous features, like categorical and count variables. This useful extension enables the tool
to further address many other sources of integrated data, such as somatic mutations and levels of
targeted or untargeted biomarkers in plasma. iCluster and its upgrade achieve sparsity by
following regularization paths to introduce lasso penalty terms into model likelihood, which
leads to a refined set of candidate multi-omics features and robust latent cluster assignments.
Albeit the competence of iCluster and extensions is beyond doubt, several of their
limitations may block their way to delineate meaningful patterns hidden beneath integrated data.
CHAPTER 1 11
First, they ignore the roles of various sources of multi-omics data in the relevant metabolic
pathway, which makes causal directions of each omics data with respect to the latent clusters
undefined. Second, information from phenotypic traits of interest is dropped through iCluster
modeling, leaving the underlying causation of different components from integrated data in the
biological process incomplete. Finally, no convenient option to conduct statistical inference is
provided in either iCluster or iCluster+, which means that hypothesis testing and power
evaluation for multi-omics features are not applicable.
1.2.3 Expectation-maximization (EM) Algorithm
When we build probabilistic models with latent variables or missing observations in the
complete data, obtaining MLEs for model parameters is challenging because the objective
function in terms of optimization is the marginal likelihood of model parameters based on
observed data only. Maximizing the marginal likelihood is problematic considering this
optimization problem is often non-convex or has many local maxima, which makes it
analytically intractable. The EM algorithm is an iterative procedure which is broadly used when
missingness (𝑋 ) is involved in collected data (𝒟 ). Its core idea is to find a relatively simpler
approximate for the marginal likelihood to serve as an alternative objective function.
1.2.3.1 Parameter Estimates
Specifically, the EM algorithm starts with initial guesses of model parameters, and then in the E-
step, the expectation of complete-data log-likelihood 𝑄 (𝚯 |𝚯 (𝒕 )
), a lower bound of marginal log-
likelihood, is calculated using observed data 𝒟 and the current posterior probability of 𝑋
𝑃 (𝑋 |𝓓 ;𝚯 (𝒕 )
) at the iteration t. Since maximizing the complete-data likelihood is considerably
easier than dealing with marginal likelihood, the expected complete-data log-likelihood is chosen
to be an approximate for the objective function in each iteration. Next, the M-step maximizes the
CHAPTER 1 12
expected complete-data log-likelihood to achieve model parameters’ updates for the iteration
t+1. Then, the algorithm iterates until the criteria of convergence are met.
1.2.3.2 Standard Errors (SE) of Parameter Estimates
Standard errors (SE) of MLEs are usually inferred based on the inversion of an information
matrix. When missing variables or observations are involved during data collection, although
observed information matrix can be theoretically obtained using marginal log-likelihood,
additional steps are needed to achieve missing information so that standard errors of the
estimates can be found. When using the EM algorithm, the SEs are not automatically produced,
but many different procedures have been proposed to solve this drawback. For example, the
supplemented EM (SEM) algorithm proposed by Meng and Rubin (1991) is a method to obtain
an asymptotic variance-covariance matrix for parameter estimates, which serves as an option for
EM algorithm. Briefly, SEM first requires an evaluation of the variance-covariance matrix for
complete data; next, a matrix recording the rate of convergence of EM for each parameter
estimate, DM, is calculated; finally, the observed data variance-covariance matrix is evaluated
using an estimated complete data variance-covariance matrix and DM. Another more general
option to obtain the SE of parameter estimates is applying the bootstrap method based on
resampling (Efron and Tibshirani, 1986). It embeds the EM algorithm in a large number of
bootstrap loops to estimate the SE.
In short, the EM algorithm is valid as long as the data are MAR so that the missing
mechanism can be ignored. It approximates the intractable MLE optimization task using easier
and tractable integrative steps. As shown in Figure 1.3, maximizing the lower bound 𝑄 (𝚯 |𝚯 (𝒕 )
)
will keep pushing up on the objective function itself, indicating EM algorithm guarantees that the
marginal likelihood gets improved after each iteration. The initial guesses of model parameters
CHAPTER 1 13
are essential in terms of the performance of the EM algorithm. Therefore, reasonable and
meticulous initial values can prevent the algorithm from trapping around the local optima and
usually guarantee the convergence of EM. These desirable properties make EM algorithm
favorable for the optimization when we analyze missing data.
Figure 1.3 Illustration of EM algorithm
Adapted from Murphy (2012)
1.3 Regularization Paths for High-dimensional Data
1.3.1 Sparse Solutions through Regularized Regression
One common way to handle the high-dimensional challenge in the analysis of integrated data is
model fitting controlled by regularization. The initial idea of regularization is based on the bias-
variance decomposition of an objective function for the optimization of model parameters
(Geman et al., 1992). Obtaining unbiased estimators seems attractive; however, the reduced bias
escalates model complexity and leads to an overfitting problem, which could considerably
increase the variances of model parameters (Figure 1.4). Vice versa, underfitting decreases
variance but introduces large bias into parameter estimates, which also results in poor
generalization. So, the bias-variance tradeoff needs to be considered to reach a balance between
goodness of fit and model complexity, which suggests that using biased estimators could be
CHAPTER 1 14
reasonable to reduce the variance of parameter estimates during optimization. An intuitive way is
to modify the objective function as the weighted sum of these two sources. If we consider a setup
of linear regression, let Y denote a response variable, X denotes a vector of predictors, and 𝜷 be
a vector of model parameters, then the optimization becomes the following problem (Friedman et
al., 2008; Friedman et al., 2010; Hastie et al, 2009)
min
𝛽 ∈ℝ
𝑝 [‖𝒀 −𝐗 𝜷 ‖
2
2
+𝜆𝑃 (𝜷 )] (1.4)
where 𝑃 (𝜷 ) refers to the penalty terms and 𝜆 is the tuning parameter of regularization.
Alternatively, the above optimization can also be interpreted as the following bound constrained
form
min
𝛽 ∈ℝ
𝑝 ‖𝒀 −𝐗 𝜷 ‖
2
2
, subject to 𝑃 (𝜷 )≤𝐵 (1.5)
under the Karush-Kuhn-Tucker conditions.
Figure 1.4 Bias-variance tradeoff
Adopted form Hastie et al. (2009)
1.3.1.1 Penalty Terms
Different choices of penalty terms 𝑃 (𝜷 ) will lead to various forms of regularization with
different statistical properties. In terms of the integrated analysis, the corresponding design
CHAPTER 1 15
matrix is short and fat, indicating its high dimensionality. Nevertheless, our real interest is to
discover the most parsimonious set of features for an accurate modeling in order to reveal
scientific insight into biological processes and to prevent overfitting. Therefore, sparsity, which
encourages model parameters 𝜷 to be estimated as zeros, is desirable when selecting appropriate
penalties.
Practically, several special cases of 𝐿 𝑞 penalty are common picks for 𝑃 (𝜷 )=‖𝜷 ‖
𝑞 𝑞 =
∑ |𝛽 𝑗 |
𝑞 𝑝 𝑗 =1
. When q=0, we get 𝐿 0
regularization, which is ‖𝜷 ‖
0
, the total number of non-zero
regression coefficients in a vector of model parameters. This is equivalent to best subset
selection; however, the objective function is not convex when q<1, which makes the
optimization a hard task. In contrast, 𝐿 2
regularization is adopted when q=2, which is called
ridge regression (Hoerl & Kennard, 1970) in the context of linear regression by using 𝑃 (𝜷 )=
∑ 𝛽 𝑗 2 𝑝 𝑗 =1
. Although ridge regression efficiently solves the overfitting issue and shrinks the
variance of model parameters, its objective function does not promote sparsity because q>1.
Therefore, it is intuitive to choose q=1 to apply 𝐿 1
regularization, which suggests 𝑃 (𝜷 )=
∑ |𝛽 𝑗 |
𝑝 𝑗 =1
, to be the tightest convex approximation to the 𝐿 0
regularization so that sparse solutions
can be obtained. The contours of two-dimensional penalty constraint with a range of q (Figure
1.5) when using a bound constrained form of regularization also support the above findings.
Figure 1.5 Contours of constant value of ∑ |𝛽 𝑗 |
𝑞 𝑝 𝑗 =1
for given values of q, when p=2
CHAPTER 1 16
Essentially, the 𝐿 1
regularization is the least absolute shrinkage and selection operator
(lasso) proposed by Tibshirani (1996). The reason 𝐿 1
regularization yields sparse solutions can
be illustrated in Figure 1.6 when using the bound constrained form of regularization mentioned
above. For a setup of least squares optimization problem in terms of a simple linear regression
with 2 predictors, the contours of the objective function and penalty constraint are shown in red
eclipses and blue surfaces respectively. As the constrain B relaxes, the corners of the 𝐿 1
surface
are more likely to intersect the objective function, which corresponds to sparse solutions lying on
the coordinate axes. In contrast, the 𝐿 2
surface has no corner to obtain intersection on the axes,
and therefore does not have a capability of selecting features. Thus, lasso is more favorable for
our goal to reveal the most informative features in multi-omics data when we try to discover the
underlying patterns of integrated data.
Figure 1.6 Comparison between 𝐿 1
and 𝐿 2
regularization when p=2
Adapted from Hastie et al. (2009)
1.3.1.2 Tuning Parameter
The tuning parameter (𝜆 ) is another very important component of regularization; however, its
statistical meaning is more complicated than that of penalty terms. Obviously, it is closely related
to the constraint B when expressing bound constrained forms of the objective. The penalized
CHAPTER 1 17
objective function is actually the Lagrangian of that quadratic function with linear constraints,
and 𝜆 corresponds to the Lagrange multiplier (Murphy, 2012). A large B always indicates a small
𝜆 , and vice versa. Another interpretation requires a Bayesian perspective of regularization. If we
assume prior distributions of model parameters with a variance 𝜏 2
and the error with a variance
𝜎 2
, then 𝜆 =
𝜎 2
𝜏 2
is the ratio of these two variances, manifesting a tradeoff between the goodness
of fit and model complexity. Thus, both interpretations suggest that regularization paths will
illustrate an ascending trend in the number of zero coefficients as 𝜆 increases. Also, when 𝜆 =
∞, all the coefficients are zero; when 𝜆 =0, the coefficients are just the ordinary least squares
(OLS) solution in the context of linear regression.
The selection of the tuning parameter could be challenging. This model selection problem
can usually be done by tuning 𝜆 through applying K-fold cross-validation (CV) with the
observed negative log-likelihood function as the loss function. Then, the best 𝜆 is determined by
the value minimizing the CV(𝜆 ). Another approach to finding the optimal 𝜆 is guided by
generalized information criteria (GIC) (Nishii, 1984). A GIC is a combined measure of model
fitting and complexity with a form as follows:
𝐺𝐼𝐶 =−2log𝐿̂
+𝑎 𝑛 𝑑𝑓 (1.6)
where 𝐿̂
is the observed data likelihood, 𝑑𝑓 is the degrees of freedom, and 𝑎 𝑛 refers to some
positive function of sample size n. The commonly used Akaike information criterion (AIC)
(Akaike, 1973) and Bayes information criterion (BIC) (Schwarz, 1978) for model selection are
both special cases of GIC. AIC (𝑎 𝑛 =2) has been reported to be asymptotically equivalent to
cross-validation (Yang 2005), but it becomes unstable for some overfitting situations (Shao,
1997). BIC (𝑎 𝑛 =log𝑛 ); in contrast, seems more reliable in selecting the tuning parameter.
Many studies have suggested that the optimal tuning parameter that minimizes BIC can
CHAPTER 1 18
consistently fathom the parsimonious model with a refined subset of predictors (Wang et al.,
2007; Chen et al., 2008). Additionally, given that most of the aforementioned findings are based
on linear regression settings and limited dimensionality, several other GICs have been
considered to ameliorate tuning parameter selection for high-dimensional models, such as 𝑎 𝑛 =
log{log𝑛 }log𝑛 and 𝑎 𝑛 =log{log𝑛 }log𝑑𝑓 (Fan and Tang, 2013; Chen et al., 2016).
Indeed, regularization or penalization is frequently adopted to avoid overfitting for high-
dimensional settings or when using data with a larger sample size is not feasible. Although some
discrete procedures for model selection, such as stepwise procedures and subset selection (𝐿 0
regularization), are available to refine a subset with informative predictors and discard the rest
noise, shrinkage methods by fitting regularized models are more continuous and substantially
reduce variability. Among various regularization paths could be followed, lasso’s particular
beauty of promoting sparsity by using 𝐿 1
regularization makes it preferred in domains with high-
dimensional data, say trying to identify informative features from high-throughput data. Due to
its popularity, many extensions of 𝐿 1
regularization have been proposed to make lasso adaptive
for various research settings, including grouped lasso (Yuan and Lin 2007; Meier et al. 2008),
fused lasso (Tibshirani et al., 2005), and elastic net (Zou and Hastie 2005). Elastic net is a
combination of 𝐿 1
and 𝐿 2
regularization which could be very useful in practice. This hybrid
between lasso and ridge regression amend several flaws of lasso by providing a more elegant
way in handing correlated predictors and breaking the limit of the largest number of selected
features constrained by sample size. Moreover, regularization paths through penalized likelihood
could be intuitively generalized to EM framework, which makes the algorithm extremely
powerful when trying to elucidate latent pattern with those most informative variables among
large data. We will discuss more about the regularized EM algorithm in the next section.
CHAPTER 1 19
1.3.2 Regularized EM Algorithm
As we briefly discussed above, missing data analysis, especially latent variable models, relies
heavily on the EM algorithm when estimating parameters. However, the conventional EM
algorithm does not have the capability of harnessing the utility of high-dimensional multi-omics
data or inducing sparsity in terms of feature selection. Since EM finds parameter estimates by
maximizing model likelihood, a regularized EM becomes a natural extension to solve the above
challenges with the penalized likelihood as the objective function. Practically, although the EM
is easy to implement and its convergence is statistically guaranteed, expanding this iterative
algorithm to a high-dimensional regime could be another story.
The first decision that needs to be made is, essentially, whether to follow a regularization
path or not. Previous work on this topic has highlighted the instability of the M-step in the high-
dimensional setting. This uncertainty makes picking suitable tuning parameters to control the
balance between optimization and statistical errors difficult. Furthermore, obtaining MLE from
penalized likelihood may be unfeasible given that the objective function with latent variables is
often non-concave, causing the EM to converge to a local optimum instead. No straightforward
way to conduct statistical inference when applying regularizations could also be a concern. A
regularization free EM-Truncation algorithm was proposed to offer a computationally tractable
approach for optimal estimation and asymptotic inference in high dimensions (Wang et al.,
2014). However, owing to the extra T-step, this approach requires prior knowledge and
assumptions for sparsity structure, which makes it only work for some specific settings. In
contrast, regardless of any inherent difficulties of regularization methods, the generalization of
regularized EM to a variety of application settings could be much more convenient with careful
screening for good tuning parameters.
CHAPTER 1 20
If we decide to apply a penalized scheme, then the next imperative task will be to find an
appropriate place where regularizations should be introduced. Some statisticians believe penalty
terms combined with the tuning parameter should be imposed early at the E-step in computing
the expected complete data log-likelihood 𝑄 (𝚯 |𝚯 (𝒕 )
), then regularization will be sequentially
involved in the M-step (Li et al., 2005). However, it is counterintuitive to unnecessarily
complicate the E-step with extra computational burdens, given that challenges of high-
dimensional data were mainly posed in the M-step. To handle overfitting and induce sparsity,
many statisticians have suggested to only alter the M-step via optimizing modified objective
function by combining conventional 𝑄 (𝚯 |𝚯 (𝒕 )
) with appropriate regularization (Chen et al.,
2016; Jia and Wang, 2014), such as 𝐿 1
penalties. Additionally, Yi and Caramanis (2015)
proposed a more sophisticated version of the regularized EM algorithm with not only a modified
M-step but also a dynamically updated tuning parameter after each EM iteration. Although this
method seems promising, it introduces two additional elements required for a tuning parameter
update, statistical error and contract factor, to be estimated, which increases the complexity of
the EM algorithm especially when multiple regularization parameters need to be tuned. With
regard to exact inference, post-selection procedures or methods based on a Bayesian framework
are usually adopted to further construct confidence intervals and conduct hypothesis testing after
identifying informative features from high-dimensional data.
More details in terms of applying a regularized EM algorithm for optimization in our
jointly latent cluster analysis models and its extensions are discussed in the Methods section in
the next three chapters.
CHAPTER 2 21
Chapter 2 Integrative Latent Cluster Assignment Using
Multi-Omics Data with Phenotypic Traits
2.1 Introduction
Advancements in technology have made multi-omics data on human projects accessible for
researchers and practitioners. The availability of such data, including genomic data, somatic
profiles, metabolite measurements, proteomics information, etc., has facilitated new insights into
the underlying etiologic mechanism of disease. For instance, in a study of 2000 primary breast
tumors from breast cancer patients by Curtis et al, integrated analysis of copy number and gene
expression data revealed novel subgroups with distinct clinical outcomes, and also provided a
novel molecular stratification of the targeting population, suggesting an underlying mechanism
in breast cancer biology (Curtis et al., 2012). Despite the great potential, multi-omics data have
many analytic challenges. First, different sources of multi-omics data are often commingled in
various underlying biological mechanisms and thus can have different types of effects on the
outcome of interest, which causes effect heterogeneity in the integrated data. Second, high-
dimensional settings of multi-omics information, such as genomic and metabolomics data,
consist of many thousands of variables making the analysis of integrated data both analytically
and computationally challenging.
One of several existing approaches to handle the aforementioned difficulties is to perform
assignment estimation of latent clusters with regularizations to achieve data integration and
dimension reduction. For instance, Shen et al. proposed a method called iCluster for integrative
clustering using multiple types of omics data via joint Gaussian latent variable models (Shen et
al., 2009; Mo et al., 2013). The iCluster method processes the associations between different data
CHAPTER 2 22
types and the covariance structure within data types through this single probabilistic framework.
It also provides lasso-type sparse estimates for cluster-specific parameters by maximizing a
penalized log-likelihood upon request (Tibshirani, 1996). Other integrative methods, built under
the framework of causal mediation modeling, are also available (Huang et al., 2014; Huang et al.,
2015). These approaches explore the direct and indirect effects of multiple data types on disease
traits but often do not incorporate a data-reduction that can be useful for overall interpretation.
Additionally, metabolomic analyses, such as mummichog (Li et al., 2013), are available and
often focus on estimating functional activity and pathway relationships without formal
identification of the individual untargeted metabolites.
In this chapter, we present a novel approach, Latent Unknown Clustering with Integrated
Data (LUCID), to perform integrative analyses using multi-omics data with phenotypic traits,
leveraging their underlying causal relationships in latent cluster estimations. A directed acyclic
graph (DAG), anchored on germline genetics, is used to illustrate the underlying biological
relevance of different components from integrated data in our method. A joint probabilistic
model with a latent variable for clustering integrates data in the DAG. Unlike other data-
reduction methods, the latent variable links various sources of multi-omics data explicitly to the
outcome of interest. An expectation–maximization (EM) algorithm is used to obtain likelihood-
based estimates of the latent cluster assignments and model parameters simultaneously (Little &
Rubin, 2002). Regularization paths are followed to accommodate settings with high-dimensional
data.
The rest of this chapter is organized as follows. In Section 2, we first introduce the DAG
and formalize the joint probabilistic model with the latent variable for clustering. Next, we
develop an EM algorithm to optimize complete-data likelihood, allowing for regularized
CHAPTER 2 23
regression approaches to generate a sparse solution in case of high-dimensional data. In Section
3, we evaluate the performance of our approach using extensive simulations. We discuss the
impact of different sources of multi-omics data and outcome variable on the performance of
clustering and parameter estimation. We demonstrate the ability of our approach in choosing the
optimal number of clusters, and to handle high dimensional data. Additionally, we compare the
performance of our method with the cluster assignments estimated using iCluster. Finally, we
performed a sensitivity analysis for scenarios with additional genomic features or expression
profiles directly affecting the outcome of interest instead of following the assumed framework.
In Section 4, we apply our method to genetic, exposure, metabolomic, and outcome data from
two applied studies, followed by a description of the R package named “LUCIDus” in Section 5,
which we develop to implement the LUCID model. In Section 6, we conclude this chapter with a
discussion.
2.2 Methods
2.2.1 LUCID Latent Clustering Model
Information from multi-omics data and the phenotypic trait is integrated by jointly modeling
their relationships through a latent variable for clustering, which is illustrated by the directed
acyclic graph (DAG) in Figure 2.1. In this DAG, 𝐆 is a genomic data matrix of dimension 𝑛 ×𝑝
with columns being genetic features and rows being samples, 𝑋 is a categorical latent clustering
variable for each subject with K levels, 𝐙 is a 𝑛 ×𝑚 dimensional matrix of standardized
continuous bio-features for the same samples, and 𝑌 is an n-dimensional vector of the phenotypic
trait. Circled variables in the DAG are model parameters. As the causal directions delineated in
the DAG, the distribution of 𝑋 depends on the values of 𝐆 with parameter 𝜷 , the distribution of 𝐙
depends on the value of 𝑋 with parameters δ and 𝚺 , and the distribution of Y depends on the
CHAPTER 2 24
value of X with parameter 𝜸 . In this framework, we assume a conditional independence among X
given 𝐆 , 𝐙 given X, and Y given X.
Figure 2.1 Joint model integrating genetic features, metabolites, and phenotypic traits
Square vs. circle: observed data vs. latent clusters or unobserved model parameters; diamond: the penalty term for regularization, e.g. lasso.
The joint likelihood of model parameters can be easily formalized given individual
models of 𝑋 given 𝐆 , 𝐙 given 𝑋 and Y given 𝑋 . Let 𝑓 (𝑋 |𝐆 ;𝜷 ), 𝑓 (𝐙 |𝑋 ;𝜹 ,𝚺 ) and 𝑓 (Y|𝑋 ;𝜸 )
denote the probability mass functions (PMF) for discrete random variables or the probability
density functions (PDF) for continuous random variables. Then, the observed log-likelihood
based on the joint distribution of 𝐙 and Y conditioning on 𝐆 is:
𝑙 (𝚯 )=∑log𝑓 (𝒁 𝑖 ,𝑌 𝑖 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝒏 𝒊 =𝟏
=∑log∑𝑓 (𝒁 𝑖 ,𝑌 𝑖 ,𝑋 𝑖 =𝑗 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 =∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 (2.1)
CHAPTER 2 25
where is a generic notation standing for all model parameters, including 𝜷 , 𝜹 , 𝚺 and 𝜸 .
The flexibility of modeling is enhanced by allowing different conditional distributions to
be specified between the random variables. This allows for extensions to be more easily and
readily implemented. As an example, we assume 𝑋 has 𝑘 categories and follows a multinomial
distribution conditioning on 𝑮 , 𝒁 follows a multivariate normal distribution conditioning on 𝑋 ,
and 𝑌 follows a normal distribution conditioning on 𝑋 . Additionally, if we let 𝜙 (∙) to be the PDF
for a normal random variable, denote 𝑆 (∙) to be a sigmoid function when K=2, which is
𝑆 (𝛽 ,𝑗 ,𝐆 i
)=
1
1+𝑒 −𝐆 i
𝛽 𝑗 (2.2)
and more generally to be the softmax function when K>2, which is
𝑆 (𝛽 ,𝑗 ,𝐆 i
)=
𝑒 𝐆 i
𝛽 𝑗 ∑ 𝑒 𝐆 i
𝛽 𝑗 𝑘 𝑗 =1
(2.3)
Then, the log-likelihood of model parameters becomes:
𝑙 (𝚯 )=∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 =∑log∑𝑆 (𝛽 ,𝑗 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 (2.4)
For convenient illustration purposes, the remainder of this chapter will focus on this
setting. A model with a binary 𝑌 will be discussed in the Appendix I.
2.2.2 Joint Estimating through Latent Cluster via EM Algorithm
We further define the posterior probability of belonging to latent cluster j given observed data
and parameter values as 𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 ). The parameters of interest include the posterior
probability of latent cluster membership, genetic effect 𝜷 , cluster-specific bio-feature means 𝜹
and covariance matrices Σ, and mean effect of the outcome 𝜸 . To obtain maximum likelihood
CHAPTER 2 26
estimates (MLE) of model parameters we maximize the expectation of complete data log-
likelihood. Detailed derivations are presented in Appendix II. Specifically, the expectation of
complete data log-likelihood for the DAG model is:
𝑄 (𝚯 )=∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑆 (𝛽 ,𝑗 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
(2.5)
The EM algorithm is applied to handle optimization and estimate model parameters. We first
define the responsibility (𝑟 ) as the conditional probability of being in latent cluster j given
observed data and current values of parameter estimates at iteration t, which is:
𝑟 𝑖𝑗
=𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 (𝑡 )
)=P(𝑋 𝑖 =𝑗 |𝐆 𝑖 ,𝐙 𝑖 ,𝑌 𝑖 ;𝚯 (𝑡 )
)
=
𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
∑𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
𝑗 (2.6)
In the E-step, we compute the expectation of complete data log-likelihood 𝑄 (𝚯 ) by using
responsibilities as imputed values for posterior probabilities of latent clusters. Thus, substituting
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 ) with 𝑟 𝑖𝑗
in (2.5), we now have:
CHAPTER 2 27
𝑄 (𝚯 )=∑∑𝑟 𝑖𝑗
log
𝑆 (𝛽 ,𝑗 ,𝐆 i
)
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑟 𝑖𝑗
𝑛 𝑖 =1
=∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝛽 ,𝑗 ,𝐆 i
)−
1
2
∑∑𝑟 𝑖𝑗
(𝐙 𝐢 −𝜹 𝑗 )
𝑇 𝑘 𝑗 =1
𝛴 𝑗 −1
(𝐙 𝐢 −𝜹 𝑗 )
𝑛 𝑖 =1
−
1
2
∑∑𝑟 𝑖𝑗
log|𝛴 𝑗 |
𝑘 𝑗 =1
𝑛 𝑖 =1
−
1
2
∑ ∑
𝑟 𝑖𝑗
(𝑌 𝑖 −𝜇 𝑗 )
2
𝜎 𝑗 2
𝑘 𝑗 =1
𝑛 𝑖 =1
−
1
2
∑ ∑ 𝑟 𝑖𝑗
log𝜎 𝑗 2
𝑘 𝑗 =1
𝑛 𝑖 =1
−3∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
(2.7)
The M-step is then to update parameter estimates by maximizing (2.7) computed in the
current E-step. This results in the following estimates of 𝜷 , 𝜹 , Σ, 𝜸 and 𝝈 at iteration t+1:
𝜷 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑎 𝑥 𝜷 ∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 i
) (2.8)
𝜹 𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
𝐙 i
𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(2.9)
Σ
𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
(𝐙 i
−𝜹 𝑗 (𝑡 +1)
)(𝐙 i
−𝜹 𝑗 (𝑡 +1)
)
𝑇 𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(2.10)
𝜸 𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
Y
𝑖 𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(2.11)
𝝈 𝑗 (𝑡 +1)
=
√
∑ 𝑟 𝑖𝑗
(𝑌 𝑖 −𝜇 𝑗 (𝑡 +1)
)
2
𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(2.12)
Although the maximization in (2.8) does not have a closed-form solution, it can be easily
solved iteratively since it is a convex optimization problem (Venables & Ripley, 2002).
Likewise, parameters in (2.11, 2.12) could be estimated through maximizing the log-likelihood
CHAPTER 2 28
of linear regression predicting a continuous outcome with posterior probabilities of latent
clusters. By doing this, we postulate equal variance ( ) among the outcome of interest across
different latent clusters due to the homoscedasticity assumption of linear regression. Specifically,
when fitting Y=𝜇 1
𝑟 1
+𝜇 2
𝑟 2
+⋯+𝜇 𝑘 𝑟 𝑘 +𝘀 =𝝁 𝑇 𝐫 +𝘀 ,
𝜸̂
(𝑡 +1)
=𝝁̂
𝑂𝐿𝑆 (𝑡 +1)
(2.13)
𝝈̂
(𝑡 +1)
= 𝑆𝐷 (𝘀 ̂ )=𝑆𝐷 (Y−𝝁̂
𝑇 𝐫 ) (2.14)
With the E-step and M-step formalized above, starting with a set of initial values of
parameter estimates, the algorithm iterates between these two steps until convergence. As the
number of latent clusters (K) needs to be pre-specified for LUCID, we recommend selecting an
optimal K using the Bayesian Information Criterion (BIC) by fitting the model over a range of
pre-specified values for K. The optimal K can be subsequently identified by the candidate model
with the minimum BIC. Details for the selection of the optimal K are shown in the simulation
study.
2.2.3 Sparse Solutions for High-dimensional Multi-omics Features
In the case of high dimensional 𝐆 and 𝐙 , sparse solutions of 𝜷 , 𝜹 and Σ are obtained via a lasso-
type penalty (Tibshirani 1996; Hastie et al., 2009; Murphy, 2012). For 𝜷 𝐿𝐴𝑆𝑆𝑂 :
𝜷 𝐿𝐴𝑆𝑆𝑂 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑎 𝑥 𝜷 {∑ ∑ 𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 i
)−𝜌 𝐺 ∑ ∑ |𝛽 𝑙𝑗
|
𝑘 𝑗 =1
𝑝 +1
𝑙 =1
} (2.15)
To obtain sparse estimates of 𝜹 and 𝚺 , we implement the lasso-type scout approach
(Witten and Tibshirani, 2008). 𝚺 is first updated by the following maximization while penalizing
the 𝐿 1
norm of its inverse:
(𝚺 𝑗 (𝑡 +1)
)
−1
=𝑎𝑟𝑔𝑚𝑎 𝑥 Σ
𝑗 −1{log|𝚺 𝑗 −1
|−𝑡𝑟 (𝑺 𝑗 𝚺 𝑗 −1
)−𝜌 Σ
−1||𝚺 𝑗 −1
||
1
} (2.16)
where
CHAPTER 2 29
𝑺 𝑗 =
∑ 𝑟 𝑖𝑗
(𝐙 𝐢 −𝜹 𝑗 (𝑡 )
)(𝐙 𝐢 −𝜹 𝑗 (𝑡 )
)
𝑇 𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(2.17)
is the empirical covariance matrix of Z. Then, 𝜹 is updated by minimizing the following
penalized distance:
𝜹 𝑗 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑖𝑛 𝜹 𝑗 {∑ ∑ 𝑟 𝑖𝑗
(𝐙 i
−𝜹 𝑗 )
𝑇 𝑘 𝑗 =1
[𝚺 𝑗 (𝑡 +1)
]
−1
(𝐙 i
−𝜹 𝑗 )
𝑛 𝑖 =1
+𝜌 Σ
−1
𝜹 𝑗 ||[𝚺 𝑗 (𝑡 +1)
]
−1
𝜹 𝑗 ||
1
} (2.18)
Choosing an optimal combination of tuning parameters for our model when both genetic
features and biomarkers are high-dimensional could be immensely challenging. Also,
considering our primary goal is to estimate the latent cluster which is often unobservable, cross-
validation based on the prediction of cluster assignments, which would be the most intuitive
model selection scheme, is not feasible to choose tuning parameters for our LUCID model in
practice. BIC, a special case of the generalized information criterion (GIC) (Nishii, 1984), is a
widely used alternative (Fan and Tang, 2012). Thus, a parallel grid-search guided by BIC is
implemented to determine the optimal combination of three tuning parameters. The following
two other variants of GICs which are more stable and reliable for high-dimensional model
selection are also been considered for the high-dimensional settings (Chen et al., 2016).
GIC=−2log𝐿̂
+𝑎 𝑛 𝑑𝑓 (2.19)
BIC=−2log𝐿̂
+log(𝑛 )𝑑𝑓 ,when 𝑎 𝑛 =log(𝑛 ) (2.20)
GIC
1
=−2log𝐿̂
+log{log(𝑛 )}log(𝑛 )𝑑𝑓 ,when 𝑎 𝑛 =log{log(𝑛 )}log(𝑛 ) (2.21)
GIC
2
=−2log𝐿̂
+log{log(𝑛 )}log(𝑑𝑓 )𝑑𝑓 ,when 𝑎 𝑛 =log{log(𝑛 )}log(𝑑𝑓 ) (2.22)
where 𝐿̂
is the maximized value of the likelihood function of the LUCID model, df is the number
of LUCID model parameters.
CHAPTER 2 30
2.3 Simulation Study
2.3.1 Data Description and Scenario Settings
Extensive simulations using various choices of parameters for genetic variants, bio-features, and
outcome effects were performed. We generated data for 200,000 subjects for each simulation
scenario for 500 replicates following the defined model in Figure 2.1 and Equation 1, conditional
on pre-specified parameter values and number of clusters. Due to the conditional independence
of the model, we first simulated genetic features, then conditional on the genetics, a cluster value
is generated, and finally biomarker and outcome data are generated conditional on the cluster
assignment. From each simulated population, a random sample of 2,000 individuals is selected
and analyzed with LUCID to obtain parameter estimates and cluster assignments. We examine
several metrics to evaluate the performance of our model, including mean parameter estimates
and corresponding standard deviations of estimates over the 500 replicates. We also investigate
estimated latent cluster assignments using the area under the curves (AUC) by comparing
estimated cluster probabilities to truly simulated cluster assignment. Using the model estimates,
we can predict the probability of each latent cluster and the corresponding outcome for given
individuals. For simulations, we use these predictions to calculate the mean square errors (MSE)
or AUC to evaluate performance. We pre-specify a range of candidate values for K in the
clustering for a simulation scenario with true K=4, and subsequently compare the model BICs to
determine the optima K. Finally, we compare cluster assignments to iCluster and feature
selection to lasso and its variant for high-dimensional simulation scenarios. Scenarios with a
binary outcome are also simulated and studied. All the simulation scenarios are listed in
Appendix III Table A.1. Additionally, results based on simulations with similar settings but a
binary outcome are presented in Appendix IV.
CHAPTER 2 31
2.3.2 Simulation Results
To demonstrate the performance and application of LUCID we first focus on low dimensional
scenarios. Here, we simulated 10 SNPs, 4 continuous biomarkers, and a continuous outcome for
each individual in the simulated population with the true number of latent clusters K=2. Each
replication of a particular scenario randomly draws 2000 samples from the generated population.
To facilitate the presentation of findings we compare results to a single reference
simulation scenario with K=2, 10 SNPs, 4 continuous biomarkers, and a continuous outcome
using ORG=exp( )=1.9 for the genetic effect, Δ
𝑍 =0.8 for the mean difference between
biomarkers, and Δ
𝑌 =0.2 for the mean difference in the outcome across clusters. LUCID results
for this scenario are shown as the Sankey diagram (Schmidt, 2008) in Figure 2.2. This figure
visualizes clustering by delineating the integrative clusters in the middle with connections to
multi-omics features, genetic variants, and the outcome with color-coded positive and negative
links and the widths of the links proportional to the estimated effect size. Without feature
selection (Figure 2.2), all causal features contribute to the estimation of latent clusters while the
null features have small parameter estimates. If feature selection is applied via the regularization,
only the causal features will be identified.
CHAPTER 2 32
Figure 2.2 Cluster-specific parameter estimates for the reference scenario
CG: causal SNP; NG: null SNP; CZ: causal biomarker; NZ: null biomarker; Y: outcome of interest; IntClust: integrative
cluster/subgroup. Links: light blue, negative effect; dark blue, positive effect; width, effect size. Nodes: yellow, integrative
cluster; purple, SNPs; cyan, biomarkers; brown, outcome of interest. Clustering results without feature selection when ORG=1.9,
Δ
𝑍 =0.8, and Δ
𝑌 =0.2. Visualization indicates the simulated effects are recapitulated by performing LUCID.
2.3.2.1 Varying Model Parameter Effects on LUCID Performance
Figure 2.3 shows simulation results with fixed genetic effects (ORG) across increasing mean
differences in biomarkers between two clusters. Here, the average parameter estimates for both
causal and null genetic effects become closer to the true effect while their standard deviation
across replicates becomes smaller (Figure 2.3a). Similarly, the estimated Δ
𝑍 and Δ
𝑌 demonstrate
increased accuracy and precision (Figure 2.3b and 3c). However, this increased accuracy for
parameter estimation does not result in substantial improvement in the average MSE for
prediction of the continuous outcome, albeit there is a decrease in the standard deviation of the
MSE for prediction across the replicates (Figure 2.3d). However, the AUC for cluster assignment
substantially improves (Figure 2.3e).
CHAPTER 2 33
Figure 2.3 Simulation results with varying biomarker effects
(a-c) Mean and 1 SD of parameter estimates based on simulations with increasing biomarker effects: solid lines, parameter
estimates; dash lines, true values, (d) Mean square error (MSE) in the prediction for outcome, (e) Area under the curve (AUC) for
integrative latent clustering.
Figure 2.4 shows simulation results for a fixed biomarker effect (Δ
𝑍 ) as the genetic
effects (ORG) are varied from 1.0 to 1.9. Overall, variation in the ORG are less influential on
parameter estimates and latent cluster assignments, reflective of the ability of mean biomarker
differences to drive cluster estimation. While increased ORG does lead to more accurate
parameter estimates and larger AUC for cluster estimation, in general, parameter estimates for
CHAPTER 2 34
biomarkers (Figure 2.4b) and for the outcome (Figure 2.4c) do not change considerably once a
sufficiently informative genetic effect is present.
Figure 2.4 Simulation results with varying genetic effects
(a-c) Mean and 1 SD of parameter estimates based on simulations with increasing genetic effects; solid lines, parameter
estimates; dash lines, true values, (d) Mean square error (MSE) in the prediction for outcome, (e) Area under the curve (AUC) for
integrative latent clustering.
CHAPTER 2 35
Finally, increased simulated effects between the latent clusters and the outcomes (shown
in Figure 2.5) results in improved model performance with more accurate parameter estimates
(as relative to true values indicated by the red dash lines), a lower MSE for outcome prediction,
and a larger model AUC for cluster estimation. LUCID has the option of incorporating the
outcome into the cluster estimation. This results in a slightly improved AUC for classification,
compared to the estimation without leveraging the outcome of interest (0.799 vs. 0.778).
Figure 2.5 Clustering with varying outcome effects
(a-b) Diamonds, mean of statistics; error bars, 1 SD of statistics; broken lines, oracle AUCs.
As LUCID leverages the covariance between biomarkers as well as mean differences,
estimation can be improved when there exists a covariance structure for the biomarkers. With
fixed genetic and outcome simulated effects, the corresponding parameter estimates become
closer to the true effects for a scenario with the same cluster-specific means but different cluster-
specific covariance structures. This results in a substantially larger AUC for the estimated latent
cluster assignments (Figure 2.6).
CHAPTER 2 36
Figure 2.6 Impact of a structured cluster-specific covariance matrix of biomarkers
2.3.2.2 Comparisons with Model Fits Using iCluster
To compare the performance of our method with iCluster, we first estimated the latent cluster
assignments using iCluster with simulated genetic features and biomarkers only for the setting of
ORG=2.0, Δ
𝑍 =3.6, and Δ
𝑌 =0.4. We then regressed the outcome on the estimated latent clusters to
estimate the outcome effect and perform hypothesis testing. Next, we reran the analysis by
adopting the proposed model using genetic and bio-features with and without the outcome of
interest. As shown in Figure 2.7, when applying iCluster for simulated data under the DAG
pathway, the outcome estimate is considerably different from the true effect, and the generated
cluster assignments provide only a moderate classification by producing an AUC of 0.71. In
contrast, when using our method, the estimated latent clusters give a much better classification
with a larger AUC of 0.99. Also, the estimate of the outcome effect is closer to the simulated
CHAPTER 2 37
truth. Additionally, when we estimate the latent cluster assignments using our proposed model
including the outcome of interest, we obtain an even more accurate parameter estimate and a
nearly perfect AUC.
Figure 2.7 Comparing LUCID method with iCluster
Scenario setting: ORG=2.0, Δ
𝑍 =3.6, and Δ
𝑌 =0.4 under the DAG pathway. LUCID-woY: LUCID not incorporating the outcome;
LUCID-wY: LUCID incorporating the outcome. (a) iCluster and LUCID-woY: Clustering with simulated omics features only,
then regressing the outcome on the estimated latent subgroup for the outcome estimate. LUCID-wY: Estimating subgroups using
omics features with the outcome. (b) Subgroup classification AUC for iCluster, LUCID-woY, and LUCID-wY.
2.3.2.3 Choosing the Optimal Number of Latent Cluster Assignment (K)
For the simulation scenario with a continuous outcome in which there are 4 simulated latent
clusters in the data, the model that pre-specifies K=4 indeed has the lowest BIC among the
models with a range of pre-specified values for K from 2 to 6 (Appendix Figure A.3), which
indicates model BIC is an efficient criterion for selecting optimal K.
CHAPTER 2 38
Figure 2.8 Model BIC using a varying pre-specified number of latent clusters (K)
2.3.2.4 Sensitivity Analysis to the Underlying Causal Mechanism
We performed additional scenarios to evaluate the performance of LUCID when
additional features have a direct effect on the outcome without going through the latent variable
in our DAG. This includes additional genetic effects and additional biomarkers. In each scenario,
we simulated the continuous outcome by conditioning on both the latent cluster variable and the
additional variables. We then applied LUCID directly to all variables and performed the analysis
across a range of pre-specified latent clusters. When the pre-specified number of clusters
corresponds to the simulated number, the estimated effects of additional biomarkers (AZ1 and
AZ2) or genes (AG1 and AG2) that are directly related to the outcome are very small compared
with the other causal features that that truly contribute to the latent clusters (Figure 2.9a and c,
when K=2). However, when additional clusters are estimated, these clusters tend to reflect the
effects of the additional variables and improve model fit (i.e. BIC indicates that K=4 is optimal
for both scenarios, results not shown).
CHAPTER 2 39
Figure 2.9 Sensitivity analysis for additional features
CG: causal SNP; NG: null SNP; CZ: causal biomarker; NZ: null biomarker; Y: outcome of interest; IntClust: integrative
cluster/subgroup. Links: light blue, negative effect; dark blue, positive effect; width, effect size. Nodes: yellow, integrative
cluster; purple, SNPs; cyan, biomarkers; brown, outcome of interest. (a-b) Adding two causal biomarkers (AZ1 & AZ2); (c-d)
adding two causal SNPs (AG1 & AG2) directly affecting the outcome.
2.3.2.5 High-dimensional simulation data
We simulated high-dimensional multi-omics data with 5000 genetic features (10 causal) and 500
biomarkers (10 causal). To facilitate simulations over many replicates, a single replicate was
selected for each scenario setting and used for a grid search to identify tuning parameters with a
minimum BIC. Then, for the remaining 500 replicates, the same tuning parameters were applied.
Figure 2.10 shows the inclusion probabilities for causal and non-causal features over 500
replicates for scenarios in which the genetic (ORG), biomarker (Δ
𝑍 ) and outcome effects are
varied (Δ
𝑌 ). In general, we observe an ascending trend in inclusion probability for both causal
genetic and biomarker features with increased biomarker effects (Δ
𝑍 ). We also see that
increasing genetic effects improve the estimation of the latent clusters and decrease the inclusion
CHAPTER 2 40
probabilities for non-causal biomarkers. Similarly, increased outcome effects (Δ
𝑌 ) leads to
improvements in the inclusion probabilities for both genetic features and biomarkers in almost
all simulation scenarios.
CHAPTER 2 41
Figure 2.10 Inclusion probabilities of multi-omics features
The referent scenario for a high-dimensional setting is when ORG=2.0, Δ
𝑍 =1.0, and Δ
𝑌 =2.0. CG: causal G; NG: null G; CZ: causal Z; NZ: null Z. Arrows: varying component
effects (yellow: genetic; red: biomarker; blue: outcome).
CHAPTER 2 42
We also compare the overall feature selection to lasso and group lasso with our integrated
model identifying more causal genetic and biomarker features. Compared to the results produced
by other feature selection techniques, including lasso and group lasso (Yuan and Lin, 2016),
which is specifically developed for a parameter vector structured into groups by data
characteristics, our integrative clustering method could capture informative genetic and bio-
features more accurately. In contrast, although having a faster computing time, lasso is only able
to identify informative biomarkers and group lasso always selects all the biomarkers and drops
all the genetic features, which both indicate a bad performance in terms of feature selection
(Figure 2.11).
Figure 2.11 Comparison of three feature selection techniques
CHAPTER 2 43
2.4 Applied Examples
2.4.1 Data Description
We apply LUCID to data from the Colon Cancer Family Registry (CFR). The Colon CFR is a
consortium of six centers across North America and Australia, organized to create a
comprehensive resource for clinical and epidemiological studies of colorectal cancer (CRC).
Details of the study design have been reported elsewhere (Haile et al., 1999; Newcomb et al.,
2007). For the subset sample used for the current analysis, the data were collected under a case-
control study design from the following centers: Sinai Health System (Ontario), the University of
Southern California/Cedars-Sinai Medical Center (USC/CSMC) consortium, the Fred
Hutchinson Cancer Research Center (FHCRC), the Mayo Clinic, University of Hawaii (UHI, not
included in this analysis) and the University of Melbourne. The recruitment included incident
cases of CRC via population-based cancer registries or clinical centers, as well as general
population or population-based sibship controls. All the cases and controls had genotype data
generated from germline DNA on 4 arrays, including Affymetrix Axiom, Illumina OncoArrary,
Illumina 1M/1M Duo, and Illumina Omni1. The number of overlapped SNPs across 4 arrays is
26797 in total. All 100 incident cases and 1798 controls had 12 targeted folate one-carbon
metabolism (FOCM) biomarkers measured (Reed et al., 2006), including 5-
methyltetrahydrofolate (5-MTHF), vitamin B2 (flavin mononucleotide [FMN] and riboflavin),
vitamin B6 (4-pyridoxic acid, pyridoxal, pyridoxamine), methionine (METH), S-Adenosyl
methionine (SAM), S-Adenosyl-L-homocysteine (SAH), total homocysteine (tHCY),
cystathionine (CYSTH), and creatinine. To meet the normality assumption of our model, all the
FOCM biomarkers were log-transformed. We also standardized all the targeted biomarkers as a
preparation for fitting the integrative clustering model. Meanwhile, all the subjects had reported
CHAPTER 2 44
other covariates, such as demographic characteristics including age and gender. To correct the
systematic effects on the biomarkers, each biomarker was adjusted for the glomerular filtration
rate (GFR). The residuals of biomarkers from regression models were then used as the biomarker
inputs in integrative clustering. We also adjusted for study center and the top five principal
components of the SNPs in the integrative clustering for confounding effects and population
stratification.
We also apply LUCID to participants from the Study of Latino Adolescents at Risk of
Type 2 Diabetes (SOLAR) project, which is a longitudinal cohort that recruited participants in
two waves from 2001 to 2012 (Goran et al., 2004; Weigensberg et al., 2003). Participants lived
in Los Angeles, California, were Hispanic/Latino ethnicity (defined as self-reported
race/ethnicity for all participants, parents, and grandparents) and underwent repeated, detailed
phenotyping for clinical risk factors of type 2 diabetes as well as body composition testing. The
analysis is performed high-resolution metabolomics (HRM) profiling completed using
standardized methods (Soltow et al., 2013). Concentrations of PFOA, PFOS and PFHxS were
quantified by reference standardization (Go et al., 2015). We apply LUCID on participants with
complete data (n=40) integrating metabolome data with serum levels of perfluorohexane sulfonic
acid (PFHxS) as the exposure to estimate subgroups of children related to the risk for developing
type 2 diabetes.
2.4.2 Applied Examples Results
For the application to genetic variants, folate biomarkers and colorectal cancer risk we performed
a grid search for the tuning parameters across a range of candidate number of latent clusters (K).
BIC was used to select the tuning parameters; the optimal the number of clusters K was found to
be 2 (Figure 2.12).
CHAPTER 2 45
Figure 2.12 Choosing the Optimal Number of Latent Cluster (K) for the CCFR Subset
When K=2, 2 SNPs (rs7070570 and rs1241696) and all 12 targeted biomarkers were
selected distinguishing two integrative subtypes of CRC (Figure 2.13). The first latent cluster is
characterized by the reference alleles for rs7070570 or rs1241696, substantially high levels of
METH, creatinine, pyridoxamine, CYSTH, SAH (Fig. 2.13 dark blue links), and relatively low
levels of riboflavin, tHCY, pyridoxal, FMN (Fig. 2.13 light blue links). The second cluster
highlights a positive effect of rs7070570 and a negative effect for rs1241696, considerably low
levels of METH, creatinine, pyridoxamine, CYSTH, SAH, and high levels of riboflavin, tHCY,
pyridoxal, FMN. The second latent cluster has an estimated OR=1.91 for risk for CRC. Pairwise
association estimates are shown in Appendix V Table A.2 and reflect the relationship among
three model components from LUCID. Post hoc analyses investigating the association of each
factor to the estimated cluster probabilities shows that both informative SNPs and many FOCM
biomarkers are significantly associated with posterior probabilities of integrative clusters
(Appendix V Table A.3).
CHAPTER 2 46
Figure 2.13 Unique feature levels in each estimated cluster of CRC risks
Links: light blue, negative effect; dark blue, positive effect; grey, outcome reference subgroup; red, high-risk subgroup; green, low-risk subgroup; width, effect size. Nodes:
yellow, integrative cluster; purple, SNPs; cyan & blue, biomarkers; brown, CRC risks. IntClust: integrative cluster/subgroup.
CHAPTER 2 47
To demonstrate the use of LUCID for prediction, we used a stratified 10-fold cross-
validation scheme to examine the prediction of CRC risks. Specifically, for each fold, 10% of
cases and 10% of controls were reserved as the test set for the trained model achieved from the
rest of cases and controls so that the original ratio of cases to controls could be preserved. The
LUCID model with K=2 and estimated parameters from fitting LUCID on each training data set
was used to obtain the posterior probabilities for the 2 latent clusters and to subsequently
estimate CRC outcomes for each subject in the test set. The mean AUC for classification was
0.82 with a SD of 0.05. In comparison, a similar procedure for a logistic regression with lasso on
all features directly results in a mean validation AUC=0.69 with a SD of 0.14 (Figure 2.14).
Figure 2.14 Outcome AUC for the FOCM biomarker study based on 10-fold cross-validation
In the group of 40 Latino adolescents recruited in SOLAR, we applied LUCID to
estimate latent clusters of participants at increased risk of developing type 2 diabetes based on
their PFHxS exposure and plasma metabolites level. Figure 2.15 shows that a latent cluster of
CHAPTER 2 48
adolescents (“Cluster 2”) is associated with increased risk of developing type 2 diabetes
(OR=1.86), indicated by an increased change in 2h-postprandial glucose levels from baseline to
the follow-up visit. This high-risk cluster is positively associated with the PFHxS exposure (blue
line connecting PFHxS to “Cluster 2”) and well characterized by an altered plasma metabolite
pattern, including increased plasma levels of palmitoylcarnitine, tyrosine, valine, isoleucine,
alanine, and cysteine (blue lines connecting “Cluster 2” to metabolites). Previous studies support
robust positive associations between branched chain amino acids (e.g., valine and isoleucine)
aromatic amino acids (e.g., tyrosine), and acylcarnitines and the risk of type 2 diabetes
(Schumacher et al., 2016; Wang et al., 2011). This example well demonstrates the ability of
LUCID to distinguish underlying heterogeneity of integrated data with a broader scope of
exposures.
CHAPTER 2 49
Figure 2.15 Integrated analysis of PFHxS and serum metabolites for the identification of a subgroup with increased risks of
developing type 2 diabetes
Links: light blue, negative effect; dark blue, positive effect; grey, outcome reference subgroup; red, high-risk subgroup; green, low-risk subgroup; width, effect size. Nodes: yellow, integrative cluster;
purple, PFHxS; cyan & blue, metabolites; brown, Increased change in 2-hr glucose levels. IntClust: integrative cluster/subgroup. The red line connecting “Cluster 2” and “Increased change in 2-hr
glucose levels” represents that children in the latent “Cluster 2” were at higher risk for developing type 2 diabetes (OR=1.86), compared to those in “Cluster 1” (reference).
CHAPTER 2 50
2.5 Implementation – LUCIDus R Package
The LUCID method is implemented in the R package LUCIDus which is available on CRAN
(https://CRAN.R-project.org/package=LUCIDus). Specifically, LUCID model fitting with an
option to do feature selection and parallel grid to find the best combination of tuning parameters
for high-dimensional data could be performed with est_lucid() and tune_lucid() functions. More
experimental features can be explored in the developer version of LUCIDus available on the
GitHub (https://github.com/USCbiostats/LUCIDus). The usage of the nine functions in the R
package can be found in the documentation and vignettes attached to LUCIDus. Details about the
usage and workflow of the LUCIDus R package are discussed in Chapter 5.
2.6 Discussion
For the analysis and integration of multi-omics data, many options have been proposed,
including solutions based on a high-dimensional mediation framework (Huang et al., 2015;
Zhang et al., 2016), methods focusing on data and dimension reduction such as principal
component analysis (PCA), factor analysis, and regularized regression (Gareth et al., 2013;
Hastie et al., 2009). Clustering and pathway approaches are also available in the community,
including the aforementioned iCluster (Shen et al., 2009) and mummichog (Li et al., 2013), as
well as other multi-staged analysis (Ritchie et al., 2009). In this paper, LUCID performs
clustering or subgroup estimation while simultaneously integrating multi-omics data with a
direct relationship to a phenotypic trait of interest through a latent variable. LUCID allows for
parameter estimation and feature selection in both low and high-dimensional data. Extensive
simulations demonstrate the potential benefits of incorporating multi-omics information from
various sources for estimating the relevant latent subgroups, the features influencing those
subgroups, and the impact of the subgroup on the trait outcome. The simulations also reveal the
CHAPTER 2 51
sensitivity of performance as the components (i.e. genetics, biomarkers, and outcome) varying in
informativeness. Even for a more complicated pathway of a biological process, LUCID is still
capable of detecting the causal features leading to the underlying heterogeneity of latent
subgroups. We also demonstrated the use of LUCID for prediction of latent clusters and
outcomes. When calculating classification AUC to evaluate prediction performance with
simulated data, we used an internal sample. Although these AUCs will be slightly inflated,
simulations for a subset of scenarios with prediction in an independent data set demonstrates that
the inflation is minor (Figure 2.16). Moreover, the simulations are designed to demonstrate the
trends and sensitivity of prediction as feature effects are varied.
Figure 2.16 AUC for classification with internal and external simulated data
The assumption of conditional independence for the underlying DAG in LUCID
facilitates estimation. While this assumption may not be a direct representation of the underlying
biological reality, the ability to estimate multiple latent groups allows for identification of
genetic and biomarker features that act jointly on the outcome or directly. Implementation of
each component within a regression framework has many advantages, including straightforward
CHAPTER 2 52
inclusion of covariates for the genetic linear predictors and for the outcome with respect to
integrative latent clusters. For practical applications, when covariates are included in the
conditional probability of the latent clusters given genetic features, their corresponding roles are
similar to predictors for the latent clusters. In contrast, the inclusion of covariates in the
conditional probability for the outcome given the latent clusters is akin to more conventional
adjustment for confounding. If systematic effects on biomarkers are present, we suggest
adjustment for relevant factors in a pre-analysis with an incorporation of the corresponding
residuals in the analysis as the biomarker inputs for LUCID.
Our modeling framework allows for flexibility by permitting every conditional
distribution to vary according to specific characteristics. Potential future extension can more
easily and readily be implemented, such as inclusion of survival outcomes via the corresponding
conditional likelihood or more specific modeling of categorical biomarker measurements. This
flexibility also extends to the potential specification of alternative penalty terms for the
regularized regression, such as elastic net (Zou and Hastie, 2006), a hybrid between Lasso and
ridge, or other penalties such as adaptive Lasso and minimax concave penalty (Zou, 2006;
Breheny and Huang, 2011). The current implementation of LUCID has three tuning parameters
for the corresponding regularized regressions. Conventional approaches for estimating these
tuning parameters, such as cross-validation, would be very computationally intensive. Thus, we
opted a grid-search guided using BIC to determine the optimal combination of three tuning
parameters. Within the current implementation for LUCID, we have both a regularization and an
EM algorithm focused on estimation of latent subgroups, feature selection, and corresponding
parameter estimates. LUCID does not currently estimate standard errors for these parameter
estimates. To obtain standard errors we suggest performing a post-hoc analysis in which the
CHAPTER 2 53
posterior estimates for the latent subgroups obtained from LUCID are then used to perform each
component regression. For example, the outcome can be regressed on the posterior latent
subgroups using either linear or logistic regression. These estimated standard errors will be only
approximate as they will not include the uncertainty in both the feature selection and in the
estimation of the latent subgroups. Future extensions to LUCID may include either a 2-step post-
selection procedure by first applying the regularized EM algorithm to perform dimension
reduction, followed by estimating model parameters and their standard errors via a supplemented
EM (SEM) framework (Meng and Rubin, 1991) or a bootstrapping procedure (Efron and
Tibshirani, 1986). Details about LUCID statistical inference options are discussed in Chapter 4.
The way to implement the regularized EM algorithm has always been controversial, and
recently proposed algorithms have recommended inconsistent schemes. Previous research has
suggested that penalty terms combined with the tuning parameter should be imposed early at the
E-step in computing the expected complete data log-likelihood 𝑄 (𝚯 |𝚯 (𝒕 )
), then regularization
will be sequentially involved in the M-step (Li et al., 2005). However, unnecessarily
complicating the E-step with extra computational burdens is counterintuitive, especially
considering challenges of high-dimensional data were mainly posed in the M-step. Thus, to
efficiently handle overfitting and to induce sparsity, many statisticians have proposed alternating
only the M-step by optimizing modified objective function by combining conventional
𝑄 (𝚯 |𝚯 (𝒕 )
) with appropriate regularization (Chen et al., 2016; Jia and Wang, 2014), such as 𝐿 1
penalties. However, others have criticized that the regularized M-step fails to keep a prudent
balance between making progress towards the solution and identifying the true structure, which
makes the M-step of an EM algorithm unstable in a high-dimensional regime (Balakrishnan et
al., 2014; Wainwright, 2014). Furthermore, Yi and Caramanis (2015) proposed a more
CHAPTER 2 54
sophisticated version of the regularized EM algorithm with not only a modified M-step but also a
dynamically updated tuning parameter after each EM iteration. Although this method seems
promising and advanced, it introduces two additional elements, statistical error and contract
factor, to be estimated, which are required for the tuning parameter update. It increases the
complexity of the EM algorithm particularly when multiple regularization parameters need to be
tuned.
For applied implementation and to facilitate interpretation of the estimation subgroups
and corresponding features, researchers may opt to first limit the selection of genetic factors
and/or biomarkers to more specific measurements, such as those biomarkers only representing
specific metabolic pathways of interest. This involves understanding the study design for
sample/measurement collection, as well. Our DAG implicitly delineates a temporal order of the
components, expecting that the genetic measures (𝐆 ) occur before the biomarker measures (𝐙 )
and the outcome. This assumption certainly holds for germline genetic factors, but it also implies
that LUCID is not necessarily limited to incorporation of genetic factors only in the prediction
model for subgroups. Other factors may be included, such as long-term environmental exposures
known to occur temporally before the measured biomarkers and outcome. Similarly, the DAG
structure also implies that the estimated latent variable occurs temporally before the outcome,
and since the biomarker measured influence subgroup estimation, these are also measured before
the outcome. Thus, estimation with cross-sectional or case-control data needs to be performed
with care. For cross-sectional data in which the biomarkers and outcome traits are obtained at the
same timepoint, LUCID may be used to find subgroups and relevant factors as well as for
determining association, but a causal interpretation is limited (Rothman et al., 2008). Similarly, a
case-control study with biological samples (e.g. plasma or urine) obtained after diagnosis may be
CHAPTER 2 55
susceptible to reverse-causation with the possibility of the outcome/disease modifying the
subsequent biomarker measures. Compared to clustering with features only, jointly estimating
latent subgroups by incorporating the outcome can more efficiently distinguish the underlying
disparities in disease outcome across subgroups. In our applied example, rs7070570, an intronic
variant in the CTNNA3 gene relevant to cardiac symptoms (Janssens et al., 2003) and
Alzheimer's disease (Miyashita et al., 2007), and rs1241696 (with no previously found
indication) have differential influences on latent subgroups that further impact the blood levels of
many targeted FOCM biomarkers (2
nd
subgroup: low in METH and SAH; high in SAM and
tHCY), which may cause a dysfunction of the methionine cycle and altogether result in a high
CRC risk for certain subgroup. Notably, no significant effects of SNPs on targeted FOCM
biomarkers were found in the pairwise associations, which may imply that the underlying
heterogeneity is being driven primarily by the marginal associations between CRC risks and
SNPs or FOCM biomarkers in the analytic sample.
Another issue in the application of LUCID to real data analysis is missingness,
particularly for the biomarker data. This may be a result of the nature of the high-throughput data
(e.g. microbiome, metabolomics) or as part of the study design. For example, to avoid reverse
causation in case-control samples one option is to only measure biomarkers in controls and/or on
prospectively measured cases (as in our applied example). To cope with the missingness, listwise
deletion is the most widely considered strategy. Alternatively, if the missing data is at random
and sporadic across individuals, imputation may be used, although inadequate handling of
missing patterns and mechanisms could lead to biased estimations and invalid inferences (Little
and Rubin, 2002). A future planned extension to LUCID includes formally estimating missing
CHAPTER 2 56
biomarkers within our EM estimation framework. In Chapter 3, the incomplete LUCID methods
are discussed in detail.
In summary, our integrative clustering can be applied to distinguish unique genotypes
and biomarkers effects in each estimated latent cluster in relation to outcome risk. Specifically,
each estimated molecular subtype will be special in terms of genetic alternations, gene
expression, biomarker levels, and outcome profiles. The results of our applied analysis have
exposed many interesting genes and informative FOCM biomarkers that can be exploited in the
future as potential pointers of colorectal cancer. These statistically identified clusters could have
biological implications. When targeted biological processes meet the DAG model assumption of
causation, our method could offer a better prediction through estimated latent clusters, compared
to predicting outcomes with genotypes and bio-features directly, considering that integrative
clustering reflects not only informative features but also meaningful interactions between them
across distinct subgroups. Also, by applying regularization methods, finally selected genetic
alternations and checkpoint biomarkers give researchers and clinicians a molecular probe to
detect the vulnerability of disease subtypes. Patient-specific therapy could be further enhanced
with improved subtype classification, which meanwhile leads to better diagnostics as well as
prognostics. Thus, discovering the molecular subtypes of disease and subgroups of patients with
the LUCID method could be an ultimate solution for personalized medicine and efficient
targeted treatment since they heavily depend on a specific molecular subtype and the immuno-
stage of a disease.
2.7 Acknowledgements
Thank you to Drs. Vaia Lida Chatzi and Ran Jin for providing SOLAR data in the applied
analyses.
CHAPTER 2 57
2.8 Funding Support
This work was supported by the National Cancer Institute at the National Institutes of Health
Grant P01 CA196569 and R01 CA140561.
CHAPTER 3 58
Chapter 3 Integrative Clustering with Incomplete Bio-
feature Measurements
3.1 Introduction
Missing data have historically been a persistent and notorious problem in various research
settings, such as longitudinal studies and clinical trials. Fast developments in biology and
genetics in recent decades have made integrated data available for researchers to
comprehensively understand the biological process involved in the metabolic pathway. The
LUCID model we proposed in Chapter 2 offers an efficient and applicable approach to jointly
estimate underlying patterns of multi-omics measures with phenotypic traits; however, rich
multi-omics information is expressed as high-throughput datasets achieved by simultaneously
measuring genetic features and other bio-features (e.g. SNPs, gene expression and metabolites),
which poses more risks on obtaining incomplete data from different sources of omics data. Too
many missing values in one or more types of omics data may severely restrict the effective
number of observations in the final analytic sample, which severely detract from the performance
of our model or other similar methods, leading to inaccurate integrative latent cluster estimations
and invalid exact inference. Thus, a thorough assessment of missing pattern and mechanism is
critical for integrated data so that necessary actions can be taken before latent cluster discovery.
An organized data set with a rectangular shape, which is a matrix with rows representing
observations or subjects and columns referring to features of interest measured for each subject,
is often expected when statistical analyses are performed. Missing data issues emerge in practice
when some entries of the data matrix are not collected, which may result from various reasons
such as instrument failures, experimental limitations, etc. For instance, each source of multi-
CHAPTER 3 59
omics data we mentioned above could have a considerable amount of missing data due to quality
control pipeline or the expense of measurements. Missing data problems cast challenges on data
analysis because they can impact statisticians’ decision from many directions. Typically, most of
the available statistical packages exclude subjects with missing values from any variables in a set
when analyses are conducted with missing data. However, this listwise deletion strategy of
getting complete cases for analysis is not ideal because it damages the structure of the data
matrix, which is intended to retain the shape of the original study design. Drawbacks of listwise
deletion include introducing serious bias into the analytic sample and devastating statistical
power in detecting hypothesized effects, which urges statisticians to harness the utility of missing
data with more thoughtful methods.
In order to propose more appropriate approaches, the patterns and mechanisms of
missingness must be first understood. As discussed in Chapter 1, Little and Rubin (2002) classify
missing-data into univariate, multivariate, monotone, disjoint, general, and latent. Although
missing patterns provide researchers with a direct impression of missingness, the reasons under
these superficial exhibitions could be more helpful in building corresponding statistical methods
to handle missing data challenges. In terms of missing mechanisms, Little and Rubin (2002)
categorized them into 3 groups, which are missing completely at random (MCAR), missing at
random (MAR), and not missing at random (NMAR) respectively. The difference among these
three categories is characterized by statistically scrutinizing the relationship between complete
data, missing pattern, and some unknown parameters. Let 𝒟 =(𝒟 𝑜𝑏𝑠 ,𝒟 𝑚𝑖𝑠 ) denotes the
complete data matrix, where 𝒟 𝑜𝑏𝑠 and 𝒟 𝑚𝑖𝑠 refer to observed and missing parts of data
respectively, M denotes the missing pattern, and 𝚯 denote unknown parameters. The MCAR and
MAR data are defined as
CHAPTER 3 60
𝑓 (𝑀 |𝒟 ,Θ)=𝑓 (𝑀 |Θ) (3.1)
and
𝑓 (𝑀 |𝒟 ,Θ)=𝑓 (𝑀 |𝒟 𝑜𝑏𝑠 ,Θ) (3.2)
respectively; if the missing values in the data matrix 𝒟 impact the missing pattern in any way,
then the missing mechanism of the data is NMAR. When the data are MCAR or at least MAR,
the missing mechanism is ignorable because the missingness parameter is separable from other
parameters of interest during statistical inference through data likelihood. In contrast, when the
data are NMAR, the missing mechanism should not be ignored in the analysis to obtain unbiased
parameter estimates and valid inference.
In the context of analyzing integrated data, a number of statistical methods based on
multiple imputation (MI) have been proposed when an issue of missing data is encountered
during genetic features collection, especially for whole-genome sequencing data intended for use
in genome-wide association studies (GWAS). The basic idea supporting MI is, under the
assumption of MAR, to substitute each missing value with a set of plausible candidates
containing reasonable uncertainty of imputation, which is generated from its predictive
distribution given observed data (Little & Rubin, 2002). Moreover, the accuracy of most current
imputation methods applying statistical methods, such as the Hidden Markov Model (HMM), the
Expectation-Maximization (EM) algorithm, and regularized regression for high-dimensional
data, heavily depends on using haplotype patterns in reference panels to predict missing
genotypes in study panels (Deng et al., 2016; Jiang et al., 2016). Many global initiatives have
provided researchers with reliable reference panels for imputation, including the 1000 Genomes
Project and the historical International HapMap Project, to guarantee the quality of genotype
imputations. Unfortunately, no dependable reference panels could be easily adopted for features
CHAPTER 3 61
from other omics data sources, including information from epigenomics, transcriptomics,
proteomics, and metabolomics. Besides, the omic-wise missingness of certain subjects may
further constrain the applicability and accuracy of MI, which makes MI inappropriate as a
solution for missingness among these omics data.
Because of the challenges in harnessing utility of high-dimensional multi-omics features
with a considerably large missing fraction, a model-based likelihood approach involving
consideration of missing values in bio-features will be desirable. In terms of latent clustering
with both incomplete multi-omics data and phenotypic traits, a lack of statistical models to offer
integrative clusters for such a particular setting makes a modification of the LUCID model
proposed in Chapter 2 needed. Indeed, the DAG model can be expanded to handle integrated
data with a substantial number of subjects in the sample having MAR omics features.
Specifically, the DAG model can be conveniently modified because of its conditional
independence assumption, which leads to a clear likelihood decomposition including observed
and missing parts separately.
The rest of this chapter is organized as follows. In Section 2, we first introduce the
modified DAG and formalize the revised joint probabilistic model through the latent variable for
clustering. We then apply the revised LUCID EM algorithm to optimize complete-data
likelihood with missing patterns, add lasso-type penalty terms for sparse solutions in case of high
dimensionalities. In Section 3, we evaluate the performance of our LUCID method with missing
metabolomics data through extensive simulations. The impact of missing fractions of bio-feature
data on the performance of clustering and parameter estimation is also discussed. The ability of
our approach in handling high dimensional data with missing values is demonstrated. Also, we
describe a feature update for our LUCIDus R package for implementation, highlighting this
CHAPTER 3 62
option to handle clustering using integrated data with missingness. In Section 4, we apply our
incomplete LUCID method to the data from the Colon Cancer Family Registry (CFR) with extra
subjects having complete genetic and outcome but missing values in metabolomics (Haile et al.,
1999). We conclude this chapter by a discussion in Section 5.
3.2 Methods
3.2.1 Revised DAG Latent Cluster Model
Integrating information from multi-omics data with missing observations and phenotypic traits
proceeds by jointly modeling their relationships through a latent variable for clustering, which is
illustrated by the directed acyclic graph (DAG) in Figure 3.1. In this revised DAG, 𝐆 is a
genomic data matrix of dimension 𝑛 ×𝑝 with columns being genetic features and rows being
samples, 𝑋 is a categorical latent clustering variable for each subject, 𝐙 is a 𝑛 ×𝑚 dimensional
matrix of standardized continuous bio-features for the same samples but with some MCAR or
MAR observations, and Y is an n-dimensional vector of phenotypic traits, while all Greek letters
in the DAG denote model parameters. As the causal directions delineated in the DAG, the
distribution of 𝑋 depends on genetic features 𝐆 with parameter 𝜷 , the distribution of 𝐙 depends
on the latent variable 𝑋 with parameters δ and 𝚺 , and the distribution of outcome Y depends on 𝑋
with parameter 𝜸 . Still, we assume a conditional independence among 𝑋 given 𝐆 , 𝐙 given 𝑋 , and
Y given 𝑋 in this framework.
CHAPTER 3 63
Figure 3.1 Joint model integrating genetic features, incomplete metabolites,
and phenotypic traits
3.2.2 Missingness Pattern and Likelihood Partition
To adapt the missing pattern of 𝐙 to the DAG latent variable model, we further denote a 𝑛 ×𝑚
missing data indicator matrix as 𝐌 ={𝑀 𝑖𝑐
} , where
𝑀 𝑖𝑐
={
1 𝑖𝑓 𝑍 𝑖𝑐
𝑖𝑠 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 0 𝑖𝑓 𝑍 𝑖𝑐
𝑖𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 .
(3.3)
As we proposed in Chapter 2, to establish the joint likelihood of model parameters, let
𝑓 (𝑋 |𝑮 ;𝜷 ), 𝑓 (𝒁 |𝑋 ;𝜹 ,𝚺 ) and 𝑓 (𝑌 |𝑋 ;𝜸 ) denote the probability density functions (PDF) for
continuous random variables or probability mass functions (PMF) for discrete random variables.
Compared to the general pattern of missingness in metabolomics, a special situation that
missingness across metabolomics is omic-wise, which indicates none of the metabolites is
observed for certain subjects, requires more cautions and will be discussed first. For this
particular case, the log-likelihood based on the joint distribution of 𝑋 , 𝒁 and 𝑌 conditioning on 𝑮
is:
𝑙 (𝚯 )=∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 (3.4)
CHAPTER 3 64
where is a generic notation vector consisting of all model parameters, including 𝜷 , 𝜹 , 𝚺 and 𝜸 ,
and I(∙) is the indicator function defined as:
I(∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0) =
{
1, 𝑖𝑓 ∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0
0, 𝑖𝑓 ∑𝑀 𝑖𝑐
𝑚 𝑐 =1
≥1,
(3.5)
which demonstrating whether the subject 𝑖 has complete measurements of bio-features or not.
Basically, we perform a listwise deletion only for the component of bio-features while keeping
all the observations with complete measurements in genomics data and phenotypic traits.
Alternatively, we could rewrite equation (4) through a log-likelihood partition separated
by subjects with observed or missing bio-features, i.e.
𝑙 (𝚯 )=𝑙 𝑜𝑏𝑠 (𝚯 )+𝑙 𝑚𝑖𝑠 (𝚯 )
= ∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝒊 =𝟏 + ∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝑛 𝑚𝑖𝑠 𝒊 =𝟏 (3.6)
where 𝑛 𝑜𝑏𝑠 and 𝑛 𝑚𝑖𝑠 stand for sample sizes of observed and missing subjects respectively,
indicating 𝑛 =𝑛 𝑜𝑏𝑠 +𝑛 𝑚𝑖𝑠 . This partition explicitly demonstrates how we incorporate omic-
wise missing patterns into the log-likelihood of model parameters. When being used as the
objective function for parameter estimation, it is obvious that parameter estimates of 𝐙 are only
attributed to subjects with completely observed bio-features; meanwhile, all the subjects in the
study sample contribute to the estimation of parameters belonging to other components,
including the assignment of latent clusters.
As mentioned in the previous chapter, allowing different conditional distributions to be
specified among the random variables facilitates the flexibility of joint modeling, which enables
extensions to be implemented conveniently. For illustration purposes, we assume 𝑋 has 𝑘
CHAPTER 3 65
categories and follows a multinomial distribution conditioning on 𝐆 , 𝐙 follows a multivariate
normal distribution conditioning on 𝑋 , and Y follows a normal distribution conditioning on 𝑋 .
Additionally, if we denote 𝑆 (∙) to be a sigmoid function, 𝜙 (∙) to be the PDF for a normal random
variable, and further define the posterior probability of belonging to latent cluster j given
observed data and parameter values as 𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 ), then the marginal log-likelihood of
model parameters, which is the joint distribution of 𝑋 , 𝐙 and Y conditioning on 𝐆 , becomes:
𝑙 (𝚯 )= ∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝚯 )
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
+∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝚯 )
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑚𝑖𝑠 𝑖 =1
(3.7)
The remainder of this chapter focuses on this setting. A model with a binary outcome 𝑌 is
discussed in the Appendix VI.
For a more general missing pattern with sporadic missing values in the metabolomics
data matrix, the incomplete LUCID method for omic-wise missingness will not be a good fit
since it drops many observed metabolites for subjects without complete measurements. To avoid
losing so much information, another scheme for the LUCID method dealing with the general
missing pattern needs to be proposed. Specifically, we can first impute a particular missing value
in the bio-features using the mean of each corresponding feature, which is
𝑍 𝑖𝑐
(1)
=I(𝑀 𝑖𝑐
=1)∙𝜇 𝑐 𝑜𝑏𝑠 +I(𝑀 𝑖𝑐
=0)∙𝑍 𝑖𝑐
(1)
(3.8)
The log-likelihood based on the joint distribution of 𝑋 , 𝐙 with imputation and 𝑌 conditioning on
𝐆 can then be formulated as the complete LUCID method described in Chapter 2.
CHAPTER 3 66
3.2.3 Joint Estimating through Latent Cluster via EM Algorithm
The parameters of interest include the posterior probability of latent cluster membership, genetic
effect 𝜷 , cluster-specific bio-feature means 𝜹 and covariance matrices 𝚺 , and mean effect of the
outcome 𝜸 . Since the above marginal log-likelihood (4.7) is often non-concave, direct
maximization to obtain maximum likelihood estimations (MLE) of model parameters can be
extremely challenging in terms of optimization. We use an approximate procedure by
maximizing the expectation of complete data log-likelihood instead, which is a lower bound of
the marginal log-likelihood, to estimate model parameters. When the missing values are sporadic
and imputed first, the expectation of complete data log-likelihood does not change but uses
imputed integrated data. The situation for omic-wise missingness is more complicated. Detailed
derivations are presented in Appendix VII. Specifically, the expectation of complete data log-
likelihood for the omic-wise incomplete LUCID model is:
𝑄 (𝚯 )=∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑆 (𝛽 ,𝑗 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝐼 (∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
CHAPTER 3 67
= ∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )log
𝑆 (𝛽 ,𝑘 ,𝑮 𝑖 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )log
𝜙 (𝒁 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑜𝑏𝑠 ;𝜣 )
𝑛 𝑜𝑏𝑠 𝑖 + ∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝜣 )log
𝑆 (𝛽 ,𝑘 ,𝑮 𝑖 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝜣 )
𝑘 𝑗 =1
𝑛 𝑚𝑖𝑠 𝑖 =1
+ ∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝜣 )
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝑚𝑖𝑠 ;𝜣 )
𝑛 𝑚𝑖𝑠 𝑖 . (3.9)
Next, the EM algorithm is applied to handle optimization and estimate model parameters.
Under the assumption that the missing data in 𝐙 is at least MAR with ian gnorable missing
mechanism, the EM algorithm is still valid to obtain MLE (Little & Rubin, 2002). When the
missing pattern is sporadic, the responsibility (𝑟 ), which is the conditional probability of being in
latent cluster j given observed data and current values of parameter estimates, is defined as
before. However, we will update the imputed values with the weighted mean of the biomarker
across latent clusters for each iteration of the LUCID EM algorithm instead of keeping the
imputed values constant. Specifically, we can modify the original LUCID EM algorithm with an
extra step to perform an update making the overall imputation dynamic, which is
𝑍 𝑖𝑐
(𝑡 )
=I(𝑀 𝑖𝑐
=1)∙∑𝜆 𝑗 (𝑡 )
𝛾 𝑗 (𝑡 )
𝑘 𝑗 =1
+I(𝑀 𝑖𝑐
=0)∙𝑍 𝑖𝑐
(1)
(3.10)
where 𝜆 𝑗 is the mixture fraction of latent cluster j at the iteration t defined as
CHAPTER 3 68
𝜆 𝑗 (𝑡 )
=
∑ 𝑟 𝑖𝑗
(𝑡 )
𝑛 𝑖 =1
𝑛 . (3.11)
For the special situation that integrated data have omic-wise missingness in biomarkers, a
partition for the responsibilities should first be conducted. We calculate the modified
responsibilities for subjects with complete and missing bio-feature measures based on missing
patterns accordingly, which are defined as the conditional probability of being in latent cluster j
given observed data and current values of parameter estimates at iteration t, i.e.:
𝑟 𝑖𝑗
=𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 (𝑡 )
)=𝑃 (𝑋 𝑖 =𝑗 |𝓓 𝒐𝒃𝒔 ,𝓓 𝒎𝒊𝒔 ;𝚯 (𝑡 )
)=P(𝑋 𝑖 =𝑗 |𝐆 𝑖 ,𝐙 𝑖 ,Y
𝑖 ;𝚯 (𝑡 )
)
=
𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)[𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
∑𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)[𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
𝑗 , (3.12)
which indicates that
𝑟 𝑖𝑗
=
{
𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
∑𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝒁 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
𝑗 ,𝑓𝑜𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑠 𝑤𝑖𝑡 ℎ 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 𝐙 𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
∑𝑓 (𝑋 𝑖 =𝑗 |𝐆 i
;𝚯 (𝑡 )
)𝑓 (𝑌 𝑖 |𝑋 𝑖 =𝑗 ;𝚯 (𝑡 )
)
𝑗 ,𝑓𝑜𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡𝑠 𝑤𝑖𝑡 ℎ 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝐙 .
(3.13)
In the E-step, we compute the expectation of complete data log-likelihood 𝑄 (𝚯 )
empirically by using responsibilities as imputed values for posterior probabilities of latent
clusters. Thus, substituting 𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 ) with 𝑟 𝑖𝑗
in (9), we now have:
𝑄 (𝚯 )=∑∑𝑟 𝑖𝑗
log
𝑆 (𝛽 ,𝑘 ,𝑮 𝑖 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑I(∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑟 𝑖𝑗
log
𝜙 (𝒁 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑟 𝑖𝑗
𝑛 𝑖
=∑∑𝑟 𝑖𝑗
log
𝑆 (𝛽 ,𝑘 ,𝑮 𝑖 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
log
𝜙 (𝒁 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
+∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑟 𝑖𝑗
𝑛 𝑖
CHAPTER 3 69
=∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 𝐢 )−
1
2
∑∑𝑟 𝑖𝑗
(𝐙 𝐢 −𝜹 𝑗 )
𝑇 𝑘 𝑗 =1
𝛴 𝑗 −1
(𝐙 𝐢 −𝜹 𝑗 )
𝑛 𝑜𝑏𝑠 𝑖 =1
−
1
2
∑∑𝑟 𝑖𝑗
log|𝛴 𝑗 |
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
−
1
2
∑∑𝑟 𝑖𝑗
(𝑌 𝑖 −𝜇 𝑗 )
2
𝜎 𝑗 2
𝑘 𝑗 =1
𝑛 𝑖 =1
−
1
2
∑∑𝑟 𝑖𝑗
log𝜎 𝑗 2
𝑘 𝑗 =1
𝑛 𝑖 =1
−2∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
−∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
(3.14)
The M-step is then to update parameter estimates by maximizing (3.14) computed in the
current E-step. This results in the following estimates of 𝜷 , 𝜹 , Σ, 𝜸 and 𝝈 , which all take the
missing pattern into consideration:
𝜷 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑎 𝑥 𝜷 ∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 i
) (3.15)
𝜹 𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
𝐙 i
𝑛 𝑜𝑏𝑠 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑜𝑏𝑠 𝑖 =1
(3.16)
Σ
𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
(𝐙 i
−𝜹 𝑗 (𝑡 +1)
)(𝐙 i
−𝜹 𝑗 (𝑡 +1)
)
𝑇 𝑛 𝑜𝑏𝑠 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑜𝑏𝑠 𝑖 =1
(3.17)
𝜸 𝑗 (𝑡 +1)
=
∑ 𝑟 𝑖𝑗
Y
𝑖 𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(3.18)
𝝈 𝑗 (𝑡 +1)
=
√
∑ 𝑟 𝑖𝑗
(𝑌 𝑖 −𝜇 𝑗 (𝑡 +1)
)
2
𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(3.19)
Maximizing the log-likelihood of (3.15) is a convex optimization issue, which is
equivalent to applying multinomial logistic regression to predict latent cluster probabilities using
genetic features. Still, parameter estimation in (3.18, 3.19) can be obtained through a linear
regression predicting a continuous outcome with posterior probabilities of latent clusters.
CHAPTER 3 70
Estimating model parameters through a regression framework has many advantages. First,
covariates that need to be adjusted during the estimations of genetic and outcome effects with
respect to integrative latent clusters could be easily included in our joint modeling. Moreover, the
generalized linear model can be easily applied so that other types of outcome variables, such as
binary, count, and survival responses, could also be handled accordingly.
3.2.4 Sparse Solutions for High-dimensional Multi-omics Features
When multi-omics features 𝐆 and 𝐙 are high-dimensional, sparse solutions of 𝜷 , 𝜹 and 𝚺 are
often favorable in terms of elucidating assumptions and associations with most informative
features. Practically, the sparsity of parameter estimates can be promoted by incorporating
regularization methods into the joint estimating method via EM algorithm above. Since we
assume the genetic data are completely measured, the sparse solutions for both omic-wise and
general incomplete LUCID method are identical to the ones using the original LUCID method.
Specifically, to achieve sparse solutions for 𝜷 , a lasso-type penalty (Tibshirani 1996), which is
𝐿 1
regularization, is added to the objective function of (3.9), resulting in the following estimate
of 𝜷 𝐿𝐴𝑆𝑆𝑂 :
𝜷 𝐿𝐴𝑆𝑆𝑂 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑎 𝑥 𝜷 {∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 i
)−𝜆 𝐺 ∑∑|𝛽 𝑙𝑗
|
𝑘 𝑗 =1
𝑝 +1
𝑙 =1
} (3.20)
For biomarkers with missing observations, imputed data with a general pattern of
missingness shows no difference from the sparse solutions derived in Chapter 2. For omic-wise
missingness among biomarkers, achieving sparse solutions is restricted to using subjects with
overall complete biomarker measures. To induce the sparsity of 𝜹 and 𝚺 , the lasso-type scout
proposed by Witten and Tibshirani (2008) is applied. We first update 𝚺 by maximizing the
following objective while additionally penalizing the 𝐿 1
norm of its inverse:
CHAPTER 3 71
(𝚺 𝑗 (𝑡 +1)
)
−1
=𝑎𝑟𝑔𝑚𝑎 𝑥 Σ
𝑗 −1{log|𝚺 𝑗 −1
|−𝑡𝑟 (𝑺 𝑗 𝚺 𝑗 −1
)−𝜆 Σ
−1||𝚺 𝑗 −1
||
1
} (3.21)
where
𝑺 𝑗 =
∑ 𝑟 𝑖𝑗
(𝐙 𝐢 −𝜹 𝑗 (𝑡 )
)(𝐙 𝐢 −𝜹 𝑗 (𝑡 )
)
𝑇 𝑛 𝑜𝑏𝑠 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(3.22)
is the empirical covariance matrix of complete measured Z. Then, 𝜹 is updated by minimizing
the following penalized distance:
𝜹 𝑗 (𝑡 +1)
=𝑎𝑟𝑔𝑚𝑖𝑛 𝜹 𝑗 {∑∑𝑟 𝑖𝑗
(𝐙 i
−𝜹 𝑗 )
𝑇 𝑘 𝑗 =1
[𝚺 𝑗 (𝑡 +1)
]
−1
(𝐙 i
−𝜹 𝑗 )
𝑛 𝑖 =1
+𝜌 Σ
−1
𝜹 ||[𝚺 𝑗 (𝑡 +1)
]
−1
𝜹 𝑗 ||
1
} (3.23)
Although cross-validation is the most popular model selection scheme for regularized
methods, it is not feasible in choosing tuning parameters for our LUCID model since our primary
goal is to estimate the latent cluster which is essentially unknown. Therefore, picking an optimal
combination of tuning parameters for our model when both genetic features and biomarkers are
high-dimensional could be immensely challenging. As discussed in Chapter 2, to choose an
optimal combination of the three tuning parameters, we perform a parallel grid-search guided by
the Bayesian information criterion (BIC).
3.3 Simulation Study
3.3.1 Data Description and Scenario Settings
Extensive simulations using various effect sizes in biomarkers with a wide range of missing
fractions are performed to evaluate the estimation of model parameter and cluster assignment
accuracy of our model. A simulated population with 200,000 subjects for each scenario is
generated following the incomplete LUCID model in Figure 3.1. Specifically, we examine the
impact of varying missing fractions (𝜌 𝑍 ) for biomarkers on the performance of the incomplete
CHAPTER 3 72
LUCID model and 𝜌 𝑍 range from 0 to 0.9 by 0.1. In light of the nature of missing fraction, a
special case of missingness in biomarkers is considered for 𝜌 =0, which is the scenario for
integrative clustering with full three components. Genetic features are first simulated, then
conditional on the genes a cluster value is generated, and biomarker and outcome data are finally
generated conditional on the cluster assignment. We examine several metrics to evaluate the
performance of the incomplete LUCID model, including mean parameter estimates and
corresponding standard deviations of estimates over the 500 replicates. We also investigate
estimated latent cluster assignments using the area under the curves (AUC) by comparing
estimated cluster probabilities to truly simulated cluster assignments. To make fair comparisons,
a partition of subjects with completely and incompletely measured biomarkers is made when the
AUC for latent clustering was calculated. Additionally, we perform listwise deletions for the
simulated data to conduct complete-case analysis using the original LUCID method so that a
comparison with the novel approach proposed above can be made.
3.3.2 Simulation Results
To evaluate the performance and application of the incomplete LUCID approach, we started with
low dimensional scenarios. We simulated 10 SNPs with 5 causals, 4 continuous biomarkers with
2 causals, and a continuous outcome for each individual in the population. Next, each replicate of
a certain scenario randomly draws 2000 samples from the simulated population. Finally, for
scenarios with omic-wise missingness, a certain number of subjects based on the missing fraction
were randomly selected to be assigned with missing values for all biomarkers; for a sporadic
pattern of missingness, randomly scattered missing values in biomarkers were assigned to some
subjects from the drawn sample based on the missing fraction.
CHAPTER 3 73
3.3.2.1 Incomplete LUCID method for omic-wise missingness
For these particular missing scenarios, model parameters are set as follows: the SNPs effects
𝑂𝑅
𝐺 =1.9; differences in outcome means 𝛥 𝑌 =0.4; differences in biomarker means 𝛥 𝑍 =0.6.
Results are presented as a comparison with complete-case analysis after listwise deletion. As
shown in Figure 3.2, the 𝜌 𝑍 has a considerable impact on the parameter estimation for the
situations with omic-wise missingness. Although both methods in general provide accurate
average parameter estimates with an increased 𝜌 𝑍 , an increased bias is observed when the 𝜌 𝑍 is
extremely large. The standard deviations of estimates across replicates becomes larger with an
ascending 𝜌 𝑍 and this increasing trend is more obvious when listwise deletion is performed.
Specifically, when the 𝜌 𝑍 is over 60%, the incomplete LUCID method effectively alleviated the
inflation in standard deviations of genetic and outcome effects compared to the original LUCID
with listwise deletion at a fixed 𝜌 𝑍 (Fig. 3.2a & c). For the estimates of biomarker effects with a
high 𝜌 𝑍 , no noticeable difference in their standard deviations is detected but incomplete LUCID
offers a better average of biomarker parameter estimates (Fig. 3.2b).
CHAPTER 3 74
CHAPTER 3 75
Figure 3.2 Comparisons between omic-wise incomplete approach and listwise deletion in
parameter estimates and their SDs for genetic (a), biomarker (b) and outcome (c) effects with
varying 𝜌 𝑧
In terms of the AUC for the estimated cluster assignments, to make reasonable
comparisons with the same and effective subjects involved after listwise deletion for complete-
case analysis based on 𝜌 𝑍 , a partition is done for the overall AUC achieved from the incomplete
LUCID approach, including AUC for subjects with complete biomarker measurements
(observed) and AUC for the rest with missingness in biomarkers (missing). As shown in Figure
3.3, the mean AUC for classification across replicates drops with increased 𝜌 𝑍 . However, using
the proposed incomplete approach always leads to a higher mean AUC for subjects with
complete biomarker measures compared with the mean AUC given by listwise deletion,
indicating that the incomplete approach is more capable in distinguishing the heterogeneity
between two simulated latent clusters. In addition, for those subjects ignored by complete-data
analysis due to missing measurements in biomarkers, our incomplete approach still offers
acceptable estimates for the latent cluster, although the mean missing AUC is much lower than
the mean AUC for subjects with complete biomarker measures.
CHAPTER 3 76
Figure 3.3 Comparisons of AUC for classification: omic-wise incomplete approach vs. listwise deletion
The above results support previous findings in missing data analysis suggesting that
listwise deletion is not valid when the proportion of missingness is substantial even if the data
are MAR (Little & Rubin, 2002; Chen et al., 2014). For the scenarios with 𝛥 𝑍 =0.4 and 0.8, the
results of the estimated effects in each component of the DAG model and classifications for final
latent cluster assignments show a similar trend with the above discussed situation when 𝛥 𝑍 =
0.6. Those results are reported in Appendix VIII.
3.3.2.2 Incomplete LUCID method for sporadic missingness
To test the performance of the incomplete LUCID method for sporadic missingness through
imputations, model parameters for scenarios are set as follows: the SNPs effects 𝑂𝑅
𝐺 =1.9;
differences in outcome means 𝛥 𝑌 =0.2; differences in biomarker means 𝛥 𝑍 =0.6. When the
dynamic imputations are performed for biomarkers with sporadic missingness in the data matrix,
the mean of parameter estimates across replicates decreases with an increased 𝜌 𝑧 . Specifically,
when 𝜌 𝑧 is no more than 25%, the average parameter estimates for genetic and outcome effects
CHAPTER 3 77
are still close to simulation truth although their variances become larger (Fig. 3.4a, b). However,
the incomplete LUCID method fails to recapitulate the true effect if 𝜌 𝑧 reaches 25%. Only
underestimated effects with large standard deviations are produced for the estimates in genetic
and outcome effects. Since missingness occurs in biomarker data, estimation for biomarker
effects is more sensitive to 𝜌 𝑧 . The accuracy and precision of estimation in biomarker effects
decrease immediately with an increase 𝜌 𝑧 (Fig. 3.4c). Overall, the general incomplete LUCID
method starts substantially losing its capability to capture the true effects of all data sources
when the 𝜌 𝑧 is reaching or over 25%.
Figure 3.4 Parameter estimates and their SDs for genetic (a), biomarker (b) and outcome (c)
effects with varying 𝜌 𝑧 applying incomplete LUCID method for general missingness
In terms of AUC for latent clustering, we also split the overall AUC into AUC for
subjects with complete biomarker measures (observed) and AUC for the rest with general
CHAPTER 3 78
missingness in biomarkers (missing). As shown in Figure 3.5, similar to the trend of parameter
estimates, average overall, observed and missing AUC for latent clustering across replicates
drops slightly before 𝜌 𝑧 increases to 25%, especially for the observed AUC. Even for subjects
with incomplete biomarker measures, the sporadic incomplete LUCID can still provide us with a
latent cluster assignment much better than random guessing. However, when 𝜌 𝑧 reaches 25%,
the mean AUCs start dropping drastically with large variances produced. After that, the general
incomplete LUCID approach begins to break down with increased 𝜌 𝑧 by producing most AUCs
around 0.5.
Figure 3.5 AUC for classification using incomplete LUCID method for general missingness.
3.3.2.3 Differential missingness depending on the outcome variable
Omic-wise missing measurements for biomarkers occasionally depend on the outcome of interest
because of certain study designs. Thus, we simulated scenarios with varying missing fractions
constrained only in cases for a binary outcome. Specifically, model parameters for scenarios are
set as follows: the SNPs effects 𝑂𝑅
𝐺 =2.0; differences in binary outcome 𝑂𝑅
𝑌 =2.0;
CHAPTER 3 79
differences in biomarker means 𝛥 𝑍 =0.6. The missing fractions (𝜌 𝑐𝑎𝑠𝑒 ) range from 0.2 to 1.0
among the cases. When 𝜌 𝑐𝑎𝑠𝑒 =1.0, this special scenario indicates that none of the cases have
any biomarker measured. We compare the differential scenarios with the regular omic-wise
missingness situation with the same 𝜌 𝑍 . In general, incomplete LUCID for omic-wise
missingness still offers good parameter estimates although their precisions decrease, and their
variabilities increase with ascending missing fractions (Figure 3.6). Compared to the scenarios
with regular omic-wise missingness, the estimation of parameters in genetic and biomarker
effects for the differential situation shows similar trends to the regular scenarios. For the
estimates of outcome effect, a larger bias is shown under the differential circumstances since
missingness is exclusive among cases (Figure 3.6c). For latent cluster AUCs, when the partition
of the overall AUC is done, the observed AUCs overperform the AUCs based on listwise
deletion with increased 𝜌 𝑐𝑎𝑠𝑒 (Figure 3.7). Additionally, when 𝜌 𝑐𝑎𝑠𝑒 =1.0, the listwise deletion
makes the estimation of latent clusters impossible, since it removes all the cases from the sample.
In contrast, the incomplete LUCID method still provides us with a decent result for the latent
cluster assignment. Moreover, for those cases with omic-wise missingness, the incomplete
LUCID method always offers some better clustering results than the random assignment does.
CHAPTER 3 80
Figure 3.6 Comparisons in parameter estimates for data with regular and differential
missingness depending on a binary outcome applying omic-wise incomplete LUCID approach
CHAPTER 3 81
Figure 3.7 Comparisons between incomplete LUCID approach and listwise deletion in the classification
of AUC for scenarios with differential missingness by the outcome variable
3.3.2.4 Regularization method for high-dimensional settings with missingness
To further explore the performance of the incomplete method, we push the limit by considerably
increasing the number of multi-omic features to be much greater than the sample size. High-
dimensional scenarios would be a comprehensive extension of our simulation to test the way
feature numbers and effect size interacts with the missing fraction of biomarkers. We can use
simulations to illustrate the method of choosing the optimal combination for three tuning
parameters through a grid search. Sequentially, we can additionally evaluate how our incomplete
LUCID approach with regularization works with high-throughput multi-omics data to identify
informative features during integrative clustering. Specifically, high-dimensional multi-omics
data are simulated with 5000 genetic features and 500 biomarkers, of which 10 and 10 features
are causal respectively. The model parameters are simulated as follows: the SNPs effects 𝑂𝑅
𝐺 =
CHAPTER 3 82
2.0; differences in biomarker means 𝛥 𝑍 =1.0 or 0.6; differences in outcome means 𝛥 𝑌 =2.0;
missing fractions (𝜌 𝑍 ) for biomarkers range from 0.1 to 0.9 by 0.2.
As shown in Figure 3.8a, increased 𝜌 𝑍 leads to a decreasing trend in the inclusion
probabilities of causal genetic variants when 𝛥 𝑍 =1.0. Also, a slight decreasing trend in causal
biomarkers and a small increasing trend in null biomarkers are observed with increased 𝜌 𝑍 .
These trends are more considerable as the 𝜌 𝑍 grows greater than 0.5. If we decrease the
biomarker effect to 𝛥 𝑍 =0.6, the general trends in inclusion probabilities remain the same as
shown in Figure 3.8b, and these trends with increased 𝜌 𝑍 become more obvious when they are
compared with the scenario with a larger 𝛥 𝑍 , indicating that a larger effect in biomarkers
improves the performance of the incomplete LUCID method for high-dimensional settings.
CHAPTER 3 83
Figure 3.8 Comparison of inclusion probabilities with increased missing fractions ( )
(a) OR G=2.0, Z=1.0, Y=2.0; (b) OR G=2.0, Z=0.6, Y=2.0; CG: causal G; NG: null G; CZ: causal Z; NZ: null Z.
CHAPTER 3 84
3.4 An Application on Colon CFR Folate Biomarkers Study
3.4.1 Data Description
Again, we apply the proposed incomplete LUCID approach on the data from the Colon Cancer
Family Registry (CFR). The Colon CFR is a consortium of six centers across North America and
Australia, organized to create a comprehensive resource for clinical and epidemiological studies
of colorectal cancer (CRC). Details of the study design have been reported elsewhere (Haile et
al., 1999; Newcomb et al., 2007). For the subset sample used for the current analysis, the data
were collected under a case-control study design from the following centers: Sinai Health
System (Ontario), the University of Southern California/Cedars-Sinai Medical Center
(USC/CSMC) consortium, the Fred Hutchinson Cancer Research Center (FHCRC), the Mayo
Clinic, University of Hawaii (UHI, not included in this analysis) and the University of
Melbourne. The recruitment included incident cases of CRC via population-based cancer
registries or clinical centers, as well as general population or population-based sibship controls.
The genotype data with 342575 SNPs were generated from germline DNA on the Illumina
OncoArrary for all 1000 cases and 1000 controls. Targeted biomarkers related to the folate one-
carbon metabolism, including 5-methyltetrahydrofolate (5-MTHF), vitamin B2 (flavin
mononucleotide [FMN] and riboflavin), vitamin B6 (4-pyridoxic acid, pyridoxal, pyridoxamine),
methionine (METH), S-Adenosyl methionine (SAM), S-Adenosyl-L-homocysteine (SAH), total
homocysteine (tHCY), cystathionine (CYSTH), and creatinine, were only measured among 220
controls and 93 cases. This data structure indicates a missing fraction in biomarkers over 84%.
Observed data for biomarkers were log-transformed to ensure normality and standardized to
apply our incomplete LUCID method. We also adjusted for the top five principal components of
the SNPs in the integrative clustering for confounding effects and population stratification.
CHAPTER 3 85
3.4.2 Applied Example Results
We first performed the lasso logistic regression of CRC risks on the genetic data to prescreen the
1534 SNPs selected at the beginning of the regularization path. Then, the incomplete LUCID
approach was applied on the combined data consisting of 313 subjects with SNPs, complete
target biomarkers, and CRC information and the rest of 1687 subjects with SNPs and CRC status
only. We conducted a grid search to find the tuning parameters across a range of candidate
number of latent clusters (K). The optimal number of clusters K was found to be 2 based on
model BIC (Figure 3.9).
Figure 3.9 Optimal number of latent cluster (K) applying incomplete LUCID method on CCFR data
K ranges from 2 to 4; when K=4, LUCID model failed to converge
Using the optimal combination of tuning parameters, we perform the feature selection
with overall data applying the incomplete LUCID model. As illustrated in Figure 3.10, 41 SNPs
and 8 targeted biomarkers related FOCM, including CYSTH, creatinine, 4-pyridoxic acid,
pyridoxamine, FMN, tHCY, SAM, and Riboflavin, are contributing to the underlying
heterogeneity in terms of CRC risks in the final model. Specifically, a combination of 15 positive
SNP effects (Fig. 3.10 dark blue links, top 10: chr16_73068060_C_G, rs12074997, rs2119609,
14000
14500
15000
15500
16000
16500
17000
17500
18000
2 3
K
Model BIC
CHAPTER 3 86
chr20_52373092_C_G, chr16_6965400_C_G, chr3_60141414_C_G, rs651249, rs12133662,
rs12550019 and chr4_27034672_A_G) and 26 negative SNP effects (Fig. 3.10 light blue links,
top 10: chr21_47983680_A_G, chr4_22523414_C_G, rs12416353, rs11240536, kgp9168996,
chr2_217566056_A_T, rs17034196, rs2744734, chr2_28832256_C_T and rs17171172) leads to
a subgroup with an increased risk of CRC (OR=1.11, Fig. 3.10 red link) having differential
expressions in serum levels of target biomarkers highlighting low levels (light blue links) of
creatinine and CYSTH and high levels of Riboflavin, SAM, tHCY, and FMN. Comparing these
results to the findings using complete subset in Chapter 2, we find the subgroup with higher risks
of CRC are both characterized by high levels of Riboflavin, SAM, tHCY, and FMN, although
using extra subjects with additional SNPs and outcome information generates a smaller OR for
CRC and a different set of determinant genetic features.
CHAPTER 3 87
Figure 3.10 Feature characteristics in each estimated subgroup of CRC risks with incomplete data
Links: light blue, negative effect; dark blue, positive effect; grey, outcome reference subgroup; red, high-risk subgroup; green, low-risk subgroup; width, effect size. Nodes:
yellow, integrative cluster; purple, SNPs; cyan & blue, biomarkers; brown, CRC risks. IntClust: integrative cluster/subgroup.
CHAPTER 3 88
3.5 Implementation - A Feature Update for R Package LUCIDus
We offer a feature update for the R package LUCIDus, which we proposed in the previous
chapter, to make the latent cluster analysis with integrated data incorporating the missing pattern
of biomarkers publicly accessible. Specifically, the LUCIDus version 1.0 is introduced, including
several additional functions/options to implement the proposed incomplete approach such as:
• tidy_Z(): It provides users with a tool to identify the missing pattern in the bio-feature data,
which recognizes the type of missingness and further categorizes our subjects into two
disjoint parts if the missingness is omic-wise;
• missing option in est_lucid() and tune_lucid(): After the missing pattern is recognized, this
option offers a convenient implementation to estimate integrative cluster assignments and
model parameters using multi-omics data consisting of random missingness in bio-features
with or without the outcome variable. Given various sources of high-dimensional data,
options to induce sparsity for cluster-specific parameter estimates via 𝐿 1
regularization are
provided as well. A similar option is also included in the function for parallel grid search of
tuning parameters in the context of high-dimensional omics features with missing values.
We are confident that this LUCIDus update could be a useful extension to offer
researchers in the relevant area a valuable and convenient tool to analyze underlying patterns of
integrated data with MAR information in bio-features. We provide toy examples for extra
functions/options in the update. More details about the usage of this extension will be attached to
the R documentation for LUCIDus. We discuss more details about the implementation of the
incomplete LUCID method later in Chapter 5.
CHAPTER 3 89
3.6 Discussion
In summary, the proposed incomplete approach elegantly incorporates the missing pattern of bio-
features in the assignment of integrative latent cluster analysis using multi-omics data and
phenotypic traits, although it may require additional calibrations and be more computationally
intensive with an increased missing fraction in biomarkers. As demonstrated in the simulation
study, the original LUCID with a listwise deletion fails to efficiently estimate the latent clusters
and model parameters when the missing fraction in biomarker data is over 50%. In contrast, the
proposed incomplete LUCID method can offer decent parameter estimates and good latent
cluster assignments, especially for subjects with complete measures of biomarkers, when the
missingness is omic-wise and the proportion of missing data in biomarkers is high. For subjects
without biomarker data, we still can obtain acceptable results for latent clustering by achieving a
mean AUC greater than 0.5. Although the capability of the method to detect the underlying
heterogeneity decreases with an increased missing fraction, this approach provides researchers
with a feasible way to perform integrative analysis in identifying unknown subgroups while
considering the part of a subset without biomarker measurements.
The applied example was performed with data from the same folate biomarker study
previously used. Even with about 84% missingness in all biomarkers, the incomplete LUCID
method handled the integrative clustering task using an extended sample with extra subjects. The
identified subgroup that is more prone to developing CRC than others is defined by the high
level of SAM and tHCY, which implies dysfunction of the methionine cycle and echoes earlier
results from Chapter 2. Among the novel SNPs contributing to this underlying heterogeneity in
CRC risks, rs651249 is an intronic variant in the C5orf66 gene relevant to the colorectal cancer
risks (Jia et al., 2012; Zeng et al., 2016), ulcerative colitis and inflammatory bowel disease (Liu
CHAPTER 3 90
et al., 2015), and rs1939438 is an intronic variant in the MAML2 gene associated with
inflammatory bowel disease (Aouizerat et al., 2011) and ulcerative colitis (Liu et al., 2015).
Indications for the other 39 SNPs are either not relevant to CRC symptoms or not found
previously. Comparing to the LUCID model fit with the complete 313 subjects only, estimates
for the genetic and biomarker effects do not change substantially, but smaller sample size for the
complete subset leads to a much larger OR=11.51 (Appendix IX Figure A5). Details of the
identified SNPs are reported in Appendix X Table A.5.
When we described the sporadic incomplete LUCID method for biomarker data with
general missing values in the data set, we observed that the model performance deteriorated
quickly with an ascending proportion of overall missing fraction and typically broke down when
the missing fraction is greater than 20%. The impact of the missing fraction on the accuracy and
precision of parameter estimates is more severe for the estimation of biomarker effects. A
possible explanation is that the missingness happens in measuring biomarkers. Even though the
overall missing fractions can be low, the number of subjects without complete biomarkers could
be extremely high. Based on the simulations, the proportion of individuals with incomplete
measures was over 35% even when the overall missing fraction was only 10%; when our
algorithm started failing at an overall missing fraction of 25%, the analytical set had roughly
70% of subjects with missing biomarker measures; eventually, when the overall missing fraction
reached 60%, almost none of the subjects in the subset had complete measures of biomarkers,
which makes identifying the hidden heterogeneity impossible (Figure 3.11).
CHAPTER 3 91
Figure 3.11 Percent of incomplete cases with varying overall missing fraction
Besides the proposed method to handle the general missingness, we also tested an
alternative algorithm which first substitutes the missing values in the bio-feature matrix with the
estimates obtained from the omic-wise incomplete LUCID method and then iteratively updates
the values with weighted means across latent clusters. However, this modification did not show
any improvements in terms of model performance and it also started to fail when the overall
missing fraction was over 20% (Appendix XI). Since the imputation procedure for this
alternative is not intuitive and its performance was not superior to the one proposed, we chose
the algorithm using the corresponding biomarker means as the initial imputed values instead.
Applying a more complicated algorithm, such as the chained equations approach based on
sequential imputation (White et al., 2011), to improve the model performance for the sporadic
missing values in metabolomics could be a future subject.
We showed scenarios in the simulation study when the missingness in bio-features
depended on the outcome status. Since the data setting is a violation of the MAR assumption, we
treated the scenarios as sensitivity analyses. Although results for parameter estimations and
subgroup identifications are most appropriate, we did see a trend of inflated estimates for the
CHAPTER 3 92
biomarker effects with an increased missing fraction. Therefore, we need to take the caution of
differential missingness when fitting an incomplete LUCID model for integrative clustering. In
related work, methods more suitable for NMAR including selection models and pattern mixture
frameworks have been developed to address this concern (Michiels et al., 1999). Future effects
for a potential extension could be focused on this direction.
Conclusively, the incomplete LUCID method essentially reduces bias and/or variance in
parameter estimates compared to complete-case analysis based on listwise deletion. It effectively
preserves the capability of the original LUCID model in detecting informative features and in
predicting outcome risks even when the biomarker data are high-dimensional and/or incomplete
measurements occur. Finally, it sustains the quality of classification for underlying subgroups by
offering decent AUC.
3.7 Funding Support
This work was supported by the National Cancer Institute at the National Institutes of Health
Grant P01 CA196569 and R01 CA140561.
CHAPTER 4 93
Chapter 4 Potential Options for the Statistical Inference of
LUCID Integrative Clustering
4.1 Introduction
Statistical methods for the latent unknown clustering using integrated data (LUCID) have been
discussed in Chapter 2 and we prove the model performance in terms of parameter estimation
and latent subgroups identification. Although clustering analyses methods in related work focus
primarily on the effect estimation, performing the hypothesis testing for model parameters can be
pragmatic and meaningful when certain research assumptions are made. Thus, exploring an
effective way to estimate the standard errors for LUCID model parameters is essential.
However, the uncertainty for the parameter estimation using the EM algorithm is not easy
to capture because directly calculating the inverse of the observed information matrix is
analytically difficult. Previous work has been proposed to address this challenging task,
including the profile likelihood method, the Louis’ method (Louis, 1982), supplemented EM
(SEM) algorithm (Meng and Rubin, 1991), and the bootstrapping method (Efron and Tibshirani,
1986), but all the options have limitations. None of these methods can offer a way to obtain the
maximum likelihood estimator of the expected Fisher information matrix which is evaluated at
the MLE of model parameters. Also, some of these methods are more appropriate for specific
study designs, while others work well only for simple situations with a small number of
parameters to be estimated (Jamshidian and Jennrich, 2000). The SEM algorithm requires an
additional EM algorithm to be run after the original one to keep track of the rate of convergence
of EM. The variance of parameter estimates could be estimated based on this EM mapping
(Meng and Rubin, 1991), but its convergence rate is relatively low when the original EM
CHAPTER 4 94
algorithm has many iterations to run. In contrast, standard error estimation based on bootstrap is
a more general statistical technique that avoids all sophisticated analytical derivations, but it may
be computationally burdensome when the EM algorithm needs a considerable computation time
to converge (Efron and Tibshirani, 1986).
The LUCID method we previously proposed integrates various sources of omics data and
an outcome variable to estimate the latent clusters. The model parameters that need to be
estimated are at least double the number of multi-omics features with a phenotypic trait, and this
number grows substantially with a larger number of latent clusters (K). Additionally, when the
omics data are high-dimensional, the situation becomes even more involved since the uncertainty
of parameter estimations is caused by not only the EM algorithm but also the regularization.
Thus, to appropriately obtain the standard errors for the parameter estimates, we need to find a
feasible way to perform the variance estimation for the LUCID EM algorithm from a practical
perspective.
In this chapter, we propose exploratory procedures to obtain the standard error of LUCID
model parameter estimates by applying the SEM algorithm and the bootstrapping approach.
First, a statistical outline of evaluating the variance of LUCID parameter estimates with SEM
and bootstrapping are demonstrated. We next examine the type I error rate and power of
statistical inference for effects on various sources of integrated data through a simulation study.
We then present an applied example of acquiring confidence intervals when performing LUCID
on a real data set with exposure, metabolomic, and outcome data from the Study of Latino
Adolescents at Risk (SOLAR). Finally, we conclude this chapter with a discussion of findings
and potential issues.
CHAPTER 4 95
4.2 Methods
4.2.1 Standard Error (SE) Estimation through SEM Algorithm
Using the SEM framework, we first run the EM algorithm and store parameter estimates at all
iterations. After the EM algorithm has converged, we compute the expectation of the complete
data observed information matrix 𝐼 𝑐𝑜𝑚
−1
=𝑉 𝑐𝑜𝑚
. Then, starting with the parameter estimate at the
first iteration of the EM algorithm, we compute the rate of convergence numerically in an
iterative fashion to achieve DM, which is the gradient of the EM mapping. Finally, the
asymptotic variance-covariance matrix for the observed data is computed as 𝑉 𝑜𝑏𝑠 =𝑉 𝑐𝑜𝑚
+
Δ𝑉 =𝑉 𝑐𝑜𝑚 +𝑉 𝑐𝑜𝑚 𝐷𝑀 (𝐼 −𝐷𝑀 )
−1
, where Δ𝑉 is the increase in variance due to missing data.
More details regarding SEM, including proof of the above results, can be found in relevant
papers or text books (Meng and Rubin, 1991; Little & Rubin, 2002). Next, standard errors (SE)
are obtained by taking the square root of the asymptotic variance of the parameters of interest.
4.2.2 SE Estimation through Bootstrapping
The bootstrap SE estimation method is based on resampling the observed data by individuals
with replacement to form pseudo-sets. Specifically, let 𝒟 =(𝒟 1
,𝒟 2
,⋯,𝒟 𝑛 ) and B denote the
observed data and bootstrap number respectively. We could first evaluate the model parameter
estimate Θ
̂
with the original data using the LUCID EM approach. Then, for each j in {1, …, B}, a
set of pseudo-data can be randomly sampled with replacement denoted as 𝒟 (𝑗 )
=
(𝒟 𝑗 1
∗
,𝒟 𝑗 2
∗
,⋯,𝒟 𝑗𝑛
∗
). With 𝒟 (𝑗 )
, we calculate the Θ
̂
(𝑗 )∗
using the identical LUCID model
performed on the original data. Finally, the bootstrap loop stops when j=B, and the collection of
LUCID parameter estimates Θ
̂
(1)∗
,⋯,Θ
̂
(𝐵 )∗
provides an empirical insight of the sampling
distribution of Θ
̂
. Using this piece of information, we can estimate the standard error for a
specific LUCID model parameter 𝜃̂
as,
CHAPTER 4 96
𝑆𝐸 (𝜃̂
)=
√
∑ (𝜃̂
(𝑗 )∗
−𝜃̂
∗
̅̅̅
)(𝜃̂
(𝑗 )∗
−𝜃̂
∗
̅̅̅
)
𝑇 𝐵 𝑗 =1
𝐵 (4.1)
where 𝜃̂
∗
̅̅̅
is the sample mean of the bootstrap statistics, 𝜃̂
(1)∗
,⋯,𝜃̂
(𝐵 )∗
.
After the SEs are estimated using the SEM algorithm or bootstrapping method, Wald
tests could be constructed through dividing parameter estimates by their corresponding standard
errors in order to test null hypotheses that true values of model parameters of interest are zero.
The confidence interval can be further constructed easily based on normal approximation. Other
types of bootstrap confidence interval methods, including basic and percentile, can also be
obtained in R (DiCiccio & Efron, 1996; Davison & Hinkley, 1997; Canty & Ripley, 2017).
4.2.2 Hypothesis Testing for High-dimensional Data
Since we introduce penalties terms into the LUCID model likelihood to handle the high
dimensionality of the multi-omics feature, induced sparsity among parameter estimates makes
feature selection possible but also generates another source of uncertainty. For now, to
incorporate information from high-dimensional multi-omics data while retaining the ability to
perform statistical testing, we recommend the following two-step procedure for analyzing high-
dimensional integrated data:
Step 1: Apply the regularized EM algorithm to perform dimension reduction for multi-
omics data and thereafter select those most informative genetic features and biomarkers;
Step 2: Given the refined subset with only informative features from Step 1, estimate
model parameters and their standard errors following the proposed SEM algorithm or
bootstrapping procedure, then calculate the confidence interval or perform hypothesis testing.
CHAPTER 4 97
4.3 Simulation Study
4.3.1 Data Description and Scenario Settings
A simulated population with 200,000 subjects for each scenario is generated following the
original LUCID DAG model in Figure 2.1. Genetic features are first generated, then a cluster
value is simulated conditional on the genes, and biomarker and outcome data are finally
generated given the cluster assignment. Specifically, we still simulated 10 SNPs, 4 continuous
biomarkers, and a continuous outcome for each individual in the simulated population with the
true number of latent clusters K=2. Each replication of a particular scenario randomly draws
2000 samples from the generated population. We examined several metrics to evaluate the
quality of SEM and bootstrapping standard errors for LUCID model parameters with varying
genetic, biomarker, and outcome effects, including Type I error to identify null features and
power to detect causal features over the 500 replicates.
4.3.2 Simulation Results
4.3.2.1 Analyzing power and type I error with SEM algorithm
We first apply the SEM algorithm to the simulated scenarios using the one with ORG=1.9 for
genetic effect, Δ
𝑍 =0.8 for mean difference in biomarkers, and Δ
𝑌 =0.2 or 0.4 for mean difference
in the continuous outcome. When the Δ
𝑍 increase from 0 to 0.8 and other effects remain fixed,
the average power of detecting causal genes has an ascending trend while the mean type I error
for null genes keeps descending (Figure 4.1a). For the outcome effect, the mean power of
detecting simulated difference across clusters has a general increasing trend with ascending Δ
𝑍
(Figure 4.1b). If we further compare a larger outcome effect with a smaller one (red vs. blue), an
increased effect in outcome always provides an increase in power and a decrease in type I error
(Figure 4.1a & b). Meanwhile, with a fixed Δ
𝑍 , changes in genetic effects (ORG) are less
CHAPTER 4 98
influential on the power for the causal features and the type I error for the null effects.
Specifically, the power in detecting the biomarker effects increases considerably with an
ascending ORG but reaches a plateau when the ORG is sufficiently large (Figure 4.1c). In
contrast, the power for capturing the outcome effects remains stable at a moderate level for a
varying ORG (Figure 4.1d).
Figure 4.1 Power and Type I Error with varying genetic, biomarker, and outcome effects
applying SEM algorithm
Reference: red dash line 0.05, green dash line 0.80; (a & b) solid bar, power; shadow bar, type I error; blue, ORG=1.9, ΔZ=0.8,
ΔY=0.2; red, ORG=1.9, ΔZ=0.8, ΔY=0.4; (c & d) ORG=1.9, ΔZ=0.8, ΔY=0.2
4.3.2.2 Investigating power and type I error through bootstrapping
To demonstrate the ability of bootstrap approach in estimating the standard error for the LUCID
model parameters, we first simulated a population with parameters set as ORG=2.0, ΔZ=0.8,
ΔY=1.0. One random sample with N=2000 was drawn to perform the estimation of standard error
CHAPTER 4 99
with B=100. As shown in Table 4.1, the biases between the original LUCID estimates and
bootstrap statistics based on resampling are all less than 0.06. The estimated standard errors are
small for biomarker and outcome effects, but relatively large for genetic effects. All three types
of bootstrap confidence intervals indicate significant contributions of causal features in the latent
subgroup classification.
Table 4.1 Bootstrap confidence interval for LUCID parameter estimates based on simulated data
FEATURES
LUCID
ESTIMATE
BIAS
STD.
ERROR
BOOTSTRAP CONFIDENCE INTERVAL
Normal Basic Percentile
CG1 0.50 0.02 0.21 (0.08, 0.89) (0.07, 0.84) (0.17, 0.94)
CG2 0.63 0.05 0.21 (0.16, 1) (0.1, 1) (0.25, 1.16)
CG3 1.06 0.05 0.23 (0.56, 1.46) (0.5, 1.46) (0.66, 1.62)
CG4 0.79 0.00 0.23 (0.34, 1.25) (0.32, 1.33) (0.25, 1.26)
CG5 0.60 0.06 0.23 (0.1, 0.99) (0.08, 0.97) (0.24, 1.12)
NG1 0.23 0.03 0.22 (-0.24, 0.63) (-0.32, 0.65) (-0.2, 0.77)
NG2 -0.01 0.02 0.20 (-0.41, 0.37) (-0.45, 0.39) (-0.41, 0.43)
NG3 0.23 -0.01 0.21 (-0.16, 0.64) (-0.15, 0.64) (-0.18, 0.61)
NG4 -0.16 0.01 0.21 (-0.58, 0.24) (-0.63, 0.33) (-0.65, 0.31)
NG5 -0.20 -0.03 0.19 (-0.53, 0.19) (-0.44, 0.32) (-0.73, 0.03)
CZ1 0.83 -0.01 0.07 (0.7, 0.98) (0.7, 1.05) (0.61, 0.95)
CZ2 0.76 0.01 0.09 (0.56, 0.94) (0.54, 0.96) (0.56, 0.98)
NZ1 0.04 -0.02 0.08 (-0.11, 0.22) (-0.14, 0.21) (-0.14, 0.21)
NZ2 0.02 -0.04 0.09 (-0.12, 0.24) (-0.13, 0.23) (-0.18, 0.18)
Y 1.02 0.01 0.08 (0.85, 1.17) (0.85, 1.16) (0.88, 1.18)
Next, to further test the coverage of bootstrap confidence interval, we performed 99 more
replicates for the same scenario and a different sample was drawn from the simulated population
each time. Using B=100 for bootstrapping, 95% confidence intervals for both causal and null
features are calculated in each replicate. As shown in Figure 4.2, the coverage among 100
replicates for genetic (a & b), biomarker (c & d), and outcome (e) effect are all greater or equal
CHAPTER 4 100
to 0.95, which suggests that the bootstrap procedure guarantees a valid statistical inference for
LUCID modeling.
CHAPTER 4 101
Figure 4.2 Coverage of bootstrap 95% confidence intervals (normal) in simulation
Reference lines: red, true effect; green, null hypothesis; (a) causal SNPs; (b) null SNPs; (c) causal biomarkers; (d) null
biomarkers; (e) outcome effect.
To examine the impact of varying genetic, biomarker, and outcome effects on the
standard error estimation, the bootstrap procedure is applied with the B set to 100 among 100
replicates. The average power of detecting a causal gene across replicates keeps ascending while
the mean Type I error for a null gene keeps descending with an increased ORG ranging from 1.2
to 2.0 (Figure 4.3a). The ascending ORG also leads to a slightly increased power in detecting the
causal biomarker and outcome effects. When the biomarker effect Z increases from 0.2 to 1.0,
the power in detecting all causal features and the outcome effect increases considerably (Figure
4.3b). These increases in power are more obvious when the simulated outcome effect is larger
(Figure 4.3b & c). Type I errors when the bootstrapping method is applied are retained at a 0.05
or lower level.
CHAPTER 4 102
Figure 4.3 Power and Type I Error with varying genetic, biomarker, and outcome effects
applying bootstrapping approach
Reference: red dash line 0.05, green dash line 0.80; G_pwr, genetic power; G_err, genetic type I error; Z_pwr, biomarker power;
Z_err, biomarker type I error; Y_pwr, outcome power.
CHAPTER 4 103
4.3.2.3 Comparison of SEM, bootstrapping, and post hoc analysis for statistical inference
Another naive but intuitive way to achieve the standard errors for LUCID parameter estimates is
to perform post hoc tests using the posterior probabilities of latent clusters, which is a part of the
LUCID model output. As described in Chapter 2, the standard errors can then be obtained from
separate regression models to check the association of each factor to the estimated cluster
probabilities. Here, we compared the power analysis based on simulated data across the SEM
algorithm, bootstrap procedure, and the simple post hoc analysis.
As shown in Figure 4.4, for a simulation scenario with ORG=2.0 for genetic effect,
Δ
𝑍 =0.4 for mean difference in biomarkers, and Δ
𝑌 =0.2 for mean difference in the outcome across
2 true latent clusters, the power for detecting the causal SNPs and biomarkers as well as the
outcome effect are similar between SEM and bootstrapping, although the bootstrap procedure
generally has a slightly higher power. The post hoc analysis performs even better in rejecting the
null hypotheses for all causal effects by providing a considerably greater power for all causal
effects than the other two options; however, it falsely and too frequently rejects the null
hypotheses for null genetic features and biomarkers, leading to highly inflated type I errors. In
contrast, bootstrapping successfully retains the type I error for null features at a level around
0.05. SEM also generates inflated type I errors for null SNPs and biomarkers, but the situation is
less severe than in post hoc analysis. Since bootstrapping shows advantages in offering
appropriate power and stable type I error in the estimation of LUCID model parameters, we
focus on the statistical inference using the bootstrap method in the applied example.
CHAPTER 4 104
Figure 4.4 Power and Type I Error for estimating genetic, biomarker, and outcome effects
through different methods
Reference: red dash line 0.05, green dash line 0.80; G_pwr, genetic power; G_err, genetic type I error; Z_pwr, biomarker power;
Z_err, biomarker type I error; Y_pwr, outcome power.
4.4 Applied Examples for Statistical Inference
4.4.1 Data Description
Again, we apply LUCID to participants from a longitudinal cohort in the Study of Latino
Adolescents at Risk (SOLAR) of Type 2 Diabetes project (Goran et al., 2004; Weigensberg et
al., 2003). Participants lived in Los Angeles, California, were Hispanic/Latino ethnicity (defined
as self-reported race/ethnicity for all participants, parents, and grandparents) and underwent
repeated, detailed phenotyping for clinical risk factors of type 2 diabetes as well as body
composition testing. The analysis is performed high-resolution metabolomics (HRM) profiling
completely using standardized methods (Soltow et al., 2013). Concentrations of PFOA, PFOS
and PFHxS were quantified by reference standardization (Go et al., 2015). In Chapter 2, we
CHAPTER 4 105
applied LUCID on participants with complete data (n=40), integrating metabolome data with
serum levels of perfluorohexane sulfonic acid (PFHxS) as the exposure to estimate subgroups of
children related to the risk for developing type 2 diabetes.
4.4.2 Application Results
The results of integrative cluttering though LUCID were presented in Chapter 2, Figure 2.15. As
demonstrated in Table 4.2, the LUCID parameter estimates from the original analytical sample
indicate that the exposure effect of PFHxS is positively associated with the heterogeneity in
terms of mean differences in serum levels of various metabolites and with an increased risk of
developing type II diabetes. The bootstrap procedure with bootstrap number B=1000 shows
some bias between the original estimates and bootstrap statistics. Both the normal and the basic
bootstrap 95% CIs suggest significant effects of Palmitoylcarnitine, Valine, Isoleucine, and
Alanine in the integrative clustering, while the percentile method shows significance in the
effects of all metabolites and the outcome of interest (type II diabetes).
Table 4.2 Bootstrap Confidence intervals for LUCID parameter estimates in the SOLAR project
FEATURES
LUCID
ESTIMATE
BOOTSTRAP CONFIDENCE INTERVAL (95%)
Normal Basic Percentile
EXPOSURE
PFHxS 0.13 (-3.31, 2.5) (-4.67, 1.06) (-0.79, 4.93)
METABOLITES
PALMITOYLCARNITINE 0.99 (0.46, 2.2) (0.35, 1.96) (0.02, 1.63)
TYROSINE 0.76 (-0.13, 1.55) (-0.1, 1.48) (0.05, 1.62)
VALINE 1.21 (0.83, 2.63) (0.75, 2.4) (0.03, 1.68)
ISOLEUCINE 1.09 (0.24, 2.18) (0.31, 2.1) (0.07, 1.86)
ALANINE 0.99 (0.24, 2.05) (0.26, 1.93) (0.04, 1.71)
CYSTEINE 0.44 (-0.49, 1.09) (-0.6, 0.85) (0.03, 1.49)
OUTCOME
TYPE II DIABETES 1.86 (0.08, 21.40) (0.23, 3.32) (1.04, 14.85)
CHAPTER 4 106
4.5 Implementation - LUCIDus R Package Updates
We believe these initial thoughts on variance estimation for LUCID parameter estimates
show good potentials in conducting hypothesis testing after integrative clustering. Both the SEM
algorithm and the bootstrap method are worthy features to be included in the LUCIDus R
package to help users explore and better understand the uncertainty of the estimation.
Specifically, the following two functions can be used to achieve the SE for model parameter
estimates:
• sem_lucid() provides SE of LUCID parameter estimates for exact inference when the
latent cluster analysis is performed with multi-omics data with or without phenotypic
traits. SEs are obtained through a supplemented EM-algorithm (SEM). This function has
already been included in the current version of LUCIDus.
• boot_lucid() offers SE of LUCID parameter estimates based on bootstrapping. It also
includes options bootci= c("norm", "basic", "perc") to provide users with different types
of confidence intervals and par=TRUE to perform the bootstrap loop parallelly.
4.6 Discussion
Understanding the uncertainty during parameter estimation is critical for statistical inference. For
a complex and joint model with multiple model parameters with latent variables involved like
LUCID, the context of model setting further complicates the problem. When an EM algorithm is
utilized, we can avoid the issue of direct maximization of the model likelihood that is
analytically intractable, but this leads to another issue of not producing the SE of parameter
estimates automatically since the Fisher information matrix for the observed data cannot be
obtained directly. Existing methods have been proposed to solve the problem by calculating the
CHAPTER 4 107
observed information with complete and missing information or computationally based on
resampling. We adopted the idea of the SEM algorithm and bootstrap procedure to explore the
options to obtain SE for LUCID parameter estimates.
Based on the simulation, we found that with an increased effect size of each component
in the LUCID DAG model, both SEM and bootstrapping showed an ascending trend in power for
causal effects and a descending trend in type I errors among null features in general. The varying
biomarker effects substantially affected the power and type I errors in detecting all component
effects (genetic, biomarker, and outcome). Meanwhile, influences of changing genetic and
outcome effects were limited to capturing the power and type I error for effects in corresponding
data source only. Additionally, the SEM algorithm caused inflated errors when the effect to be
detected was relatively small and usually offered a slightly lower power in detecting causal
features compared to the bootstrap method. Post hoc analysis based on regression models
generated high values for both power and type I error, indicating it failed to distinguish
informative features from the noise and therefore not an appropriate approach for LUCID model
inference.
In terms of the SEM algorithm, we found its convergence rate was only around 50%
across replicates in our simulation. For the SOLAR applied example shown above, we also tried
to estimate the SEs with SEM, but the algorithm did not converge. Statisticians have realized that
the convergence of SEM highly depends on the iteration of the EM algorithm (Meng and Rubin,
1996). When the convergence of the EM algorithm is slow, the 𝐼 −𝐷𝑀 is nearly singular, which
makes it not invertible. The calculation of the observed information to estimate standard errors is
thus impossible. Also, SEM needs very accurate MLEs for model parameters making it
numerically unstable for joint modeling methods with many parameters since their E-step does
CHAPTER 4 108
not have closed form solutions (Xu et al., 2014). Given that our LUCID model meets the
aforementioned conditions of concerns, users may want to avoid applying it to measure the
uncertainty if there are many assumed latent clusters, or if the features are high-dimensional.
One caveat inherent in utilizing the bootstrap procedure is its computational burden.
When applying the bootstrap strategy with the EM algorithm, the embedded EM for B iterations
extensively increases the computing time. If the observed data are high-dimensional and the
proportion of missing data is considerable, using the bootstrap method to achieve SE of
parameter estimates is not practically feasible. Moreover, we tried to apply the proposed 2-step
procedure on the high-dimensional FOCM data we used in Chapter 2. However, bootstrapping
after the feature selection with regularization indicates only several significant effects among
selected features contributing to the subgroup classification, which seems biologically
unreasonable (Appendix XII Table A.5). A lack of good estimations for the uncertainty in
regularization can be a reason for this bad inference since lasso estimates do not guarantee the
desirable oracle properties of selection consistency and asymptotic normality (Fan and Li, 2001).
However, if we further include the regularization inside the bootstrap loop, we can theoretically
get a better estimate for SE, but this could further exacerbate the situation by making the overall
procedure more computationally demanding and less practical.
In conclusion, estimating the standard errors for LUCID parameter estimates could
provide an insightful perspective for the significant effects contributing to underlying
heterogeneity. A simple but misleading post hoc analysis for SE should only be exploratory. The
SEM algorithm and bootstrap method proved to efficiently produce SE for data settings with a
small number of features and fewer parameters to evaluate. Further investigations nevertheless
CHAPTER 4 109
are necessary for more stable methods capable of detecting the uncertainty for both
regularization and EM algorithm when the data set to be analyzed is more complicated.
4.7 Acknowledgements
Thank you to Drs. Vaia Lida Chatzi and Ran Jin for providing SOLAR data in the applied
analysis.
4.8 Funding Support
This work was supported by the National Cancer Institute at the National Institutes of Health
Grant P01 CA196569 and R01 CA140561.
CHAPTER 5 110
Chapter 5 LUCIDus: An R Package for the Implementation
of LUCID
5.1 Introduction
Multi-omics data, including metabolomes, transcriptomes, genomes, epigenomes, and protein
profiles, have been rapidly generated from epidemiologic, clinical, and translational studies.
These data collected from multiple platforms have facilitated new insights into the underlying
etiologic mechanism of disease. Although potentials in understanding multi-omics data are huge,
complex data with different characteristics from various sources and high-dimensional settings
of multi-omics information make the analysis of integrated data both analytically and
computationally challenging. However, few methods are available for integrating various omics
data across multiple platforms with phenotypic traits to reveal underlying molecular subtypes.
Therefore, an integrative approach to jointly estimate latent clusters using multi-omics data and
phenotypic traits with a consideration of the underlying biological process is desirable. This
absence of an appropriate approach was filled by the recently LUCID method proposed in
Chapter 2, short for latent unknown clustering with integrated data.
In this chapter, we introduce an R package LUCIDus as an implementation for the
LUCID method to jointly estimate the latent clusters relevant to the outcome of interest using
integrated data. By treating subtype estimation as a missing data problem, an expectation-
maximization (EM) algorithm is applied to achieve maximum likelihood estimates of the model
parameters. To accommodate high dimensional omics data, we use regularized regression
techniques to obtain sparse estimates for the corresponding coefficients. The complete R package
LUCIDus version 0.9.0 with the entire source codes, detailed function documentation, and
CHAPTER 5 111
guided package vignettes (Appendix XIII) are available on CRAN and the GitHub repository,
which can be installed with R 3.4.0 or higher.
The rest of this chapter is organized as follows. We first briefly review the DAG
assumption and describe the LUCID EM algorithm to optimize the complete-data likelihood.
Next, we give a brief overview of the LUCIDus package and its functions to show how to add
LASSO-type penalty terms into objective functions to generate a sparse solution in case of high
dimensional data and present the two-step framework for integrated analysis. The workflow of
the package application is then delineated in detail. Finally, we demonstrate the application of
the LUCIDus package on simulated data followed by final conclusions and briefly discuss future
potential developments of the package.
5.2 LUCIDus Package Models – A Review
5.2.1 Base Model
As illustrated in Figure 2.1, multi-omics data combined with the phenotypic trait are integrated
by jointly modeling their relationships through a latent variable for clustering in the DAG. With
the conditional independence implied by the DAG, the joint likelihood of model parameters can
be easily formalized given probabilistic models of X given G, Z given X and Y given X as
described in Chapter 2.
To obtain maximum likelihood estimates (MLE) of LUCID model parameters,
maximizing the marginal log-likelihood with the latent variable X involved is often difficult in
terms of optimization. Instead, we use an approximate procedure by maximizing the expectation
of complete data log-likelihood, which is a lower bound of the marginal log-likelihood, to
achieve parameter estimates. Then, an EM algorithm is applied to handle optimization and
CHAPTER 5 112
estimate model parameters as shown in Algorithm 5.1. Detailed derivations of the EM algorithm
for the integrative clustering have been presented in Appendix II.
Algorithm 5.1 Latent Clustering EM Algorithm for LUCID
5.2.2 Adjusted Model
Confounding effects caused by covariates could lead to biased results of integrated clustering.
When we estimate the model parameters through regression framework as previously discussed,
covariates that need to be adjusted during the estimations of genetic and outcome effects with
respect to integrative latent clusters could be easily included in our joint modeling. A challenge
for this issue is to locate the correct component (CoG or CoY in Figure 3.1) in the DAG to adjust
corresponding covariates. When covariates are included in the conditional probability of the
CHAPTER 5 113
latent clusters given genetic features, their roles in the model are more similar to predictors for
the latent clusters. In contrast, if we include covariates in the conditional probability of the
outcome given the latent clusters, a more conventional adjustment for confounding is indicated.
Moreover, if systematic effects on biomarkers are assumed, we could adjust each feature for
these covariates and then include the residuals in the analysis as the biomarker inputs for the
integrative clustering.
5.2.3 Regularization
Moreover, the proposed integrative clustering model easily facilitates scenarios with high-
dimensional genetic features (G) and expression profiles (Z) by adopting regularization paths,
such as LASSO (Tibshirani, 1996) penalties and its variants. Penalized likelihood can be applied
to the corresponding models separately in view of the fact that the model likelihood consists of
three conditionally independent components in the DAG sharing no model parameters. Choosing
an optimal combination of tuning parameters for the DAG model when both genetic and
biological features are high-dimensional could be immensely challenging. Also, considering our
primary goal is to estimate the latent cluster which is often unobservable, cross-validation, the
most traditional model selection scheme, is not feasible for choosing tuning parameters for our
integrative clustering model. Thus, a grid-search guided by Bayesian information criterion (BIC),
a widely used alternative, is applied to determine the optimal combination of three tuning
parameters.
5.3 Implementation and Application
5.3.1 Functions in the LUCIDus Package
The LUCIDus package includes nine functions with separate purposes to implement the above
theoretical model. Brief descriptions of each function are listed in Table 5.1. Among those
CHAPTER 5 114
functions in the package, the first three functions below are the main ones playing critical roles
in estimating integrative latent clusters:
• est_lucid() offers a convenient implementation to estimate an integrative cluster
assignment using observed multi-omics data (genetic and biological features) with or
without outcome traits. Options to induce sparsity for cluster-specific parameter estimates
via L1 regularization under conditions with various sources of high-dimensional data are
also provided.
• sem_lucid() provides standard errors (SE) of parameter estimates for exact inference
when performing latent cluster analysis with multi-omics data with or without phenotypic
traits. SEs are obtained through a supplemented EM-algorithm (SEM).
• tune_lucid() fits regularized latent cluster models with various combinations of three
tuning parameters based on joint inference across data types to perform a grid-search
through parallel computing. It aims to help users determine an optimal choice of tuning
parameters guided by model BIC. Two alternative types of GICs, which are more suitable
for model selection when data are high-dimensional, are also provided as references
(Chen et al., 2016).
The next three ones in the package are the supplementary functions. Specifically,
summary_lucid() presents a list of important model outputs with an object of a LUCID model fit
in R; plot_lucid() illustrates the results of latent clustering with a Sankey diagram for
visualization; pred_lucid() offers predicted values for the latent cluster and outcome of interest
with a new data. The last three functions in the LUCIDus are ancillary, which allows the user to
define customized initial values for model parameters, tuning parameters for regularization, and
stopping criteria with some prior knowledge.
CHAPTER 5 115
Table 5.1 Functions in the LUCIDus R package
5.3.2 General Workflow
Although nine separate functions are developed for LUCID, we would suggest using this R
package as a workflow which is shown in Figure 5.1. When dealing with high-dimensional data
using regularization techniques, the first important thing is to determine the best combination of
tuning parameters. This can be done through a grid-search guided by comparing model BICs,
which is implemented with tune_lucid() from the LUCIDus package. Next, informative features
of multi-omics data are selected through est_lucid() with the chosen tuning parameters, whose
parameter estimates are non-zero. After the selection, the clustering analysis can be focused on
parameter estimation only with selected features through est_lucid(), which will return an
"IntClust" object in the R environment. To summarize the results, summary_lucid() can return a
list in R, consisting of model parameter estimates, indicator vectors and total counts of non-zero
genetic and biological features, and the model GICs, based on an "IntClust" object. To visualize
the integrative clustering, a Sankey diagram can be produced using the plot_lucid() function in
CHAPTER 5 116
the package. When predicting new data based on an existing LUCID model is of interest,
pred_lucid() could be applied to generate both predicted posterior probabilities and values of the
outcome. Last but not least, three ancillary functions, including def_initial(), def_tune(), and
def_tol(), are useful in setting the initials of model fitting, tuning parameters of regularization,
and convergence criteria for overall clustering.
Next, we show example codes and application results to demonstrate the usage of
LUCIDus package in practice.
CHAPTER 5 117
Figure 5.1 The workflow of the LUCIDus package
CHAPTER 5 118
5.3.3 Input and Data Requirements
All main functions of the LUCIDus package, including est_lucid(), tune_lucid(), and
sem_lucid(), require those components in the DAG model. These functions expect observed
integrated data consisting of genetic (G) and biological (Z) features with or without the outcome
of interest (Y) as the input for the model. Each component needs to be expressed as matrices in
the R environment, even though the outcome is typically a vector instead. Initial values of model
parameters 𝛃 ,𝛅 ,𝚺 , and 𝛄 for the LUCID EM algorithm are defined through def_initial().
Specifically, let K be the number of latent clusters. Then, init_b is a K by (ncol(G)+1)
dimensional matrix; init_m is a K by ncol(Z) dimensional matrix; init_s is a list of K ncol(Z) by
ncol(Z) matrices; init_g is a vector with a length of K for a binary Y (log odds in K clusters) or
2K for a continuous Y (K cluster-specific means followed by standard deviations in K clusters).
If not pre-specified, random values will be generated for these initials. If covariates need to be
adjusted in the model, they could be specified using either the CoG or the CoY option based on
particular research assumptions.
5.3.4 Model Specification
When fitting a LUCID model for integrative clustering, many options could be specified for
different research purposes. The default for the number of latent clusters, K, is 2, but can be
changed to other values if any prior knowledge is available. If including the phenotypic traits in
the clustering is desirable, the useY option could be turned on to take the outcome effect into
consideration. Furthermore, based on the variable characteristic of the outcome, the family option
should be specified as either "binary" or "normal." Finally, the posterior probability of each
latent cluster could be calculated with parameter estimates of the fitted model by flagging the
Pred option.
CHAPTER 5 119
When performing a grid-search to explore the best combination of tuning parameters and
optimal K, additional arguments must be defined besides those aforementioned options. Lower
and upper limits of three tuning parameters, including 𝜌 G
, 𝜌 zΣ
−1, and 𝜌 𝑧𝜇
, as well as the desired
length of the sequence for the ranges should be set up as inputs for grid-searching. Finally, since
tune_lucid() fits regularized latent cluster models with various combinations of three tuning
parameters through parallel computing, the number of CPU cores to be utilized is determined by
the NoCores option with a default value as the total number of cores minus 1.
5.3.5 LUCID Model Usage and Output
For illustration purposes, we show toy examples of LUCID model fitting for integrative
clustering with simulated data consisting of 10 SNPs (5 causal), 4 biomarkers (2 causal), and a
binary outcome of interest. True simulated effects for heterogeneity across two clusters are
OR
G
=2.0,Δ
Z
=10, and OR
Y
=2.0 respectively.
5.3.5.1 Grid Search
As suggested in the LUCID workflow, appropriate values of K and tuning parameters must first
be determined. This task can be done using tune_lucid() with the following codes:
CHAPTER 5 120
Results from a grid search can be checked with GridSearchK$Results, and the optimal
model with minimum BIC can be further identified with GridSearchK$Optimal. If we perform a
cross-comparison for the optimal among all the Ks we fit, the best model with proper K and
tuning parameters could be found. In the above example, the integrative clustering model did not
converge when K=4. If we draw a BIC plot for K equals 2 and 3, the best model is the optimal
one when K =2 as shown in Figure 5.2.
Figure 5.2 BIC for a grid search of K
5.3.5.2 Raw clustering with regularization for feature selection
After we have decided on the optimal values for K and tuning parameters, our research goal
shifts to feature selection to identify the most informative features from the data.
CHAPTER 5 121
5.3.5.3 Model output summary
The above codes return an "IntClust" object in the R environment. It could be further
summarized with summary_lucid(), which will return a list of useful statistics for integrative
clustering. If a change of the reference cluster is desirable, the switch=TRUE option in
summary_lucid() could be specified which can sort the model outputs in an ascending order
using the most protective outcome cluster as the reference. Furthermore, a more advanced way is
available to change the reference by specifying a customized vector with order.
CHAPTER 5 122
With the help of the integrative clustering summary, we can easily identify the
informative features with non-zero parameter estimates with $No0G and $No0Z. As shown
below, those causal features based on the simulation could be successfully selected.
5.3.5.4 Effect estimation after selection
Regularized clustering provides us with insights into important features. Then considering the
selected features only in the clustering, our current analytic goal then moves to parameter
estimation and statistical inference.
CHAPTER 5 123
Furthermore, we could now use sem_lucid() to implement the supplemented EM (SEM)
algorithm (Meng and Rubin, 1991) to obtain variance estimates as shown below, so that the
standard errors of parameter estimates could be calculated for statistical inference.
5.3.5.5 Visualization of integrative clustering
Visualization for integrative clustering could be more efficient in terms of results interpretation.
In the LUCIDus package, plot_lucid() generates a Sankey diagram based on an "IntClust" object.
In a Sankey diagram generated for LUCID, the width of links reflects the effect size of parameter
estimates; the color of the links indicates the sign or direction of parameter estimates (dark:
positive; light: negative). When dealing with a binary outcome of interest, integrative clusters
with low and high risks are color-coded in green and red respectively with the reference group
CHAPTER 5 124
shown in gray. As illustrated in Figure 5.3, LUCID efficiently distinguishes the simulated
heterogeneity implied by the latent clusters.
Figure 5.3 Sankey diagram for the base model with informative features
5.3.5.6 Adjusted clustering with covariates
Adjusting covariates that are potential confounders could be a key to obtaining unbiased
parameter estimates. To show how LUCID handles covariates during clustering, two covariate
sets are randomly generated. Each set has 5 variables including both continuous and binary ones
which are not associated with any components in the DAG. The following codes offer an
example to fit LUCID for both feature selection and effect estimation using the CoG and CoY
options in the est_lucid() function. When we include covariates in the regularized clustering,
they are forced to stay in the model so that they can always be selected.
CHAPTER 5 125
As shown in Figure 5.4, if we plot the Sankey diagram for the final adjusted model, all
the causal features from the simulation are correctly selected, the demonstrating the
heterogeneity between clusters, while the covariates are included with slight effects, as we
expect.
Figure 5.4 Sankey diagram for the adjusted model with informative features
5.3.5.7 Prediction of clusters and outcome risks
With a fitted LUCID model, we could make predictions with some new multi-omic data. The
following codes present that pred_lucid() could take advantage of the information from an
"IntClust" object in R to produce posterior probabilities of latent clusters ($pred_cluster) and
predicted values of the outcome ($pred_outcome). The predicted probability of the risk is
CHAPTER 5 126
returned when the outcome is binary; a predicted mean for each latent cluster is listed when the
outcome is continuous. Additionally, if the LUCID fit is an adjusted model, new covariates also
need to be specified in pred_lucid() when making predictions.
5.3.6 Applications and Examples
To determine if the proposed method could efficiently identify the heterogeneity of disease risks
based on complex biological pathways, we fit the integrative clustering model on simulated data
reflecting the folate-associated one-carbon metabolism (FOCM), which is a complex pathway
involving 19 enzymes or carrier proteins with various feedback loops and two main cycles, the
folate and the methionine cycles. Briefly, risks of a certain outcome could be generated through
various FOCM biomarker effects of interest, which is based on a system of differential
equations. Details of the simulation procedure have been described elsewhere (Thomas et al.,
2009).
Two simulation scenarios were considered in the integrative latent clustering. A simple
scenario focuses on the risk of outcome attributed to a positive effect of SAM only. Another
relatively complicated scenario highlights increased outcome risks related to positive effects of
SAM and 5,10-CH2-THF with negative effects of SAH and DHF.
CHAPTER 5 127
The first step is to perform a grid-search with a wide range of tuning parameters. This
step can be repeated with varying values of the cluster number (K), so that the optimal K can also
be selected from the grid-search. For example, the following codes provide a run for the first
scenario with 1000 combinations of tuning parameters.
CHAPTER 5 128
After the final values of K and tuning parameters for regularization are determined, the
primary interest of analysis now shifts to feature selections for informative SNPs and
biomarkers. As shown below, integrative clustering can be performed with regularization using
determined tuning parameters so that informative features with non-zero parameter estimates can
be identified. In this example for FOCM simulated data, the final selected model has 2 clusters
with 𝜌 G
=0.002, 𝜌 zΣ
−1 =3e−05, and 𝜌 𝑧𝜇
=0.54.
A similar procedure could be performed for the second scenario. The clustering results
for both scenarios (Figure 5.5a & b) indicate a similar trend in distinguishing outcome
heterogeneity regardless of simulated causal mechanism, although a more complicated causation
results in a less risky integrative cluster. High-risk cluster among the identified two is
characterized by high levels of 5,10-CH2-THF and 10f-THF, and low levels of 5mTHF and THF.
Meanwhile, for informative SNP, positive effects of SHMT, MS, BHMT and negative effects of
TS, MTHFR, CBS lead to higher risks of the outcome.
CHAPTER 5 129
Figure 5.5 Cluster-specific parameter estimates for the FOCM simulated data:
(a)SAM(+) only; (b)SAM(+), SAH(-), DHF(-), 5,10-CH2-THF(+)
5.4 Conclusions
In this chapter, we have introduced the LUCIDus package consisting of a set of R functions that
aim to implement the LUCID method previously proposed, which is a methodology focusing on
jointly estimating unknown molecular subtypes of an outcome with relevant multi-omics features
with a disease or phenotype. We believe that this R package will serve as a useful gear to obtain
CHAPTER 5 130
an improved subtype classification leading to better diagnostics as well as prognostics, which
could be the ultimate solution for efficient targeted treatments and successful precision medicine.
For the future directions of the LUCIDus package, we have considered many possible
extensions that could be meaningful and practical additions of integrative analyses in the area of
genetic epidemiology. Some of the potential directions include:
1. Alternative penalty terms. Although the LASSO has many favorable properties, several
alternatives nonconvex penalties, such as smoothly clipped absolute deviations (SCAD)
and minimax concave penalty (MCP), with desirable oracle properties are proposed (Fan
and Li, 2001; Breheny and Huang, 2011). If any of these advanced penalty terms can be
adopted in our LUCID method, we will be able to select the informative features with
absolute confidence so that the post-selection statistical inference can be more reliable.
2. Regularized EM algorithm with dynamic tuning parameters. Although the convergence
of EM algorithm has been guaranteed in previous studies (Balakrishnan et al., 2014), the
regularized M-step in the high-dimensional setting may not be well defined to provide a
balance between making progress towards the solution and identifying the right structure
Wainwright (2014). Yi and Caramanis (2015) proposed a more sophisticated version of
the regularized EM algorithm with not only a modified M-step but also a dynamically
updated tuning parameter after each EM iteration. If this more advanced algorithm could
be integrated into our LUCID method, we may be able to provide a more stable
performance of the regularized EM algorithm and more promising results of feature
selection.
3. Alternative options to perform statistical inference. We proposed the SEM algorithm for
the variance estimation in the LUCIDus package; however, when the convergence of the
CHAPTER 5 131
EM is extremely slow, the SEM algorithm often fails to converge because of the near-
singularity of the (1−𝐷𝑀 )
−1
when calculating ΔV the increase in variance due to
missing data (Meng and Rubin, 1991). Therefore, searching for other alternatives to
estimate variance when applying the EM algorithm is intuitive.
• Bootstrapping confidence intervals. As discussed in Chapter 4, we can design a
bootstrapping procedure involving resampling subjects from a population or current
set for a certain number of times with replacement. By doing this, sampling
distributions of parameter estimates in our model could be approximated so that their
means and standard errors could be calculated to construct their confidence intervals;
• Bayesian framework. If we revisit our DAG model from a Bayesian framework,
estimated posterior probabilities of LUCID model parameters could be directly used
to construct credible intervals and conduct hypothesis testing after identifying
informative features from high-dimensional data.
4. Missingness in multi-omics data. When handling high-throughput data in biomedical
research, missing data are often encountered owing to various reasons. Statisticians have
realized that analyzing missing data using conventional statistical methods without
sufficiently dealing with missing pattern and mechanism could lead to biased estimation
and invalid inference. Instead of using complete-case analysis via listwise deletion, we
aspire to develop more meticulous procedures to modify the LUCID EM algorithm. As
proposed in Chapter 3, the improved package should be able to elegantly solve the
missingness of biomarkers by incorporating missing pattern into our integrative latent
clustering model likelihood or efficiently and dynamically imputing the missing values in
the sample.
CHAPTER 5 132
Considering the above potentials, the LUCIDus R package will be actively and
continually developed. Planned extensions to perform incomplete LUCID methods and produce
bootstrap SE and confidence interval for model parameter estimates will be released in LUCIDus
version 1.0.0. We hope improved subtype classification empowered by LUCID can offer
researchers in the community more options in understanding the latent biological processes.
CHAPTER 5 133
Chapter 6 Conclusions and Future Directions
6.1 Summary of Findings
In this dissertation, I started with a comprehensive introduction and a thorough review of all the
relevant topics, including integrated data challenges and existing analysis options, the missing
data issue, classical regularization techniques in Chapter 1. After discussing currently available
solutions for integrated data analysis and their flaws, I emphasized the motivations for latent
cluster analysis using integrated data combining multi-omics features and phenotypic traits. I
then expanded our inspirations into novel statistical methods including theoretical development,
applied examples, and software implementation.
In Chapter 2, I proposed the LUCID base model built upon the DAG latent variable
framework to estimate integrative latent cluster assignments using multi-omics features and
phenotypic traits to illuminate underlying heterogeneity in the biological process. For a high-
dimensional setting, sparsity is induced by following the regularized EM algorithm with lasso-
type penalties to achieve feature selection. Extensive simulations have suggested that the LUCID
model is capable of distinguishing the heterogeneity across latent clusters efficiently while
identifying the informative multi-omics features and the outcome effect driving the classification
of subgroups. We presented applications of the LUCID method on folate biomarker data from
colon CFR. Integrative clustering results revealed the heterogeneity of CRC risk caused by two
genetic variants which further results in differential expressions of serum FOCM biomarker
levels. Another applied example with SOLAR data for type II diabetes identified that subgroups
with various risks of developing type II diabetes are characterized by differential exposures of
CHAPTER 5 134
PFHxS and expression of relevant metabolites, which further indicated that LUCID could be
used for a broader scope of determinant factors.
In Chapter 3, I extended the usage of the LUCID model by incorporating two common
missing patterns in biomedical research to cope with incomplete bio-feature measurements. By
including the omic-wise or sporadic missing pattern of bio-features into the model likelihood, we
could elegantly assign integrative latent cluster through joint model likelihood partition or
dynamic imputation respectively using integrated data with MAR biomarkers. Still, sparsity can
be prompted by allowing regularization paths in high-dimensional regimes. Based on simulations
of simple and high-dimensional scenarios, results suggested that by taking full advantage of all
observed data, our novel incomplete approach reduces bias and/or variance in parameter
estimates compared to complete-case analysis via listwise deletion. Finally, the incomplete
extension sustains the capability of classification by offering decent AUC. We then applied the
incomplete LUCID methods on an extended set from the folate biomarker study.
I next explored two options to handle the estimation of standard errors for LUCID
parameter estimates. Both the SEM algorithm and the bootstrap approach showed potentials to
gauge the uncertainty in EM algorithm for statistical inference, but also limitations in addressing
the complex data setting in the context of high dimensional parameter spaces and a noticeable
proportion of missing data. The applied example further assessed the standard errors of
parameter estimates obtained from the integrative clustering for SOLAR data in Chapter 2. The
results confirmed the effects of several significant metabolites leading to the underlying
heterogeneity in terms of type II diabetes risks.
Finally, to implement the LUCID method, I developed an R package, LUCIDus, as a
publicly accessible resource for researchers in the community. We have released LUCIDus
CHAPTER 5 135
v0.9.0 to CRAN with complete source codes, detailed function documentation, and guided
package vignettes. Some experimental features are available on the GitHub website for the
package, and a feature update for LUCIDus has been planned to include additional functions and
options to implement the proposed incomplete LUCID approach and bootstrap method to obtain
SE for parameter estimates.
6.2 Future Directions of LUCID
The inherent flexibility of the LUCID joint model based on the DAG conditional independent
assumption facilitates potential modifications of model likelihood to satisfy specific study goals
in the future. We will discuss some possible changes that can be made for the LUCID DAG
model shown in Figure 6.1.
Figure 6.1 Potential extensions for LUCID
So far, the LUCID method can use continuous or binary outcome variables to identify
subgroups. Further integrating other types of outcome could be desirable for particular study
settings. For instance, survival time data recording possibly censored time-to-event information
CHAPTER 5 136
are widely used in biomedical studies and cancer research. Joint modeling survival times under
the LUCID integrative clustering settings can be conducted by incorporating the semiparametric
model likelihood of the Cox proportional hazard model which is a widely used method to solve
hypotheses involving a survival process (Cox, 1972). If the partial likelihood of the Cox model is
not feasible, some non-parametric maximum likelihood approaches could be considered but
additional steps in the EM algorithm for parameter estimation may be needed (Xu et al., 2014).
Another intriguing type of outcome to integrate is the longitudinal process which usually
involves repeated measurements. Illuminating the outcome changing over time may considerably
affect the results in estimating underlying heterogeneity. To understand the longitudinal
trajectory, the most commonly applied method is the mixed-effects model with time as the
random effect (Pusponegoro et al., 2017). An intuitive option to include the longitudinal
information in the LUCID integrative framework is thus to incorporate the mixed-effects model
likelihood into the LUCID joint model likelihood. One advantage of incorporating the survival
and longitudinal process into LUCID from a perspective of regression framework is the easy
implementation of including covariates.
More generally, a weighted model likelihood function can be used to substitute the
regular likelihood of the outcome given the latent clusters (Y|X). If we revisit the challenge of
including longitudinal measurements, we can start with obtaining the mean across outcome
measurements or an estimate of outcome change by time. Then, an inverse weighting of the
standard error of mean or slope of time can be performed to incorporate the longitudinal
information in the LUCID joint model likelihood. Besides, bias corrections in observational
studies may be necessary since no randomization is conducted. In the practice of data collection,
it is not rare to draw a sample with systematic errors making certain subjects from the target
CHAPTER 5 137
population are less or more likely to be selected than others. When the outcome of interest is
binary, especially in the context of case-control study design, this erroneous ascertainment of
cases and/or controls results in ascertainment bias (Siegmund & Langholz, 2002). An incorrectly
ascertained sample fails to reveal a true reflection of the association between exposure and
outcome in the targeted population. Without prudently addressing the mechanism of
ascertainment bias, conventional statistical methods applied on a biased sample can only result in
a distortion of the association between exposure and outcome. To eliminate the ascertainment
bias, we can first estimate the probability of ascertainment given the outcome based on a specific
study design using overserved data or some prior knowledge from relevant literature, and then
apply inverse probability weighting to the likelihood of Y|X so that the impact of ascertainment
bias could be appropriately corrected.
We presented two solutions, SEM algorithm and bootstrap approach, to obtain standard
errors for LUCID model parameters for statistical inference. However, their stability and
feasibility under the settings of high-dimensional omics data is a concern. In the future, the
LUCID framework could be remodeled from a Bayesian perspective. A fully Bayesian approach
for integrative clustering may not only provide a straightforward means of statistical inference
among the selected features with their posterior distributions, but also avoid the computation in
finding the optimal lasso tuning parameters for feature selection (O’Hara & Sillanpää, 2009; Mo
et al., 2017). Also, alternative penalty terms with oracle properties, such as SCAD (Fan and Li,
2001) and MCP (Breheny and Huang, 2011), can help us truly select the subset of informative
features and thus improve the performance of post-selection inference.
Finally, certain effects of interactions could be explored during the integrative clustering
with LUCID. For example, it is rare that genetic variants work exclusively in the formation of
CHAPTER 5 138
latent clusters. Instead, genetic features interact with environmental exposures from many
aspects (Gauderman et al., 2017). If we can take the interaction between genetic features and
environmental exposures into consideration when estimating the integrative clusters,
improvements in the identification of underlying subgroups can be made. Another important
interaction to investigate is that between the latent clusters and treatment option. Understanding
this interaction elucidates the sensitivity of each estimated molecular subtype to the treatment of
interest in terms of the outcome, which is particularly meaningful in developing subtype-specific
therapy.
6.3 Final Remarks
The LUCID integrative latent clustering method and its extensions comprehensively address the
issues and challenges involved in elucidating underlying patterns of integrated data. Although all
the topics focus on the latent cluster analysis, they are also a consideration in dealing with
missingness of integrated data from another intriguing perspective. Specifically, the latent
subgroups and incomplete measurements in bio-features correspond to MCAR and MAR
elements in the integrated data respectively. We hope the LUCID method implemented through
the LUCIDus R package will prove to be a helpful and practical tool for researchers in relevant
disciplines to facilitate the understanding of aetiological heterogeneity in targeted biological
processes, which further advance successful personalized medicine in discovering potential
diagnostic probes and therapeutic targets for the subgroup of patients.
REFERENCES 139
References
Aouizerat, B. E., Vittinghoff, E., Musone, S. L., Pawlikowska, L., Kwok, P.-Y., Olgin, J. E., &
Tseng, Z. H. (2011). GWAS for discovery and replication of genetic loci associated with
sudden cardiac arrest in patients with coronary artery disease. BMC Cardiovascular
Disorders, 11(1), 29. https://doi.org/10.1186/1471-2261-11-29
Balakrishnan, S., Wainwright, M. J., & Yu, B. (2014). Statistical guarantees for the EM
algorithm: From population to sample-based analysis. Annals of Statistics, 45(1), 77–120.
https://doi.org/10.1214/16-AOS1435
Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized
regression, with applications to biological feature selection. Annals of Applied Statistics,
5(1), 232–253. https://doi.org/10.1214/10-AOAS388
Canty A. and Ripley B. (2017). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-20.
Chen, Y., Li, X., Liu, J., & Ying, Z. (2016). Regularized Latent Class Analysis with Application
in Cognitive Diagnosis. Psychometrika. Springer US. https://doi.org/10.1007/s11336-016-
9545-6
Chen, L. S., Prentice, R. L., & Wang, P. (2014). A penalized EM algorithm incorporating
missing data mechanism for Gaussian parameter estimation. Biometrics, 70(2), 312–322.
http://doi.org/10.1111/biom.12149
Cheng, C.L. (2014). Confounding and Bias in. Case-Control Studies. 30th Annual Meeting of the
International Society for Pharmacoepidemiology. Retrieved from
https://www.pharmacoepi.org/pub/2c5b0123-aa97-2de2-3ed5-1e7a80533fe4
Correa-Villaseñor A, Satten GA, Rolka H, et al. Random error and undercounting in birth defects
surveillance data: implications for inference. Birth Defects Res A Clin Mol Teratol.
2003;67(9):610–616
Cox, D. (1972). Regression Models and Life-Tables. Journal of the Royal Statistical Society.
Series B (Methodological), 34(2), 187-220. Retrieved from
http://www.jstor.org/stable/2985181
Cragan JD, Khoury MJ. Effect of prenatal diagnosis on epidemiologic studies of birth defects.
Epidemiology. 2000; 11(6):695–699
Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., … Aparicio, S.
(2012). The genomic and transcriptomic architecture of 2,000 breast tumors reveals novel
subgroups. Nature, 486(7403), 346–52. http://doi.org/10.1038/nature10983
REFERENCES 140
Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their Applications. Cambridge
University Press, Cambridge. ISBN 0-521-57391-2
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple Imputation for General Missing
Data Patterns in the Presence of High-dimensional Data. Scientific Reports, 6(1), 21689.
http://doi.org/10.1038/srep21689
DiCiccio, T., & Efron, B. (1996). Bootstrap Confidence Intervals. Statistical Science, 11(3), 189-
212. Retrieved from http://www.jstor.org/stable/2246110
Efron, B., & Tibshirani, R. (1986). Bootstrap Methods for Standard Errors, Confidence Intervals,
and Other Measures of Statistical Accuracy. Statistical Science, 1(1), 77–77.
https://doi.org/10.1214/ss/1177013817
Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle
Properties. Journal of the American Statistical Association, 96(456), 1348–1360.
https://doi.org/10.1198/016214501753382273
Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized
likelihood. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 75(3),
531–552. http://doi.org/10.1111/rssb.12001
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the
graphical lasso. Biostatistics, 9(3), 432–441. http://doi.org/10.1093/biostatistics/kxm045
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear
Models via Coordinate Descent. Journal Of Statistical Software, 35(12).
Gauderman, W. J., Mukherjee, B., Aschard, H., Hsu, L., Lewinger, J. P., Patel, C. J., …
Chatterjee, N. (2017). Update on the State of the Science for Analytical Methods for Gene-
Environment Interactions. American Journal of Epidemiology, 186(7), 762–770.
https://doi.org/10.1093/aje/kwx228
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural Networks and the Bias/Variance
Dilemma. Neural Computation, 4(1), 1-58.
Greenland, S., & Finkle, W. (1995). A critical look at methods for handling missing covariates in
epidemiologic regression analyses. American Journal of Epidemiology., 142(12), 1255-
1264.
Go YM, Walker DI, Liang Y, Uppal K, Soltow QA, Tran V, et al. Reference Standardization for
Mass Spectrometry and High-resolution Metabolomics Applications to Exposome
Research. Toxicol Sci. 2015;148(2):531-43.
REFERENCES 141
Goran MI, Bergman RN, Avila Q, Watkins M, Ball GD, Shaibi GQ, et al. Impaired glucose
tolerance and reduced beta-cell function in overweight Latino children with a positive
family history for type 2 diabetes. J Clin Endocrinol Metab. 2004;89(1):207-12.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (Springer
Series in Statistics). New York, NY: Springer New York.
Haile, R. W., Siegmund, K. D., Gauderman, W. J., & Thomas, D. C. (1999). Study-design issues
in the development of the University of Southern California Consortium’s Colorectal
Cancer Family Registry. J Natl Cancer Inst Monogr, 90033(26), 89–93.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal
Problems. Technometrics, 12(1), 55–67. http://doi.org/10.1080/00401706.1970.10488634
Huang, Y. T., Vanderweele, T. J., & Lin, X. (2014). Joint analysis of SNP and gene expression
data in genetic association studies of complex diseases. Ann Appl Stat, 8(1), 352–376.
http://doi.org/10.1214/13-AOAS690
Huang, Y. T. (2015). Integrative modeling of multi-platform genomic data under the framework
of mediation analysis. Statistics in Medicine, 34(1), 162–178.
http://doi.org/10.1002/sim.6326
Huang, Y.-T., Liang, L., Moffatt, M. F., Cookson, W. O. C. M., & Lin, X. (2015). iGWAS:
Integrative Genome-Wide Association Studies of Genetic and Genomic Data for Disease
Susceptibility Using Mediation Analysis. Genetic Epidemiology, 39(5), 347–356.
http://doi.org/10.1002/gepi.21905
Howards, P. P., Johnson, C. Y., Honein, M. A., & Flanders, W. D. (2015). Adjusting for bias due
to incomplete case ascertainment in case-control studies of birth defects. American Journal
of Epidemiology, 181(8), 595–607. http://doi.org/10.1093/aje/kwu323
James, G., Tibshirani, R., Witten, D., Hastie, T., & SpringerLink. (2013). An Introduction to
Statistical Learning with Applications in R (Springer texts in statistics).
Jamshidian, M. And Jennrich, R. I. (2000). Standard errors for EM estimation. Journal of the
Royal Statistical Society, Series B 62, 257–270.
Janssens, B., Mohapatra, B., Vatta, M., Goossens, S., Vanpoucke, G., Kools, P., … Towbin, J. A.
(2003). Assessment of the CTNNA3 gene encoding human alpha T-catenin regarding its
involvement in dilated cardiomyopathy. Hum Genet, 112(3), 227–236.
https://doi.org/10.1007/s00439-002-0857-5
Jia, B., & Wang, X. (2014). Regularized EM algorithm for sparse parameter estimation in
nonlinear dynamic systems with application to gene regulatory network inference.
EURASIP Journal on Bioinformatics & Systems Biology, 2014(1), 5.
https://doi.org/10.1186/1687-4153-2014-5
REFERENCES 142
Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., … Huang, X. (2016). SparRec: An
effective matrix completion framework of missing data imputation for GWAS. Scientific
Reports, 6(1), 35534. http://doi.org/10.1038/srep35534
Li, H., Zhang, K., & Jiang, T. (2005). The regularized EM algorithm. Proceedings of the
National Conference on Artificial Intelligence (AAAI), 20(2), 807.
Li, S., Park, Y., Duraisingham, S., Strobel, F. H., Khan, N., Soltow, Q. A., … Pulendran, B.
(2013). Predicting Network Activity from High Throughput Metabolomics. PLoS
Computational Biology, 9(7). https://doi.org/10.1371/journal.pcbi.1003123
Little, R., & Rubin, D. (2002). Statistical analysis with missing data (2nd ed., Wiley series in
probability and statistics). Hoboken, N.J.: Wiley-Interscience.
Liu, J. Z., van Sommeren, S., Huang, H., Ng, S. C., Alberts, R., Takahashi, A., … Weersma, R.
K. (2015). Association analyses identify 38 susceptibility loci for inflammatory bowel
disease and highlight shared genetic risk across populations. Nature Genetics, 47, 979.
Retrieved from https://doi.org/10.1038/ng.3359
Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society, Series B 44, 226–233
Low, S.-K., Chung, S., Takahashi, A., Zembutsu, H., Mushiroda, T., Kubo, M., & Nakamura, Y.
(2013). Genome-wide association study of chemotherapeutic agent-induced severe
neutropenia/leucopenia for patients in Biobank Japan. Cancer Science, 104(8), 1074–1082.
https://doi.org/10.1111/cas.12186
Meng, X., & Rubin, D. B. (1991). Using EM to Obtain Asymptotic Matrices: The SEM
Algorithm. Journal of the American Statistical Association, 86(416), 899–909.
http://doi.org/10.2307/2290503
Michiels, B., Molenberghs, G., & Lipsitz, S. (1999). Selection Models and Pattern-Mixture
Models for Incomplete Data with Covariates. Biometrics, 55(3), 978-983. Retrieved from
http://www.jstor.org/stable/2533635
Miyashita, A., Arai, H., Asada, T., Imagawa, M., Matsubara, E., Shoji, M., … Kuwano, R.
(2007). Genetic association of CTNNA3 with late-onset Alzheimer’s disease in females.
Human Molecular Genetics, 16(23), 2854–2869. https://doi.org/10.1093/hmg/ddm244
Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K. S., & Hilsenbeck, S. G. (2017). A fully
Bayesian latent variable model for integrative clustering analysis of multi-type omics data.
Biostatistics, 19(1), 71–86. https://doi.org/10.1093/biostatistics/kxx017
Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., … Shen, R. (2013).
Pattern discovery and cancer gene identification in integrated cancer genomic data.
REFERENCES 143
Proceedings of the National Academy of Sciences of the United States of America, 110(11),
4245–50. http://doi.org/10.1073/pnas.1208949110
Murphy, K. (2012). Machine Learning A Probabilistic Perspective (Adaptive Computation and
Machine Learning). Cambridge: The MIT Press.
Newcomb, P. A., Baron, J., Cotterchio, M., Gallinger, S., Grove, J., Haile, R., … Seminara, D.
(2007). Colon cancer family registry: An international resource for studies of the genetic
epidemiology of colon cancer. Cancer Epidemiology Biomarkers and Prevention, 16(11),
2331–2343. https://doi.org/10.1158/1055-9965.EPI-07-0648
Nishii, R. (1984) Asymptotic properties of criteria for selection of variables in multiple
regression. Ann. Statist., 12, 758–765.
O’Hara, R. B., & Sillanpää, M. J. (2009). A review of bayesian variable selection methods:
What, how and which. Bayesian Analysis, 4(1), 85–118. https://doi.org/10.1214/09-BA403
Pusponegoro, N. H., Rachmawati, R. N., Notodiputro, K. A., & Sartono, B. (2017). Linear
Mixed Model for Analyzing Longitudinal Data: A Simulation Study of Children Growth
Differences. Procedia Computer Science, 116, 284–291.
https://doi.org/10.1016/j.procs.2017.10.071
Reed, M. C., Nijhout, H. F., Neuhouser, M. L., Gregory, J. F., Shane, B., James, S. J., … Ulrich,
C. M. (2006). A mathematical model gives insights into nutritional and genetic aspects of
folate-mediated one-carbon metabolism. The Journal of Nutrition, 136(10), 2653–61.
http://doi.org/136/10/2653
Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A., & Kim, D. (2015). Methods of
integrating data to uncover genotype-phenotype interactions. Nature Reviews Genetics,
16(2), 85–97. http://doi.org/10.1038/nrg3868
Rothman, K., Greenland, S., & Lash, T. (2008). Modern epidemiology (Third edition).
Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins.
Schmidt, M. (2008). The Sankey diagram in energy and material flow management: Part I:
History. Journal of Industrial Ecology, 12(1), 82–94. https://doi.org/10.1111/j.1530-
9290.2008.00004.x
Schumacher, F., Schmit, S., Jiao, S., Edlund, C., Wang, H., Zhang, B., … Henderson, B. (2016).
Genome-wide association study of colorectal cancer identifies six new susceptibility loci.
Nature Communications, 6, 7138. https://doi.org/10.1038/ncomms8138.Genome-wide
Shen, R., Olshen, A. B., & Ladanyi, M. (2009). Integrative clustering of multiple genomic data
types using a joint latent variable model with application to breast and lung cancer subtype
analysis. Bioinformatics, 25(22), 2906–2912. http://doi.org/10.1093/bioinformatics/btp543
REFERENCES 144
Shen, R., Mo, Q., Schultz, N., Seshan, V. E., Olshen, A. B., Huse, J., … Sander, C. (2012).
Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE, 7(4).
http://doi.org/10.1371/journal.pone.0035236
Siegmund, K. D., & Langholz, B. (2002). Ascertainment bias in family-based case-control
studies. American Journal of Epidemiology, 155(9), 875–880.
http://doi.org/10.1093/aje/155.9.875
Soltow QA, Strobel FH, Mansfield KG, Wachtman L, Park Y, Jones DP. High-performance
metabolic profiling with dual chromatography-Fourier-transform mass spectrometry (DC-
FTMS) for study of the exposome. Metabolomics. 2013;9(1 Suppl):S132-s43.
Thomas, D. C., Conti, D. V, Baurley, J., Nijhout, F., Reed, M., & Ulrich, C. M. (2009). Use of
pathway information in molecular epidemiology. Human Genomics, 4(1), 21–42.
https://doi.org/10.1186/1479-7364-4-1-21
Tibshirani, R. (1996). Regression Selection and Shrinkage via the Lasso. Journal of the Royal
Statistical Society B. http://doi.org/10.2307/2346178
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition.
Springer, New York. ISBN 0-387-95457-0
Wang, T. J., Larson, M. G., Vasan, R. S., Cheng, S., Rhee, E. P., McCabe, E., … Gerszten, R. E.
(2011). Metabolite profiles and the risk of developing diabetes. Nature Medicine, 17, 448.
Retrieved from https://doi.org/10.1038/nm.2307
Wang, Z., Gu, Q., Ning, Y., & Liu, H. (2014). High Dimensional Expectation-Maximization
Algorithm: Statistical Optimization and Asymptotic Normality. arXiv, (2008), 1–84.
Retrieved from http://arxiv.org/abs/1412.8729
White, I. R., Royston, P. and Wood, A. M. (2011), Multiple imputation using chained equations:
Issues and guidance for practice. Statist. Med., 30: 377-399. doi:10.1002/sim.4067
Witten, D., & Tibshirani, R. (2009). Covariance‐regularized regression and classification for
high dimensional problems. … of the Royal Statistical Society: Series …, 2(Tibshirani
1996), 615–636. http://doi.org/10.1111/j.1467-9868.2009.00699.x
Wainwright, M. J. (2014). Structured Regularizers for High-Dimensional Problems: Statistical
and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253.
https://doi.org/10.1146/annurev-statistics-022513-115643
Weigensberg MJ, Cruz ML, Goran MI. Association between insulin sensitivity and post-glucose
challenge plasma insulin values in overweight Latino youth. Diabetes Care.
2003;26(7):2094-9.
REFERENCES 145
Xu, C., Baines, P., Wang J. (2014). Standard error estimation using the EM algorithm for the
joint modeling of survival and longitudinal data, Biostatistics, Volume 15, Issue 4, 1
October 2014, Pages 731–744, https://doi.org/10.1093/biostatistics/kxu015
Yi, X., & Caramanis, C. (2015). Regularized EM Algorithms: A Unified Framework and
Provable Statistical Guarantees. Advances in Neural Information Processing Systems,
(2014), 1–9.
Zhang, H., Zheng, Y., Zhang, Z., Gao, T., Joyce, B., Yoon, G., … Liu, L. (2016). Estimating and
testing high-dimensional mediation effects in epigenetic studies. Bioinformatics, 32(20),
3150–3154. https://doi.org/10.1093/bioinformatics/btw351
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic-net. Journal of
the Royal Statistical Society, 67(2), 301–320. https://doi.org/10.1111/j.1467-
9868.2005.00503.x
APPENDICES 146
Appendices
Appendix I
Base DAG model with a binary outcome.
𝑙 (𝚯 )=∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 =∑log∑𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝑆 (𝛾 𝑗 ,Y
i
)
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log𝔼 𝑛 𝑖 =1
[
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
≥∑𝔼 [log[
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]]
𝑛 𝑖 =1
=∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
=∑∑𝑟 𝑖𝑗
log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
log
𝑆 (𝛾 𝑗 ,Y
i
)
𝑟 𝑖𝑗
𝑛 𝑖 =1
=∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 𝐢 )−
1
2
∑∑𝑟 𝑖𝑗
(𝐙 𝐢 −𝜹 𝑗 )
𝑇 𝑘 𝑗 =1
𝛴 𝑗 −1
(𝐙 𝐢 −𝜹 𝑗 )
𝑛 𝑖 =1
−
1
2
∑∑𝑟 𝑖𝑗
log|𝛴 𝑗 |
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
Y
i
log
𝑒 𝛾 𝑗 1+𝑒 𝛾 𝑗 𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
(1−Y
i
)log
1
1+𝑒 𝛾 𝑗 𝑘 𝑗 =1
𝑛 𝑖 =1
−3∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
(A.1)
APPENDICES 147
Then, the updated parameter estimates of 𝛾 𝑗 at iteration t+1 are:
𝛾 𝑗 (𝑡 +1)
=log
𝑝 1−𝑝 ;𝑝 =
∑ 𝑟 𝑖𝑗
Y
i
𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(A.2)
APPENDICES 148
Appendix II
Proof of Chapter 2 Equation (2.5)
𝑙 (𝚯 )=log𝐿 (𝚯 )
=log∏𝑓 (𝒁 𝑖 ,𝑌 𝑖 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝑛 𝑖 =1
=∑log𝑓 (𝒁 𝑖 ,𝑌 𝑖 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝑛 𝑖 =1
=∑log∑𝑓 (𝒁 𝑖 ,𝑌 𝑖 ,𝑋 𝑖 =𝑗 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑓 (𝒁 𝑖 ,𝑌 𝑖 ,𝑋 𝑖 =𝑗 |𝐆 i
;𝜷 ,𝜹 ,𝚺 ,𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log𝔼 𝑛 𝑖 =1
[
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
≥∑𝔼 [log[
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]]
𝑛 𝑖 =1
=𝔼 [∑log[
𝑆 (𝛽 ,𝑘 ,𝐆 i
)𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
𝑛 𝑖 =1
]
=∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )∑[log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
+log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
+log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
𝑛 𝑖 =1
𝑘 𝑗 =1
=∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝜙 (𝑌 𝑖 ;𝛾 𝑗 ,𝜎 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
(A.3)
APPENDICES 149
Appendix III
Table A.1 Scenario Settings for Simulation
SNPs Biomarkers Outcome
#Causal ORG #Causal Δ
𝑍 𝚺 Continuous - Δ
𝑌 Binary - OGY
5 1.9 2 0.8 Independent 0.2 2.0
5 1.9 0 0.0 Independent 0.2 2.0
5 1.9 2 0.2 Independent 0.2 2.0
5 1.9 2 0.4 Independent 0.2 2.0
5 1.9 2 0.5 Independent 0.2 2.0
5 1.9 2 0.6 Independent 0.2 2.0
5 1.9 0 0.0 Independent 0.4
5 1.9 2 0.2 Independent 0.4
5 1.9 2 0.4 Independent 0.4
5 1.9 2 0.5 Independent 0.4
5 1.9 2 0.6 Independent 0.4
5 1.9 2 0.8 Independent 0.4
0 1.0 2 0.8 Independent 0.2 2.0
5 1.1 2 0.8 Independent 0.2 2.0
5 1.3 2 0.8 Independent 0.2 2.0
5 1.5 2 0.8 Independent 0.2 2.0
5 1.75 2 0.8 Independent 0.2 2.0
5 1.8 2 0.8 Independent 0.2 2.0
5 1.85 2 0.8 Independent 0.2 2.0
5 1.9 2 0.8 Independent 0.2 2.0
5 1.95 2 0.8 Independent 0.2 2.0
5 2.0 2 0.8 Independent 0.2 2.0
5 1.9 0 0 Independent 0.2
5 1.9 0 0 Structured 0.2
5 1.9 2 0.8 Independent 0 0
APPENDICES 150
Appendix IV
Figure A.1 Simulation results for base DAG model with a binary outcome
APPENDICES 151
Appendix V
Table A.2 Associations among three components in the colon CFR sample
Regress Biomarkers on SNPs
Biomarkers SNP Estimate SE P-value
Creatinine
rs7070570
-0.0182 0.0344 0.5965
SAH -0.0391 0.0317 0.2172
tHCY 0.0042 0.0330 0.8998
METH 0.0106 0.0329 0.7485
CYSTH -0.0679 0.0338 0.0449
SAM 0.0318 0.0303 0.2932
Riboflavin 0.0062 0.0309 0.8408
Flavin Mononucleotide 0.0110 0.0297 0.7113
4-Pyridoxic Acid -0.0113 0.0319 0.7239
Pyridoxal -0.0391 0.0270 0.1485
Pyridoxamine 0.0218 0.0254 0.3907
5-Methyltetrahydrofolate -0.0181 0.0236 0.4440
Creatinine
rs1241696
-0.0148 0.0311 0.6332
SAH -0.0181 0.0286 0.5269
tHCY -0.0351 0.0298 0.2393
METH 0.0017 0.0298 0.9537
CYSTH -0.0582 0.0306 0.0573
SAM 0.0142 0.0274 0.6041
Riboflavin 0.0377 0.0279 0.1775
Flavin Mononucleotide 0.0161 0.0269 0.5496
4-Pyridoxic Acid 0.0136 0.0288 0.6364
Pyridoxal -0.0274 0.0245 0.2626
Pyridoxamine 0.0196 0.0230 0.3946
5-Methyltetrahydrofolate -0.0056 0.0213 0.7924
Regress Outcome on SNPs
SNPs Estimate SE P-value
rs7070570 -0.3476 0.1702 0.0411
rs1241696 0.4246 0.1653 0.0102
APPENDICES 152
Regress Outcome on Biomarkers
Biomarkers Estimate SE P-value
Creatinine -0.0781 0.1019 0.4434
SAH 0.1575 0.1372 0.2508
tHCY 0.0132 0.1498 0.9295
METH -0.0797 0.1310 0.5428
CYSTH 0.1047 0.1337 0.4337
SAM -0.0652 0.1484 0.6603
Riboflavin 0.2722 0.1267 0.0316
Flavin Mononucleotide 0.3325 0.1713 0.0523
4-Pyridoxic Acid -0.0244 0.1448 0.8659
Pyridoxal 0.1202 0.1562 0.4415
Pyridoxamine -0.8617 0.2536 0.0007
5-Methyltetrahydrofolate -0.0452 0.1874 0.8092
Table A.3 Summary of the integrated post-hoc analysis
Association between SNPs and 2
nd
Subgroup
rs7070570 OR: 1.29 95% CI: (1.12, 1.50)
rs1241696 OR: 0.81 95% CI: (0.71, 0.92)
Association between 2
nd
Subgroup and CRC (Adjusted)
2
nd
vs. 1
st
Subgroup OR: 1.91 95% CI: (0.73, 4.54)
Mean (SE) of metabolites* between different subgroups
1
st
Subgroup, low risk for CRC 2
nd
Subgroup, high risk for CRC
creatinine 0.08 (0.0304) -0.12 (0.0370)
SAH 0.13 (0.0302) -0.19 (0.0367)
tHCY 0.13 (0.0302) -0.19 (0.0367)
METH 0.17 (0.0300) -0.24 (0.0364)
CYSTH -0.17 (0.0300) 0.25 (0.0364)
SAM -0.09 (0.0304) 0.13 (0.0369)
Riboflavin -0.37 (0.0277) 0.53 (0.0337)
Flavin-mononucleotide -0.10 (0.0304) 0.15 (0.0369)
4-Pyridoxic acid -0.24 (0.0294) 0.34 (0.0357)
Pyridoxal 0.02 (0.0306) -0.02 (0.0371)
Pyridoxamine -0.01 (0.0306) 0.01 (0.0371)
5-methyltetrahydrofolate -0.35 (0.0280) 0.51 (0.0341)
APPENDICES 153
Appendix VI
Revised DAG for incomplete approach with a binary outcome.
𝑙(𝚯 )=∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 =∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑆 (𝛽 ,𝑘 ,𝐆 i
)[𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
≥∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑I(∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝑆 (𝛾 𝑗 ,Y
i
)
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
=∑∑𝑟 𝑖𝑗
log
𝑆 (𝛽 ,𝑘 ,𝐆 i
)
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑I(∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑟 𝑖𝑗
log
𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
log
𝑆 (𝛾 𝑗 ,Y
i
)
𝑟 𝑖𝑗
𝑛 𝑖 =1
=∑∑𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
log𝑆 (𝜷 ,𝑗 ,𝐆 𝐢 )+∑∑I(∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑟 𝑖𝑗
log𝜙 (𝐙 𝑖 ;𝛿 𝑗 ,𝛴 𝑗 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
Y
i
log
𝑒 𝛾 𝑗 1+𝑒 𝛾 𝑗 𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑟 𝑖𝑗
(1−Y
i
)log
1
1+𝑒 𝛾 𝑗 𝑘 𝑗 =1
𝑛 𝑖 =1
−2∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑖 =1
−∑∑𝑟 𝑖𝑗
log𝑟 𝑖𝑗
𝑘 𝑗 =1
𝑛 𝑜𝑏𝑠 𝑖 =1
(A.4)
Then, the updated parameter estimates of 𝛾 𝑗 at iteration t+1 are:
𝛾 𝑗 (𝑡 +1)
=log
𝑝 1−𝑝 ;𝑝 =
∑ 𝑟 𝑖𝑗
Y
i
𝑛 𝑖 =1
∑ 𝑟 𝑖𝑗
𝑛 𝑖 =1
(A.5)
APPENDICES 154
Appendix VII
Proof of Chapter 3 Equation (3.9)
𝑙(𝜣 )=∑log∑𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑘 𝑗 =1
𝒏 𝒊 =𝟏 =∑log∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
=∑log𝔼 𝑛 𝑖 =1
[
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
≥∑𝔼 [log[
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]]
𝑛 𝑖 =1
=𝔼 [∑log[
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )[𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )]
I(∑ 𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
]
𝑛 𝑖 =1
]
=∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑓 (𝑋 𝑖 |𝐆 i
;𝜷 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝐼 (∑𝑀 𝑖𝑐
𝑚 𝑐 =1
=0)𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )log
𝑓 (𝒁 𝑖 |𝑋 𝑖 ;𝜹 ,𝚺 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
𝑛 𝑖 =1
+∑∑𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑘 𝑗 =1
log
𝑓 (𝑌 𝑖 |𝑋 𝑖 ;𝜸 )
𝑃 (𝑋 𝑖 =𝑗 |𝓓 ;𝚯 )
𝑛 𝑖 =1
(A.6)
APPENDICES 155
Appendix VIII
Figure A.2 Simulation results for a setting of OR
G
=2.0; Δ
Z
=0.4; Δ
Y
=0.4
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000
0.4500
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Power - Gene Effect
Causal Null
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Power - Biomarker Effect
Cluster 1 - Causal Cluster 2 - Causal Cluster 1 - Null Cluster 2 - Null
APPENDICES 156
Figure A.3 Simulation results for a setting of OR
G
=2.0; Δ
Z
=0.8; Δ
Y
=0.4
APPENDICES 157
APPENDICES 158
Appendix IX
Figure A.4 Feature characteristics in each estimated subgroup of CRC risks with subjects with
complete measured biomarker data only
Links: light blue, negative effect; dark blue, positive effect; grey, outcome reference subgroup; red, high-risk subgroup; green,
low-risk subgroup; width, effect size. Nodes: yellow, integrative cluster; purple, SNPs; cyan & blue, biomarkers; brown, CRC
risks. IntClust: integrative cluster/subgroup
APPENDICES 159
Appendix X
Table A.4 Novel SNPs identified through the incomplete LUCID method in the CCFR sample
SNPS
2
ND
SUBGROUP
(β)
OR=EXP(β)
ASSOCIATION SNP→Y
P-VALUE*
GENE
CHR21_47983680_A_G -1.73 0.18 0.08 N/A
CHR4_22523414_C_G -1.36 0.26 0.57 N/A
RS12416353 -1.32 0.27 0.99 N/A
RS11240536 -1.21 0.30 0.72 N/A
KGP9168996 -1.13 0.32 0.47 N/A
CHR2_217566056_A_T -1.01 0.37 0.80 N/A
RS17034196 -0.64 0.53 0.16 N/A
RS2744734 -0.56 0.57 0.20 N/A
CHR2_28832256_C_T -0.52 0.59 0.56 N/A
RS17171172 -0.51 0.60 0.69 EPDR1
RS10518742 -0.50 0.61 0.39 EHD4
CHR11_15606332_A_G -0.43 0.65 0.63 N/A
RS17133 -0.43 0.65 0.58 AC073941.1
RS12900613 -0.39 0.68 0.09 RASGRF1
CHR4_174573033_A_G -0.30 0.74 0.01 N/A
RS208245 -0.27 0.77 0.06 PTPRT
RS6064127 -0.26 0.77 0.72 N/A
RS1219496 -0.16 0.85 0.31 AC025947.1
RS10486614 -0.10 0.90 0.13 CHN2
RS6715538 -0.08 0.92 0.77 SPTBN1
RS689466 -0.07 0.93 0.89 N/A
RS10873661 -0.07 0.93 0.97 LPAR3
RS3817196 -0.05 0.95 0.15 LSP1
RS4724620 -0.04 0.96 0.74 N/A
RS3176779 -0.03 0.97 0.82 CD69
RS7551719 0.00 1.00 0.47 N/A
CHR2_43976541_C_G 0.02 1.02 0.52 N/A
RS1939483 0.02 1.02 0.00 MAML2
RS4768689 0.10 1.11 0.97 N/A
KGP1068234 0.13 1.14 0.82 N/A
RS10023142 0.18 1.20 0.11 N/A
CHR4_27034672_A_G 0.21 1.23 0.45 N/A
RS12550019 0.27 1.31 0.06 FAM135B
RS12133662 0.29 1.34 0.09 KCNN3
APPENDICES 160
RS651249 0.32 1.37 0.08 C5orf66
CHR3_60141414_C_G 0.35 1.42 0.51 N/A
CHR16_6965400_C_G 0.45 1.57 0.55 N/A
CHR20_52373092_C_G 0.47 1.61 0.19 N/A
RS2119609 0.62 1.86 0.86 LINC02005
RS12074997 0.82 2.26 0.66 N/A
CHR16_73068060_C_G 1.63 5.11 0.31 N/A
*Adjusted for the top PCs
APPENDICES 161
Appendix XI
Figure A.5 Sporadic incomplete LUCID method using estimates from omic-wise incomplete
approach as initial imputed values
APPENDICES 162
Appendix XII
Table A.5 Bootstrap Confidence intervals for LUCID parameter estimates in the CCFR study
FEATURES ORIGINAL
ESTIMATE
BOOTSTRAP
STATISTICS
BOOTSTRAP CONFIDENCE INTERVAL
Normal Basic Percentile
EXPOSURE
RS7070570 0.26 -0.08 (0.03, 0.64) (0.05, 0.73) (-0.22, 0.46)
RS1241696 -0.21 0.05 (-0.54, 0.02) (-0.64, -0.04) (-0.37, 0.22)
METABOLITES
CLUSTER 1
CREATININE 0.07 -0.02 (-0.09, 0.27) (-0.04, 0.39) (-0.25, 0.18)
SAH 0.02 0.00 (-0.12, 0.17) (-0.11, 0.3) (-0.25, 0.16)
HCY -0.03 0.03 (-0.18, 0.06) (-0.2, 0.01) (-0.07, 0.13)
METH 0.20 -0.03 (-0.09, 0.57) (0.13, 0.84) (-0.44, 0.28)
CYSTH 0.04 0.01 (-0.28, 0.33) (-0.21, 0.55) (-0.48, 0.29)
SAM -0.02 0.03 (-0.17, 0.07) (-0.21, 0.04) (-0.08, 0.18)
RIBOFLAVIN -0.03 0.03 (-0.17, 0.06) (-0.24, 0.01) (-0.07, 0.18)
FLAVIN MONONUCLEOTIDE -0.03 0.02 (-0.15, 0.06) (-0.24, 0.03) (-0.09, 0.17)
4_PYRIDOXIC_ACID -0.03 0.02 (-0.12, 0.03) (-0.13, 0.01) (-0.06, 0.07)
PYRIDOXAL -0.03 0.03 (-0.15, 0.04) (-0.16, 0.03) (-0.1, 0.09)
PYRIDOXAMINE 0.04 0.00 (-0.08, 0.16) (-0.06, 0.27) (-0.18, 0.14)
5_METHYLTETRAHYDROFOLA
TE
0.01 0.01 (-0.09, 0.09) (-0.11, 0.09) (-0.07, 0.13)
CLUSTER 2
CREATININE -0.10 -0.01 (-0.34, 0.15) (-0.38, 0.2) (-0.41, 0.17)
SAH -0.03 -0.02 (-0.27, 0.25) (-0.21, 0.44) (-0.51, 0.14)
HCY 0.04 -0.07 (-0.16, 0.38) (-0.04, 0.53) (-0.44, 0.13)
METH -0.30 -0.05 (-0.75, 0.27) (-0.83, 0.4) (-0.99, 0.24)
CYSTH -0.05 -0.09 (-0.43, 0.49) (-0.39, 0.73) (-0.84, 0.29)
SAM 0.03 -0.07 (-0.17, 0.37) (-0.03, 0.52) (-0.46, 0.09)
RIBOFLAVIN 0.04 -0.04 (-0.1, 0.28) (-0.02, 0.41) (-0.33, 0.11)
FLAVIN MONONUCLEOTIDE 0.05 -0.02 (-0.08, 0.22) (-0.05, 0.32) (-0.23, 0.14)
4_PYRIDOXIC_ACID 0.04 -0.03 (-0.07, 0.22) (-0.04, 0.27) (-0.19, 0.12)
PYRIDOXAL 0.04 -0.06 (-0.07, 0.28) (0, 0.3) (-0.21, 0.09)
PYRIDOXAMINE -0.06 -0.02 (-0.23, 0.15) (-0.24, 0.2) (-0.32, 0.13)
5_METHYLTETRAHYDROFOLA
TE
-0.02 -0.03 (-0.22, 0.26) (-0.12, 0.42) (-0.45, 0.09)
OUTCOME
CRC RISKS 1.91 3.01 (1.36e-10,
2.79e+8)
(2.26e-17,
3.58)
(1.02,
1.61e+17)
APPENDICES 163
Appendix XIII
LUCIDus R package Manual:
APPENDICES 164
APPENDICES 165
APPENDICES 166
APPENDICES 167
APPENDICES 168
APPENDICES 169
APPENDICES 170
APPENDICES 171
APPENDICES 172
APPENDICES 173
APPENDICES 174
APPENDICES 175
APPENDICES 176
APPENDICES 177
APPENDICES 178
APPENDICES 179
LUCIDus R package Vignettes:
APPENDICES 180
APPENDICES 181
APPENDICES 182
APPENDICES 183
APPENDICES 184
APPENDICES 185
APPENDICES 186
Abstract (if available)
Abstract
Epidemiologic, clinical, and translational studies are increasingly generating multi-platform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genotypes and informative biomarkers/omics effects while jointly estimating subgroups relevant to the outcome of interest. These subgroups are both influenced by long-term omics factors and are related to the outcome. To estimate the overall integrated model, we treat subtype estimation as a missing data problem, where we can use the Expectation-Maximization (E-M) algorithm to achieve maximum likelihood estimates of the model parameters. To accommodate high dimensional omics data, we use regularized regression techniques to obtain sparse estimates for the corresponding coefficients. We further demonstrate the use of the LUCID model for integrated data with incomplete biomarker measurements. Potential options for statistical inference in the context of LUCID integrative clustering were also explored. Simulation results indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups, and recapitulate subgroup-specific effects. We apply this approach to several real data applications to highlight the integration of genomic, exposure, and metabolomic data. The proposed method offers a promising solution to identify informative factors and estimate molecular subtypes of diseases for personalized medicine and potentially efficient targeted treatments.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Incorporating prior knowledge into regularized regression
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Statistical analysis of high-throughput genomic data
PDF
Two-step study designs in genetic epidemiology
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Modeling mutational signatures in cancer
PDF
ROC surface in the presence of verification bias
PDF
High-dimensional regression for gene-environment interactions
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Finding signals in Infinium DNA methylation data
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Gene-set based analysis using external prior information
PDF
Bayesian multilevel quantile regression for longitudinal data
Asset Metadata
Creator
Peng, Cheng
(author)
Core Title
Latent unknown clustering with integrated data (LUCID)
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
03/13/2019
Defense Date
02/26/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
high dimensionality,integrated data,latent clustering,multi-omics features,OAI-PMH Harvest,regularization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David (
committee chair
), Jenkins, Keith (
committee member
), Lewinger, Juan (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
cheng.vincent.peng@outlook.com,chengpen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-132078
Unique identifier
UC11676639
Identifier
etd-PengCheng-7152.pdf (filename),usctheses-c89-132078 (legacy record id)
Legacy Identifier
etd-PengCheng-7152.pdf
Dmrecord
132078
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Peng, Cheng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
high dimensionality
integrated data
latent clustering
multi-omics features
regularization