Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Prediction modeling with meta data and comparison with lasso regression
(USC Thesis Other)
Prediction modeling with meta data and comparison with lasso regression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Prediction modeling with meta data
and comparison with lasso regression
by
Jiqing Wu
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(Biostatistics)
May 2021
Copyright 2021 Jiqing Wu
ii
Table of Contents
List of Figures .........................................................................................iii
Abstract ...................................................................................................iv
1.Introduction ...........................................................................................1
2.Method ..................................................................................................4
2.1 Data simulation ............................................................................4
2.2 Statistical methods .......................................................................7
3.Result ....................................................................................................8
3.1 LASSO regression .......................................................................8
3.2 Xrnet regression with informed meta data ................................10
3.3 Xrnet regression with non-informative Z matrix ......................11
3.4 Models comparison....................................................................13
4.Conclusion ..........................................................................................15
5.Discussion............................................................................................16
References...............................................................................................17
iii
List of Figures
Figure 1 Bimodal distribution of simulated data …………………….…6
Figure 2 The distribution of the simulated age variable ………………..7
Figure 3 Lasso regression on simulation data …...……………………...9
Figure 4 Beta coefficients of the model from xrnet ……...……………10
Figure 5 Comparison of coefficients between the model obtained by
ordinary lasso regression and xrnet ……………………………………11
Figure 6 Coefficients obtained when using the non-informative z
matrix ………………………………………………………………….12
Figure 7 The comparison between coefficients in the model generated by
non-informative and real z matrix ……………………………………..13
Figure 8 MSE obtained from original lasso regression, xrnet with correct
Z matrix and xrnet with non-informative Z matrix ……………………14
iv
Abstract
Recently, new methods to incorporate external information into penalized regression models for
high-dimensional data have shown improved prediction compared to traditional methods. We
were interested in whether external meta data could improve the prediction of age from DNA
methylation data. We used simulated data, designed to loosely resemble the DNA methylation –
age prediction model, to study the impact of using external information in penalized regression.
All analyses from a penalized regression model using meta data are performed using the glmnet
and xrnet package in R (version=3.6.0). According to our results, the model obtained from xrnet
(mean of mean squared error = 1.2197) was better than that from original lasso regression (mean
of mean squared error = 1.9518). However, this is based on only one simulation setting with a
very large signal and independent predictor variables. Additional studies with more subtle effects
and more complex models are needed before drawing broader conclusions.
1
1. Introduction
We set out to evaluate the prediction performance of a new model that incorporated meta data,
data about the predictor variables. This method could be applied to predict epigenetic age, a
person’s age predicted from their DNA methylation profile in blood. Studies have found that
epigenetic age is correlated with chronological age, and may reflect a health or longevity index
(ARMSTRONG; MATHER; THALAMUTHU; WRIGHT et al., 2017). For example, the
epigenetic age of older individuals (aged 95+) is below their chronological age.
Aging is a time-dependent functional decline in most of the living organisms on earth.
Researchers have found that the process of human aging results in telomere shortening, gene
expression changing, DNA methylation and many other changes on cellular and molecular levels
(HANNUM; GUINNEY; ZHAO; ZHANG et al., 2013). DNA methylation, a kind of epigenetic
biomarker, has been highly studied recently because its level changes over time and can be
quantitatively measured easily. These dynamic changes happen over an individual’s lifespan,
beginning at birth. Early studies have revealed that the global DNA methylation level decreases
with aging. Recently, the overall DNA methylation level has been found to increase in the first
year of life, stay stable in adulthood and then decrease in old age, although the rate of change
may differ by gene, lifestyle and environment (JUNG; SHIN; LEE, 2017).
2
Most commonly, DNA methylation is formed by adding a methyl group to the 5’ cytosine of C-
G dinucleotides, which is also called a CpG, without changing the DNA sequence. This process
is catalyzed by DNA methyltransferases (JONES; GOODMAN; KOBOR, 2015). Under normal
circumstances, promoter CpG islands are unmethylated regardless of the level of gene expression
(FIELD; ROBERTSON; WANG; HAVAS et al., 2018). However, in the human genome, 70%-
80% of all CpGs are methylated, and about 60%-70% of genes have CpG islands in their
promoters. Approximately 2% of CpG islands have either hypomethylation or hypermethylation
related to aging (UNNIKRISHNAN; FREEMAN; JACKSON; WREN et al., 2019). Although
hypo- and hypermethylation occur in different chromatin context, they have a similar genomic
distribution across tissues and play similar roles to genetic alterations by affecting transcription
binding sites. Abnormal DNA methylation causes the alteration of normal gene regulation and
cell differentiation (PÉREZ; TEJEDOR; BAYÓN; FERNÁNDEZ et al., 2018). Aberrant
hypermethylation is a key feature in several cancers (RAKYAN; DOWN; MASLAU; ANDREW
et al., 2010). One of the underlying reasons is that various tumor suppressor genes become
hypermethylated with aging (MOORE; LE; FAN, 2013). The large number of CpGs showing
age-related DNA methylation support building a model to predict age from DNA methylation in
tissue.
A recent study built a model to predict age from DNA methylation data from individuals aged 19
3
to 101 in two different cohorts. DNA methylation was measured in blood cells using Infinium
HumanMethylation450 BeadChip assay covering 99% of RefSeq genes and >450,000 CpG
markers (BIBIKOVA; BARNES; TSAN; HO et al., 2011). The results of DNA methylation were
recorded as a fraction from zero to one. Fifteen percent of markers showed an association (false-
discovery rate adjusted p<0.05) between DNA methylation and age. The authors built a high-
accuracy predictive model of age with 71 methylation markers using elastic net regression. Our
primary interest was to investigate the use of external meta data, data about the DNA
methylation features on the array, to improve the model prediction. We encountered
computational challenges due to the size of the data set (with more than 200,000 DNA
methylation variables remaining after filtering) and subsequently reset our goal to study the new
regression model using a simulated data set of 5,000 DNA methylation variables.
We simulated age as a linear function of DNA methylation, with the regression model
coefficients modeled as a function of an independent set of meta data. This is an example of a
two-level regression model with age as the outcome and DNA methylation the predictors in the
level one regression model and their regression coefficients as the outcome and meta data as
predictors in the level two model. Then we evaluated the impact of including the external data on
1.) our prediction accuracy for age and 2.) estimates of the model coefficients compared to a
model that predicted age without such additional information. Examples of such analyses are
4
provided in the xrnet package for external data that are continuous. In this study, we considered
external data that were binary (1/0). This would mimic the types of variables we might apply to
annotate DNA methylation such as in CpG island (yes/no).
2. Method
2.1 Data Simulation
We simulated data using a two-level hierarchical model. The first level is a linear regression
model:
y = Xβ + !"""""""""""""""""""""""""""""#$%"
In our study, y is the continuous age variable for n observations (n×1 matrix), x is an n×p matrix
of DNA methylation measurements at p CpG sites, and !" is distributed as a standard normal. In
our simulations, n equals 1000, p equals 5000, and β is a vector with β1=…= β20 = 10, β21=…=
β40 = 20 and β41 =…= β5000 = 0. The second level model connects the parameters, β, with external
information, also called meta features, in the Z matrix.
β = Zɑ """""""""""""""""""""""""""""""#&%
We create a Z matrix using binary meta-features that identify sets of CpG sites that have a
similar effect on outcome. Our Z matrix has two features, the first taking on the value 1 for
5
model coefficients with value 10 and 0 otherwise, and the second features taking on the value 1
for model coefficients with value 20 and 0 otherwise. The vector ɑ = (10,20).
The level of DNA methylation for each CpG site is measured as a proportion between 0 and 1
and simulated using a Beta distribution. We simulate these two distributions of DNA
methylation, by dividing our 5,000 predictor variables (p = 5,000) into two equal-sized groups
(both 2500). Half of our x variables are simulated using Beta (2.5, 30) and the other half using
Beta (30, 2.5). Fig. 1 shows this bimodal distribution. We simulate 1,000 measurements for each
x variable. Combining both sets of 2500 variables gives us our x matrix, a n×p matrix. We
assume the error vector which measures the difference between the observed value of age and
the predicted value of age is normally distributed, e~ 𝑁(0,1). The constant 10 is added to the
linear combination in (1) to give a mean age of 55 (Fig.2) and to resemble the middle age of
many epidemiological studies.
6
Fig.1 Bimodal distribution of simulated data
(A) The graph shows the distribution of the 5,000 DNA methylation measurements for one sample (n=1). The distribution DNA
methylation measurements in a single sample has two peaks at about 0.07 and 0.95
(B) The beta value distributions for 10 individual predictor variables with peak near 0.07.
(C) The beta value distributions for 10 individual predictor variables with peak near 0.95.
B C
A
7
Fig2. The distribution of the simulated age variable
Our simulated age variable follows a normal distribution with a sample mean value of 56.13 and a sample variance of 22.26.
2.2 Statistical methods
We fit a two-level penalized regression model to predict age. The data were split in the ratio
67:33 into a training and test set, with the tuning parameters estimated in the training data and
model prediction evaluated using the mean squared error computed in the test set. We applied the
LASSO penalty instead of ridge or elastic net because of the level one model design. In
particular, our expectation is that the model fit would benefit from feature selection and dropping
large number of DNA methylation features with null effects on age. Furthermore, our model
lacked correlation between the X variables, a feature that would favor the use of elastic net. The
two-level model was fit using the xrnet (version 0.1.7) package in R (version = 3.6.0). The xrnet
A
8
package extends the penalties available in glmnet (ridge, lasso, elastic-net) to include a second
level model on an external data matrix (Z) which informs the coefficient estimates from the level
one model. We fitted xrnet with an informative Z matrix and a non-informative one. The non-
informative Z matrix did not identify the true coefficients 10 or 20. We then compared the results
we got from two xrnet models to those estimated under ordinary LASSO as implemented by
glmnet (version 4.0-2).
3. Results
3.1 LASSO regression
Figure 3 shows results from the LASSO model as a function of the tuning parameter. The mean
squared error of LASSO regression, estimated by cross-validation, decreased as the number of
variables in the model increased (Fig.3A). This indicates that lasso regression was effective in
building a predictive model for our simulated age variable. The final model selected 139 non-
zero beta coefficients. Meanwhile, the coefficient estimates resembled groups near our
simulation parameter values of 0, 10 and 20 (Fig.3B).
9
Fig.3 Lasso regression on simulation data
(A) The red dotted line represents the cross-validation curve along with upper and lower standard deviation curves along the λ
sequence (error bars). The value of λ that gives minimum mean cross-validated error (lambda.min) and the value of λ that gives
the most regularized model such that the cross-validated error is within one (lambda.1se) standard error of the minimum are
indicated by the vertical dotted lines.
(B) The graph shows the path of variable coefficients against the ℓ1-norm, each curve corresponds to a single variable. This
indicates the number of non-zero coefficients at the current λ.
A
B
10
3.2 Xrnet regression with informed meta data
Compared to the ordinary lasso model, the fitted model obtained by xrnet using the (true) z
matrix of meta data dropped more predictors. Only 40 features were retained, corresponding to
the 20 features with parameter value 10 and 20 features with parameter value 20 (Fig. 4). None
of the features with true null effects on age were retained by the model. Compared to the results
from ordinary lasso, the coefficients estimated with the external data are closer to their true
values (Fig.5) and the mean squared error (mse) is lower (xrnet mse = 1.3463 vs LASSO mse =
2.0907). The distribution of beta coefficients in the xrnet model were more concentrated at the
true value, while the range of coefficients from the lasso model at 10, 20 or 0 was larger. Some
coefficients that had a true value of 20 were estimated to be lower than 15 which never happened
in the xrnet model.
11
Fig.4 Beta coefficients of the model from xrnet
Forty coefficients were nonzero, 20 of them near 10 and 20 of them near 20
Fig.5 Comparison of coefficients between the model obtained by ordinary lasso regression and xrnet.
The coefficient difference between the model fit obtained by lasso and xrnet
3.3 Xrnet regression with non-informative Z matrix
We evaluated the xrnet model using meta data that were non-informative of the level one
coefficients to see if our prediction model could do worse than if we did not model the meta data
at all (i.e. LASSO regression). We created a non-informative Z matrix with q=3, which means
these external data do not identify the true coefficients of 10 or 20. Each Z variable was
dichotomous, taking on a value of 1 for 20 DNA methylation features with true beta (level one)
parameter zero and 0 otherwise. The 295 features were retained in this model, with most of them
12
showing small effects on the final model (Fig.6). Compared to xrnet with the Z matrix
identifying the predictive features, the distribution of coefficients in the model generated using
the non-informative Z matrix is less concentrated and there were a large proportion of
coefficients near 0 (Fig.7). The information provided by the non-informative Z matrix resulted in
a model with most of the variables contributing small effects.
Fig.6 Coefficients obtained when using the non-informative z matrix.
A total of 255, 20 and 20 non-zero coefficients are near 0, 10 and 20 respectively.
13
Fig.7 The comparison between coefficients in the model generated by non-informative and real z matrix.
The graph shows the difference of coefficients distribution. Obviously, most coefficients in the non-informative Z matrix model
were near 0.
3.4 Model Comparison
We used the same method to generate 10 different data sets and repeated fitting the original lasso
regression, xrnet with the correct Z matrix and xrnet with the non-informative Z matrix. The
results show that the value of the mean squared error (mse) using lasso regression is the largest
(mean mse = 1.9518, SD=0.4873) and that in xrnet with correct external information it is the
smallest (mean mse = 1.2197, SD=0.3469) (Fig.8AB). The mean squared error in the model
generated by the non-informative Z matrix is still lower than that in lasso regression model. It is
14
clear that in our analysis, xrnet with an informative Z matrix performed better than original lasso
regression.
A
B
15
Fig.8 MSE obtained from original lasso regression, xrnet with correct Z matrix and xrnet with non-informative Z matrix
(A) The Mean squared error from the three models in 10 different data sets. In every data set, the model developed by xrnet with
the correct Z matrix has the lowest mean squared error, the model developed by xrnet with the non-informative Z matrix is the
next best. The lasso model did not fit as well as xrnet model.
(B) The boxplot for mean squared error from the three models in 10 different data sets.
4. Conclusion
We found that the model fit by xrnet using informative meta data had a smaller mean squared
error for predicting a quantitative outcome than ordinary lasso regression and the value of the
coefficients were closer to the true value and more concentrated to one another. Meanwhile, we
fitted a model using a non-informative Z matrix in order to show the benefits of real external
information in xrnet. The results demonstrated that an informative Z matrix was important in
developing a more predictive model, however the predictions under both Z matrices were never
worse than ordinary lasso regression. The 0 and 1 in our non-informative Z matrices had
overlaps with correct Z matrices which means the non-informative Z matrices was not
completely non-informative. This might explain why the mean squared error of the model fitted
by non-informative Z matrix was always lower than lasso regression model. In conclusion,
providing external information that helps to identify variables that are informative of outcome in
the level one model will contribute to a lower mean squared error when the level one predictor
variables (X) are independent, and the level two data (Z matrix) are binary.
16
5. Discussion
We compared penalized regression results obtained from a new penalized hierarchical regression
model fit using the xrnet package in R to results from fitting ordinary lasso regression in glmnet.
Although it takes a longer time to process the external information for a larger data set, the xrnet
package can develop a model with a lower mean squared error and more accurate coefficients
compared to the original lasso regression. When using xrnet to analyze the real DNA
methylation data, the data would be more complex than the data we simulated. There may also
exist correlation between the predictor variables. The data we have simulated are all independent
which means our simulation considers DNA methylation, aging and external data in a simpler
way than is probably the case in reality. We did not analyze the ability of xrnet to predict with
correlated feature data. Further investigations are needed to address this issue. Furthermore, we
also did not consider the effects of hypomethylation. Both hyper- and hypomethylation of CpGs
are observed with aging. Such an analysis will require individual meta features only to combine
features that have trends with age in the same direction so as not to attenuate their individual
effects on predicting age, as would happen if we combined features acting in opposite directions.
17
References
ARMSTRONG, N. J.; MATHER, K. A.; THALAMUTHU, A.; WRIGHT, M. J. et al. Aging, exceptional longevity and
comparisons of the Hannum and Horvath epigenetic clocks. Epigenomics, 9, n. 5, p. 689-700, 05 2017.
BIBIKOVA, M.; BARNES, B.; TSAN, C.; HO, V. et al. High density DNA methylation array with single CpG site resolution.
Genomics, 98, n. 4, p. 288-295, Oct 2011.
FIELD, A. E.; ROBERTSON, N. A.; WANG, T.; HAVAS, A. et al. DNA Methylation Clocks in Aging: Categories, Causes, and
Consequences. Mol Cell, 71, n. 6, p. 882-895, 09 2018.
HANNUM, G.; GUINNEY , J.; ZHAO, L.; ZHANG, L. et al. Genome-wide methylation profiles reveal quantitative views of
human aging rates. Mol Cell, 49, n. 2, p. 359-367, Jan 2013.
JONES, M. J.; GOODMAN, S. J.; KOBOR, M. S. DNA methylation and healthy human aging. Aging Cell, 14, n. 6, p. 924-
932, Dec 2015.
JUNG, S. E.; SHIN, K. J.; LEE, H. Y . DNA methylation-based age prediction from various tissues and body fluids. BMB
Rep, 50, n. 11, p. 546-553, Nov 2017.
MOORE, L. D.; LE, T.; FAN, G. DNA methylation and its basic function. Neuropsychopharmacology, 38, n. 1, p. 23-38,
Jan 2013.
PÉREZ, R. F.; TEJEDOR, J. R.; BAYÓN, G. F.; FERNÁNDEZ, A. F. et al. Distinct chromatin signatures of DNA
hypomethylation in aging and cancer . Aging Cell, 17, n. 3, p. e12744, 06 2018.
RAKYAN, V. K.; DOWN, T. A.; MASLAU, S.; ANDREW , T. et al. Human aging-associated DNA hypermethylation occurs
preferentially at bivalent chromatin domains. Genome Res, 20, n. 4, p. 434-439, Apr 2010.
UNNIKRISHNAN, A.; FREEMAN, W. M.; JACKSON, J.; WREN, J. D. et al. The role of DNA methylation in epigenetics of
aging. Pharmacol Ther, 195, p. 172-185, Mar 2019.
Abstract (if available)
Abstract
Recently, new methods to incorporate external information into penalized regression models for high-dimensional data have shown improved prediction compared to traditional methods. We were interested in whether external meta data could improve the prediction of age from DNA methylation data. We used simulated data, designed to loosely resemble the DNA methylation−age prediction model, to study the impact of using external information in penalized regression. All analyses from a penalized regression model using meta data are performed using the glmnet and xrnet package in R (version=3.6.0). According to our results, the model obtained from xrnet (mean of mean squared error = 1.2197) was better than that from original lasso regression (mean of mean squared error = 1.9518). However, this is based on only one simulation setting with a very large signal and independent predictor variables. Additional studies with more subtle effects and more complex models are needed before drawing broader conclusions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Statistical analysis of high-throughput genomic data
PDF
Incorporating prior knowledge into regularized regression
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
High-dimensional regression for gene-environment interactions
PDF
Finding signals in Infinium DNA methylation data
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Associations of ambient air pollution exposures with perceived stress in the MADRES cohort
PDF
Bayesian multilevel quantile regression for longitudinal data
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
The carcinogenic effect of the MMP9 rs3918242 polymorphism on the risk of cancer of the digestive system: evidence from a meta-analysis
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Inference correction in measurement error models with a complex dosimetry system
Asset Metadata
Creator
Wu, Jiqing
(author)
Core Title
Prediction modeling with meta data and comparison with lasso regression
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2021-05
Publication Date
05/09/2021
Defense Date
05/07/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
lasso regression,meta data,OAI-PMH Harvest,xrnet package
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly (
committee chair
), Lewinger, Juan Pablo (
committee member
), Marjoram, Paul (
committee member
)
Creator Email
jiqingwu@usc.edu,jiqingwu1997@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112720118
Unique identifier
UC112720118
Identifier
etd-WuJiqing-9613.pdf (filename)
Legacy Identifier
etd-WuJiqing-9613
Document Type
Thesis
Format
application/pdf (imt)
Rights
Wu, Jiqing
Type
texts
Source
20210510-wayne-usctheses-batch-836-shoaf
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
lasso regression
meta data
xrnet package