Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
(USC Thesis Other)
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Nonlinear Modeling of the Relationship Between Smoking and DNA Methylation in the MultiEthnic Cohort
by
Muhammad Rayeed Islam
A Thesis Presented to the
FACULTY OF THE USC
KECK SCHOOL OF MEDICINE
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
August 2024
Copyright 2024 Muhammad Rayeed Islam
ii
Dedication
This document was the work of a village. To Dr. Brian Huang, my thesis advisor and
committee chair, and the members of my committee, Dr. Trevor Pickering and Dr. Juan Pablo
Lewinger, I extend my deepest gratitude for your unwavering support, endless patience, and
constant guidance throughout the time I was writing this and my entire degree. Your teachings
and insights are things I will carry with me for the rest of my life and without your expertise this
paper would not be what it is. Every page of this document was informed by things I have
learned from working with and being taught by you and this work is as much yours as it is mine.
To Dr. Wendy Mack, my graduate degree advisor, I thank you for the years of constant
help and guidance even when I seemed to be lost. To Dr. Zhanghua Chen, your mentorship and
willingness to take on an unproven student amidst a pandemic helped me take my first steps on
this path that I am following to this day. To Dr. Gary Rosen, your love for all things mathematics
and statistics helped me understand that a field I thought impossible for me at the time was very
much within my reach with some effort and dedication.
To my friends and family, Maa, Abbu, Rumaisa, Olly, there is no way to put my gratitude
into words. For 23 years you have supported and believed in me through every up and every
down, through every mistake and bad decision, and through every triumph and great victory. I
would not be the person I am today without any of you and this document would not exist. You
all have my everlasting love and thanks for everything. To my grandparents, those still here and
those not, your legacy and lives have informed every decision I have made and I hope I have
made you proud.
To Joyce, your love and dedication through stormy seas brought this final work to
fruition, you are my rock, always and ever.
iii
Table of Contents
Dedication....................................................................................................................................... ii
List of Tables ................................................................................................................................. iv
Abstract........................................................................................................................................... v
Chapter 1: Introduction................................................................................................................... 1
Chapter 2: Methods......................................................................................................................... 5
2.1: Data ...................................................................................................................................... 5
2.2: Modeling Approach ............................................................................................................. 7
Chapter 3: Results......................................................................................................................... 13
3.1: Results for cg05575921...................................................................................................... 13
3.2: Results for cg23576855...................................................................................................... 14
3.3: Results for cg21161138...................................................................................................... 16
3.4: Results for cg26703534...................................................................................................... 17
3.5: Comparison Models with Untransformed TNEs ............................................................... 18
Chapter 4: Discussion of Results.................................................................................................. 22
References..................................................................................................................................... 27
Appendix 1: Plots for log-transformed TNE Models ................................................................... 30
A1.1. cg05575921: .................................................................................................................... 31
A1.2: cg23576855:.................................................................................................................... 35
A1.3: cg21161138:.................................................................................................................... 39
A1.4: cg26703534:.................................................................................................................... 43
Appendix 2: Plots for Untransformed TNE Models..................................................................... 47
A2.1. cg05575921: .................................................................................................................... 48
A2.2: cg23576855:.................................................................................................................... 52
A2.3: cg21161138:.................................................................................................................... 56
A2.4: cg26703534:.................................................................................................................... 60
iv
List of Tables
Table 1: Regression results for cg05575921 with transformed TNEs.................................. 13
Table 2: Regression results for cg23576855 with transformed TNEs.................................. 14
Table 3: Regression results for cg21161138 with transformed TNEs.................................. 16
Table 4: Regression results for cg26703534 with transformed TNEs.................................. 17
Table 5: Regression results for cg05575921 with Untransformed TNEs............................. 18
Table 6: Regression results for cg23576855 with Untransformed TNEs. ............................ 19
Table 7: Regression results for cg21161138 with Untransformed TNEs. ............................ 19
Table 8: Regression results for cg26703534 with Untransformed TNEs. ............................ 19
v
Abstract
Background: Lung cancer continues to present a global public health crisis, with cigarette
smoking being one of the main contributors to lung cancer development. Current literature has
shown that cigarette smoking is associated with differential methylation of the AHRR gene.
However, smoking research has been hampered by the unreliability of self-reported dose and the
development of other nicotine substitutes, such as vapes and nicotine patches, which have made
it difficult to use previous gold standard methods for determining internal smoking dose in
patients. As such, studies have begun analyzing cigarette smoking using total nicotine
equivalents (TNEs), which provide a more complete measure for gauging internal smoking dose.
Recently, studies have explored the link between TNEs and DNA methylation under a linear
regression framework. This study aims to expand on that knowledge by examining possible
nonlinear relationships between smoking dose, as quantified by TNEs, and DNA methylation.
Methods: The data for this analysis were collected as part of the Multiethnic Cohort Study
(MEC) and consisted of a sample of 1,994 individuals who were smokers at the time of sample
collection, reported smoking >10 cigarettes per day, and had DNA methylation data available.
Data from four CpG sites on the AHRR gene was used: cg05575921 (n=1993), cg23576855
(n=1994), cg21161138 (n=1994), and cg26703534 (n=1993). We evaluated the association
between TNE and DNA methylation at each of the four CpG sites, using a series of models
including standard linear regression, 2
nd and 3rd degree polynomial regressions, KNN
regressions, random forest regressions, and eXtreme Gradient Boosting(XGB) regressions. All
models were adjusted for sex, age, methylation-based estimates of cell type composition, and
principal components of genetic ancestry. Standard linear regression served as a the baseline for
model performance, which was assessed by RMSE and MAPE and verified using 10-repeats, 5-
vi
fold cross validation. Models were evaluated using both baseline measures prior to cross
validation and averages gained from cross validation.
Results: Results indicated very little improvement in model fit, compared to standard linear
regression, for any of the methods for any of the sites. Certain notable results were found, with
2
nd and 3rd degree polynomial models generally performing the best after basic linear regression
for all four sites. For cg21161138, polynomial regression was able to be only 0.02% worse than
basic linear regression as measured by RMSE after cross validation, indicating similar or almost
equal model fit. Cg26703534 yielded particularly interesting results as it was the only site that
saw an improvement in fit by any of the modeling types, with polynomial regression providing a
marginal improvement in fit (<1%) as measured by MAPE. Non-parametric methods, which
consisted of XGB, KNN, and random forests regression all generally performed worse than both
basic linear regression and polynomial regression, with random forests coming the closest to the
performance of polynomial regression across all sites, being especially close for cg23576855
(0.6% worse) and cg21161138 (0.6% worse). XGB and KNN regression performed poorly
across the board.
Conclusion: In conclusion, it was found that nonlinear modeling methods did not result in
improvements to model fit across the four CpG sites on AHRR when using log-transformed TNE
values, indicating that the relationship between TNEs and DNA methylation at these CpG sites is
best approximated by a linear model. This can inform future analyses as it indicates that the
current trajectory of research, which largely focuses on basic linear modeling methods, is likely
the best for analyzing this relationship. If nonlinear relationships are explored again with new
data, this research has also indicated that low-degree polynomial models and random forests will
likely be the most useful avenues for exploration alongside using raw, untransformed TNE data.
1
Chapter 1: Introduction
Cigarette smoking is one of the most pressing and preventable public health issues facing
the United States today. Despite often being considered an issue of the past, smoking remains
even in 2023 one of the leading causes of both morbidity and mortality in the country, causing an
additional $240 billion dollars in preventable healthcare costs every year[1]. Smoking accounts
for 1 in 5 deaths per year, killing on aggregate approximately 480,000 people per annum[1].
Over many years, research has found smoking to be particularly strongly associated with a
number of serious diseases and conditions, among the biggest of which is lung cancer.
Lung cancer represents a significant modern public health crisis in and of itself. In the
United States, it is the third most common form of cancer among both men and women[2] and is
considered the leading cause of cancer-related deaths, accounting for about 18% of cancerrelated deaths per year[2]. Decades of research have shown lung cancer development to be
strongly linked with smoking status, with approximately 90% of lung cancer patients having
been ever-smokers[2]. As such, an increasing amount of research has been conducted to delve
into the underlying mechanisms of the smoking-lung cancer association. One of the major
motivations for elucidating this relationship is to help procure new lung cancer risk assessment
tools or screening methods based on smoking-related measures (e.g., smoking status, intensity or
dose) and/or biomarkers.
However, research has also shown a number of peculiarities and contradictions present in
the relationship between smoking and lung cancer risk, which create significant challenges for
those seeking to develop diagnostic tools based on smoking-related measures. It has been
observed that lung cancer risk differs across race and ethnicity for the same number of cigarettes
smoked, indicating that the dose-response relationship of smoking and lung cancer risk is
2
inconsistent across populations. This difference may be explained by racial and ethnic
differences in the metabolism of smoking-related carcinogens; prior research in the Multiethnic
Cohort observed that populations with slower nicotine metabolism (e.g., Japanese Americans)
have reduced lung cancer risk, while other populations with faster nicotine metabolism (e.g.,
African Americans) have increased lung cancer risk. However, this pattern of nicotine
metabolism and lung cancer risk was not consistent for Native Hawaiians and Latinos[3, 4, 5].
Thus, the actual mechanisms via which smoking impacts lung cancer risk, particularly in relation
to smoking dose, are still not fully understood.
Recent works have postulated that the racial and ethnic inconsistencies in the smoking
dose-lung cancer relationship can be partly attributed to the different measures in which smoking
dose has been quantified in previous research[6]. Generally, there have been two main measures
that have been used to estimate smoking dose when conducting analyses. The first measure is
self-reported smoking dose, measured by the number of cigarettes smoked per day (CPD). The
second measure, which served as the “gold standard” for quantifying smoking dose, is
urinary/serum concentrations of cotinine, a metabolite of smoking which provides a more
accurate biological measure of smoking dose[6].
Each of these measures present problems which make them less reliable for modern
analyses. CPD is an inconsistent measure due to the extreme variability in individuals’
assessments of their own smoking. It has been shown that 4-10% of research participants in
epidemiologic studies who self-identify as never smokers actually have measurable levels of
cotinine in the serum[6], indicating that they may either be current or former smokers. Cotinine
and cotinine-derived measures have also become unreliable due to the advent of vaping and
nicotine-replacement therapies for smoking, as cotinine measures cannot differentiate between
3
serum cotinine attributable to smoking and cotinine attributable to vaping or nicotinereplacement therapies. These two measures have also been found to correlate poorly with the
direction of disease risk and fail to account for individual differences in nicotine uptake resulting
from behavioral and metabolic differences[7]. Smoking dose is thus better quantified using
urinary total nicotine equivalents (TNEs), the sum of the major smoking metabolites nicotine,
cotinine, trans-30-hydroxycotinine (3-HCOT]), and their glucuronides, and nicotine N-oxide.
TNEs account for 80-90% of nicotine uptake from cigarettes, thereby providing a more accurate
and reliable measure of internal smoking dose[2, 8].
The effect of smoking on lung cancer risk across populations may further be explained by
epigenetic modification, particularly changes in DNA methylation. Smoking has been found to
cause widespread changes in DNA methylation levels across >2,600 cytosine-phosphate-guanine
(CpG) sites across the epigenome, many of which have been found to revert or partially revert
upon cessation of smoking[6, 7]. As such, it is hypothesized that analyses of DNA methylation
can provide a more effective way to characterize the effects of smoking on the body and account
for the observed racial and ethnic disparities in the dose-response relationship of smoking and
lung cancer. Prior research on the impact of smoking on DNA methylation has found that CpG
sites on the aryl-hydrocarbon receptor repressor (AHRR) gene are consistently hypomethylated
among individuals with higher smoking exposure [9, 10]. In particular, the CpG site
cg05575921 on AHRR has been repeatedly observed to be a strong marker of smoking exposure
across studies, as well as across different racial and ethnic populations [6, 9, 10, 11]. Therefore,
CpG sites on AHRR could possibly serve as biomarkers for smoking and lung cancer risk for
multiple races and ethnicities.
4
However, prior analyses of the association between smoking and DNA methylation of
CpG sites has presented two major limitations, both of which this study aims to assess and
address. Firstly, the majority of studies have examined smoking exposure using self-reported
measures (e.g., smoking status) or less reliable markers of smoking dose (e.g., cotinine). Studies
of smoking dose and DNA methylation have further only used a linear framework[2, 10],
therefore possibly overlooking nonlinear trends that may account for many of the current
incongruencies present in the literature. This analysis aims to bridge this gap in knowledge by
examining the association of smoking dose, as quantified by urinary TNEs, with DNA
methylation of CpG sites on the AHRR gene using a series of flexible and powerful statistical and
machine learning methods that can capture possible nonlinear trends. These methods may
provide further insight into the true functional form of the smoking-DNA methylation
relationship, and in turn allow for an improved understanding of the biological mechanisms
linking smoking with lung cancer risk.
5
Chapter 2: Methods
2.1: Data
The data for this analysis were collected as part of the Multiethnic Cohort Study (MEC),
a prospective cohort study established in the early 1990’s to study cancer and other chronic
diseases. It consists of over 215,000 individuals from California and Hawaii aged 45-75 years at
enrollment during 1993-1996, of which 70,000 provided blood and urine samples approximately
10 years after cohort entry[10]. For our analysis, we focused on a subset of 1,998 participants
who were self-reported current smokers at the time of biospecimen collection, reported smoking
greater than 10 cigarettes per day, had no prior history of lung cancer development, and had
DNA methylation data available.
DNA methylation was measured using the Illumina MethylationEPIC (EPIC) assay,
which measures methylation at >850,000 CpG sites. The pre-processing of the raw methylation
data using R bioconducter pipelines has been previously described[10]. Briefly, the methylation
signal intensities from IDAT files were pre-processed using normal out-of-band background
(‘‘noob’’) correction, dye-bias correction, and type I and type II probe bias correction, and batch
effects correction. During quality control, two individuals were excluded due to poor methylation
signals and two individuals were excluded due to discordant self-reported and methylationestimated sex values, leaving a total of 1,994 individuals with valid DNA methylation data. After
data cleaning using the Illumina Bead Chip, DNA methylation at each CpG site was represented
with a value ranging from 0 to 1, which represent the ratio between the intensity of the
methylated bead type to the combined locus intensity. In application, this generally means that
increasing values indicate increasing methylation.
6
The CpG sites used for this analysis were the top four differentially methylated AHRRspecific sites (cg05575921, cg21161138, cg23576855, and cg26703534) from a recent
epigenome-wide association study (EWAS) in the MEC that examined the linear association of
TNEs and DNA methylation[10]. Methylation data was available for all 1,994 individuals for
cg21161138 and cg23576855, but was missing for one individual for cg05575921 and a different
individual for cg26703534. Thus, analyses of cg21161138 and cg23576855 were limited to the
1,993 individuals with complete data.
In addition, we used methylation-based estimates of cell type to account for internal
differences in the proportion of B-cells, CD4 T-cells, CD8 T-cells, natural killer cells,
neutrophils, and monocytes, which were computed from methylation-based constrained
projection procedures using the Identifying Optimal Libraries (IDOL) probes for the EPIC array.
As it has been shown that utilizing principal components of genetic ancestry is able to account
for population stratification[12], principal components calculated from a prior analysis of this
data were used for the present study.
Urinary concentrations of TNEs were assessed using gas chromatography/mass
spectrometry. In line the prior EWAS, TNEs were log-transformed to account for skewed
distributions and maintain consistency[10]. An additional set of comparison analyses were also
run on untransformed TNEs to examine the effects of the transformation on our results and
inform future studies. However, as log-transformed TNE has been the standard in most previous
studies, analysis of our results focused mainly on these models. The same adjustment variables
were also used in the present analyses, including age at specimen collection, sex, estimates of
cell type proportions, and the first eight principal components of genetic ancestry to account for
population stratification[3]. Furthermore, as the association of TNEs with DNA methylation at
7
these four AHRR-specific sites did not differ significantly across racial and ethnic in the prior
EWAS, analyses for the present study were run for the entire cohort collectively rather than
individually within each racial and ethnic group.
2.2: Modeling Approach
The goal of this analysis was to examine potential nonlinearity in the previouslyestablished linear association of TNEs with DNA methylation at AHRR-specific CpG sites. Thus,
we compared a series of nonlinear modeling techniques with basic linear regression, using
several metrics to determine whether they improved model fit. Methods were chosen based on
their ability to identify trends that depart from typical linearity and may not be identifiable by
simple linear regression. The dataset was randomly divided into 67-33% splits for all model
training and testing to reduce the chances of model overfitting. This process was repeated for
each site, thereby reducing possible bias occurring from the random splitting of the data, and
these splits were then used to train initial models without cross validation that would serve as a
baseline for assessing model fit.
Several performance metrics were chosen to evaluate whether these nonlinear models
improved model fit over basic linear regression, including the coefficient of determination (Rsquared), root mean squared error (RMSE) and mean absolute percent error (MAPE). Initially,
we evaluated R-squared as our primary metric. R-squared constitutes the proportion of variance
in the dependent variable that is explained by its association with the independent variable and is
heavily used in regression analysis as a result, whereby a model with a high R-squared is
generally one that fits better. However, literature has shown R-squared may perform poorly
when evaluating nonlinear models, the model and error variance components used to assess Rsquared would not necessarily add to the total variance. Thus, it was determined that R-squared
8
would not be a suitable metric for this analysis[14]. As such, our analysis was shifted to better
metrics for evaluating model performance under nonlinear conditions.
Root mean squared error (RMSE) was the second metric we chose to examine, which
utilizes the standard deviation of the model’s residuals to evaluate model fit. RMSE has two
particular benefits: 1) it is measured on the same scale as the data and thus allows for better
interpretability, and 2) it provides an overall good metric for evaluating model fit between
different models. As it is calculated on the residuals, RMSE also provides a very good measure
of how accurately a model predicts the dependent variable, so a model that minimizes RMSE is
generally the best-fit model. However, RMSE has a disadvantage in that it gives a high weight to
large errors, such as those caused by outliers[13, 15]. While this sensitivity to outliers can be
desirable for prediction tasks, it can also result in overly conservative estimates of model
performance that reject otherwise good candidate models.
As a result, models were evaluated with an additional metric, mean absolute percent error
(MAPE), which is a percent measure indicating the average deviation of predicted values from
actual values. Due to the method of its calculation, MAPE is less sensitive to outliers, but is also
not as reliable at evaluating the predictive capability of a model because it gives the same weight
to all the error terms present in the model[13, 15]. Thus, in the event of a model that generally
fits the data well, but with the exception of the occasional large error, MAPE would provide a
more holistic measure for model fit and would not discount an otherwise effective model, so
models were also examined for whether or not they minimized MAPE.
An initial examination of candidate models was done by examining the theoretical
underpinnings of a variety of regression methods, as well as their use in literature for modeling
nonlinear relationships. Ultimately, four models were selected due to their flexibility in capturing
9
nonlinear relationships whilst also optimizing their computational and modeling complexities
within the limitations of the dataset:
● Polynomial regression: Polynomial regression is a natural extension of linear regression
which provides a simple yet powerful framework for modeling nonlinear relationships by
exponentiating the main effect—in this case TNEs—by some value a. For this analysis, a
series of polynomial models were run on the training set, with degrees varying from 1 to
3, which were examined for which degree minimized RMSE and MAPE. The maximum
degree was limited to 3 as most phenomena do not exhibit relationships that exceed a
3rd-degree polynomial and most models fitted beyond the 3rd degree have significant
issues of overfitting[16, 17, 18]. To verify this fact, we ran exploratory models with
polynomials ranging up to 50 as well and, whilst higher degree polynomials were found
to often be better at minimizing RMSE and MAPE in the training set, there were
significant issues of overfitting when applying these models to the test data. With the
issue of overfitting taken into consideration, polynomial models of the 2nd and 3rd degree
were both fitted and their performance in the training data and test data evaluated for
their performance in comparison to the basic linear model.
● K-Nearest Neighbors (KNN) Regression: KNN regression is a non-parametric regression
method which utilizes a set of K closest observations to estimate a numerical value for
the target variable by averaging the value being predicted over its neighbors[16]. A series
of models were run with values of K ranging from 1 to the square root of the number of
observations in the dataset and the most optimal K was selected based on the model that
minimized RMSE. K values were found to vary very little across training and testing
splits with these variations resulting in only marginal changes to the RMSE and MAPE of
10
the model in the testing set. As such, results were reported on one particular instance of
the KNN regression model as the findings were found to be consistent across different
values of K. KNN regression is well-suited to examining nonlinear relationships, as each
individual point is estimated based on the local structure of the data, thereby meaning that
the estimates are not limited to a particular functional form. For three of the sites, the
chosen K during analysis was 44, whilst one site used a slightly lower K value of 38.
● Random Forests Regression: Random forests regression is a method that utilizes many
independent decision trees to train learners in parallel on many samples of the dataset,
from which the votes are combined and averaged to predict values[16]. Random forests
are extremely useful for analyzing nonlinear relationships as, by running in parallel and
working as an ensemble, they can model minute nonlinearities present in the data and
capture otherwise subtle aspects of nonlinear relationships[19]. Random forest regression
is known to perform well with little prior parameter tuning and is robust to overfitting,
thereby allowing us to capture trends very accurately from the outset[19].
● eXtreme Gradient Boosting (XGB) Regression: The extreme gradient boosting modeling
method is a boosted tree-based algorithm which consists of many simple decision trees
which are taken as an ensemble for predicting the overall data[20]. As XGB regression
uses many simple trees which are not formed independently of the previous trees, it is
very capable of effectively capturing any nonlinearities present in the dataset by learning
from the already-formed trees present in the model[16]. The number of boosting rounds,
which consists of the number of trees in the initial boosting model, was tuned by iterating
over 1 to 50 rounds, from which the value that resulted in the lowest RMSE was chosen.
As a whole, this method was expected to perform the best as it utilizes the power of the
11
random forest model with a system that compensates for shortcomings in previous trees
by learning from new trees[20].
These four methods were applied to the data for each of the CpG sites alongside basic
linear regression and 2nd and 3rd degree polynomial regression models, thereby providing 24 total
models across the 4 sites. Resampling techniques were employed in order to ensure that the
results were consistent across various testing and training splits and not due to random chance.
Repeated k-fold cross validation was the method chosen for resampling the data and models. Kfold cross validation partitions the original data into k folds, from which one fold is used as the
test set, whilst the other k-1 folds serve as the training set. This process is then repeated k times,
with each fold being used once as the testing set, after which the results are averaged to provide a
clearer picture of model performance. Repeated k-fold cross validation expands upon this
process by repeating the splitting into folds n times, wherein each run is resampled with
replacement, thereby providing an even more robust and accurate method for evaluating a
model’s overall performance[16, 21]. For this particular data, owing to considerations of sample
size and computational complexity, a 10-repeats, 5-fold cross validation was performed for each
modeling method, thereby resulting in 50 iterations of resampling, from which aggregate
measures of RMSE and MAPE were estimated by averaging the results across all iterations. All
analyses of results included both the baseline measures found prior to cross validation and the
aggregate measures provided by cross validation. Finally, plots of methylation vs. logtransformed TNEs, methylation vs. untransformed TNEs, and partial dependence plots of
methylation relative to TNEs under our various modeling frameworks were created in order to
visually examine the relationship between TNEs and DNA methylation. These plots are
12
contained in Appendices 1 and 2. Final interpretation of results focused primarily on our
numerical metrics, alongside visual plots.
13
Chapter 3: Results
As the final analysis resulted in 48 models with 192 total measures when including both
baseline and cross validation results and analyses of transformed and untransformed TNEs, it
was chosen to organize the results by site, metrics used, and baseline vs. aggregate measures
from cross validation.
3.1: Results for cg05575921
The results for cg05575921 for n=1993 samples were as follows:
Table 1: Regression results for cg05575921 with transformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.07582943 0.10011645 Basic Linear 0.0771146 0.1014455
KNN (k=38) 0.08012352 0.10603016 KNN (k=38) 0.08057205 0.10563791
XGB 0.0814473 0.1059061 XGB 0.08125742 0.10605193
Random Forest 0.07815935 0.10434228 Random Forest 0.07859684 0.1047701
2nd Degree Poly 0.07522398 0.09873147 2nd Degree Poly 0.07751832 0.1017548
3rd Degree Poly 0.07525691 0.09866234 3rd Degree Poly 0.07743593 0.1016931
We observed that the KNN, XGB, and random forest models resulted in increases in
RMSE of 5.67%, 7.14%, and 3.07%, respectively, over the basic linear model, and increases in
MAPE of 5.90%, 5.78%, and 4.22%, respectively, over the basic linear model. These results
proved to be consistent in their direction and were relatively consistent in their magnitude once
cross-validation was performed. In particular, after 5-fold cross-validation repeated 10 times, it
was found that the KNN, XGB, and random forest models presented RMSEs that were 4.48%,
5.37%, 1.92% higher, respectively, than the basic linear model, and MAPE values that were
4.13%, 4.54%, and 3.27% higher, respectively, than the basic linear model.
14
In contrast, the 2nd and 3rd degree polynomial models were found to reduce RMSE and
MAPE across the initial training and testing split. The 2nd degree polynomial model resulted in a
reduction of RMSE of 0.79% and a reduction in MAPE of 1.38% over the base model. The 3rd
degree polynomial saw slightly worse performance as evaluated by RMSE, with only a 0.75%
decrease over the base model, whilst it saw better performance as evaluated by MAPE, with a
1.45% decrease in MAPE over the base model. These results did not prove to be consistent
across many different training and testing splits as provided by cross-validation. After 5-fold, 10
repeats cross validation, both the 2nd and 3rd degree polynomial models were found to result in
higher aggregate RMSE (0.52% and 0.41% higher, respectively) and aggregate MAPE values
(0.30% and 0.24% higher, respectively).
These results indicated that only the 2nd and 3rd degree polynomials provided any
improvement in fit over the basic linear model in the initial training and test split whilst the
KNN, XGB, and random forest models failed to provide any improvement in model fit at any
point and that, following cross validation, none of the models provided an improvement in fit as
measured by either RMSE or MAPE over the basic linear model.
3.2: Results for cg23576855
The results for cg23576855 for n=1994 samples were as follows:
Table 2: Regression results for cg23576855 with transformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.1092412 0.2141753 Basic Linear 0.1114687 0.2400335
KNN (k=44) 0.1138673 0.220187 KNN (k=44) 0.1141805 0.2414478
XGB 0.12038545 0.22579267 XGB 0.12078269 0.24602415
Random Forest 0.1109316 0.2166757 Random Forest 0.1125646 0.2411423
2nd Degree Poly 0.1085142 0.2127665 2nd Degree Poly 0.1117444 0.2496579
3rd Degree Poly 0.108492 0.2126135 3rd Degree Poly 0.1118347 0.2497132
15
As a whole, models for cg23576855 demonstrated similar trends to those observed for
cg05575921, with only the polynomial models providing any improvement in fit over the basic
linear model at the baseline. Compared to the basic linear regression model, the KNN, XGB, and
random forest models resulted in RMSE values that were 4.23%, 10.21%, and 1.54% higher at
baseline whilst the 2nd and 3rd degree polynomial models resulted in a reduction of RMSE by
0.66% and 0.68%, respectively. Similar results were found when examining the models via
MAPE as the KNN, XGB, and random forest models resulted in MAPE increasing by 2.81%,
5.42%, and 1.16%, respectively, whilst the 2nd and 3rd degree polynomials lowered MAPE by
0.65% and 0.73%, respectively.
Across the cross-validation splits, KNN, XGB, and random forests resulted in RMSE
increasing by 2.43%, 8.35%, and 0.98%, respectively. In contrast to the baseline results and in
line with the trends we found with cg05575921, 2nd and 3rd degree polynomials increased
RMSE by 0.25% and 0.32%, respectively. KNN, XGB, and random forests were found to
increase MAPE by 0.58%, 2.49%, and 0.46%, respectively, once cross validation was completed.
Interestingly, and not quite in line with the results observed for cg05575921, the 2nd degree and
3rd degree polynomial models demonstrated the highest MAPE values, after cross validation,
with the 2nd degree model increasing MAPE by 4.01% and the 3rd degree model increasing
MAPE by 4.03%.
16
3.3: Results for cg21161138
The results for cg21161138 for n=1994 samples were as follows:
Table 3: Regression results for cg21161138 with transformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.04547585 0.04967752 Basic Linear 0.04577653 0.05118018
KNN (k=44) 0.04690211 0.05118855 KNN (k=44) 0.04797996 0.05415994
XGB 0.04876587 0.05415703 XGB 0.0487147 0.05490045
Random Forest 0.04650127 0.05117995 Random Forest 0.04694777 0.0530853
2nd Degree Poly 0.04565754 0.04997356 2nd Degree Poly 0.04578964 0.05167571
3rd Degree Poly 0.0456727 0.04998854 3rd Degree Poly 0.04588808 0.05174266
We found that cg21161138 did not follow the same patterns observed in cg05575921 and
cg23576855 as none of the models improved fit, as measured by either RMSE or MAPE, at
baseline or after cross-validation. Specifically, the KNN, XGB, and random forest models
increased RMSE by 3.13%, 7.23%, and 2.25% at the baseline compared to the basic linear
regression model, whilst the 2nd and 3rd degree polynomials had smaller increases of 0.39% and
0.43%, respectively. MAPE had similar trends as the baseline, albeit in larger magnitudes as the
KNN, XGB, and random forest models were found to increase MAPE by 3.04%, 9.01%, and
3.02%, respectively, whilst it only increased by 0.59% and 0.62%, respectively, for the 2nd and
3rd degree polynomial models.
Following cross-validation, it was found that the KNN, XGB, and random forest models
caused aggregate RMSE to increase by 4.81%, 6.41%, and 2.55%, respectively. Similar to the
trends found at the baseline, the 2nd and 3rd degree polynomial models were found to cause
small increases in RMSE, at 0.02% and 0.24%, respectively. Examining MAPE, the KNN, XGB,
17
and random forest models were found to increase MAPE by 5.82%, 7.26%, and 3.72%, whilst
the 2nd and 3rd degree models increased MAPE by 0.96% and 1.10%, respectively.
3.4: Results for cg26703534
The results for cg26703534 for n=1993 samples were as follows:
Table 4: Regression results for cg26703534 with transformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.03425795 0.03903948 Basic Linear 0.03471447 0.03866549
KNN (k=44) 0.03620903 0.04121205 KNN (k=44) 0.03665724 0.04095765
XGB 0.03696238 0.04155649 XGB 0.03690192 0.0409986
Random Forest 0.03545912 0.0403994 Random Forest 0.03566257 0.03981754
2nd Degree Poly 0.03432738 0.03910612 2nd Degree Poly 0.03492691 0.03843259
3rd Degree Poly 0.03432987 0.03911006 3rd Degree Poly 0.03497538 0.03846852
Compared to the analyses for the previous three CpG sites, the modeling for cg26703534
replicated some of the most consistent changes in model fit, while also exhibiting some
inconsistent trends. At the baseline, none of the modeling methods resulted in an improvement in
fit as measured by RMSE, a significant departure from the trend of 2nd and 3rd degree
polynomials generally improving fit at the baseline. In particular, we observed that the KNN,
XGB, and random forest models resulted in a worsening of model fit by 5.70%, 7.89%, and
3.50%, respectively, compared to the basic linear model. Similarly, the 2nd and 3rd degree
polynomial models resulted in RMSE increasing by 0.202% and 0.209%, respectively. This trend
persisted when looking at MAPE, as the KNN, XGB, and random forest models resulted in
MAPE increasing by 5.56%, 6.45%, and 3.48%, respectively, compared to the basic linear
model, whilst the 2nd and 3rd degree polynomial models resulted in marginal increases in
MAPE of 0.17% and 0.18%, respectively.
18
When examining model fit after cross-validation, trends for changes in RMSE appeared
to be relatively consistent. The KNN, XGB, and random forest models increased aggregate
RMSE by 5.59%, 6.30%, and 2.73%, respectively. The 2nd and 3rd degree models also resulted
in slightly larger, but still marginal increases, in aggregate RMSE, with the 2nd degree model
corresponding to a 0.61% increase and the 3rd degree model corresponding to a 0.75% increase.
Examinations of MAPE for the KNN, XGB, and random forest models were also relatively
consistent in their findings, with increased aggregate MAPE of 5.92%, 6.03%, and 2.97%,
respectively, over the basic linear model. However, examinations of aggregate MAPE for the
2nd and 3rd degree models did not align with previous results, as the 2nd degree model
corresponded with a 0.60% decrease in aggregate MAPE whilst the 3rd degree model
corresponded with a 0.51% decrease in aggregate MAPE.
3.5: Comparison Models with Untransformed TNEs
This analysis also considered the effects of transformations on TNE, the main
independent variable, on the overall regression results and a set of comparison models were run
to examine these effects wherein TNEs were not transformed.
Table 5: Regression results for cg05575921 with Untransformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.07978977 0.10662971 Basic Linear 0.08072896 0.10679806
KNN (k=44) 0.08293014 0.11068476 KNN (k=44) 0.08331685 0.10975971
XGB 0.07929205 0.10367215 XGB 0.08119749 0.10587242
Random Forest 0.07815118 0.10425354 Random Forest 0.07848106 0.10447085
2nd Degree Poly 0.07780155 0.1032343 2nd Degree Poly 0.0795439 0.1048218
3rd Degree Poly 0.07646536 0.09866234 3rd Degree Poly 0.07896054 0.1042006
19
Table 6: Regression results for cg23576855 with Untransformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.1178854 0.2396654 Basic Linear 0.1143392 0.2442633
KNN (k=44) 0.12177414 0.24665212 KNN (k=44) 0.11680567 0.24649842
XGB 0.12082654 0.24917775 XGB 0.11678685 0.24807604
Random Forest 0.1158068 0.2390627 Random Forest 0.1125433 0.2413217
2nd Degree Poly 0.1123136 0.2392459 2nd Degree Poly 0.114122 0.2454493
3rd Degree Poly 0.1105093 0.2347376 3rd Degree Poly 0.1139397 0.2453659
Table 7: Regression results for cg21161138 with Untransformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.04693558 0.05274960 Basic Linear 0.04661308 0.05215934
KNN (k=44) 0.04949149 0.05644157 KNN (k=44) 0.04872100 0.05485562
XGB 0.04942909 0.05616616 XGB 0.04875832 0.05495492
Random Forest 0.04779977 0.05445153 Random Forest 0.04693654 0.05302347
2nd Degree Poly 0.04678789 0.05253597 2nd Degree Poly 0.04614414 0.0515911
3rd Degree Poly 0.04658494 0.05225441 3rd Degree Poly 0.04602409 0.0513512
Table 8: Regression results for cg26703534 with Untransformed TNEs.
Baselines RMSE MAPE Cross Validation RMSE MAPE
Basic Linear 0.03552930 0.03948884 Basic Linear 0.03525877 0.03941887
KNN (k=44) 0.03754723 0.04165783 KNN (k=44) 0.03697028 0.04132893
XGB 0.03767289 0.04153863 XGB 0.03655728 0.04114398
Random Forest 0.03593658 0.03992386 Random Forest 0.03564486 0.03976246
2nd Degree Poly 0.03513862 0.03926986 2nd Degree Poly 0.03514863 0.03921508
3rd Degree Poly 0.03607320 0.03956713 3rd Degree Poly 0.0348081 0.03873815
These models indicated several interesting results. Across the board, we found that
polynomial regression resulted in improvements to model fit over basic linear regression. In
particular, it was found that, for cg05575921, random forests, 2nd degree polynomial regression,
and 3rd degree polynomial regression resulted in improvements to model fit before crossvalidation as measured by RMSE. After cross-validation, this pattern remained, with random
20
forests performing the best in aggregate. Based on MAPE, the results are similar, thought in this
case the 3rd
degree polynomial regression model results in the biggest improvement to model fit
both before and after cross validation.
When examining cg23576855, we found that the random forest, 2nd degree polynomial,
and 3rd degree polynomial models resulted in improvements to model fit as measured by RMSE
both before and cross validation. In particular, it was found that random forests performed the
best after cross validation. When examined via MAPE, we found that only the 2nd and 3rd degree
polynomial models resulted in improvements to model fit both before and after cross validation,
with the 3rd degree model providing the greatest magnitude of improvement.
cg21161138 demonstrated further interesting trends, with the 2nd and 3rd degree models
providing improvements to fit as measured by RMSE both before and after cross validation, with
the 3rd degree model resulting in the largest improvement after cross validation. As measured by
MAPE, the results were similar as the 2nd and 3rd degree models again were the only ones that
resulted in improvements to fit over basic linear regression, with the 3rd degree model again
providing the greatest magnitude of improvement.
Finally, cg26703534 had its own set of interesting results. Prior to cross validation, only
the 2nd degree polynomial model appeared to result in an improvement to model fit as measured
by RMSE. However, after cross validation, both the 2nd and 3rd degree models resulted in
improvements to model fit as measured by RMSE, with the 3rd degree model providing the
largest improvement. When examined via MAPE, the trends were similar, with the 2nd degree
polynomial model being the only one resulting in better model fit prior to cross-validation, and
the 3rd degree polynomial model being superior to the 2nd degree model after cross validation.
21
As a whole, all of these models resulted in higher absolute RMSE values across the board
as compared to using log-transformed TNEs, whilst the MAPE values were often lower,
indicating possible issues of outliers affecting the results in the untransformed models, as MAPE
is generally unaffected by outliers. The higher prediction error values indicate as a whole that the
log-transformation may help with the modeling overall.
22
Chapter 4: Discussion of Results
This study aimed to examine possible nonlinearities present in the already-established
association of TNEs with DNA methylation at four CpG sites in AHRR. Due to the large number
of models, it was prudent to analyze the results in parts, focusing firstly on interpretation of the
models containing log-transformed TNEs as these models were best in line with previous
literature. In examining the non-parametric modeling methods used for this analysis – namely
KNN, XGB, and random forests regression – we found that there were no general improvements
to model fit across any of our sites as measured by RMSE or MAPE. This trend was consistent
across both the baseline and cross-validation sets, indicating that it was likely not attributable to
errors or simple random chance. As a whole, we found that the set of non-parametric methods
generally performed worse than basic linear regressions and polynomial regressions in terms of
model fit, resulting in higher RMSE and MAPE values across the board. Interestingly, the nonparametric methods also were found to perform worse across the board when compared to
polynomial regression, a method that is generally much simpler in its estimation and design. This
is crucial as these three non-parametric methods were specifically chosen due to their flexibility
and complexity, both of which are factors that can be useful for capturing nonlinear relationships
present in data, especially slight nonlinearities that simpler methods may tend to miss. That these
complex models consistently failed to provide improvements in model fit indicate that the most
optimal method for modeling the relationship between log-transformed TNEs and DNA
methylation is basic linear regression.
The utility of the non-parametric methods extends beyond solidifying the linear approach,
as comparisons within the models provided interesting insights into model fit. It was observed
that random forest regression was the best-performing of the non-parametric methods for all four
23
CpG sites, both at a baseline and after cross-validation. In particular, random forest models
performed relatively well in two of the sites, cg23576855 and cg21161138, with the random
forest model being only 0.6% worse as measured by RMSE than 2nd and 3rd degree polynomial
models for cg23576855 after cross validation and only 0.98% worse than basic linear regression
as measured by RMSE. KNN regression provided moderate changes in performance among the
three non-parametric methods, as it was naturally expected to perform worse than random forests
due to the simpler principles it uses for modeling data. XGB regression proved to be the worstperforming of the non-parametric methods across the board, consistently resulting in increases in
RMSE and MAPE that were 2-3 times that of the other methods. This was particularly peculiar
as XGB should generally outperform random forests regression due to its modeling methodology
that uses dependent trees that learn from each other. That it did not outperform random forests
indicates a number of possibilities, including poor parameter tuning, issues within the data, or
simply a paucity of data for the model to learn from. Another possible cause identified for the
disparity in actual model performance relative to expected model performance was overfitting of
the data by the XGB model. Random forest models are naturally robust to overfitting which is
the primary reason they prove competitive with gradient-boosting algorithms despite being
simpler in their methodology. Many of these observations may also be tied to sample size, as the
learners may simply not have had enough information to learn from, so future analyses with
much larger data volumes may be able to glean new information and find trends these models
could not. As a whole, this analysis found that complex non-parametric modeling methods did
not provide improvements in model fit over basic linear regression because there did not appear
to be a nonlinear relationship in the data when examining the relationship between logtransformed TNEs and methylation.
24
Polynomial regression provided several interesting results and proved to be the modeling
method most competitive with basic linear regression. Out of all the modeling methods, the 2nd
and 3rd degree polynomial models came the closest to replicating the performance of the basic
linear model as measured by RMSE, with their values often being less than 1% higher than the
basic linear model. Such differences were minimized after cross-validation for three of the sites
(cg21161138, cg05575921, and cg23576855). In particular, the 2nd degree polynomial model for
cg21161138 was extremely close to the basic linear model in terms of performance as measured
by RMSE (0.02%) after cross validation, indicating it could essentially serve the same predictive
function. When examining these models via MAPE, we saw that they generally fit worse than
the basic linear model at baseline. However, the 2nd and 3rd degree polynomial models for
cg26703534 actually improved model fit as measured by MAPE across cross-validation splits.
Although this improvement was extremely small (less than 1%), it provides evidence of a
possible nonlinear trend. MAPE is a metric that is naturally less sensitive to outliers than RMSE,
which may also be indicative of some outlier issues being present in the models for cg26703534.
As such, while the polynomial models did not generally result in widespread improvements to
model fit, there were still interesting results to be gleaned from them for future analyses.
The set of comparison models run with untransformed TNEs provided a number of
interesting results that, even though they were not the primary focus of the study, could provide
useful insights for future research. The most central of these is that, when using untransformed
TNEs, many of the nonlinear modeling methods outperformed basic linear regression, both as
measured by RMSE and MAPE. In particular, random forests outperformed basic linear
regression and both polynomial regression models on aggregate as measured by RMSE for
cg05575921 and cg23576855, indicating random forests as a potentially useful avenue for
25
modeling the relationship between TNEs and methylation for these two sites in future studies.
This is also a very significant departure from the models using log-transformed TNEs as none of
the non-parametric modeling methods were able to match basic linear regression or polynomial
regression for any of the sites in those models. Furthermore, for all modeling methods, 2nd and
3
rd degree polynomial regression resulted in improvements to model fit as measured by RMSE,
being the best options for cg21161138 and cg26703534.
The story changes slightly when examining the models through the lens of MAPE,
however, as only the polynomial regression models now resulted in improvements to fit, though
they still consistently resulted in improvements to model fit over basic linear regression for all
the sites. Taking this fact in tandem with the fact that RMSE values were generally always
higher for the untransformed models as compared to the transformed models, there is some
indication of possible issues of outliers and skewness in the distributions that could be addressed
in future studies. This is further confirmed by examining graphical evidence, as the transformed
TNEs generally have a less skewed graph when plotted against DNA methylation whilst their
partial dependence plots exhibit less extreme nonlinearity. As a whole, these comparison models
seem to indicate that transformation of TNEs aids with examining the relationship of TNEs with
DNA methylation when utilizing basic linear regression, but future research may benefit from
using untransformed TNEs alongside flexible machine learning models that are able to deal with
non-linearities directly.
In conclusion, it was found that nonlinear modeling methods did not result in
improvements to model fit across the four CpG sites on AHRR when using log-transformed TNE
values, indicating that the relationship between TNEs and DNA methylation at these CpG sites is
best approximated by a linear model. This can inform future analyses as it indicates that the
26
current trajectory of research, which largely focuses on basic linear modeling methods, is likely
the best for analyzing this relationship. If nonlinear relationships are explored again with new
data, this research has also indicated that low-degree polynomial models and random forests will
likely be the most useful avenues for exploration alongside using raw, untransformed TNE data.
27
References
1. Warren, G. W., Alberg, A. J., Kraft, A. S., & Cummings, K. M. (2014). The 2014 Surgeon
General’s report: “The Health Consequences of Smoking–50 Years of Progress”: A
paradigm shift in cancer care. Cancer, 120(13), 1914–1916.
https://doi.org/10.1002/cncr.28695
2. Park, S. L., Patel, Y. M., Loo, L. W. M., Mullen, D. J., Offringa, I. A., Maunakea, A.,
Stram, D. O., Siegmund, K., Murphy, S. E., Tiirikainen, M., & Le Marchand, L. (2018).
Association of internal smoking dose with blood DNA methylation in three racial/ethnic
populations. Clinical Epigenetics, 10(1), 110 https://doi.org/10.1186/s13148-018-0543-7
3. Haiman, C. A., Stram, D. O., Wilkens, L. R., Pike, M. C., Kolonel, L. N., Henderson, B. E.,
& Le Marchand, L. (2006). Ethnic and Racial Differences in the Smoking-Related Risk
of Lung Cancer. New England Journal of Medicine, 354(4), 333–342.
https://doi.org/10.1056/NEJMoa033250
4. Murphy, S. E., Park, S. L., Balbo, S., Haiman, C. A., Hatsukami, D. K., Patel, Y., Peterson,
L. A., Stepanov, I., Stram, D. O., Tretyakova, N., Hecht, S. S., & Le Marchand, L.
(2018). Tobacco biomarkers and genetic/epigenetic analysis to investigate
5. Stram, D. O., Park, S. L., Haiman, C. A., Murphy, S. E., Patel, Y., Hecht, S. S., & Le
Marchand, L. (2019). Racial/Ethnic Differences in Lung Cancer Incidence in the
Multiethnic Cohort Study: An Update. JNCI: Journal of the National Cancer Institute,
111(8), 811–819. https://doi.org/10.1093/jnci/djy206
6. Dawes, K., Andersen, A., Reimer, R., Mills, J. A., Hoffman, E., Long, J. D., Miller, S., &
Philibert, R. (2021). The relationship of smoking to cg05575921 methylation in blood
and saliva DNA samples from several studies. Scientific Reports, 11(1), 21627.
https://doi.org/10.1038/s41598-021-01088-7
7. Philibert, R., Dogan, M., Beach, S. R. H., Mills, J. A., & Long, J. D. (2020). AHRR
methylation predicts smoking status and smoking intensity in both saliva and blood
DNA. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics,
183(1), 51–60. https://doi.org/10.1002/ajmg.b.32760
8. Benowitz, N.L., Dains, K.M., Dempsey, D., Yu, L., and Jacob, P. (2010). Estimation of
nicotine dose after low- level exposure using plasma and urine nicotine metabolites.
Cancer Epidemiol Biomarkers Prev. 19, 1160–1166. https://doi.org/10.1158/1055-
9965.EPI-09-1303.
28
9. Gao, X., Jia, M., Zhang, Y., Breitling, L. P., & Brenner, H. (2015). DNA methylation
changes of whole blood cells in response to active smoking exposure in adults: A
systematic review of DNA methylation studies. Clinical Epigenetics, 7(1), 113.
https://doi.org/10.1186/s13148-015-0148-3
10. Huang, B. Z., Binder, A. M., Quon, B., Patel, Y. M., Lum-Jones, A., Tiirikainen, M.,
Murphy, S. E., Loo, L., Maunakea, A. K., Haiman, C. A., Wilkens, L. R., Koh, W.-P.,
Cai, Q., Aldrich, M. C., Siegmund, K. D., Hecht, S. S., Yuan, J.-M., Blot, W. J., Stram,
D. O., … Park, S. L. (2024). Epigenome-wide association study of total nicotine
equivalents in multiethnic current smokers from three prospective cohorts. The
American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2024.01.012
11. Murphy, S. E., Park, S. L., Balbo, S., Haiman, C. A., Hatsukami, D. K., Patel, Y., Peterson,
L. A., Stepanov, I., Stram, D. O., Tretyakova, N., Hecht, S. S., & Le Marchand, L.
(2018). Tobacco biomarkers and genetic/epigenetic analysis to investigate ethnic/racial
differences in lung cancer risk among smokers. Npj Precision Oncology, 2(1), 17.
https://doi.org/10.1038/s41698-018-0057-y
12. Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D.
(2006). Principal components analysis corrects for stratification in genome-wide
association studies. Nature Genetics, 38(8), 904–909. https://doi.org/10.1038/ng1847
13. Diebold, F. X., & Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of
Business & Economic Statistics, 13(3), 253–263.
https://doi.org/10.1080/07350015.1995.10524599
14. Spiess, A.-N., & Neumeyer, N. (2010). An evaluation of R2 as an inadequate measure for
nonlinear models in pharmacological and biochemical research: A Monte Carlo
approach. BMC Pharmacology, 10(1), 6. https://doi.org/10.1186/1471-2210-10-6
15. Harrell, F. E. (2001). Regression Modeling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis. Springer.
https://books.google.com/books?id=kfHrF-bVcvQC
16. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer.
https://books.google.com/books?id=eBSgoAEACAAJ
17. Anderson, T. W. (1962). The Choice of the Degree of a Polynomial Regression as a
MultipleDecision Problem. Annals of Mathematical Statistics, 33, 255–265.
18. Gelman, A., & Imbens, G. (2019). Why High-Order Polynomials Should Not Be Used in
Regression Discontinuity Designs. Journal of Business & Economic Statistics, 37(3),
447–456. https://doi.org/10.1080/07350015.2017.1366909
29
19. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
https://doi.org/10.1023/A:1010933404324
20. Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of
the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data
Mining, 785–794.
21. Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy
of Sciences, 116(32), 15849–15854. https://doi.org/10.1073/pnas.1903070116
30
Appendix 1: Plots for log-transformed TNE Models
Plot legend: All plots were created using ggplot2. Plots consisted of methylation vs. TNE plots
indicating the overall trend in the data as well as partial dependence plots for visualizing the
effect of changes in TNE on methylation when accounting for the modeling method used.
31
A1.1. cg05575921:
Methylation vs. TNE:
Random Forests:
32
XGB:
KNN:
33
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
34
Linear Regression:
35
A1.2: cg23576855:
Methylation vs. TNE:
Random Forests:
36
XGB:
KNN:
37
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
38
Linear Regression:
39
A1.3: cg21161138:
Methylation vs. TNE:
Random Forests:
40
XGB:
KNN:
41
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
42
Linear Regression:
43
A1.4: cg26703534:
Methylation vs. TNE:
Random Forests:
44
XGB:
KNN:
45
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
46
Linear Regression:
47
Appendix 2: Plots for Untransformed TNE Models
Plot legend: All plots were created using ggplot2. Plots consisted of methylation vs. TNE plots
indicating the overall trend in the data as well as partial dependence plots for visualizing the
effect of changes in TNE on methylation when accounting for the modeling method used.
48
A2.1. cg05575921:
Methylation vs. TNE:
Random Forests:
49
XGB:
KNN:
50
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
51
Basic Linear Regression:
52
A2.2: cg23576855:
Methylation vs. TNEs:
Random Forests:
53
XGB:
KNN:
54
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
55
Basic Linear Regression:
56
A2.3: cg21161138:
Methylation vs. TNEs:
Random Forests:
57
XGB:
KNN:
58
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
59
Basic Linear Regression:
60
A2.4: cg26703534:
Methylation vs. TNEs:
Random Forests:
61
XGB:
KNN:
62
2
nd Degree Polynomial Regression:
3
rd Degree Polynomial Regression:
63
Basic Linear Regression:
Abstract (if available)
Abstract
Background: Lung cancer continues to present a global public health crisis, with cigarette smoking being one of the main contributors to lung cancer development. Current literature has shown that cigarette smoking is associated with differential methylation of the AHRR gene. However, smoking research has been hampered by the unreliability of self-reported dose and the development of other nicotine substitutes, such as vapes and nicotine patches, which have made it difficult to use previous gold standard methods for determining internal smoking dose in patients. As such, studies have begun analyzing cigarette smoking using total nicotine equivalents (TNEs), which provide a more complete measure for gauging internal smoking dose. Recently, studies have explored the link between TNEs and DNA methylation under a linear regression framework. This study aims to expand on that knowledge by examining possible nonlinear relationships between smoking dose, as quantified by TNEs, and DNA methylation.
Methods: The data for this analysis were collected as part of the Multiethnic Cohort Study (MEC) and consisted of a sample of 1,994 individuals who were smokers at the time of sample collection, reported smoking >10 cigarettes per day, and had DNA methylation data available. Data from four CpG sites on the AHRR gene was used: cg05575921 (n=1993), cg23576855 (n=1994), cg21161138 (n=1994), and cg26703534 (n=1993). We evaluated the association between TNE and DNA methylation at each of the four CpG sites, using a series of models including standard linear regression, 2nd and 3rd degree polynomial regressions, KNN regressions, random forest regressions, and eXtreme Gradient Boosting(XGB) regressions. All models were adjusted for sex, age, methylation-based estimates of cell type composition, and principal components of genetic ancestry. Standard linear regression served as a the baseline for model performance, which was assessed by RMSE and MAPE and verified using 10-repeats, 5-fold cross validation. Models were evaluated using both baseline measures prior to cross validation and averages gained from cross validation.
Results: Results indicated very little improvement in model fit, compared to standard linear regression, for any of the methods for any of the sites. Certain notable results were found, with 2nd and 3rd degree polynomial models generally performing the best after basic linear regression for all four sites. For cg21161138, polynomial regression was able to be only 0.02% worse than basic linear regression as measured by RMSE after cross validation, indicating similar or almost equal model fit. Cg26703534 yielded particularly interesting results as it was the only site that saw an improvement in fit by any of the modeling types, with polynomial regression providing a marginal improvement in fit (<1%) as measured by MAPE. Non-parametric methods, which consisted of XGB, KNN, and random forests regression all generally performed worse than both basic linear regression and polynomial regression, with random forests coming the closest to the performance of polynomial regression across all sites, being especially close for cg23576855 (0.6% worse) and cg21161138 (0.6% worse). XGB and KNN regression performed poorly across the board.
Conclusion: In conclusion, it was found that nonlinear modeling methods did not result in improvements to model fit across the four CpG sites on AHRR when using log-transformed TNE values, indicating that the relationship between TNEs and DNA methylation at these CpG sites is best approximated by a linear model. This can inform future analyses as it indicates that the current trajectory of research, which largely focuses on basic linear modeling methods, is likely the best for analyzing this relationship. If nonlinear relationships are explored again with new data, this research has also indicated that low-degree polynomial models and random forests will likely be the most useful avenues for exploration alongside using raw, untransformed TNE data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Finding signals in Infinium DNA methylation data
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
An analysis of conservation of methylation
PDF
The kinetic study of engineered MBD domain interactions with methylated DNA: insight into binding of methylated DNA by MBD2b
PDF
CpG poor promoter SULT1C2 regulated by DNA methylation and is induced by cigarette smoke condensate in lung cell lines
PDF
Differential methylation analysis of colon tissues
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Using average pairwise distance in a correlation analysis
PDF
Presbyopia and quality of life among the Latino population
PDF
Incorporating prior knowledge into regularized regression
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Air pollution, smoking, and multigenerational DNA methylation Signatures: a study of two southern California cohorts
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Statistical analysis of high-throughput genomic data
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Machine learning approaches for downscaling satellite observations of dust
Asset Metadata
Creator
Islam, Muhammad Rayeed
(author)
Core Title
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2024-08
Publication Date
07/11/2024
Defense Date
07/10/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Biostatistics,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Huang, Brian (
committee chair
), Lewinger, Juan Pablo (
committee member
), Pickering, Trevor (
committee member
)
Creator Email
islammuhammadr@gmail.com,muhammai@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997L9O
Unique identifier
UC113997L9O
Identifier
etd-IslamMuham-13200.pdf (filename)
Legacy Identifier
etd-IslamMuham-13200
Document Type
Thesis
Format
theses (aat)
Rights
Islam, Muhammad Rayeed
Internet Media Type
application/pdf
Type
texts
Source
20240712-usctheses-batch-1179
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu