Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying and measuring cell-intrinsic and cell-extrinsic factors influencing aging
(USC Thesis Other)
Identifying and measuring cell-intrinsic and cell-extrinsic factors influencing aging
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Identifying and Measuring Cell-Intrinsic and Cell-Extrinsic Factors Influencing Aging
by
Alan Albert Tomusiak
A Dissertation Presented To The
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOLOGY OF AGING)
May 2024
Copyright 2024 Alan Albert Tomusiak
ii
ACKNOWLEDGEMENTS
Though the first page of this document solely lists my name, the work embedded herein would
not be possible without a vast network of current and past support. I would like to thank my
committee, consisting of Dr. Eric Verdin, Dr. Lisa Ellerby, Dr. Berenice Benayoun, Dr. David
Furman, and Dr. Dan Winer, for their willingness to share their thoughts and feedback on a
variety of projects that I have both conducted and considered attempting during my PhD. This
work is vastly better as a result of many conversations with them. I would also like to thank the
USC Biology of Aging PhD program administration, the Buck Institute, and the sources of
funding I have received (including an NIA T32) as being underlying deep pillars of support.
The Verdin lab has been simultaneously a warm and sharp intellectual home for me in the
past three years, as a comfortable environment where both positive and negative feedback is
encouraged. I would like to particularly thank Rebeccah Riley as the unbelievably competent and
deeply caring force that enables science to take place seamlessly. Herb, Ritesh, and Ryan –
sitting by the flow sorter at two in the morning listening to the hum of millions of cells flying
through laser beams and magnets is still one of the most magical memories from my past two
years, and it would not have been possible without your training. Sierra Lore – thank you for
being a phenomenal teammate and scientific sparring partner. You showed me how enjoyable
science could be when conducted as a team endeavor.
I am deeply grateful to the Buck community for creating an atmosphere of warmth and
openness to deep scientific dialogue. Conversations with Edward Anderton, Carlos Galicia,
Daria Timonina, Brendan Hughes, Josef Byrne, Cynthia Siebrand, and many more helped me
iii
understand aging better on a fundamental level. Participating in Graduate Student Society and
the Buck Student Aging Symposium were meaningful experiences for me that made me
appreciate the deeper impact that a close-knit community can have.
I would especially like to thank the mentors who have guided me, both past and present.
Each and every one of you took a chance on me, and for this I am eternally grateful. Henley
Sawicki held my life trajectory towards science steady even while life threw at me a cacophony
of difficulty. Dr. Brad Nelms encouraged patience and critical thinking at a time when I knew
neither. Dr. Robert Bao taught me how to step back and think outside of the box, and Dr. Marcin
Paduch reminded me of the importance of the human element in science. Most recently, I’d like
to extend my deepest gratitude to Dr. Eric Verdin – you played the critical role in shaping my
raw enthusiasm and excitement into a contribution to aging science. It’s been a joy being in your
lab.
Lastly, and perhaps most importantly, I would like to extend my appreciation to Ariel
Floro and Aeowynn Coakley. They have been steadfast companions through all manners of
adventures and tribulations. I am truly lucky that, in the midst of a PhD program undertaken to
understand aging, I have found the closest friends I could imagine.
In the following chapters are embedded sections of the bioRxiv preprint titled
“Development of a novel epigenetic clock resistant to changes in immune cell composition” by
Alan Tomusiak, Ariel Floro, Ritesh Tiwari, Rebeccah Riley, Hiroyuki Matsui, Nicolas Andrews,
Herbert G. Kasler, and Eric Verdin.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ......................................................................................................... ii
LIST OF TABLES ...................................................................................................................... vii
LIST OF FIGURES................................................................................................................... viii
ABSTRACT................................................................................................................................... x
CHAPTER 1: INTRODUCTION................................................................................................ 1
INTRINSIC VS. EXTRINSIC AGING ................................................................................................. 1
DECLINE OF THE IMMUNE SYSTEM WITH AGE.............................................................................. 2
DNA METHYLATION AND ITS ROLE IN DISEASE AND AGING....................................................... 4
EPIGENETIC CLOCKS.................................................................................................................... 6
EPIGENETIC CLOCKS AND BLOOD CELL COMPOSITION................................................................ 8
SINGLE-CELL TRANSCRIPTOMIC CLOCKS .................................................................................... 9
CHAPTER 2: MATERIALS AND METHODS ...................................................................... 11
CHAPTER 3: UNDERSTANDING THE RELATIONSHIP BETWEEN EPIGENETIC
CLOCKS AND BLOOD CELL TYPE COMPOSITION....................................................... 16
ABSTRACT.............................................................................................................................. 16
RESULTS ................................................................................................................................. 16
Existing epigenetic clock age predictions depend on CD8+ T-cell differentiation state...... 16
Development of a novel epigenetic clock (IntrinClock) resistant to changes in immune cell
v
composition........................................................................................................................... 19
IntrinClock is accurate across tissues, and its age predictions are not affected by adaptive
immune cell compositional changes. .................................................................................... 23
CHAPTER 4: DEVELOPMENT OF A NOVEL EPIGENETIC CLOCK THAT
MEASURES AGING INDEPENDENTLY OF BLOOD CELL COMPOSITION.............. 33
ABSTRACT.............................................................................................................................. 33
RESULTS ................................................................................................................................. 33
IntrinClock is enriched for sites near functional areas of the genome.................................. 33
IntrinClock sites are located near regulatory domains of cancer-related transcription factors
............................................................................................................................................... 34
IntrinClock epigenetic age is accelerated in models of intrinsic hallmarks of aging and in
HIV+
individuals................................................................................................................... 37
IntrinClock epigenetic age is affected by cellular reprogramming....................................... 38
CHAPTER 5: DEVELOPMENT OF A SINGLE CELL TRANSCRIPTOMIC
BIOMARKER FOR T CELL AGING ..................................................................................... 52
ABSTRACT.............................................................................................................................. 52
RESULTS ................................................................................................................................. 53
Automated prediction of cell type recapitulates known changes in T cell composition with
age ......................................................................................................................................... 53
Cell type-dependent models are capable of predicting age across a variety of cell types.... 54
Cell type-dependent single cell age prediction models identify accelerated aging in
vi
autoimmune disorders such as multiple sclerosis................................................................. 55
CHAPTER 6: CONCLUSIONS ................................................................................................ 60
REFERENCES............................................................................................................................ 64
APPENDICES............................................................................................................................. 95
SubsetClockFilter.R: Code used to subset CpGs for developing the IntrinClock ................ 95
ClockDatabaseConstruction.R: Code used to generate the database that formed the basis for
the IntrinClock .................................................................................................................... 110
ClockDevelopment.R: Code used to generate the IntrinClock........................................... 232
CellTypePrediction.R: Code used to generate the cell type predictor for the single-cell
transcriptomic dataset ......................................................................................................... 261
AgePrediction.R: Code used to generate the age predictor for the single-cell transcriptomic
dataset ................................................................................................................................. 264
MultipleModelConstructors.R: Code used to validate clock construction pipeline and to
determine the optimal number of clusters to use for single cell age prediction ................. 266
vii
LIST OF TABLES
TABLE 1. LIST OF DATASETS USED FOR BUILDING AND/OR VALIDATING THE INTRINCLOCK.......... 48
TABLE 2. ANTIBODY MIX USED FOR SORTING T CELLS. ................................................................. 49
TABLE 3. ANTIBODY MIX USED FOR SORTING B CELLS, CD56DIM CD16+ NK CELLS, AND
CLASSICAL MONOCYTES......................................................................................................... 49
TABLE 4. ANTIBODY MIX USED FOR HIGH-DIMENSIONAL PHENOTYPING OF PERIPHERAL BLOOD
MONONUCLEAR CELLS. .......................................................................................................... 52
viii
LIST OF FIGURES
FIGURE 1.1 CPG SITE CHANGES DURING T CELL DIFFERENTIATION............................................... 18
FIGURE 1.2 INTRINCLOCK DESIGN STRATEGY AND PERFORMANCE. .............................................. 22
FIGURE 1.3 EPIGENETIC AGE ACCELERATIONS MEASURED BY DIFFERENT CLOCKS........................ 26
SUPPLEMENTAL FIGURE 1.4 GATING STRATEGY FOR SORTING OF CD4+ SUBSETS AND CD8+
SUBSETS ................................................................................................................................. 27
SUPPLEMENTAL FIGURE 1.5 GATING STRATEGY FOR SORTING OF B-CELL SUBSETS, CD56DIM
CD16+ NK CELLS, AND CLASSICAL MONOCYTES. ................................................................. 28
SUPPLEMENTAL FIGURE 1.6 GATING STRATEGY FOR HIGH-DIMENSIONAL FLOW CYTOMETRY .... 29
SUPPLEMENTAL FIGURE 1.7 INTRINCLOCK TRAINING AND VALIDATION. .................................... 30
SUPPLEMENTAL FIGURE 1.8 COMPARISON OF PREDICTION ACCURACY WITH ELASTIC NET
PERFORMED ONCE OR REPEATED ............................................................................................ 31
SUPPLEMENTAL FIGURE 1.9 PREDICTED EPIGENETIC AGE ACCELERATION OF MATCHED-DONOR
CELLS OF DIFFERENT CELL SUBSETS ....................................................................................... 32
FIGURE 2.1 DISTRIBUTIONS OF CPG POSITIONS............................................................................. 36
FIGURE 2.2 EFFECT OF DISEASES AND INTERVENTIONS ON INTRINCLOCK RESIDUALS .................. 40
FIGURE 2.3 INTRINCLOCK AND CPG METHYLATION PATTERNS AND CLUSTERS ............................ 41
SUPPLEMENTAL FIGURE 2.4 ANALYSIS OF INTRINCLOCK CPG METHYLATION SITES BELONGING
TO CLUSTERS 4, 5, AND 6........................................................................................................ 42
SUPPLEMENTAL FIGURE 2.5 EFFECT OF REPLICATIVE SENESCENCE IN FIBROBLASTS ON FIVE
EPIGENETIC CLOCKS............................................................................................................... 43
SUPPLEMENTAL FIGURE 2.6 ANALYSIS OF CORRELATIONS OF INTRINCLOCK CPG SITE
ix
METHYLATION WITH TISSUES. ................................................................................................ 44
FIGURE 3.1 CELL TYPE PREDICTIONS RECAPITULATE KNOWN EFFECTS OF AGING ON THE IMMUNE
SYSTEM .................................................................................................................................. 56
FIGURE 3.2 CELL TYPE INFORMATION IMPROVES AGE PREDICTION OF T CELL SUBSETS ................ 57
SUPPLEMENTAL FIGURE 3.3 CORRELATION OF CHRONOLOGICAL AGE PREDICTION COMPARED TO
NUMBER OF CELL CLUSTERS USED.......................................................................................... 58
SUPPLEMENTAL FIGURE 3.4 PREDICTED AGE RESIDUALS OF CELL-TYPE DEPENDENT SINGLE-CELL
TRANSCRIPTOMIC CLOCK ACROSS AUTOIMMUNE DISEASES .................................................... 59
x
ABSTRACT
Validated, reliable, and interpretable biomarkers are critical for understanding the aging process.
They are utilized both in the context of screening possible aging interventions in vitro and in
understanding clinically how the aging process can be affected in humans. One of the main
changes that occurs during aging is in the context of the immune system, as immune cell
composition shifts significantly towards an enrichment of more differentiated cells. This shift in
immune cell composition makes an important impact on biomarker measurements of aging, as it
complicates the interpretation of aging processes regarding whether longevity interventions are
affecting aging on a cellular level or if they are changing cellular composition.
Here, I characterize how epigenetic biomarkers of aging are impacted by changes in cell
type composition. To address the potential issues with this association, I developed a novel
biomarker based on DNA methylation (an epigenetic clock) that is robust to changes in immune
cell composition. This new biomarker captures several aspects of intrinsic epigenetic changes,
including the increase of epigenetic age prediction due to senescence. As an alternative method
to study cell-intrinsic effects of aging, I also describe the creation of a single-cell transcriptomic
cell type and age predictor focusing on T cells. In total, this work assists in the effort to identify
and understand the fundamental process of aging via the utilization of biomarkers, on both a
cellular and organismal scale.
1
CHAPTER 1: INTRODUCTION
Intrinsic vs. Extrinsic Aging
Aging has been proposed to consist of several hallmarks, including genome instability,
epigenetic alterations, loss of proteostasis, altered intercellular communication, chronic
inflammation, and dysbiosis (López-Otín et al., 2023). However, among these hallmarks there is
a key distinction in that many of them change largely on a cellular level (the first three), while
others primarily impact the coordination of many cells, tissues, or organs (the last three). Many
of these hallmarks cause both intrinsic and extrinsic changes – cellular senescence, as an
example, can be triggered by cell intrinsic factors while leading to cell extrinsic effects on
systemic inflammation (Campisi, 2013). Understanding how intrinsic and extrinsic aspects of
dysfunction contribute to health and disease is critical for understanding which interventions
could be effective in treating or slowing the effects of aging.
A key example of the complex interplay between cellular and systemic factors is in the
immune system (Davis et al., 2017). Likely at least partially due to intrinsic factors, the naïve
CD8+ and CD4+ T cell compartments functionally decline over time (H. Zhang et al., 2023).
This dysfunction leads to increased pro-inflammatory signaling (X. Li et al., 2023), particularly
upon activation, which hampers the regeneration of higher-order systems such as tissues and
organs (de la Fuente et al., 2024). This complex interplay of cellular and systemic aging is
important, but our ability to assess how each of these independently contributes to agingassociated pathology is presently limited.
2
Decline of the Immune System with Age
The aging of the immune system leads to significant negative effects on human well-being and
healthspan (Castelo-Branco & Soveral, 2014; Haynes, 2020; Müller et al., 2019; NikolichŽugich, 2018; Sadighi Akha, 2018; Weyand & Goronzy, 2016). Changes in immune system
function begin very early in life, with thymic involution beginning prior to the age of twenty-five
(Lynch et al., 2009).The effectiveness of vaccines against diseases decreases with age, as
effector responses weaken and the duration of immunological memory becomes shorter
(Gustafson et al., 2020). Age-related changes occur in cell subsets responsible for immune
surveillance, likely leading to a weakened response against cancer (Foster et al., 2011). Proinflammatory markers become chronically elevated and contribute to multimorbidity, a condition
referred to as inflammaging (Ferrucci & Fabbri, 2018). Relatedly, the prevalence of
autoimmunity is significantly associated with age, leading to a wide variety of diseases and
maladies (Goronzy & Weyand, 2012). Most directly, an older immune system is less capable of
defending the host against infection, leading to higher mortality and worse symptoms (Bartleson
et al., 2021). This has become of particular relevance recently during the COVID-19 pandemic,
as deaths caused COVID-19 were much higher in older individuals (Bajaj et al., 2021; Zheng et
al., 2020). Understanding and preventing the decline of immunity with age would allow for the
elderly to live much happier, healthier, and longer lives.
In addition to the effects that an increasingly dysfunctional immune system has on an
organismal level, there is a complex direct and indirect molecular relationship between the
canonical “hallmarks of aging” (López-Otín et al., 2013, 2023) and immunity. Telomere
attrition, a phenomenon generally linked to aging, appears to specifically impact T cells by
3
inducing dysfunctional senescent and/or exhausted states (Bellon & Nicot, 2017). Aberrant
cellular communication is a particularly relevant aging hallmark to the immune system, as
precisely tuned activation is a key component of a functional immune response. Several
signaling pathways, including the programmed death-1 (PD1) pathway, are dysfunctional in aged
immune cells and lead to a prematurely exhausted and ineffective immune system (Lages et al.,
2010; Reitsema et al., 2020). Stem cell exhaustion plays a role via age-associated dysfunction of
the naïve cell reservoir, demonstrated via naïve T cell failure to complete differentiation
programs towards memory cells. Recent studies have demonstrated that least some of this effect
is caused by diminished nutrient sensing pathways and dysregulated metabolism (Quinn et al.,
2019). Amelioration of canonical hallmarks of aging in immune cells would lead to a more
robust and improved systemic response against infection and disease.
Intriguingly, recent studies have suggested that the immune system may play a functional
or causal role in inducing systemic aging, as opposed to only being affected by it. T cells with
dysfunctional mitochondria via a deficiency in a mitochondrial transcription factor were shown
to lead to several aging-related features, including metabolic and cognitive alterations resembling
aging that resulted in premature death (Desdín-Micó et al., 2020). A year later, a group
demonstrated that a selective deletion of Ercc1, a DNA repair protein, in mouse hematopoietic
cells leads to early onset of senescence and damage in non-lymphoid tissues (Yousefzadeh et al.,
2021). In humans, these findings are reinforced by observations that supercentenarians have a
dramatic expansion of cytotoxic CD4+ cells (Hashimoto et al., 2019), which is ordinarily a rare
cell type. Combined with observations that CD8+ T cells appear to undergo a more significant
aging phenotype compared to CD4+ T cells (Weinberger et al., 2007), this suggests depletion of
4
immune cell populations that undergo more aging-related dysfunction could lead to improved
overall health and lifespan. Altogether, there is a distinct possibility that amelioration of aging
hallmarks in immune cells could lead not only to improved immunity, but also delayed aging
across the entire organism.
DNA Methylation and its Role in Disease and Aging
DNA methylation is an important epigenetic regulator of gene expression. Methylation of DNA
occurs most frequently at the 5th carbon position of cytosine, although in rarer instances
methylation of a nitrogen base on adenosine has been reported as well (K.-J. Wu, 2020; C.-L.
Xiao et al., 2018). The majority of DNA methylation occurs at a CpG site, defined as a cytosine
that immediately precedes a guanine (Moore et al., 2013). The enzymes responsible for
methylating DNA are DNMT1, DNMT3A, DNMT3B, and DNMT3L (though DNMT3L does
not have inherent enzymatic activity). DNMT1 is generally considered as a DNA methylation
“maintainer,” in that it sustains DNA methylation with replication, whereas DNMT3A and
DNMT3B apply de novo DNA methylation marks (B. Jin & Robertson, 2013). DNA methylation
marks can be removed by the ten eleven translocation (TET) enzymes (TET1, TET2, and TET3),
which remove DNA methylation via a series of enzymatic steps that converts methylated
cytosines to 5-carboxycytosine, which is then removed via base excision repair (Rasmussen &
Helin, 2016). The specific role of DNA methylation in terms of gene regulation is highly
context-specific, though it is frequently associated with gene repression when located within a
gene promoter. There are important exceptions, however, as exemplified by the OCT4
transcription factor preferentially binding to methylated sites (Yin et al., 2017). More globally,
5
DNA methylation plays important roles in imprinting(Elhamamsy, 2017) and 3D genome
organization (Buitrago et al., 2021).
As DNA methylation is linked closely to appropriate gene expression, defects or dysregulation of
DNA methylation has been tied to a variety of diseases (Z. Jin & Liu, 2018; Robertson, 2005;
Salameh et al., 2020). The association between aberrant DNA methylation and cancer has been
extensively studied (Ehrlich, 2002), which has led to the development of drugs designed to target
epigenetic regulators as potential therapies (Cheng et al., 2019; Gnyszka et al., 2013). Though
frequently DNA methylation alterations are viewed as reinforcement mechanisms for cancerous
cells that have already developed, mutations in DNA methyltransferases that lead to
tumorigenesis have also been observed (W. Zhang & Xu, 2017). Many diseases related to a
dysfunctional immune system have been shown to be either caused by or exacerbated by misregulated DNA methylation, including lupus (Hedrich et al., 2017), rheumatoid arthritis (Cribbs
et al., 2015), and psoriasis (Chandra et al., 2018; Charras et al., 2021). Discovering ways in
which to harness the DNA methylation machinery of the cell in order to slow the progression or
onset of disease is an important ongoing area of research.
Changes in DNA methylation leading to negative effects on cellular function have
particular relevance in the context of aging. As early as 1987, it was shown that a global
hypomethylation of the genome occurs with age in many different species (Wilson et al., 1987).
Since many cancers also show an overall loss of methylation, there has been speculation that this
could be a functional driver of the increase in cancer incidence with age (Johnstone et al., 2022).
This is supported by observations showing clonal expansion in the hematopoietic system, in
which a small number of progenitor and stem cells outcompete others – an effect which has been
6
linked to increased risk of blood cancers (Genovese et al., 2014). A recent study showed this
clonal expansion led to dramatic effects, in which 30-60% of hematopoiesis was accounted for
by only 12-18 clones in subjects over the age of 75 (Mitchell et al., 2022). The two driver
mutations most heavily implicated in clonal hematopoiesis are TET2 and DNMT3A, both
enzymes directly responsible for modifying DNA methylation (Buscarlet et al., 2017). Thus,
finding ways of slowing age-associated changes in global DNA methylation is a promising
avenue for improving human healthspan and lifespan.
Epigenetic Clocks
As technologies to assess DNA methylation levels developed, the close connection between
DNA methylation changes and aging led to speculation as to whether methylation levels at prespecified genomic loci could be used to predict chronological age. This culminated in the
development of “epigenetic clocks,” highly accurate (R > .95; mean absolute error generally less
than six years) machine learning-based predictors of chronological age based on methylation
levels of a relatively small number of CpG sites (generally 3-500 out of >300,000 assayed). The
first generation of epigenetic clocks was purely trained to predict chronological age based on
samples from blood (Hannum et al., 2013) or a variety of tissues (Horvath, 2013). Interestingly,
despite not being designed to predict health status, these epigenetic clocks were found to predict
accelerated ages (relative to the individuals’ chronological age) in disease indications such as
HIV (Horvath & Levine, 2015), obesity (Horvath et al., 2014), and smoking (X. Wu et al., 2019).
This lead to the development of a second generation of epigenetic clocks, which were
specifically built to predict health metrics as a marker of “biological age” (Levine et al., 2018;
7
Lu et al., 2019). These clocks have been applied to a wide variety of contexts (Horvath et al.,
2016, 2018; Kabacik et al., 2022; Lowe et al., 2016; McCrory et al., 2021; Protsenko et al.,
2021), and are promising biomarkers for evaluating interventions for healthier aging. Most
recently, new models have been generated that seek to predict the pace of aging rather than the
accumulation of aging (Belsky et al., 2022), and the resulting metric is generally more sensitive
to interventions compared to previously-developed clocks (Belsky et al., n.d.).
Due to the remarkably high accuracy of epigenetic clocks and their associations with
health metrics, there is significant interest in understanding how the CpG sites tracked by
epigenetic clocks predict age. However, this has proven to be a challenging area of research.
Recent preprints suggest that epigenetic clocks are a composite of different “modules,” each
tracking a different facet of aging and modifiable by separate sets of interventions, and that
epigenetic clocks differ in terms of the relative importance of each module (Levine et al., 2022).
Other authors have implicated progressive stochasticity of global DNA methylation with age as a
major driver of epigenetic clock age predictions (Schumacher & Meyer, 2023). Some novel
clocks have been developed specifically to track a particular type of aging biology that will lead
to more readily understandable findings, such as a recent clock tracking CpG sites predicted to
be causal in the aging process (Ying et al., 2022). One particular challenge with interpreting
epigenetic clocks stems from a lack of clarity as to whether they are tracking a cell-intrinsic
aspect of aging or a shift in cellular composition that occurs with age. A recent study identified T
and NK cell activation as a significant driver of epigenetic age predictions (Jonkman et al.,
2022), suggesting that they are at least partially tracking a cell-extrinsic signal. As shifts in
cellular composition can be tracked via nonfunctional cell markers, they increase the difficulty in
8
understanding the effect of intrinsic changes in DNA methylation on aging as they do not
necessarily have an inherent functional role relevant to the aging process.
Epigenetic Clocks and Blood Cell Composition
One major challenge in understanding the mechanism(s) underlying epigenetic clocks is the
confounding effect of age-related changes in cell-type composition of many tissues. While
changes in cell-type composition are an important part of aging, they can make interpreting
epigenetic clocks more difficult as the relevant CpG sites may be cell-type-specific markers
rather than those affecting cell-intrinsic aging. Most epigenetic clocks are trained largely on
blood, which sees a drop in naïve CD8+ T cells with age and a corresponding increase in more
terminally differentiated memory T-cell types (Goronzy et al., 2015). Some clocks may be more
impacted by changes in cell-type composition than others, depending on how they were
constructed (Horvath & Raj, 2018). These challenges are not limited to epigenetic clocks;
measures of telomere length in whole blood as predictors of age have also been shown to be
linked to the proportion of naïve T-cells (Lin et al., 2015). Quite recently, T-cell and NK (natural
killer) cell activation have been implicated as major drivers in epigenetic clock progression
(Jonkman et al., 2022).
Other approaches have been explored to create epigenetic age predictions that are less
sensitive to changes in cell type composition. Most notably, residuals from regression models
that include epigenetic age and proportions of several blood cell types have been used to
generate an “intrinsic epigenetic age acceleration” measure (B. H. Chen et al., 2016). While the
9
resulting measure is cell-type independent, it becomes challenging to biologically interpret as the
underlying signal is derived from a mixture of CpG sites that can be either cell type-independent
or cell type-dependent. Other modern approaches include the development of single-cell
epigenetic clocks (Bonder et al., 2023; Trapp et al., 2021), though the underlying technology will
require further maturing before it can match the sensitivity and accuracy of bulk measurementbased clocks.
Single-Cell Transcriptomic Clocks
One major advancement in the realm of biomarkers of aging has been the development of singlecell transcriptomic clocks. A transcriptomic clock allows for more readily understandable
biomarkers, as gene function is more readily understood relative to the complex role of
individual CpG sites that underlie epigenetic clocks. Reading cellular age on an individual level
allows for improved understanding of how aging impacts individual cell types differently. The
ability to pool many cells together also improves the power of in vitro screens considerably, as
current biomarkers of aging require hundreds of thousands of cells. Lastly, it allows for the
simultaneous and partially de-coupled understanding of how intrinsic and extrinsic factors
influence the aging process.
The first single cell transcriptomic clock was developed in the context of the mouse
brain, where pseudo-bulk aging measures were used to better understand the rejuvenative effects
of exercise (Buckley et al., 2023). More recently, a technical improvement (binarization) was
shown to improve the prediction of whether or not cells are predicted to be old or young, again in
10
the context of the mouse brain (Yu et al., 2023). To date, no single cell transcriptomic biomarker
has been reported in the context of human cells. A development of such a biomarker would be
helpful in understanding the aging of the immune system on a cellular level.
11
CHAPTER 2: MATERIALS AND METHODS
Ethics approval. NIH provided approval for use of phs000424/GRU (GTEx) age data via the
dbGaP database approval system. Ethics approval was not required for other datasets generated.
Immune cell isolation, sorting, and DNA extraction. PBMCs were extracted from
leukopheresis chambers from CMV+
donors. Donors were volunteers who donated plasma at a
blood donation center in San Francisco after passing a health screening. Blood was first diluted
1:1 with PBS with 2% FBS. Diluted blood was slowly layered on top of 12 mL of Ficoll in a 50-
mL Falcon conical tube. The tube was then centrifuged for 30 minutes at 2000 rpm at 21°C
without applying a break. The layer containing white blood cells was removed, diluted with
FBS-supplemented PBS, and centrifuged for 3 minutes at 2500 rpm. The cell pellet was resuspended in 15 mL of ACK lysis buffer and incubated for 3 minutes. The cells were topped up
with PBS with 2% FBS, centrifuged, and resuspended.
For the initial CD8+
epigenetic clock characterization experiment, an EasySepTM Human T Cell
Isolation kit was used to extract T cells from the PBMC fraction. T cells were then washed,
stained with 1:500 LIVE/DEADTM Fixable Near-IR Dead Cell staining kit, washed, stained with
an antibody cocktail (Table 2), and washed again. FACS was performed on a BD FACSAriaTM II
instrument. DNA was isolated using a Zymo Quick-DNATM Microprep Plus kit.
For the second comprehensive immune cell-sorting experiment, 2 million PBMCs were frozen
immediately after extraction. The remaining cells were then positively selected for a CD4
fraction using the EasySepTM Human CD4 Positive Selection Kit II. The CD4 cells were stained
with 1:500 LIVE/DEADTM Fixable Near-IR Dead Cell staining kit, washed, and stained with
12
CD4/CD8 antibody cocktail (Table 2), and the remaining cells were positively selected for a
CD8 fraction using the EasySepTM Human CD8 Positive Selection Kit II. Both CD8+
cells and
remaining PBMCs were washed, stained with 1:500 LIVE/DEADTM Fixable Near-IR Dead Cell
staining kit and washed again. CD8+
cells were stained with a CD4/CD8 antibody cocktail
(Table 2), and the remaining PBMCs were stained with a B Cell/NK Cell/Monocyte antibody
cocktail (Table 3), after blocking with human IgG. The gating strategy for T cell sorting is shown
in Supplementary Figure 1.4, and the gating strategy for remaining PBMCs is described in
Supplementary Figure 1.5. All three fractions were then subjected to FACS analysis using a BD
FACSAriaTM II instrument. DNA was isolated using a Zymo Quick-DNA/RNATM Microprep
Plus kit.
For both experiments, DNA was quantified using QubitTM HS dsDNA quantification reagents.
Bisulfite conversion and DNA methylation assessment were performed by Diagenode. For all
experiments involving FACS, post-sort validations were performed to verify cell sort purity by
analyzing sorted populations via flow cytometry. The Clock Foundation assisted with facilitating
DNA methylation assessment and data transfer for the initial CD8+
experiment.
High-dimensional flow cytometry. PBMCs were transferred to a 96-well V-bottom plate. Cells
were re-suspended in a 1:500 dilution of LIVE/DEADTM Fixable Blue Dead Cell Stain kit in
cold PBS and incubated for 30 minutes in the dark. Cells were then washed and blocked with
human IgG for 30 minutes. They were then washed twice and stained with a PBMC phenotyping
antibody cocktail (Table 4). Cell phenotyping was performing on a Cytek AuroraTM instrument
and analyzed using FlowJoTM. The flow cytometry analysis strategy is described in
13
Supplementary Figure 1.6.
DNA methylation analysis and pre-processing. .idat files were converted into beta values by
using the minfi R package (Aryee et al., 2014), with a functional normalization pre-processing
step (Fortin et al., 2014, p. 20). For differential methylation analyses, beta values were converted
to M-values through the formula M = log2(B / (1-B)). The R package umap was used for UMAP
dimensionality reduction(Konopka, 2023).
Dataset collection and pre-processing. All datasets used to build the novel epigenetic clock
were either generated in this study or downloaded from GEO. Exact ages were obtained for
GTEx data through dbGaP (THE GTEX CONSORTIUM, 2020), as exact chronological ages of
tissues were required. For constructing the clock, the assembled database of DNA methylation
data was first culled of any samples that had more than 10% of CpGs missing and of any CpGs
that had more than 10% samples missing. All samples derived from cancer tissues were
removed. To ensure forward compatibility, we filtered out CpGs that were not on the Infinium
MethylationEPIC v2.0 array. Based on our CD8+ DNA methylation data, we tested the
correlation of each CpG methylation with age and with naive CD8+ T cells. To assess whether
CpGs were correlated with naive CD8+
cells, we binarized each naive sample as “1” and each
non-naive (CM, EM, TEMRA) sample as “0” and then used the R cor function to compute a
Spearman’s correlation between methylation and naive T-cell state. All CpGs with an absolute
value correlation of .3 or greater with naive T-cell state were removed, and all CpGs with an
absolute value correlation of .3 or less with age were removed.
14
Once CpGs and samples were filtered, the samples were split 75% for the training set and 25%
for the test set. Imputation of missing was performed separately between training sets and test
sets, and separately between different tissues within training sets and test sets (imputation
performed using the impute R package (Hastie et al., 2023)). Outliers were detected and removed
using the outlyx function in the R watermelon package (Schalkwyk et al., 2023). Untransformed
beta values were used for model creation and age prediction. Prior to training the model, ages
were transformed using Horvath’s formula used in his original epigenetic clock (Horvath, 2013).
An elastic net model using glmnet (Friedman et al., 2022) was used to develop the IntrinClock,
with alpha value set at .5. Once the first model was generated, the training data were a subset of
only those CpGs with non-zero coefficients, which were used for training the final model. The
regularization parameter for both elastic net models was generated using cross-validation
(cv.glmnet() function) with ten folds.
Statistical methods. For comparisons between two measurements from one individual, as in
Figures 1B and 1C, paired t-tests were used for assessment of significant changes. For multiple
comparisons between a group and a background reference, as in Figures 4A, 4B, and 4C, onesample proportional tests using the prop.test function from the R stats package were utilized with
Bonferroni multiple-comparisons correction. For samples of multiple measurements, t-tests with
Bonferroni multiple-comparisons corrections were used to test significance. Most graphs and
figures were created with aid of the ggplot R package (Wickham et al., 2022).
Motif enrichment and pattern analyses. For motif enrichment analysis, the HOMER software
15
tool was utilized (Heinz et al., 2010). To define sequences of interest, we investigated 40-bp
windows surrounding the 381 CpG sites that compose the IntrinClock. As a background, we
investigated 40-bp windows around CpG sites in our dataset immediately before removal of CpG
sites associated with naive CD8+
cells and those not associated with aging. For investigating
patterns of IntrinClock CpG shifts with age, beta values from blood samples were converted to
M values, after which the degPatterns function from the DEGreport R package (Pantano et al.,
2023) was utilized. Patterns with fewer than 10 CpG sites were discarded from analysis. Ages
were binned into groups of 10 (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81+
). Each
bin was confirmed to have at least 100 samples.
Epigenetic age acceleration analysis. To compute DNA methylation age for each epigenetic
clock, the R methylclock package was utilized (Pelegri-Siso & Gonzalez, 2023). For experiments
containing a limited number of donors or cell types, epigenetic age acceleration was defined as
the difference between epigenetic age prediction and chronological age. For larger studies,
epigenetic age acceleration was defined as the residual after regressing predicted epigenetic age
on chronological age. To analyze associations between epigenetic clock residuals and predicted
immune cell proportions, the estimateCellProp function from the Enmix (Z. Xu et al., 2023)
package was utilized along with the corrplot (Wei, n.d.) package for plotting.
16
Chapter 1 CHAPTER 3: UNDERSTANDING THE RELATIONSHIP BETWEEN
EPIGENETIC CLOCKS AND BLOOD CELL TYPE COMPOSITION
ABSTRACT
In this work, we report our analysis of the differences in epigenetic age predictions derived from
four epigenetic clocks (Hannum (Hannum et al., 2013), Horvath (Horvath, 2013), Horvath Skin
and Blood (Horvath et al., 2018), and PhenoAge (Levine et al., 2018)) for cytotoxic CD8+ T cells
at different stages of differentiation. We found that human naïve CD8+ T cells, which decrease in
humans during aging, exhibit an epigenetic age 15–20 years younger than effector memory CD8+
T cells isolated from the same individual. Interestingly, naïve T cells isolated from individuals of
different ages still show a progressive increase in epigenetic age. Based on these observations,
which indicate, as predicted, that current epigenetic clocks measure two independent variables,
aging and immune cell composition, we created a new clock, the IntrinClock, that does not
change among 10 immune cell types tested.
RESULTS
Existing epigenetic clock age predictions depend on CD8+ T-cell differentiation state.
In humans, CD8+ T cells decrease in frequency, with a particularly pronounced loss of naive T
cells during aging (Lazuardi et al., 2005). We used a negative bead-based selection method to
isolate total T cells from seven donors (six men and one woman) of varying ages, all of whom
were positive for cytomegalovirus (CMV+
). We then used FACS to isolate CD8+
naive (CD8+
CD28+ CD45RO-
), CD8+
central memory (CD8+ CD28+ CD45RO+
), CD8+
effector memory
17
(CD8+ CD28- CD45RO+
), and CD8+
terminal effector memory RA+
(CD8+ CD28- CD45RO-
)
cells (Figure 1.1A). After DNA isolation and profiling using the Illumina Infinium
MethylationEPICTM platform, we noted a distinct clustering of CD8+
naive cells away from
CD8+
central memory (CM), effector memory (EM), and terminal effector memory RA+
cells
(TEMRA) (Figure 1.1B) in UMAP analysis. Horvath clock epigenetic ages were measured in
each of the CD8 T-cell subsets and found to correlate with age across every subset. However,
strikingly, naive T cells consistently showed a significantly younger epigenetic age than other
CD8+ subsets (Figure 1.1C). This result suggests that epigenetic clock measurements are
affected by CD8+ T-cell differentiation. Equally interestingly, naive CD8+ T cells from
individuals of different chronological age showed an increase in epigenetic age that was parallel
to chronological age but consistently lower than the chronological age (Figure 1.1C). The same
observation was made for CMs, EMs and TEMRAs except that these cells’ epigenetic age
appeared closer to the chronological age of the donors.
Next, using differential methylation analysis on methylation M-values, we identified
22,963 CpGs that changed with age and 370,383 CpGs that changed between naive CD8+ T cells
and CD8+ CM, CD8+ EM, or CD8+ TEMRA cells. Of the 22,963 aging-related CpGs, 9,992
were also affected by differentiation (Figure 1.1D). To understand how this could affect
epigenetic clock predictions, we investigated the proportion of CpG sites used for epigenetic age
prediction in the Hannum, Horvath, Horvath Skin and Blood, and PhenoAge clocks that we
identified were affected by CD8+ T-cell differentiation. In all four clocks, more than a third of
the predictive sites were changed with differentiation (Figure 1.1E), and all four had a difference
in age acceleration for CD8+ T-cell subsets. In all clocks, CD8+ TEMRA and CD8+ EM cells
18
were predicted to be older than CD8+ CM cells, which were predicted to be older than CD8+
naive cells (Figures 1F-1I). The differences in epigenetic ages among the CD8+ T-cell subsets
varied among clocks. For example, PhenoAge predicts CD8+
naive cells to be over 60 years
younger than the donor chronological age, but the difference was much smaller for both Horvath
clocks with an epigenetic age prediction of only approximately 12 years lower than
chronological age (Figures 1F - 1I).
Figure 1.1 CpG site changes during T cell differentiation
a) Experimental design for determining impact of CD8+
differentiation on epigenetic clock age
prediction. b, UMAP dimensionality reduction of CD8+ DNA methylation profiles. c,
19
Differences between predicted epigenetic age as a function of donor age and CD8+ T-cell subset.
d, Comparison of shared CpG site changes between age in CD8+ T cells and CD8+
cell subset. e,
Percent of sites in four epigenetic clocks that are altered by CD8+ T-cell differentiation. f-i,
Comparison of the (f) Hannum, (g) Horvath, (h) Horvath skin and blood, and (i) PhenoAge
epigenetic age acceleration predictions for four CD8+ T-cell subsets. *** ANOVA p-value less
than .001.
Development of a novel epigenetic clock (IntrinClock) resistant to changes in immune cell
composition.
Given the overlap of DNA methylation signatures of cellular aging and CD8+
differentiation, we
sought to create a new epigenetic clock that is unaffected by changes in immune cell
composition. We began by generating a database of 14,601 DNA methylation samples from 71
different datasets1,22–90, generated on either the Illumina InfiniumTM HumanMethylation450
(450K) or the Illumina InfiniumTM MethylationEPIC (EPIC) array, all sourced from the Gene
Expression Omnibus (GEO) database or the Genotype-Tissue Expression project (GTEx) (Table
1). The number of samples per dataset ranged from six to 1,218, with a mean number of samples
per dataset of 213 (Figure 1.4A). The distribution of sexes was approximately equal (Figure
S1.7B). Samples were derived from a variety of tissues with the majority from blood (Figure
S1.7C), and the DNA methylation assay platform was split roughly evenly between the 450K
and the EPIC array. (Figure S1.7D).
Once the database of samples was assembled, we performed a series of filtering and
quality control steps. We filtered out all samples that were missing more than 10% of CpG sites
measured by the 450K array, those that were derived from cancerous tissue, and those that were
derived from germline tissues. We then removed outliers, defining outliers as those with
20
principal components more than two interquartile ranges away from the mean (Figure 1.2B).
After performing a random 75-25 training/test split, 9104 samples were used to train the model
and 2994 were used to validate it.
Given the unique methylation pattern (Figure 1.1B) and quiescent biology (Figure S1.7A)
(Bennett et al., 2020) of naive CD8+ T cells, we aimed to use them as a basis on which to
eliminate CpGs linked to CD8+ T-cell differentiation and performed additional filtering steps.
When constructing our database of DNA methylation data, we initially collected all CpG sites
measured by the 450K array for all samples. To increase reliability, we first filtered out CpG
sites that were present in fewer than 90 percent of samples. To ensure forward compatibility, we
also included only CpG sites that were present on the Illumina InfiniumTM MethylationEPICv2.0
array. Next, we opted to remove any CpG sites that were correlated with a sample being a naive
CD8+
sample (R > .3) within our CD8+
subset data (i.e., CpG sites whose methylation patterns
were distinct in CD8+ naive cells as compared to CD8+ CM/EM/TEMRA cells). We also opted to
include only those CpG sites correlated with age (R > .3) (Figure 1.2C), to decrease the search
space for the elastic net algorithm to identify age-predictive sites. Interestingly, we observed a
negative correlation-of-correlations between the age correlation and naive CD8+
correlation of
CpG sites (R = -.45) (Figure 1.2C), indicating that CpG sites that are hypermethylated with age
tend to be hypomethylated in naive CD8+
cells, and vice-versa.
We utilized the elastic net algorithm on the remaining 55,896 CpGs to generate a new
epigenetic clock based on 410 CpG sites. To increase accuracy and reduce the number of
necessary prediction sites, we used a novel approach whereby we employed the elastic net
algorithm a second time on the training data filtered only on the 410 CpG sites used for the
21
clock. This reduced the number of predictive CpG sites in the final model (IntrinClock) to 381,
and reduced error by approximately 3 months (Figure S1.8). We validated that a similar degree
of improvement in predictive accuracy could not be obtained by tuning the alpha parameter
(Figure S1.8C).
22
Figure 1.2 IntrinClock design strategy and performance.
a, Filtering strategy for CpG sites. b, Filtering strategy for samples. c, Visualization of the
filtering process for differentiation-independent age-related CpGs. Blue CpGs (those correlated
with age but not with being a naive cell) were included in the feature set, whereas gray CpGs
23
were not. Green dashed line indicates linear least-squared regression line of relationship between
CpG age correlation and CpG CD8+
naive cell correlation. d, Correlation between age and
IntrinClock predicted age in a variety of tissues from the test set. e-h, Individual correlation plots
for specific tissues in the test set. i, Epigenetic age vs. chronological age correlation plot for
semen samples.
IntrinClock is accurate across tissues, and its age predictions are not affected by adaptive
immune cell compositional changes.
Next, we tested the IntrinClock on a variety of tissues in the test set and observed high overall
prediction accuracy (R ~ .972, mean absolute error (MAE) ~ 3.83) (Figure 1.2D). Age prediction
errors on blood and saliva were particularly low (MAE ~ 3.25, MAE ~ 3.21, respectively)
(Figure 1.2E, 2G). Tissues with less immune infiltration also had high epigenetic age correlations
with chronological age (R ~ .944 for brain, R ~ .841 for skin). We were interested in discovering
whether the IntrinClock would predict chronological age in semen samples, as previous
epigenetic clocks have shown significant age deceleration in sperm (Horvath, 2013). We found
that epigenetic age predictions of semen had only a weak correlation with chronological age (R ~
.32), and the predicted age of sperm samples, using a previously generated dataset (Jenkins et al.,
2022), appears to consistently be ~ 12 (Figure 1.2I).
Importantly and as expected, IntrinClock applied to our generated CD8+ DNA
methylation data showed no epigenetic age prediction differences among CD8+ T-cell subsets
(Figure 1.3A). As these samples were included in the training set for clock construction, we
validated our approach on two external datasets (Rodriguez et al., 2017; Schlums et al., 2015)
with CD8+
naive and CD8+ EM DNA methylation data and found no differences in epigenetic
age (paired t-test p-value > .05) (Figure 1.3B). We also tested whether our clock could find a
24
shift in epigenetic age between CD4+
naive and CD4+ CM cells, as the proportion of CD4+
naive
cells also decreases with age(M. Li et al., 2019). Using two external data sets (Garaud et al.,
2017; Pitaksalee et al., 2020), we discovered no evidence for a shift in epigenetic age between
CD4+
naive and CM cells (Figure 1.3C) (paired t-test p-value > .05), despite our filtering strategy
being based only on CD8+
cells.
We also tested whether the IntrinClock would be similarly unperturbed in other immune
cell types, particularly naive and memory B cells, which change in frequency with age (Chong et
al., 2005). We sorted CD8+
naive (CD8+CD28+CD45RO-
), CD8+ CM (CD8+CD28+CD45RO+
),
CD8+
combined EM/TEMRA (CD8+CD28-
), CD4+
naive (CD4+CD28+CD45RO-
), CD4+ CM
(CD4+CD28+CD45RO+
), B-cell naive (CD3-CD19+CD27-
IgD+
), class-switched B cells (CD3-
CD19+CD27+
IgD-
), CD16+CD56dim NK cells (CD3-CD19-CD56dimCD16+
), classical monocytes
(CD3-CD19-HLADR+CD14+CD16dim), and whole-peripheral blood mononuclear cell (PBMC)
samples from a separate set of nine donors (five women, four men) aged 30–68 and collected
DNA for methylation analysis. To increase cell recovery, we performed two sequential rounds of
positive selection for CD8+
and then CD4+
cells using magnetic enrichment kits prior to flow
sorting, similar to a published strategy (Roy et al., 2021). Concurrently, we analyzed the PBMC
samples using high-parameter spectral flow cytometry to empirically determine whether changes
in immune cell composition of the PBMC samples would impact predicted epigenetic age of the
whole PBMC fraction.
As predicted, we found no evidence for an association between cell subset and epigenetic
age prediction (Figure 1.3D) or between cell subset and epigenetic age acceleration (ANOVA pvalue > .05) (Figure S1.9A). This remained consistent whether epigenetic age acceleration was
25
defined as the difference between predicted age and chronological age or as the residual after
regressing predicted epigenetic age on chronological age. In contrast, cell subset and epigenetic
age acceleration were significantly correlated, according to the Hannum (Figure S1.9B), Horvath
(Figure S1.9C), Horvath Skin and Blood (Figure S1.9D), and PhenoAge (Figure S1.9E) clocks.
To further investigate how resistant IntrinClock is to the change in immune cell composition, we
analyzed the correlation between the PBMC epigenetic age and percentage of several PBMC
subsets. As expected, we identified no significant relationship between the PBMC epigenetic age
acceleration and percentage of CD8+ EM cells (Figure 1.3E), CD4+ CM cells (Figure 1.3F),
class-switched B cells (Figure 1.3G), CD16+ CD56dim NK cells (Figure 1.3H), or classical
monocytes (Figure 1.3I), relative to their parent populations (Pearson’ s correlation p-value >
.05). Combined with our observations of the IntrinClock’ s high accuracy across many tissues,
including different cell types in the brain (Figure S1.9F-G), these observations indicate that shifts
in immune cell composition do not impact IntrinClock age predictions.
26
Figure 1.3 Epigenetic age accelerations measured by different clocks.
a, Differences in epigenetic age accelerations in different CD8+
subsets generated in this study.
Horvath clock predictions overlaid in light gray. b, Epigenetic ages of CD8+
naive cells and
effector memory cells, based on data from GSE66564 and GSE83156. c, Epigenetic ages of
CD4+
naive cells and central memory cells, based on data from GSE121192 and GSE71825. d,
Epigenetic ages of PBMCs, CD8+
naive, CD8+
central memory, CD8+
combined effector and
TEMRA, CD4+
naive, CD4+
central memory, B-cell naive, B-cell switched memory,
CD16+CD56dim NK, and classical monocyte cells. e-i, Association of percentage of e, effector
memory CD8+
cells, f, central memory CD4+
cells, g, class-switched B cells, h, CD16+ CD56dim
NK cells, and i, classical monocytes with epigenetic age acceleration.
27
Supplemental Figure 1.4 Gating strategy for sorting of CD4+ subsets and CD8+ subsets
28
Supplemental Figure 1.5 Gating strategy for sorting of B-cell subsets, CD56dim CD16+ NK
cells, and classical monocytes.
29
Supplemental Figure 1.6 Gating strategy for high-dimensional flow cytometry
30
Supplemental Figure 1.7 IntrinClock training and validation.
a, Number of samples used for IntrinClock training and validation per dataset. Dotted line
represents mean number of samples per dataset. b, Sex distribution, c, tissue distribution, and d,
Illumina chip distribution for samples used in IntrinClock training and validation.
31
Supplemental Figure 1.8 Comparison of prediction accuracy with elastic net performed once or
repeated
a, Prediction accuracy and precision for the test set after training using one cycle of elastic net. b,
Prediction accuracy and precision for the test set after training using once-repeated elastic net. c,
Mean absolute error of age prediction on training set after one elastic net cycle given several
tested alpha parameters. Repeated elastic net MAE showed with dotted line. d, Correlation of
age prediction with chronological age in the training set after one elastic net cycle given several
tested alpha parameters. Repeated elastic net correlation showed with dotted line.
32
Supplemental Figure 1.9 Predicted epigenetic age acceleration of matched-donor cells of
different cell subsets
Epigenetic age acceleration of PBMC, CD8+
naive, CD8+ CM, CD8+ EM/TEMRA, CD4+
naive,
CD4+ CM, B-cell naive, B-cell switched memory, NK, and classical monocyte samples by the a,
IntrinClock, b, Hannum, c, Horvath, d, Horvath Skin and Blood, and e, PhenoAge epigenetic
clocks. Samples derived from N = 9 donors. *** ANOVA p-value less than .001. Error bars
shown as mean ± standard error. f, Prediction accuracy and precision for isolated neurons. DNA
methylation data obtained from study GSE112525. g, Prediction accuracy and precision for
isolated microglia. DNA methylation data obtained from study GSE191200.
33
Chapter 2 CHAPTER 4: DEVELOPMENT OF A NOVEL EPIGENETIC CLOCK THAT
MEASURES AGING INDEPENDENTLY OF BLOOD CELL COMPOSITION
ABSTRACT
With the development of a cell-type independent epigenetic clock, we sought to understand what
it informs us regarding the natural aging process. We observe there are two CpG methylation
patterns embedded within it – one driven primarily by development and the other by aging. We
identify that this clock shows an increase in a model of replicative senescence in vitro and shows
decreased aging during OSKM reprogramming. We show that chronic disease, such as HIV,
increases cell intrinsic epigenetic age while acute disease does not. We also identify the
IntrinClock CpG sites may have a functional role, as they are enriched in regulatory regions of
genes.
RESULTS
IntrinClock is enriched for sites near functional areas of the genome
IntrinClock is highly enriched for CpG sites upstream of transcription start sites, and its sites are
enriched for motifs whose TFs are implicated in cancer. One central challenge in understanding
epigenetic clocks comes from a lack of knowledge regarding to what extent epigenetic clocks are
tracking a cell-autonomous or, conversely, a cell-ensemble phenomenon (Bell et al., 2019). Our
data provide evidence that current epigenetic clocks represent a composite of at least two
variables, change in DNA methylation associated with aging in a cell intrinsic manner
(IntrinClock), and a change in cell composition associated with aging. Due to the IntrinClock’s
34
resistance to changes in immune cell composition, the CpG sites that constitute the clock may
have more readily interpretable cell-autonomous biology as they are less likely to track markers
of changing immune cell composition. This prediction could be particularly helpful in the
context of identifying a functional or causal relationship between epigenetic clock sites and
aging. We found that the sites in the IntrinClock that are hypermethylated with age are enriched
within the region 200–1500 bp upstream of gene transcription start sites, and correspondingly
strongly depleted in sites distant from genes (25% vs. 15%) (Figure 1.4A). In sites that are
hypomethylated with age, there was a significant enrichment within the first exon of genes (8%
vs. 5%) (Figure 1.4B). DNA methylation changes within 1500 bp of the transcription start site
are most closely linked to alterations in gene expression(Schlosberg et al., 2017). Similarly,
IntrinClock CpGs are enriched for being located near CpG islands (45% vs. 31%) and are
depleted from open sea regions (20% vs. 36%) (Figure 1.4C).
IntrinClock sites are located near regulatory domains of cancer-related transcription
factors
Transcription factor activity and DNA methylation are biologically connected both directly, as in
the case of the OCT4 transcription factor preferring to bind to methylated DNA (Yin et al.,
2017), and indirectly, as in the case of passive methylation from lack of TF binding (Medvedeva
et al., 2014; Thurman et al., 2012). We investigated regions within 40 bp of IntrinClock CpG
sites and used HOMER (Heinz et al., 2010) to identify enriched motifs associated with
transcription factor-binding sites (Figure 1.4D). Motifs associated with TFAP2C, ZNF341,
ZFP57, RUNX1, E2F3, HOXA1, SP4, MYB, GRHL2, MGA, IRF3, and INSM1 binding were
35
significantly enriched, compared to a 40-bp background of basepairs surrounding CpG sites that
are assayed by both Illumina Infinium HumanMethylation450K and MethylationEPIC chips.
Aberrant activity of each corresponding transcription factor has been associated with cancer
development or worsened prognosis (L. Chen et al., 2019; Cicirò & Sala, 2021; Feng et al., 2018;
Hedrick et al., 2016; Liu et al., 2013; Mathsyaraja et al., 2021; Rocha & Henrique, 2022; Tian et
al., 2020; Tuo et al., 2022; Wang et al., 2018; Xiang et al., 2012; Yan et al., 2021). Some of
these, such as E2F3 (Ki et al., 2014) and IRF3 (X. Zhang et al., 2019), have been associated with
aging-related diseases, whereas a connection for others has yet to be discovered.
We were interested in exploring general patterns of shifts in IntrinClock CpGs with age.
To avoid uneven distribution of tissue samples across age groups, we focused our analysis on
blood samples. Given that a linear regression model was used to build the IntrinClock, we were
not surprised that the two most prevalent patterns were a linear decrease and increase,
respectively, of DNA methylation with age (Figure S7). However, we also found several CpGs
(Clusters 4, 5, and 6) where the CpGs reverse their age-related direction of DNA methylation
around the age of 21-30. This indicates that, for a subset of CpGs in the IntrinClock, there is a
distinction between the processes of maturation and aging after sexual maturity. Interestingly,
these CpGs were 2.3-fold (34% vs. 14.9%) enriched for being located 200-1500bp upstream of a
TSS, and 2-fold (19.4% vs. 10%) enriched for being located on a genomic south shore region
(Figure S8), which are stronger enrichments than identified for IntrinClock sites generally
(Figures 4A – 4C).
36
Figure 2.1 Distributions of CpG positions
a, Distributions of CpG positions relative to genes in IntrinClock sites that are hyper-methylated
with age relative to background. b, Distributions of CpG positions relative to genes in
IntrinClock sites that are hypo-methylated with age relative to background. c, Genomic
distribution of IntrinClock CpG positions. d, HOMER analysis of the top 12 motifs enriched
37
within 19bp on either side (5′ or 3′) of IntrinClock sites (40 bp total). *** one-sample proportion
t-test p-value < .001; * < .05
IntrinClock epigenetic age is accelerated in models of intrinsic hallmarks of aging and in
HIV+
individuals
HIV has been associated with changes in DNA methylation state (Arumugam et al., 2021;
Mantovani et al., 2021), including changes in epigenetic age (Horvath & Levine, 2015). HIV
infection is associated with a plethora of clinical manifestations and morbidities consistent with
accelerated aging. However, HIV also causes major changes in immune cell composition (Douek
et al., 1998), which could skew previous versions of epigenetic clocks. As a result, it is unclear
whether early results showcasing epigenetic age acceleration during HIV infection are due to
changes in blood cell composition or an accelerated intrinsic rate of aging. Using the IntrinClock
on previously generated data from HIV+
individuals and controls, we identified an HIVassociated increase in epigenetic age of two years, supporting the model that HIV leads to
accelerated aging independently of shifts in immune cell composition (Figure 1.5A).
Furthermore, using a previously described cell composition prediction algorithm (Houseman et
al., 2012) combined with a validated library (Reinius et al., 2012) generated on the
HumanMethylation450k platform, we were able to predict changes in ten different immune cell
types and their correlations with clock residuals and HIV status. We observed no association
between IntrinClock residuals and immune cell proportions, including in the cases of eosinophils
and neutrophils (Figure 1.5B). In contrast, we observed significant associations using other
epigenetic clocks, in a manner paralleling the changes seen in HIV (an increase in NK cells and a
decrease in neutrophils).
38
We also sought to investigate whether the IntrinClock would be accelerated by other
acute immune-related diseases. Using a dataset primarily generated in 2020, we found that the
IntrinClock age prediction was not affected by COVID-19 (Figure 1.5C), contrary to findings in
other epigenetic clocks where COVID-19 infection was associated with an increase in epigenetic
age (Cao et al., 2022). We utilized a library generated on the MethylationEPIC platform (Salas et
al., 2022) to predict changes in immune cell proportions in individuals with COVID. We did not
observe an association of IntrinClock residuals with the relative proportions of any cell type
(Figure 1.5D). Unlike in the case of HIV infection, the associations between the epigenetic clock
residuals of other epigenetic clocks and immune cell type proportions were more variable. The
relationship between the residuals of several epigenetic clocks and immune cell type composition
match the data we obtained on sorted populations, with naïve B and T cells generally positively
correlated with epigenetic age and memory T cells associated with older epigenetic age
prediction. As the data analyzed in this study were generated early in the COVID-19 pandemic,
most individuals would have been acutely, rather than chronically, ill with COVID-19. It remains
to be seen whether the IntrinClock will predict a higher epigenetic age in those who are infected
with COVID-19 for a prolonged period (i.e., long COVID).
IntrinClock epigenetic age is affected by cellular reprogramming
One application of epigenetic clocks is in tracking the effect of rejuvenating or aging
interventions on cells. As the IntrinClock was developed on sites that are not shifting due to
immune cell compositional changes, we reasoned it may be more sensitive to such interventions.
Consistent with this idea, we used an external dataset (Ohnuki et al., 2014) to find that the
39
IntrinClock is sensitive to Yamanaka factor–mediated reprogramming in fibroblasts. The study
authors sorted cells positive for TRA-1-60+
, a marker for de-differentiation, at six time points
after initiation of reprogramming. We investigated IntrinClock epigenetic age predictions at each
time point and found that, from an initial mean predicted epigenetic age of 31, the age prediction
decreased to 20 after 11 days of OSKM-mediated reprogramming. A mean age of 0 was reached
after 20 total days of reprogramming (Figure 1.5E). Conversely, using publicly available data
using an in vitro fibroblast model of replicative cellular senescence (Xie et al., 2018), we found
that the IntrinClock was progressively accelerated with cell divisions as cells become
progressively more senescent. IntrinClock predicted values were greater after 14 population
doublings (predicted age of 15 vs. 10), and then greater still with a predicted age of 20 after
another 14 population doublings (Figure 1.5F). This effect was comparable to that seen using the
PhenoAge clock, and stronger relative to the Hannum, Horvath, and Horvath Skin & Blood
clocks (Figure S9).
40
Figure 2.2 Effect of diseases and interventions on IntrinClock residuals
a, IntrinClock epigenetic age in HIV+
and HIVindividuals, DNA methylation data from
GSE67751. b, Correlation plot of HIV status, clock residuals, and predicted immune cell type
proportions. c, IntrinClock epigenetic age in COVID positive and COVID negative individuals,
DNA methylation data from GSE167202. d, Correlation plot of COVID status, clock residuals,
and predicted immune cell type proportions. e, Epigenetic reprogramming affects fibroblast
predicted IntrinClock age. DNA methylation data from GSE54848. f, Induced replicative
senescence in fibroblasts leads to an increase in IntrinClock predicted age. DNA methylation
data from GSE91069. T-test p-values # < .10; * < .05; *** < .001.
41
Figure 2.3 IntrinClock and CpG methylation patterns and clusters
a, Analysis of IntrinClock CpG methylation patterns over five age groups in blood samples. b,
42
Clusters by number of CpGs in the IntrinClock. c, Clusters weighed by the sum of absolute
values of IntrinClock CpG coefficients. Boxplots are centered at median and bound one quartile
on each side.
Supplemental Figure 2.4 Analysis of IntrinClock CpG methylation sites belonging to clusters 4,
5, and 6.
a, Distribution of clusters 4, 5, and 6 CpG sites relative to genes. b, Distribution of clusters 4, 5,
and 6 CpG sites across the genome. *** One-sample proportion test p-value < .001; * onesample proportion test p-value < .05.
43
Supplemental Figure 2.5 Effect of replicative senescence in fibroblasts on five epigenetic clocks.
DNA methylation data from GSE91069. Samples from N = 3 independent biological replicates.
*** t-test p-value < .001. Boxplots are centered at median and bound one quartile on each side.
44
Supplemental Figure 2.6 Analysis of correlations of IntrinClock CpG site methylation with
tissues.
Color represents degree of correlation of IntrinClock CpG site methylation with the respective
tissue.
Table 1.
GSE # Citation Tissue(s) N Platform
GSE41826 Guintivano 2013 Brain 145 450K
GSE42861 Liu 2013 Blood 689 450K
GSE42700 Martino 2013 Brain 53 450K
GSE115278 Arpon 2019 Blood 474 450K,
EPIC
GSE52588 Bacalini 2015 Blood 87 450K
45
GSE201724 Bartlett 2022 Breast 18 EPIC
GSE178887 Bauer 2021 Blood 37 450K
GSE191276 Brennan 2022 Blood 6 EPIC
GSE136583 Cerapio 2021 Liver 62 EPIC
GSE59157 Charlton 2014 Kidney 92 450K
GSE159898 Li 2021 Colorectal 44 EPIC
GSE210245 Clement 2022 Blood 36 EPIC
GSE112987 Cobben 2019 Blood 103 450K
GSE203399 Cullell 2022 Blood 121 450K,
EPIC
GSE193879 Davalos 2022 Blood 127 EPIC
GSE201752 Estupiñán-Moreno 2022 Blood 113 EPIC
GSE129428 Fries 2020 Brain 64 EPIC
GSE63347 Horvath 2015 Brain 71 450K
GSE179414 Garcia-Prieto 2022 Blood 157 EPIC
GSE66351 Gasparoni 2018 Brain 190 450K
GSE99029 Gopalan 2017 Saliva 57 450K
GSE191200 De Witte 2022 Brain 56 EPIC
GSE152026 Hannon 2021 Blood 929 EPIC
GSE40279 Hannum 2013 Blood 656 450K
GSE119078 Hearn 2020 Saliva 59 450K
46
GSE92767 Hong 2017 Saliva 54 450K
GSE61256 Horvath 2014 Liver 79 450K
GSE78874 Horvath 2016 Saliva 259 450K
GSE141682 Xiao 2021 Blood 42 EPIC
GSE120610 McEwen 2018 Blood 156 EPIC
GSE40360 Huynh 2013 Brain 46 450K
GSE149282 Ishak 2020 Colon 24 EPIC
GSE124366 Islam 2019 Buccal, Blood 215 450K
GSE52068 Jiang 2015 Nasopharynx 48 450K
GSE88883 Johnson 2017 Breast 100 450K
GSE69270 Kananen 2016 Blood 184 450K
GSE154566 Kandaswamy 2021 Buccal, Blood 963 450K,
EPIC
GSE122288 Kasuga 2022 Blood 61 EPIC
GSE157131 Kho 2020 Blood 1218 450K,
EPIC
GSE167202 Konigsberg 2021 Blood 525 EPIC
GSE70977 Langevin 2015 Buccal 223 450K
GSE141256 Lewis 2020 Small Intestine, Colon 84 450K,
EPIC
GSE106648 Kular 2018 Blood 279 450K
47
GSE59685 Lunnon 2014 Brain 526 450K
GSE201872 Magnaye 2022 Bronchi 142 450K,
EPIC
GSE114134 Martino 2018 Blood 205 EPIC
GSE188593 Muse 2022 Skin 64 EPIC
GSE85566 Nicodemus-Johnson 2016 Lung 115 450K
GSE166611 Unpublished, Nonino 2016 Blood 32 450K
GSE190540 Vyas 2021 Blood 90 EPIC
GSE213478 Oliva 2023 Breast, Colon, Kidney,
Lung, Skeletal Muscle,
Ovary, Prostate, Testis,
Blood
987 EPIC
GSE112179 Pai 2019 Brain 100 450K
GSE203332 Pihlstrøm 2022 Brain 336 EPIC
GSE137223 Policicchio 2020 Brain 33 450K
GSE61195 Renauer 2015 Blood 21 450K
GSE133062 Ringh 2019 Bronchi 70 EPIC
GSE151017 Ringh 2021 Bronchi 78 EPIC
GSE90124 Roos 2017 Skin 322 450K
GSE184269 Roy 2021 Blood 167 EPIC
GSE112611 Somineni 2019 Blood 402 EPIC
48
GSE178925 Takeuchi 2022 Blood 24 EPIC
GSE146376 Thompson 2020 Lung 280 EPIC
None yet This study Blood 30 EPIC
GSE50660 Tsaprouni 2014 Blood 464 450K
GSE72556 Oelsner 2017 Saliva 93 450K
GSE89707 Viana 2016 Brain 49 450K
GSE151407 Voisin 2020 Skeletal Muscle 78 EPIC
GSE61107 Wockner 2014 Brain 47 450K
GSE49393 Xu 2014 Brain 48 450K
GSE174442 Xu 2019 Blood 256 450K
GSE128235 Zannas 2019 Blood 537 450K
Table 1. List of datasets used for building and/or validating the IntrinClock.
Table 2.
Target Fluorophore Clone Company Titer
CD3 PE-Cy7 UCHT1 BioLegend 1:66
CD8 APC SK1 BioLegend 1:100
CD4 Pacific Blue OKT4 BioLegend 1:100
CD28 PE B353546 BioLegend 1:50
CD45RO BV785 UCHL1 BioLegend 1:50
49
Table 2. Antibody mix used for sorting T cells.
Table 3.
Target Fluorophore Clone Company Titer
CD14 BV650 M5E2 BioLegend 1:12.5
IgD PerCP Cy5.5 IA6-1 BioLegend 1:12.5
CD3 Pacific Blue OKT4 BioLegend 1:12.5
CD19 APC HIB19 BioLegend 1:12.5
CD16 PE-Cy7 3G8 BioLegend 1:12.5
CD27 FITC O323 BioLegend 1:12.5
Table 3. Antibody mix used for sorting B cells, CD56dim CD16+ NK cells, and classical
monocytes.
Table 4.
Specificity Fluorochrome Vendor Catalog # Clone # Lot # Titer
CD3 BV510 BioLegend 344828 SK7 B364399 1:62.5
CD4 cFluor YG584 Cytek R7-
20042 SK3 F
-
012022-
01
1:125
CD8 BUV805 BD 612889 SK1 2006191 1:125
50
Biosciences
CD14 Spark NIR 685 BioLegend 367150 63D3 B343008 1:250
CD16 BUV496 BD
Biosciences
612944 3G8 1348178 1:62.5
CD19 BV570 BioLegend 302236 HIB19 B350345 1:30
CD25 PE-Fire 700 BioLegend 356146 M-A251 B357965 1:62.5
CD27 APC-H7 BD
Biosciences
560222 M-T271 1260248 1:62.5
CD28 PE-Cy5 BioLegend 302910 CD28.2 B336927 1:30
CD38 APC-Fire 810 BioLegend 356644 HIT2 B365111 1:62.5
CD39 BUV661 BD
Biosciences
749967 TU66 2075880 1:15
CD45 PerCP BioLegend 386506 2D1 B326914 1:30
CD45RA BUV395 BD
Biosciences
740315 5H9 1356975 1:250
CD49d APC BioLegend 304308 9F10 B270195 1:62.5
CD56
(NCAM1)
BUV737 BD
Biosciences
612766 NCAM16.2 1210146 1:62.5
CD57
(HNK-1)
eFluor 450 Invitrogen 48-0577-42 TB01 2437632 1:30
CD95 (Fas) BV650 BioLegend 305642 DX2 B324443 1:30
51
CD127 APC-R700 BD
Biosciences
565185 HIL-7RM21
1341730 1:30
CD159a
(NKG2a)
Alexa Fluor 647 BioLegend 375105 S19004C B325094 1:62.5
CD161
(NK1.1)
PE-eFluor 610 Invitrogen 61-1619-42 HP-3G10 2446967 1:30
CD183
(CXCR3)
PE-Cy7 BioLegend 353720 G025H7 B319137 1:62.5
CD184
(CXCR4)
BV605 BioLegend 306522 12G5 B301424 1:30
CD195
(CCR5)
BUV563 BD
Biosciences
741401 2D7/CCR5 1348905 1:30
CD196
(CCR6)
BV711 BioLegend 353436 G034E3 B323435 1:30
CD197
(CCR7)
BV421 BioLegend 353208 G043H7 B337639 1:15
CD279 (PD1)
BV785 BioLegend 329930 EH12.2H7 B351259 1:15
HLA-DR BV750 BioLegend 307672 L243 B368880 1:30
IgD PerCP-Cy5.5 BioLegend 348208 IA6-2 B366177 1:30
Klrg1 PE-Fire 810 BioLegend 367733 SA231A2 B371259 1:125
52
TCRγδ BUV615 BD
Biosciences
751308 11F2 2271540 1:15
Tigit BV480 BD
Biosciences
747843 741182 2272802 1:15
Table 4. Antibody mix used for high-dimensional phenotyping of peripheral blood mononuclear
cells.
Chapter 3 CHAPTER 5: DEVELOPMENT OF A SINGLE CELL TRANSCRIPTOMIC
BIOMARKER FOR T CELL AGING
ABSTRACT
Predicting biological or chronological age on a single-cell level holds promise for understanding
how different cell types react to the aging process. Here, we generated a combined automated
cell type and age predictor for six different types of human T cell subsets. We demonstrate that
our cell type predictor capable of correctly identifying canonical cell subsets based on cell
53
markers. We show that cell type prediction improved chronological age prediction. Lastly, we
tentatively identify an increase in predicted chronological age in autoimmune diseases. This
study shows the potential of utilizing single cell transcriptomic biomarkers for understanding
differences in how human immune cell types are impacted by aging.
RESULTS
Automated prediction of cell type recapitulates known changes in T cell composition with
age
With age, the proportion of T cell subsets changes, as the CD4/CD8 proportion increases and the
number of naïve CD8 cells decreases (M. Li et al., 2019). Each T cell subset also ages
differently, demonstrated by CD8 cells having less homeostatic control and more gene
expression instability compared to CD4s (Czesnikiewicz-Guzik et al., 2008). To profile the
changes in T cell composition in aging and disease, we trained a cell type prediction model to
predict one of six canonical T cell subsets – naïve helper cells, memory helper cells, regulatory T
cells, naïve cytotoxic T cells, memory cytotoxic T cells, and effector cytotoxic T cells.
To serve as the basis for our model, we used a previously published scRNA-Seq dataset
of two million peripheral blood mononuclear cells (PBMCs) from 166 individuals (Terekhova et
al., 2023) (Figure 3.1A). As noise can significantly detriment from accurate cell type or age
prediction, we filtered on genes that have some correlation (R > .01 or R < -.01) with age. This
removed approximately 90% of genes in the original dataset. We then split the dataset into a
training subset (80%) used for building the model and a test subset (20%) for determining its
accuracy and precision. We used an external dataset (Yasumizu et al., 2024) to measure the
54
ability of our models to make accurate predictions in other cohorts. To determine cell prediction
accuracy and generate initial labels, we used K-means clustering to split the cells into
biologically relevant groups and then labeled these groups based on six markers (CD4, CD8A,
CCR7, GZMB, GNLY, and FOXP3).
We validated our model by first identifying whether known aging-associated trends could
be recapitulated We identified an increase in the CD4/CD8 ratio with age, aligning with changes
that had been previously identified clinically (Figure 3.1B). Within the six cell types we
measured, we also observed an increase in effector cytotoxic cells and a simultaneous decrease
in the proportion of naïve cytotoxic cells (Figure 3.2B). Furthermore, we found the cell type
prediction matches closely with manual annotation of the training (Figure 3.1D), test (Figure
3.2E), and validation (Figure 3.2F) datasets. It is likely our estimate for accuracy in the
validation dataset is conservative, as the prediction model is able to identify a regulatory subset
expressing FOXP3 that was not subdivided with K means clustering.
Cell type-dependent models are capable of predicting age across a variety of cell types
We first sought to identify whether cell clustering would improve age prediction. We trained
cell-type dependent models on a range of cell type numbers, and identified cell type clustering as
providing a benefit to age prediction (Supplemental Figure 3.3). We then trained six independent
age prediction models, one per each of the six predicted cell types. Using these models, we were
able to show age prediction in the training (Figure 3.2A), test (Figure 3.2B), and validation
(Figure 3.2C) datasets. In general, we observed a moderate-to-high (R = .5 to R = .8) age
prediction accuracy on a per-donor basis and a low-to-moderate (R = .38 to R = .56) age
55
prediction accuracy on a per-cell basis. We then sought to identify whether each of these age
predictors was identifying unique cell type-specific aging patterns. In general, the individual age
predictors had moderate (R ~ .5) age prediction correlations with one another, suggesting they
measure both a shared and cell-type dependent aging signature.
Cell type-dependent single cell age prediction models identify accelerated aging in
autoimmune disorders such as multiple sclerosis
The incidence of autoimmunity increases with age due to changes in cell type composition and
defects in T cell function (Goronzy & Weyand, 2012). Many autoimmune diseases have been
associated with molecular changes indicative of accelerated aging (Shen et al., 2021). As a result,
we were interested in discovering whether or not our single-cell aging clock would identify
accelerated aging trends in autoimmune disease models. Using the validation cohort, we found
that on a cell-by-cell basis there was an increase in transcriptomic age relative to healthy controls
in multiple sclerosis specifically in naïve cytotoxic cells (Supplemental Figure 3.4), although this
effect did not reach significance when evaluated on a donor basis.
56
Figure 3.1 Cell type predictions recapitulate known effects of aging on the immune system
a, General study and model design. b, CD4/CD8 ratio identified in different age groups,
normalized for ratio identified in individuals between 18-35. c) Cell type proportion changes in
individuals from different age groups. d, Comparison between predicted and manually annotated
cell clusters for the training set. e, Comparison between predicted and manually annotated cell
57
clusters for the test set. f, Comparison between predicted and manually annotated cell clusters for
an external validation set (Yasumizu et al., 2024). *** Bonferroni-corrected p-value less than or
equal to .001, * Bonferroni-corrected P-value less than or equal to .05.
Figure 3.2 Cell type information improves age prediction of T cell subsets
A, Predicted age and chronological age of donors and cells in the training set. b, Predicted age
and chronological age of donors and cells in the test set. c, Predicted age and chronological age
of donors in the validation set (Yasumizu et al., 2024). d, Correlations of age prediction between
each T cell subset clock, and relative errors of each T cell subset clock relative to donor
chronological age.
58
Supplemental Figure 3.3 Correlation of chronological age prediction compared to number of
cell clusters used
59
Supplemental Figure 3.4 Predicted age residuals of cell-type dependent single-cell
transcriptomic clock across autoimmune diseases
60
CHAPTER 6: CONCLUSIONS
Aging biomarkers hold great promise for the study of longevity due to their high correlation with
age and (particularly for second-generation clocks) association with aging-related disease state.
As diagnostic tools, they have the potential to serve as important predictive biomarkers for
assessing biological age, determining risk for age-associated diseases, and assessing the efficacy
of interventions that target the aging process (Duan et al., 2022; Fransquet et al., 2019; Noroozi
et al., 2021; Oblak et al., 2021; Simpson & Chandra, 2021). Recent technical advances, such as
the development of principal component clocks (Higgins-Chen et al., 2021) and novel techniques
for cost reduction, promise to increase reliability and usability further. However, their current
status as a composite of multiple aging signals makes them difficult to interpret and to link to
specific biological processes. As an example, a recent study in patients post-COVID 19 infection
demonstrated a significant PhenoAge epigenetic age acceleration in individuals over the age of
50, but an epigenetic age reversal for those under the age of 50 (Cao et al., 2022). Further, the
manner in which clocks track healthspan is not fully overlapping, as clocks can be independently
predictive of mortality even when analyzed jointly (X. Li et al., 2020). This challenge in
interpretation is equally important for cellular models of the hallmarks of aging. In models of
senescence or reprogramming, the sensitivity or even direction of the perturbation on predicted
epigenetic age can dramatically differ, depending on the epigenetic clock used. For example, in
this study, we identified the Hannum clock as predicting an age reversal in a fibroblast model of
cellular replicative senescence (Supplemental Figure 2.5).
The immune system changes dramatically with aging, and its decline can exacerbate or
lead to many aging-related pathologies (Weyand & Goronzy, 2016). Clocks built solely on
61
inflammatory markers can be used to predict age and risk of multimorbidity (Sayed et al., 2021).
However, the presence of CpG sites that track primarily with immune cell markers makes
epigenetic clocks applied to cell-intrinsic effects (e.g., cellular reprogramming in fibroblast cell
culture) difficult to understand. Such sites can introduce background noise to the resulting
measurement.
Here, using sorted CD8+ T-cell subsets, we observed that naive T cells consistently
showed a younger epigenetic age than other CD8+
subsets (Figure 1.1C), ranging from a 10-year
average age under-prediction in some clocks to as high as a 60-year underprediction in others.
This suggests that epigenetic clock measurements are significantly affected by CD8+ T-cell
differentiation subsets other than T cells. Altogether, the cell subsets assessed in this study
compromise between 40-60% of the PBMC blood cell fraction, and many of these subsets
change in relative proportion with age (Terekhova et al., 2023). These observations reinforce the
finding that current epigenetic clocks represent the integration of at least two variables: cell
intrinsic aging and changes in immune composition during aging.
To isolate these variables, we developed a novel epigenetic clock that is based on CpG
sites that do not change with CD8+ T-cell differentiation (IntrinClock). We further observed that
this clock predicts the same age in each individual across a wide variety of immune cell types.
Interestingly, a filtering step based on naive CD8+ T cells can generate a clock that is not
affected by differentiation in cells from different lineages, such as CD4+
cells or even B cells.
This indicates part of a unique “CD8+
naive” signal may, in fact, be a conserved quiescence
program shared by a variety of immune cells. This observation is supported by our finding that
methylation patterns associated with naive CD8+ T cells have a negative correlation with those
62
changing with aging (Figure 1.2C). A connection between quiescence and aging is found in a
wide variety of cell types, including neural stem cells (Audesse & Webb, 2020).
The IntrinClock’s higher proportion of sites near transcription start sites and CpG islands
and its expected relationship with reprogramming and senescence suggest that it is tracking an
intrinsic cellular aging program. Enrichment of IntrinClock CpG sites within motifs bound by
transcription factors linked to cancer progression is consistent with a recent review investigating
the connection between epigenetic clocks, global hypomethylation, cancer, and aging (Johnstone
et al., 2022). It will be important in the future to test whether acceleration of the IntrinClock is
linked to particular disease states. This application could be a novel tool used to distinguish agerelated diseases caused by aberrant cell-to-cell interactions from those caused by intrinsic
cellular dysfunction. Further improvement could also be made on the IntrinClock design, such as
by removing tissue-specific DNA methylation signatures from training. Although most
IntrinClock CpG sites are not linked to tissue, a minority appear to be tissue-stabilizing for blood
and brain tissue (Figure S2.6).
The approach described here reduces the potential of cellular composition changes to be a
confounder, particularly in blood or saliva samples, and will likely increase our understanding of
biological aging and age-associated diseases. The IntrinClock holds the promise of being more
sensitive to cell-intrinsic rejuvenation approaches, as its constituent CpG sites are not affected by
immune cell composition. It may also be more closely linked to CpG sites with a functional or
even causal relationship with the aging process. Overall, IntrinClock represents a new instrument
to add to the aging biomarker toolkit, with a potential wide variety of applications and uses.
Lastly, we have developed a single-cell transcriptomic biomarker that combines cell type
63
prediction with age prediction. Our results show that this cell type predictor can effectively
identify and replicate cell type changes associated with aging and disease. By coupling cell type
prediction with age prediction, we have improved the accuracy of our model. Furthermore, our
findings indicate that autoimmune diseases impact the age prediction of specific cell subsets.
Moving forward, we aim to refine this biomarker to gain a deeper understanding of both cellintrinsic and cell-extrinsic aging and their combined effects on T cell dysfunction. Expanding the
application of this biomarker to other diseases will further enhance its translational impact.
64
REFERENCES
Arpón, A., Milagro, F. I., Ramos-Lopez, O., Mansego, M. L., Santos, J. L., Riezu-Boj, J.-I., &
Martínez, J. A. (2019). Epigenome-wide association study in peripheral white blood cells
involving insulin resistance. Scientific Reports, 9(1), Article 1.
https://doi.org/10.1038/s41598-019-38980-2
Arumugam, T., Ramphal, U., Adimulam, T., Chinniah, R., & Ramsuran, V. (2021). Deciphering
DNA Methylation in HIV Infection. Frontiers in Immunology, 12, 795121.
https://doi.org/10.3389/fimmu.2021.795121
Aryee, M. J., Jaffe, A. E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A. P., Hansen, K. D.,
& Irizarry, R. A. (2014). Minfi: A flexible and comprehensive Bioconductor package for
the analysis of Infinium DNA methylation microarrays. Bioinformatics, 30(10), 1363–
1369. https://doi.org/10.1093/bioinformatics/btu049
Audesse, A. J., & Webb, A. E. (2020). Mechanisms of enhanced quiescence in neural stem cell
aging. Mechanisms of Ageing and Development, 191, 111323.
https://doi.org/10.1016/j.mad.2020.111323
Bacalini, M. G., Gentilini, D., Boattini, A., Giampieri, E., Pirazzini, C., Giuliani, C., Fontanesi,
E., Scurti, M., Remondini, D., Capri, M., Cocchi, G., Ghezzo, A., Rio, A. D., Luiselli, D.,
Vitale, G., Mari, D., Castellani, G., Fraga, M., Blasio, A. M. D., … Garagnani, P. (2014).
Identification of a DNA methylation signature in blood cells from persons with Down
Syndrome. Aging, 7(2), 82–96. https://doi.org/10.18632/aging.100715
Bajaj, V., Gadi, N., Spihlman, A. P., Wu, S. C., Choi, C. H., & Moulton, V. R. (2021). Aging,
Immunity, and COVID-19: How Age Influences the Host Immune Response to
Coronavirus Infections? Frontiers in Physiology, 11.
https://www.frontiersin.org/article/10.3389/fphys.2020.571416
Bartleson, J. M., Radenkovic, D., Covarrubias, A. J., Furman, D., Winer, D. A., & Verdin, E.
(2021). SARS-CoV-2, COVID-19 and the Ageing Immune System. Nature Aging, 1(9),
769–782. https://doi.org/10.1038/s43587-021-00114-7
65
Bartlett, T. E., Evans, I., Jones, A., Barrett, J. E., Haran, S., Reisel, D., Papaikonomou, K., Jones,
L., Herzog, C., Pashayan, N., Simões, B. M., Clarke, R. B., Evans, D. G., Ghezelayagh,
T. S., Ponandai-Srinivasan, S., Boggavarapu, N. R., Lalitkumar, P. G., Howell, S. J.,
Risques, R. A., … Widschwendter, M. (2022). Antiprogestins reduce epigenetic field
cancerization in breast tissue of young healthy women. Genome Medicine, 14(1), 64.
https://doi.org/10.1186/s13073-022-01063-5
Bauer, M. A., Todorova, V. K., Stone, A., Carter, W., Plotkin, M. D., Hsu, P.-C., Wei, J. Y., Su,
J. L., & Makhoul, I. (2021). Genome-Wide DNA Methylation Signatures Predict the
Early Asymptomatic Doxorubicin-Induced Cardiotoxicity in Breast Cancer. Cancers,
13(24), Article 24. https://doi.org/10.3390/cancers13246291
Bell, C. G., Lowe, R., Adams, P. D., Baccarelli, A. A., Beck, S., Bell, J. T., Christensen, B. C.,
Gladyshev, V. N., Heijmans, B. T., Horvath, S., Ideker, T., Issa, J.-P. J., Kelsey, K. T.,
Marioni, R. E., Reik, W., Relton, C. L., Schalkwyk, L. C., Teschendorff, A. E., Wagner,
W., … Rakyan, V. K. (2019). DNA methylation aging clocks: Challenges and
recommendations. Genome Biology, 20(1), 249. https://doi.org/10.1186/s13059-019-
1824-y
Bellon, M., & Nicot, C. (2017). Telomere Dynamics in Immune Senescence and Exhaustion
Triggered by Chronic Viral Infection. Viruses, 9(10), 289.
https://doi.org/10.3390/v9100289
Belsky, D. W., Caspi, A., Arseneault, L., Baccarelli, A., Corcoran, D. L., Gao, X., Hannon, E.,
Harrington, H. L., Rasmussen, L. J., Houts, R., Huffman, K., Kraus, W. E., Kwon, D.,
Mill, J., Pieper, C. F., Prinz, J. A., Poulton, R., Schwartz, J., Sugden, K., … Moffitt, T. E.
(n.d.). Quantification of the pace of biological aging in humans through a blood test, the
DunedinPoAm DNA methylation algorithm. eLife, 9, e54870.
https://doi.org/10.7554/eLife.54870
Belsky, D. W., Caspi, A., Corcoran, D. L., Sugden, K., Poulton, R., Arseneault, L., Baccarelli,
A., Chamarti, K., Gao, X., Hannon, E., Harrington, H. L., Houts, R., Kothari, M., Kwon,
D., Mill, J., Schwartz, J., Vokonas, P., Wang, C., Williams, B. S., & Moffitt, T. E.
(2022). DunedinPACE, a DNA methylation biomarker of the pace of aging. eLife, 11,
e73420. https://doi.org/10.7554/eLife.73420
66
Bennett, T. J., Udupa, V. A. V., & Turner, S. J. (2020). Running to Stand Still: Naive CD8+ T
Cells Actively Maintain a Program of Quiescence. International Journal of Molecular
Sciences, 21(24), Article 24. https://doi.org/10.3390/ijms21249773
Bonder, M. J., Clark, S. J., Krueger, F., Luo, S., Sousa, J. A. de, Hashtroud, A. M., Stubbs, T.
M., Stark, A.-K., Rulands, S., Stegle, O., Reik, W., & Meyenn, F. von. (2023). Single cell
DNA methylation ageing in mouse blood (p. 2023.01.30.526343). bioRxiv.
https://doi.org/10.1101/2023.01.30.526343
Brennan, K., Zheng, H., Fahrner, J. A., Shin, J. H., Gentles, A. J., Schaefer, B., Sunwoo, J. B.,
Bernstein, J. A., & Gevaert, O. (2022). NSD1 mutations deregulate transcription and
DNA methylation of bivalent developmental genes in Sotos syndrome. Human Molecular
Genetics, 31(13), 2164–2184. https://doi.org/10.1093/hmg/ddac026
Buckley, M. T., Sun, E. D., George, B. M., Liu, L., Schaum, N., Xu, L., Reyes, J. M., Goodell,
M. A., Weissman, I. L., Wyss-Coray, T., Rando, T. A., & Brunet, A. (2023). Cell-typespecific aging clocks to quantify aging and rejuvenation in neurogenic regions of the
brain. Nature Aging, 3(1), 121–137. https://doi.org/10.1038/s43587-022-00335-4
Buitrago, D., Labrador, M., Arcon, J. P., Lema, R., Flores, O., Esteve-Codina, A., Blanc, J.,
Villegas, N., Bellido, D., Gut, M., Dans, P. D., Heath, S. C., Gut, I. G., Brun Heath, I., &
Orozco, M. (2021). Impact of DNA methylation on 3D genome structure. Nature
Communications, 12(1), Article 1. https://doi.org/10.1038/s41467-021-23142-8
Buscarlet, M., Provost, S., Zada, Y. F., Barhdadi, A., Bourgoin, V., Lépine, G., Mollica, L.,
Szuber, N., Dubé, M.-P., & Busque, L. (2017). DNMT3A and TET2 dominate clonal
hematopoiesis and demonstrate benign phenotypes and different genetic predispositions.
Blood, 130(6), 753–762. https://doi.org/10.1182/blood-2017-04-777029
Campisi, J. (2013). Aging, cellular senescence, and cancer. Annual Review of Physiology, 75,
685–705. https://doi.org/10.1146/annurev-physiol-030212-183653
Cao, X., Li, W., Wang, T., Ran, D., Davalos, V., Planas-Serra, L., Pujol, A., Esteller, M., Wang,
X., & Yu, H. (2022). Accelerated biological aging in COVID-19 patients. Nature
Communications, 13(1), Article 1. https://doi.org/10.1038/s41467-022-29801-8
67
Castelo-Branco, C., & Soveral, I. (2014). The immune system and aging: A review.
Gynecological Endocrinology: The Official Journal of the International Society of
Gynecological Endocrinology, 30(1), 16–22.
https://doi.org/10.3109/09513590.2013.852531
Cerapio, J. P., Marchio, A., Cano, L., López, I., Fournié, J.-J., Régnault, B., CasavilcaZambrano, S., Ruiz, E., Dejean, A., Bertani, S., & Pineau, P. (2021). Global DNA
hypermethylation pattern and unique gene expression signature in liver cancer from
patients with Indigenous American ancestry. Oncotarget, 12(5), 475–492.
https://doi.org/10.18632/oncotarget.27890
Chandra, A., Senapati, S., Roy, S., Chatterjee, G., & Chatterjee, R. (2018). Epigenome-wide
DNA methylation regulates cardinal pathological features of psoriasis. Clinical
Epigenetics, 10(1), 108. https://doi.org/10.1186/s13148-018-0541-9
Charlton, J., Williams, R. D., Weeks, M., Sebire, N. J., Popov, S., Vujanic, G., Mifsud, W.,
Alcaide-German, M., Butcher, L. M., Beck, S., & Pritchard-Jones, K. (2014). Methylome
analysis identifies a Wilms tumor epigenetic biomarker detectable in blood. Genome
Biology, 15(8), 434. https://doi.org/10.1186/s13059-014-0434-y
Charras, A., Garau, J., Hofmann, S. R., Carlsson, E., Cereda, C., Russ, S., Abraham, S., &
Hedrich, C. M. (2021). DNA Methylation Patterns in CD8+ T Cells Discern Psoriasis
From Psoriatic Arthritis and Correlate With Cutaneous Disease Activity. Frontiers in Cell
and Developmental Biology, 9.
https://www.frontiersin.org/articles/10.3389/fcell.2021.746145
Chen, B. H., Marioni, R. E., Colicino, E., Peters, M. J., Ward-Caviness, C. K., Tsai, P.-C.,
Roetker, N. S., Just, A. C., Demerath, E. W., Guan, W., Bressler, J., Fornage, M.,
Studenski, S., Vandiver, A. R., Moore, A. Z., Tanaka, T., Kiel, D. P., Liang, L., Vokonas,
P., … Horvath, S. (2016). DNA methylation-based measures of biological age: Metaanalysis predicting time to death. Aging (Albany NY), 8(9), 1844–1859.
https://doi.org/10.18632/aging.101020
Chen, L., Wu, X., Xie, H., Yao, N., Xia, Y., Ma, G., Qian, M., Ge, H., Cui, Y., Huang, Y.,
Wang, S., & Zheng, M. (2019). ZFP57 suppress proliferation of breast cancer cells
through down-regulation of MEST-mediated Wnt/β-catenin signalling pathway. Cell
Death & Disease, 10(3), Article 3. https://doi.org/10.1038/s41419-019-1335-5
68
Cheng, Y., He, C., Wang, M., Ma, X., Mo, F., Yang, S., Han, J., & Wei, X. (2019). Targeting
epigenetic regulators for cancer therapy: Mechanisms and advances in clinical trials.
Signal Transduction and Targeted Therapy, 4(1), Article 1.
https://doi.org/10.1038/s41392-019-0095-0
Chong, Y., Ikematsu, H., Yamaji, K., Nishimura, M., Nabeshima, S., Kashiwagi, S., & Hayashi,
J. (2005). CD27+ (memory) B cell decrease and apoptosis-resistant CD27− (naive) B cell
increase in aged humans: Implications for age-related peripheral B cell developmental
disturbances. International Immunology, 17(4), 383–390.
https://doi.org/10.1093/intimm/dxh218
Cicirò, Y., & Sala, A. (2021). MYB oncoproteins: Emerging players and potential therapeutic
targets in human cancer. Oncogenesis, 10(2), Article 2. https://doi.org/10.1038/s41389-
021-00309-y
Clement, J., Yan, Q., Agrawal, M., Coronado, R. E., Sturges, J. A., Horvath, M., Lu, A. T.,
Brooke, R. T., & Horvath, S. (2022). Umbilical cord plasma concentrate has beneficial
effects on DNA methylation GrimAge and human clinical biomarkers. Aging Cell,
21(10), e13696. https://doi.org/10.1111/acel.13696
Cobben, J. M., Krzyzewska, I. M., Venema, A., Mul, A. N., Polstra, A., Postma, A. V., Smigiel,
R., Pesz, K., Niklinski, J., Chomczyk, M. A., Henneman, P., & Mannens, M. M. (2019).
DNA methylation abundantly associates with fetal alcohol spectrum disorder and its
subphenotypes. Epigenomics, 11(7), 767–785. https://doi.org/10.2217/epi-2018-0221
Cribbs, A., Feldmann, M., & Oppermann, U. (2015). Towards an understanding of the role of
DNA methylation in rheumatoid arthritis: Therapeutic and diagnostic implications.
Therapeutic Advances in Musculoskeletal Disease, 7(5), 206–219.
https://doi.org/10.1177/1759720X15598307
69
Cullell, N., Soriano-Tárraga, C., Gallego-Fábrega, C., Cárcel-Márquez, J., Torres-Águila, N. P.,
Muiño, E., Lledós, M., Llucià-Carol, L., Esteller, M., Moura, M. C. de, Montaner, J.,
Fernández-Sanlés, A., Elosua, R., Delgado, P., Martí-Fábregas, J., Krupinski, J., Roquer,
J., Jiménez-Conde, J., & Fernández-Cadenas, I. (2022). DNA Methylation and Ischemic
Stroke Risk: An Epigenome-Wide Association Study. Thrombosis and Haemostasis,
1767–1778. https://doi.org/10.1055/s-0042-1749328
Czesnikiewicz-Guzik, M., Lee, W.-W., Cui, D., Hiruma, Y., Lamar, D. L., Yang, Z.-Z.,
Ouslander, J. G., Weyand, C. M., & Goronzy, J. J. (2008). T cell subset-specific
susceptibility to aging. Clinical Immunology (Orlando, Fla.), 127(1), 107–118.
https://doi.org/10.1016/j.clim.2007.12.002
Davalos, V., García-Prieto, C. A., Ferrer, G., Aguilera-Albesa, S., Valencia-Ramos, J.,
Rodríguez-Palmero, A., Ruiz, M., Planas-Serra, L., Jordan, I., Alegría, I., Flores-Pérez,
P., Cantarín, V., Fumadó, V., Viadero, M. T., Rodrigo, C., Méndez-Hernández, M.,
López-Granados, E., Colobran, R., Rivière, J. G., … Esteller, M. (2022). Epigenetic
profiling linked to multisystem inflammatory syndrome in children (MIS-C): A
multicenter, retrospective study. eClinicalMedicine, 50.
https://doi.org/10.1016/j.eclinm.2022.101515
Davis, M. M., Tato, C. M., & Furman, D. (2017). Systems immunology: Just getting started.
Nature Immunology, 18(7), 725–732. https://doi.org/10.1038/ni.3768
de la Fuente, A. G., Dittmer, M., Heesbeen, E. J., de la Vega Gallardo, N., White, J. A., Young,
A., McColgan, T., Dashwood, A., Mayne, K., Cabeza-Fernández, S., Falconer, J.,
Rodriguez-Baena, F. J., McMurran, C. E., Inayatullah, M., Rawji, K. S., Franklin, R. J.
M., Dooley, J., Liston, A., Ingram, R. J., … Fitzgerald, D. C. (2024). Ageing impairs the
regenerative capacity of regulatory T cells in mouse central nervous system
remyelination. Nature Communications, 15(1), 1870. https://doi.org/10.1038/s41467-
024-45742-w
Desdín-Micó, G., Soto-Heredero, G., Aranda, J. F., Oller, J., Carrasco, E., Gabandé-Rodríguez,
E., Blanco, E. M., Alfranca, A., Cussó, L., Desco, M., Ibañez, B., Gortazar, A. R.,
Fernández-Marcos, P., Navarro, M. N., Hernaez, B., Alcamí, A., Baixauli, F., &
Mittelbrunn, M. (2020). T cells with dysfunctional mitochondria induce multimorbidity
and premature senescence. Science, 368(6497), 1371–1376.
https://doi.org/10.1126/science.aax0860
70
Douek, D. C., McFarland, R. D., Keiser, P. H., Gage, E. A., Massey, J. M., Haynes, B. F., Polis,
M. A., Haase, A. T., Feinberg, M. B., Sullivan#, J. L., Jamieson, B. D., Zack, J. A.,
Picker, L. J., & Koup, R. A. (1998). Changes in thymic function with age and during the
treatment of HIV infection. Nature, 396(6712), Article 6712.
https://doi.org/10.1038/25374
Duan, R., Fu, Q., Sun, Y., & Li, Q. (2022). Epigenetic clock: A promising biomarker and
practical tool in aging. Ageing Research Reviews, 81, 101743.
https://doi.org/10.1016/j.arr.2022.101743
Ehrlich, M. (2002). DNA methylation in cancer: Too much, but also too little. Oncogene, 21(35),
Article 35. https://doi.org/10.1038/sj.onc.1205651
Elhamamsy, A. R. (2017). Role of DNA methylation in imprinting disorders: An updated review.
Journal of Assisted Reproduction and Genetics, 34(5), 549–562.
https://doi.org/10.1007/s10815-017-0895-5
Estupiñán-Moreno, E., Ortiz-Fernández, L., Li, T., Hernández-Rodríguez, J., Ciudad, L., AndrésLeón, E., Terron-Camero, L. C., Prieto-González, S., Espígol-Frigolé, G., Cid, M. C.,
Márquez, A., Ballestar, E., & Martín, J. (2022). Methylome and transcriptome profiling
of giant cell arteritis monocytes reveals novel pathways involved in disease pathogenesis
and molecular response to glucocorticoids. Annals of the Rheumatic Diseases, 81(9),
1290–1300. https://doi.org/10.1136/annrheumdis-2022-222156
Feng, Z., Peng, C., Li, D., Zhang, D., Li, X., Cui, F., Chen, Y., & He, Q. (2018). E2F3 promotes
cancer growth and is overexpressed through copy number variation in human melanoma.
OncoTargets and Therapy, 11, 5303–5313. https://doi.org/10.2147/OTT.S174103
Ferrucci, L., & Fabbri, E. (2018). Inflammageing: Chronic inflammation in ageing,
cardiovascular disease, and frailty. Nature Reviews. Cardiology, 15(9), 505–522.
https://doi.org/10.1038/s41569-018-0064-2
Fortin, J.-P., Labbe, A., Lemire, M., Zanke, B. W., Hudson, T. J., Fertig, E. J., Greenwood, C.
M., & Hansen, K. D. (2014). Functional normalization of 450k methylation array data
improves replication in large cancer studies. Genome Biology, 15(11), 503.
https://doi.org/10.1186/s13059-014-0503-2
71
Foster, A. D., Sivarapatna, A., & Gress, R. E. (2011). The aging immune system and its
relationship with cancer. Aging Health, 7(5), 707–718. https://doi.org/10.2217/ahe.11.56
Fransquet, P. D., Wrigglesworth, J., Woods, R. L., Ernst, M. E., & Ryan, J. (2019). The
epigenetic clock as a predictor of disease and mortality risk: A systematic review and
meta-analysis. Clinical Epigenetics, 11(1), 62. https://doi.org/10.1186/s13148-019-0656-
7
Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., Qian, J., & Yang, J.
(2022). glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models (4.1-6)
[Computer software]. https://CRAN.R-project.org/package=glmnet
Fries, G. R., Bauer, I. E., Scaini, G., Valvassori, S. S., Walss-Bass, C., Soares, J. C., & Quevedo,
J. (2020). Accelerated hippocampal biological aging in bipolar disorder. Bipolar
Disorders, 22(5), 498–507. https://doi.org/10.1111/bdi.12876
Garaud, S., Roufosse, F., De Silva, P., Gu-Trantien, C., Lodewyckx, J.-N., Duvillier, H.,
Dedeurwaerder, S., Bizet, M., Defrance, M., Fuks, F., Bex, F., & Willard-Gallo, K.
(2017). FOXP1 is a regulator of quiescence in healthy human CD4+ T cells and is
constitutively repressed in T cells from patients with lymphoproliferative disorders.
European Journal of Immunology, 47(1), 168–179. https://doi.org/10.1002/eji.201646373
Garcia-Prieto, C. A., Villanueva, L., Bueno-Costa, A., Davalos, V., González-Navarro, E. A.,
Juan, M., Urbano-Ispizua, Á., Delgado, J., Ortiz-Maldonado, V., del Bufalo, F., Locatelli,
F., Quintarelli, C., Sinibaldi, M., Soler, M., Castro de Moura, M., Ferrer, G., Urdinguio,
R. G., Fernandez, A. F., Fraga, M. F., … Esteller, M. (2022). Epigenetic Profiling and
Response to CD19 Chimeric Antigen Receptor T-Cell Therapy in B-Cell Malignancies.
JNCI: Journal of the National Cancer Institute, 114(3), 436–445.
https://doi.org/10.1093/jnci/djab194
72
Gasparoni, G., Bultmann, S., Lutsik, P., Kraus, T. F. J., Sordon, S., Vlcek, J., Dietinger, V.,
Steinmaurer, M., Haider, M., Mulholland, C. B., Arzberger, T., Roeber, S.,
Riemenschneider, M., Kretzschmar, H. A., Giese, A., Leonhardt, H., & Walter, J. (2018).
DNA methylation analysis on purified neurons and glia dissects age and Alzheimer’s
disease-specific changes in the human cortex. Epigenetics & Chromatin, 11(1), 41.
https://doi.org/10.1186/s13072-018-0211-3
Genovese, G., Kähler, A. K., Handsaker, R. E., Lindberg, J., Rose, S. A., Bakhoum, S. F.,
Chambert, K., Mick, E., Neale, B. M., Fromer, M., Purcell, S. M., Svantesson, O.,
Landén, M., Höglund, M., Lehmann, S., Gabriel, S. B., Moran, J. L., Lander, E. S.,
Sullivan, P. F., … McCarroll, S. A. (2014). Clonal Hematopoiesis and Blood-Cancer
Risk Inferred from Blood DNA Sequence. New England Journal of Medicine, 371(26),
2477–2487. https://doi.org/10.1056/NEJMoa1409405
Gnyszka, A., Jastrzębski, Z., & Flis, S. (2013). DNA Methyltransferase Inhibitors and Their
Emerging Role in Epigenetic Therapy of Cancer. Anticancer Research, 33(8), 2989–
2996.
Gopalan, S., Carja, O., Fagny, M., Patin, E., Myrick, J. W., McEwen, L. M., Mah, S. M., Kobor,
M. S., Froment, A., Feldman, M. W., Quintana-Murci, L., & Henn, B. M. (2017). Trends
in DNA Methylation with Age Replicate Across Diverse Human Populations. Genetics,
206(3), 1659–1674. https://doi.org/10.1534/genetics.116.195594
Goronzy, J. J., Fang, F., Cavanagh, M. M., Qi, Q., & Weyand, C. M. (2015). Naïve T cell
maintenance and function in human aging. Journal of Immunology (Baltimore, Md. :
1950), 194(9), 4073–4080. https://doi.org/10.4049/jimmunol.1500046
Goronzy, J. J., & Weyand, C. M. (2012). Immune aging and autoimmunity. Cellular and
Molecular Life Sciences : CMLS, 69(10), 1615–1623. https://doi.org/10.1007/s00018-
012-0970-0
Guintivano, J., Aryee, M. J., & Kaminsky, Z. A. (2013). A cell epigenotype specific model for
the correction of brain cellular heterogeneity bias and its application to age, brain region
and major depression. Epigenetics, 8(3), 290–302. https://doi.org/10.4161/epi.23924
73
Gustafson, C. E., Kim, C., Weyand, C. M., & Goronzy, J. J. (2020). Influence of immune aging
on vaccine responses. The Journal of Allergy and Clinical Immunology, 145(5), 1309–
1321. https://doi.org/10.1016/j.jaci.2020.03.017
Hannon, E., Dempster, E. L., Mansell, G., Burrage, J., Bass, N., Bohlken, M. M., Corvin, A.,
Curtis, C. J., Dempster, D., Di Forti, M., Dinan, T. G., Donohoe, G., Gaughran, F., Gill,
M., Gillespie, A., Gunasinghe, C., Hulshoff, H. E., Hultman, C. M., Johansson, V., …
Mill, J. (2021). DNA methylation meta-analysis reveals cellular alterations in psychosis
and markers of treatment-resistant schizophrenia. eLife, 10, e58430.
https://doi.org/10.7554/eLife.58430
Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M.,
Fan, J.-B., Gao, Y., Deconde, R., Chen, M., Rajapakse, I., Friend, S., Ideker, T., &
Zhang, K. (2013). Genome-wide Methylation Profiles Reveal Quantitative Views of
Human Aging Rates. Molecular Cell, 49(2), 359–367.
https://doi.org/10.1016/j.molcel.2012.10.016
Hashimoto, K., Kouno, T., Ikawa, T., Hayatsu, N., Miyajima, Y., Yabukami, H., Terooatea, T.,
Sasaki, T., Suzuki, T., Valentine, M., Pascarella, G., Okazaki, Y., Suzuki, H., Shin, J. W.,
Minoda, A., Taniuchi, I., Okano, H., Arai, Y., Hirose, N., & Carninci, P. (2019). Singlecell transcriptomics reveals expansion of cytotoxic CD4 T cells in supercentenarians.
Proceedings of the National Academy of Sciences of the United States of America,
116(48), 24242–24251. https://doi.org/10.1073/pnas.1907883116
Hastie, T., Tibshirani, R., Narasimhan, B., & Chu, G. (2023). impute: Imputation for microarray
data (1.72.3) [Computer software]. Bioconductor version: Release (3.16).
https://doi.org/10.18129/B9.bioc.impute
Haynes, L. (2020). Aging of the Immune System: Research Challenges to Enhance the Health
Span of Older Adults. Frontiers in Aging, 1.
https://www.frontiersin.org/articles/10.3389/fragi.2020.602108
Hearn, N. L., Chiu, C. L., & Lind, J. M. (2020). Comparison of DNA methylation profiles from
saliva in Coeliac disease and non-coeliac disease individuals. BMC Medical Genomics,
13(1), 16. https://doi.org/10.1186/s12920-020-0670-9
74
Hedrich, C. M., Mäbert, K., Rauen, T., & Tsokos, G. C. (2017). DNA methylation in systemic
lupus erythematosus. Epigenomics, 9(4), 505–525. https://doi.org/10.2217/epi-2016-0096
Hedrick, E., Cheng, Y., Jin, U.-H., Kim, K., & Safe, S. (2016). Specificity protein (Sp)
transcription factors Sp1, Sp3 and Sp4 are non-oncogene addiction genes in cancer cells.
Oncotarget, 7(16), 22245–22256. https://doi.org/10.18632/oncotarget.7925
Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C.,
Singh, H., & Glass, C. K. (2010). Simple Combinations of Lineage-Determining
Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B
Cell Identities. Molecular Cell, 38(4), 576–589.
https://doi.org/10.1016/j.molcel.2010.05.004
Higgins-Chen, A., Thrush, K., Hu-Seliger, T., Wang, Y., Hagg, S., & Levine, M. (2021). A
Computational Solution to Bolster Epigenetic Clock Reliability for Clinical Trials and
Longitudinal Tracking. Innovation in Aging, 5(Suppl 1), 5.
https://doi.org/10.1093/geroni/igab046.015
Hong, S. R., Jung, S.-E., Lee, E. H., Shin, K.-J., Yang, W. I., & Lee, H. Y. (2017). DNA
methylation-based age prediction from saliva: High age predictability by combination of
7 CpG markers. Forensic Science International: Genetics, 29, 118–125.
https://doi.org/10.1016/j.fsigen.2017.04.006
Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biology,
14(10), R115. https://doi.org/10.1186/gb-2013-14-10-r115
Horvath, S., Erhart, W., Brosch, M., Ammerpohl, O., von Schönfels, W., Ahrens, M., Heits, N.,
Bell, J. T., Tsai, P.-C., Spector, T. D., Deloukas, P., Siebert, R., Sipos, B., Becker, T.,
Röcken, C., Schafmayer, C., & Hampe, J. (2014). Obesity accelerates epigenetic aging of
human liver. Proceedings of the National Academy of Sciences, 111(43), 15538–15543.
https://doi.org/10.1073/pnas.1412759111
75
Horvath, S., Gurven, M., Levine, M. E., Trumble, B. C., Kaplan, H., Allayee, H., Ritz, B. R.,
Chen, B., Lu, A. T., Rickabaugh, T. M., Jamieson, B. D., Sun, D., Li, S., Chen, W.,
Quintana-Murci, L., Fagny, M., Kobor, M. S., Tsao, P. S., Reiner, A. P., … Assimes, T.
L. (2016). An epigenetic clock analysis of race/ethnicity, sex, and coronary heart disease.
Genome Biology, 17(1), 171. https://doi.org/10.1186/s13059-016-1030-0
Horvath, S., & Levine, A. J. (2015). HIV-1 Infection Accelerates Age According to the
Epigenetic Clock. The Journal of Infectious Diseases, 212(10), 1563–1573.
https://doi.org/10.1093/infdis/jiv277
Horvath, S., Oshima, J., Martin, G. M., Lu, A. T., Quach, A., Cohen, H., Felton, S., Matsuyama,
M., Lowe, D., Kabacik, S., Wilson, J. G., Reiner, A. P., Maierhofer, A., Flunkert, J.,
Aviv, A., Hou, L., Baccarelli, A. A., Li, Y., Stewart, J. D., … Raj, K. (2018). Epigenetic
clock for skin and blood cells applied to Hutchinson Gilford Progeria Syndrome and ex
vivo studies. Aging, 10(7), 1758–1775. https://doi.org/10.18632/aging.101508
Horvath, S., & Raj, K. (2018). DNA methylation-based biomarkers and the epigenetic clock
theory of ageing. Nature Reviews Genetics, 19(6), Article 6.
https://doi.org/10.1038/s41576-018-0004-3
Houseman, E. A., Accomando, W. P., Koestler, D. C., Christensen, B. C., Marsit, C. J., Nelson,
H. H., Wiencke, J. K., & Kelsey, K. T. (2012). DNA methylation arrays as surrogate
measures of cell mixture distribution. BMC Bioinformatics, 13, 86.
https://doi.org/10.1186/1471-2105-13-86
Huynh, J. L., Garg, P., Thin, T. H., Yoo, S., Dutta, R., Trapp, B. D., Haroutunian, V., Zhu, J.,
Donovan, M. J., Sharp, A. J., & Casaccia, P. (2014). Epigenome-wide differences in
pathology-free regions of multiple sclerosis–affected brains. Nature Neuroscience, 17(1),
Article 1. https://doi.org/10.1038/nn.3588
Ishak, M., Baharudin, R., Mohamed Rose, I., Sagap, I., Mazlan, L., Mohd Azman, Z. A., Abu,
N., Jamal, R., Lee, L.-H., & Ab Mutalib, N. S. (2020). Genome-Wide Open Chromatin
Methylome Profiles in Colorectal Cancer. Biomolecules, 10(5), Article 5.
https://doi.org/10.3390/biom10050719
76
Islam, S. A., Goodman, S. J., MacIsaac, J. L., Obradović, J., Barr, R. G., Boyce, W. T., & Kobor,
M. S. (2019). Integration of DNA methylation patterns and genetic variation in human
pediatric tissues help inform EWAS design and interpretation. Epigenetics & Chromatin,
12(1), 1. https://doi.org/10.1186/s13072-018-0245-6
Jenkins, T., Aston, K., Carrell, D., DeVilbiss, E., Sjaarda, L., Perkins, N., Mills, J. L., Chen, Z.,
Sparks, A., Clemons, T., Chaney, K., Peterson, C. M., Emery, B., Hotaling, J., Johnstone,
E., Schisterman, E., & Mumford, S. L. (2022). The impact of zinc and folic acid
supplementation on sperm DNA methylation: Results from the folic acid and zinc
supplementation randomized clinical trial (FAZST). Fertility and Sterility, 117(1), 75–85.
https://doi.org/10.1016/j.fertnstert.2021.09.009
Jiang, W., Liu, N., Chen, X.-Z., Sun, Y., Li, B., Ren, X.-Y., Qin, W.-F., Jiang, N., Xu, Y.-F., Li,
Y.-Q., Ren, J., Cho, W. C., Yun, J.-P., Zeng, J., Liu, L.-Z., Li, L., Guo, Y., Mai, H.-Q.,
Zeng, M.-S., … Ma, J. (2015). Genome-Wide Identification of a Methylation Gene Panel
as a Prognostic Biomarker in Nasopharyngeal Carcinoma. Molecular Cancer
Therapeutics, 14(12), 2864–2873. https://doi.org/10.1158/1535-7163.MCT-15-0260
Jin, B., & Robertson, K. D. (2013). DNA Methyltransferases (DNMTs), DNA Damage Repair,
and Cancer. Advances in Experimental Medicine and Biology, 754, 3–29.
https://doi.org/10.1007/978-1-4419-9967-2_1
Jin, Z., & Liu, Y. (2018). DNA methylation in human diseases. Genes & Diseases, 5(1), 1–8.
https://doi.org/10.1016/j.gendis.2018.01.002
Johnson, K. C., Houseman, E. A., King, J. E., & Christensen, B. C. (2017). Normal breast tissue
DNA methylation differences at regulatory elements are associated with the cancer risk
factor age. Breast Cancer Research, 19(1), 81. https://doi.org/10.1186/s13058-017-0873-
y
Johnstone, S. E., Gladyshev, V. N., Aryee, M. J., & Bernstein, B. E. (2022). Epigenetic clocks,
aging, and cancer. Science, 378(6626), 1276–1277.
https://doi.org/10.1126/science.abn4009
77
Jonkman, T. H., Dekkers, K. F., Slieker, R. C., Grant, C. D., Ikram, M. A., van Greevenbroek,
M. M. J., Franke, L., Veldink, J. H., Boomsma, D. I., Slagboom, P. E., Consortium, B. I.
O. S., & Heijmans, B. T. (2022). Functional genomics analysis identifies T and NK cell
activation as a driver of epigenetic clock progression. Genome Biology, 23, 24.
https://doi.org/10.1186/s13059-021-02585-8
Kabacik, S., Lowe, D., Fransen, L., Leonard, M., Ang, S.-L., Whiteman, C., Corsi, S., Cohen, H.,
Felton, S., Bali, R., Horvath, S., & Raj, K. (2022). The relationship between epigenetic
age and the hallmarks of aging in human cells. Nature Aging, 2(6), Article 6.
https://doi.org/10.1038/s43587-022-00220-0
Kananen, L., Marttila, S., Nevalainen, T., Jylhävä, J., Mononen, N., Kähönen, M., Raitakari, O.
T., Lehtimäki, T., & Hurme, M. (2016). Aging-associated DNA methylation changes in
middle-aged individuals: The Young Finns study. BMC Genomics, 17(1), 103.
https://doi.org/10.1186/s12864-016-2421-z
Kandaswamy, R., Hannon, E., Arseneault, L., Mansell, G., Sugden, K., Williams, B., Burrage, J.,
Staley, J. R., Pishva, E., Dahir, A., Roberts, S., Danese, A., Mill, J., Fisher, H. L., &
Wong, C. C. Y. (2021). DNA methylation signatures of adolescent victimization:
Analysis of a longitudinal monozygotic twin sample. Epigenetics, 16(11), 1169–1186.
https://doi.org/10.1080/15592294.2020.1853317
Kasuga, Y., Kawai, T., Miyakoshi, K., Hori, A., Tamagawa, M., Hasegawa, K., Ikenoue, S.,
Ochiai, D., Saisho, Y., Hida, M., Tanaka, M., & Hata, K. (2022). DNA methylation
analysis of cord blood samples in neonates born to gestational diabetes mothers
diagnosed before 24 gestational weeks. BMJ Open Diabetes Research & Care, 10(1),
e002539. https://doi.org/10.1136/bmjdrc-2021-002539
Kho, M., Zhao, W., Ratliff, S. M., Ammous, F., Mosley, T. H., Shang, L., Kardia, S. L. R., Zhou,
X., & Smith, J. A. (2020). Epigenetic loci for blood pressure are associated with
hypertensive target organ damage in older African Americans from the genetic
epidemiology network of Arteriopathy (GENOA) study. BMC Medical Genomics, 13(1),
131. https://doi.org/10.1186/s12920-020-00791-0
78
Ki, S., Park, D., Selden, H. J., Seita, J., Chung, H., Kim, J., Iyer, V. R., & Ehrlich, L. I. R.
(2014). Global Transcriptional Profiling Reveals Distinct Functions of Thymic Stromal
Subsets and Age-Related Changes during Thymic Involution. Cell Reports, 9(1), 402–
415. https://doi.org/10.1016/j.celrep.2014.08.070
Konigsberg, I. R., Barnes, B., Campbell, M., Davidson, E., Zhen, Y., Pallisard, O., Boorgula, M.
P., Cox, C., Nandy, D., Seal, S., Crooks, K., Sticca, E., Harrison, G. F., Hopkinson, A.,
Vest, A., Arnold, C. G., Kahn, M. G., Kao, D. P., Peterson, B. R., … Barnes, K. C.
(2021). Host methylation predicts SARS-CoV-2 infection and clinical outcome.
Communications Medicine, 1(1), Article 1. https://doi.org/10.1038/s43856-021-00042-y
Konopka, T. (2023). umap: Uniform Manifold Approximation and Projection (0.2.10.0)
[Computer software]. https://CRAN.R-project.org/package=umap
Kular, L., Liu, Y., Ruhrmann, S., Zheleznyakova, G., Marabita, F., Gomez-Cabrero, D., James,
T., Ewing, E., Lindén, M., Górnikiewicz, B., Aeinehband, S., Stridh, P., Link, J.,
Andlauer, T. F. M., Gasperi, C., Wiendl, H., Zipp, F., Gold, R., Tackenberg, B., …
Jagodic, M. (2018). DNA methylation as a mediator of HLA-DRB1*15:01 and a
protective variant in multiple sclerosis. Nature Communications, 9(1), Article 1.
https://doi.org/10.1038/s41467-018-04732-5
Lages, C. S., Lewkowich, I., Sproles, A., Wills-Karp, M., & Chougnet, C. (2010). Partial
restoration of T cell function in aged mice by in vitro blockade of the PD-1/PD-L1
pathway. Aging Cell, 9(5), 785–798. https://doi.org/10.1111/j.1474-9726.2010.00611.x
Langevin, S. M., Eliot, M., Butler, R. A., Cheong, A., Zhang, X., McClean, M. D., Koestler, D.
C., & Kelsey, K. T. (2015). CpG island methylation profile in non-invasive oral rinse
samples is predictive of oral and pharyngeal carcinoma. Clinical Epigenetics, 7(1), 125.
https://doi.org/10.1186/s13148-015-0160-7
Lazuardi, L., Jenewein, B., Wolf, A. M., Pfister, G., Tzankov, A., & Grubeck-Loebenstein, B.
(2005). Age-related loss of naïve T cells and dysregulation of T-cell/B-cell interactions in
human lymph nodes. Immunology, 114(1), 37–43. https://doi.org/10.1111/j.1365-
2567.2004.02006.x
79
Levine, M. E., Higgins-Chen, A., Thrush, K., Minteer, C., & Niimi, P. (2022). Clock Work:
Deconstructing the Epigenetic Clock Signals in Aging, Disease, and Reprogramming (p.
2022.02.13.480245). bioRxiv. https://doi.org/10.1101/2022.02.13.480245
Levine, M. E., Lu, A. T., Quach, A., Chen, B. H., Assimes, T. L., Bandinelli, S., Hou, L.,
Baccarelli, A. A., Stewart, J. D., Li, Y., Whitsel, E. A., Wilson, J. G., Reiner, A. P., Aviv,
A., Lohman, K., Liu, Y., Ferrucci, L., & Horvath, S. (2018). An epigenetic biomarker of
aging for lifespan and healthspan. Aging (Albany NY), 10(4), 573–591.
https://doi.org/10.18632/aging.101414
Lewis, S. K., Nachun, D., Martin, M. G., Horvath, S., Coppola, G., & Jones, D. L. (2020). DNA
Methylation Analysis Validates Organoids as a Viable Model for Studying Human
Intestinal Aging. Cellular and Molecular Gastroenterology and Hepatology, 9(3), 527–
541. https://doi.org/10.1016/j.jcmgh.2019.11.013
Li, M., Sun, X., Yao, H., Chen, W., Zhang, F., Gao, S., Zou, X., Chen, J., Qiu, S., Wei, H., Hu,
Z., & Chen, W. (2021). Genomic methylation variations predict the susceptibility of six
chemotherapy related adverse effects and cancer development for Chinese colorectal
cancer patients. Toxicology and Applied Pharmacology, 427, 115657.
https://doi.org/10.1016/j.taap.2021.115657
Li, M., Yao, D., Zeng, X., Kasakovski, D., Zhang, Y., Chen, S., Zha, X., Li, Y., & Xu, L. (2019).
Age related human T cell subset evolution and senescence. Immunity & Ageing, 16(1),
24. https://doi.org/10.1186/s12979-019-0165-8
Li, X., Li, C., Zhang, W., Wang, Y., Qian, P., & Huang, H. (2023). Inflammation and aging:
Signaling pathways and intervention therapies. Signal Transduction and Targeted
Therapy, 8(1), 1–29. https://doi.org/10.1038/s41392-023-01502-8
Li, X., Ploner, A., Wang, Y., Magnusson, P. K., Reynolds, C., Finkel, D., Pedersen, N. L.,
Jylhävä, J., & Hägg, S. (2020). Longitudinal trajectories, correlations and mortality
associations of nine biological ages across 20-years follow-up. eLife, 9, e51507.
https://doi.org/10.7554/eLife.51507
80
Lin, Y., Damjanovic, A., Metter, E. J., Nguyen, H., Truong, T., Najarro, K., Morris, C., Longo,
D. L., Zhan, M., Ferrucci, L., Hodes, R. J., & Weng, N. (2015). Age-associated telomere
attrition of lymphocytes in vivo is co-ordinated with changes in telomerase activity,
composition of lymphocyte subsets and health conditions. Clinical Science (London,
England: 1979), 128(6), 367–377. https://doi.org/10.1042/CS20140481
Liu, Y., Aryee, M. J., Padyukov, L., Fallin, M. D., Hesselberg, E., Runarsson, A., Reinius, L.,
Acevedo, N., Taub, M., Ronninger, M., Shchetynsky, K., Scheynius, A., Kere, J.,
Alfredsson, L., Klareskog, L., Ekström, T. J., & Feinberg, A. P. (2013). Epigenome-wide
association data implicate DNA methylation as an intermediary of genetic risk in
rheumatoid arthritis. Nature Biotechnology, 31(2), Article 2.
https://doi.org/10.1038/nbt.2487
López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., & Kroemer, G. (2013). The Hallmarks
of Aging. Cell, 153(6), 1194–1217. https://doi.org/10.1016/j.cell.2013.05.039
López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., & Kroemer, G. (2023). Hallmarks of
aging: An expanding universe. Cell, 186(2), 243–278.
https://doi.org/10.1016/j.cell.2022.11.001
Lowe, D., Horvath, S., & Raj, K. (2016). Epigenetic clock analyses of cellular senescence and
ageing. Oncotarget, 7(8), 8524–8531. https://doi.org/10.18632/oncotarget.7383
Lu, A. T., Quach, A., Wilson, J. G., Reiner, A. P., Aviv, A., Raj, K., Hou, L., Baccarelli, A. A.,
Li, Y., Stewart, J. D., Whitsel, E. A., Assimes, T. L., Ferrucci, L., & Horvath, S. (2019).
DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging, 11(2), 303–
327. https://doi.org/10.18632/aging.101684
Lunnon, K., Smith, R., Hannon, E., De Jager, P. L., Srivastava, G., Volta, M., Troakes, C., AlSarraj, S., Burrage, J., Macdonald, R., Condliffe, D., Harries, L. W., Katsel, P.,
Haroutunian, V., Kaminsky, Z., Joachim, C., Powell, J., Lovestone, S., Bennett, D. A., …
Mill, J. (2014). Methylomic profiling implicates cortical deregulation of ANK1 in
Alzheimer’s disease. Nature Neuroscience, 17(9), Article 9.
https://doi.org/10.1038/nn.3782
81
Lynch, H. E., Goldberg, G. L., Chidgey, A., Brink, M. R. M. V. den, Boyd, R., & Sempowski, G.
D. (2009). Thymic involution and immune reconstitution. Trends in Immunology, 30(7),
366–373. https://doi.org/10.1016/j.it.2009.04.003
Magnaye, K. M., Clay, S. M., Nicodemus-Johnson, J., Naughton, K. A., Huffman, J., Altman, M.
C., Jackson, D. J., Gern, J. E., Hogarth, D. K., Naureckas, E. T., White, S. R., & Ober, C.
(2022). DNA methylation signatures in airway cells from adult children of asthmatic
mothers reflect subtypes of severe asthma. Proceedings of the National Academy of
Sciences, 119(24), e2116467119. https://doi.org/10.1073/pnas.2116467119
Mantovani, N., Defelicibus, A., da Silva, I. T., Cicero, M. F., Santana, L. C., Arnold, R., de
Castro, D. F., Duro, R. L. S., Nishiyama-Jr, M. Y., Junqueira-de-Azevedo, I. L. M., da
Silva, B. C. M., da Silva Duarte, A. J., Casseb, J., de Barros Tenore, S., Hunter, J., Diaz,
R. S., & Komninakis, S. C. V. (2021). Latency-associated DNA methylation patterns
among HIV-1 infected individuals with distinct disease progression courses or
antiretroviral virologic response. Scientific Reports, 11(1), Article 1.
https://doi.org/10.1038/s41598-021-02463-0
Martino, D., Loke, Y. J., Gordon, L., Ollikainen, M., Cruickshank, M. N., Saffery, R., & Craig,
J. M. (2013). Longitudinal, genome-scale analysis of DNA methylation in twins from
birth to 18 months of age reveals rapid epigenetic change in early life and pair-specific
effects of discordance. Genome Biology, 14(5), R42. https://doi.org/10.1186/gb-2013-14-
5-r42
Martino, D., Neeland, M., Dang, T., Cobb, J., Ellis, J., Barnett, A., Tang, M., Vuillermin, P.,
Allen, K., & Saffery, R. (2018). Epigenetic dysregulation of naive CD4+ T-cell activation
genes in childhood food allergy. Nature Communications, 9(1), Article 1.
https://doi.org/10.1038/s41467-018-05608-4
Mathsyaraja, H., Catchpole, J., Freie, B., Eastwood, E., Babaeva, E., Geuenich, M., Cheng, P. F.,
Ayers, J., Yu, M., Wu, N., Moorthi, S., Poudel, K. R., Koehne, A., Grady, W., Houghton,
A. M., Berger, A. H., Shiio, Y., MacPherson, D., & Eisenman, R. N. (2021). Loss of
MGA repression mediated by an atypical polycomb complex promotes tumor progression
and invasiveness. eLife, 10, e64212. https://doi.org/10.7554/eLife.64212
82
McCrory, C., Fiorito, G., Hernandez, B., Polidoro, S., O’Halloran, A. M., Hever, A., Ni
Cheallaigh, C., Lu, A. T., Horvath, S., Vineis, P., & Kenny, R. A. (2021). GrimAge
Outperforms Other Epigenetic Clocks in the Prediction of Age-Related Clinical
Phenotypes and All-Cause Mortality. The Journals of Gerontology. Series A, Biological
Sciences and Medical Sciences, 76(5), 741–749. https://doi.org/10.1093/gerona/glaa286
McEwen, L. M., Jones, M. J., Lin, D. T. S., Edgar, R. D., Husquin, L. T., MacIsaac, J. L.,
Ramadori, K. E., Morin, A. M., Rider, C. F., Carlsten, C., Quintana-Murci, L., Horvath,
S., & Kobor, M. S. (2018). Systematic evaluation of DNA methylation age estimation
with common preprocessing methods and the Infinium MethylationEPIC BeadChip array.
Clinical Epigenetics, 10(1), 123. https://doi.org/10.1186/s13148-018-0556-2
Medvedeva, Y. A., Khamis, A. M., Kulakovskiy, I. V., Ba-Alawi, W., Bhuyan, M. S. I., Kawaji,
H., Lassmann, T., Harbers, M., Forrest, A. R., Bajic, V. B., & The FANTOM consortium.
(2014). Effects of cytosine methylation on transcription factor binding sites. BMC
Genomics, 15(1), 119. https://doi.org/10.1186/1471-2164-15-119
Mitchell, E., Spencer Chapman, M., Williams, N., Dawson, K. J., Mende, N., Calderbank, E. F.,
Jung, H., Mitchell, T., Coorens, T. H. H., Spencer, D. H., Machado, H., Lee-Six, H.,
Davies, M., Hayler, D., Fabre, M. A., Mahbubani, K., Abascal, F., Cagan, A., Vassiliou,
G. S., … Campbell, P. J. (2022). Clonal dynamics of haematopoiesis across the human
lifespan. Nature, 606(7913), Article 7913. https://doi.org/10.1038/s41586-022-04786-y
Moore, L. D., Le, T., & Fan, G. (2013). DNA Methylation and Its Basic Function.
Neuropsychopharmacology, 38(1), Article 1. https://doi.org/10.1038/npp.2012.112
Müller, L., Di Benedetto, S., & Pawelec, G. (2019). The Immune System and Its Dysregulation
with Aging. Sub-Cellular Biochemistry, 91, 21–43. https://doi.org/10.1007/978-981-13-
3681-2_2
Muse, M. E., Bergman, D. T., Salas, L. A., Tom, L. N., Tan, J.-M., Laino, A., Lambie, D.,
Sturm, R. A., Schaider, H., Soyer, H. P., Christensen, B. C., & Stark, M. S. (2022).
Genome-Scale DNA Methylation Analysis Identifies Repeat Element Alterations that
Modulate the Genomic Stability of Melanocytic Nevi. Journal of Investigative
Dermatology, 142(7), 1893-1902.e7. https://doi.org/10.1016/j.jid.2021.11.025
83
Nicodemus-Johnson, J., Myers, R. A., Sakabe, N. J., Sobreira, D. R., Hogarth, D. K., Naureckas,
E. T., Sperling, A. I., Solway, J., White, S. R., Nobrega, M. A., Nicolae, D. L., Gilad, Y.,
& Ober, C. (2016). DNA methylation in lung cells is associated with asthma endotypes
and genetic risk. JCI Insight, 1(20). https://doi.org/10.1172/jci.insight.90151
Nikolich-Žugich, J. (2018). The twilight of immunity: Emerging concepts in aging of the
immune system. Nature Immunology, 19(1), 10–19. https://doi.org/10.1038/s41590-017-
0006-x
Nonino, C., NY, Noronha, CF, Nicoletti, & MA, Pinhel. (2021). Trait related and differential
DNA Methylation in obese and normal weight Brazilian women. Gene Expression
Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE166611
Noroozi, R., Ghafouri-Fard, S., Pisarek, A., Rudnicka, J., Spólnicka, M., Branicki, W., Taheri,
M., & Pośpiech, E. (2021). DNA methylation-based age clocks: From age prediction to
age reversion. Ageing Research Reviews, 68, 101314.
https://doi.org/10.1016/j.arr.2021.101314
Oblak, L., van der Zaag, J., Higgins-Chen, A. T., Levine, M. E., & Boks, M. P. (2021). A
systematic review of biological, social and environmental factors associated with
epigenetic clock acceleration. Ageing Research Reviews, 69, 101348.
https://doi.org/10.1016/j.arr.2021.101348
Oelsner, K. T., Guo, Y., To, S. B.-C., Non, A. L., & Barkin, S. L. (2017). Maternal BMI as a
predictor of methylation of obesity-related genes in saliva samples from preschool-age
Hispanic children at-risk for obesity. BMC Genomics, 18, 57.
https://doi.org/10.1186/s12864-016-3473-9
Ohnuki, M., Tanabe, K., Sutou, K., Teramoto, I., Sawamura, Y., Narita, M., Nakamura, M.,
Tokunaga, Y., Nakamura, M., Watanabe, A., Yamanaka, S., & Takahashi, K. (2014).
Dynamic regulation of human endogenous retroviruses mediates factor-induced
reprogramming and differentiation potential. Proceedings of the National Academy of
Sciences of the United States of America, 111(34), 12426–12431.
https://doi.org/10.1073/pnas.1413299111
84
Oliva, M., Demanelis, K., Lu, Y., Chernoff, M., Jasmine, F., Ahsan, H., Kibriya, M. G., Chen, L.
S., & Pierce, B. L. (2023). DNA methylation QTL mapping across diverse human tissues
provides molecular links between genetic variation and complex traits. Nature Genetics,
55(1), Article 1. https://doi.org/10.1038/s41588-022-01248-z
Pai, S., Li, P., Killinger, B., Marshall, L., Jia, P., Liao, J., Petronis, A., Szabó, P. E., & Labrie, V.
(2019). Differential methylation of enhancer at IGF2 is associated with abnormal
dopamine synthesis in major psychosis. Nature Communications, 10(1), Article 1.
https://doi.org/10.1038/s41467-019-09786-7
Pantano, L., Hutchinson, J., Barrera, V., Piper, M., Khetani, R., Daily, K., Perumal, T. M.,
Kirchner, R., & Steinbaugh, M. (2023). DEGreport: Report of DEG analysis (1.34.0)
[Computer software]. Bioconductor version: Release (3.16).
https://doi.org/10.18129/B9.bioc.DEGreport
Pelegri-Siso, D., & Gonzalez, J. R. (2023). Methylclock—DNA methylation-based clocks (1.4.0)
[Computer software]. Bioconductor version: Release (3.16).
https://doi.org/10.18129/B9.bioc.methylclock
Pihlstrøm, L., Shireby, G., Geut, H., Henriksen, S. P., Rozemuller, A. J. M., Tunold, J.-A.,
Hannon, E., Francis, P., Thomas, A. J., Love, S., Mill, J., van de Berg, W. D. J., & Toft,
M. (2022). Epigenome-wide association study of human frontal cortex identifies
differential methylation in Lewy body pathology. Nature Communications, 13(1), Article
1. https://doi.org/10.1038/s41467-022-32619-z
Pitaksalee, R., Burska, A. N., Ajaib, S., Rogers, J., Parmar, R., Mydlova, K., Xie, X., Droop, A.,
Nijjar, J. S., Chambers, P., Emery, P., Hodgett, R., McInnes, I. B., & Ponchel, F. (2020).
Differential CpG DNA methylation in peripheral naïve CD4+ T-cells in early rheumatoid
arthritis patients. Clinical Epigenetics, 12(1), 54. https://doi.org/10.1186/s13148-020-
00837-1
Policicchio, S., Washer, S., Viana, J., Iatrou, A., Burrage, J., Hannon, E., Turecki, G., Kaminsky,
Z., Mill, J., Dempster, E. L., & Murphy, T. M. (2020). Genome-wide DNA methylation
meta-analysis in the brains of suicide completers. Translational Psychiatry, 10(1), Article
1. https://doi.org/10.1038/s41398-020-0752-7
85
Protsenko, E., Yang, R., Nier, B., Reus, V., Hammamieh, R., Rampersaud, R., Wu, G. W. Y.,
Hough, C. M., Epel, E., Prather, A. A., Jett, M., Gautam, A., Mellon, S. H., &
Wolkowitz, O. M. (2021). “GrimAge,” an epigenetic predictor of mortality, is accelerated
in major depressive disorder. Translational Psychiatry, 11(1), 1–9.
https://doi.org/10.1038/s41398-021-01302-0
Quinn, K. M., Palchaudhuri, R., Palmer, C. S., & La Gruta, N. L. (2019). The clock is ticking:
The impact of ageing on T cell metabolism. Clinical & Translational Immunology, 8(11),
e01091. https://doi.org/10.1002/cti2.1091
Rasmussen, K. D., & Helin, K. (2016). Role of TET enzymes in DNA methylation,
development, and cancer. Genes & Development, 30(7), 733–750.
https://doi.org/10.1101/gad.276568.115
Reinius, L. E., Acevedo, N., Joerink, M., Pershagen, G., Dahlén, S.-E., Greco, D., Söderhäll, C.,
Scheynius, A., & Kere, J. (2012). Differential DNA Methylation in Purified Human
Blood Cells: Implications for Cell Lineage and Studies on Disease Susceptibility. PLOS
ONE, 7(7), e41361. https://doi.org/10.1371/journal.pone.0041361
Reitsema, R. D., Hid Cadena, R., Nijhof, S. H., Abdulahad, W. H., Huitema, M. G., Paap, D.,
Brouwer, E., Boots, A. M. H., & Heeringa, P. (2020). Effect of age and sex on immune
checkpoint expression and kinetics in human T cells. Immunity & Ageing, 17(1), 32.
https://doi.org/10.1186/s12979-020-00203-y
Renauer, P. A., Coit, P., & Sawalha, A. H. (2015). The DNA methylation signature of human
TCRαβ+CD4−CD8− double negative T cells reveals CG demethylation and a unique
epigenetic architecture permissive to a broad stimulatory immune response. Clinical
Immunology, 156(1), 19–27. https://doi.org/10.1016/j.clim.2014.10.007
Ringh, M. V., Hagemann-Jensen, M., Needhamsen, M., Kular, L., Breeze, C. E., Sjöholm, L. K.,
Slavec, L., Kullberg, S., Wahlström, J., Grunewald, J., Brynedal, B., Liu, Y., Almgren,
M., Jagodic, M., Öckinger, J., & Ekström, T. J. (2019). Tobacco smoking induces
changes in true DNA methylation, hydroxymethylation and gene expression in
bronchoalveolar lavage cells. eBioMedicine, 46, 290–304.
https://doi.org/10.1016/j.ebiom.2019.07.006
86
Ringh, M. V., Hagemann-Jensen, M., Needhamsen, M., Kullberg, S., Wahlström, J., Grunewald,
J., Brynedal, B., Jagodic, M., Ekström, T. J., Öckinger, J., & Kular, L. (2021).
Methylome and transcriptome signature of bronchoalveolar cells from multiple sclerosis
patients in relation to smoking. Multiple Sclerosis Journal, 27(7), 1014–1026.
https://doi.org/10.1177/1352458520943768
Robertson, K. D. (2005). DNA methylation and human disease. Nature Reviews Genetics, 6(8),
Article 8. https://doi.org/10.1038/nrg1655
Rocha, R., & Henrique, R. (2022). Insulinoma-Associated Protein 1 (INSM1): Diagnostic,
Prognostic, and Therapeutic Use in Small Cell Lung Cancer. Journal of Molecular
Pathology, 3(3), Article 3. https://doi.org/10.3390/jmp3030013
Rodriguez, R. M., Suarez-Alvarez, B., Lavín, J. L., Mosén-Ansorena, D., Baragaño Raneros, A.,
Márquez-Kisinousky, L., Aransay, A. M., & Lopez-Larrea, C. (2017). Epigenetic
Networks Regulate the Transcriptional Program in Memory and Terminally
Differentiated CD8+ T Cells. The Journal of Immunology, 198(2), 937–949.
https://doi.org/10.4049/jimmunol.1601102
Roos, L., Sandling, J. K., Bell, C. G., Glass, D., Mangino, M., Spector, T. D., Deloukas, P.,
Bataille, V., & Bell, J. T. (2017). Higher Nevus Count Exhibits a Distinct DNA
Methylation Signature in Healthy Human Skin: Implications for Melanoma. Journal of
Investigative Dermatology, 137(4), 910–920. https://doi.org/10.1016/j.jid.2016.11.029
Roy, R., Ramamoorthy, S., Shapiro, B. D., Kaileh, M., Hernandez, D., Sarantopoulou, D.,
Arepalli, S., Boller, S., Singh, A., Bektas, A., Kim, J., Moore, A. Z., Tanaka, T.,
McKelvey, J., Zukley, L., Nguyen, C., Wallace, T., Dunn, C., Wersto, R., … Sen, R.
(2021). DNA methylation signatures reveal that distinct combinations of transcription
factors specify human immune cell epigenetic identity. Immunity, 54(11), 2465-2480.e5.
https://doi.org/10.1016/j.immuni.2021.10.001
Sadighi Akha, A. A. (2018). Aging and the immune system: An overview. Journal of
Immunological Methods, 463, 21–26. https://doi.org/10.1016/j.jim.2018.08.005
87
Salameh, Y., Bejaoui, Y., & El Hajj, N. (2020). DNA Methylation Biomarkers in Aging and
Age-Related Diseases. Frontiers in Genetics, 11.
https://www.frontiersin.org/articles/10.3389/fgene.2020.00171
Salas, L. A., Zhang, Z., Koestler, D. C., Butler, R. A., Hansen, H. M., Molinaro, A. M.,
Wiencke, J. K., Kelsey, K. T., & Christensen, B. C. (2022). Enhanced cell deconvolution
of peripheral blood using DNA methylation for high-resolution immune profiling. Nature
Communications, 13(1), Article 1. https://doi.org/10.1038/s41467-021-27864-7
Sayed, N., Huang, Y., Nguyen, K., Krejciova-Rajaniemi, Z., Grawe, A. P., Gao, T., Tibshirani,
R., Hastie, T., Alpert, A., Cui, L., Kuznetsova, T., Rosenberg-Hasson, Y., Ostan, R.,
Monti, D., Lehallier, B., Shen-Orr, S. S., Maecker, H. T., Dekker, C. L., Wyss-Coray, T.,
… Furman, D. (2021). An inflammatory aging clock (iAge) based on deep learning tracks
multimorbidity, immunosenescence, frailty and cardiovascular aging. Nature Aging, 1(7),
Article 7. https://doi.org/10.1038/s43587-021-00082-y
Schalkwyk, L. C., Gorrie-Stone, T. J., Pidsley, R., Wong, C. C., Touleimat, N., Defrance, M.,
Teschendorff, A., Maksimovic, J., Khoury, L. Y. E., & Wang, Y. (2023). wateRmelon:
Illumina 450 and EPIC methylation array normalization and metrics (2.4.0) [Computer
software]. Bioconductor version: Release (3.16).
https://doi.org/10.18129/B9.bioc.wateRmelon
Schlosberg, C. E., VanderKraats, N. D., & Edwards, J. R. (2017). Modeling complex patterns of
differential DNA methylation that associate with gene expression changes. Nucleic Acids
Research, 45(9), 5100–5111. https://doi.org/10.1093/nar/gkx078
Schlums, H., Cichocki, F., Tesi, B., Theorell, J., Beziat, V., Holmes, T. D., Han, H., Chiang, S.
C. C., Foley, B., Mattsson, K., Larsson, S., Schaffer, M., Malmberg, K.-J., Ljunggren, H.-
G., Miller, J. S., & Bryceson, Y. T. (2015). Cytomegalovirus Infection Drives Adaptive
Epigenetic Diversification of NK Cells with Altered Signaling and Effector Function.
Immunity, 42(3), 443–456. https://doi.org/10.1016/j.immuni.2015.02.008
Schumacher, B., & Meyer, D. (2023). Accurate aging clocks based on accumulating stochastic
variation. https://www.researchsquare.com/. https://doi.org/10.21203/rs.3.rs-2351315/v1
88
Shen, C.-Y., Lu, C.-H., Wu, C.-H., Li, K.-J., Kuo, Y.-M., Hsieh, S.-C., & Yu, C.-L. (2021).
Molecular Basis of Accelerated Aging with Immune Dysfunction-Mediated Inflammation
(Inflamm-Aging) in Patients with Systemic Sclerosis. Cells, 10(12), 3402.
https://doi.org/10.3390/cells10123402
Simpson, D. J., & Chandra, T. (2021). Epigenetic age prediction. Aging Cell, 20(9), e13452.
https://doi.org/10.1111/acel.13452
Somineni, H. K., Venkateswaran, S., Kilaru, V., Marigorta, U. M., Mo, A., Okou, D. T.,
Kellermayer, R., Mondal, K., Cobb, D., Walters, T. D., Griffiths, A., Noe, J. D., Crandall,
W. V., Rosh, J. R., Mack, D. R., Heyman, M. B., Baker, S. S., Stephens, M. C.,
Baldassano, R. N., … Kugathasan, S. (2019). Blood-Derived DNA Methylation
Signatures of Crohn’s Disease and Severity of Intestinal Inflammation. Gastroenterology,
156(8), 2254-2265.e3. https://doi.org/10.1053/j.gastro.2019.01.270
Takeuchi, C., Sato, J., Yamashita, S., Sasaki, A., Akahane, T., Aoki, R., Yamamichi, M., Liu,
Y.-Y., Ito, M., Furuta, T., Nakajima, S., Sakaguchi, Y., Takahashi, Y., Tsuji, Y., Niimi,
K., Tomida, S., Fujishiro, M., Yamamichi, N., & Ushijima, T. (2022). Autoimmune
gastritis induces aberrant DNA methylation reflecting its carcinogenic potential. Journal
of Gastroenterology, 57(3), 144–155. https://doi.org/10.1007/s00535-021-01848-2
Terekhova, M., Swain, A., Bohacova, P., Aladyeva, E., Arthur, L., Laha, A., Mogilenko, D. A.,
Burdess, S., Sukhov, V., Kleverov, D., Echalar, B., Tsurinov, P., Chernyatchik, R.,
Husarcikova, K., & Artyomov, M. N. (2023). Single-cell atlas of healthy human blood
unveils age-related loss of NKG2C+GZMB-CD8+ memory T cells and accumulation of
type 2 memory T cells. Immunity, 56(12), 2836-2854.e9.
https://doi.org/10.1016/j.immuni.2023.10.013
THE GTEX CONSORTIUM. (2020). The GTEx Consortium atlas of genetic regulatory effects
across human tissues. Science, 369(6509), 1318–1330.
https://doi.org/10.1126/science.aaz1776
Thompson, E. E., Dang, Q., Mitchell-Handley, B., Rajendran, K., Ram-Mohan, S., Solway, J.,
Ober, C., & Krishnan, R. (2020). Cytokine-induced molecular responses in airway
smooth muscle cells inform genome-wide association studies of asthma. Genome
Medicine, 12(1), 64. https://doi.org/10.1186/s13073-020-00759-w
89
Thurman, R. E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M. T., Haugen, E., Sheffield, N.
C., Stergachis, A. B., Wang, H., Vernot, B., Garg, K., John, S., Sandstrom, R., Bates, D.,
Boatman, L., Canfield, T. K., Diegel, M., Dunn, D., Ebersol, A. K., …
Stamatoyannopoulos, J. A. (2012). The accessible chromatin landscape of the human
genome. Nature, 489(7414), Article 7414. https://doi.org/10.1038/nature11232
Tian, M., Wang, X., Sun, J., Lin, W., Chen, L., Liu, S., Wu, X., Shi, L., Xu, P., Cai, X., & Wang,
X. (2020). IRF3 prevents colorectal tumorigenesis via inhibiting the nuclear translocation
of β-catenin. Nature Communications, 11(1), Article 1. https://doi.org/10.1038/s41467-
020-19627-7
Trapp, A., Kerepesi, C., & Gladyshev, V. N. (2021). Profiling epigenetic age in single cells.
Nature Aging, 1(12), Article 12. https://doi.org/10.1038/s43587-021-00134-3
Tsaprouni, L. G., Yang, T.-P., Bell, J., Dick, K. J., Kanoni, S., Nisbet, J., Viñuela, A.,
Grundberg, E., Nelson, C. P., Meduri, E., Buil, A., Cambien, F., Hengstenberg, C.,
Erdmann, J., Schunkert, H., Goodall, A. H., Ouwehand, W. H., Dermitzakis, E., Spector,
T. D., … Deloukas, P. (2014). Cigarette smoking reduces DNA methylation levels at
multiple genomic loci but the effect is partially reversible upon cessation. Epigenetics,
9(10), 1382–1396. https://doi.org/10.4161/15592294.2014.969637
Tuo, Z., Zhang, Y., Wang, X., Dai, S., Liu, K., Xia, D., Wang, J., & Bi, L. (2022). RUNX1 is a
promising prognostic biomarker and related to immune infiltrates of cancer-associated
fibroblasts in human cancers. BMC Cancer, 22(1), 523. https://doi.org/10.1186/s12885-
022-09632-y
Viana, J., Hannon, E., Dempster, E., Pidsley, R., Macdonald, R., Knox, O., Spiers, H., Troakes,
C., Al-Saraj, S., Turecki, G., Schalkwyk, L. C., & Mill, J. (2017). Schizophreniaassociated methylomic variation: Molecular signatures of disease and polygenic risk
burden across multiple brain regions. Human Molecular Genetics, 26(1), 210–225.
https://doi.org/10.1093/hmg/ddw373
Voisin, S., Harvey, N. R., Haupt, L. M., Griffiths, L. R., Ashton, K. J., Coffey, V. G., Doering,
T. M., Thompson, J.-L. M., Benedict, C., Cedernaes, J., Lindholm, M. E., Craig, J. M.,
Rowlands, D. S., Sharples, A. P., Horvath, S., & Eynon, N. (2020). An epigenetic clock
for human skeletal muscle. Journal of Cachexia, Sarcopenia and Muscle, 11(4), 887–898.
https://doi.org/10.1002/jcsm.12556
90
Vyas, C. M., Gatchel, J. R., Sadreyev, R. I., Kang, J. H., Mischoulon, D., Reynolds III, C. F.,
Chang, G., Manson, J. E., DeVivo, I., Blacker, D., & Okereke, O. I. (2021). Pilot study of
genome-wide differences in DNA methylation among older adults with normal cognition
and mild cognitive impairment, with and without neuropsychiatric symptoms.
Alzheimer’s & Dementia, 17(S5), e055497. https://doi.org/10.1002/alz.055497
Wang, X., Sun, D., Tai, J., Chen, S., Yu, M., Ren, D., & Wang, L. (2018). TFAP2C promotes
stemness and chemotherapeutic resistance in colorectal cancer via inactivating hippo
signaling pathway. Journal of Experimental & Clinical Cancer Research : CR, 37, 27.
https://doi.org/10.1186/s13046-018-0683-9
Wei, T. (n.d.). Corrplot. Retrieved July 24, 2023, from
https://www.rdocumentation.org/packages/corrplot/versions/0.92
Weinberger, B., Lazuardi, L., Weiskirchner, I., Keller, M., Neuner, C., Fischer, K.-H., Neuman,
B., Würzner, R., & Grubeck-Loebenstein, B. (2007). Healthy aging and latent infection
with CMV lead to distinct changes in CD8+ and CD4+ T-cell subsets in the elderly.
Human Immunology, 68(2), 86–90. https://doi.org/10.1016/j.humimm.2006.10.019
Weyand, C. M., & Goronzy, J. J. (2016). Aging of the Immune System. Mechanisms and
Therapeutic Targets. Annals of the American Thoracic Society, 13(Suppl 5), S422–S428.
https://doi.org/10.1513/AnnalsATS.201602-095AW
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani,
H., Dunnington, D., & RStudio. (2022). ggplot2: Create Elegant Data Visualisations
Using the Grammar of Graphics (3.4.0) [Computer software]. https://CRAN.Rproject.org/package=ggplot2
Wilson, V. L., Smith, R. A., Ma, S., & Cutler, R. G. (1987). Genomic 5-methyldeoxycytidine
decreases with age. The Journal of Biological Chemistry, 262(21), 9948–9951.
91
Witte, L. D. de, Wang, Z., Snijders, G. L. J. L., Mendelev, N., Liu, Q., Sneeboer, M. A. M.,
Boks, M. P. M., Ge, Y., & Haghighi, F. (2022). Contribution of Age, Brain Region,
Mood Disorder Pathology, and Interindividual Factors on the Methylome of Human
Microglia. Biological Psychiatry, 91(6), 572–581.
https://doi.org/10.1016/j.biopsych.2021.10.020
Wockner, L. F., Noble, E. P., Lawford, B. R., Young, R. M., Morris, C. P., Whitehall, V. L. J., &
Voisey, J. (2014). Genome-wide DNA methylation analysis of human brain tissue from
schizophrenia patients. Translational Psychiatry, 4(1), Article 1.
https://doi.org/10.1038/tp.2013.111
Wu, K.-J. (2020). The epigenetic roles of DNA N6-Methyladenine (6mA) modification in
eukaryotes. Cancer Letters, 494, 40–46. https://doi.org/10.1016/j.canlet.2020.08.025
Wu, X., Huang, Q., Javed, R., Zhong, J., Gao, H., & Liang, H. (2019). Effect of tobacco smoking
on the epigenetic age of human respiratory organs. Clinical Epigenetics, 11(1), 183.
https://doi.org/10.1186/s13148-019-0777-z
Xiang, X., Deng, Z., Zhuang, X., Ju, S., Mu, J., Jiang, H., Zhang, L., Yan, J., Miller, D., &
Zhang, H.-G. (2012). Grhl2 Determines the Epithelial Phenotype of Breast Cancers and
Promotes Tumor Progression. PLOS ONE, 7(12), e50781.
https://doi.org/10.1371/journal.pone.0050781
Xiao, C., Yi, S., & Huang, D. (2021). Genome-wide identification of age-related CpG sites for
age estimation from blood DNA of Han Chinese individuals. Electrophoresis, 42(14–15),
1488–1496. https://doi.org/10.1002/elps.202000367
Xiao, C.-L., Zhu, S., He, M., Chen, D., Zhang, Q., Chen, Y., Yu, G., Liu, J., Xie, S.-Q., Luo, F.,
Liang, Z., Wang, D.-P., Bo, X.-C., Gu, X.-F., Wang, K., & Yan, G.-R. (2018). N6-
Methyladenine DNA Modification in the Human Genome. Molecular Cell, 71(2), 306-
318.e7. https://doi.org/10.1016/j.molcel.2018.06.015
92
Xie, W., Kagiampakis, I., Pan, L., Zhang, Y. W., Murphy, L., Tao, Y., Kong, X., Kang, B., Xia,
L., Carvalho, F. L. F., Sen, S., Chiu Yen, R.-W., Zahnow, C. A., Ahuja, N., Baylin, S. B.,
& Easwaran, H. (2018). DNA Methylation Patterns Separate Senescence from
Transformation Potential and Indicate Cancer Risk. Cancer Cell, 33(2), 309-321.e5.
https://doi.org/10.1016/j.ccell.2018.01.008
Xu, H., Wang, F., Liu, Y., Yu, Y., Gelernter, J., & Zhang, H. (2014). Sex-biased methylome and
transcriptome in human prefrontal cortex. Human Molecular Genetics, 23(5), 1260–1270.
https://doi.org/10.1093/hmg/ddt516
Xu, Z., Niu, L., & Taylor, J. (2023). ENmix: Quality control and analysis tools for Illumina
DNA methylation BeadChip (1.36.01) [Computer software]. Bioconductor version:
Release (3.17). https://doi.org/10.18129/B9.bioc.ENmix
Xu, Z., Sandler, D. P., & Taylor, J. A. (2020). Blood DNA Methylation and Breast Cancer: A
Prospective Case-Cohort Analysis in the Sister Study. JNCI: Journal of the National
Cancer Institute, 112(1), 87–94. https://doi.org/10.1093/jnci/djz065
Yan, D., Shen, M., Du, Z., Cao, J., Tian, Y., Zeng, P., & Tang, Z. (2021). Developing ZNF Gene
Signatures Predicting Radiosensitivity of Patients with Breast Cancer. Journal of
Oncology, 2021, e9255494. https://doi.org/10.1155/2021/9255494
Yasumizu, Y., Takeuchi, D., Morimoto, R., Takeshima, Y., Okuno, T., Kinoshita, M., Morita, T.,
Kato, Y., Wang, M., Motooka, D., Okuzaki, D., Nakamura, Y., Mikami, N., Arai, M.,
Zhang, X., Kumanogoh, A., Mochizuki, H., Ohkura, N., & Sakaguchi, S. (2024). Singlecell transcriptome landscape of circulating CD4+ T cell populations in autoimmune
diseases. Cell Genomics, 4(2), 100473. https://doi.org/10.1016/j.xgen.2023.100473
Yin, Y., Morgunova, E., Jolma, A., Kaasinen, E., Sahu, B., Khund-Sayeed, S., Das, P. K.,
Kivioja, T., Dave, K., Zhong, F., Nitta, K. R., Taipale, M., Popov, A., Ginno, P. A.,
Domcke, S., Yan, J., Schübeler, D., Vinson, C., & Taipale, J. (2017). Impact of cytosine
methylation on DNA binding specificities of human transcription factors. Science (New
York, N.Y.), 356(6337), eaaj2239. https://doi.org/10.1126/science.aaj2239
93
Ying, K., Liu, H., Tarkhov, A. E., Lu, A. T., Horvath, S., Kutalik, Z., Shen, X., & Gladyshev, V.
N. (2022). Causal Epigenetic Age Uncouples Damage and Adaptation (p.
2022.10.07.511382). bioRxiv. https://doi.org/10.1101/2022.10.07.511382
Yousefzadeh, M. J., Flores, R. R., Zhu, Y., Schmiechen, Z. C., Brooks, R. W., Trussoni, C. E.,
Cui, Y., Angelini, L., Lee, K.-A., McGowan, S. J., Burrack, A. L., Wang, D., Dong, Q.,
Lu, A., Sano, T., O’Kelly, R. D., McGuckian, C. A., Kato, J. I., Bank, M. P., …
Niedernhofer, L. J. (2021). An aged immune system drives senescence and ageing of
solid organs. Nature, 594(7861), 100–105. https://doi.org/10.1038/s41586-021-03547-7
Yu, D., Li, M., Linghu, G., Hu, Y., Hajdarovic, K. H., Wang, A., Singh, R., & Webb, A. E.
(2023). CellBiAge: Improved single-cell age classification using data binarization. Cell
Reports, 42(12), 113500. https://doi.org/10.1016/j.celrep.2023.113500
Zannas, A. S., Jia, M., Hafner, K., Baumert, J., Wiechmann, T., Pape, J. C., Arloth, J., Ködel,
M., Martinelli, S., Roitman, M., Röh, S., Haehle, A., Emeny, R. T., Iurato, S., CarrilloRoa, T., Lahti, J., Räikkönen, K., Eriksson, J. G., Drake, A. J., … Binder, E. B. (2019).
Epigenetic upregulation of FKBP5 by aging and stress contributes to NF-κB–driven
inflammation and cardiovascular risk. Proceedings of the National Academy of Sciences,
116(23), 11370–11379. https://doi.org/10.1073/pnas.1816847116
Zhang, H., Jadhav, R. R., Cao, W., Goronzy, I. N., Zhao, T. V., Jin, J., Ohtsuki, S., Hu, Z.,
Morales, J., Greenleaf, W. J., Weyand, C. M., & Goronzy, J. J. (2023). Aging-associated
HELIOS deficiency in naive CD4+ T cells alters chromatin remodeling and promotes
effector cell responses. Nature Immunology, 24(1), Article 1.
https://doi.org/10.1038/s41590-022-01369-x
Zhang, W., & Xu, J. (2017). DNA methyltransferases and their roles in tumorigenesis.
Biomarker Research, 5(1), 1. https://doi.org/10.1186/s40364-017-0081-z
Zhang, X., Zhu, J., Chen, X., Jie-Qiong, Z., Li, X., Luo, L., Huang, H., Liu, W., Zhou, X., Yan,
J., Lin, S., & Ye, J. (2019). Interferon Regulatory Factor 3 Deficiency Induces AgeRelated Alterations of the Retina in Young and Old Mice. Frontiers in Cellular
Neuroscience, 13, 272. https://doi.org/10.3389/fncel.2019.00272
94
Zheng, Y., Liu, X., Le, W., Xie, L., Li, H., Wen, W., Wang, S., Ma, S., Huang, Z., Ye, J., Shi,
W., Ye, Y., Liu, Z., Song, M., Zhang, W., Han, J.-D. J., Belmonte, J. C. I., Xiao, C., Qu,
J., … Su, W. (2020). A human circulating immune cell landscape in aging and COVID19. Protein & Cell, 11(10), 740–770. https://doi.org/10.1007/s13238-020-00762-2
95
APPENDICES
SubsetClockFilter.R: Code used to subset CpGs for developing the IntrinClock
#This code analyzes data obtained from the Illumina EPIC array chip on four different CD8+ T
cell
# subsets - Naive, CM, EM, and TEMRA cells. It identifies CpGs that are associated with T cell
differentiation
# and removes them In so doing, the program
# aims to create a list of CpGs that can be used for an epigenetic aging clock that are not
associated with T
# cell differentiation.
# Additionally, the program also performs imputation, outlier detection, filtering of high-na
CpGs, and other
# related data pre-processing steps.
#Grabs some useful scripts.
source("AgingProjects/Useful Scripts/generally_useful.R")
#Sets location of data
setwd("Data/") #Sets directory.
#Libraries to import.
library("scales")
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
library(plyr)
library(reshape2)
library(WGCNA)
library(limma)
library(dplyr)
library("RColorBrewer")
library(readxl)
library(ggplot2)
library(readr)
library(umap)
require(clusterExperiment)
library(stringr)
library(tidyr)
library(Rfast)
library(data.table)
library(minfi)
library(wateRmelon)
library(sva)
96
library(mice)
library(HiClimR)
library(impute)
`%!in%` <- Negate(`%in%`)
#Reading in complete CPG table data for reading.
# complete_cpg_table <-
data.table::fread("ClockConstruction/entire_cpg_table.csv",header=TRUE)
#
# #Filtering out CpGs with high NA values, and filtering out samples with many NA CpGs.
# filtered_cpg_table <- complete_cpg_table
# filtered_cpg_table <- filtered_cpg_table[,-c(1)]
# filtered_cpg_table[filtered_cpg_table == "NULL"] <- NA
# filtered_cpg_table[filtered_cpg_table == "null"] <- NA
# high_na_rows <- Rfast::rowsums(is.na(filtered_cpg_table)) > (ncol(filtered_cpg_table)/20)
# low_na_rows <- !high_na_rows
# filtered_cpg_table <- filtered_cpg_table[low_na_rows,]
# high_na_columns <- Rfast::colsums(is.na(filtered_cpg_table)) > (nrow(filtered_cpg_table)/20)
# low_na_columns <- !high_na_columns
# filtered_cpg_table <- filtered_cpg_table[,..low_na_columns]
# data.table::fwrite(filtered_cpg_table,"ClockConstruction/filtered_cpg_table_final.csv",
# row.names = TRUE)
# filtered_cpg_table <-
data.table::fread("ClockConstruction/filtered_cpg_table_final.csv",header=TRUE)
# filtered_cpg_table <- filtered_cpg_table[,-1]
# complete_sample_table <-
read.csv("ClockConstruction/entire_sample_table.csv",row.names=1)
# colnames(filtered_cpg_table)[11496:12482] <-
complete_sample_table[complete_sample_table$Key==
# "Oliva2022","ID"]
# cpgs <- filtered_cpg_table$cpg
# # # ####################################################################
# # #
# # # #Writes list of candidate CpGs to be used for new clock construction.
# # # # Removes all samples that are cancer samples.
# # # # Removes a few outliers as well, determined via UMAP.
# noncancer_sample_table <-
complete_sample_table[(complete_sample_table$Condition!="Wilms Tumour") &
# (complete_sample_table$Condition!="Nasopharyngeal
Carcinoma") &
# (complete_sample_table$Condition!="OPSCC Case") &
# (complete_sample_table$Condition!="Hepatocellular
Carcinoma")&
97
# (complete_sample_table$Condition!="Colon Cancer") &
# (complete_sample_table$Condition!="Colorectal Cancer") &
# (complete_sample_table$Condition!="Dysplastic") &
# (complete_sample_table$Condition!="Sotos") &
# (complete_sample_table$Condition!="Benign Cancer") &
# (complete_sample_table$Condition!="Asymmetrical Cortica") &
# (complete_sample_table$Tissue != "Testis") &
# (complete_sample_table$Tissue != "Semen"),]
# # &
# # (complete_sample_table$Key != "Lunnon2014") &
# # (complete_sample_table$Key != "Gasparoni2018") &
# # (complete_sample_table$ID) %!in%
# # c("GSM4073545" ,"GSM4073546", "GSM4073547",
"GSM4073548", "GSM4073549", "GSM4073550" ,"GSM4073551",
# # "GSM4073552", "GSM4073553", "GSM4073554"
,"GSM4073555", "GSM4073556", "GSM4073557", "GSM4073558",
# # "GSM4073559" ,"GSM4073560", "GSM4073561"
,"GSM4073562" ,"GSM4073563", "GSM4073564", "GSM4073565",
# # "GSM4073566", "GSM4073567", "GSM4073568",
"GSM4073569" ,"GSM4073570" ,"GSM4073571" ,"GSM4073572",
# # "GSM4073573" ,"GSM4073574" ,"GSM4073575"
,"GSM4073576" ,"GSM4073577", "GSM1546376" ,"GSM1546378",
# # "GSM1546379" ,"GSM1546390", "GSM1546398",
"GSM1546407", "GSM1546408" ,"GSM1546411" ,"GSM1546417",
# # "GSM1546418", "GSM1546424" ,"GSM1546426",
"GSM1546435"),]
# noncancer_samples <- (colnames(filtered_cpg_table) %in% noncancer_sample_table$ID |
# colnames(filtered_cpg_table) == "cpg")
# noncancer_cpg_table <- filtered_cpg_table[,..noncancer_samples]
# noncancer_cpg_table$cpg <- cpgs
# noncancer_sample_table <- noncancer_sample_table[noncancer_sample_table$ID %in%
colnames(noncancer_cpg_table),]
# #Writing the data.
# data.table::fwrite(noncancer_cpg_table,"ClockConstruction/noncancer_cpg_table.csv",
# row.names = TRUE)
#
# data.table::fwrite(noncancer_sample_table,"ClockConstruction/noncancer_sample_table.csv",
# row.names = TRUE)
# #########################################################################
98
#####################################################################
#Reading in data with cancer samples removed.
noncancer_cpg_table <-
data.table::fread("ClockConstruction/noncancer_cpg_table.csv",header=TRUE)
noncancer_sample_table <-
data.table::fread("ClockConstruction/noncancer_sample_table.csv",header=TRUE)
noncancer_cpg_table <- noncancer_cpg_table[,-c(1)]
noncancer_sample_table <- noncancer_sample_table[noncancer_sample_table$ID %in%
colnames(noncancer_cpg_table),]
epicv2_data <- data.table::fread("ClockConstruction/epicv2.csv")
colnames(epicv2_data) <- as.character(epicv2_data[8,])
epicv2_data <- epicv2_data[-c(1:7),]
############################################################################3
#This section of code is responsible for determining which CpGs are tied to
# differentiation, and removing them from the training and test sets.
# #Read in clock data and metadata. Merge them and process them.
noncancer_cpg_table <- data.frame(noncancer_cpg_table)
v2_cpgs <- epicv2_data$Name
noncancer_cpg_table <- noncancer_cpg_table[noncancer_cpg_table$cpg %in%
v2_cpgs,]
clock_data <- read.csv("Tomusiak2021/clock_data.csv")
clock_data <- clock_data[,colnames(clock_data != "Age")]
subset_metadata <- read.csv("Tomusiak2021/subset_metadata.csv")
all_data <- merge(clock_data,subset_metadata)
all_data$type <- as.character(all_data$type)
all_data$type <- factor(all_data$type, levels=c("naive", "central_memory",
"effector_memory","temra"))
beta_values <- data.table::fread("Tomusiak2021/beta_values.csv",header=TRUE)
beta_values_samples <- colnames(beta_values)
beta_values <- data.frame(beta_values)
beta_values_cpgs <- beta_values$V1
beta_values <- beta_values[,-1]
#Filter out unwanted data from metadata, mapping, and beta values.
#D3 and E4 are technical outliers.
keep <- all_data$SampleID[all_data$SampleID != "D3" &
all_data$SampleID != "E4" &
all_data$sabgal_sample==FALSE]
#Note - D3 was a mistakenly pipetted sample. SAbGal-high samples were originally included to
99
# determine if changes were occurring on a DNA methylation level, but ultimately only one
# SA-bGal sample was included and thus rigorous statistics cannot be performed.
all_data <- all_data[all_data$SampleID %in% keep,]
beta_values <- beta_values[,colnames(beta_values) %in% keep]
beta_values <- beta_values[,order(all_data$type)]
all_data <- all_data[order(all_data$type),]
rownames(beta_values) <- beta_values_cpgs
#Going to perform UMAP analysis on overall data set to determine where most variation is
# found. Intuitively, we will identify it in the T cell subsets that we assessed.
beta_rotated <- data.frame(t(beta_values))
colnames(beta_rotated) <- beta_values_cpgs
umap <- umap(beta_rotated,random_state=6)
umap_plot_df <- data.frame(umap$layout) %>%
tibble::rownames_to_column("SampleID") %>%
dplyr::inner_join(all_data, by = "SampleID")
umap_plot_df$type <- as.character(umap_plot_df$type)
umap_plot_df$type <- factor(umap_plot_df$type, levels=c("naive", "central_memory",
"effector_memory","temra"))
umap_plot_df$type <- revalue(umap_plot_df$type,
c("naive"="CD8+ Naive",
"central_memory" = "CD8+ Central Memory",
"effector_memory" = "CD8+ Effector Memory",
"temra" = "CD8+ TEMRA"))
ggplot(
umap_plot_df,
aes(x = X1, y = X2, color=type)) +
labs(x="UMAP Dimension 1", y="UMAP Dimension 2", title = "") +
theme_moderate() +
geom_point(size=5) +
scale_color_manual(values=c("#0066A9", "#16B935", "#D8852A","#DDC63F")) +
guides(color=guide_legend(title="Cell Type")) +
theme(legend.margin=margin(c(0,0,0,0))) +
theme(axis.text.x = element_text(angle =45, vjust = 0.95, hjust=1)) +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
ggplot(umap_plot_df,aes(x=type,y=X2, color=type)) +
theme_classic() +
labs(x="CD8 Cell Subtype", y="UMAP Component 2",title="UMAP Component 2 Tracks Cell
Lineage") +
geom_point(size=2)
#In the figures above, we found that UMAP component 1 seems to correlate with cell
100
# differentiation. We can use this to create a "pseudotime analysis" trajectory of bulk
# DNA methylation data where we look at how CpGs are differentially methylated with
# differentiation. We can then perform differential methylation analysis on the resulting
# sites to assess changes.
all_data$differentiation <- umap_plot_df$X2
age_group <- all_data$age
celltype_group <-
factor(all_data$type,levels=c("naive","central_memory","effector_memory","temra"))
donor_group <- factor(all_data$donor,levels=c("R45690","R45740","R45804","R45504",
"R45553","R45741","R45805"))
diff_group <- all_data$differentiation
beta_rotated <- data.matrix(beta_rotated)
m_values <- log(beta_rotated,2)/(1-(beta_rotated))
m_values <- t(m_values)
diff_design <- model.matrix(~age_group+celltype_group)
diff_fit_reduced <- lmFit(m_values,diff_design)
diff_fit_reduced <- eBayes(diff_fit_reduced, robust=TRUE)
summary(decideTests(diff_fit_reduced))
diff_exp_cm <-topTable(diff_fit_reduced,coef=3,number=1000000)
diff_exp_em <-topTable(diff_fit_reduced,coef=4,number=1000000)
diff_exp_temra <-topTable(diff_fit_reduced,coef=5,number=1000000)
diff_exp_sig <- diff_exp[diff_exp_cm$adj.P.Val<.016 |
diff_exp_em$adj.P.Val<.016 |
diff_exp_temra$adj.P.Val<.016 ,]
age_design <- model.matrix(~celltype_group+age_group)
age_fit_reduced <- lmFit(m_values,age_design)
age_fit_reduced <- eBayes(age_fit_reduced, robust=TRUE)
summary(decideTests(age_fit_reduced))
age_exp <-topTable(age_fit_reduced,coef=5,number=1000000)
age_exp_sig <- age_exp[age_exp$adj.P.Val<.05,]
dim(age_exp_sig)
dim(diff_exp_sig)
sum(rownames(age_exp_sig) %in% rownames(diff_exp_sig))
load("ClockConstruction/coefHannum.rda")
load("ClockConstruction/coefLevine.rda")
load("ClockConstruction/coefHorvath.rda")
load("ClockConstruction/coefSkin.rda")
101
sum((coefHannum$CpGmarker) %in%
rownames(diff_exp_sig))/length(coefHannum$CpGmarker)
sum((coefHorvath$CpGmarker) %in%
rownames(diff_exp_sig))/length(coefHorvath$CpGmarker)
sum((coefSkin$CpGmarker) %in% rownames(diff_exp_sig))/length(coefSkin$CpGmarker)
sum((coefLevine$CpGmarker) %in% rownames(diff_exp_sig))/length(coefLevine$CpGmarker)
proportion_chart <-
data.frame(Clock=c("Hannum","Horvath","HorvathSkinBlood","PhenoAge"),
Proportion=c(50.7,39.2,54.1,35.0))
proportion_chart$Clock <- factor(proportion_chart$Clock,levels=c("PhenoAge",
"HorvathSkinBlood","Horvath","Hannum"))
ggplot(proportion_chart,aes(x=Clock,y=Proportion,fill=Clock)) +
geom_bar(stat="identity",color="black",size=2) + theme_moderate() +
geom_text(aes(label=Proportion), hjust=-.1, color="black", size=6) +
coord_flip() +
scale_fill_manual(values=c("#00B5E6", "#2C99D7", "#48AA48","#9FB0D5")) +
theme(legend.position = "none")+
scale_y_continuous(limit=c(0,105), expand=c(0,0),
name ="% Clock Sites Changing with CD8+ Differentiation") +
theme(axis.text.x = element_text(angle =45, vjust = 0.95, hjust=1)) +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
# Note - an early version of the code used this UMAP axis as a proxy for differentiation,
# and then removed every site that was linked to this UMAP axis. I shifted away from
# this approach in favor of one that focused on naive T CD8+ T cell aging.
#Labeling sample by cell type in a way that allows for correlation analysis.
all_data$is_naive <- as.numeric(all_data$type=="naive")
all_data$is_cm <- as.numeric(all_data$type=="central_memory")
all_data$is_em <- as.numeric(all_data$type=="effector_memory")
all_data$is_temra <- as.numeric(all_data$type=="temra")
beta_rotated <- beta_rotated[,colnames(beta_rotated) %in% noncancer_cpg_table$cpg]
rownames(beta_rotated) == all_data$SampleID
##############################
#Creating correlation matrices for age and cell type for each cell type per
# CpG site.
beta_rotated_naive <- beta_rotated[all_data$type=="naive",]
beta_rotated_cm <- beta_rotated[all_data$type=="central_memory",]
beta_rotated_em <- beta_rotated[all_data$type=="effector_memory",]
102
beta_rotated_temra <- beta_rotated[all_data$type=="temra",]
all_age_naive <- all_data$age[all_data$type=="naive"]
all_age_cm <- all_data$age[all_data$type=="central_memory"]
all_age_em <- all_data$age[all_data$type=="effector_memory"]
all_age_temra <- all_data$age[all_data$type=="temra"]
naive_correlation <- data.frame(cor(beta_rotated,all_data$is_naive))
cm_correlation <- data.frame(cor(beta_rotated,all_data$is_cm))
em_correlation <- data.frame(cor(beta_rotated,all_data$is_em))
temra_correlation <- data.frame(cor(beta_rotated,all_data$is_temra))
age_correlated_in_naive <- data.frame(cor(beta_rotated_naive,
all_age_naive))
age_correlated_in_cm <- data.frame(cor(beta_rotated_cm,
all_age_cm))
age_correlated_in_em <- data.frame(cor(beta_rotated_em,
all_age_em))
age_correlated_in_temra <- data.frame(cor(beta_rotated_temra,
all_age_temra))
colnames(naive_correlation) <- "correlation"
colnames(cm_correlation) <- "correlation"
colnames(em_correlation) <- "correlation"
colnames(temra_correlation) <- "correlation"
colnames(age_correlated_in_naive) <- "correlation"
colnames(age_correlated_in_cm) <- "correlation"
colnames(age_correlated_in_em) <- "correlation"
colnames(age_correlated_in_temra) <- "correlation"
#Creating cutoffs for what we consider to be methylation associated with cell type,
# and methylation associated with aging for a given cell type.
# For the purposes of this analysis it's prudent to be conservative with
# banning CpGs associated with cell type while permitting CpGs associated
# with aging.
naive_cpgs <- rownames(naive_correlation)[abs(naive_correlation$correlation)>.3]
cm_cpgs <- rownames(cm_correlation)[abs(cm_correlation$correlation)>.15]
em_cpgs <- rownames(em_correlation)[abs(em_correlation$correlation)>.15]
temra_cpgs <- rownames(temra_correlation)[abs(temra_correlation$correlation)>.15]
age_naive_correlated_cpgs <-
rownames(age_correlated_in_naive)[abs(age_correlated_in_naive$correlation)>.3]
103
age_cm_correlated_cpgs <-
rownames(age_correlated_in_cm)[abs(age_correlated_in_cm$correlation)>.4]
age_em_correlated_cpgs <-
rownames(age_correlated_in_em)[abs(age_correlated_in_em$correlation)>.4]
age_temra_correlated_cpgs <-
rownames(age_correlated_in_temra)[abs(age_correlated_in_temra$correlation)>.4]
correlations_table <- naive_correlation
colnames(correlations_table) <- "naive_correlation"
correlations_table$ID <- rownames(correlations_table)
age_correlations_table <- age_correlated_in_naive
colnames(age_correlations_table) <- "age_correlation"
age_correlations_table$ID <- rownames(age_correlations_table)
age_naive_correlations_table <- merge(correlations_table,age_correlations_table)
age_naive_correlations_table$color <- ((abs(age_naive_correlations_table$naive_correlation) <
.3)) &
(abs(age_naive_correlations_table$age_correlation)>.3)
age_naive_correlations_table$color[age_naive_correlations_table$color==TRUE] <- "#0066A9"
age_naive_correlations_table$color[age_naive_correlations_table$color==FALSE] <-
"#8B8D90"
ggplot(age_naive_correlations_table,aes(x=naive_correlation,y=age_correlation)) +
geom_point(aes(color=color),size=.0001) + theme_classic() +
geom_hline(yintercept = 0,size=1) +
geom_vline(xintercept = 0,size=1) +
scale_color_manual(values=c("#0066A9", "#16B935","#8B8D90")) +
geom_smooth(aes(x=age_correlation,y= naive_correlation,color="#16B935"),
method = "lm", se = FALSE, linetype="dashed",size=2)
cor.test(age_naive_correlations_table$naive_correlation,age_naive_correlations_table$age_corre
lation)
age_correlated_cpgs_all <- unique(c(age_naive_correlated_cpgs,
age_cm_correlated_cpgs,
age_em_correlated_cpgs,
age_temra_correlated_cpgs))
state_correlated_cpgs_all <- unique(c(naive_cpgs,
cm_cpgs,
em_cpgs,
temra_cpgs))
desired_cpgs <- age_naive_correlated_cpgs[!(age_naive_correlated_cpgs %in% naive_cpgs)]
104
v2_cpgs <- epicv2_data$Name
#Gotta maintain compatibility with epic V2 now...
desired_cpgs <- desired_cpgs[desired_cpgs %in% v2_cpgs]
#############################
rownames(noncancer_cpg_table) <- noncancer_cpg_table$cpg
filtered_cpgs <- noncancer_cpg_table[(noncancer_cpg_table$cpg %in%
desired_cpgs),]
filtered_beta_values <- beta_rotated[,(colnames(beta_rotated) %in% filtered_cpgs$cpg) ]
#Splitting training and test samples. Deliberately throwing the Roy2021 and Tomusiak2021
# samples into the training set, as I would like the training model to have maximum exposure
# to different immune cell types at a variety of ages.
set.seed(123)
training_samples <- noncancer_sample_table$ID[(sample(length(noncancer_sample_table$ID),
size =length(noncancer_sample_table$ID)*(7.5/10),
replace = F))]
training_samples <- c(training_samples,
noncancer_sample_table[noncancer_sample_table$Key=="Roy2021",
"ID"]$ID,
noncancer_sample_table[noncancer_sample_table$Key=="Tomusiak2021",
"ID"]$ID)
training_samples <- unique(training_samples)
test_samples <- noncancer_sample_table$ID[(!(noncancer_sample_table$ID %in%
training_samples))]
training_set_cpgs <- filtered_cpgs[,colnames(filtered_cpgs) %in% training_samples |
colnames(filtered_cpgs) == "cpg"]
test_set_cpgs <- filtered_cpgs[,colnames(filtered_cpgs) %in% test_samples |
colnames(filtered_cpgs) == "cpg"]
training_set_sample_table <- noncancer_sample_table[noncancer_sample_table$ID %in%
training_samples,]
test_set_sample_table <- noncancer_sample_table[noncancer_sample_table$ID %in%
test_samples,]
cpgs <- rownames(training_set_cpgs)
training_set_cpgs <- training_set_cpgs[,colnames(training_set_cpgs) != "cpg"]
test_set_cpgs <- test_set_cpgs[,colnames(test_set_cpgs) != "cpg"]
105
#Need to impute different tissue types as separately as possible.
# Due to sparsity of datasets from certain tissues, I needed to combine
# tissue types twice for the following imputation.
unique(training_set_sample_table$Tissue)
training_set_sample_table[training_set_sample_table$Tissue=="Nasopharynx","Tissue"] <-
"Throat"
training_set_sample_table[training_set_sample_table$Tissue=="Bronchi","Tissue"] <- "Throat"
training_set_sample_table[training_set_sample_table$Tissue=="Buccal","Tissue"] <- "Saliva"
training_set_sample_table[training_set_sample_table$Tissue=="Colon","Tissue"] <-
"Colorectal"
training_set_brain <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Brain","ID"]$ID]
training_set_blood <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Blood","ID"]$ID]
training_set_throat <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Throat","ID"]$ID]
training_set_kidney <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Kidney","ID"]$ID]
training_set_pbmc <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="PBMC","ID"]$ID]
training_set_liver <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Liver","ID"]$ID]
training_set_saliva <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Saliva","ID"]$ID]
training_set_skin <- training_set_cpgs[,(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Skin","ID"]$ID) |
(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Breast","ID"]$ID)]
training_set_colorectal <- training_set_cpgs[,colnames(training_set_cpgs) %in%
106
training_set_sample_table[training_set_sample_table$Tissue=="Colorectal","ID"]$ID]
training_set_skeletal <- training_set_cpgs[,(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Skeletal Muscle","ID"]$ID) |
(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Lung","ID"]$ID) |
(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Ovary","ID"]$ID) |
(colnames(training_set_cpgs) %in%
training_set_sample_table[training_set_sample_table$Tissue=="Prostate","ID"]$ID) ]
training_set_brain <- data.frame(impute.knn(as.matrix(training_set_brain))$data)
training_set_blood <- data.frame(impute.knn(as.matrix(training_set_blood))$data)
training_set_throat <- data.frame(impute.knn(as.matrix(training_set_throat))$data)
training_set_kidney <- data.frame(impute.knn(as.matrix(training_set_kidney))$data)
training_set_pbmc <- data.frame(impute.knn(as.matrix(training_set_pbmc))$data)
training_set_liver <- data.frame(impute.knn(as.matrix(training_set_liver))$data)
training_set_saliva <- data.frame(impute.knn(as.matrix(training_set_saliva))$data)
training_set_skin <- data.frame(impute.knn(as.matrix(training_set_skin))$data)
training_set_colorectal <- data.frame(impute.knn(as.matrix(training_set_colorectal))$data)
training_set_skeletal <- data.frame(impute.knn(as.matrix(training_set_skeletal))$data)
# Now for test cpgs
test_set_sample_table[test_set_sample_table$Tissue=="Nasopharynx","Tissue"] <- "Throat"
test_set_sample_table[test_set_sample_table$Tissue=="Bronchi","Tissue"] <- "Throat"
test_set_sample_table[test_set_sample_table$Tissue=="Buccal","Tissue"] <- "Saliva"
test_set_sample_table[test_set_sample_table$Tissue=="Colon","Tissue"] <- "Colorectal"
test_set_brain <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Brain","ID"]$ID]
test_set_blood <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Blood","ID"]$ID]
test_set_throat <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Throat","ID"]$ID]
test_set_kidney <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Kidney","ID"]$ID]
test_set_pbmc <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="PBMC","ID"]$ID]
test_set_liver <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Liver","ID"]$ID]
107
test_set_saliva <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Saliva","ID"]$ID]
test_set_skin <- test_set_cpgs[,(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Skin","ID"]$ID) |
(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Breast","ID"]$ID)]
test_set_colorectal <- test_set_cpgs[,colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Colorectal","ID"]$ID]
test_set_skeletal <- test_set_cpgs[,(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Skeletal
Muscle","ID"]$ID) |
(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Lung","ID"]$ID) |
(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Ovary","ID"]$ID) |
(colnames(test_set_cpgs) %in%
test_set_sample_table[test_set_sample_table$Tissue=="Prostate","ID"]$ID) ]
test_set_brain <- data.frame(impute.knn(as.matrix(test_set_brain))$data)
test_set_blood <- data.frame(impute.knn(as.matrix(test_set_blood))$data)
test_set_throat <- data.frame(impute.knn(as.matrix(test_set_throat))$data)
test_set_kidney <- data.frame(impute.knn(as.matrix(test_set_kidney))$data)
test_set_pbmc <- data.frame(impute.knn(as.matrix(test_set_pbmc))$data)
test_set_liver <- data.frame(impute.knn(as.matrix(test_set_liver))$data)
test_set_saliva <- data.frame(impute.knn(as.matrix(test_set_saliva))$data)
test_set_skin <- data.frame(impute.knn(as.matrix(test_set_skin))$data)
test_set_colorectal <- data.frame(impute.knn(as.matrix(test_set_colorectal))$data)
test_set_skeletal <- data.frame(impute.knn(as.matrix(test_set_skeletal))$data)
training_set_cpgs <- bind_cols(training_set_brain, training_set_blood, training_set_throat,
training_set_kidney,
training_set_pbmc, training_set_liver, training_set_saliva, training_set_skin,
training_set_colorectal, training_set_skeletal)
training_set_cpgs <- training_set_cpgs[,training_set_sample_table$ID]
test_set_cpgs <- bind_cols(test_set_brain, test_set_blood, test_set_throat, test_set_kidney,
test_set_pbmc, test_set_liver, test_set_saliva, test_set_skin,
108
test_set_colorectal, test_set_skeletal)
test_set_cpgs <- test_set_cpgs[,test_set_sample_table$ID]
# The following IDs looked like clear outliers on PCA plots, and thus were removed.
banned_ids <- c("GSM1865172","GSM3712776","GSM4052066","GSM4052073")
training_set_sample_table <- training_set_sample_table[!(training_set_sample_table$ID %in%
banned_ids),]
test_set_sample_table <- test_set_sample_table[!(test_set_sample_table$ID %in% banned_ids)]
training_set_cpgs <- training_set_cpgs[,colnames(training_set_cpgs) %in%
training_set_sample_table$ID]
test_set_cpgs <- test_set_cpgs[,colnames(test_set_cpgs) %in% test_set_sample_table$ID]
#Performing some PCA analysis to identify outliers.
# pca_training <- prcomp(training_set_cpgs)
# pca_test <- prcomp(test_set_cpgs)
#
# pca_training <- data.frame(pca_training$rotation)[,1:5]
# pca_test <- data.frame(pca_test$rotation)[,1:5]
#
# pca_training$ID <- rownames(pca_training)
# pca_test$ID <- rownames(pca_test)
#
# pca_temp_sample_table_training <- merge(pca_training,noncancer_sample_table,on="ID")
# pca_temp_sample_table_test <- merge(pca_test,noncancer_sample_table,on="ID")
#
# ggplot(pca_temp_sample_table_training,aes(x=PC1,y=PC2,color=Key)) +
# geom_point() + theme_classic()
# ggplot(pca_temp_sample_table_test,aes(x=PC1,y=PC2,color=Key)) +
# geom_point() + theme_classic()
#
#Need to create a CpG table for the clock to use for my previously generated data, as a validation
set.
# Need to also ensure that the clock is not built on any CpGs that are not in the validation set,
although
# previous filtering should have taken care of that.
# # Filtering list of CpGs down to those that are also found in the Jonkman validation dataset.
validation_sample_table <-
read.csv("ClockConstruction/validation_sample_table.csv",row.names=1)
validation_cpg_table <- read.csv("ClockConstruction/validation_cpg_table.csv",row.names=1)
validation_cpgs <- rownames(validation_cpg_table)
109
beta_values <- beta_values[rownames(beta_values) %in% validation_cpgs,]
training_set_cpgs <- training_set_cpgs[rownames(training_set_cpgs) %in% validation_cpgs,]
test_set_cpgs <- test_set_cpgs[rownames(test_set_cpgs) %in% validation_cpgs,]
validation_cpg_table <- validation_cpg_table[rownames(validation_cpg_table) %in%
rownames(test_set_cpgs),]
############################################################
training_set_cpgs$cpg <- rownames(training_set_cpgs)
test_set_cpgs$cpg <- rownames(test_set_cpgs)
data.table::fwrite(training_set_cpgs,"ClockConstruction/training_set.csv",
row.names = TRUE)
data.table::fwrite(test_set_cpgs,"ClockConstruction/test_set.csv",
row.names = TRUE)
#Last step of QC will be outlier detection and removal.
training_set_cpgs <- data.table::fread("ClockConstruction/training_set.csv",header=TRUE)
training_set_cpgs$cpg <- training_set_cpgs$V1
training_set_cpgs <- training_set_cpgs[,-1]
test_set_cpgs <- data.table::fread("ClockConstruction/test_set.csv",header=TRUE)
test_set_cpgs$cpg <- test_set_cpgs$V1
test_set_cpgs <- test_set_cpgs[,-1]
# Removing outliers from training and test set.
#################################################
#First steps are to transpose the table so that CpGs are columns and samples are rows. After that,
# imputation will only be performed on samples and CpGs that have any NAs. This saves time
later.
nodiff_present <- training_set_cpgs$cpg %in% rownames(filtered_cpgs)
cpg_list <- training_set_cpgs$cpg[training_set_cpgs$cpg %in% rownames(filtered_cpgs)]
training_set_cpgs <- training_set_cpgs[nodiff_present,]
test_set_cpgs <- test_set_cpgs[nodiff_present,]
#########################
training_set_cpgs[,("cpg"):=NULL]
test_set_cpgs[,("cpg"):=NULL]
training_samples <- colnames(training_set_cpgs)
test_samples <- colnames(test_set_cpgs)
110
#Performing outlier detection.
training_set_outliers <- outlyx(training_set_cpgs)
test_set_outliers <- outlyx(test_set_cpgs)
training_set_outliers <- training_set_outliers$outliers
test_set_outliers <- test_set_outliers$outliers
training_set_outliers <- !training_set_outliers
test_set_outliers <- !test_set_outliers
training_set_cpgs <- data.frame(training_set_cpgs)
test_set_cpgs <- data.frame(test_set_cpgs)
training_set_cpgs <- training_set_cpgs[,training_set_outliers]
test_set_cpgs <- test_set_cpgs[,test_set_outliers]
training_set_sample_table <- training_set_sample_table[training_set_outliers,]
test_set_sample_table <- test_set_sample_table[test_set_outliers,]
#Setting cpg name as a column to get passed into downstream functions.
training_set_cpgs$cpg <- cpg_list
test_set_cpgs$cpg <- cpg_list
#Saving files.
data.table::fwrite(training_set_cpgs,"ClockConstruction/training_set.csv",
row.names = TRUE)
data.table::fwrite(test_set_cpgs,"ClockConstruction/test_set.csv",
row.names = TRUE)
data.table::fwrite(validation_cpg_table,"ClockConstruction/filtered_validation_cpg_table.csv",
row.names = TRUE)
ClockDatabaseConstruction.R: Code used to generate the database that formed the basis
for the IntrinClock
#In this program I will construct a database of DNA methylation data, consisting of 450K and
EPIC
# chip results. I will construct this database as having two segments - "cpg_table" consisting
# of cpgs in rows and samples in columns, and "sample_table" consisting of samples in rows and
# metadata in columns. The possible cpgs will be filtered on those that do not change with
# T cell differentiation.
111
source("AgingProjects/Useful Scripts/generally_useful.R") #Helper functions
#Packages
setwd("Data/") #Sets directory.
library(readr)
library(tidyr)
library(dplyr)
library(methylumi)
library(minfi)
library(stringr)
library(data.table)
library(ggplot2)
sample_table <- read.csv("ClockConstruction/entire_sample_table.csv",row.names=1)
# cpg_table <- data.table::fread("ClockConstruction/entire_cpg_table.csv",header=TRUE)
# row.names(cpg_table) <- cpg_table$V1
# cpg_table <- cpg_table[,-1]
#
# #Reading in the 450K dataset from Magnaye 2022 et al.
# magnaye2022EPIC_unformatted_table <- read.table("Magnaye2022/GSE201872-
GPL21145_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# magnaye2022EPIC_unformatted_samples <- magnaye2022EPIC_unformatted_table[2,-1]
# magnaye2022EPIC_formatted_samples <- t(magnaye2022EPIC_unformatted_samples)[,1]
#
# #Formatting ages.
# magnaye2022EPIC_unformatted_ages <- magnaye2022EPIC_unformatted_table[3,-1]
# magnaye2022EPIC_formatted_ages <-
as.numeric(str_sub(magnaye2022EPIC_unformatted_ages, 6, 7))
#
# #Formatting sex.
# magnaye2022EPIC_unformatted_sex <- magnaye2022EPIC_unformatted_table[4,-1]
# magnaye2022EPIC_formatted_sex <- str_sub(magnaye2022EPIC_unformatted_sex, 6, 15)
#
# #Formatting condition.
# magnaye2022EPIC_unformatted_condition <- magnaye2022EPIC_unformatted_table[5,-1]
# magnaye2022EPIC_formatted_condition <-
str_sub(magnaye2022EPIC_unformatted_condition, 9, 18)
#
# #Looks like the values provided are heavily pre-processed m-values. To ensure consistency,
112
# # I will process raw data to generate normalized beta values.
# magnaye2022EPIC_list_of_files<-list.files(file.path("Magnaye2022/raw_data"))
# magnaye2022EPIC_list_of_files <-
magnaye2022EPIC_list_of_files[231:length(magnaye2022EPIC_list_of_files)]
# magnaye2022EPIC_parsed_list_of_files <- substr(magnaye2022EPIC_list_of_files,1,30)
# magnaye2022EPIC_parsed_list_of_files <- unique(magnaye2022EPIC_parsed_list_of_files)
# magnaye2022EPIC_samplesheet <-
data.frame(Sample=magnaye2022EPIC_formatted_samples,each=2,
# Ages = magnaye2022EPIC_formatted_ages,each=2,
# Sex= magnaye2022EPIC_formatted_sex,each=2,
# Condition = magnaye2022EPIC_formatted_condition,each=2,
# Basename = magnaye2022EPIC_parsed_list_of_files)
# setwd("Magnaye2022/raw_data")
# magnaye2022EPIC_RGSet <- read.metharray.exp(targets = magnaye2022EPIC_samplesheet)
# magnaye2022EPIC_MSet <- preprocessSWAN(magnaye2022EPIC_RGSet)
# magnaye2022EPIC_cpgs <- getBeta(magnaye2022EPIC_MSet)
# setwd("..")
# setwd("..")
# colnames(magnaye2022EPIC_cpgs) <- magnaye2022EPIC_formatted_samples
# magnaye2022EPIC_cpgs <- data.frame(magnaye2022EPIC_cpgs)
#
# #Reading in the EPIC dataset from Magnaye 2022 et al.
# magnaye2022450k_unformatted_table <- read.table("Magnaye2022/GSE201872-
GPL13534_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# magnaye2022450k_unformatted_samples <- magnaye2022450k_unformatted_table[1,-1]
# magnaye2022450k_formatted_samples <- t(magnaye2022450k_unformatted_samples)[,1]
#
# #Formatting ages.
# magnaye2022450k_unformatted_ages <- magnaye2022450k_unformatted_table[2,-1]
# magnaye2022450k_formatted_ages <-
as.numeric(str_sub(magnaye2022450k_unformatted_ages, 6, 7))
#
# #Formatting sex.
# magnaye2022450k_unformatted_sex <- magnaye2022450k_unformatted_table[3,-1]
# magnaye2022450k_formatted_sex <- str_sub(magnaye2022450k_unformatted_sex, 6, 15)
#
# #Formatting condition.
# magnaye2022450k_unformatted_condition <- magnaye2022450k_unformatted_table[4,-1]
# magnaye2022450k_formatted_condition <-
113
str_sub(magnaye2022450k_unformatted_condition, 9, 18)
#
# #Looks like the values provided are heavily pre-processed m-values. To ensure consistency,
# # I will process raw data to generate normalized beta values.
# magnaye2022450k_list_of_files<-list.files(file.path("Magnaye2022/raw_data"))
# magnaye2022450k_list_of_files <- magnaye2022450k_list_of_files[7:230]
# magnaye2022450k_parsed_list_of_files <- substr(magnaye2022450k_list_of_files,1,28)
# magnaye2022450k_parsed_list_of_files <- unique(magnaye2022450k_parsed_list_of_files)
# magnaye2022450k_samplesheet <-
data.frame(Sample=magnaye2022450k_formatted_samples,each=2,
# Ages = magnaye2022450k_formatted_ages,each=2,
# Sex= magnaye2022450k_formatted_sex,each=2,
# Condition = magnaye2022450k_formatted_condition,each=2,
# Basename = magnaye2022450k_parsed_list_of_files)
# setwd("Magnaye2022/raw_data")
# magnaye2022450k_RGSet <- read.metharray.exp(targets = magnaye2022450k_samplesheet)
# magnaye2022450k_MSet <- preprocessSWAN(magnaye2022450k_RGSet)
# magnaye2022450k_cpgs <- getBeta(magnaye2022450k_MSet)
# setwd("..")
# setwd("..")
# colnames(magnaye2022450k_cpgs) <- magnaye2022450k_formatted_samples
# magnaye2022450k_cpgs <- data.frame(magnaye2022450k_cpgs)
#
# magnaye2022EPIC_cpgs <- magnaye2022EPIC_cpgs[rownames(magnaye2022EPIC_cpgs)
%in%
# rownames(magnaye2022450k_cpgs),]
# magnaye2022450k_cpgs <- magnaye2022450k_cpgs[rownames(magnaye2022450k_cpgs)
%in%
# rownames(magnaye2022EPIC_cpgs),]
# shared_cpgs <- rownames(magnaye2022450k_cpgs)
# #Initialize sample table.
# sample_table <- data.frame(ID=character(),
# Author=character(),
# Year=integer(),
# Tissue=character(),
# CellType=character(),
# Age=integer(),
# Condition=character(),
# Sex=character(),
# DonorID=character(),
# Misc=character())
# magnayefirst_IDs <- paste(rep("F",30),1:30,sep="")
# magnayefirst_samples <- data.frame(ID=magnaye2022EPIC_formatted_samples,
114
# Author=rep("Magnaye",30),
# Year=rep(2022,30),
# Tissue=rep("Bronchi",30),
# CellType=rep("Epithelial",30),
# Age=magnaye2022EPIC_formatted_ages,
# Condition=magnaye2022EPIC_formatted_condition,
# Sex=magnaye2022EPIC_formatted_sex,
# DonorID=magnayefirst_IDs,
# Misc=rep("",30))
# magnayesecond_IDs <- paste(rep("G",112),1:112,sep="")
# magnayesecond_samples <- data.frame(ID=magnaye2022450k_formatted_samples,
# Author=rep("Magnaye",112),
# Year=rep(2022,112),
# Tissue=rep("Bronchi",112),
# CellType=rep("Epithelial",112),
# Age=magnaye2022450k_formatted_ages,
# Condition=magnaye2022450k_formatted_condition,
# Sex=magnaye2022450k_formatted_sex,
# DonorID=magnayesecond_IDs,
# Misc=rep("",112))
# sample_table <- rbind(magnayefirst_samples,magnayesecond_samples)
# magnaye2022450k_cpgs$cpg <- rownames(magnaye2022450k_cpgs)
# magnaye2022EPIC_cpgs$cpg <- rownames(magnaye2022EPIC_cpgs)
# magnaye2022EPIC_cpgs <- data.table(magnaye2022EPIC_cpgs)
# magnaye2022450k_cpgs <- data.table(magnaye2022450k_cpgs)
# cpg_table <- merge(magnaye2022450k_cpgs,magnaye2022EPIC_cpgs)
#
# # Time for a monocyte data set.
# estupinan2022_unformatted_table <-
read.table("Estupinan2022/GSE201752_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# estupinan2022_formatted_samples <- strsplit(estupinan2022_unformatted_table[1,2]," ")[[1]]
#
# #Formatting ages.
# estupinan2022_unformatted_ages <- estupinan2022_unformatted_table[6,-1]
# estupinan2022_formatted_ages <- as.numeric(str_sub(estupinan2022_unformatted_ages, 6, 7))
#
# #Formatting sex.
# estupinan2022_unformatted_sex <- estupinan2022_unformatted_table[5,-1]
# estupinan2022_formatted_sex <- str_sub(estupinan2022_unformatted_sex, 6, 15)
115
#
# #Formatting condition and miscellaneous information.
# estupinan2022_unformatted_condition <- estupinan2022_unformatted_table[4,-1]
# estupinan2022_formatted_condition <- str_sub(estupinan2022_unformatted_condition, 17, 20)
# estupinan2022_formatted_condition[estupinan2022_formatted_condition == "HD"] <-
"Control"
# estupinan2022_formatted_misc <- estupinan2022_formatted_condition
# estupinan2022_formatted_condition[estupinan2022_formatted_condition != "Control"] <-
"Giant Cell Arteritis"
# estupinan2022_IDs <- paste(rep("H",113),1:113,sep="")
#
# estupinan2022_samples <- data.frame(ID=estupinan2022_formatted_samples,
# Author=rep("Estupinan",113),
# Year=rep(2022,113),
# Tissue=rep("Blood",113),
# CellType=rep("Monocytes",113),
# Age=estupinan2022_formatted_ages,
# Condition=estupinan2022_formatted_condition,
# Sex=estupinan2022_formatted_sex,
# DonorID=estupinan2022_IDs,
# Misc=estupinan2022_formatted_misc)
#
# estupinan2022_cpgs <-
data.frame(read_table2("Estupinan2022/GSE201752_processed_data.txt",
skip=4),row.names=1)
# colnames(estupinan2022_cpgs) <- estupinan2022_formatted_samples
# estupinan2022_cpgs <- estupinan2022_cpgs[rownames(estupinan2022_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(estupinan2022_cpgs),]
#
# estupinan2022_cpgs$cpg <- rownames(estupinan2022_cpgs)
# estupinan2022_cpgs <- data.table(estupinan2022_cpgs)
# sample_table <- rbind(sample_table,estupinan2022_samples)
# cpg_table <- merge(cpg_table,estupinan2022_cpgs, all=TRUE)
# #
# #
# # #Time for a blood data set.
# okereke2021_unformatted_table <- read.table("Okereke2021/GSE190540_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# okereke2021_unformatted_samples <- strsplit(okereke2021_unformatted_table[1,2]," ")[[1]]
116
#
# #Formatting ages.
# okereke2021_unformatted_ages <- okereke2021_unformatted_table[4,-1]
# okereke2021_formatted_ages <- as.numeric(str_sub(okereke2021_unformatted_ages, 6, 7))
#
# #Formatting sex.
# okereke2021_unformatted_sex <- okereke2021_unformatted_table[2,-1]
# okereke2021_formatted_sex <- str_sub(okereke2021_unformatted_sex, 9, 15)
#
# #Formatting condition and miscellaneous information.
# okereke2021_unformatted_condition <- okereke2021_unformatted_table[3,-1]
# okereke2021_formatted_condition <- str_sub(okereke2021_unformatted_condition, 17, 25)
# okereke2021_formatted_condition[okereke2021_formatted_condition == "Case"] <-
"Cognitive Impairment"
#
# okereke2021_IDs <- paste(rep("I",90),1:90,sep="")
#
# okereke2021_samples <- data.frame(ID=okereke2021_unformatted_samples,
# Author=rep("Okereke",90),
# Year=rep(2021,90),
# Tissue=rep("Blood",90),
# CellType=rep("PBMCs",90),
# Age=okereke2021_formatted_ages,
# Condition=okereke2021_formatted_condition,
# Sex=okereke2021_formatted_sex,
# DonorID=okereke2021_IDs,
# Misc=rep(NA,90))
#
# okereke2021_cpgs <- read.table("Okereke2021/GSE190540_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# okereke2021_cpgs <- data.frame(okereke2021_cpgs)
# okereke2021_cpgs <- okereke2021_cpgs[-c(1:5),]
# rownames(okereke2021_cpgs) <- okereke2021_cpgs$V1
# okereke2021_cpgs <- data.frame(okereke2021_cpgs[,-1])
# colnames(okereke2021_cpgs) <- okereke2021_unformatted_samples
# okereke2021_cpgs <- okereke2021_cpgs[rownames(okereke2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(okereke2021_cpgs),]
#
# okereke2021_cpgs$cpg <- rownames(okereke2021_cpgs)
# okereke2021_cpgs <- data.table(okereke2021_cpgs)
# sample_table <- rbind(sample_table,okereke2021_samples)
117
# cpg_table <- merge(cpg_table,okereke2021_cpgs, all=TRUE)
# #
# # # Skin data set.
# muse2021_unformatted_table <- read.table("Muse2021/GSE188593_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# muse2021_unformatted_samples <- muse2021_unformatted_table[1,-1]
# muse2021_formatted_samples <- t(muse2021_unformatted_samples)[,1]
#
# #Formatting ages.
# muse2021_unformatted_ages <- muse2021_unformatted_table[3,-1]
# muse2021_formatted_ages <- as.numeric(str_sub(muse2021_unformatted_ages, 13, 19))
#
# #Formatting sex.
# muse2021_unformatted_sex <- muse2021_unformatted_table[4,-1]
# muse2021_formatted_sex <- str_sub(muse2021_unformatted_sex, 6, 7)
#
# muse2021_unformatted_donor <- muse2021_unformatted_table[2,-1]
# muse2021_unformatted_donor <- str_sub(muse2021_unformatted_donor, 10,15)
# muse2021_unformatted_donor <- as.numeric(factor(muse2021_unformatted_donor))
# muse2021_formatted_donor <- paste(rep("J",64),muse2021_unformatted_donor,sep="")
#
# #Formatting condition and miscellaneous information.
# muse2021_unformatted_condition <- muse2021_unformatted_table[6,-1]
# muse2021_formatted_condition <- str_sub(muse2021_unformatted_condition, 17, 30)
# muse2021_formatted_condition[muse2021_formatted_condition == ""] <- "Control"
#
# muse2021_IDs <- paste(rep("J",64),1:64,sep="")
#
# muse2021_samples <- data.frame(ID=muse2021_formatted_samples,
# Author=rep("Muse",64),
# Year=rep(2021,64),
# Tissue=rep("Skin",64),
# CellType=rep("Epithelial",64),
# Age=muse2021_formatted_ages,
# Condition=muse2021_formatted_condition,
# Sex=muse2021_formatted_sex,
# DonorID=muse2021_IDs,
# Misc=rep(NA,64))
#
# muse2021_cpgs <- read.table("Muse2021/GSE188593_series_matrix.txt",
118
# comment = "!",
# skip=5,
# fill=TRUE)
# muse2021_cpgs <- data.frame(muse2021_cpgs)
# muse2021_cpgs <- muse2021_cpgs[-c(1:6),]
# rownames(muse2021_cpgs) <- muse2021_cpgs$V1
# muse2021_cpgs <- data.frame(muse2021_cpgs[,-1])
# colnames(muse2021_cpgs) <- muse2021_unformatted_samples
# muse2021_cpgs <- muse2021_cpgs[rownames(muse2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(muse2021_cpgs),]
#
# muse2021_cpgs$cpg <- rownames(muse2021_cpgs)
# muse2021_cpgs <- data.table(muse2021_cpgs)
# sample_table <- rbind(sample_table,muse2021_samples)
# cpg_table <- merge(cpg_table,muse2021_cpgs, all=TRUE)
#
# # #Blood data set.
# konigsberg2021_unformatted_table <-
read.table("Konigsberg2021/GSE167202_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# konigsberg2021_formatted_samples <- strsplit(konigsberg2021_unformatted_table[1,2],"
")[[1]]
#
# #Formatting ages.
# konigsberg2021_unformatted_ages <- konigsberg2021_unformatted_table[4,-1]
# konigsberg2021_formatted_ages <- as.numeric(str_sub(konigsberg2021_unformatted_ages, 5,
8))
#
# #Formatting sex.
# konigsberg2021_unformatted_sex <- konigsberg2021_unformatted_table[3,-1]
# konigsberg2021_formatted_sex <- str_sub(konigsberg2021_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# konigsberg2021_unformatted_condition <- konigsberg2021_unformatted_table[2,-1]
# konigsberg2021_formatted_condition <- str_sub(konigsberg2021_unformatted_condition, 15,
30)
# konigsberg2021_formatted_condition[konigsberg2021_formatted_condition == "negative"] <-
"Control"
# konigsberg2021_formatted_condition[konigsberg2021_formatted_condition == "positive"] <-
"COVID"
119
# konigsberg2021_formatted_condition[konigsberg2021_formatted_condition == "other
infection"] <-
# "Respiratory Illness"
#
# konigsberg2021_IDs <- paste(rep("K",525),1:525,sep="")
#
# konigsberg2021_samples <- data.frame(ID=konigsberg2021_formatted_samples,
# Author=rep("Konigsberg",525),
# Year=rep(2021,525),
# Tissue=rep("Blood",525),
# CellType=rep("PBMCs",525),
# Age=konigsberg2021_formatted_ages,
# Condition=konigsberg2021_formatted_condition,
# Sex=konigsberg2021_formatted_sex,
# DonorID=konigsberg2021_IDs,
# Misc=rep(NA,525))
#
# konigsberg2021_cpgs <-
data.table::fread("Konigsberg2021/GSE167202_ProcessedBetaValues.txt",
# header=TRUE) %>%
# as.data.frame()
# konigsberg2021_cpgs <- konigsberg2021_cpgs[,-c(2:22)]
# konigsberg2021_sequence_ids <- konigsberg2021_unformatted_table[5,-1]
# konigsberg2021_sequence_ids <- unlist(konigsberg2021_sequence_ids)
# row.names(konigsberg2021_cpgs) <- konigsberg2021_cpgs$ID_REF
# konigsberg2021_cpgs <- konigsberg2021_cpgs[,-1]
# konigsberg2021_cpgs <- konigsberg2021_cpgs[,konigsberg2021_sequence_ids]
# colnames(konigsberg2021_cpgs) <- konigsberg2021_formatted_samples
#
# konigsberg2021_cpgs <- konigsberg2021_cpgs[rownames(konigsberg2021_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(konigsberg2021_cpgs),]
#
# konigsberg2021_cpgs$cpg <- rownames(konigsberg2021_cpgs)
# konigsberg2021_cpgs <- data.table(konigsberg2021_cpgs)
# sample_table <- rbind(sample_table,konigsberg2021_samples)
# cpg_table <- merge(cpg_table,konigsberg2021_cpgs, all=TRUE)
#
# #Brain samples.
# haghighi2022_unformatted_table <- read.table("Haghighi2022/GSE191200_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
120
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# haghighi2022_formatted_samples <- strsplit(haghighi2022_unformatted_table[1,2]," ")[[1]]
#
# #Formatting ages.
# haghighi2022_unformatted_ages <- haghighi2022_unformatted_table[4,-1]
# haghighi2022_formatted_ages <- as.numeric(str_sub(haghighi2022_unformatted_ages, 5, 8))
#
# #Formatting sex.
# haghighi2022_unformatted_sex <- haghighi2022_unformatted_table[5,-1]
# haghighi2022_formatted_sex <- str_sub(haghighi2022_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# haghighi2022_unformatted_condition <- haghighi2022_unformatted_table[6,-1]
# haghighi2022_formatted_condition <- str_sub(haghighi2022_unformatted_condition, 12, 30)
#
# haghighi2022_unformatted_donor <- haghighi2022_unformatted_table[2,-1]
# haghighi2022_unformatted_donor <- str_sub(haghighi2022_unformatted_donor,11,15)
# haghighi2022_unformatted_donor <- as.numeric(factor(haghighi2022_unformatted_donor,
# labels=c(1:22)))
# haghighi2022_formatted_donor <-
paste(rep("L",56),haghighi2022_unformatted_donor,sep="")
#
# haghighi2022_samples <- data.frame(ID=haghighi2022_formatted_samples,
# Author=rep("Haghighi",56),
# Year=rep(2022,56),
# Tissue=rep("Brain",56),
# CellType=rep("Microglia",56),
# Age=haghighi2022_formatted_ages,
# Condition=haghighi2022_formatted_condition,
# Sex=haghighi2022_formatted_sex,
# DonorID=haghighi2022_formatted_donor,
# Misc=rep(NA,56))
#
# haghighi2022_cpgs <- read.table("Haghighi2022/GSE191200_series_matrix.txt",
# comment = "!",
# skip=7,
# fill=TRUE)
# haghighi2022_cpgs <- haghighi2022_cpgs[-c(1:6),]
# row.names(haghighi2022_cpgs) <- haghighi2022_cpgs$V1
# haghighi2022_cpgs <- haghighi2022_cpgs[,-1]
# colnames(haghighi2022_cpgs) <- haghighi2022_cpgs[1,]
# haghighi2022_cpgs <- haghighi2022_cpgs[-1,]
#
121
# haghighi2022_cpgs <- haghighi2022_cpgs[rownames(haghighi2022_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(haghighi2022_cpgs),]
#
# haghighi2022_cpgs$cpg <- rownames(haghighi2022_cpgs)
# haghighi2022_cpgs <- data.table(haghighi2022_cpgs)
# sample_table <- rbind(sample_table,haghighi2022_samples)
# cpg_table <- merge(cpg_table,haghighi2022_cpgs, all=TRUE)
#
# # #Colorectal samples.
# chen2021_unformatted_table <- read.table("Chen2021/GSE159898_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# chen2021_formatted_samples <- strsplit(chen2021_unformatted_table[1,2]," ")[[1]]
#
# #Formatting ages.
# chen2021_unformatted_ages <- chen2021_unformatted_table[3,-1]
# chen2021_formatted_ages <- as.numeric(str_sub(chen2021_unformatted_ages, 5, 8))
#
# #Formatting sex.
# chen2021_unformatted_sex <- chen2021_unformatted_table[2,-1]
# chen2021_formatted_sex <- str_sub(chen2021_unformatted_sex, 9, 15)
#
# #Formatting condition and miscellaneous information.
# chen2021_unformatted_condition <- chen2021_unformatted_table[4,-1]
# chen2021_formatted_condition <- str_sub(chen2021_unformatted_condition, 14, 40)
# chen2021_formatted_condition[chen2021_formatted_condition == "normal colorectal tissue"]
<-
# "Control"
# chen2021_formatted_condition[chen2021_formatted_condition == "colorectal cancer tissue"]
<-
# "Colorectal Cancer"
#
# chen2021_unformatted_donor <- chen2021_unformatted_table[5,-1]
# chen2021_unformatted_donor <- str_sub(chen2021_unformatted_donor,11,15)
# chen2021_unformatted_donor <- as.numeric(factor(chen2021_unformatted_donor,
# labels=c(1:21)))
# chen2021_formatted_donor <- paste(rep("M",44),chen2021_unformatted_donor,sep="")
#
# chen2021_samples <- data.frame(ID=chen2021_formatted_samples,
# Author=rep("Chen",44),
122
# Year=rep(2021,44),
# Tissue=rep("Colorectal",44),
# CellType=rep("Epithelial (Colorectal)",44),
# Age=chen2021_formatted_ages,
# Condition=chen2021_formatted_condition,
# Sex=chen2021_formatted_sex,
# DonorID=chen2021_formatted_donor,
# Misc=rep(NA,44))
#
# chen2021_cpgs <- read.table("Chen2021/GSE159898_series_matrix.txt",
# comment = "!",
# skip=7,
# fill=TRUE)
# chen2021_cpgs <- chen2021_cpgs[-c(1:5),]
# row.names(chen2021_cpgs) <- chen2021_cpgs$V1
# chen2021_cpgs <- chen2021_cpgs[,-1]
# colnames(chen2021_cpgs) <- chen2021_cpgs[1,]
# chen2021_cpgs <- chen2021_cpgs[-1,]
#
# chen2021_cpgs <- chen2021_cpgs[rownames(chen2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(chen2021_cpgs),]
#
# chen2021_cpgs$cpg <- rownames(chen2021_cpgs)
# chen2021_cpgs <- data.table(chen2021_cpgs)
# sample_table <- rbind(sample_table,chen2021_samples)
# cpg_table <- merge(cpg_table,chen2021_cpgs, all=TRUE)
#
# #Skeletal Muscle samples
# voisin2021_unformatted_table <- read.table("Voisin2021/GSE151407_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# voisin2021_formatted_samples <- strsplit(voisin2021_unformatted_table[1,2]," ")[[1]]
#
# #Formatting ages.
# voisin2021_unformatted_ages <- voisin2021_unformatted_table[3,-1]
# voisin2021_formatted_ages <- as.numeric(str_sub(voisin2021_unformatted_ages, 5, 8))
#
# #Formatting sex.
# voisin2021_unformatted_sex <- voisin2021_unformatted_table[4,-1]
# voisin2021_formatted_sex <- str_sub(voisin2021_unformatted_sex, 6, 15)
#
123
# #Formatting condition and miscellaneous information.
# voisin2021_unformatted_condition <- voisin2021_unformatted_table[2,-1]
# voisin2021_formatted_condition <- str_sub(voisin2021_unformatted_condition, 8, 10)
# voisin2021_formatted_condition[voisin2021_formatted_condition == "PRE"] <-
# "Control"
# voisin2021_formatted_condition[voisin2021_formatted_condition != "Control"] <-
# "HIIT"
#
# voisin2021_unformatted_donor <- voisin2021_unformatted_table[2,-1]
# voisin2021_unformatted_donor <- str_sub(voisin2021_unformatted_donor,-9,-6)
# voisin2021_unformatted_donor <- sub("_", "", voisin2021_unformatted_donor)
# voisin2021_unformatted_donor <- as.numeric(factor(voisin2021_unformatted_donor,
# labels=c(1:25)))
# voisin2021_formatted_donor <- paste(rep("N",78),voisin2021_unformatted_donor,sep="")
#
# voisin2021_samples <- data.frame(ID=voisin2021_formatted_samples,
# Author=rep("Voisin",78),
# Year=rep(2021,78),
# Tissue=rep("Skeletal Muscle",78),
# CellType=rep("Muscle Cells",78),
# Age=voisin2021_formatted_ages,
# Condition=voisin2021_formatted_condition,
# Sex=voisin2021_formatted_sex,
# DonorID=voisin2021_formatted_donor,
# Misc=rep(NA,78))
#
# voisin2021_cpgs <- read.table("Voisin2021/GSE151407_series_matrix.txt",
# comment = "!",
# skip=7,
# fill=TRUE)
# voisin2021_cpgs <- voisin2021_cpgs[-c(1:4),]
# row.names(voisin2021_cpgs) <- voisin2021_cpgs$V1
# voisin2021_cpgs <- voisin2021_cpgs[,-1]
# colnames(voisin2021_cpgs) <- voisin2021_cpgs[1,]
# voisin2021_cpgs <- voisin2021_cpgs[-1,]
#
# voisin2021_cpgs <- voisin2021_cpgs[rownames(voisin2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(voisin2021_cpgs),]
#
# voisin2021_cpgs$cpg <- rownames(voisin2021_cpgs)
# voisin2021_cpgs <- data.table(voisin2021_cpgs)
# sample_table <- rbind(sample_table,voisin2021_samples)
# cpg_table <- merge(cpg_table,voisin2021_cpgs, all=TRUE)
124
#
# # More brain. Note - I am somewhat apprehensive about this following dataset.
#
# fries2019_unformatted_table <- read.table("Fries2019/GSE129428_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# fries2019_formatted_samples <- strsplit(fries2019_unformatted_table[1,2]," ")[[1]]
#
# #Formatting ages.
# fries2019_unformatted_ages <- fries2019_unformatted_table[4,-1]
# fries2019_formatted_ages <- as.numeric(str_sub(fries2019_unformatted_ages, 14, 19))
#
# #Formatting sex.
# fries2019_unformatted_sex <- fries2019_unformatted_table[2,-1]
# fries2019_formatted_sex <- str_sub(fries2019_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# fries2019_unformatted_condition <- fries2019_unformatted_table[3,-1]
# fries2019_formatted_condition <- str_sub(fries2019_unformatted_condition, 8, 100)
#
# fries2019_formatted_donor <- paste(rep("O",64),c(1:64),sep="")
#
# fries2019_samples <- data.frame(ID=fries2019_formatted_samples,
# Author=rep("Fries",64),
# Year=rep(2019,64),
# Tissue=rep("Brain",64),
# CellType=rep("Brain Cells",64),
# Age=fries2019_formatted_ages,
# Condition=fries2019_formatted_condition,
# Sex=fries2019_formatted_sex,
# DonorID=fries2019_formatted_donor,
# Misc=rep(NA,64))
#
# fries2019_cpgs <- read.table("Fries2019/GSE129428_series_matrix.txt",
# comment = "!",
# skip=7,
# fill=TRUE)
# fries2019_cpgs <- fries2019_cpgs[-c(1:4),]
# row.names(fries2019_cpgs) <- fries2019_cpgs$V1
# fries2019_cpgs <- fries2019_cpgs[,-1]
# colnames(fries2019_cpgs) <- fries2019_cpgs[1,]
125
# fries2019_cpgs <- fries2019_cpgs[-1,]
#
# fries2019_cpgs <- fries2019_cpgs[rownames(fries2019_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(fries2019_cpgs),]
#
# fries2019_cpgs$cpg <- rownames(fries2019_cpgs)
# fries2019_cpgs <- data.table(fries2019_cpgs)
# sample_table <- rbind(sample_table,fries2019_samples)
# cpg_table <- merge(cpg_table,fries2019_cpgs, all=TRUE)
#
# #Samples from children.
#
# islam2018_unformatted_table <- read.table("Islam2018/GSE124366_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# islam2018_unformatted_samples <- islam2018_unformatted_table[6,]
# islam2018_formatted_samples <- str_sub(islam2018_unformatted_table[6,],1,100)[2:216]
#
# #Formatting ages.
# islam2018_unformatted_ages <- islam2018_unformatted_table[4,-1]
# islam2018_formatted_ages <- as.numeric(str_sub(islam2018_unformatted_ages, 35,100))
#
# #Formatting sex.
# islam2018_unformatted_sex <- islam2018_unformatted_table[3,-1]
# islam2018_formatted_sex <- str_sub(islam2018_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# islam2018_formatted_condition <- rep("Control",215)
#
# islam2018_unformatted_donor <- islam2018_unformatted_table[1,-1]
# islam2018_unformatted_donor <- str_sub(islam2018_unformatted_donor,1,8)
# islam2018_unformatted_donor <- as.numeric(factor(islam2018_unformatted_donor,
# labels=c(1:201)))
# islam2018_formatted_donor <- paste(rep("P",201),islam2018_unformatted_donor,sep="")
#
# islam2018_unformatted_celltype <- islam2018_unformatted_table[2,-1]
# islam2018_formatted_celltype <- str_sub(islam2018_unformatted_celltype, 9, 20)
#
# islam2018_unformatted_tissue <- islam2018_formatted_celltype
# islam2018_unformatted_tissue[islam2018_unformatted_tissue == "PBMC"] <- "Blood"
# islam2018_formatted_tissue <- islam2018_unformatted_tissue
126
#
# islam2018_samples <- data.frame(ID=islam2018_formatted_samples,
# Author=rep("Islam",215),
# Year=rep(2018,215),
# Tissue=islam2018_formatted_tissue,
# CellType=islam2018_formatted_celltype,
# Age=islam2018_formatted_ages,
# Condition=islam2018_formatted_condition,
# Sex=islam2018_formatted_sex,
# DonorID=islam2018_formatted_donor,
# Misc=rep(NA,215))
#
# islam2018_cpgs <- read.table("Islam2018/GSE124366_series_matrix.txt",
# comment = "!",
# skip=6,
# fill=TRUE)
# islam2018_cpgs <- islam2018_cpgs[-c(1:6),]
# row.names(islam2018_cpgs) <- islam2018_cpgs$V1
# islam2018_cpgs <- islam2018_cpgs[,-1]
# colnames(islam2018_cpgs) <- islam2018_formatted_samples
#
# islam2018_cpgs <- islam2018_cpgs[rownames(islam2018_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(islam2018_cpgs),]
#
# islam2018_cpgs$cpg <- rownames(islam2018_cpgs)
# islam2018_cpgs <- data.table(islam2018_cpgs)
# sample_table <- rbind(sample_table,islam2018_samples)
# cpg_table <- merge(cpg_table,islam2018_cpgs, all=TRUE)
# #
#
# #Samples from breast tissue.
#
# johnson2016_unformatted_table <- read.table("Johnson2016/GSE88883_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# johnson2016_unformatted_samples <- johnson2016_unformatted_table[4,]
# johnson2016_formatted_samples <- str_sub(johnson2016_unformatted_samples,1,100)[2:101]
#
# #Formatting ages.
# johnson2016_unformatted_ages <- johnson2016_unformatted_table[2,-1]
# johnson2016_formatted_ages <- as.numeric(str_sub(johnson2016_unformatted_ages, 14,100))
127
#
# #Formatting sex.
# johnson2016_unformatted_sex <- johnson2016_unformatted_table[1,-1]
# johnson2016_formatted_sex <- str_sub(johnson2016_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# johnson2016_formatted_condition <- rep("Control",100)
#
# johnson2016_formatted_donor <- paste(rep("Q",100),c(1:100),sep="")
#
# johnson2016_samples <- data.frame(ID=johnson2016_formatted_samples,
# Author=rep("Johnson",100),
# Year=rep(2016,100),
# Tissue=rep("Breast",100),
# CellType=rep("Breast",100),
# Age=johnson2016_formatted_ages,
# Condition=johnson2016_formatted_condition,
# Sex=johnson2016_formatted_sex,
# DonorID=johnson2016_formatted_donor,
# Misc=rep(NA,100))
#
# johnson2016_cpgs <- read.table("Johnson2016/GSE88883_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# johnson2016_cpgs <- johnson2016_cpgs[-c(1:4),]
# row.names(johnson2016_cpgs) <- johnson2016_cpgs$V1
# johnson2016_cpgs <- johnson2016_cpgs[,-1]
# colnames(johnson2016_cpgs) <- johnson2016_formatted_samples
#
# johnson2016_cpgs <- johnson2016_cpgs[rownames(johnson2016_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(johnson2016_cpgs),]
#
# johnson2016_cpgs$cpg <- rownames(johnson2016_cpgs)
# johnson2016_cpgs <- data.table(johnson2016_cpgs)
# sample_table <- rbind(sample_table,johnson2016_samples)
# cpg_table <- merge(cpg_table,johnson2016_cpgs, all=TRUE)
#
# #Samples from liver.
#
# horvath2014_unformatted_table <- read.table("Horvath2014/GSE61258_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
128
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# horvath2014_unformatted_samples <- horvath2014_unformatted_table[4,]
# horvath2014_formatted_samples <- str_sub(horvath2014_unformatted_samples,1,100)[2:80]
#
# #Formatting ages.
# horvath2014_unformatted_ages <- horvath2014_unformatted_table[2,-1]
# horvath2014_formatted_ages <- as.numeric(str_sub(horvath2014_unformatted_ages, 5,100))
#
# #Formatting sex.
# horvath2014_unformatted_sex <- horvath2014_unformatted_table[1,-1]
# horvath2014_formatted_sex <- str_sub(horvath2014_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# horvath2014_formatted_condition <- rep("Control",79)
#
# horvath2014_formatted_donor <- paste(rep("R",79),c(1:79),sep="")
#
# horvath2014_samples <- data.frame(ID=horvath2014_formatted_samples,
# Author=rep("Horvath",79),
# Year=rep(2014,79),
# Tissue=rep("Liver",79),
# CellType=rep("Liver",79),
# Age=horvath2014_formatted_ages,
# Condition=horvath2014_formatted_condition,
# Sex=horvath2014_formatted_sex,
# DonorID=horvath2014_formatted_donor,
# Misc=rep(NA,79))
#
# horvath2014_cpgs <- read.table("Horvath2014/GSE61258_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# horvath2014_cpgs <- horvath2014_cpgs[-c(1:4),]
# row.names(horvath2014_cpgs) <- horvath2014_cpgs$V1
# horvath2014_cpgs <- horvath2014_cpgs[,-1]
# colnames(horvath2014_cpgs) <- horvath2014_formatted_samples
#
# horvath2014_cpgs <- horvath2014_cpgs[rownames(horvath2014_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(horvath2014_cpgs),]
#
# horvath2014_cpgs$cpg <- rownames(horvath2014_cpgs)
# horvath2014_cpgs <- data.table(horvath2014_cpgs)
129
# sample_table <- rbind(sample_table,horvath2014_samples)
# cpg_table <- merge(cpg_table,horvath2014_cpgs, all=TRUE)
#
# #Samples from sperm tissue.
#
# pilsner2022_unformatted_table <- read.table("Pilsner2022/GSE185445_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# pilsner2022_unformatted_samples <- pilsner2022_unformatted_table[1,-1]
# pilsner2022_formatted_samples <- str_sub(pilsner2022_unformatted_samples,1,100)[1:379]
#
# #Formatting ages.
# pilsner2022_unformatted_ages <- pilsner2022_unformatted_table[3,-1]
# pilsner2022_formatted_ages <- as.numeric(str_sub(pilsner2022_unformatted_ages, 8,100))
#
# #Formatting sex.
# pilsner2022_formatted_sex <- rep("Male",379)
#
# #Formatting condition and miscellaneous information.
# pilsner2022_formatted_condition <- rep("Control",379)
#
# pilsner2022_formatted_donor <- paste(rep("T",379),c(1:379),sep="")
#
# pilsner2022_samples <- data.frame(ID=pilsner2022_formatted_samples,
# Author=rep("Pilsner",379),
# Year=rep(2022,379),
# Tissue=rep("Semen",379),
# CellType=rep("Sperm",379),
# Age=pilsner2022_formatted_ages,
# Condition=pilsner2022_formatted_condition,
# Sex=pilsner2022_formatted_sex,
# DonorID=pilsner2022_formatted_donor,
# Misc=rep(NA,379))
#
# pilsner2022_unformatted_basenames <- pilsner2022_unformatted_table[2,-1]
# pilsner2022_formatted_basenames <- str_sub(pilsner2022_unformatted_basenames,11,100)
#
# pilsner2022_cpgs <-
data.table::fread("Pilsner2022/GSE185445_processed_methylation_data_matrix.txt",
# header=FALSE) %>%
# as.data.frame()
130
#
# pilsner2022_colnames <- data.frame(read_table2("Pilsner2022/colnames.csv",
# col_names = FALSE))
# pilsner2022_colnames <- substr(pilsner2022_colnames,2,20)
#
# row.names(pilsner2022_cpgs) <- pilsner2022_cpgs[,1]
# pilsner2022_cpgs <- pilsner2022_cpgs[,-1]
# colnames(pilsner2022_cpgs) <- pilsner2022_colnames
# pilsner2022_cpgs <- pilsner2022_cpgs[,pilsner2022_formatted_basenames]
#
# pilsner2022_cpgs <- pilsner2022_cpgs[rownames(pilsner2022_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(pilsner2022_cpgs),]
# colnames(pilsner2022_cpgs) <- pilsner2022_formatted_samples
#
# pilsner2022_cpgs$cpg <- rownames(pilsner2022_cpgs)
# pilsner2022_cpgs <- data.table(pilsner2022_cpgs)
# sample_table <- rbind(sample_table,pilsner2022_samples)
# cpg_table <- merge(cpg_table,pilsner2022_cpgs, all=TRUE)
#
# #Samples from nasal epithelia
#
# davalos2022_unformatted_table <- read.table("Davalos2022/GSE193879_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# davalos2022_unformatted_samples <- davalos2022_unformatted_table[1,]
# davalos2022_formatted_samples <- str_sub(davalos2022_unformatted_samples,1,62)[2:128]
#
# #Formatting ages.
# davalos2022_unformatted_ages <- davalos2022_unformatted_table[4,-1]
# davalos2022_formatted_ages <- as.numeric(str_sub(davalos2022_unformatted_ages, 13,15))
#
# #Formatting sex.
# davalos2022_unformatted_sex <- davalos2022_unformatted_table[3,-1]
# davalos2022_formatted_sex <- str_sub(davalos2022_unformatted_sex, 17, 30)
#
# #Formatting condition and miscellaneous information.
# davalos2022_unformatted_condition <- davalos2022_unformatted_table[2,-1]
# davalos2022_formatted_condition <- str_sub(davalos2022_unformatted_condition, 8,39)
#
# davalos2022_formatted_donor <- paste(rep("U",127),c(1:127),sep="")
#
131
# davalos2022_samples <- data.frame(ID=davalos2022_formatted_samples,
# Author=rep("Davalos",127),
# Year=rep(2022,127),
# Tissue=rep("Blood",127),
# CellType=rep("Blood",127),
# Age=davalos2022_formatted_ages,
# Condition=davalos2022_formatted_condition,
# Sex=davalos2022_formatted_sex,
# DonorID=davalos2022_formatted_donor,
# Misc=rep(NA,127))
#
# davalos2022_samples <- davalos2022_samples[!is.na(davalos2022_samples$Age),]
#
# davalos2022_cpgs <- data.table::fread("Davalos2022/GSE193879_Matrix_processed.csv",
# header=FALSE) %>%
# as.data.frame()
# row.names(davalos2022_cpgs) <- davalos2022_cpgs$V1
# davalos2022_cpgs <- davalos2022_cpgs[-1,-1]
# davalos2022_cpgs <- davalos2022_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),127)]
# colnames(davalos2022_cpgs) <- davalos2022_formatted_samples
# davalos2022_cpgs <- davalos2022_cpgs[,colnames(davalos2022_cpgs) %in%
davalos2022_samples$ID]
#
# davalos2022_cpgs <- davalos2022_cpgs[rownames(davalos2022_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(davalos2022_cpgs),]
#
# davalos2022_cpgs$cpg <- rownames(davalos2022_cpgs)
# davalos2022_cpgs <- data.table(davalos2022_cpgs)
# sample_table <- rbind(sample_table,davalos2022_samples)
# cpg_table <- merge(cpg_table,davalos2022_cpgs, all=TRUE)
#
#
# hannon2021_unformatted_table <- read.table("Hannon2021/GSE152026_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# hannon2021_unformatted_samples <- hannon2021_unformatted_table[2,]
# hannon2021_formatted_samples <- str_sub(hannon2021_unformatted_samples,1,62)[2:935]
#
# #Formatting ages.
# hannon2021_unformatted_ages <- hannon2021_unformatted_table[5,-1]
# hannon2021_formatted_ages <- as.numeric(str_sub(hannon2021_unformatted_ages, 5,100))
132
#
# #Formatting sex.
# hannon2021_unformatted_sex <- hannon2021_unformatted_table[4,-1]
# hannon2021_formatted_sex <- str_sub(hannon2021_unformatted_sex, 6, 30)
#
# #Formatting condition and miscellaneous information.
# hannon2021_unformatted_condition <- hannon2021_unformatted_table[3,-1]
# hannon2021_formatted_condition <- str_sub(hannon2021_unformatted_condition, 12,39)
# hannon2021_formatted_condition[hannon2021_formatted_condition=="Case"] <-
"Schizophrenia"
#
# hannon2021_formatted_donor <- paste(rep("V",934),c(1:934),sep="")
#
# hannon2021_samples <- data.frame(ID=hannon2021_formatted_samples,
# Author=rep("Hannon",934),
# Year=rep(2021,934),
# Tissue=rep("Blood",934),
# CellType=rep("Blood",934),
# Age=hannon2021_formatted_ages,
# Condition=hannon2021_formatted_condition,
# Sex=hannon2021_formatted_sex,
# DonorID=hannon2021_formatted_donor,
# Misc=rep(NA,934))
#
# hannon2021_samples <- hannon2021_samples[!is.na(hannon2021_samples$Age),]
# hannon2021_sample_names <- str_sub(hannon2021_unformatted_table[1,-1],1,19)
#
# hannon2021_cpgs <-
data.table::fread("Hannon2021/GSE152026_EUGEI_processed_signals.csv",
# header=FALSE) %>%
# as.data.frame()
# row.names(hannon2021_cpgs) <- hannon2021_cpgs$V1
# hannon2021_cpgs <- hannon2021_cpgs[-1,-1]
# hannon2021_cpgs <- hannon2021_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),934)]
# hannon2021_names <- read_csv("Hannon2021/target.txt",
# col_names = FALSE)
# hannon2021_names <- str_sub(unlist(hannon2021_names),1,500)
# hannon2021_names <- hannon2021_names[2:1869]
# hannon2021_names <- hannon2021_names[rep(c(rep(TRUE, 2- 1), FALSE),934)]
# colnames(hannon2021_cpgs) <- hannon2021_names
#
# hannon2021_cpgs <- hannon2021_cpgs[,hannon2021_sample_names]
# colnames(hannon2021_cpgs) <- hannon2021_formatted_samples
133
# hannon2021_cpgs <- hannon2021_cpgs[,colnames(hannon2021_cpgs) %in%
# hannon2021_samples$ID]
# hannon2021_cpgs <- hannon2021_cpgs[rownames(hannon2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in%
rownames(hannon2021_changed_cpgs),]
# hannon2021_cpgs$cpg <- rownames(hannon2021_cpgs)
# hannon2021_cpgs <- data.table(hannon2021_cpgs)
# sample_table <- rbind(sample_table,hannon2021_samples)
# cpg_table <- merge(cpg_table,hannon2021_cpgs, all=TRUE)
#
# martino2018_unformatted_table <- read.table("Martino2018/GSE114135-
GPL23976_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# martino2018_unformatted_samples <- martino2018_unformatted_table[5,-1]
# martino2018_formatted_samples <- str_sub(martino2018_unformatted_samples,1,62)[1:205]
#
# #Formatting ages.
# martino2018_unformatted_ages <- martino2018_unformatted_table[2,-1]
# martino2018_formatted_ages <- as.numeric(str_sub(martino2018_unformatted_ages, 5,100))
#
# #Formatting sex.
# martino2018_unformatted_sex <- martino2018_unformatted_table[3,-1]
# martino2018_formatted_sex <- str_sub(martino2018_unformatted_sex, 6, 30)
#
# #Formatting condition and miscellaneous information.
# martino2018_unformatted_condition <- martino2018_unformatted_table[4,-1]
# martino2018_formatted_condition <- str_sub(martino2018_unformatted_condition, -1,-1)
# martino2018_formatted_condition[martino2018_formatted_condition=="1"] <- "Stimulated"
# martino2018_formatted_condition[martino2018_formatted_condition=="0"] <- "Control"
#
# martino2018_unformatted_donor <- martino2018_unformatted_table[1,-1]
# martino2018_unformatted_donor <- str_sub(martino2018_unformatted_donor,1,6)
# martino2018_unformatted_donor <- as.numeric(factor(martino2018_unformatted_donor))
# martino2018_formatted_donor <-
paste(rep("W",205),martino2018_unformatted_donor,sep="")
#
# martino2018_samples <- data.frame(ID=martino2018_formatted_samples,
# Author=rep("Martino",205),
# Year=rep(2018,205),
# Tissue=rep("Blood",205),
134
# CellType=rep("CD4+",205),
# Age=martino2018_formatted_ages,
# Condition=martino2018_formatted_condition,
# Sex=martino2018_formatted_sex,
# DonorID=martino2018_formatted_donor,
# Misc=rep(NA,205))
#
# martino2018_cpgs <- read.table("Martino2018/GSE114135-GPL23976_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
#
# martino2018_cpgs <- martino2018_cpgs[-c(1:5),]
# rownames(martino2018_cpgs) <- str_sub(martino2018_cpgs[,1],1,100)
# martino2018_cpgs <- martino2018_cpgs[,-1]
# colnames(martino2018_cpgs) <- martino2018_formatted_samples
#
# martino2018_cpgs <- martino2018_cpgs[rownames(martino2018_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(martino2018_cpgs),]
#
# martino2018_cpgs$cpg <- rownames(martino2018_cpgs)
# martino2018_cpgs <- data.table(martino2018_cpgs)
# sample_table <- rbind(sample_table,martino2018_samples)
# cpg_table <- merge(cpg_table,martino2018_cpgs, all=TRUE)
#
# thompson2020_unformatted_table <-
read.table("Thompson2020/GSE146376_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# thompson2020_unformatted_samples <- thompson2020_unformatted_table[7,-1]
# thompson2020_formatted_samples <-
str_sub(thompson2020_unformatted_samples,1,62)[1:280]
#
# #Formatting ages.
# thompson2020_unformatted_ages <- thompson2020_unformatted_table[3,-1]
# thompson2020_formatted_ages <- as.numeric(str_sub(thompson2020_unformatted_ages,
5,100))
#
# #Formatting sex.
# thompson2020_unformatted_sex <- thompson2020_unformatted_table[4,-1]
# thompson2020_formatted_sex <- str_sub(thompson2020_unformatted_sex, 6, 30)
135
#
# #Formatting condition and miscellaneous information.
# thompson2020_unformatted_condition <- thompson2020_unformatted_table[2,-1]
# thompson2020_formatted_condition <- str_sub(thompson2020_unformatted_condition, 12, 30)
# thompson2020_formatted_condition[thompson2020_formatted_condition=="Vehicle"] <-
"Control"
#
# thompson2020_unformatted_donor <- thompson2020_unformatted_table[1,-1]
# thompson2020_unformatted_donor <- str_sub(thompson2020_unformatted_donor,8,10)
# thompson2020_formatted_donor <-
paste(rep("X",280),thompson2020_unformatted_donor,sep="")
#
# thompson2020_unformatted_arrangement <- thompson2020_unformatted_table[5,-1]
# thompson2020_formatted_arrangement <-
str_sub(thompson2020_unformatted_arrangement,1,100)
#
# thompson2020_samples <- data.frame(ID=thompson2020_formatted_samples,
# Author=rep("Thompson",280),
# Year=rep(2020,280),
# Tissue=rep("Lung",280),
# CellType=rep("Airway Smooth Muscle",280),
# Age=thompson2020_formatted_ages,
# Condition=thompson2020_formatted_condition,
# Sex=thompson2020_formatted_sex,
# DonorID=thompson2020_formatted_donor,
# Misc=rep(NA,280))
#
# thompson2020_cpgs <-
data.table::fread("Thompson2020/GSE146376_ProcessedMethData_70_ForGEO.csv",
# header=FALSE) %>%
# as.data.frame()
# row.names(thompson2020_cpgs) <- thompson2020_cpgs$V1
# thompson2020_cpgs <- thompson2020_cpgs[,-1]
# thompson2020_cpgs <- thompson2020_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),280)]
#
# thompson2020_colnames <- read_csv("Thompson2020/colnames.csv",
# col_names = FALSE)
# thompson2020_colnames <- str_sub(unlist(thompson2020_colnames[1,]),1,100)
# thompson2020_colnames <- thompson2020_colnames[-1]
# thompson2020_colnames <- thompson2020_colnames[rep(c(rep(TRUE, 2- 1), FALSE),280)]
# colnames(thompson2020_cpgs) <- thompson2020_colnames
# thompson2020_formatted_arrangement == colnames(thompson2020_cpgs)
# colnames(thompson2020_cpgs) <- thompson2020_formatted_samples
136
#
# thompson2020_cpgs <- thompson2020_cpgs[rownames(thompson2020_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(thompson2020_cpgs),]
#
# thompson2020_cpgs$cpg <- rownames(thompson2020_cpgs)
# thompson2020_cpgs <- data.table(thompson2020_cpgs)
# sample_table <- rbind(sample_table,thompson2020_samples)
# cpg_table <- merge(cpg_table,thompson2020_cpgs, all=TRUE)
#
#
# zannas2019_unformatted_table <- read.table("Zannas2019/GSE128235_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# zannas2019_unformatted_samples <- zannas2019_unformatted_table[6,-1]
# zannas2019_formatted_samples <- str_sub(zannas2019_unformatted_samples,1,62)[1:537]
#
# #Formatting ages.
# zannas2019_unformatted_ages <- zannas2019_unformatted_table[3,-1]
# zannas2019_formatted_ages <- as.numeric(str_sub(zannas2019_unformatted_ages, 5,100))
#
# #Formatting sex.
# zannas2019_unformatted_sex <- zannas2019_unformatted_table[4,-1]
# zannas2019_formatted_sex <- str_sub(zannas2019_unformatted_sex, 6, 30)
#
# #Formatting condition and miscellaneous information.
# zannas2019_unformatted_condition <- zannas2019_unformatted_table[2,-1]
# zannas2019_formatted_condition <- str_sub(zannas2019_unformatted_condition, 12, 30)
# zannas2019_formatted_condition[zannas2019_formatted_condition=="control"] <- "Control"
# zannas2019_formatted_condition[zannas2019_formatted_condition=="case"] <- "Depression"
#
# zannas2019_formatted_donor <- paste(rep("Y",537),c(1:537),sep="")
#
# zannas2019_unformatted_arrangement <- zannas2019_unformatted_table[1,-1]
# zannas2019_formatted_arrangement <- str_sub(zannas2019_unformatted_arrangement,25,100)
#
# zannas2019_samples <- data.frame(ID=zannas2019_formatted_samples,
# Author=rep("Zannas",537),
# Year=rep(2019,537),
# Tissue=rep("Blood",537),
# CellType=rep("Blood",537),
137
# Age=zannas2019_formatted_ages,
# Condition=zannas2019_formatted_condition,
# Sex=zannas2019_formatted_sex,
# DonorID=zannas2019_formatted_donor,
# Misc=rep(NA,537))
#
# zannas2019_cpgs <- data.table::fread("Zannas2019/GSE128235_matrix_normalized.txt",
# header=FALSE) %>%
# as.data.frame()
# rownames(zannas2019_cpgs) <- zannas2019_cpgs[,2]
# zannas2019_cpgs <- zannas2019_cpgs[,-c(1,2)]
# zannas2019_cpgs <- zannas2019_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),537)]
#
# zannas2019_colnames <- read_tsv("Zannas2019/first_line.txt",
# col_names = FALSE)
# zannas2019_colnames <- str_sub(unlist(zannas2019_colnames[1,]),1,100)
# zannas2019_colnames <- zannas2019_colnames[-c(1,2)]
# zannas2019_colnames <- zannas2019_colnames[rep(c(rep(TRUE, 2- 1), FALSE),537)]
# zannas2019_colnames <- str_sub(zannas2019_colnames,7,15)
# colnames(zannas2019_cpgs) <- zannas2019_colnames
# zannas2019_cpgs <- zannas2019_cpgs[,zannas2019_formatted_arrangement]
# zannas2019_formatted_arrangement == colnames(zannas2019_cpgs)
# colnames(zannas2019_cpgs) <- zannas2019_formatted_samples
#
# zannas2019_cpgs <- zannas2019_cpgs[rownames(zannas2019_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(zannas2019_cpgs),]
#
# zannas2019_cpgs$cpg <- rownames(zannas2019_cpgs)
# zannas2019_cpgs <- data.table(zannas2019_cpgs)
# sample_table <- rbind(sample_table,zannas2019_samples)
# cpg_table <- merge(cpg_table,zannas2019_cpgs, all=TRUE)
#
# nicodemus2017_unformatted_table <-
read.table("Nicodemus2017/GSE85566_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# nicodemus2017_unformatted_samples <- nicodemus2017_unformatted_table[4,-1]
# nicodemus2017_formatted_samples <-
str_sub(nicodemus2017_unformatted_samples,1,62)[1:115]
#
# #Formatting ages.
138
# nicodemus2017_unformatted_ages <- nicodemus2017_unformatted_table[1,-1]
# nicodemus2017_formatted_ages <- as.numeric(str_sub(nicodemus2017_unformatted_ages,
5,100))
#
# #Formatting sex.
# nicodemus2017_unformatted_sex <- nicodemus2017_unformatted_table[3,-1]
# nicodemus2017_formatted_sex <- str_sub(nicodemus2017_unformatted_sex, 9, 30)
#
# #Formatting condition and miscellaneous information.
# nicodemus2017_unformatted_condition <- nicodemus2017_unformatted_table[2,-1]
# nicodemus2017_formatted_condition <- str_sub(nicodemus2017_unformatted_condition, 17,
30)
#
# nicodemus2017_formatted_donor <- paste(rep("Z",115),c(1:115),sep="")
#
# nicodemus2017_samples <- data.frame(ID=nicodemus2017_formatted_samples,
# Author=rep("Nicodemus",115),
# Year=rep(2017,115),
# Tissue=rep("Lung",115),
# CellType=rep("Epithelial Cells",115),
# Age=nicodemus2017_formatted_ages,
# Condition=nicodemus2017_formatted_condition,
# Sex=nicodemus2017_formatted_sex,
# DonorID=nicodemus2017_formatted_donor,
# Misc=rep(NA,115))
#
# nicodemus2017_cpgs <- read.table("Nicodemus2017/GSE85566_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# nicodemus2017_cpgs <- nicodemus2017_cpgs[-c(1:4),]
# rownames(nicodemus2017_cpgs) <- nicodemus2017_cpgs[,1]
# nicodemus2017_cpgs <- nicodemus2017_cpgs[,-1]
# colnames(nicodemus2017_cpgs) <- nicodemus2017_formatted_samples
#
# nicodemus2017_cpgs <- nicodemus2017_cpgs[rownames(nicodemus2017_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(nicodemus2017_cpgs),]
#
# nicodemus2017_cpgs$cpg <- rownames(nicodemus2017_cpgs)
# nicodemus2017_cpgs <- data.table(nicodemus2017_cpgs)
# sample_table <- rbind(sample_table,nicodemus2017_samples)
# cpg_table <- merge(cpg_table,nicodemus2017_cpgs, all=TRUE)
139
#
# langevin2016_unformatted_table <- read.table("Langevin2016/GSE70977_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# langevin2016_unformatted_samples <- langevin2016_unformatted_table[5,-1]
# langevin2016_formatted_samples <- str_sub(langevin2016_unformatted_samples,1,62)[1:223]
#
# #Formatting ages.
# langevin2016_unformatted_ages <- langevin2016_unformatted_table[2,-1]
# langevin2016_formatted_ages <- as.numeric(str_sub(langevin2016_unformatted_ages,
26,100))
#
# #Formatting sex.
# langevin2016_unformatted_sex <- langevin2016_unformatted_table[3,-1]
# langevin2016_formatted_sex <- str_sub(langevin2016_unformatted_sex, 6, 30)
#
# #Formatting condition and miscellaneous information.
# langevin2016_unformatted_condition <- langevin2016_unformatted_table[1,-1]
# langevin2016_formatted_condition <- str_sub(langevin2016_unformatted_condition, 8, 30)
#
# langevin2016_formatted_donor <- paste(rep("AA",223),c(1:223),sep="")
#
# langevin2016_samples <- data.frame(ID=langevin2016_formatted_samples,
# Author=rep("Langevin",223),
# Year=rep(2016,223),
# Tissue=rep("Mouth",223),
# CellType=rep("Epithelial",223),
# Age=langevin2016_formatted_ages,
# Condition=langevin2016_formatted_condition,
# Sex=langevin2016_formatted_sex,
# DonorID=langevin2016_formatted_donor,
# Misc=rep(NA,223))
#
# langevin2016_cpgs <- read.table("Langevin2016/GSE70977_series_matrix.txt",
# comment = "!",
# skip=7,
# fill=TRUE)
# langevin2016_cpgs <- langevin2016_cpgs[-c(1:5),]
# rownames(langevin2016_cpgs) <- langevin2016_cpgs[,1]
# langevin2016_cpgs <- langevin2016_cpgs[,-1]
# colnames(langevin2016_cpgs) <- langevin2016_formatted_samples
140
#
# langevin2016_cpgs <- langevin2016_cpgs[rownames(langevin2016_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(langevin2016_cpgs),]
#
# langevin2016_cpgs$cpg <- rownames(langevin2016_cpgs)
# langevin2016_cpgs <- data.table(langevin2016_cpgs)
# sample_table <- rbind(sample_table,langevin2016_samples)
# cpg_table <- merge(cpg_table,langevin2016_cpgs, all=TRUE)
#
#
# pai2019_unformatted_table <- read.table("Pai2019/GSE112179_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# pai2019_unformatted_samples <- pai2019_unformatted_table[5,-1]
# pai2019_formatted_samples <- str_sub(pai2019_unformatted_samples,1,62)[1:100]
#
# #Formatting ages.
# pai2019_unformatted_ages <- pai2019_unformatted_table[1,-1]
# pai2019_formatted_ages <- as.numeric(str_sub(pai2019_unformatted_ages, 5,100))
#
# #Formatting sex.
# pai2019_unformatted_sex <- pai2019_unformatted_table[2,-1]
# pai2019_formatted_sex <- str_sub(pai2019_unformatted_sex, 6, 30)
#
# #Formatting condition and miscellaneous information.
# pai2019_unformatted_condition <- pai2019_unformatted_table[3,-1]
# pai2019_formatted_condition <- str_sub(pai2019_unformatted_condition, 10, 30)
#
# pai2019_formatted_donor <- paste(rep("AB",100),c(1:100),sep="")
#
# pai2019_samples <- data.frame(ID=pai2019_formatted_samples,
# Author=rep("Pai",100),
# Year=rep(2019,100),
# Tissue=rep("Brain",100),
# CellType=rep("Neurons",100),
# Age=pai2019_formatted_ages,
# Condition=pai2019_formatted_condition,
# Sex=pai2019_formatted_sex,
# DonorID=pai2019_formatted_donor,
# Misc=rep(NA,100))
141
#
# pai2019_cpgs <- read.table("Pai2019/GSE112179_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# pai2019_cpgs <- pai2019_cpgs[-c(1:5),]
# rownames(pai2019_cpgs) <- pai2019_cpgs[,1]
# pai2019_cpgs <- pai2019_cpgs[,-1]
# colnames(pai2019_cpgs) <- pai2019_formatted_samples
#
# pai2019_cpgs <- pai2019_cpgs[rownames(pai2019_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(pai2019_cpgs),]
#
# pai2019_cpgs$cpg <- rownames(pai2019_cpgs)
# pai2019_cpgs <- data.table(pai2019_cpgs)
# sample_table <- rbind(sample_table,pai2019_samples)
# cpg_table <- merge(cpg_table,pai2019_cpgs, all=TRUE)
#
# cobben2019_unformatted_table <- read.table("Cobben2019/GSE112987_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# cobben2019_unformatted_samples <- cobben2019_unformatted_table[5,-1]
# cobben2019_formatted_samples <- str_sub(cobben2019_unformatted_samples,1,62)[1:103]
#
# #Formatting ages.
# cobben2019_unformatted_ages <- cobben2019_unformatted_table[3,-1]
# cobben2019_formatted_ages <- as.numeric(str_sub(cobben2019_unformatted_ages, 5,100))
#
# #Formatting sex.
# cobben2019_unformatted_sex <- cobben2019_unformatted_table[1,-1]
# cobben2019_formatted_sex <- str_sub(cobben2019_unformatted_sex, 9, 30)
#
# #Formatting condition and miscellaneous information.
# cobben2019_unformatted_condition <- cobben2019_unformatted_table[2,-1]
# cobben2019_formatted_condition <- str_sub(cobben2019_unformatted_condition, 16, 30)
# cobben2019_formatted_condition[cobben2019_formatted_condition=="control"] <- "Control"
#
# cobben2019_formatted_donor <- paste(rep("AC",103),c(1:103),sep="")
#
# cobben2019_samples <- data.frame(ID=cobben2019_formatted_samples,
# Author=rep("Cobben",103),
142
# Year=rep(2019,103),
# Tissue=rep("Blood",103),
# CellType=rep("Blood",103),
# Age=cobben2019_formatted_ages,
# Condition=cobben2019_formatted_condition,
# Sex=cobben2019_formatted_sex,
# DonorID=cobben2019_formatted_donor,
# Misc=rep(NA,103))
#
# cobben2019_cpgs <- read.table("Cobben2019/GSE112987_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# cobben2019_cpgs <- cobben2019_cpgs[-c(1:5),]
# rownames(cobben2019_cpgs) <- cobben2019_cpgs[,1]
# cobben2019_cpgs <- cobben2019_cpgs[,-1]
# colnames(cobben2019_cpgs) <- cobben2019_formatted_samples
#
# cobben2019_cpgs <- cobben2019_cpgs[rownames(cobben2019_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(cobben2019_cpgs),]
#
# cobben2019_cpgs$cpg <- rownames(cobben2019_cpgs)
# cobben2019_cpgs <- data.table(cobben2019_cpgs)
# sample_table <- rbind(sample_table,cobben2019_samples)
# cpg_table <- merge(cpg_table,cobben2019_cpgs, all=TRUE)
#
#
# husquin2019_unformatted_table <- read.table("Husquin2019/GSE120610_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# husquin2019_unformatted_samples <- husquin2019_unformatted_table[5,-1]
# husquin2019_formatted_samples <- str_sub(husquin2019_unformatted_samples,1,62)[1:156]
#
# #Formatting ages.
# husquin2019_unformatted_ages <- husquin2019_unformatted_table[1,-1]
# husquin2019_formatted_ages <- as.numeric(str_sub(husquin2019_unformatted_ages, 5,100))
#
# #Formatting sex.
# husquin2019_unformatted_sex <- husquin2019_unformatted_table[2,-1]
# husquin2019_formatted_sex <- str_sub(husquin2019_unformatted_sex, 9, 15)
#
143
# #Formatting condition and miscellaneous information.
# husquin2019_formatted_condition <- rep("Control",156)
#
# husquin2019_formatted_donor <- paste(rep("AD",156),c(1:156),sep="")
#
# husquin2019_unformatted_arrangement <- husquin2019_unformatted_table[3,-1]
# husquin2019_formatted_arrangement <-
str_sub(husquin2019_unformatted_arrangement,8,100)
#
# husquin2019_samples <- data.frame(ID=husquin2019_formatted_samples,
# Author=rep("Husquin",156),
# Year=rep(2019,156),
# Tissue=rep("Blood",156),
# CellType=rep("Monocytes",156),
# Age=husquin2019_formatted_ages,
# Condition=husquin2019_formatted_condition,
# Sex=husquin2019_formatted_sex,
# DonorID=husquin2019_formatted_donor,
# Misc=rep(NA,156))
#
# husquin2019_cpgs <- data.table::fread("Husquin2019/GSE120610_Matrix_processed.csv",
# header=FALSE) %>%
# as.data.frame()
# row.names(husquin2019_cpgs) <- husquin2019_cpgs$V1
# husquin2019_cpgs <- husquin2019_cpgs[,-1]
# husquin2019_cpgs <- husquin2019_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),156)]
#
# colnames(husquin2019_cpgs) <- str_sub(husquin2019_cpgs[1,],7,15)
# colnames(husquin2019_cpgs) <- husquin2019_formatted_samples
# husquin2019_cpgs <- husquin2019_cpgs[-1,]
#
# husquin2019_cpgs <- husquin2019_cpgs[rownames(husquin2019_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(husquin2019_cpgs),]
#
# husquin2019_cpgs$cpg <- rownames(husquin2019_cpgs)
# husquin2019_cpgs <- data.table(husquin2019_cpgs)
# sample_table <- rbind(sample_table,husquin2019_samples)
# cpg_table <- merge(cpg_table,husquin2019_cpgs, all=TRUE)
#
# gasparoni2018_unformatted_table <-
read.table("Gasparoni2018/GSE66351_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 8)
144
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# gasparoni2018_unformatted_samples <- gasparoni2018_unformatted_table[8,-1]
# gasparoni2018_formatted_samples <-
str_sub(gasparoni2018_unformatted_samples,1,62)[1:190]
#
# #Formatting ages.
# gasparoni2018_unformatted_ages <- gasparoni2018_unformatted_table[4,-1]
# gasparoni2018_formatted_ages <- as.numeric(str_sub(gasparoni2018_unformatted_ages,
5,100))
#
# #Formatting sex.
# gasparoni2018_unformatted_sex <- gasparoni2018_unformatted_table[5,-1]
# gasparoni2018_formatted_sex <- str_sub(gasparoni2018_unformatted_sex, 6, 15)
#
# #Formatting condition and miscellaneous information.
# gasparoni2018_unformatted_condition <- gasparoni2018_unformatted_table[2,-1]
# gasparoni2018_unformatted_condition <-
str_sub(gasparoni2018_unformatted_condition,12,20)
# gasparoni2018_unformatted_condition[gasparoni2018_unformatted_condition=="CTRL"] <-
"Control"
# gasparoni2018_unformatted_condition[gasparoni2018_unformatted_condition=="AD"] <-
"Alzheimers"
# gasparoni2018_formatted_condition <- gasparoni2018_unformatted_condition
#
# gasparoni2018_unformatted_donor <- gasparoni2018_unformatted_table[6,-1]
# gasparoni2018_unformatted_donor <- str_sub(gasparoni2018_unformatted_donor,11,30)
# gasparoni2018_unformatted_donor <- as.numeric(factor(gasparoni2018_unformatted_donor))
# gasparoni2018_formatted_donor <-
paste(rep("AE",190),gasparoni2018_unformatted_donor,sep="")
#
# gasparoni2018_unformatted_misc <- gasparoni2018_unformatted_table[3,-1]
# gasparoni2018_formatted_misc <- str_sub(gasparoni2018_unformatted_misc,15,200)
#
# gasparoni2018_samples <- data.frame(ID=gasparoni2018_formatted_samples,
# Author=rep("Gasparoni",190),
# Year=rep(2018,190),
# Tissue=rep("Brain",190),
# CellType=rep("Brain",190),
# Age=gasparoni2018_formatted_ages,
# Condition=gasparoni2018_formatted_condition,
# Sex=gasparoni2018_formatted_sex,
# DonorID=gasparoni2018_formatted_donor,
145
# Misc=gasparoni2018_formatted_misc)
#
# gasparoni2018_cpgs <- read.table("Gasparoni2018/GSE66351_series_matrix.txt",
# comment = "!",
# skip=8,
# fill=TRUE)
# gasparoni2018_cpgs <- gasparoni2018_cpgs[-c(1:8),]
# rownames(gasparoni2018_cpgs) <- gasparoni2018_cpgs[,1]
# gasparoni2018_cpgs <- gasparoni2018_cpgs[,-1]
# colnames(gasparoni2018_cpgs) <- gasparoni2018_formatted_samples
#
# gasparoni2018_cpgs <- gasparoni2018_cpgs[rownames(gasparoni2018_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(gasparoni2018_cpgs),]
#
# gasparoni2018_cpgs$cpg <- rownames(gasparoni2018_cpgs)
# gasparoni2018_cpgs <- data.table(gasparoni2018_cpgs)
# sample_table <- rbind(sample_table,gasparoni2018_samples)
# cpg_table <- merge(cpg_table,gasparoni2018_cpgs, all=TRUE)
#
#
# liu2018_unformatted_table <- read.table("Liu2018/GSE106648_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# liu2018_unformatted_samples <- liu2018_unformatted_table[5,-1]
# liu2018_formatted_samples <- str_sub(liu2018_unformatted_samples,1,62)[1:279]
#
# #Formatting ages.
# liu2018_unformatted_ages <- liu2018_unformatted_table[3,-1]
# liu2018_formatted_ages <- as.numeric(str_sub(liu2018_unformatted_ages, 5,100))
#
# #Formatting sex.
# liu2018_unformatted_sex <- liu2018_unformatted_table[2,-1]
# liu2018_formatted_sex <- str_sub(liu2018_unformatted_sex, 9, 15)
#
# #Formatting condition and miscellaneous information.
# liu2018_unformatted_condition <- liu2018_unformatted_table[1,-1]
# liu2018_unformatted_condition <- str_sub(liu2018_unformatted_condition,17,40)
# liu2018_unformatted_condition[liu2018_unformatted_condition=="Healthy control"] =
"Control"
# liu2018_unformatted_condition[liu2018_unformatted_condition=="MS case"] = "MS"
146
# liu2018_formatted_condition <- liu2018_unformatted_condition
#
# liu2018_formatted_donor <- paste(rep("AF",279),c(279),sep="")
#
# liu2018_samples <- data.frame(ID=liu2018_formatted_samples,
# Author=rep("Liu",279),
# Year=rep(2018,279),
# Tissue=rep("Blood",279),
# CellType=rep("Blood",279),
# Age=liu2018_formatted_ages,
# Condition=liu2018_formatted_condition,
# Sex=liu2018_formatted_sex,
# DonorID=liu2018_formatted_donor,
# Misc=rep(NA,279))
#
# liu2018_cpgs <- read.table("Liu2018/GSE106648_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# liu2018_cpgs <- liu2018_cpgs[-c(1:5),]
# rownames(liu2018_cpgs) <- liu2018_cpgs[,1]
# liu2018_cpgs <- liu2018_cpgs[,-1]
# colnames(liu2018_cpgs) <- liu2018_formatted_samples
#
# liu2018_cpgs <- liu2018_cpgs[rownames(liu2018_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(liu2018_cpgs),]
#
# liu2018_cpgs$cpg <- rownames(liu2018_cpgs)
# liu2018_cpgs <- data.table(liu2018_cpgs)
# sample_table <- rbind(sample_table,liu2018_samples)
# cpg_table <- merge(cpg_table,liu2018_cpgs, all=TRUE)
#
#
# somineni2018_unformatted_table <-
read.table("Somineni2018/GSE112611_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# somineni2018_unformatted_samples <- somineni2018_unformatted_table[6,-1]
# somineni2018_formatted_samples <-
str_sub(somineni2018_unformatted_samples,1,62)[1:402]
#
147
# #Formatting ages.
# somineni2018_unformatted_ages <- somineni2018_unformatted_table[2,-1]
# somineni2018_formatted_ages <- as.numeric(str_sub(somineni2018_unformatted_ages,
5,100))
#
# #Formatting sex.
# somineni2018_unformatted_sex <- somineni2018_unformatted_table[1,-1]
# somineni2018_formatted_sex <- str_sub(somineni2018_unformatted_sex, 9, 15)
#
# #Formatting condition and miscellaneous information.
# somineni2018_unformatted_condition <- somineni2018_unformatted_table[3,-1]
# somineni2018_unformatted_condition <- str_sub(somineni2018_unformatted_condition,12,40)
# somineni2018_unformatted_condition[somineni2018_unformatted_condition=="non-IBD
control"] = "Control"
# somineni2018_formatted_condition <- somineni2018_unformatted_condition
#
# somineni2018_formatted_donor <- paste(rep("AG",402),c(1:402),sep="")
#
# somineni2018_formatted_location <- str_sub(somineni2018_unformatted_table[4,-1],1,100)
#
# somineni2018_samples <- data.frame(ID=somineni2018_formatted_samples,
# Author=rep("Somineni",402),
# Year=rep(2018,402),
# Tissue=rep("Blood",402),
# CellType=rep("Blood",402),
# Age=somineni2018_formatted_ages,
# Condition=somineni2018_formatted_condition,
# Sex=somineni2018_formatted_sex,
# DonorID=somineni2018_formatted_donor,
# Misc=rep(NA,402))
#
# somineni2018_cpgs <- data.table::fread("Somineni2018/GSE112611_beta_values.txt",
# header=FALSE) %>%
# as.data.frame()
# row.names(somineni2018_cpgs) <- somineni2018_cpgs$V1
# somineni2018_cpgs <- somineni2018_cpgs[,-1]
# somineni2018_cpgs <- somineni2018_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),402)]
# colnames(somineni2018_cpgs) <- somineni2018_cpgs[1,]
# somineni2018_cpgs <- somineni2018_cpgs[-1,]
#
# colnames(somineni2018_cpgs) <- somineni2018_formatted_samples
#
# somineni2018_cpgs <- somineni2018_cpgs[rownames(somineni2018_cpgs) %in%
148
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(somineni2018_cpgs),]
#
# somineni2018_cpgs$cpg <- rownames(somineni2018_cpgs)
# somineni2018_cpgs <- data.table(somineni2018_cpgs)
# sample_table <- rbind(sample_table,somineni2018_samples)
# cpg_table <- merge(cpg_table,somineni2018_cpgs, all=TRUE)
#
# roos2017_unformatted_table <- read.table("Roos2017/GSE90124_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# roos2017_unformatted_samples <- roos2017_unformatted_table[3,-1]
# roos2017_formatted_samples <- str_sub(roos2017_unformatted_samples,1,62)[1:322]
#
# #Formatting ages.
# roos2017_unformatted_ages <- roos2017_unformatted_table[1,-1]
# roos2017_formatted_ages <- as.numeric(str_sub(roos2017_unformatted_ages, 15,100))
#
# #Formatting sex.
# roos2017_formatted_sex <- rep("Female",322)
#
# #Formatting condition and miscellaneous information.
# roos2017_formatted_condition <- rep("Control",322)
#
# roos2017_formatted_donor <- paste(rep("AH",322),c(1:322),sep="")
#
# roos2017_samples <- data.frame(ID=roos2017_formatted_samples,
# Author=rep("Roos",322),
# Year=rep(2017,322),
# Tissue=rep("Skin",322),
# CellType=rep("Epithelial",322),
# Age=roos2017_formatted_ages,
# Condition=roos2017_formatted_condition,
# Sex=roos2017_formatted_sex,
# DonorID=roos2017_formatted_donor,
# Misc=rep(NA,322))
#
# roos2017_cpgs <- read.table("Roos2017/GSE90124_series_matrix.txt",
# comment = "!",
# skip=3,
# fill=TRUE)
149
# roos2017_cpgs <- roos2017_cpgs[-c(1:3),]
# rownames(roos2017_cpgs) <- roos2017_cpgs[,1]
# roos2017_cpgs <- roos2017_cpgs[,-1]
# colnames(roos2017_cpgs) <- roos2017_formatted_samples
#
# roos2017_cpgs <- roos2017_cpgs[rownames(roos2017_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(roos2017_cpgs),]
#
# roos2017_cpgs$cpg <- rownames(roos2017_cpgs)
# roos2017_cpgs <- data.table(roos2017_cpgs)
# sample_table <- rbind(sample_table,roos2017_samples)
# cpg_table <- merge(cpg_table,roos2017_cpgs, all=TRUE)
#
#
# kananen2016_unformatted_table <- read.table("Kananen2016/GSE69270_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# kananen2016_unformatted_samples <- kananen2016_unformatted_table[4,-1]
# kananen2016_formatted_samples <- str_sub(kananen2016_unformatted_samples,1,62)[1:184]
#
# #Formatting ages.
# kananen2016_unformatted_ages <- kananen2016_unformatted_table[1,-1]
# kananen2016_formatted_ages <- as.numeric(str_sub(kananen2016_unformatted_ages, 14,100))
#
# #Formatting sex.
# kananen2016_unformatted_sex <- kananen2016_unformatted_table[2,-1]
# kananen2016_unformatted_sex <- str_sub(kananen2016_unformatted_sex,-1,-1)
# kananen2016_unformatted_sex[kananen2016_unformatted_sex=="1"] <- "Female"
# kananen2016_unformatted_sex[kananen2016_unformatted_sex=="0"] <- "Male"
# kananen2016_formatted_sex <- kananen2016_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# kananen2016_formatted_condition <- rep("Control",184)
#
# kananen2016_formatted_donor <- paste(rep("AI",184),c(1:184),sep="")
#
# kananen2016_samples <- data.frame(ID=kananen2016_formatted_samples,
# Author=rep("Kananen",184),
# Year=rep(2016,184),
# Tissue=rep("Blood",184),
# CellType=rep("Leukocytes",184),
150
# Age=kananen2016_formatted_ages,
# Condition=kananen2016_formatted_condition,
# Sex=kananen2016_formatted_sex,
# DonorID=kananen2016_formatted_donor,
# Misc=rep(NA,184))
#
# kananen2016_cpgs <- read.table("Kananen2016/GSE69270_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# kananen2016_cpgs <- kananen2016_cpgs[-c(1:4),]
# rownames(kananen2016_cpgs) <- kananen2016_cpgs[,1]
# kananen2016_cpgs <- kananen2016_cpgs[,-1]
# colnames(kananen2016_cpgs) <- kananen2016_formatted_samples
#
# kananen2016_cpgs <- kananen2016_cpgs[rownames(kananen2016_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(kananen2016_cpgs),]
#
# kananen2016_cpgs$cpg <- rownames(kananen2016_cpgs)
# kananen2016_cpgs <- data.table(kananen2016_cpgs)
# sample_table <- rbind(sample_table,kananen2016_samples)
# cpg_table <- merge(cpg_table,kananen2016_cpgs, all=TRUE)
#
# cerapio2021_unformatted_table <- read.table("Cerapio2021/GSE136583_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# cerapio2021_unformatted_samples <- cerapio2021_unformatted_table[5,-1]
# cerapio2021_formatted_samples <- str_sub(cerapio2021_unformatted_samples,1,62)[1:62]
#
# #Formatting ages.
# cerapio2021_unformatted_ages <- cerapio2021_unformatted_table[2,-1]
# cerapio2021_formatted_ages <- as.numeric(str_sub(cerapio2021_unformatted_ages, 5,100))
#
# #Formatting sex.
# cerapio2021_unformatted_sex <- cerapio2021_unformatted_table[3,-1]
# cerapio2021_unformatted_sex <- str_sub(cerapio2021_unformatted_sex,9,30)
# cerapio2021_formatted_sex <- cerapio2021_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# cerapio2021_unformatted_condition <- cerapio2021_unformatted_table[1,-1]
# cerapio2021_unformatted_condition <- str_sub(cerapio2021_unformatted_condition,9,50)
151
# cerapio2021_unformatted_condition[cerapio2021_unformatted_condition=="non-tumor liver"]
<- "Control"
# cerapio2021_formatted_condition <- cerapio2021_unformatted_condition
#
# cerapio2021_formatted_donor <- paste(rep("AJ",62),c(1:62),sep="")
#
# cerapio2021_samples <- data.frame(ID=cerapio2021_formatted_samples,
# Author=rep("Cerapio",62),
# Year=rep(2021,62),
# Tissue=rep("Liver",62),
# CellType=rep("Hepatocytes",62),
# Age=cerapio2021_formatted_ages,
# Condition=cerapio2021_formatted_condition,
# Sex=cerapio2021_formatted_sex,
# DonorID=cerapio2021_formatted_donor,
# Misc=rep(NA,62))
#
# cerapio2021_cpgs <- read.table("Cerapio2021/GSE136583_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# cerapio2021_cpgs <- cerapio2021_cpgs[-c(1:5),]
# rownames(cerapio2021_cpgs) <- cerapio2021_cpgs[,1]
# cerapio2021_cpgs <- cerapio2021_cpgs[,-1]
# colnames(cerapio2021_cpgs) <- cerapio2021_formatted_samples
#
# cerapio2021_cpgs <- cerapio2021_cpgs[rownames(cerapio2021_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(cerapio2021_cpgs),]
#
# cerapio2021_cpgs$cpg <- rownames(cerapio2021_cpgs)
# cerapio2021_cpgs <- data.table(cerapio2021_cpgs)
# sample_table <- rbind(sample_table,cerapio2021_samples)
# cpg_table <- merge(cpg_table,cerapio2021_cpgs, all=TRUE)
#
# hearn2020_unformatted_table <- read.table("Hearn2020/GSE119078_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# hearn2020_unformatted_samples <- hearn2020_unformatted_table[5,-1]
# hearn2020_formatted_samples <- str_sub(hearn2020_unformatted_samples,1,62)[1:59]
#
# #Formatting ages.
152
# hearn2020_unformatted_ages <- hearn2020_unformatted_table[2,-1]
# hearn2020_formatted_ages <- as.numeric(str_sub(hearn2020_unformatted_ages, 5,100))
#
# #Formatting sex.
# hearn2020_unformatted_sex <- hearn2020_unformatted_table[1,-1]
# hearn2020_unformatted_sex <- str_sub(hearn2020_unformatted_sex,9,100)
# hearn2020_formatted_sex <- hearn2020_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# hearn2020_unformatted_condition <- hearn2020_unformatted_table[3,-1]
# hearn2020_unformatted_condition <- str_sub(hearn2020_unformatted_condition,17,40)
# hearn2020_formatted_condition <- hearn2020_unformatted_condition
#
# hearn2020_formatted_donor <- paste(rep("AK",59),c(1:59),sep="")
#
# hearn2020_samples <- data.frame(ID=hearn2020_formatted_samples,
# Author=rep("Hearn",59),
# Year=rep(2020,59),
# Tissue=rep("Saliva",59),
# CellType=rep("Epithelial",59),
# Age=hearn2020_formatted_ages,
# Condition=hearn2020_formatted_condition,
# Sex=hearn2020_formatted_sex,
# DonorID=hearn2020_formatted_donor,
# Misc=rep(NA,59))
#
# hearn2020_cpgs <- read.table("Hearn2020/GSE119078_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# hearn2020_cpgs <- hearn2020_cpgs[-c(1:4),]
# rownames(hearn2020_cpgs) <- hearn2020_cpgs[,1]
# hearn2020_cpgs <- hearn2020_cpgs[,-1]
# colnames(hearn2020_cpgs) <- hearn2020_formatted_samples
#
# hearn2020_cpgs <- hearn2020_cpgs[rownames(hearn2020_cpgs) %in% shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(hearn2020_cpgs),]
#
# hearn2020_cpgs$cpg <- rownames(hearn2020_cpgs)
# hearn2020_cpgs <- data.table(hearn2020_cpgs)
# sample_table <- rbind(sample_table,hearn2020_samples)
# cpg_table <- merge(cpg_table,hearn2020_cpgs, all=TRUE)
#
153
# hong2017_unformatted_table <- read.table("Hong2017/GSE92767_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# hong2017_unformatted_samples <- hong2017_unformatted_table[4,-1]
# hong2017_formatted_samples <- str_sub(hong2017_unformatted_samples,1,62)[1:54]
#
# #Formatting ages.
# hong2017_unformatted_ages <- hong2017_unformatted_table[1,-1]
# hong2017_formatted_ages <- as.numeric(str_sub(hong2017_unformatted_ages, 6,100))
#
# #Formatting sex.
# hong2017_unformatted_sex <- hong2017_unformatted_table[2,-1]
# hong2017_unformatted_sex <- str_sub(hong2017_unformatted_sex,9,15)
# hong2017_formatted_sex <- hong2017_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# hong2017_formatted_condition <- rep("Control",54)
#
# hong2017_formatted_donor <- paste(rep("AL",54),c(1:54),sep="")
#
# hong2017_samples <- data.frame(ID=hong2017_formatted_samples,
# Author=rep("Hong",54),
# Year=rep(2017,54),
# Tissue=rep("Saliva",54),
# CellType=rep("Epithelial",54),
# Age=hong2017_formatted_ages,
# Condition=hong2017_formatted_condition,
# Sex=hong2017_formatted_sex,
# DonorID=hong2017_formatted_donor,
# Misc=rep(NA,54))
#
# hong2017_cpgs <- read.table("Hong2017/GSE92767_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# hong2017_cpgs <- hong2017_cpgs[-c(1:4),]
# rownames(hong2017_cpgs) <- hong2017_cpgs[,1]
# hong2017_cpgs <- hong2017_cpgs[,-1]
# colnames(hong2017_cpgs) <- hong2017_formatted_samples
#
# hong2017_cpgs <- hong2017_cpgs[rownames(hong2017_cpgs) %in% shared_cpgs,]
154
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(hong2017_cpgs),]
#
# hong2017_cpgs$cpg <- rownames(hong2017_cpgs)
# hong2017_cpgs <- data.table(hong2017_cpgs)
# sample_table <- rbind(sample_table,hong2017_samples)
# cpg_table <- merge(cpg_table,hong2017_cpgs, all=TRUE)
#
#
# gopclassic2017_unformatted_table <- read.table("Gopclassic2017/GSE99091-
GPL13534_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# gopclassic2017_unformatted_samples <- gopclassic2017_unformatted_table[4,-1]
# gopclassic2017_formatted_samples <-
str_sub(gopclassic2017_unformatted_samples,1,62)[1:57]
#
# #Formatting ages.
# gopclassic2017_unformatted_ages <- gopclassic2017_unformatted_table[2,-1]
# gopclassic2017_formatted_ages <- as.numeric(str_sub(gopclassic2017_unformatted_ages,
6,100))
#
# #Formatting sex.
# gopclassic2017_unformatted_sex <- gopclassic2017_unformatted_table[1,-1]
# gopclassic2017_unformatted_sex <- str_sub(gopclassic2017_unformatted_sex,9,60)
# gopclassic2017_unformatted_sex[gopclassic2017_unformatted_sex=="female"] <- "Female"
# gopclassic2017_unformatted_sex[gopclassic2017_unformatted_sex=="male"] <- "Male"
# gopclassic2017_formatted_sex <- gopclassic2017_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# gopclassic2017_formatted_condition <- rep("Control",57)
#
# gopclassic2017_formatted_donor <- paste(rep("AM",57),c(1:57),sep="")
#
# gopclassic2017_samples <- data.frame(ID=gopclassic2017_formatted_samples,
# Author=rep("Gopclassic",57),
# Year=rep(2017,57),
# Tissue=rep("Saliva",57),
# CellType=rep("Epithelial",57),
# Age=gopclassic2017_formatted_ages,
# Condition=gopclassic2017_formatted_condition,
# Sex=gopclassic2017_formatted_sex,
155
# DonorID=gopclassic2017_formatted_donor,
# Misc=rep(NA,57))
#
# gopclassic2017_cpgs <- read.table("Gopclassic2017/GSE99091-GPL13534_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# gopclassic2017_cpgs <- gopclassic2017_cpgs[-c(1:4),]
# rownames(gopclassic2017_cpgs) <- gopclassic2017_cpgs[,1]
# gopclassic2017_cpgs <- gopclassic2017_cpgs[,-1]
# colnames(gopclassic2017_cpgs) <- gopclassic2017_formatted_samples
#
# gopclassic2017_cpgs <- gopclassic2017_cpgs[rownames(gopclassic2017_cpgs) %in%
shared_cpgs,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(gopclassic2017_cpgs),]
#
# gopclassic2017_cpgs$cpg <- rownames(gopclassic2017_cpgs)
# gopclassic2017_cpgs <- data.table(gopclassic2017_cpgs)
# sample_table <- rbind(sample_table,gopclassic2017_samples)
# cpg_table <- merge(cpg_table,gopclassic2017_cpgs, all=TRUE)
#
#
# horvath2016_unformatted_table <- read.table("Horvath2016/GSE78874_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# horvath2016_unformatted_samples <- horvath2016_unformatted_table[4,-1]
# horvath2016_formatted_samples <- str_sub(horvath2016_unformatted_samples,1,62)[1:259]
#
# #Formatting ages.
# horvath2016_unformatted_ages <- horvath2016_unformatted_table[1,-1]
# horvath2016_formatted_ages <- as.numeric(str_sub(horvath2016_unformatted_ages, 6,100))
#
# #Formatting sex.
# horvath2016_unformatted_sex <- horvath2016_unformatted_table[2,-1]
# horvath2016_unformatted_sex <- str_sub(horvath2016_unformatted_sex,9,60)
# horvath2016_unformatted_sex[horvath2016_unformatted_sex=="female"] <- "Female"
# horvath2016_unformatted_sex[horvath2016_unformatted_sex=="male"] <- "Male"
# horvath2016_formatted_sex <- horvath2016_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# horvath2016_formatted_condition <- rep("Control",259)
156
#
# horvath2016_formatted_donor <- paste(rep("AN",259),c(1:259),sep="")
#
# horvath2016_samples <- data.frame(ID=horvath2016_formatted_samples,
# Author=rep("Horvath",259),
# Year=rep(2016,259),
# Tissue=rep("Saliva",259),
# CellType=rep("Epithelial",259),
# Age=horvath2016_formatted_ages,
# Condition=horvath2016_formatted_condition,
# Sex=horvath2016_formatted_sex,
# DonorID=horvath2016_formatted_donor,
# Misc=rep(NA,259))
#
# horvath2016_cpgs <- read.table("Horvath2016/GSE78874_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# horvath2016_cpgs <- horvath2016_cpgs[-c(1:4),]
# rownames(horvath2016_cpgs) <- horvath2016_cpgs[,1]
# horvath2016_cpgs <- horvath2016_cpgs[,-1]
# colnames(horvath2016_cpgs) <- horvath2016_formatted_samples
#
# horvath2016_cpgs <- horvath2016_cpgs[rownames(horvath2016_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(horvath2016_cpgs),]
#
# horvath2016_cpgs$cpg <- rownames(horvath2016_cpgs)
# horvath2016_cpgs <- data.table(horvath2016_cpgs)
# sample_table <- rbind(sample_table,horvath2016_samples)
# cpg_table <- merge(cpg_table,horvath2016_cpgs, all=TRUE)
#
#
# guintivano2013_unformatted_table <-
read.table("Guintivano2013/GSE41826_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# guintivano2013_unformatted_samples <- guintivano2013_unformatted_table[4,-1]
# guintivano2013_formatted_samples <-
str_sub(guintivano2013_unformatted_samples,1,62)[1:145]
#
# #Formatting ages.
157
# guintivano2013_unformatted_ages <- guintivano2013_unformatted_table[3,-1]
# guintivano2013_formatted_ages <- as.numeric(str_sub(guintivano2013_unformatted_ages,
6,100))
#
# #Formatting sex.
# guintivano2013_unformatted_sex <- guintivano2013_unformatted_table[2,-1]
# guintivano2013_unformatted_sex <- str_sub(guintivano2013_unformatted_sex,6,60)
# guintivano2013_formatted_sex <- guintivano2013_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# guintivano2013_unformatted_condition <- guintivano2013_unformatted_table[1,-1]
# guintivano2013_formatted_condition <-
str_sub(guintivano2013_unformatted_condition,12,60)
#
# guintivano2013_formatted_donor <- paste(rep("AO",145),c(1:145),sep="")
#
# guintivano2013_samples <- data.frame(ID=guintivano2013_formatted_samples,
# Author=rep("Guintivano",145),
# Year=rep(2013,145),
# Tissue=rep("Brain",145),
# CellType=rep("Brain",145),
# Age=guintivano2013_formatted_ages,
# Condition=guintivano2013_formatted_condition,
# Sex=guintivano2013_formatted_sex,
# DonorID=guintivano2013_formatted_donor,
# Misc=rep(NA,145))
#
# guintivano2013_cpgs <- read.table("Guintivano2013/GSE41826_series_matrix.txt",
# comment = "!",
# skip=4,
# fill=TRUE)
# guintivano2013_cpgs <- guintivano2013_cpgs[-c(1:4),]
# rownames(guintivano2013_cpgs) <- guintivano2013_cpgs[,1]
# guintivano2013_cpgs <- guintivano2013_cpgs[,-1]
# colnames(guintivano2013_cpgs) <- guintivano2013_formatted_samples
#
# guintivano2013_cpgs <- guintivano2013_cpgs[rownames(guintivano2013_cpgs) %in%
cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(guintivano2013_cpgs),]
#
# guintivano2013_cpgs$cpg <- rownames(guintivano2013_cpgs)
# guintivano2013_cpgs <- data.table(guintivano2013_cpgs)
# sample_table <- rbind(sample_table,guintivano2013_samples)
158
# cpg_table <- merge(cpg_table,guintivano2013_cpgs, all=TRUE)
#
# martino2013_unformatted_table <- read.table("Martino2013/GSE42700_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# martino2013_unformatted_samples <- martino2013_unformatted_table[4,-1]
# martino2013_formatted_samples <- str_sub(martino2013_unformatted_samples,1,62)[1:53]
#
# #Formatting ages.
# martino2013_unformatted_ages <- martino2013_unformatted_table[2,-1]
# martino2013_unformatted_ages <- str_sub(martino2013_unformatted_ages, 6,100)
# martino2013_unformatted_ages[martino2013_unformatted_ages=="18 months"] <- 1.5
# martino2013_unformatted_ages[martino2013_unformatted_ages=="birth"] <- 0
# martino2013_formatted_ages <- as.numeric(martino2013_unformatted_ages)
#
# #Formatting sex.
# martino2013_unformatted_sex <- martino2013_unformatted_table[1,-1]
# martino2013_unformatted_sex <- str_sub(martino2013_unformatted_sex,9,60)
# martino2013_unformatted_sex[martino2013_unformatted_sex=="female"] <- "Female"
# martino2013_unformatted_sex[martino2013_unformatted_sex=="male"] <- "Male"
# martino2013_formatted_sex <- martino2013_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# martino2013_formatted_condition <- rep("Control",53)
#
# martino2013_formatted_donor <- paste(rep("AP",53),c(1:53),sep="")
#
# martino2013_samples <- data.frame(ID=martino2013_formatted_samples,
# Author=rep("Martino",53),
# Year=rep(2013,53),
# Tissue=rep("Saliva",53),
# CellType=rep("Epithelial",53),
# Age=martino2013_formatted_ages,
# Condition=martino2013_formatted_condition,
# Sex=martino2013_formatted_sex,
# DonorID=martino2013_formatted_donor,
# Misc=rep(53))
#
# martino2013_cpgs <- read.table("Martino2013/GSE42700_series_matrix.txt",
# comment = "!",
# skip=4,
159
# fill=TRUE)
# martino2013_cpgs <- martino2013_cpgs[-c(1:4),]
# rownames(martino2013_cpgs) <- martino2013_cpgs[,1]
# martino2013_cpgs <- martino2013_cpgs[,-1]
# colnames(martino2013_cpgs) <- martino2013_formatted_samples
#
# martino2013_cpgs <- martino2013_cpgs[rownames(martino2013_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(martino2013_cpgs),]
#
# martino2013_cpgs$cpg <- rownames(martino2013_cpgs)
# martino2013_cpgs <- data.table(martino2013_cpgs)
# sample_table <- rbind(sample_table,martino2013_samples)
# cpg_table <- merge(cpg_table,martino2013_cpgs, all=TRUE)
#
#
# pihlstrom2022_unformatted_table <-
read.table("Pihlstrom2022/GSE203332_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# pihlstrom2022_unformatted_samples <- pihlstrom2022_unformatted_table[7,-1]
# pihlstrom2022_formatted_samples <-
str_sub(pihlstrom2022_unformatted_samples,1,62)[1:492]
#
# #Formatting ages.
# pihlstrom2022_unformatted_ages <- pihlstrom2022_unformatted_table[4,-1]
# pihlstrom2022_formatted_ages <- as.numeric(str_sub(pihlstrom2022_unformatted_ages,
20,100))
#
# #Formatting sex.
# pihlstrom2022_unformatted_sex <- pihlstrom2022_unformatted_table[3,-1]
# pihlstrom2022_unformatted_sex <- str_sub(pihlstrom2022_unformatted_sex, 6, 15)
# pihlstrom2022_unformatted_sex[pihlstrom2022_unformatted_sex=="F"] <- "Female"
# pihlstrom2022_unformatted_sex[pihlstrom2022_unformatted_sex=="M"] <- "Male"
# pihlstrom2022_formatted_sex <- pihlstrom2022_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# pihlstrom2022_unformatted_condition <- pihlstrom2022_unformatted_table[5,-1]
# pihlstrom2022_unformatted_condition <-
str_sub(pihlstrom2022_unformatted_condition,21,40)
# pihlstrom2022_unformatted_condition[pihlstrom2022_unformatted_condition=="control"] =
"Control"
160
# pihlstrom2022_formatted_condition <- pihlstrom2022_unformatted_condition
#
# pihlstrom2022_formatted_donor <- paste(rep("Q",492),c(1:492),sep="")
#
# pihlstrom2022_formatted_duplicate <- (str_sub(pihlstrom2022_unformatted_table[2,-
1],17,40)=="NA")
#
# pihlstrom2022_formatted_location <- str_sub(pihlstrom2022_unformatted_table[1,-1],16,100)
#
# pihlstrom2022_samples <- data.frame(ID=pihlstrom2022_formatted_samples,
# Author=rep("Pihlstrom",492),
# Year=rep(2022,492),
# Tissue=rep("Brain",492),
# CellType=rep("Brain",492),
# Age=pihlstrom2022_formatted_ages,
# Condition=pihlstrom2022_formatted_condition,
# Sex=pihlstrom2022_formatted_sex,
# DonorID=pihlstrom2022_formatted_donor,
# Misc=rep(NA,492))
#
# pihlstrom2022_cpgs <-
data.table::fread("Pihlstrom2022/GSE203332_Matrix_processed_dasen_betas.txt",
# header=FALSE) %>%
# as.data.frame()
# row.names(pihlstrom2022_cpgs) <- pihlstrom2022_cpgs$V1
# pihlstrom2022_cpgs <- pihlstrom2022_cpgs[,-1]
# colnames(pihlstrom2022_cpgs) <- pihlstrom2022_cpgs[1,]
# pihlstrom2022_cpgs <- pihlstrom2022_cpgs[-1,]
#
# pihlstrom2022_cpgs <- pihlstrom2022_cpgs[,pihlstrom2022_formatted_location]
# colnames(pihlstrom2022_cpgs) <- pihlstrom2022_formatted_samples
#
# pihlstrom2022_samples <- pihlstrom2022_samples[pihlstrom2022_formatted_duplicate,]
# pihlstrom2022_samples <- pihlstrom2022_samples[!is.na(pihlstrom2022_samples$Age),]
# pihlstrom2022_cpgs <- pihlstrom2022_cpgs[,colnames(pihlstrom2022_cpgs) %in%
pihlstrom2022_samples$ID]
#
# pihlstrom2022_cpgs <- pihlstrom2022_cpgs[rownames(pihlstrom2022_cpgs) %in%
cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(pihlstrom2022_cpgs),]
#
# pihlstrom2022_cpgs$cpg <- rownames(pihlstrom2022_cpgs)
# pihlstrom2022_cpgs <- data.table(pihlstrom2022_cpgs)
161
# sample_table <- rbind(sample_table,pihlstrom2022_samples)
# cpg_table <- merge(cpg_table,pihlstrom2022_cpgs, all=TRUE)
#
#
#
# tsaprouni2014_unformatted_table <- read.table("Tsaprouni2014/GSE50660_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# tsaprouni2014_unformatted_samples <- tsaprouni2014_unformatted_table[4,-1]
# tsaprouni2014_formatted_samples <-
str_sub(tsaprouni2014_unformatted_samples,1,62)[1:464]
#
# #Formatting ages.
# tsaprouni2014_unformatted_ages <- tsaprouni2014_unformatted_table[1,-1]
# tsaprouni2014_unformatted_ages <- str_sub(tsaprouni2014_unformatted_ages, 6,100)
# tsaprouni2014_formatted_ages <- as.numeric(tsaprouni2014_unformatted_ages)
#
# #Formatting sex.
# tsaprouni2014_unformatted_sex <- tsaprouni2014_unformatted_table[2,-1]
# tsaprouni2014_unformatted_sex <- str_sub(tsaprouni2014_unformatted_sex,9,60)
# tsaprouni2014_formatted_sex <- tsaprouni2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# tsaprouni2014_formatted_condition <- rep("Control",464)
#
# tsaprouni2014_formatted_donor <- paste(rep("AR",464),c(1:464),sep="")
#
# tsaprouni2014_samples <- data.frame(ID=tsaprouni2014_formatted_samples,
# Author=rep("Tsaprouni",464),
# Year=rep(2014,464),
# Tissue=rep("Blood",464),
# CellType=rep("Blood",464),
# Age=tsaprouni2014_formatted_ages,
# Condition=tsaprouni2014_formatted_condition,
# Sex=tsaprouni2014_formatted_sex,
# DonorID=tsaprouni2014_formatted_donor,
# Misc=rep(NA,464))
#
# tsaprouni2014_cpgs <- read.table("Tsaprouni2014/GSE50660_series_matrix.txt",
# comment = "!",
# skip=4,
162
# fill=TRUE)
# tsaprouni2014_cpgs <- tsaprouni2014_cpgs[-c(1:4),]
# rownames(tsaprouni2014_cpgs) <- tsaprouni2014_cpgs[,1]
# tsaprouni2014_cpgs <- tsaprouni2014_cpgs[,-1]
# colnames(tsaprouni2014_cpgs) <- tsaprouni2014_formatted_samples
#
# tsaprouni2014_cpgs <- tsaprouni2014_cpgs[rownames(tsaprouni2014_cpgs) %in%
cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(tsaprouni2014_cpgs),]
#
# tsaprouni2014_cpgs$cpg <- rownames(tsaprouni2014_cpgs)
# tsaprouni2014_cpgs <- data.table(tsaprouni2014_cpgs)
# sample_table <- rbind(sample_table,tsaprouni2014_samples)
# cpg_table <- merge(cpg_table,tsaprouni2014_cpgs, all=TRUE)
#
# liu2013_unformatted_table <- read.table("Liu2013/GSE42861_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# liu2013_unformatted_samples <- liu2013_unformatted_table[5,-1]
# liu2013_formatted_samples <- str_sub(liu2013_unformatted_samples,1,62)[1:689]
#
# #Formatting ages.
# liu2013_unformatted_ages <- liu2013_unformatted_table[2,-1]
# liu2013_unformatted_ages <- str_sub(liu2013_unformatted_ages, 6,100)
# liu2013_formatted_ages <- as.numeric(liu2013_unformatted_ages)
#
# #Formatting sex.
# liu2013_unformatted_sex <- liu2013_unformatted_table[3,-1]
# liu2013_unformatted_sex <- str_sub(liu2013_unformatted_sex,9,60)
# liu2013_unformatted_sex[liu2013_unformatted_sex=="f"] <- "Female"
# liu2013_unformatted_sex[liu2013_unformatted_sex=="m"] <- "Male"
# liu2013_formatted_sex <- liu2013_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# liu2013_unformatted_condition <- liu2013_unformatted_table[1,-1]
# liu2013_unformatted_condition <- str_sub(liu2013_unformatted_condition,16,60)
# liu2013_unformatted_condition[liu2013_unformatted_condition=="rheumatoid arthritis"] <-
"Rheumatoid Arthritis"
# liu2013_formatted_condition <- liu2013_unformatted_condition
#
# liu2013_formatted_donor <- paste(rep("AS",689),c(1:689),sep="")
163
#
# liu2013_samples <- data.frame(ID=liu2013_formatted_samples,
# Author=rep("Liu",689),
# Year=rep(2013,689),
# Tissue=rep("Blood",689),
# CellType=rep("Leukocyte",689),
# Age=liu2013_formatted_ages,
# Condition=liu2013_formatted_condition,
# Sex=liu2013_formatted_sex,
# DonorID=liu2013_formatted_donor,
# Misc=rep(NA,689))
#
# liu2013_cpgs <- read.table("Liu2013/GSE42861_series_matrix.txt",
# comment = "!",
# skip=5,
# fill=TRUE)
# liu2013_cpgs <- liu2013_cpgs[-c(1:5),]
# rownames(liu2013_cpgs) <- liu2013_cpgs[,1]
# liu2013_cpgs <- liu2013_cpgs[,-1]
# colnames(liu2013_cpgs) <- liu2013_formatted_samples
#
# liu2013_cpgs <- liu2013_cpgs[rownames(liu2013_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(liu2013_cpgs),]
#
# liu2013_cpgs$cpg <- rownames(liu2013_cpgs)
# liu2013_cpgs <- data.table(liu2013_cpgs)
# sample_table <- rbind(sample_table,liu2013_samples)
# cpg_table <- merge(cpg_table,liu2013_cpgs, all=TRUE)
#
#
# arpon2019_unformatted_table <- read.table("Arpon2019/GSE115278-
GPL16304_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
# arpon2019_unformatted_table_2 <- read.table("Arpon2019/GSE115278-
GPL21145_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# arpon2019_unformatted_samples <- arpon2019_unformatted_table[5,-1]
# arpon2019_formatted_samples <- str_sub(arpon2019_unformatted_samples,1,62)[1:366]
#
164
# arpon2019_unformatted_samples_2 <- arpon2019_unformatted_table_2[5,-1]
# arpon2019_formatted_samples_2 <- str_sub(arpon2019_unformatted_samples_2,1,62)[1:108]
#
# arpon2019_formatted_samples <-
c(arpon2019_formatted_samples,arpon2019_formatted_samples_2)
#
# #Formatting ages.
# arpon2019_unformatted_ages <- arpon2019_unformatted_table[2,-1]
# arpon2019_formatted_ages <- as.numeric(str_sub(arpon2019_unformatted_ages, 5,100))
#
# arpon2019_unformatted_ages_2 <- arpon2019_unformatted_table_2[2,-1]
# arpon2019_formatted_ages_2 <- as.numeric(str_sub(arpon2019_unformatted_ages_2, 5,100))
#
# arpon2019_formatted_ages <- c(arpon2019_formatted_ages,arpon2019_formatted_ages_2)
#
# #Formatting sex.
# arpon2019_unformatted_sex <- arpon2019_unformatted_table[1,-1]
# arpon2019_unformatted_sex <- str_sub(arpon2019_unformatted_sex, 6, 15)
# arpon2019_formatted_sex <- arpon2019_unformatted_sex
#
# arpon2019_unformatted_sex_2 <- arpon2019_unformatted_table_2[1,-1]
# arpon2019_unformatted_sex_2 <- str_sub(arpon2019_unformatted_sex_2, 6, 15)
# arpon2019_formatted_sex_2 <- arpon2019_unformatted_sex_2
#
# arpon2019_formatted_sex <- c(arpon2019_formatted_sex,arpon2019_formatted_sex_2)
#
# #Formatting condition and miscellaneous information.
# arpon2019_formatted_condition <- rep("Control",474)
#
# arpon2019_formatted_donor <- paste(rep("AT",474),c(1:474),sep="")
#
# arpon2019_formatted_location <- str_sub(arpon2019_unformatted_table[3,-1],1,100)
# arpon2019_formatted_location_2 <- str_sub(arpon2019_unformatted_table_2[3,-1],1,100)
# arpon2019_formatted_location <-
c(arpon2019_formatted_location,arpon2019_formatted_location_2)
#
# arpon2019_samples <- data.frame(ID=arpon2019_formatted_samples,
# Author=rep("Arpon",474),
# Year=rep(2019,474),
# Tissue=rep("Blood",474),
# CellType=rep("Leukocytes",474),
# Age=arpon2019_formatted_ages,
# Condition=arpon2019_formatted_condition,
165
# Sex=arpon2019_formatted_sex,
# DonorID=arpon2019_formatted_donor,
# Misc=rep(NA,474))
#
# arpon2019_cpgs <- data.table::fread("Arpon2019/GSE115278_Matrix_processed.txt",
# header=FALSE) %>%
# as.data.frame()
# row.names(arpon2019_cpgs) <- arpon2019_cpgs[,2]
# arpon2019_cpgs <- arpon2019_cpgs[,-c(1,2)]
# names <- read.table("Arpon2019/target.txt", quote="\"", comment.char="")
# names <- names[,-1]
# colnames(arpon2019_cpgs) <- names
# arpon2019_cpgs <- arpon2019_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),474)]
#
# arpon2019_cpgs <- arpon2019_cpgs[,arpon2019_formatted_location]
# colnames(arpon2019_cpgs) <- arpon2019_formatted_samples
#
# arpon2019_cpgs <- arpon2019_cpgs[rownames(arpon2019_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(arpon2019_cpgs),]
#
# arpon2019_cpgs$cpg <- rownames(arpon2019_cpgs)
# arpon2019_cpgs <- data.table(arpon2019_cpgs)
# sample_table <- rbind(sample_table,arpon2019_samples)
# cpg_table <- merge(cpg_table,arpon2019_cpgs, all=TRUE)
#
# ringh2019_unformatted_table <- read.table("Ringh2019/GSE133062_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# ringh2019_unformatted_samples <- ringh2019_unformatted_table[5,-1]
# ringh2019_formatted_samples <- str_sub(ringh2019_unformatted_samples,1,62)[1:70]
#
# #Formatting ages.
# ringh2019_unformatted_ages <- ringh2019_unformatted_table[2,-1]
# ringh2019_formatted_ages <- as.numeric(str_sub(ringh2019_unformatted_ages, 5,100))
#
# #Formatting sex.
# ringh2019_unformatted_sex <- ringh2019_unformatted_table[1,-1]
# ringh2019_unformatted_sex <- str_sub(ringh2019_unformatted_sex, 6, 15)
# ringh2019_unformatted_sex[ringh2019_unformatted_sex=="F"] <- "Female"
# ringh2019_unformatted_sex[ringh2019_unformatted_sex=="M"] <- "Male"
# ringh2019_formatted_sex <- ringh2019_unformatted_sex
166
#
# #Formatting condition and miscellaneous information.
# ringh2019_formatted_condition <- rep("Control",70)
#
# ringh2019_formatted_donor <- paste(rep("AU",70),c(1:70),sep="")
#
# ringh2019_formatted_location <- str_sub(ringh2019_unformatted_table[3,-1],1,100)
#
# ringh2019_samples <- data.frame(ID=ringh2019_formatted_samples,
# Author=rep("Ringh",70),
# Year=rep(2019,70),
# Tissue=rep("Bronchi",70),
# CellType=rep("Bronchoalveolar Cells",70),
# Age=ringh2019_formatted_ages,
# Condition=ringh2019_formatted_condition,
# Sex=ringh2019_formatted_sex,
# DonorID=ringh2019_formatted_donor,
# Misc=rep(NA,70))
#
# ringh2019_cpgs <- data.table::fread("Ringh2019/GSE133062_Matrix_processed.txt",
# header=FALSE) %>%
# as.data.frame()
# row.names(ringh2019_cpgs) <- ringh2019_cpgs$V1
# ringh2019_cpgs <- ringh2019_cpgs[,-1]
# colnames(ringh2019_cpgs) <- ringh2019_cpgs[1,]
# ringh2019_cpgs <- ringh2019_cpgs[-1,]
#
# ringh2019_cpgs <- ringh2019_cpgs[,colnames(ringh2019_cpgs) %in%
ringh2019_formatted_location]
# ringh2019_cpgs <- ringh2019_cpgs[,ringh2019_formatted_location]
# colnames(ringh2019_cpgs) <- ringh2019_formatted_samples
#
# ringh2019_cpgs <- ringh2019_cpgs[rownames(ringh2019_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(ringh2019_cpgs),]
#
# ringh2019_cpgs$cpg <- rownames(ringh2019_cpgs)
# ringh2019_cpgs <- data.table(ringh2019_cpgs)
# sample_table <- rbind(sample_table,ringh2019_samples)
# cpg_table <- merge(cpg_table,ringh2019_cpgs, all=TRUE)
#
#
# kho2020_unformatted_table <- read.table("Kho2020/GSE157131-
167
GPL13534_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
# kho2020_unformatted_table_2 <- read.table("Kho2020/GSE157131-
GPL21145_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# kho2020_unformatted_samples <- kho2020_unformatted_table[5,-1]
# kho2020_formatted_samples <- str_sub(kho2020_unformatted_samples,1,62)[1:272]
#
# kho2020_unformatted_samples_2 <- kho2020_unformatted_table_2[5,-1]
# kho2020_formatted_samples_2 <- str_sub(kho2020_unformatted_samples_2,1,62)[1:946]
#
# kho2020_formatted_samples <-
c(kho2020_formatted_samples,kho2020_formatted_samples_2)
#
# #Formatting ages.
# kho2020_unformatted_ages <- kho2020_unformatted_table[2,-1]
# kho2020_formatted_ages <- as.numeric(str_sub(kho2020_unformatted_ages, 10,100))
#
# kho2020_unformatted_ages_2 <- kho2020_unformatted_table_2[2,-1]
# kho2020_formatted_ages_2 <- as.numeric(str_sub(kho2020_unformatted_ages_2, 10,100))
#
# kho2020_formatted_ages <- c(kho2020_formatted_ages,kho2020_formatted_ages_2)
#
# #Formatting sex.
# kho2020_unformatted_sex <- kho2020_unformatted_table[1,-1]
# kho2020_unformatted_sex <- str_sub(kho2020_unformatted_sex, 9, 15)
# kho2020_unformatted_sex[kho2020_unformatted_sex=="F"] <- "Female"
# kho2020_unformatted_sex[kho2020_unformatted_sex=="M"] <- "Male"
# kho2020_formatted_sex <- kho2020_unformatted_sex
#
# kho2020_unformatted_sex_2 <- kho2020_unformatted_table_2[1,-1]
# kho2020_unformatted_sex_2 <- str_sub(kho2020_unformatted_sex_2, 9, 15)
# kho2020_unformatted_sex_2[kho2020_unformatted_sex_2=="F"] <- "Female"
# kho2020_unformatted_sex_2[kho2020_unformatted_sex_2=="M"] <- "Male"
# kho2020_formatted_sex_2 <- kho2020_unformatted_sex_2
#
# kho2020_formatted_sex <- c(kho2020_formatted_sex,kho2020_formatted_sex_2)
#
# #Formatting condition and miscellaneous information.
168
# kho2020_formatted_condition <- rep("Control",1218)
#
# kho2020_formatted_donor <- paste(rep("AV",1218),c(1:1218),sep="")
#
# kho2020_formatted_location <- str_sub(kho2020_unformatted_table[3,-1],1,100)
# kho2020_formatted_location_2 <- str_sub(kho2020_unformatted_table_2[3,-1],1,100)
# kho2020_formatted_location <-
c(kho2020_formatted_location,kho2020_formatted_location_2)
#
# kho2020_samples <- data.frame(ID=kho2020_formatted_samples,
# Author=rep("Kho",1218),
# Year=rep(2020,1218),
# Tissue=rep("Blood",1218),
# CellType=rep("Leukocytes",1218),
# Age=kho2020_formatted_ages,
# Condition=kho2020_formatted_condition,
# Sex=kho2020_formatted_sex,
# DonorID=kho2020_formatted_donor,
# Misc=rep(NA,1218))
#
# kho2020_cpgs <-
data.table::fread("Kho2020/GSE157131_Matrix_processed_beta_geo_08252020.txt",
# header=FALSE) %>%
# as.data.frame()
# rownames(kho2020_cpgs) <- kho2020_cpgs[,1]
# kho2020_cpgs <- kho2020_cpgs[,-1]
# names <- read.table("Kho2020/target.txt", quote="\"", comment.char="")
# names <- names[,-1]
# colnames(kho2020_cpgs) <- names
# kho2020_cpgs <- kho2020_cpgs[, rep(c(rep(TRUE, 2- 1), FALSE),1218)]
#
# kho2020_cpgs <- kho2020_cpgs[,kho2020_formatted_location]
# colnames(kho2020_cpgs) <- kho2020_formatted_samples
#
# kho2020_cpgs <- kho2020_cpgs[rownames(kho2020_cpgs) %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(kho2020_cpgs),]
#
# kho2020_cpgs$cpg <- rownames(kho2020_cpgs)
# kho2020_cpgs <- data.table(kho2020_cpgs)
# sample_table <- rbind(sample_table,kho2020_samples)
# cpg_table <- merge(cpg_table,kho2020_cpgs, all=TRUE)
#
#
169
# hannum2012_unformatted_table <- read.table("Hannum2012/GSE40279_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# hannum2012_unformatted_samples <- hannum2012_unformatted_table[5,-1]
# hannum2012_formatted_samples <- str_sub(hannum2012_unformatted_samples,1,62)[1:656]
#
# #Formatting ages.
# hannum2012_unformatted_ages <- hannum2012_unformatted_table[2,-1]
# hannum2012_formatted_ages <- as.numeric(str_sub(hannum2012_unformatted_ages, 9,100))
#
# #Formatting sex.
# hannum2012_unformatted_sex <- hannum2012_unformatted_table[3,-1]
# hannum2012_unformatted_sex <- str_sub(hannum2012_unformatted_sex,9, 15)
# hannum2012_unformatted_sex[hannum2012_unformatted_sex=="F"] <- "Female"
# hannum2012_unformatted_sex[hannum2012_unformatted_sex=="M"] <- "Male"
# hannum2012_formatted_sex <- hannum2012_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# hannum2012_formatted_condition <- rep("Control",656)
#
# hannum2012_formatted_donor <- paste(rep("AW",656),c(1:656),sep="")
#
# hannum2012_samples <- data.frame(ID=hannum2012_formatted_samples,
# Author=rep("Hannum",656),
# Year=rep(2012,656),
# Tissue=rep("Blood",656),
# CellType=rep("Blood",656),
# Age=hannum2012_formatted_ages,
# Condition=hannum2012_formatted_condition,
# Sex=hannum2012_formatted_sex,
# DonorID=hannum2012_formatted_donor,
# Misc=rep(NA,656))
#
# hannum2012_formatted_location <- hannum2012_unformatted_table[1,-1]
#
# hannum2012_cpgs <- data.table::fread("Hannum2012/GSE40279_average_beta.txt",
# header=FALSE, fill=FALSE)
# hannum2012_cpglist <- hannum2012_cpgs$V1
# hannum2012_cpglist <- hannum2012_cpglist[-1]
# colnames(hannum2012_cpgs) <- as.character(hannum2012_cpgs[1,])
# hannum2012_cpgs <- hannum2012_cpgs[-1,]
170
#
# hannum2012_present_samples <- colnames(hannum2012_cpgs) %in%
hannum2012_formatted_location
# hannum2012_cpgs <- hannum2012_cpgs[,..hannum2012_present_samples]
# hannum2012_cpgs <-
setcolorder(hannum2012_cpgs,str_sub(hannum2012_formatted_location,1,100))
# colnames(hannum2012_cpgs) <- hannum2012_formatted_samples
#
# hannum2012_cpgs$cpg <- hannum2012_cpglist
# hannum2012_cpgs <- hannum2012_cpgs[hannum2012_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(hannum2012_cpgs),]
#
# sample_table <- rbind(sample_table,hannum2012_samples)
# cpg_table <- merge(cpg_table,hannum2012_cpgs, all=TRUE,by.y="cpg")
#
# garcia2021_unformatted_table <- read.table("Garcia2021/GSE179414_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows = 6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# garcia2021_unformatted_samples <- garcia2021_unformatted_table[6,-1]
# garcia2021_formatted_samples <- str_sub(garcia2021_unformatted_samples,1,62)[1:157]
#
# #Formatting ages.
# garcia2021_unformatted_ages <- garcia2021_unformatted_table[4,-1]
# garcia2021_formatted_ages <- as.numeric(str_sub(garcia2021_unformatted_ages, 6,100))
#
# #Formatting sex.
# garcia2021_unformatted_sex <- garcia2021_unformatted_table[3,-1]
# garcia2021_unformatted_sex <- str_sub(garcia2021_unformatted_sex,9, 15)
# garcia2021_unformatted_sex[garcia2021_unformatted_sex=="F"] <- "Female"
# garcia2021_unformatted_sex[garcia2021_unformatted_sex=="M"] <- "Male"
# garcia2021_formatted_sex <- garcia2021_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# garcia2021_unformatted_condition <- garcia2021_unformatted_table[2,-1]
# garcia2021_unformatted_condition[garcia2021_unformatted_condition=="transduction status:
Untransduced"] <-
# "Control"
# garcia2021_unformatted_condition[garcia2021_unformatted_condition == "transduction
status: Transduced"] <-
# "CAR-T"
# garcia2021_formatted_condition <- str_sub(garcia2021_unformatted_condition,1,100)
171
#
# garcia2021_unformatted_donor <- garcia2021_unformatted_table[1,-1]
# garcia2021_unformatted_donor <- as.numeric(factor(garcia2021_unformatted_donor))
# garcia2021_formatted_donor <- paste(rep("AX",157),garcia2021_unformatted_donor,sep="")
#
# garcia2021_samples <- data.frame(ID=garcia2021_formatted_samples,
# Author=rep("Garcia",157),
# Year=rep(2021,157),
# Tissue=rep("Blood",157),
# CellType=rep("T Cell",157),
# Age=garcia2021_formatted_ages,
# Condition=garcia2021_formatted_condition,
# Sex=garcia2021_formatted_sex,
# DonorID=garcia2021_formatted_donor,
# Misc=rep(NA,157))
#
# garcia2021_cpgs <- data.table::fread("Garcia2021/no_comments.txt",
# header=FALSE, fill=FALSE,skip=5)
# garcia2021_cpglist <- garcia2021_cpgs$V1
# garcia2021_cpglist <- garcia2021_cpglist[-1]
# colnames(garcia2021_cpgs) <- as.character(garcia2021_cpgs[1,])
# garcia2021_cpgs <- garcia2021_cpgs[-1,-1]
#
# # garcia2021_present_samples <- colnames(garcia2021_cpgs) %in%
garcia2021_formatted_location
# # garcia2021_cpgs <- garcia2021_cpgs[,..garcia2021_present_samples]
# # garcia2021_cpgs <-
setcolorder(garcia2021_cpgs,str_sub(garcia2021_formatted_location,1,100))
# colnames(garcia2021_cpgs) <- garcia2021_formatted_samples
#
# garcia2021_cpgs$cpg <- garcia2021_cpglist
# garcia2021_cpgs <- garcia2021_cpgs[garcia2021_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(garcia2021_cpgs),]
#
# sample_table <- rbind(sample_table,garcia2021_samples)
# cpg_table <- merge(cpg_table,garcia2021_cpgs, all=TRUE,by.y="cpg")
#
# kawai2021_unformatted_table <- read.table("Kawai2021/GSE122288_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# kawai2021_unformatted_samples <- kawai2021_unformatted_table[3,-1]
172
# kawai2021_formatted_samples <- str_sub(kawai2021_unformatted_samples,1,62)[1:61]
#
# #Formatting ages.
# # kawai2021_unformatted_ages <- kawai2021_unformatted_table[4,-1]
# # kawai2021_formatted_ages <- as.numeric(str_sub(kawai2021_unformatted_ages, 6,100))
# kawai2021_formatted_ages <- rep(0,61)
#
# #Formatting sex.
# kawai2021_unformatted_sex <- kawai2021_unformatted_table[1,-1]
# kawai2021_unformatted_sex <- str_sub(kawai2021_unformatted_sex,17, 25)
# kawai2021_unformatted_sex[kawai2021_unformatted_sex=="female"] <- "Female"
# kawai2021_unformatted_sex[kawai2021_unformatted_sex=="male"] <- "Male"
# kawai2021_formatted_sex <- kawai2021_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# # kawai2021_unformatted_condition <- kawai2021_unformatted_table[2,-1]
# # kawai2021_unformatted_condition[kawai2021_unformatted_condition=="transduction
status: Untransduced"] <-
# # "Control"
# # kawai2021_unformatted_condition[kawai2021_unformatted_condition == "transduction
status: Transduced"] <-
# # "CAR-T"
# # kawai2021_formatted_condition <- str_sub(kawai2021_unformatted_condition,1,100)
# kawai2021_formatted_condition <- rep("Control",61)
#
# kawai2021_formatted_donor <- paste(rep("AY",61),c(1:61),sep="")
#
# kawai2021_samples <- data.frame(ID=kawai2021_formatted_samples,
# Author=rep("Kawai",61),
# Year=rep(2021,61),
# Tissue=rep("Blood",61),
# CellType=rep("Cord Blood",61),
# Age=kawai2021_formatted_ages,
# Condition=kawai2021_formatted_condition,
# Sex=kawai2021_formatted_sex,
# DonorID=kawai2021_formatted_donor,
# Misc=rep(NA,61))
#
# kawai2021_cpgs <- data.table::fread("Kawai2021/no_comments.txt",
# header=FALSE, fill=FALSE,skip=2)
# kawai2021_cpglist <- kawai2021_cpgs$V1
# kawai2021_cpglist <- kawai2021_cpglist[-1]
# colnames(kawai2021_cpgs) <- as.character(kawai2021_cpgs[1,])
173
# kawai2021_cpgs <- kawai2021_cpgs[-1,-1]
#
# # kawai2021_present_samples <- colnames(kawai2021_cpgs) %in%
kawai2021_formatted_location
# # kawai2021_cpgs <- kawai2021_cpgs[,..kawai2021_present_samples]
# # kawai2021_cpgs <-
setcolorder(kawai2021_cpgs,str_sub(kawai2021_formatted_location,1,100))
# colnames(kawai2021_cpgs) <- kawai2021_formatted_samples
#
# kawai2021_cpgs$cpg <- kawai2021_cpglist
# kawai2021_cpgs <- kawai2021_cpgs[kawai2021_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(kawai2021_cpgs),]
#
# sample_table <- rbind(sample_table,kawai2021_samples)
# cpg_table <- merge(cpg_table,kawai2021_cpgs, all=TRUE,by.y="cpg")
#
# xu2021_unformatted_table <- read.table("Xu2021/GSE174422_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# xu2021_unformatted_samples <- xu2021_unformatted_table[6,-1]
# xu2021_formatted_samples <- str_sub(xu2021_unformatted_samples,1,62)[1:256]
#
# #Formatting ages.
# xu2021_unformatted_ages <- xu2021_unformatted_table[4,-1]
# xu2021_formatted_ages <- as.numeric(str_sub(xu2021_unformatted_ages, 6,100))
#
# #Formatting sex.
# xu2021_unformatted_sex <- xu2021_unformatted_table[3,-1]
# xu2021_unformatted_sex <- str_sub(xu2021_unformatted_sex,6, 25)
# xu2021_unformatted_sex[xu2021_unformatted_sex=="female"] <- "Female"
# xu2021_unformatted_sex[xu2021_unformatted_sex=="male"] <- "Male"
# xu2021_formatted_sex <- xu2021_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# # xu2021_unformatted_condition <- xu2021_unformatted_table[2,-1]
# # xu2021_unformatted_condition[xu2021_unformatted_condition=="transduction status:
Untransduced"] <-
# # "Control"
# # xu2021_unformatted_condition[xu2021_unformatted_condition == "transduction status:
Transduced"] <-
# # "CAR-T"
174
# # xu2021_formatted_condition <- str_sub(xu2021_unformatted_condition,1,100)
# xu2021_formatted_condition <- rep("Control",256)
#
# xu2021_formatted_location <- str_sub(xu2021_unformatted_table[1,-1],1,100)
# xu2021_formatted_duplicate <- str_sub(xu2021_unformatted_table[2,-1],15,100)
# indexing <- match(xu2021_formatted_location,xu2021_formatted_duplicate)
# for (i in indexing) {
# indexing[i] <- increment
# indexing[indexing[i]] <- increment
# increment <- increment + 1
# }
#
# xu2021_formatted_donor <- paste(rep("BA",256),as.numeric(as.factor(indexing)),sep="")
#
# xu2021_samples <- data.frame(ID=xu2021_formatted_samples,
# Author=rep("Xu",256),
# Year=rep(2021,256),
# Tissue=rep("Blood",256),
# CellType=rep("Blood",256),
# Age=xu2021_formatted_ages,
# Condition=xu2021_formatted_condition,
# Sex=xu2021_formatted_sex,
# DonorID=xu2021_formatted_donor,
# Misc=rep(NA,256))
#
# xu2021_cpgs <- data.table::fread("Xu2021/GSE174422_Matrix_processed_dup.csv",
# header=FALSE, fill=FALSE)
# xu2021_cpglist <- xu2021_cpgs$V1
# xu2021_cpglist <- xu2021_cpglist[-1]
# colnames(xu2021_cpgs) <- as.character(xu2021_cpgs[1,])
# xu2021_cpgs <- xu2021_cpgs[-1,-1]
#
# xu2021_present <- rep(c(rep(TRUE, 2- 1), FALSE),256)
# xu2021_cpgs <- xu2021_cpgs[,..xu2021_present]
#
# xu2021_present_samples <- colnames(xu2021_cpgs) %in% xu2021_formatted_location
# xu2021_cpgs <- xu2021_cpgs[,..xu2021_present_samples]
# xu2021_cpgs <- setcolorder(xu2021_cpgs,str_sub(xu2021_formatted_location,1,100))
# colnames(xu2021_cpgs) <- xu2021_formatted_samples
#
# xu2021_cpgs$cpg <- xu2021_cpglist
# xu2021_cpgs <- xu2021_cpgs[xu2021_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(xu2021_cpgs),]
175
#
# sample_table <- rbind(sample_table,xu2021_samples)
# cpg_table <- merge(cpg_table,xu2021_cpgs, all=TRUE,by.y="cpg")
#
# kandaswamy2020_unformatted_table <- read.table("Kandaswamy2020/GSE154566-
GPL13534_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =8)
# kandaswamy2020_unformatted_table_2 <- read.table("Kandaswamy2020/GSE154566-
GPL23976_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =8)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# kandaswamy2020_unformatted_samples <- kandaswamy2020_unformatted_table[8,-1]
# kandaswamy2020_formatted_samples <-
str_sub(kandaswamy2020_unformatted_samples,1,234)[1:233]
#
# kandaswamy2020_unformatted_samples_2 <- kandaswamy2020_unformatted_table_2[8,-1]
# kandaswamy2020_formatted_samples_2 <-
str_sub(kandaswamy2020_unformatted_samples_2,1,234)[1:944]
#
# kandaswamy2020_formatted_samples <-
c(kandaswamy2020_formatted_samples,kandaswamy2020_formatted_samples_2)
# #Formatting ages.
# kandaswamy2020_unformatted_ages <- kandaswamy2020_unformatted_table[4,-1]
# kandaswamy2020_formatted_ages <-
as.numeric(str_sub(kandaswamy2020_unformatted_ages, 10,100))
#
# kandaswamy2020_unformatted_ages_2 <- kandaswamy2020_unformatted_table_2[4,-1]
# kandaswamy2020_formatted_ages_2 <-
as.numeric(str_sub(kandaswamy2020_unformatted_ages_2, 10,100))
#
# kandaswamy2020_formatted_ages <-
c(kandaswamy2020_formatted_ages,kandaswamy2020_formatted_ages_2)
# #Formatting sex.
# kandaswamy2020_unformatted_sex <- kandaswamy2020_unformatted_table[2,-1]
# kandaswamy2020_unformatted_sex <- str_sub(kandaswamy2020_unformatted_sex,9, 25)
# kandaswamy2020_formatted_sex <- kandaswamy2020_unformatted_sex
#
# kandaswamy2020_unformatted_sex_2 <- kandaswamy2020_unformatted_table_2[2,-1]
# kandaswamy2020_unformatted_sex_2 <- str_sub(kandaswamy2020_unformatted_sex_2,9, 25)
# kandaswamy2020_formatted_sex_2 <- kandaswamy2020_unformatted_sex_2
176
#
# kandaswamy2020_formatted_sex <-
c(kandaswamy2020_formatted_sex,kandaswamy2020_formatted_sex_2)
# #Formatting condition and miscellaneous information.
# kandaswamy2020_unformatted_condition <- kandaswamy2020_unformatted_table[5,-1]
#
kandaswamy2020_unformatted_condition[kandaswamy2020_unformatted_condition=="disease
state: NA"] <-
# "Control"
# kandaswamy2020_unformatted_condition[kandaswamy2020_unformatted_condition ==
"disease state: Exposed"] <-
# "Victimized"
# kandaswamy2020_formatted_condition <-
str_sub(kandaswamy2020_unformatted_condition,1,100)
#
# kandaswamy2020_unformatted_condition_2 <- kandaswamy2020_unformatted_table_2[5,-1]
#
kandaswamy2020_unformatted_condition_2[kandaswamy2020_unformatted_condition_2=="dis
ease state: Notexposed"] <-
# "Control"
# kandaswamy2020_unformatted_condition_2[kandaswamy2020_unformatted_condition_2 ==
"disease state: Exposed"] <-
# "Victimized"
# kandaswamy2020_formatted_condition_2 <-
str_sub(kandaswamy2020_unformatted_condition_2,1,100)
#
# kandaswamy2020_formatted_condition <-
c(kandaswamy2020_formatted_condition,kandaswamy2020_formatted_condition_2)
#
# kandaswamy2020_unformatted_location <- str_sub(kandaswamy2020_unformatted_table[6,-
1],79,97)
# kandaswamy2020_unformatted_location_new <- c()
# for (item in kandaswamy2020_unformatted_location) {
# last_character <- substr(item, nchar(item)-1+1, nchar(item))
# if (last_character == "G") {
# item <- str_sub(item,0,17)
# }
# kandaswamy2020_unformatted_location_new <-
c(kandaswamy2020_unformatted_location_new,item)
# }
# kandaswamy2020_formatted_location_2 <-
str_sub(kandaswamy2020_unformatted_table_2[6,-1],79,97)
# kandaswamy2020_formatted_location <-
177
c(kandaswamy2020_unformatted_location_new,kandaswamy2020_formatted_location_2)
# kandaswamy2020_mapping_df <- data.frame("Location"
=kandaswamy2020_formatted_location,
# "Sample" =kandaswamy2020_formatted_samples )
#
# kandaswamy2020_unformatted_donor <- str_sub(kandaswamy2020_unformatted_table[1,-
1],1,100)
# kandaswamy2020_unformatted_donor_2 <-
str_sub(kandaswamy2020_unformatted_table_2[1,-1],1,100)
# kandaswamy2020_unformatted_donor <- c(kandaswamy2020_unformatted_donor,
kandaswamy2020_unformatted_donor_2)
# kandaswamy2020_unformatted_donor <-
as.numeric(as.factor(kandaswamy2020_unformatted_donor))
# kandaswamy2020_formatted_donor <-
paste("BB",kandaswamy2020_unformatted_donor,sep="")
#
# kandaswamy2020_formatted_tissue <- rep("Blood",233)
# kandaswamy2020_unformatted_tissue_2 <-
str_sub(kandaswamy2020_unformatted_table_2[3,-1],9,100)
# kandaswamy2020_formatted_tissue <-
c(kandaswamy2020_formatted_tissue,kandaswamy2020_unformatted_tissue_2)
#
# kandaswamy2020_formatted_celltype <- kandaswamy2020_formatted_tissue
# kandaswamy2020_formatted_celltype[kandaswamy2020_formatted_celltype=="Buccal"] <-
# "Epithelial"
#
# kandaswamy2020_samples <- data.frame(ID=kandaswamy2020_formatted_samples,
# Author=rep("Kandaswamy",1177),
# Year=rep(2020,1177),
# Tissue=kandaswamy2020_formatted_tissue,
# CellType=kandaswamy2020_formatted_celltype,
# Age=kandaswamy2020_formatted_ages,
# Condition=kandaswamy2020_formatted_condition,
# Sex=kandaswamy2020_formatted_sex,
# DonorID=kandaswamy2020_formatted_donor,
# Misc=rep(NA,1177))
#
# kandaswamy2020_cpgs_1 <-
data.table::fread("Kandaswamy2020/GSE154566_Overlaping_probes_normalised_together_450
K_EPIC_betas.csv",
# header=FALSE, fill=FALSE)
# kandaswamy2020_cpgs_2 <-
data.table::fread("Kandaswamy2020/GSE154566_betas_normalised.csv",
178
# header=FALSE, fill=FALSE)
# kandaswamy2020_cpgs <-
merge(kandaswamy2020_cpgs_1,kandaswamy2020_cpgs_2,by="V1")
# kandaswamy2020_cpglist <- kandaswamy2020_cpgs$V1
# kandaswamy2020_cpglist <- kandaswamy2020_cpglist[-1]
# colnames(kandaswamy2020_cpgs) <- as.character(kandaswamy2020_cpgs[1,])
# kandaswamy2020_cpgs <- kandaswamy2020_cpgs[-1,-1]
#
# #kandaswamy2020_present <- rep(c(rep(TRUE, 2- 1), FALSE),256)
# #kandaswamy2020_cpgs <- kandaswamy2020_cpgs[,..kandaswamy2020_present]
#
# kandaswamy2020_present_samples <- colnames(kandaswamy2020_cpgs) %in%
kandaswamy2020_formatted_location
# kandaswamy2020_cpgs <- kandaswamy2020_cpgs[,..kandaswamy2020_present_samples]
# kandaswamy2020_samples_unique <- !duplicated(colnames(kandaswamy2020_cpgs))
# kandaswamy2020_cpgs <- kandaswamy2020_cpgs[,..kandaswamy2020_samples_unique]
# kandaswamy2020_formatted_location <-
kandaswamy2020_formatted_location[kandaswamy2020_formatted_location %in%
# colnames(kandaswamy2020_cpgs)]
# kandaswamy2020_cpgs <-
setcolorder(kandaswamy2020_cpgs,str_sub(kandaswamy2020_formatted_location,1,100))
#
# colnames(kandaswamy2020_cpgs) <- kandaswamy2020_formatted_location
# kandaswamy2020_matched_cpgs <-
match(colnames(kandaswamy2020_cpgs),kandaswamy2020_mapping_df$Location)
# kandaswamy2020_sample_names <-
kandaswamy2020_mapping_df$Sample[kandaswamy2020_matched_cpgs]
# colnames(kandaswamy2020_cpgs) <- kandaswamy2020_sample_names
#
# kandaswamy2020_cpgs$cpg <- kandaswamy2020_cpglist
# kandaswamy2020_cpgs <- kandaswamy2020_cpgs[kandaswamy2020_cpgs$cpg %in%
cpg_table$cpg,]
# kandaswamy2020_samples <- kandaswamy2020_samples[kandaswamy2020_samples$ID
%in%
# kandaswamy2020_sample_names,]
#
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(kandaswamy2020_cpgs),]
#
# sample_table <- rbind(sample_table,kandaswamy2020_samples)
# cpg_table <- merge(cpg_table,kandaswamy2020_cpgs, all=TRUE,by.y="cpg")
#
#
# tully2016_unformatted_table <- read.table("Tully2016/GSE72556_series_matrix.txt",
179
# comment = "!",
# skip=0,fill=TRUE,nrows =4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# tully2016_unformatted_samples <- tully2016_unformatted_table[4,-1]
# tully2016_formatted_samples <- str_sub(tully2016_unformatted_samples,1,62)[1:96]
#
# #Formatting ages.
# tully2016_unformatted_ages <- tully2016_unformatted_table[2,-1]
# tully2016_formatted_ages <- as.numeric(str_sub(tully2016_unformatted_ages, 12,100))
#
# #Formatting sex.
# tully2016_unformatted_sex <- tully2016_unformatted_table[1,-1]
# tully2016_unformatted_sex <- str_sub(tully2016_unformatted_sex,15, 25)
# tully2016_unformatted_sex[tully2016_unformatted_sex=="F"] <- "Female"
# tully2016_unformatted_sex[tully2016_unformatted_sex=="M"] <- "Male"
# tully2016_formatted_sex <- tully2016_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# # tully2016_unformatted_condition <- tully2016_unformatted_table[2,-1]
# # tully2016_unformatted_condition[tully2016_unformatted_condition=="transduction status:
Untransduced"] <-
# # "Control"
# # tully2016_unformatted_condition[tully2016_unformatted_condition == "transduction status:
Transduced"] <-
# # "CAR-T"
# # tully2016_formatted_condition <- str_sub(tully2016_unformatted_condition,1,100)
# tully2016_formatted_condition <- rep("Control",96)
#
# tully2016_formatted_donor <- paste(rep("BC",96),c(1:96),sep="")
#
# tully2016_samples <- data.frame(ID=tully2016_formatted_samples,
# Author=rep("Tully",96),
# Year=rep(2016,96),
# Tissue=rep("Saliva",96),
# CellType=rep("Epithelial",96),
# Age=tully2016_formatted_ages,
# Condition=tully2016_formatted_condition,
# Sex=tully2016_formatted_sex,
# DonorID=tully2016_formatted_donor,
# Misc=rep(NA,96))
#
# tully2016_cpgs <- read.table("Tully2016/GSE72556_series_matrix.txt",
180
# comment = "!",
# skip=4,fill=TRUE)
# colnames(tully2016_cpgs) <- tully2016_cpgs[4,]
# tully2016_cpgs <- tully2016_cpgs[-c(1:4),]
# rownames(tully2016_cpgs) <- tully2016_cpgs[,1]
# tully2016_cpgs <- tully2016_cpgs[,-1]
#
# tully2016_cpgs$cpg <- rownames(tully2016_cpgs)
# tully2016_cpgs <- tully2016_cpgs[tully2016_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(tully2016_cpgs),]
#
# tully2016_samples <- tully2016_samples[!is.na(tully2016_samples$Age),]
# tully2016_cpgs <- tully2016_cpgs[,(colnames(tully2016_cpgs) %in% tully2016_samples$ID) |
# colnames(tully2016_cpgs) == "cpg"]
#
# sample_table <- rbind(sample_table,tully2016_samples)
# cpg_table <- merge(cpg_table,tully2016_cpgs, all=TRUE,by.y="cpg")
#
#
# bacalini2015_unformatted_table <- read.table("Bacalini2015/GSE52588_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# bacalini2015_unformatted_samples <- bacalini2015_unformatted_table[5,-1]
# bacalini2015_formatted_samples <- str_sub(bacalini2015_unformatted_samples,1,62)[1:87]
#
# #Formatting ages.
# bacalini2015_unformatted_ages <- bacalini2015_unformatted_table[3,-1]
# bacalini2015_formatted_ages <- as.numeric(str_sub(bacalini2015_unformatted_ages, 6,100))
#
# #Formatting sex.
# bacalini2015_unformatted_sex <- bacalini2015_unformatted_table[2,-1]
# bacalini2015_unformatted_sex <- str_sub(bacalini2015_unformatted_sex,9, 25)
# bacalini2015_unformatted_sex[bacalini2015_unformatted_sex=="F"] <- "Female"
# bacalini2015_unformatted_sex[bacalini2015_unformatted_sex=="M"] <- "Male"
# bacalini2015_formatted_sex <- bacalini2015_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# bacalini2015_unformatted_condition <- bacalini2015_unformatted_table[1,-1]
# bacalini2015_unformatted_condition[bacalini2015_unformatted_condition=="disease state:
healthy"] <-
# "Control"
181
# bacalini2015_unformatted_condition[bacalini2015_unformatted_condition == "disease state:
Down syndrome"] <-
# "Down Syndrome"
# bacalini2015_formatted_condition <- str_sub(bacalini2015_unformatted_condition,1,100)
#
# bacalini2015_formatted_donor <- paste(rep("BD",87),c(1:87),sep="")
#
# bacalini2015_samples <- data.frame(ID=bacalini2015_formatted_samples,
# Author=rep("Bacalini",87),
# Year=rep(2015,87),
# Tissue=rep("Blood",87),
# CellType=rep("Blood",87),
# Age=bacalini2015_formatted_ages,
# Condition=bacalini2015_formatted_condition,
# Sex=bacalini2015_formatted_sex,
# DonorID=bacalini2015_formatted_donor,
# Misc=rep(NA,87))
#
# bacalini2015_cpgs <- read.table("Bacalini2015/GSE52588_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(bacalini2015_cpgs) <- bacalini2015_cpgs[5,]
# bacalini2015_cpgs <- bacalini2015_cpgs[-c(1:5),]
# rownames(bacalini2015_cpgs) <- bacalini2015_cpgs[,1]
# bacalini2015_cpgs <- bacalini2015_cpgs[,-1]
#
# bacalini2015_cpgs$cpg <- rownames(bacalini2015_cpgs)
# bacalini2015_cpgs <- bacalini2015_cpgs[bacalini2015_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(bacalini2015_cpgs),]
#
# sample_table <- rbind(sample_table,bacalini2015_samples)
# cpg_table <- merge(cpg_table,bacalini2015_cpgs, all=TRUE,by.y="cpg")
#
#
# charlton2014_unformatted_table <- read.table("Charlton2014/GSE59157_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# charlton2014_unformatted_samples <- charlton2014_unformatted_table[5,-1]
# charlton2014_formatted_samples <- str_sub(charlton2014_unformatted_samples,1,62)[1:95]
#
# #Formatting ages.
182
# charlton2014_unformatted_ages <- charlton2014_unformatted_table[3,-1]
# charlton2014_formatted_ages <- as.numeric(str_sub(charlton2014_unformatted_ages,
15,100))/12
#
# #Formatting sex.
# charlton2014_unformatted_sex <- charlton2014_unformatted_table[2,-1]
# charlton2014_unformatted_sex <- str_sub(charlton2014_unformatted_sex,6, 25)
# charlton2014_unformatted_sex[charlton2014_unformatted_sex=="F"] <- "Female"
# charlton2014_unformatted_sex[charlton2014_unformatted_sex=="M"] <- "Male"
# charlton2014_formatted_sex <- charlton2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# charlton2014_unformatted_condition <- charlton2014_unformatted_table[1,-1]
# charlton2014_unformatted_condition[charlton2014_unformatted_condition=="tissue:
nephrogenic rest"] <-
# "Control"
# charlton2014_unformatted_condition[charlton2014_unformatted_condition=="tissue: normal
kidney"] <-
# "Control"
# charlton2014_unformatted_condition[charlton2014_unformatted_condition == "tissue: Wilms
tumour"] <-
# "Wilms Tumour"
# charlton2014_formatted_condition <- str_sub(charlton2014_unformatted_condition,1,100)
#
# charlton2014_formatted_donor <- paste(rep("BE",95),c(1:95),sep="")
#
# charlton2014_samples <- data.frame(ID=charlton2014_formatted_samples,
# Author=rep("Charlton",95),
# Year=rep(2014,95),
# Tissue=rep("Kidney",95),
# CellType=rep("Kidney",95),
# Age=charlton2014_formatted_ages,
# Condition=charlton2014_formatted_condition,
# Sex=charlton2014_formatted_sex,
# DonorID=charlton2014_formatted_donor,
# Misc=rep(NA,95))
#
# charlton2014_cpgs <- read.table("Charlton2014/GSE59157_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(charlton2014_cpgs) <- charlton2014_cpgs[5,]
# charlton2014_cpgs <- charlton2014_cpgs[-c(1:5),]
# rownames(charlton2014_cpgs) <- charlton2014_cpgs[,1]
183
# charlton2014_cpgs <- charlton2014_cpgs[,-1]
#
# charlton2014_samples <- charlton2014_samples[!is.na(charlton2014_samples$Age),]
# charlton2014_cpgs <- charlton2014_cpgs[,colnames(charlton2014_cpgs) %in%
charlton2014_samples$ID]
#
# charlton2014_cpgs$cpg <- rownames(charlton2014_cpgs)
# charlton2014_cpgs <- charlton2014_cpgs[charlton2014_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(charlton2014_cpgs),]
#
# sample_table <- rbind(sample_table,charlton2014_samples)
# cpg_table <- merge(cpg_table,charlton2014_cpgs, all=TRUE,by.y="cpg")
#
#
# jenkins2022_unformatted_table <- read.table("Jenkins2022/GSE185920_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# jenkins2022_unformatted_samples <- jenkins2022_unformatted_table[5,-1]
# jenkins2022_formatted_samples <- str_sub(jenkins2022_unformatted_samples,1,62)[1:1471]
#
# #Formatting ages.
# jenkins2022_unformatted_ages <- jenkins2022_unformatted_table[3,-1]
# jenkins2022_formatted_ages <- as.numeric(str_sub(jenkins2022_unformatted_ages, 6,100))
#
# #Formatting sex.
# jenkins2022_formatted_sex <- rep("Male",1471)
#
# #Formatting condition and miscellaneous information.
# jenkins2022_formatted_condition <- str_sub(jenkins2022_unformatted_table[2,-1],21,40)
# jenkins2022_formatted_condition[jenkins2022_formatted_condition=="Placebo Arm"] <-
"Control"
#
# jenkins2022_formatted_location <- str_sub(jenkins2022_unformatted_table[1,-1],16,100)
#
# jenkins2022_formatted_donor <-
paste(rep("BF",1471),as.numeric(factor(jenkins2022_formatted_location)),sep="")
#
# jenkins2022_mapping_table <- data.table(Sample=jenkins2022_formatted_samples,
# Location=jenkins2022_formatted_location)
#
# jenkins2022_samples <- data.frame(ID=jenkins2022_formatted_samples,
184
# Author=rep("Jenkins",1471),
# Year=rep(2022,1471),
# Tissue=rep("Semen",1471),
# CellType=rep("Sperm",1471),
# Age=jenkins2022_formatted_ages,
# Condition=jenkins2022_formatted_condition,
# Sex=jenkins2022_formatted_sex,
# DonorID=jenkins2022_formatted_donor,
# Misc=rep(NA,1471))
#
# jenkins2022_cpgs <- data.table::fread("Jenkins2022/GSE185920_processed.csv",
# header=FALSE, fill=FALSE)
# jenkins2022_cpglist <- jenkins2022_cpgs$V1
# jenkins2022_cpglist <- jenkins2022_cpglist[-1]
# colnames(jenkins2022_cpgs) <- as.character(jenkins2022_cpgs[1,])
# jenkins2022_cpgs <- jenkins2022_cpgs[-1,-1]
#
# jenkins2022_present <- rep(c(rep(TRUE, 2- 1), FALSE),1471)
# jenkins2022_cpgs <- jenkins2022_cpgs[,..jenkins2022_present]
#
# jenkins2022_nonduplicated_samples <- !duplicated(colnames(jenkins2022_cpgs))
# jenkins2022_cpgs <- jenkins2022_cpgs[,..jenkins2022_nonduplicated_samples]
# jenkins2022_formatted_location <-
jenkins2022_formatted_location[!duplicated(jenkins2022_formatted_location)]
# jenkins2022_present_samples <- colnames(jenkins2022_cpgs) %in%
jenkins2022_formatted_location
# jenkins2022_cpgs <- jenkins2022_cpgs[,..jenkins2022_present_samples]
# jenkins2022_formatted_location <-
jenkins2022_formatted_location[jenkins2022_formatted_location%in%colnames(jenkins2022_c
pgs)]
# jenkins2022_cpgs <- setcolorder(jenkins2022_cpgs,jenkins2022_formatted_location)
#
# jenkins2022_matching <-
match(colnames(jenkins2022_cpgs),jenkins2022_mapping_table$Location)
# colnames(jenkins2022_cpgs) <- jenkins2022_mapping_table$Sample[jenkins2022_matching]
#
# jenkins2022_cpgs$cpg <- jenkins2022_cpglist
# jenkins2022_cpgs <- jenkins2022_cpgs[jenkins2022_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(jenkins2022_cpgs),]
# jenkins2022_samples <- jenkins2022_samples[jenkins2022_samples$ID %in%
colnames(jenkins2022_cpgs),]
# jenkins2022_samples$Key <- "Jenkins2022"
#
185
# sample_table <- rbind(sample_table,jenkins2022_samples)
# cpg_table <- merge(cpg_table,jenkins2022_cpgs, all=TRUE,by="cpg")
#
# data.table::fwrite(jenkins2022_cpgs,"ClockConstruction/jenkins_cpg_table.csv",
# row.names = TRUE)
# data.table::fwrite(jenkins2022_samples,"ClockConstruction/jenkins_sample_table.csv",
# row.names = TRUE)
#
#
# ishak2020_unformatted_table <- read.table("Ishak2020/GSE149282_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# ishak2020_unformatted_samples <- ishak2020_unformatted_table[5,-1]
# ishak2020_formatted_samples <- str_sub(ishak2020_unformatted_samples,1,62)[1:24]
#
# #Formatting ages.
# ishak2020_unformatted_ages <- ishak2020_unformatted_table[3,-1]
# ishak2020_formatted_ages <- as.numeric(str_sub(ishak2020_unformatted_ages, 6,100))
#
# #Formatting sex.
# ishak2020_unformatted_sex <- ishak2020_unformatted_table[2,-1]
# ishak2020_unformatted_sex <- str_sub(ishak2020_unformatted_sex,9, 25)
# ishak2020_formatted_sex <- ishak2020_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# ishak2020_unformatted_condition <- str_sub(ishak2020_unformatted_table[1,-1],-30,-6)
# ishak2020_unformatted_condition[ishak2020_unformatted_condition=="Colon cancer tissue"]
<-
# "Colon Cancer"
# ishak2020_unformatted_condition[ishak2020_unformatted_condition=="Colon adjacent
normal"] <-
# "Control"
# ishak2020_formatted_condition <- ishak2020_unformatted_condition
#
# ishak2020_unformatted_donor <- as.numeric(factor(str_sub(ishak2020_unformatted_table[1,-
1],-4,-1)))
# ishak2020_formatted_donor <- paste(rep("BG",24),ishak2020_unformatted_donor,sep="")
#
# ishak2020_samples <- data.frame(ID=ishak2020_formatted_samples,
# Author=rep("Ishak",24),
# Year=rep(2020,24),
186
# Tissue=rep("Colon",24),
# CellType=rep("Colon",24),
# Age=ishak2020_formatted_ages,
# Condition=ishak2020_formatted_condition,
# Sex=ishak2020_formatted_sex,
# DonorID=ishak2020_formatted_donor,
# Misc=rep(NA,24))
#
# ishak2020_cpgs <- read.table("Ishak2020/GSE149282_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(ishak2020_cpgs) <- ishak2020_cpgs[5,]
# ishak2020_cpgs <- ishak2020_cpgs[-c(1:5),]
# rownames(ishak2020_cpgs) <- ishak2020_cpgs[,1]
# ishak2020_cpgs <- ishak2020_cpgs[,-1]
#
# ishak2020_cpgs$cpg <- rownames(ishak2020_cpgs)
# ishak2020_cpgs <- ishak2020_cpgs[ishak2020_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(ishak2020_cpgs),]
#
# ishak2020_samples$Key <- "Ishak2020"
# sample_table <- rbind(sample_table,ishak2020_samples)
# cpg_table <- merge(cpg_table,ishak2020_cpgs, all=TRUE,by="cpg")
#
#
#
# lewis2020_unformatted_table <- read.table("Lewis2020/GSE141254-
GPL13534_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
# lewis2020_unformatted_table_2 <- read.table("Lewis2020/GSE141254-
GPL21145_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# lewis2020_unformatted_samples <- lewis2020_unformatted_table[7,-1]
# lewis2020_formatted_samples <- str_sub(lewis2020_unformatted_samples,1,62)[1:52]
#
# lewis2020_unformatted_samples_2 <- lewis2020_unformatted_table_2[7,-1]
# lewis2020_formatted_samples_2 <- str_sub(lewis2020_unformatted_samples_2,1,62)[1:32]
#
# #Formatting ages.
187
# lewis2020_unformatted_ages <- lewis2020_unformatted_table[4,-1]
# lewis2020_formatted_ages <- as.numeric(str_sub(lewis2020_unformatted_ages, 6,100))
# lewis2020_unformatted_ages_2 <- lewis2020_unformatted_table_2[4,-1]
# lewis2020_formatted_ages_2 <- as.numeric(str_sub(lewis2020_unformatted_ages_2, 6,100))
#
# #Formatting sex.
# lewis2020_formatted_sex <- str_sub(lewis2020_unformatted_table[5,-1],6,62)
# lewis2020_formatted_sex[lewis2020_formatted_sex=="male"] <- "Male"
# lewis2020_formatted_sex[lewis2020_formatted_sex=="female"] <- "Female"
#
# lewis2020_formatted_sex_2 <- str_sub(lewis2020_unformatted_table_2[5,-1],6,62)
# lewis2020_formatted_sex_2[lewis2020_formatted_sex_2=="male"] <- "Male"
# lewis2020_formatted_sex_2[lewis2020_formatted_sex_2=="female"] <- "Female"
#
# #Formatting tissue and cell type.
# lewis2020_formatted_tissue <- str_sub(lewis2020_unformatted_table[2,-1],1,62)
# lewis2020_formatted_tissue[lewis2020_formatted_tissue=="duodenum"] <- "Duodenum"
# lewis2020_formatted_tissue[lewis2020_formatted_tissue=="jejunum"] <- "Jejunum"
# lewis2020_formatted_tissue[lewis2020_formatted_tissue=="colon"] <- "Colon"
# lewis2020_formatted_tissue[lewis2020_formatted_tissue=="small intestine"] <- "Small
Intestine"
# lewis2020_formatted_tissue[lewis2020_formatted_tissue=="ileum"] <- "Ileum"
#
# lewis2020_formatted_tissue_2 <- str_sub(lewis2020_unformatted_table_2[2,-1],1,62)
# lewis2020_formatted_tissue_2[lewis2020_formatted_tissue_2=="duodenum"] <- "Duodenum"
# lewis2020_formatted_tissue_2[lewis2020_formatted_tissue_2=="jejunum"] <- "Jejunum"
# lewis2020_formatted_tissue_2[lewis2020_formatted_tissue_2=="colon"] <- "Colon"
# lewis2020_formatted_tissue_2[lewis2020_formatted_tissue_2=="small intestine"] <- "Small
Intestine"
# lewis2020_formatted_tissue_2[lewis2020_formatted_tissue_2=="ileum"] <- "Ileum"
#
# #Formatting cell type.
# lewis2020_formatted_celltype <- str_sub(lewis2020_unformatted_table[3,-1],12,62)
# lewis2020_formatted_celltype[lewis2020_formatted_celltype=="crypt"] <- "Crypt"
# lewis2020_formatted_celltype[lewis2020_formatted_celltype=="spheroid"] <- "Spheroid"
# lewis2020_formatted_celltype[lewis2020_formatted_celltype=="mucosa"] <- "Mucosa"
#
# lewis2020_formatted_celltype_2 <- str_sub(lewis2020_unformatted_table_2[3,-1],12,62)
# lewis2020_formatted_celltype_2[lewis2020_formatted_celltype_2=="crypt"] <- "Crypt"
# lewis2020_formatted_celltype_2[lewis2020_formatted_celltype_2=="spheroid"] <- "Spheroid"
# lewis2020_formatted_celltype_2[lewis2020_formatted_celltype_2=="mucosa"] <- "Mucosa"
# lewis2020_formatted_celltype_2[lewis2020_formatted_celltype_2=="diff. spheroid"] <-
"Spheroid"
188
#
# #Formatting condition
# lewis2020_formatted_condition <- lewis2020_formatted_celltype
# lewis2020_formatted_condition[lewis2020_formatted_condition!="Spheroid"] <- "Control"
#
# lewis2020_formatted_condition_2 <- lewis2020_formatted_celltype_2
# lewis2020_formatted_condition_2[lewis2020_formatted_condition_2!="Spheroid"] <-
"Control"
#
# #Formatting condition and miscellaneous information.
#
# lewis2020_formatted_location <- str_sub(lewis2020_unformatted_table[1,-1],1,100)
# lewis2020_formatted_location_2 <- str_sub(lewis2020_unformatted_table_2[1,-1],1,100)
#
# lewis2020_formatted_donor <- paste(rep("BH",84),c(1:84),sep="")
#
# #Combining the two studies.
# lewis2020_formatted_samples <-
c(lewis2020_formatted_samples,lewis2020_formatted_samples_2)
# lewis2020_formatted_ages <- c(lewis2020_formatted_ages,lewis2020_formatted_ages_2)
# lewis2020_formatted_sex <- c(lewis2020_formatted_sex,lewis2020_formatted_sex_2)
# lewis2020_formatted_tissue <- c(lewis2020_formatted_tissue,lewis2020_formatted_tissue_2)
# lewis2020_formatted_celltype <-
c(lewis2020_formatted_celltype,lewis2020_formatted_celltype_2)
# lewis2020_formatted_condition <-
c(lewis2020_formatted_condition,lewis2020_formatted_condition_2)
# lewis2020_formatted_location <-
c(lewis2020_formatted_location,lewis2020_formatted_location_2)
#
# lewis2020_mapping_table <- data.table(Sample=lewis2020_formatted_samples,
# Location=lewis2020_formatted_location)
#
# lewis2020_samples <- data.frame(ID=lewis2020_formatted_samples,
# Author=rep("Lewis",84),
# Year=rep(2020,84),
# Tissue=lewis2020_formatted_tissue,
# CellType=lewis2020_formatted_celltype,
# Age=lewis2020_formatted_ages,
# Condition=lewis2020_formatted_condition,
# Sex=lewis2020_formatted_sex,
# DonorID=lewis2020_formatted_donor,
# Misc=rep(NA,84))
#
189
# lewis2020_cpgs <- data.table::fread("Lewis2020/GSE141254_normalized_data.txt",
# header=FALSE, fill=FALSE)
# lewis2020_cpglist <- lewis2020_cpgs$V1
# lewis2020_cpglist <- lewis2020_cpglist[-1]
# colnames(lewis2020_cpgs) <- as.character(lewis2020_cpgs[1,])
# lewis2020_cpgs <- lewis2020_cpgs[-1,-1]
#
# colnames(lewis2020_cpgs) <- lewis2020_formatted_samples
#
# lewis2020_cpgs$cpg <- lewis2020_cpglist
# lewis2020_cpgs <- lewis2020_cpgs[lewis2020_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(lewis2020_cpgs),]
# lewis2020_samples <- lewis2020_samples[lewis2020_samples$ID %in%
colnames(lewis2020_cpgs),]
# lewis2020_samples$Key <- "Lewis2020"
#
# sample_table <- rbind(sample_table,lewis2020_samples)
# cpg_table <- merge(cpg_table,lewis2020_cpgs, all=TRUE,by="cpg")
#
#
#
# huynh2014_unformatted_table <- read.table("Huynh2014/GSE40360_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# huynh2014_unformatted_samples <- huynh2014_unformatted_table[6,-1]
# huynh2014_formatted_samples <- str_sub(huynh2014_unformatted_samples,1,62)[1:47]
#
# #Formatting ages.
# huynh2014_unformatted_ages <- huynh2014_unformatted_table[1,-1]
# huynh2014_formatted_ages <- as.numeric(str_sub(huynh2014_unformatted_ages, 6,7))
#
# #Formatting sex.
# huynh2014_unformatted_sex <- huynh2014_unformatted_table[4,-1]
# huynh2014_unformatted_sex <- str_sub(huynh2014_unformatted_sex,9, 25)
# huynh2014_formatted_sex <- huynh2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# huynh2014_unformatted_condition <- str_sub(huynh2014_unformatted_table[2,-1],17,38)
# huynh2014_unformatted_condition[huynh2014_unformatted_condition=="Multiple sclerosis"]
<-
# "Multiple Sclerosis"
190
# huynh2014_formatted_condition <- huynh2014_unformatted_condition
#
# huynh2014_formatted_donor <- paste(rep("BI",47),c(1:47),sep="")
#
# huynh2014_samples <- data.frame(ID=huynh2014_formatted_samples,
# Author=rep("Huynh",47),
# Year=rep(2014,47),
# Tissue=rep("Brain",47),
# CellType=rep("Brain",47),
# Age=huynh2014_formatted_ages,
# Condition=huynh2014_formatted_condition,
# Sex=huynh2014_formatted_sex,
# DonorID=huynh2014_formatted_donor,
# Misc=rep(NA,47))
#
# huynh2014_samples <- huynh2014_samples[!is.na(huynh2014_samples$Age),]
#
# huynh2014_cpgs <- read.table("Huynh2014/GSE40360_series_matrix.txt",
# comment = "!",
# skip=6,fill=TRUE)
# colnames(huynh2014_cpgs) <- huynh2014_cpgs[6,]
# huynh2014_cpgs <- huynh2014_cpgs[-c(1:6),]
# rownames(huynh2014_cpgs) <- huynh2014_cpgs[,1]
# huynh2014_cpgs <- huynh2014_cpgs[,-1]
#
# huynh2014_cpgs$cpg <- rownames(huynh2014_cpgs)
# huynh2014_cpgs <- huynh2014_cpgs[huynh2014_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(huynh2014_cpgs),]
#
# huynh2014_cpgs <- huynh2014_cpgs[,(colnames(huynh2014_cpgs) %in%
huynh2014_samples$ID |
# colnames(huynh2014_cpgs) == "cpg")]
#
# huynh2014_samples$Key <- "Huynh2014"
# sample_table <- rbind(sample_table,huynh2014_samples)
# cpg_table <- merge(cpg_table,huynh2014_cpgs, all=TRUE,by="cpg")
#
#
#
# policicchio2020_unformatted_table <-
read.table("Policicchio2020/GSE137223_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
191
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# policicchio2020_unformatted_samples <- policicchio2020_unformatted_table[5,-1]
# policicchio2020_formatted_samples <-
str_sub(policicchio2020_unformatted_samples,1,62)[1:33]
#
# #Formatting ages.
# policicchio2020_unformatted_ages <- policicchio2020_unformatted_table[1,-1]
# policicchio2020_formatted_ages <- as.numeric(str_sub(policicchio2020_unformatted_ages,
6,7))
#
# #Formatting sex.
# policicchio2020_unformatted_sex <- policicchio2020_unformatted_table[3,-1]
# policicchio2020_unformatted_sex <- str_sub(policicchio2020_unformatted_sex,9, 25)
# policicchio2020_unformatted_sex[policicchio2020_unformatted_sex=="M"] <- "Male"
# policicchio2020_unformatted_sex[policicchio2020_unformatted_sex=="F"] <- "Female"
# policicchio2020_formatted_sex <- policicchio2020_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# policicchio2020_unformatted_condition <- str_sub(policicchio2020_unformatted_table[2,-
1],12,38)
# policicchio2020_unformatted_condition[policicchio2020_unformatted_condition=="1"] <-
# "Suicide"
# policicchio2020_unformatted_condition[policicchio2020_unformatted_condition=="0"] <-
# "Control"
# policicchio2020_formatted_condition <- policicchio2020_unformatted_condition
#
# policicchio2020_formatted_donor <- paste(rep("BJ",33),c(1:33),sep="")
#
# policicchio2020_samples <- data.frame(ID=policicchio2020_formatted_samples,
# Author=rep("Policicchio",33),
# Year=rep(2020,33),
# Tissue=rep("Brain",33),
# CellType=rep("Brain",33),
# Age=policicchio2020_formatted_ages,
# Condition=policicchio2020_formatted_condition,
# Sex=policicchio2020_formatted_sex,
# DonorID=policicchio2020_formatted_donor,
# Misc=rep(NA,33))
#
# policicchio2020_cpgs <- read.table("Policicchio2020/GSE137223_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
192
# colnames(policicchio2020_cpgs) <- policicchio2020_cpgs[5,]
# policicchio2020_cpgs <- policicchio2020_cpgs[-c(1:5),]
# rownames(policicchio2020_cpgs) <- policicchio2020_cpgs[,1]
# policicchio2020_cpgs <- policicchio2020_cpgs[,-1]
#
# policicchio2020_cpgs$cpg <- rownames(policicchio2020_cpgs)
# policicchio2020_cpgs <- policicchio2020_cpgs[policicchio2020_cpgs$cpg %in%
cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(policicchio2020_cpgs),]
#
# policicchio2020_samples$Key <- "Policicchio2020"
# sample_table <- rbind(sample_table,policicchio2020_samples)
# cpg_table <- merge(cpg_table,policicchio2020_cpgs, all=TRUE,by="cpg")
#
#
# policicchio2020_unformatted_table <-
read.table("Policicchio2020/GSE137223_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# policicchio2020_unformatted_samples <- policicchio2020_unformatted_table[5,-1]
# policicchio2020_formatted_samples <-
str_sub(policicchio2020_unformatted_samples,1,62)[1:33]
#
# #Formatting ages.
# policicchio2020_unformatted_ages <- policicchio2020_unformatted_table[1,-1]
# policicchio2020_formatted_ages <- as.numeric(str_sub(policicchio2020_unformatted_ages,
6,7))
#
# #Formatting sex.
# policicchio2020_unformatted_sex <- policicchio2020_unformatted_table[3,-1]
# policicchio2020_unformatted_sex <- str_sub(policicchio2020_unformatted_sex,9, 25)
# policicchio2020_unformatted_sex[policicchio2020_unformatted_sex=="M"] <- "Male"
# policicchio2020_unformatted_sex[policicchio2020_unformatted_sex=="F"] <- "Female"
# policicchio2020_formatted_sex <- policicchio2020_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# policicchio2020_unformatted_condition <- str_sub(policicchio2020_unformatted_table[2,-
1],12,38)
# policicchio2020_unformatted_condition[policicchio2020_unformatted_condition=="1"] <-
# "Suicide"
# policicchio2020_unformatted_condition[policicchio2020_unformatted_condition=="0"] <-
193
# "Control"
# policicchio2020_formatted_condition <- policicchio2020_unformatted_condition
#
# policicchio2020_formatted_donor <- paste(rep("BJ",33),c(1:33),sep="")
#
# policicchio2020_samples <- data.frame(ID=policicchio2020_formatted_samples,
# Author=rep("Policicchio",33),
# Year=rep(2020,33),
# Tissue=rep("Brain",33),
# CellType=rep("Brain",33),
# Age=policicchio2020_formatted_ages,
# Condition=policicchio2020_formatted_condition,
# Sex=policicchio2020_formatted_sex,
# DonorID=policicchio2020_formatted_donor,
# Misc=rep(NA,33))
#
# policicchio2020_cpgs <- read.table("Policicchio2020/GSE137223_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(policicchio2020_cpgs) <- policicchio2020_cpgs[5,]
# policicchio2020_cpgs <- policicchio2020_cpgs[-c(1:5),]
# rownames(policicchio2020_cpgs) <- policicchio2020_cpgs[,1]
# policicchio2020_cpgs <- policicchio2020_cpgs[,-1]
#
# policicchio2020_cpgs$cpg <- rownames(policicchio2020_cpgs)
# policicchio2020_cpgs <- policicchio2020_cpgs[policicchio2020_cpgs$cpg %in%
cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(policicchio2020_cpgs),]
#
# policicchio2020_samples$Key <- "Policicchio2020"
# sample_table <- rbind(sample_table,policicchio2020_samples)
# cpg_table <- merge(cpg_table,policicchio2020_cpgs, all=TRUE,by="cpg")
#
#
# viana2017_unformatted_table <- read.table("Viana2017/GSE89706_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# viana2017_unformatted_samples <- viana2017_unformatted_table[5,-1]
# viana2017_formatted_samples <- str_sub(viana2017_unformatted_samples,1,62)[1:49]
#
# #Formatting ages.
194
# viana2017_unformatted_ages <- viana2017_unformatted_table[2,-1]
# viana2017_formatted_ages <- as.numeric(str_sub(viana2017_unformatted_ages, 6,7))
#
# #Formatting sex.
# viana2017_unformatted_sex <- viana2017_unformatted_table[3,-1]
# viana2017_unformatted_sex <- str_sub(viana2017_unformatted_sex,9, 25)
# viana2017_unformatted_sex[viana2017_unformatted_sex=="M"] <- "Male"
# viana2017_unformatted_sex[viana2017_unformatted_sex=="F"] <- "Female"
# viana2017_formatted_sex <- viana2017_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# viana2017_unformatted_condition <- str_sub(viana2017_unformatted_table[1,-1],12,38)
# viana2017_unformatted_condition[viana2017_unformatted_condition=="control"] <-
# "Control"
# viana2017_unformatted_condition[viana2017_unformatted_condition=="schizophrenia"] <-
# "Schizophrenia"
# viana2017_formatted_condition <- viana2017_unformatted_condition
#
# viana2017_formatted_donor <- paste(rep("BK",49),c(1:49),sep="")
#
# viana2017_samples <- data.frame(ID=viana2017_formatted_samples,
# Author=rep("Viana",49),
# Year=rep(2017,49),
# Tissue=rep("Brain",49),
# CellType=rep("Brain",49),
# Age=viana2017_formatted_ages,
# Condition=viana2017_formatted_condition,
# Sex=viana2017_formatted_sex,
# DonorID=viana2017_formatted_donor,
# Misc=rep(NA,49))
#
# viana2017_cpgs <- read.table("Viana2017/GSE89706_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(viana2017_cpgs) <- viana2017_cpgs[5,]
# viana2017_cpgs <- viana2017_cpgs[-c(1:5),]
# rownames(viana2017_cpgs) <- viana2017_cpgs[,1]
# viana2017_cpgs <- viana2017_cpgs[,-1]
#
# viana2017_cpgs$cpg <- rownames(viana2017_cpgs)
# viana2017_cpgs <- viana2017_cpgs[viana2017_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(viana2017_cpgs),]
#
195
# viana2017_samples$Key <- "Viana2017"
# sample_table <- rbind(sample_table,viana2017_samples)
# cpg_table <- merge(cpg_table,viana2017_cpgs, all=TRUE,by="cpg")
#
# garagnani2015_unformatted_table <-
read.table("Garagnani2015/GSE63347_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# garagnani2015_unformatted_samples <- garagnani2015_unformatted_table[5,-1]
# garagnani2015_formatted_samples <-
str_sub(garagnani2015_unformatted_samples,1,62)[1:71]
#
# #Formatting ages.
# garagnani2015_unformatted_ages <- garagnani2015_unformatted_table[1,-1]
# garagnani2015_formatted_ages <- as.numeric(str_sub(garagnani2015_unformatted_ages, 6,7))
#
# #Formatting sex.
# garagnani2015_unformatted_sex <- garagnani2015_unformatted_table[2,-1]
# garagnani2015_unformatted_sex <- str_sub(garagnani2015_unformatted_sex,6, 25)
# garagnani2015_unformatted_sex[garagnani2015_unformatted_sex=="M"] <- "Male"
# garagnani2015_unformatted_sex[garagnani2015_unformatted_sex=="F"] <- "Female"
# garagnani2015_formatted_sex <- garagnani2015_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# garagnani2015_unformatted_condition <- str_sub(garagnani2015_unformatted_table[3,-
1],22,38)
# garagnani2015_unformatted_condition[garagnani2015_unformatted_condition=="N"] <-
# "Control"
# garagnani2015_unformatted_condition[garagnani2015_unformatted_condition=="DS"] <-
# "Down Syndrome"
# garagnani2015_formatted_condition <- garagnani2015_unformatted_condition
#
# garagnani2015_formatted_donor <- paste(rep("BL",71),c(1:71),sep="")
#
# garagnani2015_samples <- data.frame(ID=garagnani2015_formatted_samples,
# Author=rep("Garagnani",71),
# Year=rep(2015,71),
# Tissue=rep("Brain",71),
# CellType=rep("Brain",71),
# Age=garagnani2015_formatted_ages,
# Condition=garagnani2015_formatted_condition,
196
# Sex=garagnani2015_formatted_sex,
# DonorID=garagnani2015_formatted_donor,
# Misc=rep(NA,71))
#
# garagnani2015_cpgs <- read.table("Garagnani2015/GSE63347_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(garagnani2015_cpgs) <- garagnani2015_cpgs[5,]
# garagnani2015_cpgs <- garagnani2015_cpgs[-c(1:5),]
# rownames(garagnani2015_cpgs) <- garagnani2015_cpgs[,1]
# garagnani2015_cpgs <- garagnani2015_cpgs[,-1]
#
# garagnani2015_cpgs$cpg <- rownames(garagnani2015_cpgs)
# garagnani2015_cpgs <- garagnani2015_cpgs[garagnani2015_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(garagnani2015_cpgs),]
#
# garagnani2015_samples$Key <- "Garagnani2015"
# sample_table <- rbind(sample_table,garagnani2015_samples)
# cpg_table <- merge(cpg_table,garagnani2015_cpgs, all=TRUE,by="cpg")
#
#
# wockner2014_unformatted_table <- read.table("Wockner2014/GSE61107_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# wockner2014_unformatted_samples <- wockner2014_unformatted_table[5,-1]
# wockner2014_formatted_samples <- str_sub(wockner2014_unformatted_samples,1,62)[1:48]
#
# #Formatting ages.
# wockner2014_unformatted_ages <- wockner2014_unformatted_table[2,-1]
# wockner2014_formatted_ages <- as.numeric(str_sub(wockner2014_unformatted_ages, 6,7))
#
# #Formatting sex.
# wockner2014_unformatted_sex <- wockner2014_unformatted_table[3,-1]
# wockner2014_unformatted_sex <- str_sub(wockner2014_unformatted_sex,9, 25)
# wockner2014_unformatted_sex[wockner2014_unformatted_sex=="M"] <- "Male"
# wockner2014_unformatted_sex[wockner2014_unformatted_sex=="F"] <- "Female"
# wockner2014_formatted_sex <- wockner2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# wockner2014_unformatted_condition <- str_sub(wockner2014_unformatted_table[1,-1],44,50)
# wockner2014_unformatted_condition[wockner2014_unformatted_condition=="1"] <-
197
# "Control"
# wockner2014_unformatted_condition[wockner2014_unformatted_condition=="2"] <-
# "Schizophrenia"
# wockner2014_formatted_condition <- wockner2014_unformatted_condition
#
# wockner2014_formatted_donor <- paste(rep("BM",48),c(1:48),sep="")
#
# wockner2014_samples <- data.frame(ID=wockner2014_formatted_samples,
# Author=rep("Wockner",48),
# Year=rep(2014,48),
# Tissue=rep("Brain",48),
# CellType=rep("Brain",48),
# Age=wockner2014_formatted_ages,
# Condition=wockner2014_formatted_condition,
# Sex=wockner2014_formatted_sex,
# DonorID=wockner2014_formatted_donor,
# Misc=rep(NA,48))
#
# wockner2014_cpgs <- read.table("Wockner2014/GSE61107_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(wockner2014_cpgs) <- wockner2014_cpgs[5,]
# wockner2014_cpgs <- wockner2014_cpgs[-c(1:5),]
# rownames(wockner2014_cpgs) <- wockner2014_cpgs[,1]
# wockner2014_cpgs <- wockner2014_cpgs[,-1]
#
# wockner2014_cpgs$cpg <- rownames(wockner2014_cpgs)
# wockner2014_cpgs <- wockner2014_cpgs[wockner2014_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(wockner2014_cpgs),]
#
# wockner2014_samples$Key <- "Wockner2014"
#
# wockner2014_samples <- wockner2014_samples[!is.na(wockner2014_samples$Age),]
# wockner2014_cpgs <- wockner2014_cpgs[,(colnames(wockner2014_cpgs) %in%
wockner2014_samples$ID |
# colnames(wockner2014_cpgs) == "cpg")]
#
# sample_table <- rbind(sample_table,wockner2014_samples)
# cpg_table <- merge(cpg_table,wockner2014_cpgs, all=TRUE,by="cpg")
#
#
#
# xu2014_unformatted_table <- read.table("Xu2014/GSE49393_series_matrix.txt",
198
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# xu2014_unformatted_samples <- xu2014_unformatted_table[5,-1]
# xu2014_formatted_samples <- str_sub(xu2014_unformatted_samples,1,62)[1:48]
#
# #Formatting ages.
# xu2014_unformatted_ages <- xu2014_unformatted_table[3,-1]
# xu2014_formatted_ages <- as.numeric(str_sub(xu2014_unformatted_ages, 6,7))
#
# #Formatting sex.
# xu2014_unformatted_sex <- xu2014_unformatted_table[2,-1]
# xu2014_unformatted_sex <- str_sub(xu2014_unformatted_sex,6, 25)
# xu2014_unformatted_sex[xu2014_unformatted_sex=="male"] <- "Male"
# xu2014_unformatted_sex[xu2014_unformatted_sex=="female"] <- "Female"
# xu2014_formatted_sex <- xu2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# xu2014_unformatted_condition <- str_sub(xu2014_unformatted_table[1,-1],13,50)
# xu2014_unformatted_condition[xu2014_unformatted_condition=="Alcohol dependence"] <-
# "Alcoholism"
# xu2014_unformatted_condition[xu2014_unformatted_condition=="Alcohol Abuse"] <-
# "Alcoholism"
# xu2014_formatted_condition <- xu2014_unformatted_condition
#
# xu2014_formatted_donor <- paste(rep("BN",48),c(1:48),sep="")
#
# xu2014_samples <- data.frame(ID=xu2014_formatted_samples,
# Author=rep("Xu",48),
# Year=rep(2014,48),
# Tissue=rep("Brain",48),
# CellType=rep("Brain",48),
# Age=xu2014_formatted_ages,
# Condition=xu2014_formatted_condition,
# Sex=xu2014_formatted_sex,
# DonorID=xu2014_formatted_donor,
# Misc=rep(NA,48))
#
# xu2014_cpgs <- read.table("Xu2014/GSE49393_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(xu2014_cpgs) <- xu2014_cpgs[5,]
199
# xu2014_cpgs <- xu2014_cpgs[-c(1:5),]
# rownames(xu2014_cpgs) <- xu2014_cpgs[,1]
# xu2014_cpgs <- xu2014_cpgs[,-1]
#
# xu2014_cpgs$cpg <- rownames(xu2014_cpgs)
# xu2014_cpgs <- xu2014_cpgs[xu2014_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(xu2014_cpgs),]
#
# xu2014_samples$Key <- "Xu2014"
#
# sample_table <- rbind(sample_table,xu2014_samples)
# cpg_table <- merge(cpg_table,xu2014_cpgs, all=TRUE,by="cpg")
#
# lunnon2014_unformatted_table <- read.table("Lunnon2014/GSE59685_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# lunnon2014_unformatted_samples <- lunnon2014_unformatted_table[7,-1]
# lunnon2014_formatted_samples <- str_sub(lunnon2014_unformatted_samples,1,62)[1:531]
#
# #Formatting ages.
# lunnon2014_unformatted_ages <- lunnon2014_unformatted_table[5,-1]
# lunnon2014_formatted_ages <- as.numeric(str_sub(lunnon2014_unformatted_ages, 12,100))
#
# #Formatting sex.
# lunnon2014_unformatted_sex <- lunnon2014_unformatted_table[4,-1]
# lunnon2014_unformatted_sex <- str_sub(lunnon2014_unformatted_sex,6, 25)
# lunnon2014_unformatted_sex[lunnon2014_unformatted_sex=="MALE"] <- "Male"
# lunnon2014_unformatted_sex[lunnon2014_unformatted_sex=="FEMALE"] <- "Female"
# lunnon2014_formatted_sex <- lunnon2014_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# lunnon2014_formatted_condition <- str_sub(lunnon2014_unformatted_table[3,-1],20,40)
# lunnon2014_formatted_condition[lunnon2014_formatted_condition=="C"] <- "Control"
# lunnon2014_formatted_condition[lunnon2014_formatted_condition=="AD"] <- "Alzheimer's"
# lunnon2014_formatted_condition[lunnon2014_formatted_condition=="Exclude"] <-
"Exclude"
#
# lunnon2014_formatted_location <- str_sub(lunnon2014_unformatted_table[1,-1],16,100)
#
# lunnon2014_formatted_donor <- paste(rep("BO",531),
# as.numeric(factor(str_sub(lunnon2014_unformatted_table[2,-
200
1],1,100))),sep="")
#
# lunnon2014_samples <- data.frame(ID=lunnon2014_formatted_samples,
# Author=rep("Lunnon",531),
# Year=rep(2014,531),
# Tissue=rep("Brain",531),
# CellType=rep("Brain",531),
# Age=lunnon2014_formatted_ages,
# Condition=lunnon2014_formatted_condition,
# Sex=lunnon2014_formatted_sex,
# DonorID=lunnon2014_formatted_donor,
# Misc=rep(NA,531))
#
# lunnon2014_cpgs <- data.table::fread("Lunnon2014/GSE59685_betas.csv",
# header=FALSE, fill=FALSE)
# lunnon2014_cpglist <- lunnon2014_cpgs$V1
# lunnon2014_cpglist <- lunnon2014_cpglist[-c(1:3)]
# colnames(lunnon2014_cpgs) <- as.character(lunnon2014_cpgs[2,])
# lunnon2014_cpgs <- lunnon2014_cpgs[-c(1:3),-1]
#
# lunnon2014_cpgs <- setcolorder(lunnon2014_cpgs,lunnon2014_formatted_samples)
#
# lunnon2014_cpgs$cpg <- lunnon2014_cpglist
#
# lunnon2014_samples <- lunnon2014_samples[!is.na(lunnon2014_samples$Age),]
# lunnon2014_present <- lunnon2014_cpgs[,(colnames(lunnon2014_cpgs) %in%
lunnon2014_samples$ID |
# colnames(lunnon2014_cpgs) == "cpg")]
# lunnon2014_cpgs <- lunnon2014_cpgs[,..lunnon2014_present]
#
# lunnon2014_cpgs <- lunnon2014_cpgs[lunnon2014_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(lunnon2014_cpgs),]
# lunnon2014_samples <- lunnon2014_samples[lunnon2014_samples$ID %in%
colnames(lunnon2014_cpgs),]
# lunnon2014_samples$Key <- "Lunnon2014"
#
# sample_table <- rbind(sample_table,lunnon2014_samples)
# cpg_table <- merge(cpg_table,lunnon2014_cpgs, all=TRUE,by="cpg")
#
#
# jiang2015_unformatted_table <- read.table("Jiang2015/GSE52068_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
201
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# jiang2015_unformatted_samples <- jiang2015_unformatted_table[5,-1]
# jiang2015_formatted_samples <- str_sub(jiang2015_unformatted_samples,1,62)[1:48]
#
# #Formatting ages.
# jiang2015_unformatted_ages <- jiang2015_unformatted_table[2,-1]
# jiang2015_formatted_ages <- as.numeric(str_sub(jiang2015_unformatted_ages, 6,7))
#
# #Formatting sex.
# jiang2015_unformatted_sex <- jiang2015_unformatted_table[1,-1]
# jiang2015_unformatted_sex <- str_sub(jiang2015_unformatted_sex,9, 25)
# jiang2015_unformatted_sex[jiang2015_unformatted_sex=="male"] <- "Male"
# jiang2015_unformatted_sex[jiang2015_unformatted_sex=="female"] <- "Female"
# jiang2015_formatted_sex <- jiang2015_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# jiang2015_unformatted_condition <- str_sub(jiang2015_unformatted_table[3,-1],14,50)
# jiang2015_unformatted_condition[jiang2015_unformatted_condition=="nasopharyngeal
carcinoma"] <-
# "Nasopharyngeal Carcinoma"
# jiang2015_unformatted_condition[jiang2015_unformatted_condition=="normal
nasopharyngeal epithelial"] <-
# "Control"
# jiang2015_formatted_condition <- jiang2015_unformatted_condition
#
# jiang2015_formatted_donor <- paste(rep("BP",48),c(1:48),sep="")
#
# jiang2015_samples <- data.frame(ID=jiang2015_formatted_samples,
# Author=rep("Jiang",48),
# Year=rep(2015,48),
# Tissue=rep("Nasopharynx",48),
# CellType=rep("Epithelial",48),
# Age=jiang2015_formatted_ages,
# Condition=jiang2015_formatted_condition,
# Sex=jiang2015_formatted_sex,
# DonorID=jiang2015_formatted_donor,
# Misc=rep(NA,48))
#
# jiang2015_cpgs <- read.table("Jiang2015/GSE52068_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(jiang2015_cpgs) <- jiang2015_cpgs[5,]
202
# jiang2015_cpgs <- jiang2015_cpgs[-c(1:5),]
# rownames(jiang2015_cpgs) <- jiang2015_cpgs[,1]
# jiang2015_cpgs <- jiang2015_cpgs[,-1]
#
# jiang2015_cpgs$cpg <- rownames(jiang2015_cpgs)
# jiang2015_cpgs <- jiang2015_cpgs[jiang2015_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(jiang2015_cpgs),]
#
# jiang2015_samples$Key <- "Jiang2015"
#
# sample_table <- rbind(sample_table,jiang2015_samples)
# cpg_table <- merge(cpg_table,jiang2015_cpgs, all=TRUE,by="cpg")
#
#
#
# renauer2015_unformatted_table <- read.table("Renauer2015/GSE61195_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# renauer2015_unformatted_samples <- renauer2015_unformatted_table[5,-1]
# renauer2015_formatted_samples <- str_sub(renauer2015_unformatted_samples,1,62)[1:21]
#
# #Formatting ages.
# renauer2015_unformatted_ages <- renauer2015_unformatted_table[3,-1]
# renauer2015_formatted_ages <- as.numeric(str_sub(renauer2015_unformatted_ages, 6,7))
#
# #Formatting sex.
# renauer2015_unformatted_sex <- renauer2015_unformatted_table[2,-1]
# renauer2015_unformatted_sex <- str_sub(renauer2015_unformatted_sex,9, 25)
# renauer2015_unformatted_sex[renauer2015_unformatted_sex=="male"] <- "Male"
# renauer2015_unformatted_sex[renauer2015_unformatted_sex=="female"] <- "Female"
# renauer2015_formatted_sex <- renauer2015_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# renauer2015_formatted_celltype <- str_sub(renauer2015_unformatted_table[1,-1],1,50)
# renauer2015_formatted_condition <- rep("Control",21)
#
# renauer2015_formatted_donor <- paste(rep("BQ",21),c(1:21),sep="")
#
# renauer2015_samples <- data.frame(ID=renauer2015_formatted_samples,
# Author=rep("Renauer",21),
# Year=rep(2015,21),
203
# Tissue=rep("PBMC",21),
# CellType=renauer2015_formatted_celltype,
# Age=renauer2015_formatted_ages,
# Condition=renauer2015_formatted_condition,
# Sex=renauer2015_formatted_sex,
# DonorID=renauer2015_formatted_donor,
# Misc=rep(NA,21))
#
# renauer2015_cpgs <- read.table("Renauer2015/GSE61195_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(renauer2015_cpgs) <- renauer2015_cpgs[5,]
# renauer2015_cpgs <- renauer2015_cpgs[-c(1:5),]
# rownames(renauer2015_cpgs) <- renauer2015_cpgs[,1]
# renauer2015_cpgs <- renauer2015_cpgs[,-1]
#
# renauer2015_cpgs$cpg <- rownames(renauer2015_cpgs)
# renauer2015_cpgs <- renauer2015_cpgs[renauer2015_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(renauer2015_cpgs),]
#
# renauer2015_samples$Key <- "Renauer2015"
#
# sample_table <- rbind(sample_table,renauer2015_samples)
# cpg_table <- merge(cpg_table,renauer2015_cpgs, all=TRUE,by="cpg")
#
#
#
# clement2022_unformatted_table <- read.table("Clement2022/GSE210245_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# clement2022_unformatted_samples <- clement2022_unformatted_table[5,-1]
# clement2022_formatted_samples <- str_sub(clement2022_unformatted_samples,1,62)[1:36]
#
# #Formatting ages.
# clement2022_unformatted_ages <- clement2022_unformatted_table[1,-1]
# clement2022_formatted_ages <- as.numeric(str_sub(clement2022_unformatted_ages, 5,9))
#
# #Formatting sex.
# clement2022_unformatted_sex <- clement2022_unformatted_table[2,-1]
# clement2022_unformatted_sex <- str_sub(clement2022_unformatted_sex,6, 25)
# clement2022_unformatted_sex[clement2022_unformatted_sex=="male"] <- "Male"
204
# clement2022_unformatted_sex[clement2022_unformatted_sex=="female"] <- "Female"
# clement2022_formatted_sex <- clement2022_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# clement2022_unformatted_condition <- clement2022_unformatted_table[3,-1]
# clement2022_formatted_condition <- str_sub(clement2022_unformatted_condition,7, 25)
# clement2022_formatted_condition[clement2022_formatted_condition=="BaselineDraw"] <-
"Control"
# clement2022_formatted_condition[clement2022_formatted_condition=="SecondDraw"] <-
"Plasma Transfer"
#
# clement2022_formatted_donor <- paste(rep("BR",36),rep(1:18,each=2),sep="")
#
# clement2022_samples <- data.frame(ID=clement2022_formatted_samples,
# Author=rep("Clement",36),
# Year=rep(2022,36),
# Tissue=rep("Blood",36),
# CellType=rep("PBMC",36),
# Age=clement2022_formatted_ages,
# Condition=clement2022_formatted_condition,
# Sex=clement2022_formatted_sex,
# DonorID=clement2022_formatted_donor,
# Key=rep("Clement2022",36),
# Chip=rep("EPIC",36),
# Misc=rep(NA,36))
#
# clement2022_cpgs <- read.table("Clement2022/GSE210245_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(clement2022_cpgs) <- clement2022_cpgs[5,]
# clement2022_cpgs <- clement2022_cpgs[-c(1:5),]
# rownames(clement2022_cpgs) <- clement2022_cpgs[,1]
# clement2022_cpgs <- clement2022_cpgs[,-1]
#
# clement2022_cpgs$cpg <- rownames(clement2022_cpgs)
# clement2022_cpgs <- clement2022_cpgs[clement2022_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(clement2022_cpgs),]
#
# sample_table <- rbind(sample_table,clement2022_samples)
# cpg_table <- merge(cpg_table,clement2022_cpgs, all=TRUE,by="cpg")
#
# cullell2022_unformatted_table <- read.table("Cullell2022/GSE203399-
GPL13534_series_matrix.txt",
205
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
# cullell2022_unformatted_table_2 <- read.table("Cullell2022/GSE203399-
GPL29753_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# cullell2022_unformatted_samples <- cullell2022_unformatted_table[5,-1]
# cullell2022_formatted_samples <- str_sub(cullell2022_unformatted_samples,1,62)[1:59]
#
# cullell2022_unformatted_samples_2 <- cullell2022_unformatted_table_2[5,-1]
# cullell2022_formatted_samples_2 <- str_sub(cullell2022_unformatted_samples_2,1,62)[1:62]
#
# #Formatting ages.
# cullell2022_unformatted_ages <- cullell2022_unformatted_table[2,-1]
# cullell2022_formatted_ages <- as.numeric(str_sub(cullell2022_unformatted_ages, 6,100))
# cullell2022_unformatted_ages_2 <- cullell2022_unformatted_table_2[2,-1]
# cullell2022_formatted_ages_2 <- as.numeric(str_sub(cullell2022_unformatted_ages_2, 6,100))
#
# #Formatting sex.
# cullell2022_formatted_sex <- str_sub(cullell2022_unformatted_table[3,-1],9,62)
# cullell2022_formatted_sex[cullell2022_formatted_sex=="M"] <- "Male"
# cullell2022_formatted_sex[cullell2022_formatted_sex=="F"] <- "Female"
#
# cullell2022_formatted_sex_2 <- str_sub(cullell2022_unformatted_table_2[3,-1],9,62)
# cullell2022_formatted_sex_2[cullell2022_formatted_sex_2=="M"] <- "Male"
# cullell2022_formatted_sex_2[cullell2022_formatted_sex_2=="F"] <- "Female"
#
# #Formatting tissue and cell type.
# cullell2022_formatted_tissue <- rep("Blood",59)
#
# cullell2022_formatted_tissue_2 <- rep("Blood",62)
#
# #Formatting cell type.
# cullell2022_formatted_celltype <- rep("Blood",59)
#
# cullell2022_formatted_celltype_2 <- rep("Blood",62)
#
#
# #Formatting condition
# cullell2022_formatted_condition <- rep("Stroke",59)
#
206
# cullell2022_formatted_condition_2 <- rep("Stroke",62)
# #Formatting condition and miscellaneous information.
#
# cullell2022_formatted_donor <- paste(rep("BS",121),c(1:121),sep="")
#
# #Combining the two studies.
# cullell2022_formatted_samples <-
c(cullell2022_formatted_samples,cullell2022_formatted_samples_2)
# cullell2022_formatted_ages <- c(cullell2022_formatted_ages,cullell2022_formatted_ages_2)
# cullell2022_formatted_sex <- c(cullell2022_formatted_sex,cullell2022_formatted_sex_2)
# cullell2022_formatted_tissue <-
c(cullell2022_formatted_tissue,cullell2022_formatted_tissue_2)
# cullell2022_formatted_celltype <-
c(cullell2022_formatted_celltype,cullell2022_formatted_celltype_2)
# cullell2022_formatted_condition <-
c(cullell2022_formatted_condition,cullell2022_formatted_condition_2)
#
# cullell2022_samples <- data.frame(ID=cullell2022_formatted_samples,
# Author=rep("Cullell2022",121),
# Year=rep(2022,121),
# Tissue=cullell2022_formatted_tissue,
# CellType=cullell2022_formatted_celltype,
# Age=cullell2022_formatted_ages,
# Condition=cullell2022_formatted_condition,
# Sex=cullell2022_formatted_sex,
# DonorID=cullell2022_formatted_donor,
# Key=rep("Cullell2022",121),
# Chip=c(rep("450K",59),rep("EPIC",62)),
# Misc=rep(NA,121))
#
# cullell2022_cpgs_1 <-
data.frame(data.table::fread("Cullell2022/GSE203399_betas_normalized_pval_detection_discov
ery.txt",
# header=FALSE, fill=FALSE))
# cullell2022_cpgs_2 <-
data.frame(data.table::fread("Cullell2022/GSE203399_betas_normalized_pval_detection_replica
tion.txt",
# header=FALSE, fill=FALSE))
# cullell2022_cpglist_1 <- cullell2022_cpgs_1$V1
# cullell2022_cpglist_2 <- cullell2022_cpgs_2$V1
# cullell2022_cpgs_1 <- cullell2022_cpgs_1[,c(FALSE,rep(c(TRUE,FALSE),59))]
# cullell2022_cpgs_2 <- cullell2022_cpgs_2[,c(FALSE,rep(c(TRUE,FALSE),62))]
#
207
# cullell2022_cpglist <- cullell2022_cpglist_1
#
# cullell2022_cpgs <- cbind(cullell2022_cpgs_1,cullell2022_cpgs_2)
# cullell2022_cpgs$cpg <- cullell2022_cpglist
# cullell2022_cpgs <- cullell2022_cpgs[-1,]
# colnames(cullell2022_cpgs) <- c(cullell2022_samples$ID,"cpg")
#
# cullell2022_cpgs <- cullell2022_cpgs[cullell2022_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(cullell2022_cpgs),]
#
# sample_table <- rbind(sample_table,cullell2022_samples)
# cpg_table <- merge(cpg_table,cullell2022_cpgs, all=TRUE,by="cpg")
#
#
#
# bartlett2022_unformatted_table <- read.table("Bartlett2022/GSE201724_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# bartlett2022_unformatted_samples <- bartlett2022_unformatted_table[5,-1]
# bartlett2022_formatted_samples <- str_sub(bartlett2022_unformatted_samples,1,62)[1:18]
#
# #Formatting ages.
# bartlett2022_unformatted_ages <- bartlett2022_unformatted_table[3,-1]
# bartlett2022_formatted_ages <- as.numeric(str_sub(bartlett2022_unformatted_ages, 27,50))
#
# #Formatting sex.
# bartlett2022_formatted_sex <- rep("Female",18)
#
# #Formatting condition and miscellaneous information.
# bartlett2022_unformatted_condition <- bartlett2022_unformatted_table[1,-1]
# bartlett2022_formatted_condition <- str_sub(bartlett2022_unformatted_condition,16, 25)
# bartlett2022_formatted_condition[bartlett2022_formatted_condition=="0"] <- "Control"
# bartlett2022_formatted_condition[bartlett2022_formatted_condition=="3"] <- "UA"
#
# bartlett2022_formatted_donor <- paste(rep("BT",18),rep(1:9,each=2),sep="")
#
# bartlett2022_samples <- data.frame(ID=bartlett2022_formatted_samples,
# Author=rep("Bartlett",18),
# Year=rep(2022,18),
# Tissue=rep("Breast",18),
# CellType=rep("Breast",18),
208
# Age=bartlett2022_formatted_ages,
# Condition=bartlett2022_formatted_condition,
# Sex=bartlett2022_formatted_sex,
# DonorID=bartlett2022_formatted_donor,
# Key=rep("Bartlett2022",18),
# Chip=rep("EPIC",18),
# Misc=rep(NA,18))
#
# bartlett2022_cpgs <- read.table("Bartlett2022/GSE201724_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
# colnames(bartlett2022_cpgs) <- bartlett2022_cpgs[5,]
# bartlett2022_cpgs <- bartlett2022_cpgs[-c(1:5),]
# rownames(bartlett2022_cpgs) <- bartlett2022_cpgs[,1]
# bartlett2022_cpgs <- bartlett2022_cpgs[,-1]
#
# bartlett2022_cpgs$cpg <- rownames(bartlett2022_cpgs)
# bartlett2022_cpgs <- bartlett2022_cpgs[bartlett2022_cpgs$cpg %in% cpg_table$cpg,]
# # cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(bartlett2022_cpgs),]
#
# sample_table <- rbind(sample_table,bartlett2022_samples)
# cpg_table <- merge(cpg_table,bartlett2022_cpgs, all=TRUE,by="cpg")
#
#
# brennan2022_unformatted_table <- read.table("Brennan2022/GSE191277-
GPL23976_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# brennan2022_unformatted_samples <- brennan2022_unformatted_table[6,-1]
# brennan2022_formatted_samples <- str_sub(brennan2022_unformatted_samples,1,62)[1:7]
#
# #Formatting ages.
# brennan2022_unformatted_ages <- brennan2022_unformatted_table[3,-1]
# brennan2022_formatted_ages <- as.numeric(str_sub(brennan2022_unformatted_ages, 13,50))
#
# #Formatting sex.
# brennan2022_unformatted_sex <- brennan2022_unformatted_table[4,-1]
# brennan2022_formatted_sex <- str_sub(brennan2022_unformatted_sex,6,10)
# brennan2022_formatted_sex[brennan2022_formatted_sex=="M"] <- "Male"
# brennan2022_formatted_sex[brennan2022_formatted_sex=="F"] <- "Female"
#
209
#
# #Formatting condition and miscellaneous information.
# brennan2022_unformatted_condition <- brennan2022_unformatted_table[2,-1]
# brennan2022_formatted_condition <- str_sub(brennan2022_unformatted_condition,19, 21)
# brennan2022_formatted_condition[brennan2022_formatted_condition=="Age"] <- "Control"
# brennan2022_formatted_condition[brennan2022_formatted_condition=="Pro"] <- "Sotos"
# brennan2022_formatted_condition[brennan2022_formatted_condition=="Poo"] <- NA
#
# brennan2022_formatted_donor <- paste(rep("BU",7),rep(1:7,each=1),sep="")
#
# brennan2022_samples <- data.frame(ID=brennan2022_formatted_samples,
# Author=rep("Brennan",7),
# Year=rep(2022,7),
# Tissue=rep("Blood",7),
# CellType=rep("Blood",7),
# Age=brennan2022_formatted_ages,
# Condition=brennan2022_formatted_condition,
# Sex=brennan2022_formatted_sex,
# DonorID=brennan2022_formatted_donor,
# Key=rep("Brennan2022",7),
# Chip=rep("EPIC",7),
# Misc=rep(NA,7))
#
# brennan2022_cpgs <- read.table("Brennan2022/GSE191277-GPL23976_series_matrix.txt",
# comment = "!",
# skip=6,fill=TRUE)
# colnames(brennan2022_cpgs) <- brennan2022_cpgs[6,]
# brennan2022_cpgs <- brennan2022_cpgs[-c(1:6),]
# rownames(brennan2022_cpgs) <- brennan2022_cpgs[,1]
# brennan2022_cpgs <- brennan2022_cpgs[,-1]
#
# brennan2022_cpgs$cpg <- rownames(brennan2022_cpgs)
# brennan2022_samples <- brennan2022_samples[-1,]
# brennan2022_cpgs <- brennan2022_cpgs[,-1]
#
#
# brennan2022_cpgs <- brennan2022_cpgs[brennan2022_cpgs$cpg %in% cpg_table$cpg,]
# cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(brennan2022_cpgs),]
#
# takeuchi2022_unformatted_table <- read.table("Takeuchi2022/GSE178925_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
210
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# takeuchi2022_unformatted_samples <- takeuchi2022_unformatted_table[5,-1]
# takeuchi2022_formatted_samples <- str_sub(takeuchi2022_unformatted_samples,1,62)[1:24]
#
# #Formatting ages.
# takeuchi2022_unformatted_ages <- takeuchi2022_unformatted_table[2,-1]
# takeuchi2022_formatted_ages <- as.numeric(str_sub(takeuchi2022_unformatted_ages, 6,50))
#
# #Formatting sex.
# takeuchi2022_unformatted_sex <- takeuchi2022_unformatted_table[1,-1]
# takeuchi2022_formatted_sex <- str_sub(takeuchi2022_unformatted_sex,9,15)
# takeuchi2022_formatted_sex[takeuchi2022_formatted_sex=="M"] <- "Male"
# takeuchi2022_formatted_sex[takeuchi2022_formatted_sex=="F"] <- "Female"
#
#
# #Formatting condition and miscellaneous information.
# takeuchi2022_unformatted_condition <- takeuchi2022_unformatted_table[3,-1]
# takeuchi2022_formatted_condition <- str_sub(takeuchi2022_unformatted_condition,16, 50)
# takeuchi2022_formatted_condition[takeuchi2022_formatted_condition=="normal"] <-
"Control"
# takeuchi2022_formatted_condition[takeuchi2022_formatted_condition=="autoiimmune
gastritis"] <- "Gastritis"
# takeuchi2022_formatted_condition[takeuchi2022_formatted_condition=="H. pyloriassocatited gastritis"] <- "Gastritis"
#
# takeuchi2022_formatted_donor <- paste(rep("BV",24),rep(1:24,each=1),sep="")
#
# takeuchi2022_samples <- data.frame(ID=takeuchi2022_formatted_samples,
# Author=rep("Takeuchi",24),
# Year=rep(2022,24),
# Tissue=rep("Blood",24),
# CellType=rep("Blood",24),
# Age=takeuchi2022_formatted_ages,
# Condition=takeuchi2022_formatted_condition,
# Sex=takeuchi2022_formatted_sex,
# DonorID=takeuchi2022_formatted_donor,
# Key=rep("Takeuchi2022",24),
# Chip=rep("EPIC",24),
# Misc=rep(NA,24))
#
# takeuchi2022_cpgs <- read.table("Takeuchi2022/GSE178925_series_matrix.txt",
# comment = "!",
# skip=5,fill=TRUE)
211
# colnames(takeuchi2022_cpgs) <- takeuchi2022_cpgs[5,]
# takeuchi2022_cpgs <- takeuchi2022_cpgs[-c(1:5),]
# rownames(takeuchi2022_cpgs) <- takeuchi2022_cpgs[,1]
# takeuchi2022_cpgs <- takeuchi2022_cpgs[,-1]
#
# takeuchi2022_cpgs$cpg <- rownames(takeuchi2022_cpgs)
#
# takeuchi2022_cpgs <- takeuchi2022_cpgs[takeuchi2022_cpgs$cpg %in% cpg_table$cpg,]
#
# sample_table <- rbind(sample_table,takeuchi2022_samples)
# cpg_table <- merge(cpg_table,takeuchi2022_cpgs, all=TRUE,by="cpg")
#
#
# bauer2021_unformatted_table <- read.table("Bauer2021/GSE178887_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# bauer2021_unformatted_samples <- bauer2021_unformatted_table[4,-1]
# bauer2021_formatted_samples <- str_sub(bauer2021_unformatted_samples,1,62)[1:37]
#
# #Formatting ages.
# bauer2021_unformatted_ages <- bauer2021_unformatted_table[2,-1]
# bauer2021_formatted_ages <- as.numeric(str_sub(bauer2021_unformatted_ages, 6,50))
#
# #Formatting sex.
# bauer2021_unformatted_sex <- bauer2021_unformatted_table[1,-1]
# bauer2021_formatted_sex <- str_sub(bauer2021_unformatted_sex,9,15)
# bauer2021_formatted_sex[bauer2021_formatted_sex=="M"] <- "Male"
# bauer2021_formatted_sex[bauer2021_formatted_sex=="F"] <- "Female"
#
#
# #Formatting condition and miscellaneous information.
# bauer2021_unformatted_condition <- bauer2021_unformatted_table[3,-1]
# bauer2021_formatted_condition <- rep("Control",37)
#
# bauer2021_formatted_donor <- paste(rep("BW",37),rep(1:37,each=1),sep="")
#
# bauer2021_samples <- data.frame(ID=bauer2021_formatted_samples,
# Author=rep("Bauer",37),
# Year=rep(2021,37),
# Tissue=rep("Blood",37),
# CellType=rep("PBMC",37),
212
# Age=bauer2021_formatted_ages,
# Condition=bauer2021_formatted_condition,
# Sex=bauer2021_formatted_sex,
# DonorID=bauer2021_formatted_donor,
# Key=rep("Bauer2021",37),
# Chip=rep("450K",37),
# Misc=rep(NA,37))
#
# bauer2021_cpgs <- read.table("Bauer2021/GSE178887_series_matrix.txt",
# comment = "!",
# skip=4,fill=TRUE)
# colnames(bauer2021_cpgs) <- bauer2021_cpgs[4,]
# bauer2021_cpgs <- bauer2021_cpgs[-c(1:4),]
# rownames(bauer2021_cpgs) <- bauer2021_cpgs[,1]
# bauer2021_cpgs <- bauer2021_cpgs[,-1]
#
# bauer2021_cpgs$cpg <- rownames(bauer2021_cpgs)
#
# bauer2021_cpgs <- bauer2021_cpgs[bauer2021_cpgs$cpg %in% cpg_table$cpg,]
#
# sample_table <- rbind(sample_table,bauer2021_samples)
# cpg_table <- merge(cpg_table,bauer2021_cpgs, all=TRUE,by="cpg")
#
#
# roy2021_unformatted_table <- read.table("Roy2021/GSE184269_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =7)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# roy2021_unformatted_samples <- roy2021_unformatted_table[7,-1]
# roy2021_formatted_samples <- str_sub(roy2021_unformatted_samples,1,62)[1:167]
#
# #Formatting ages.
# roy2021_unformatted_ages <- roy2021_unformatted_table[3,-1]
# roy2021_formatted_ages <- as.numeric(str_sub(roy2021_unformatted_ages, 6,100))
#
# roy2021_unformatted_celltype <- roy2021_unformatted_table[1,-1]
# roy2021_formatted_celltype <- str_sub(roy2021_unformatted_celltype, 1,100)
#
# #Formatting sex.
# roy2021_unformatted_sex <- roy2021_unformatted_table[4,-1]
# roy2021_unformatted_sex <- str_sub(roy2021_unformatted_sex,6, 25)
# roy2021_unformatted_sex[roy2021_unformatted_sex=="M"] <- "Male"
213
# roy2021_unformatted_sex[roy2021_unformatted_sex=="F"] <- "Female"
# roy2021_formatted_sex <- roy2021_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# roy2021_formatted_condition <- rep("Control",167)
#
# roy2021_formatted_location <- str_sub(roy2021_unformatted_table[5,-1],79,97)
#
# roy2021_formatted_donor <- paste(rep("BX",167),
# as.numeric(factor(str_sub(roy2021_unformatted_table[2,-
1],1,100))),sep="")
#
# roy2021_samples <- data.frame(ID=roy2021_formatted_samples,
# Author=rep("Roy",167),
# Year=rep(2021,167),
# Tissue=rep("Blood",167),
# CellType=roy2021_formatted_celltype,
# Age=roy2021_formatted_ages,
# Condition=roy2021_formatted_condition,
# Sex=roy2021_formatted_sex,
# DonorID=roy2021_formatted_donor,
# Key = rep("Roy2021",167),
# Chip=rep("EPIC",167),
# Misc=rep(NA,167))
#
# roy2021_cpgs <- data.table::fread("Roy2021/GSE184269_Matrix_processed_FINAL.txt",
# header=FALSE, fill=FALSE)
# roy2021_cpglist <- roy2021_cpgs$V1
# roy2021_cpglist <- roy2021_cpglist[-1]
# colnames(roy2021_cpgs) <- as.character(roy2021_cpgs[1,])
# roy2021_cpgs <- roy2021_cpgs[-1,]
# roy2021_cpgs <- roy2021_cpgs[,-1]
# roy_keep <- rep(c(rep(TRUE, 2-1), FALSE),167)
# roy2021_cpgs <- roy2021_cpgs[,..roy_keep]
#
# roy2021_cpgs <- setcolorder(roy2021_cpgs,roy2021_formatted_location)
# colnames(roy2021_cpgs) <- roy2021_samples$ID
#
# roy2021_cpgs$cpg <- roy2021_cpglist
#
# roy2021_cpgs <- roy2021_cpgs[roy2021_cpgs$cpg %in% cpg_table$cpg,]
#
# sample_table <- rbind(sample_table,roy2021_samples)
214
# cpg_table <- merge(cpg_table,roy2021_cpgs, all=TRUE,by="cpg")
#
#
#
# nonino2021_unformatted_table <- read.table("Nonino2021/GSE166611_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =5)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# nonino2021_unformatted_samples <- nonino2021_unformatted_table[5,-1]
# nonino2021_formatted_samples <- str_sub(nonino2021_unformatted_samples,1,62)[1:32]
#
# #Formatting ages.
# nonino2021_unformatted_ages <- nonino2021_unformatted_table[2,-1]
# nonino2021_formatted_ages <- as.numeric(str_sub(nonino2021_unformatted_ages, 6,50))
#
# #Formatting sex.
# nonino2021_unformatted_sex <- nonino2021_unformatted_table[1,-1]
# nonino2021_formatted_sex <- str_sub(nonino2021_unformatted_sex,9,15)
# nonino2021_formatted_sex[nonino2021_formatted_sex=="M"] <- "Male"
# nonino2021_formatted_sex[nonino2021_formatted_sex=="F"] <- "Female"
#
#
# #Formatting condition and miscellaneous information.
# nonino2021_formatted_condition <- rep("Control",32)
#
# nonino2021_formatted_donor <- paste(rep("BY",32),rep(1:32,each=1),sep="")
#
# nonino2021_samples <- data.frame(ID=nonino2021_formatted_samples,
# Author=rep("Nonino",32),
# Year=rep(2021,32),
# Tissue=rep("Blood",32),
# CellType=rep("Blood",32),
# Age=nonino2021_formatted_ages,
# Condition=nonino2021_formatted_condition,
# Sex=nonino2021_formatted_sex,
# DonorID=nonino2021_formatted_donor,
# Key=rep("Nonino2021",32),
# Chip=rep("450K",32),
# Misc=rep(NA,32))
#
# nonino2021_cpgs <- read.table("Nonino2021/GSE166611_series_matrix.txt",
# comment = "!",
215
# skip=5,fill=TRUE)
# colnames(nonino2021_cpgs) <- nonino2021_cpgs[5,]
# nonino2021_cpgs <- nonino2021_cpgs[-c(1:5),]
# rownames(nonino2021_cpgs) <- nonino2021_cpgs[,1]
# nonino2021_cpgs <- nonino2021_cpgs[,-1]
#
# nonino2021_cpgs$cpg <- rownames(nonino2021_cpgs)
#
# nonino2021_cpgs <- nonino2021_cpgs[nonino2021_cpgs$cpg %in% cpg_table$cpg,]
#
# sample_table <- rbind(sample_table,nonino2021_samples)
# cpg_table <- merge(cpg_table,nonino2021_cpgs, all=TRUE,by="cpg")
#
#
# huang2021_unformatted_table <- read.table("Huang2021/GSE141682_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =4)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# huang2021_unformatted_samples <- huang2021_unformatted_table[4,-1]
# huang2021_formatted_samples <- str_sub(huang2021_unformatted_samples,1,62)[1:42]
#
# #Formatting ages.
# huang2021_unformatted_ages <- huang2021_unformatted_table[2,-1]
# huang2021_formatted_ages <- as.numeric(str_sub(huang2021_unformatted_ages, 6,50))
#
# #Formatting sex.
# huang2021_unformatted_sex <- huang2021_unformatted_table[1,-1]
# huang2021_formatted_sex <- str_sub(huang2021_unformatted_sex,9,15)
# huang2021_formatted_sex[huang2021_formatted_sex=="M"] <- "Male"
# huang2021_formatted_sex[huang2021_formatted_sex=="F"] <- "Female"
#
#
# #Formatting condition and miscellaneous information.
# huang2021_formatted_condition <- rep("Control",42)
#
# huang2021_formatted_donor <- paste(rep("BZ",42),rep(1:42,each=1),sep="")
#
# huang2021_samples <- data.frame(ID=huang2021_formatted_samples,
# Author=rep("Huang",42),
# Year=rep(2021,42),
# Tissue=rep("Blood",42),
# CellType=rep("Blood",42),
216
# Age=huang2021_formatted_ages,
# Condition=huang2021_formatted_condition,
# Sex=huang2021_formatted_sex,
# DonorID=huang2021_formatted_donor,
# Key=rep("Huang2021",42),
# Chip=rep("EPIC",42),
# Misc=rep(NA,42))
#
# huang2021_cpgs <- read.table("Huang2021/GSE141682_series_matrix.txt",
# comment = "!",
# skip=4,fill=TRUE)
# colnames(huang2021_cpgs) <- huang2021_cpgs[4,]
# huang2021_cpgs <- huang2021_cpgs[-c(1:4),]
# rownames(huang2021_cpgs) <- huang2021_cpgs[,1]
# huang2021_cpgs <- huang2021_cpgs[,-1]
#
# huang2021_cpgs$cpg <- rownames(huang2021_cpgs)
#
# huang2021_cpgs <- huang2021_cpgs[huang2021_cpgs$cpg %in% cpg_table$cpg,]
#
# sample_table <- rbind(sample_table,huang2021_samples)
# cpg_table <- merge(cpg_table,huang2021_cpgs, all=TRUE,by="cpg")
#
#
# Adding in T cells to the training set.
# clock_data <- read.csv("Tomusiak2021/clock_data.csv")
# subset_metadata <- read.csv("Tomusiak2021/subset_metadata.csv")
# all_data <- merge(clock_data,subset_metadata)
# all_data$type <- as.character(all_data$type)
# all_data$type <- factor(all_data$type, levels=c("naive", "central_memory",
"effector_memory","temra"))
# beta_values <- data.table::fread("Tomusiak2021/beta_values.csv",header=TRUE)
# beta_values_samples <- colnames(beta_values)
# beta_values <- data.frame(beta_values)
# beta_values_cpgs <- beta_values$V1
# beta_values <- beta_values[,-1]
#
# #Filter out unwanted data from metadata, mapping, and beta values.
# keep <- all_data$SampleID[all_data$SampleID != "D3" &
# all_data$SampleID != "E4"]
#
#
# #Note - D3 was a mistakenly pipetted sample. SAbGal-high samples were originally included
217
to
# # determine if changes were occurring on a DNA methylation level, but ultimately only one
# # SA-bGal sample was included and thus rigorous statistics cannot be performed.
# all_data <- all_data[all_data$SampleID %in% keep,]
# beta_values <- beta_values[,colnames(beta_values) %in% keep]
# beta_values <- beta_values[,order(all_data$SampleID)]
# rownames(beta_values) <- beta_values_cpgs
#
# tomusiak2021_formatted_samples <- all_data$SampleID
# tomusiak2021_formatted_ages <- all_data$age
# tomusiak2021_formatted_sex <- all_data$Sex
# tomusiak2021_formatted_sex[tomusiak2021_formatted_sex=="male"] <- "Male"
# tomusiak2021_formatted_sex[tomusiak2021_formatted_sex=="female"] <- "Female"
# tomusiak2021_formatted_condition <- rep("Control",30)
# tomusiak2021_formatted_donor <- paste(rep("BZ",30),as.factor(all_data$DonorID),sep="")
# tomusiak2021_formatted_tissue <- rep("PBMC",30)
# tomusiak2021_formatted_celltype <- as.character(all_data$type)
# tomusiak2021_formatted_celltype[tomusiak2021_formatted_celltype=="naive"] <- "CD8+
Naive"
# tomusiak2021_formatted_celltype[tomusiak2021_formatted_celltype=="central_memory"] <-
"CD8+ Central Memory"
# tomusiak2021_formatted_celltype[tomusiak2021_formatted_celltype=="effector_memory"] <-
"CD8+ Effector Memory"
# tomusiak2021_formatted_celltype[tomusiak2021_formatted_celltype=="temra"] <- "CD8+
TEMRA"
#
# tomusiak2021_samples <- data.frame(ID=tomusiak2021_formatted_samples,
# Author=rep("Tomusiak",30),
# Year=rep(2021,30),
# Tissue=rep("PBMC",30),
# CellType=tomusiak2021_formatted_celltype,
# Age=tomusiak2021_formatted_ages,
# Condition=tomusiak2021_formatted_condition,
# Sex=tomusiak2021_formatted_sex,
# DonorID=tomusiak2021_formatted_donor,
# Key=rep("Tomusiak2021",30),
# Chip=rep("EPIC",30),
# Misc=rep(NA,30))
# tomusiak2021_cpgs <- beta_values
# colnames(beta_values) == tomusiak2021_samples$ID
# tomusiak2021_cpgs$cpg <- rownames(tomusiak2021_cpgs)
#
# tomusiak2021_cpgs <- tomusiak2021_cpgs[tomusiak2021_cpgs$cpg %in% cpg_table$cpg,]
218
#
# sample_table <- rbind(sample_table,tomusiak2021_samples)
# cpg_table <- merge(cpg_table,tomusiak2021_cpgs, all=TRUE,by="cpg")
#
#
# Last data set - GTEX data. Cannot include in public dataset as including age data.
#
# gtex_unformatted_table <- read.table("Oliva2022/GSE213478_series_matrix.txt",
# comment = "!",
# skip=0,fill=TRUE,nrows =6)
#
# #Formatting sample names, as the header text file is not arranged in a particularly easy way.
# gtex_unformatted_id <- gtex_unformatted_table[1,-1]
# gtex_formatted_id <- str_sub(gtex_unformatted_id,1,10)[1:987]
# gtex_formatted_id <- sub("-$", "", gtex_formatted_id)
#
# #Formatting ages.
# gtex_unformatted_phenotypic <-
read.csv("encrypted/phs000424.v9.pht002742.v9.p2.c1.GTEx_Subject_Phenotypes.GRU.txt",
# comment = "#", sep = "\t",
# skip=0,fill=TRUE)
# gtex_formatted_phenotypic <-
gtex_unformatted_phenotypic[gtex_unformatted_phenotypic$SUBJID %in%
gtex_formatted_id,]
#
# gtex_formatted_id <- data.table(gtex_formatted_id)
# gtex_formatted_ages <-
gtex_formatted_phenotypic$AGE[match(gtex_formatted_id$gtex_formatted_id,gtex_formatted_
phenotypic$SUBJID)]
#
# #Need to pull IDs to match real ages.
# gtex_unformatted_samples <- gtex_unformatted_table[6,-1]
# gtex_formatted_samples <- str_sub(gtex_unformatted_samples,1,62)[1:987]
#
# #Formatting sex.
# gtex_unformatted_sex <- gtex_unformatted_table[4,-1]
# gtex_unformatted_sex <- str_sub(gtex_unformatted_sex,6, 25)
# gtex_unformatted_sex[gtex_unformatted_sex=="1"] <- "Male"
# gtex_unformatted_sex[gtex_unformatted_sex=="2"] <- "Female"
# gtex_formatted_sex <- gtex_unformatted_sex
#
# #Formatting condition and miscellaneous information.
# gtex_formatted_condition <- rep("Control",987)
219
#
# gtex_formatted_donor <- paste(rep("CA",987),
# as.numeric(factor(str_sub(gtex_unformatted_table[3,-1],1,100))),sep="")
#
# gtex_unformatted_tissue <- gtex_unformatted_table[2,-1]
# gtex_formatted_tissue<- str_sub(gtex_unformatted_tissue,1, 205)
# gtex_formatted_tissue[gtex_formatted_tissue=="Muscle - Skeletal"] <- "Skeletal Muscle"
# gtex_formatted_tissue[gtex_formatted_tissue=="Colon - Transverse"] <- "Colon"
# gtex_formatted_tissue[gtex_formatted_tissue=="Breast - Mammary Tissue"] <- "Breast"
# gtex_formatted_tissue[gtex_formatted_tissue=="Kidney - Cortex"] <- "Kidney"
# gtex_formatted_tissue[gtex_formatted_tissue=="Whole Blood"] <- "Blood"
#
# gtex_formatted_location <- gtex_unformatted_table[1,-1]
#
# gtex_samples <- data.frame(ID=gtex_formatted_samples,
# Author=rep("Oliva",987),
# Year=rep(2022,987),
# Tissue=gtex_formatted_tissue,
# CellType=gtex_formatted_tissue,
# Age=gtex_formatted_ages,
# Condition=gtex_formatted_condition,
# Sex=gtex_formatted_sex,
# DonorID=gtex_formatted_donor,
# Misc=rep(NA,987))
#
# gtex_cpgs <-
data.table::fread("Oliva2022/GSE213478_methylation_DNAm_noob_final_BMIQ_all_tissues_9
87.txt",
# header=FALSE, fill=FALSE)
# gtex_cpglist <- gtex_cpgs$V1
# gtex_cpglist <- gtex_cpglist[-c(1)]
# colnames(gtex_cpgs) <- as.character(gtex_cpgs[1,])
# gtex_cpgs <- gtex_cpgs[-1,-1]
# any(!(colnames(gtex_cpgs) == gtex_formatted_location))
#
# #They're in the correct order, so we can just proceed
#
# gtex_cpgs$cpg <- gtex_cpglist
#
# gtex_cpgs <- gtex_cpgs[gtex_cpgs$cpg %in% cpg_table$cpg,]
#
# gtex_samples$Key <- "Oliva2022"
# gtex_samples$Chip <- "EPIC"
220
#
# sample_table <- rbind(sample_table,gtex_samples)
# cpg_table <- merge(cpg_table,gtex_cpgs, all=TRUE,by="cpg")
###########################################################################
# Checking details about the sample table.
# sample_table$Key <- paste(sample_table$Author, sample_table$Year, sep="")
studies <- sample_table %>% group_by(Key) %>% dplyr::summarize(Number=n())
studies <- studies[order(-studies$Number),]
studies$Key <- factor(studies$Key,levels=studies$Key)
buck_colors <- c("#0066A9", "#2C99D7", "#00B5E6","#48AA48", "#DDC63F",
"#000000","#8B8D90","#D8852A","#16B935")
buck_colors_large <- colorRampPalette(buck_colors)(70)
ggplot(studies, aes(x=Key, y=Number,fill=Key)) +
geom_bar(stat="identity") +
theme_moderate() +
theme(legend.position = "none") +
geom_hline(yintercept=192,linetype="dotted") +
scale_fill_manual(values=buck_colors_large) +
ylab("Number of Samples") + xlab("Dataset") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2)) +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()
) + scale_y_continuous(expand=c(0,0), limit=c(0,1300)) + theme(axis.title.x =
element_text(vjust=-1.5))
sample_table$Tissue[sample_table$Tissue=="Mouth"] <- "Buccal"
sample_table$Tissue[sample_table$Tissue=="Colon"] <- "Colorectal"
sample_table$Tissue[sample_table$Tissue=="PBMC"] <- "Blood"
sample_table$Tissue[sample_table$Tissue=="Jejunum"] <- "Small Intestine"
sample_table$Tissue[sample_table$Tissue=="Duodenum"] <- "Small Intestine"
sample_table$Tissue[sample_table$Tissue=="Ileum"] <- "Small Intestine"
tissues <- sample_table %>% group_by(Tissue) %>% dplyr::summarize(Number=n())
ggplot(tissues, aes(x="", y=Number, fill=Tissue)) +
geom_bar(width = 1, color="black", stat = "identity",size=1) +
coord_polar("y", start=0) + theme_classic() + theme_moderate() +
theme(legend.title = element_text(size = 25),
legend.text = element_text(size = 18)) +
theme(legend.key.size = unit(0.2, "cm")) + guides(fill = guide_legend(ncol = 3)) +
ylab("") + xlab("") +
221
theme(axis.line.x.bottom=element_line(size=0),axis.line.y.left=element_line(size=0)) +
theme(axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
axis.text.y=element_blank(), #remove y axis labels
axis.ticks.y=element_blank() #remove y axis ticks
) +
scale_fill_manual(values=c("#0066A9", "#2C99D7", "#00B5E6","#48AA48", "#DDC63F",
"#000000","#8B8D90","#D8852A","#16B935","#71cae8",
"#650d0b","#4d0cea","#aa8537","#1e4439","#588e19")) +
theme(legend.spacing.y = unit(.15, 'cm')) +
## important additional element
guides(fill = guide_legend(byrow = TRUE))
sample_table$Condition[sample_table$Condition=="COVID"] <- "COVID-19"
sample_table$Condition[sample_table$Condition=="Schizophrenis"] <- "Schizophrenia"
sample_table$Condition[sample_table$Condition=="Combined"] <- "IL13+IL17"
sample_table$Condition[sample_table$Condition=="hepatocellular carcinoma"] <-
"Hepatocellular Carcinoma"
sample_table$Condition[sample_table$Condition=="depression"] <- "Depression"
sample_table$Condition[sample_table$Condition=="NA"] <- "Dysplastic"
sample_table$Condition[sample_table$Condition=="stroke"] <- "Stroke"
sample_table$Condition[sample_table$Condition=="other"] <- "Control"
sample_table$Condition[sample_table$Condition=="normal_pressure_hydr"] <- "Control"
sample_table$Condition[sample_table$Condition=="parkinsonism_NOS"] <- "Parkinson's"
sample_table$Condition[sample_table$Condition=="dementia_NOS"] <- "Dementia"
sample_table$Condition[sample_table$Condition=="MSA"] <- "MS"
sample_table$Condition[sample_table$Condition==""] <- "Control"
sample_table$Condition[sample_table$Condition=="chronic_subdural_hem"] <- "Control"
sample_table$Condition[sample_table$Condition=="alcoholic_encephalop"] <- "Alcoholic
Encephalopathy"
sample_table$Condition[sample_table$Condition=="asymmetrical_cortica"] <- "Asymmetrical
Cortica"
sample_table$Condition[sample_table$Condition=="Benign"] <- "Benign Cancer"
sample_table$Condition[sample_table$Condition=="PD"] <- "Parkinson's"
sample_table$Condition[sample_table$Condition=="PDD"] <- "Parkinson's"
sample_table$Condition[sample_table$Condition=="Normal"] <- "Control"
sample_table$Condition[sample_table$Condition=="PSP"] <- "Palsy"
sample_table$Condition[sample_table$Condition=="AD"] <- "Alzheimers"
sample_table$Condition[sample_table$Condition=="DLB"] <- "Dementia"
sample_table$Condition[sample_table$Condition=="FTD"] <- "Dementia"
sample_table$Condition[sample_table$Condition=="MDD"] <- "Depression"
sample_table$Condition[sample_table$Condition=="Alzheimers"] <- "Alzheimer's"
sample_table$Condition[sample_table$Condition=="CAR-T"] <- "Control"
222
condition <- sample_table %>% group_by(Condition) %>% dplyr::summarize(Number=n())
ggplot(condition, aes(x="", y=Number, fill=Condition)) +
geom_bar(width = 1, color="black", stat = "identity",size=1) +
coord_polar("y", start=0) + theme_classic() + theme_moderate() +
theme(legend.title = element_text(size = 15),
legend.text = element_text(size = 10))
sample_table$Sex[sample_table$Sex=="F"] <- "Female"
sample_table$Sex[sample_table$Sex=="female"] <- "Female"
sample_table$Sex[sample_table$Sex=="H"] <- "Male"
sample_table$Sex[sample_table$Sex=="M"] <- "Male"
sample_table$Sex[sample_table$Sex=="male"] <- "Male"
sample_table$Sex[sample_table$Sex=="Unsure"] <- "Female"
sample_table$Sex[sample_table$Sex==""] <- "Female"
sample_table$Sex[sample_table$Sex=="MM"] <- "Male"
sex <- sample_table %>% group_by(Sex) %>% dplyr::summarize(Number=n())
ggplot(sex, aes(x="", y=Number, fill=Sex)) +
geom_bar(width = 1, color="black", stat = "identity",size=1) +
coord_polar("y", start=0) + theme_classic() + theme_moderate() +
scale_fill_manual(values=c("#0066A9","#16B935")) +
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 18)) +
theme(legend.key.size = unit(0.2, "cm")) + guides(fill = guide_legend(ncol = 1)) +
ylab("") + xlab("") +
theme(axis.line.x.bottom=element_line(size=0),axis.line.y.left=element_line(size=0)) +
theme(axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
axis.text.y=element_blank(), #remove y axis labels
axis.ticks.y=element_blank() #remove y axis ticks
) + theme(legend.spacing.y = unit(.15, 'cm')) +
## important additional element
guides(fill = guide_legend(byrow = TRUE))
sample_table$CellType[sample_table$CellType=="Brain Cells"] <- "Brain"
sample_table$CellType[sample_table$CellType=="Epithelial (Colorectal)"] <- "Epithelial"
sample_table$CellType[sample_table$CellType=="Epithelial Cells"] <- "Epithelial"
sample_table$CellType[sample_table$CellType=="Epithelial Cells"] <- "Epithelial"
sample_table$CellType[sample_table$CellType=="Epithelial Cells"] <- "Epithelial"
sample_table$CellType[sample_table$CellType=="Airway Smooth Muscle"] <- "Muscle"
sample_table$CellType[sample_table$CellType=="Muscle Cells"] <- "Muscle"
sample_table$CellType[sample_table$CellType=="PBMCs"] <- "PBMC"
223
sample_table$CellType[sample_table$CellType=="Leukocytes"] <- "Leukocyte"
sample_table$CellType[sample_table$CellType=="Leukocyte"] <- "PBMC"
sample_table$CellType[sample_table$CellType=="Buccal"] <- "Epithelial"
cell_type <- sample_table %>% group_by(CellType) %>% dplyr::summarize(Number=n())
ggplot(cell_type, aes(x="", y=Number, fill=CellType)) +
geom_bar(width = 1, color="black", stat = "identity") +
coord_polar("y", start=0) + theme_classic() + theme_moderate() +
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 15)) +
theme(legend.key.size = unit(0.2, "cm")) + guides(fill = guide_legend(ncol =1)) +
ylab("") + xlab("") +
theme(axis.line.x.bottom=element_line(size=0),axis.line.y.left=element_line(size=0)) +
theme(axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
axis.text.y=element_blank(), #remove y axis labels
axis.ticks.y=element_blank() #remove y axis ticks
)
chip_type <- sample_table %>% group_by(Chip) %>% dplyr::summarize(Number=n())
ggplot(chip_type, aes(x="", y=Number, fill=Chip)) +
geom_bar(width = 1, color="black", stat = "identity",size=1) +
coord_polar("y", start=0) + theme_classic() + theme_moderate() +
theme(legend.title = element_text(size = 20),
legend.text = element_text(size = 18)) +
scale_fill_manual(values=c("#2C99D7","#48AA48")) +
theme(legend.key.size = unit(0.2, "cm")) + guides(fill = guide_legend(ncol = 1)) +
ylab("") + xlab("") +
theme(axis.line.x.bottom=element_line(size=0),axis.line.y.left=element_line(size=0)) +
theme(axis.text.x=element_blank(), #remove x axis labels
axis.ticks.x=element_blank(), #remove x axis ticks
axis.text.y=element_blank(), #remove y axis labels
axis.ticks.y=element_blank() #remove y axis ticks
) +
theme(legend.spacing.y = unit(.15, 'cm')) +
## important additional element
guides(fill = guide_legend(byrow = TRUE))
# #########################################
#
# # I would like to adjust the CpGs depending on whether they were assessed on the
# # 450K or the EPIC chip, so I will include a chip variable post-hoc into the
# # sample sheet.
224
#
# chip_dataframe <- data.frame(Key="Magnaye2022",
# ID=sample_table$ID[sample_table$Key=="Magnaye2022"],
# Chip=c(rep("450K",30), rep("EPIC",112)))
#
# estupinan2022_chip <- data.frame(Key="Estupinan2022",
# ID=sample_table$ID[sample_table$Key=="Estupinan2022"],
# Chip=c(rep("EPIC",113)))
#
# okereke2021_chip <- data.frame(Key="Okereke2021",
# ID=sample_table$ID[sample_table$Key=="Okereke2021"],
# Chip=c(rep("EPIC",90)))
#
# muse2021_chip <- data.frame(Key="Muse2021",
# ID=sample_table$ID[sample_table$Key=="Muse2021"],
# Chip=c(rep("EPIC",64)))
#
# konigsberg2021_chip <- data.frame(Key="Konigsberg2021",
# ID=sample_table$ID[sample_table$Key=="Konigsberg2021"],
# Chip=c(rep("EPIC",525)))
#
# haghighi2022_chip <- data.frame(Key="Haghighi2022",
# ID=sample_table$ID[sample_table$Key=="Haghighi2022"],
# Chip=c(rep("EPIC",56)))
#
# chen2021_chip <- data.frame(Key="Chen2021",
# ID=sample_table$ID[sample_table$Key=="Chen2021"],
# Chip=c(rep("EPIC",44)))
#
# voisin2021_chip <- data.frame(Key="Voisin2021",
# ID=sample_table$ID[sample_table$Key=="Voisin2021"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Voisin2021"]))))
#
# fries2019_chip <- data.frame(Key="Fries2019",
# ID=sample_table$ID[sample_table$Key=="Fries2019"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Fries2019"]))))
#
# islam2018_chip <- data.frame(Key="Islam2018",
# ID=sample_table$ID[sample_table$Key=="Islam2018"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Islam2018"]))))
225
#
# johnson2016_chip <- data.frame(Key="Johnson2016",
# ID=sample_table$ID[sample_table$Key=="Johnson2016"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Johnson2016"]))))
#
# horvath2014_chip <- data.frame(Key="Horvath2014",
# ID=sample_table$ID[sample_table$Key=="Horvath2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Horvath2014"]))))
#
# pilsner2022_chip <- data.frame(Key="Pilsner2022",
# ID=sample_table$ID[sample_table$Key=="Pilsner2022"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Pilsner2022"]))))
#
# davalos2022_chip <- data.frame(Key="Davalos2022",
# ID=sample_table$ID[sample_table$Key=="Davalos2022"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Davalos2022"]))))
#
# hannon2021_chip <- data.frame(Key="Hannon2021",
# ID=sample_table$ID[sample_table$Key=="Hannon2021"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Hannon2021"]))))
#
# martino2018_chip <- data.frame(Key="Martino2018",
# ID=sample_table$ID[sample_table$Key=="Martino2018"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Martino2018"]))))
#
# thompson2020_chip <- data.frame(Key="Thompson2020",
# ID=sample_table$ID[sample_table$Key=="Thompson2020"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Thompson2020"]))))
#
# zannas2019_chip <- data.frame(Key="Zannas2019",
# ID=sample_table$ID[sample_table$Key=="Zannas2019"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Zannas2019"]))))
#
# nicodemus2017_chip <- data.frame(Key="Nicodemus2017",
# ID=sample_table$ID[sample_table$Key=="Nicodemus2017"],
226
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Nicodemus2017"]))))
#
# langevin2016_chip <- data.frame(Key="Langevin2016",
# ID=sample_table$ID[sample_table$Key=="Langevin2016"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Langevin2016"]))))
#
# pai2019_chip <- data.frame(Key="Pai2019",
# ID=sample_table$ID[sample_table$Key=="Pai2019"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Pai2019"]))))
#
# cobben2019_chip <- data.frame(Key="Cobben2019",
# ID=sample_table$ID[sample_table$Key=="Cobben2019"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Cobben2019"]))))
#
# husquin2019_chip <- data.frame(Key="Husquin2019",
# ID=sample_table$ID[sample_table$Key=="Husquin2019"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Husquin2019"]))))
#
# gasparoni2018_chip <- data.frame(Key="Gasparoni2018",
# ID=sample_table$ID[sample_table$Key=="Gasparoni2018"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Gasparoni2018"]))))
#
# liu2018_chip <- data.frame(Key="Liu2018",
# ID=sample_table$ID[sample_table$Key=="Liu2018"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Liu2018"]))))
#
# somineni2018_chip <- data.frame(Key="Somineni2018",
# ID=sample_table$ID[sample_table$Key=="Somineni2018"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Somineni2018"]))))
#
# roos2017_chip <- data.frame(Key="Roos2017",
# ID=sample_table$ID[sample_table$Key=="Roos2017"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Roos2017"]))))
#
227
# kananen2016_chip <- data.frame(Key="Kananen2016",
# ID=sample_table$ID[sample_table$Key=="Kananen2016"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Kananen2016"]))))
#
# cerapio2021_chip <- data.frame(Key="Cerapio2021",
# ID=sample_table$ID[sample_table$Key=="Cerapio2021"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Cerapio2021"]))))
#
# hearn2020_chip <- data.frame(Key="Hearn2020",
# ID=sample_table$ID[sample_table$Key=="Hearn2020"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Hearn2020"]))))
#
# hong2017_chip <- data.frame(Key="Hong2017",
# ID=sample_table$ID[sample_table$Key=="Hong2017"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Hong2017"]))))
#
# gopalan2017_chip <- data.frame(Key="Gopalan2017",
# ID=sample_table$ID[sample_table$Key=="Gopalan2017"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Gopalan2017"]))))
#
# horvath2016_chip <- data.frame(Key="Horvath2016",
# ID=sample_table$ID[sample_table$Key=="Horvath2016"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Horvath2016"]))))
#
# guintivano2013_chip <- data.frame(Key="Guintivano2013",
# ID=sample_table$ID[sample_table$Key=="Guintivano2013"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Guintivano2013"]))))
#
# martino2013_chip <- data.frame(Key="Martino2013",
# ID=sample_table$ID[sample_table$Key=="Martino2013"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Martino2013"]))))
#
# pihlstrom2022_chip <- data.frame(Key="Pihlstrom2022",
# ID=sample_table$ID[sample_table$Key=="Pihlstrom2022"],
#
228
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Pihlstrom2022"]))))
#
# tsaprouni2014_chip <- data.frame(Key="Tsaprouni2014",
# ID=sample_table$ID[sample_table$Key=="Tsaprouni2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Tsaprouni2014"]))))
#
# liu2013_chip <- data.frame(Key="Liu2013",
# ID=sample_table$ID[sample_table$Key=="Liu2013"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Liu2013"]))))
#
# liu2013_chip <- data.frame(Key="Liu2013",
# ID=sample_table$ID[sample_table$Key=="Liu2013"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Liu2013"]))))
#
# arpon2019_chip <- data.frame(Key="Arpon2019",
# ID=sample_table$ID[sample_table$Key=="Arpon2019"],
# Chip=c(rep("450K",366), rep("EPIC",108)))
#
# ringh2019_chip <- data.frame(Key="Ringh2019",
# ID=sample_table$ID[sample_table$Key=="Ringh2019"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Ringh2019"]))))
#
# kho2020_chip <- data.frame(Key="Kho2020",
# ID=sample_table$ID[sample_table$Key=="Kho2020"],
# Chip=c(rep("450K",272), rep("EPIC",946)))
#
# hannum2012_chip <- data.frame(Key="Hannum2012",
# ID=sample_table$ID[sample_table$Key=="Hannum2012"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Hannum2012"]))))
#
# garcia2021_chip <- data.frame(Key="Garcia2021",
# ID=sample_table$ID[sample_table$Key=="Garcia2021"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Garcia2021"]))))
#
# kawai2021_chip <- data.frame(Key="Kawai2021",
# ID=sample_table$ID[sample_table$Key=="Kawai2021"],
#
229
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Kawai2021"]))))
#
# xu2021_chip <- data.frame(Key="Xu2021",
# ID=sample_table$ID[sample_table$Key=="Xu2021"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Xu2021"]))))
#
# kandaswamy2020_chip <- data.frame(Key="Kandwaswamy2020",
# ID=sample_table$ID[sample_table$Key=="Kandaswamy2020"],
# Chip=c(rep("450K",225), rep("EPIC",738)))
#
# tully2016_chip <- data.frame(Key="Tully2016",
# ID=sample_table$ID[sample_table$Key=="Tully2016"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Tully2016"]))))
#
# bacalini2015_chip <- data.frame(Key="Bacalini2015",
# ID=sample_table$ID[sample_table$Key=="Bacalini2015"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Bacalini2015"]))))
#
# charlton2014_chip <- data.frame(Key="Charlton2014",
# ID=sample_table$ID[sample_table$Key=="Charlton2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Charlton2014"]))))
#
# jenkins2022_chip <- data.frame(Key="Jenkins2022",
# ID=sample_table$ID[sample_table$Key=="Jenkins2022"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Jenkins2022"]))))
#
# ishak2020_chip <- data.frame(Key="Ishak2020",
# ID=sample_table$ID[sample_table$Key=="Ishak2020"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Ishak2020"]))))
#
# ringh2021_chip <- data.frame(Key="Ringh2021",
# ID=sample_table$ID[sample_table$Key=="Ringh2021"],
#
Chip=c(rep("EPIC",length(sample_table$ID[sample_table$Key=="Ringh2021"]))))
#
# lewis2020_chip <- data.frame(Key="Lewis2020",
# ID=sample_table$ID[sample_table$Key=="Lewis2020"],
230
# Chip=c(rep("450K",52), rep("EPIC",32)))
#
# huynh2014_chip <- data.frame(Key="Huynh2014",
# ID=sample_table$ID[sample_table$Key=="Huynh2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Huynh2014"]))))
#
# policicchio2020_chip <- data.frame(Key="Policicchio2020",
# ID=sample_table$ID[sample_table$Key=="Policicchio2020"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Policicchio2020"]))))
#
# viana2017_chip <- data.frame(Key="Viana2017",
# ID=sample_table$ID[sample_table$Key=="Viana2017"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Viana2017"]))))
#
# garagnani2015_chip <- data.frame(Key="Garagnani2015",
# ID=sample_table$ID[sample_table$Key=="Garagnani2015"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Garagnani2015"]))))
#
# wockner2014_chip <- data.frame(Key="Wockner2014",
# ID=sample_table$ID[sample_table$Key=="Wockner2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Wockner2014"]))))
#
# xu2014_chip <- data.frame(Key="Xu2014",
# ID=sample_table$ID[sample_table$Key=="Xu2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Xu2014"]))))
#
# lunnon2014_chip <- data.frame(Key="Lunnon2014",
# ID=sample_table$ID[sample_table$Key=="Lunnon2014"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Lunnon2014"]))))
#
# jiang2015_chip <- data.frame(Key="Jiang2015",
# ID=sample_table$ID[sample_table$Key=="Jiang2015"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Jiang2015"]))))
#
# renauer2015_chip <- data.frame(Key="Renauer2015",
231
# ID=sample_table$ID[sample_table$Key=="Renauer2015"],
#
Chip=c(rep("450K",length(sample_table$ID[sample_table$Key=="Renauer2015"]))))
#
# chip_dataframe <- rbind(chip_dataframe, estupinan2022_chip, okereke2021_chip,
muse2021_chip,
# konigsberg2021_chip, haghighi2022_chip, chen2021_chip, voisin2021_chip,
# fries2019_chip, islam2018_chip, johnson2016_chip, horvath2014_chip,
# pilsner2022_chip, davalos2022_chip, hannon2021_chip, martino2018_chip,
# thompson2020_chip, zannas2019_chip, nicodemus2017_chip,
langevin2016_chip,
# pai2019_chip, cobben2019_chip, husquin2019_chip, gasparoni2018_chip,
# liu2018_chip, somineni2018_chip, roos2017_chip, kananen2016_chip,
cerapio2021_chip,
# hearn2020_chip, hong2017_chip, gopalan2017_chip, horvath2016_chip,
guintivano2013_chip,
# martino2013_chip, pihlstrom2022_chip, tsaprouni2014_chip, liu2013_chip,
arpon2019_chip,
# ringh2019_chip, kho2020_chip, hannum2012_chip, garcia2021_chip,
kawai2021_chip,
# xu2021_chip, kandaswamy2020_chip, tully2016_chip, bacalini2015_chip,
charlton2014_chip,
# jenkins2022_chip, ishak2020_chip, ringh2021_chip, lewis2020_chip,
huynh2014_chip,
# policicchio2020_chip, viana2017_chip, garagnani2015_chip,
wockner2014_chip,
# xu2014_chip, lunnon2014_chip, jiang2015_chip, renauer2015_chip)
#
# sample_table <- merge(sample_table,chip_dataframe,on="ID")
#############################################################################
# cpg_table <- na.omit(cpg_table)
# sample_table <- sample_table[sample_table$ID %in% colnames(cpg_table),]
# cpg_table <- cpg_table[,colnames(cpg_table) %in% sample_table$ID]
# cpg_table <- cpg_table[,unique(colnames(cpg_table))]
data.table::fwrite(cpg_table,"ClockConstruction/entire_cpg_table.csv",
row.names = TRUE)
data.table::fwrite(sample_table,"ClockConstruction/entire_sample_table.csv",
row.names = TRUE)
232
ClockDevelopment.R: Code used to generate the IntrinClock
#This program creates a new epigenetic clock by training on CpGs that were
# identified to be not associated with
# T cell differentiation in a previous analysis.
# The clock is based on two elastic nets.
source("~/AgingProjects/Useful Scripts/generally_useful.R") #Helper functions
library(impute)
library(rstatix)
#This reads in the created models. To build the models, skip the next two lines and
# instead uncomment the "cv.glmnet" lines.
model <- readRDS("~/Data/ClockConstruction/final_model_large.RData")
small_model <- readRDS("~/Data/ClockConstruction/final_model_small.RData")
#Performing Horvath age transformation.
transformAge <- function(ages) {
adult_age <- 20
for (age in 1:length(ages)) {
if (ages[age] >= adult_age) {
ages[age] <- ((ages[age]-adult_age)/(adult_age+1))
} else {
ages[age] <- log(ages[age]+1)-log(adult_age+1)
}
}
return (ages)
}
#Reversing age transformation.
returnAge <- function(ages) {
adult_age <- 20
limit <- 0
for (age in 1:length(ages)) {
if (ages[age] >= limit) {
ages[age] <- (((adult_age+1) * (ages[age]) + adult_age))
} else {
ages[age] <- (adult_age+1)*exp(ages[age])-1
}
}
return (ages)
}
233
#Packages
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library("methylclock")
library(stringr)
library(glmnet)
library(data.table)
library(coefplot)
library(wateRmelon)
library(mice)
library(randomForest)
library(Metrics)
library(DEGreport)
library(impute)
library(glmnetUtils)
library("FactoMineR")
data("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
# #Quick helper function for pre-processing datasets.
# This is only performed once for the validation CD4 & CD8 datasets.
# preprocessDataset <- function(file_name) {
# cpgs <- data.frame(read_table2(file_name,
# col_names = FALSE, comment = "!"))
# colnames(cpgs) <- cpgs[1,]
# rownames(cpgs) <- cpgs[,1]
# cpgs <- data.frame(cpgs[-1,-1])
# cpgs <- mutate_all(cpgs, function(x) as.numeric(as.character(x)))
# cpgs <- drop_na(cpgs)
# rownames(cpgs) <- str_replace_all(rownames(cpgs), "[[:punct:]]", " ")
# rownames(cpgs) <- gsub(" ","",rownames(cpgs))
# colnames(cpgs) <- substr(colnames(cpgs),3,12)
# return(cpgs)
# }
#
# #Initialize sample table.
# sample_table <- data.frame(ID=character(),
# Author=character(),
# Year=integer(),
# Tissue=character(),
# CellType=character(),
234
# Age=integer(),
# Condition=character(),
# Sex=character(),
# DonorID=character(),
# Misc=character())
#
# #Pre-processing the Garaud 2017 by structuring it appropriately and annotating
# # each sample.
# garaud_cpgs <- preprocessDataset("Garaud2017/GSE71825_series_matrix.txt")
# garaud_samples <- data.frame(ID=colnames(garaud_cpgs),
# Author=rep("Garaud",12),
# Year=rep(2017,12),
# Tissue=rep("Blood",12),
# CellType=c(rep("Naive CD4+",3),rep("Memory CD4+",3),
# rep("Naive CD4+",3),rep("Memory CD4+",3)),
# Age=rep(NA,12),
# Condition=rep("Healthy",12),
# Sex=rep(NA,12),
# DonorID=c("A1","A2","A3","A1","A2","A3",
# "A1","A2","A3","A1","A2","A3"),
# Misc=c(rep("Resting",6),rep("Activated",6)))
#
# #Initializing the cpg_table with the first dataset.
# cpg_table <- garaud_cpgs
#
# #Adding Garaud 2017 to the data table.
# sample_table <- bind_rows(sample_table,garaud_samples)
#
# #Processing and adding Schlums 2015
# schlums_cpgs <- preprocessDataset("Schlums2015/GSE66564-GPL13534_series_matrix.txt")
# schlums_cpgs <- schlums_cpgs[,c(2,5,7,9,13,16,19,23)] #Sorting out NK cells.
# schlums_samples <- data.frame(ID=colnames(schlums_cpgs),
# Author=rep("Schlums",8),
# Year=rep(2015,8),
# Tissue=rep("Blood",8),
# CellType=c(rep(c("Effector CD8+","Naive CD8+"),4)),
# Age=rep(NA,8),
# Condition=rep("Healthy",8),
# Sex=rep(NA,8),
# DonorID=c("B1","B1","B2","B2",
# "B3","B3","B4","B4"),
# Misc=c(rep("Resting",8)))
# sample_table <- rbind(sample_table,schlums_samples)
235
# cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(schlums_cpgs),]
# schlums_cpgs <- schlums_cpgs[rownames(schlums_cpgs) %in% rownames(cpg_table),]
# cpg_table <- cbind(cpg_table,schlums_cpgs)
#
# #Adding Rodriguez 2017 to the sample table.
# rodriguez_cpgs <- preprocessDataset("Rodriguez2017/GSE83159-
GPL13534_series_matrix.txt")
# rodriguez_samples <- data.frame(ID=colnames(rodriguez_cpgs),
# Author=rep("Rodriguez",6),
# Year=rep(2017,6),
# Tissue=rep("Blood",6),
# CellType=c("Naive CD8+","Naive CD8+",
# "TEMRA CD8+","TEMRA CD8+",
# "Effector CD8+","Effector CD8+"),
# Age=rep(NA,6),
# Condition=rep("Healthy",6),
# Sex=rep(NA,6),
# DonorID=c("C1","C2","C1","C2","C1","C2"),
# Misc=c(rep("Resting",6)))
# sample_table <- rbind(sample_table,rodriguez_samples)
# cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(rodriguez_cpgs),]
# rodriguez_cpgs <- rodriguez_cpgs[rownames(rodriguez_cpgs) %in% rownames(cpg_table),]
# cpg_table <- cbind(cpg_table,rodriguez_cpgs)
#
# #Adding Pitaksalee 2020 to the sample table.
# pitaksalee_cpgs <- preprocessDataset("Pitaksalee2020/GSE121192_series_matrix.txt")
# pitaksalee_cpgs <- pitaksalee_cpgs[,c(c(1:4),c(15:20),c(31:36))]
# pitaksalee_samples <- data.frame(ID=colnames(pitaksalee_cpgs),
# Author=rep("Pitaksalee",16),
# Year=rep(2020,16),
# Tissue=rep("Blood",16),
# CellType=c(rep("Naive CD4+",4),
# rep("Memory CD4+",6),
# rep("Monocyte",6)),
# Age=rep(NA,16),
# Condition=rep("Healthy",16),
# Sex=c("F","M","F","M",
# "M","F","F","M","F","M",
# "M","F","F","M","F","M"),
# DonorID=c("E1","E2","E3","E4",
# "E5","E6","E1","E2","E3","E4",
# "E5","E6", "E1","E2","E3","E4"),
# Misc=c(rep("Resting",16)))
236
# validation_sample_table <- rbind(sample_table,pitaksalee_samples)
# cpg_table <- cpg_table[rownames(cpg_table) %in% rownames(pitaksalee_cpgs),]
# pitaksalee_cpgs <- pitaksalee_cpgs[rownames(pitaksalee_cpgs) %in% rownames(cpg_table),]
# validation_cpg_table <- cbind(cpg_table,pitaksalee_cpgs)
#
# #Writing tables and saving them.
# write.csv(validation_sample_table,"ClockConstruction/validation_sample_table.csv")
# write.csv(validation_cpg_table,"ClockConstruction/validation_cpg_table.csv")
# # #Here - testing whether or not I can re-capitulate Jonkman 2022 results.
# # # Looks like results are similar.
# cpg_names <- data.frame(rownames(validation_cpg_table))
# validation_cpg_table <- as.data.frame(lapply(validation_cpg_table, as.numeric))
# cpg_names <- cpg_names[!rowSums(validation_cpg_table > 1),]
# validation_cpg_table <- validation_cpg_table[!rowSums(validation_cpg_table > 1),]
# validation_cpg_table <- cbind(cpg_names,validation_cpg_table)
# clocks <- checkClocks(validation_cpg_table)
# predicted_ages <- DNAmAge(validation_cpg_table, toBetas=FALSE)
# colnames(predicted_ages) <- c("ID","Horvath","Hannum","Levine","BNN",
# "skinHorvath","PedBE","Wu","TL","BLUP","EN")
# write.csv(predicted_ages,"ClockConstruction/predicted_ages_oldclock.csv")
#Reading in tables and performing some data wrangling.
training_set_multitissue <- data.table::fread("~/Data/ClockConstruction/training_set.csv")
training_set_multitissue <- training_set_multitissue[,-c(1,2)]
cpgs <- training_set_multitissue$cpg
training_set_multitissue <- training_set_multitissue[,1:(ncol(training_set_multitissue)-1)]
test_set_multitissue <- data.table::fread("~/Data/ClockConstruction/test_set.csv")
test_set_multitissue <- test_set_multitissue[,-c(1,2)]
test_set_multitissue <- test_set_multitissue[,1:(ncol(test_set_multitissue)-1)]
healthy_sample_table <-
data.table::fread("~/Data/ClockConstruction/noncancer_sample_table.csv")
training_set_samples <- colnames(training_set_multitissue)
test_set_samples <- colnames(test_set_multitissue)
sample_table <- read.csv("~/Data/ClockConstruction/entire_sample_table.csv",row.names=1)
#Transposing data tables for ML, converting them to dataframes.
training_set_multitissue <- data.frame(training_set_multitissue)
test_set_multitissue <- data.frame(test_set_multitissue)
training_set_multitissue_transposed <- data.frame(t(training_set_multitissue))
test_set_multitissue_transposed <- data.frame(t(test_set_multitissue))
colnames(training_set_multitissue_transposed) <- cpgs
colnames(test_set_multitissue_transposed) <- cpgs
237
#Adding ages to the data sets, and using the Horvath age transformation.
sample_table_training <- (sample_table[sample_table$ID %in%
rownames(training_set_multitissue_transposed),])
sample_table_test <- (sample_table[sample_table$ID %in%
rownames(test_set_multitissue_transposed),])
training_set_multitissue_transposed <-
training_set_multitissue_transposed[sample_table_training$ID,]
test_set_multitissue_transposed <- test_set_multitissue_transposed[sample_table_test$ID,]
training_set_multitissue_transposed$Age <- transformAge(sample_table_training$Age)
test_set_multitissue_transposed$Age <- transformAge(sample_table_test$Age)
training_set_multitissue_transposed <-
training_set_multitissue_transposed[rownames(training_set_multitissue_transposed) %in%
sample_table_training$ID,]
training_set_multitissue_transposed <-
training_set_multitissue_transposed[sample_table_training$ID,]
training_set_multitissue <- training_set_multitissue_transposed
test_set_multitissue <- test_set_multitissue_transposed
# data.table::fwrite(training_set_multitissue,"ClockConstruction/training_set.csv",
# row.names = TRUE)
# data.table::fwrite(test_set_multitissue,"ClockConstruction/test_set.csv",
# row.names = TRUE)
# data.table::fwrite(nontraining_set,"ClockConstruction/nontraining_set.csv",
# row.names = TRUE)
training_set <- training_set_multitissue_transposed
test_set <- test_set_multitissue_transposed
#Loading in the validation sample table, created previously. These are external data
# sets used to identify that the clock created is not skewed by CD8+ or CD4+
# differentiation.
validation_cpg_table <- data.table::fread("~/Data/ClockConstruction/validation_cpg_table.csv",
header=TRUE) %>%
as.data.frame()
row.names(validation_cpg_table) <- validation_cpg_table$V1
validation_cpg_table <- validation_cpg_table[,-1]
validation_cpgs <- rownames(validation_cpg_table)
validation_samples <- colnames(validation_cpg_table)
validation_cpg_table_rotated <- data.table::transpose(validation_cpg_table)
colnames(validation_cpg_table_rotated) <- validation_cpgs
######
238
#Training the model.
number_of_columns <-ncol(training_set)
#Making the model.
# model <-cv.glmnet(data.matrix(training_set[,1:(number_of_columns-1)]),
# data.matrix(training_set[,number_of_columns]),alpha=.5, nfold=10)
training_set <- data.frame(training_set)
#Prediction function.
predicted_age <- returnAge(predict(model,data.matrix(training_set[,1:(number_of_columns1)])))
predicted_age <- data.frame(predicted_age)
colnames(predicted_age) <- "predicted_age"
predicted_age$ID <- rownames(training_set)
sample_table_training_appended <- merge(sample_table_training,predicted_age,by="ID")
#Calculating acceleration and age prediction.
sample_table_training_appended$acceleration <- sample_table_training_appended$Age -
sample_table_training_appended$predicted_age
sex_differences <- sample_table_training_appended %>% group_by(Sex) %>%
summarize(average_acc =
mean(acceleration))
#Plotting accuracy of the model - interested in both RMSE and MAE here.
fit = lm(predicted_age ~ Age, sample_table_training_appended)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point(aes(color=sample_table_training_appended$Key)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R^2 = ",cor(sample_table_training_appended$Age,
sample_table_training_appended$predicted_age)^2,
"MAE = ",mae(sample_table_training_appended$Age,
sample_table_training_appended$predicted_age)))+
xlim(0,100) + ylim(0,100)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point(aes(color=sample_table_training_appended$Tissue)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
239
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" RMSE =",sqrt(mean(fit$residuals^2))))+
xlim(0,100) + ylim(0,100)
print(mae(sample_table_training_appended$predicted_age,sample_table_training_appended$Ag
e))
print(cor(sample_table_training_appended$predicted_age,sample_table_training_appended$Age
) ^ 2)
# Now performing the same but on the test dataset.
predicted_age_test <- returnAge(predict(model,data.matrix(test_set[,1:(number_of_columns1)])))
predicted_age_test <- data.frame(predicted_age_test)
colnames(predicted_age_test) <- "predicted_age"
predicted_age_test$ID <-rownames(test_set)
sample_table_test_appended <- merge(healthy_sample_table,predicted_age_test,by="ID")
sample_table_test_appended$predicted_age <- sample_table_test_appended$predicted_age
sample_table_test_appended <-
sample_table_test_appended[!is.na(sample_table_test_appended$predicted_age),]
sample_table_test_appended$acceleration <- sample_table_test_appended$Age -
sample_table_test_appended$predicted_age
mae <- mae(sample_table_test_appended$Age,sample_table_test_appended$predicted_age)
print(mae)
cor(sample_table_test_appended$Age,sample_table_test_appended$predicted_age) ^ 2
#Plotting accuracy of the model.
#sample_table_nontraining <-
sample_table_nontraining[sample_table_nontraining$Tissue=="Blood",]
fit_test = lm(predicted_age ~ Age, sample_table_test_appended)
ggplot(fit_test$model, aes_string(x = names(fit_test$model)[2], y = names(fit_test$model)[1])) +
geom_point(aes(color=sample_table_test_appended$Tissue)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
scale_color_manual(values=c("#0066A9","#D8852A", "#aa8537", "#00B5E6","#48AA48",
"#DDC63F",
"#000000","#8B8D90","#16B935","#71cae8",
"#650d0b","#4d0cea","#2C99D7","#1e4439","#588e19","#FFFAAB",
"#FAFBBB")) +
guides(color = guide_legend(title = "Tissue",byrow=TRUE)) +
scale_y_continuous(name ="Predicted Age",limits=c(0,100)) +
xlim(0,100) +
240
theme(legend.spacing.y = unit(.15, 'cm')) +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
ggplot(fit_test$model, aes_string(x = names(fit_test$model)[2], y = names(fit_test$model)[1])) +
geom_point(aes(color=sample_table_test_appended$Tissue)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R2 = ",signif(summary(fit_test)$adj.r.squared, 5),
"Intercept =",signif(fit_test$coef[[1]],5 ),
" Slope =",signif(fit_test$coef[[2]], 5),
" RMSE =",sqrt(mean(fit_test$residuals^2))))+
xlim(0,100) + ylim(0,100)
r <- sqrt(summary(fit_test)$adj.r.squared)
print(r)
#########################################
#Let's look at blood only.
test_blood_sample_table <-
sample_table_test_appended[sample_table_test_appended$Tissue=="Blood",]
test_blood_sample_table <- test_blood_sample_table[,c(1:13)]
test_blood_samples <- test_blood_sample_table$ID
blood_cpgs <- test_set[test_blood_samples,]
blood_set <- blood_cpgs
predicted_age_blood <-
returnAge(predict(model,data.matrix(blood_set[,1:(number_of_columns-1)])))
predicted_age_blood <- data.frame(predicted_age_blood)
colnames(predicted_age_blood) <- "predicted_age"
predicted_age_blood$ID <-rownames(blood_cpgs)
sample_table_blood <- merge(test_blood_sample_table,predicted_age_blood,by="ID")
sample_table_blood$predicted_age <- sample_table_blood$predicted_age
fit_blood <- lm(predicted_age ~ Age, sample_table_blood)
# Plotting blood data.
ggplot(fit_blood$model, aes_string(x = names(fit_blood$model)[2], y =
names(fit_blood$model)[1])) +
geom_point(aes(color=sample_table_blood$Key)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R^2 =
",cor(sample_table_blood$Age,sample_table_blood$predicted_age)^2,
"MAE = ",mae(sample_table_blood$Age,sample_table_blood$predicted_age))) +
xlim(0,100) + ylim(0,100)
241
ggplot(fit_blood$model, aes_string(x = names(fit_blood$model)[2], y =
names(fit_blood$model)[1])) +
geom_point(color="darkblue") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R2 = ",signif(summary(fit_blood)$adj.r.squared, 5),
"Intercept =",signif(fit_blood$coef[[1]],5 ),
" Slope =",signif(fit_blood$coef[[2]], 5),
" RMSE =",sqrt(mean(fit_blood$residuals^2))))+
xlim(0,100) + ylim(0,100)
print(cor(sample_table_blood$predicted_age, sample_table_blood$Age) ^ 2)
print(mae(sample_table_blood$Age,sample_table_blood$predicted_age))
blood_mae <- (mae(sample_table_blood$Age,sample_table_blood$predicted_age))
sample_table_blood$acceleration <- sample_table_blood$Age -
sample_table_blood$predicted_age
sex_differences <- sample_table_blood %>% group_by(Sex) %>% summarize(average_acc =
mean(acceleration))
##########################################
#Below we validate differentiation independence by assessing whether or not various CD4+
# and CD8+ subsets are predicted to be different ages from the same donor in external
# datasets. If p value of them being different is < .1, we'll consider it successful.
# Data wrangling below.
validation_present <- validation_cpgs %in% colnames(test_set)
validation_cpg_table_assess <- validation_cpg_table_rotated[,validation_present]
validation_cpg_table_assess <- data.table::setcolorder(validation_cpg_table_assess,
colnames(training_set[,1:(number_of_columns-1)]))
predicted_age_validation <- returnAge(predict(model,data.matrix(validation_cpg_table_assess)))
################################################################
#Goal is to plot difference between CD4 CM and naive cells; as well as CD8 EM and naive cells.
validation_sample_table <-
read.csv("~/Data/ClockConstruction/validation_sample_table.csv",row.names=1)
242
predicted_ages_oldclock <-
read.csv("~/Data/ClockConstruction/predicted_ages_oldclock.csv",row.names=1)
predicted_ages_newclock <-predicted_age_validation[,1]
validation_sample_table$new_predictions <- predicted_ages_newclock
merged_data <- merge(validation_sample_table,predicted_ages_oldclock,by="ID")
filtered_data <- merged_data[merged_data$Misc != "Activated",]
#Going to investigate CD4s and CD8s and compare the "differences" between CD8+ effector ->
naive
# and CD4 memory -> naive. If the new clock has differences substantially closer to 0, that is
# an indicator of success.
cd8_summary_type <- filtered_data[filtered_data$CellType=="Effector CD8+" |
filtered_data$CellType=="Naive CD8+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarize(average_age_hannum =
mean(Hannum),
average_age_levine=mean(Levine),
average_age_horvath = mean(Horvath))
cd8_differences <- cd8_summary_type %>% group_by(DonorID)
ggplot(cd8_differences,aes(x=CellType,y=average_age_horvath)) +
geom_line(aes(group=DonorID)) + theme_classic()
cd8_new_summary_type <- filtered_data[filtered_data$CellType=="Effector CD8+" |
filtered_data$CellType=="Naive CD8+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
cd8_new_differences <- cd8_new_summary_type %>% group_by(DonorID)
ggplot(cd8_new_summary_type,aes(x=CellType,y=average_age)) +
geom_line(aes(group=DonorID)) + theme_classic()
cd8_old_changes<- na.omit(ave(cd8_differences$average_age_horvath,
factor(cd8_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd8_new_changes<- na.omit(ave(cd8_new_differences$average_age,
factor(cd8_new_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd8_old_mean <- mean(cd8_old_changes)
cd8_old_se <- sd(cd8_old_changes)/sqrt(6)
cd8_new_mean <- mean(cd8_new_changes)
cd8_new_se <- sd(cd8_new_changes)/sqrt(6)
cd8_old_changes_hannum<- na.omit(ave(cd8_differences$average_age_hannum,
factor(cd8_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd8_old_mean_hannum <- mean(cd8_old_changes_hannum)
cd8_old_se_hannum <- sd(cd8_old_changes_hannum)/sqrt(6)
cd8_old_changes_levine <- na.omit(ave(cd8_differences$average_age_levine,
factor(cd8_differences$DonorID), FUN=function(x) c(NA,diff(x))))
243
cd8_old_mean_levine <- mean(cd8_old_changes_levine)
cd8_old_se_levine <- sd(cd8_old_changes_levine)/sqrt(6)
cd8_quantifying_differences <- data.frame("Mean_Difference" = 1,
"Mean_SE" = 1)
cd8_quantifying_differences <- rbind(cd8_quantifying_differences,
data.frame(Mean_Difference=cd8_old_mean_hannum,
Mean_SE=cd8_old_se_hannum))
cd8_quantifying_differences <- rbind(cd8_quantifying_differences,
data.frame(Mean_Difference=cd8_old_mean_levine,
Mean_SE=cd8_old_se_levine))
cd8_quantifying_differences <- rbind(cd8_quantifying_differences,
data.frame(Mean_Difference=cd8_old_mean,
Mean_SE=cd8_old_se))
cd8_quantifying_differences <- rbind(cd8_quantifying_differences,
data.frame(Mean_Difference=cd8_new_mean,
Mean_SE=cd8_new_se))
cd8_quantifying_differences <- cd8_quantifying_differences[-1,]
cd8_quantifying_differences$Clock <- c("Hannum Clock","Levine Clock","Horvath Clock",
"New Clock")
#Let's do a T test with the old and new clock on the validation samples - CD8s.
old_naive_cd8 <- cd8_summary_type[cd8_summary_type$CellType=="Naive CD8+",]
old_effector_cd8 <- cd8_summary_type[cd8_summary_type$CellType=="Effector CD8+",]
new_naive_cd8 <- cd8_new_summary_type[cd8_new_summary_type$CellType=="Naive
CD8+",]
new_effector_cd8 <- cd8_new_summary_type[cd8_new_summary_type$CellType=="Effector
CD8+",]
t.test(old_naive_cd8$average_age_hannum,old_effector_cd8$average_age_hannum,paired=TRU
E)
cd8_test <- t.test(new_naive_cd8$average_age,new_effector_cd8$average_age,paired=TRUE)
cd8_pvalue <- cd8_test$p.value
#CD8+ differentiation processed above. Will process CD4+ differentiation below.
cd4_summary_type <- filtered_data[filtered_data$CellType=="Naive CD4+" |
filtered_data$CellType=="Memory CD4+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarize(average_age_hannum =
mean(Hannum),
average_age_levine=mean(Levine),
average_age_horvath = mean(Horvath))
cd4_summary_type <- cd4_summary_type[1:14,]
cd4_differences <- cd4_summary_type %>% group_by(DonorID)
244
ggplot(cd4_differences,aes(x=CellType,y=average_age_horvath)) +
geom_line(aes(group=DonorID)) + theme_classic()
cd4_new_summary_type <- filtered_data[filtered_data$CellType=="Naive CD4+" |
filtered_data$CellType=="Memory CD4+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
cd4_new_summary_type <- cd4_new_summary_type[1:14,]
cd4_new_differences <- cd4_new_summary_type %>% group_by(DonorID)
ggplot(cd4_new_differences,aes(x=CellType,y=average_age)) +
geom_line(aes(group=DonorID)) + theme_classic()
cd4_new_changes<- na.omit(ave(cd4_new_differences$average_age,
factor(cd4_new_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd4_new_mean <- mean(cd4_new_changes)
cd4_new_se <- sd(cd4_new_changes)/sqrt(7)
cd4_old_changes<- na.omit(ave(cd4_differences$average_age_horvath,
factor(cd4_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd4_old_se <- sd(cd4_old_changes)/sqrt(7)
cd4_old_mean <- mean(cd4_old_changes)
cd4_old_changes_hannum<- na.omit(ave(cd4_differences$average_age_hannum,
factor(cd4_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd4_old_se_hannum <- sd(cd4_old_changes_hannum)/sqrt(7)
cd4_old_mean_hannum <- mean(cd4_old_changes_hannum)
cd4_old_changes_levine <- na.omit(ave(cd4_differences$average_age_levine,
factor(cd4_differences$DonorID), FUN=function(x) c(NA,diff(x))))
cd4_old_se_levine <- sd(cd4_old_changes_levine)/sqrt(7)
cd4_old_mean_levine <- mean(cd4_old_changes_levine)
cd4_quantifying_differences <- data.frame("Mean_Difference" = 1,
"Mean_SE" = 1)
cd4_quantifying_differences <- rbind(cd4_quantifying_differences,
data.frame(Mean_Difference=cd4_old_mean_hannum,
Mean_SE=cd4_old_se_hannum))
cd4_quantifying_differences <- rbind(cd4_quantifying_differences,
data.frame(Mean_Difference=cd4_old_mean_levine,
Mean_SE=cd4_old_se_levine))
cd4_quantifying_differences <- rbind(cd4_quantifying_differences,
data.frame(Mean_Difference=cd4_old_mean,
Mean_SE=cd4_old_se))
245
cd4_quantifying_differences <- rbind(cd4_quantifying_differences,
data.frame(Mean_Difference=cd4_new_mean,
Mean_SE=cd4_new_se))
cd4_quantifying_differences <- cd4_quantifying_differences[-1,]
cd4_quantifying_differences$Clock <- c("Hannum Clock","Levine Clock","Horvath Clock",
"New Clock")
cd8_quantifying_differences$Cell <- "CD8 EM - CD8 Naive"
cd4_quantifying_differences$Cell <- "CD4 Memory - CD4 Naive"
#Let's do a T test with the old and new clock on the validation samples - CD4s.
old_naive_cd4 <- cd4_summary_type[cd4_summary_type$CellType=="Naive CD4+",]
old_memory_cd4 <- cd4_summary_type[cd4_summary_type$CellType=="Memory CD4+",]
new_naive_cd4 <- cd4_new_summary_type[cd4_new_summary_type$CellType=="Naive
CD4+",]
new_memory_cd4 <- cd4_new_summary_type[cd4_new_summary_type$CellType=="Memory
CD4+",]
cd4_test <- t.test(new_naive_cd4$average_age,new_memory_cd4$average_age,paired=TRUE)
cd8_test <- t.test(new_naive_cd8$average_age,new_effector_cd8$average_age,paired=TRUE)
cd4_pvalue <- cd4_test$p.value
cd8_pvalue <- cd8_test$p.value
print(cd4_pvalue)
print(cd8_pvalue)
# Now plotting the differences in predicted age per differentiation state per clock.
ggplot(cd8_quantifying_differences, aes(x=Clock,y=Mean_Difference,
ymin=Mean_DifferenceMean_SE,ymax=Mean_Difference+Mean_SE)) +
geom_point() + theme_classic() + geom_errorbar() + ylim(-60,60) + facet_wrap(~Cell) +
geom_hline(yintercept=0,linetype="dashed")
ggplot(cd4_quantifying_differences, aes(x=Clock,y=Mean_Difference,
ymin=Mean_DifferenceMean_SE,ymax=Mean_Difference+Mean_SE)) +
geom_point() + theme_classic() + geom_errorbar() + ylim(-60,60) + facet_wrap(~Cell) +
geom_hline(yintercept=0,linetype="dashed")
#Taking a look at monocytes. The N is too small to make important conclusions.
monocyte_new_summary_type <- filtered_data[filtered_data$Author=="Pitaksalee" &
(filtered_data$CellType=="Monocyte" |
filtered_data$CellType=="Naive CD4+"),] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
246
monocyte_new_differences <- monocyte_new_summary_type %>% group_by(DonorID)
ggplot(monocyte_new_summary_type,aes(x=CellType,y=average_age)) +
geom_line(aes(group=DonorID)) + theme_classic()
monocytes <-
monocyte_new_summary_type[monocyte_new_summary_type$CellType=="Monocyte",]
naive_cd4s <-
monocyte_new_summary_type[monocyte_new_summary_type$CellType=="Naive CD4+",]
monocytes <- monocytes[1:4,]
monocyte_test <- t.test(naive_cd4s$average_age,monocytes$average_age,paired=TRUE)
monocyte_pvalue <- monocyte_test$p.value
print(monocyte_pvalue)
##############################################################################
###
############################################
# Below we'll investigate the difference between CD8+ subsets in our in-house generated
# data.
#NOTE: Not an unbiased validation as later versions of the clock include this data
# in the the training set. We will use the Tomusiak2022 data for this instead in
# a different script.
beta_values <- data.table::fread("~/Data/Tomusiak2021/beta_values.csv",header=TRUE)
beta_values_samples <- colnames(beta_values)
beta_values <- data.frame(beta_values)
beta_values_cpgs <- beta_values$V1
beta_values <- beta_values[,-1]
subset_beta <- data.frame(t(beta_values))
colnames(subset_beta) <- beta_values_cpgs
subset_beta <- subset_beta[,colnames(test_set[1:(ncol(test_set)-1)])]
subset_metadata <- read.csv("~/Data/Tomusiak2021/subset_metadata.csv")
clock_data <- read.csv("~/Data/Tomusiak2021/clock_data.csv")
clock_data <- clock_data[,colnames(clock_data != "Age")]
all_data <- merge(clock_data,subset_metadata)
all_data$type <- as.character(all_data$type)
all_data$type <- factor(all_data$type, levels=c("naive", "central_memory",
"effector_memory","temra"))
keep <- all_data$SampleID[all_data$SampleID != "D3" & all_data$sabgal_sample == FALSE
& all_data$SampleID != "E4"]
all_data <- all_data[all_data$SampleID %in% keep,]
247
# rownames(subset_beta) <- samples[-1]
# subset_set_cpgs_test <- t(t(as.matrix(subset_cpg_table)) - adjustment %*% as.matrix(t( rep(-
1,26))))
# subset_cpg_table$Sex <- as.numeric(all_data$sex=="F")
predicted_ages_subset_set <- data.frame(returnAge(predict(model,
data.matrix(subset_beta))))
colnames(predicted_ages_subset_set) <- "predicted_newclock"
predicted_ages_subset_set$SampleID <- rownames(predicted_ages_subset_set)
all_data <- merge(all_data,predicted_ages_subset_set)
all_data$diff_newclock <- all_data$predicted_newclock-all_data$age
summary <- getSummary(all_data,"diff_newclock", "type")
summary$type <- c("Naive","Central Memory","Effector Memory","TEMRA")
summary$type <- factor(summary$type,levels=c("Naive","Central Memory","Effector
Memory","TEMRA"))
#Investigating differences in methylation age per subset.
ggplot(data=summary, aes(x=type, y=diff_newclock, group=1)) +
geom_line()+
geom_point() +
theme_classic() +
ylim(-20,20) +
theme(text = element_text(size = 15)) +
geom_hline(yintercept = 0,linetype="dotted") +
geom_errorbar(aes(ymin=diff_newclock-se, ymax=diff_newclock+se), width=.1) +
labs(x="CD8+ T Cell Subset",y="Predicted Age - Age", title="CD8+ T Cell Subset Differences
Between
Clock Age and Chronological Age")
all_data$diff_horvathclock <- all_data$DNAmAge-all_data$age
summary_horvathclock <- getSummary(all_data,"diff_horvathclock", "type")
summary_horvathclock$type <- c("Naive","Central Memory","Effector Memory","TEMRA")
summary_horvathclock$type <- factor(summary$type,levels=c("Naive","Central
Memory","Effector Memory","TEMRA"))
ggplot(data=summary_horvathclock, aes(x=type, y=diff_horvathclock, group=1)) +
geom_line()+
geom_point() +
theme_classic() +
ylim(-20,20) +
theme(text = element_text(size = 15)) +
geom_hline(yintercept = 0,linetype="dotted") +
geom_errorbar(aes(ymin=diff_horvathclock-se, ymax=diff_horvathclock+se), width=.1) +
labs(x="CD8+ T Cell Subset",y="Predicted Age - Age", title="CD8+ T Cell Subset Differences
248
Between
Clock Age and Chronological Age")
group_ages_newclock <- all_data %>%
group_by(DonorID,type) %>% dplyr::summarise(Predicted_Age = (predicted_newclock))
group_ages_newclock <- group_ages_newclock %>% group_by(DonorID)
ggplot(group_ages_newclock,aes(x=type,y=Predicted_Age)) +
geom_line(aes(group=DonorID)) + geom_point(size=5) + theme_classic() +
labs(x="CD8+ T Cell Subset",y="Predicted Age", title="CD8+ Subset Age Predictions (New
Clock)") +
ylim(18,80)
group_ages_horvathclock <- all_data %>%
group_by(DonorID,type) %>% dplyr::summarise(Predicted_Age = (DNAmAge))
group_ages_horvathclock <- group_ages_horvathclock %>% group_by(DonorID)
ggplot(group_ages_horvathclock,aes(x=type,y=Predicted_Age)) +
geom_line(aes(group=DonorID)) + geom_point(size=5) + theme_classic() +
labs(x="CD8+ T Cell Subset",y="Predicted Age", title="CD8+ Subset Age Predictions
(Horvath Clock)") +
ylim(18,80)
# Performing statistical tests on changes between cell types.
newclock_em <- all_data[all_data$type=="effector_memory",c("donor","predicted_newclock")]
newclock_cm <- all_data[all_data$type=="central_memory",c("donor","predicted_newclock")]
newclock_em$donor==newclock_cm$donor
cm_test <- t.test(newclock_em$predicted_newclock,
newclock_cm$predicted_newclock,
paired=TRUE)
cm_pvalue <- cm_test$p.value
print(cm_pvalue)
newclock_naive <- all_data[all_data$type=="naive",c("donor","predicted_newclock")]
newclock_em <- all_data[all_data$type=="effector_memory",c("donor","predicted_newclock")]
newclock_em <- newclock_em[-7,]
newclock_naive$donor==newclock_em$donor
naive_test <- t.test(newclock_naive$predicted_newclock,
newclock_em$predicted_newclock,
paired=TRUE)
em_pvalue <- naive_test$p.value
print(em_pvalue)
newclock_cm <- all_data[all_data$type=="central_memory",c("donor","predicted_newclock")]
newclock_temra <- all_data[all_data$type=="temra",c("donor","predicted_newclock")]
249
newclock_cm <- newclock_cm[-2,]
newclock_cm$donor==newclock_temra$donor
temra_test <- t.test(newclock_cm$predicted_newclock,
newclock_temra$predicted_newclock,
paired=TRUE)
temra_pvalue <- temra_test$p.value
print(temra_pvalue)
difference <- summary$diff_newclock[1]
print(difference)
############
#Opted for an interesting innovation where elastic net is run twice - the first time
# to select the sites, and then a second time to optimize them. Both are built from the
# same training data, so there is no bleed-through into test data.
#Building new training and test datasets built only on the CpGs that the first
# elastic net identified.
coefs <- extract.coef(model) %>% arrange(desc(abs(Value)))
tmp_coeffs <- coef(small_model)
data.frame(name = tmp_coeffs@Dimnames[[1]][tmp_coeffs@i + 1], coefficient =
tmp_coeffs@x)
coefficients <- coefs$Coefficient
coef_cpgs <- coefficients
allowed_cpgs <- rownames(coef(small_model))[2:411]
small_training_set <- training_set[,colnames(training_set) %in% allowed_cpgs]
small_training_set$Age <- training_set$Age
small_number_of_columns <- ncol(small_training_set)
# small_model <-cv.glmnet(data.matrix(small_training_set[,1:(small_number_of_columns-1)]),
# data.matrix(small_training_set[,small_number_of_columns]), alpha=.5,
# nfold=10)
small_predicted_age <-
returnAge(predict(small_model,data.matrix(small_training_set[,1:(small_number_of_columns1)])))
small_predicted_age <- data.frame(small_predicted_age)
colnames(small_predicted_age) <- "predicted_age"
small_predicted_age$ID <- rownames(small_training_set)
small_sample_table_training_appended <-
merge(sample_table_training,small_predicted_age,by="ID")
250
#Calculating acceleration and age prediction.
small_sample_table_training_appended$acceleration <-
small_sample_table_training_appended$Age -
small_sample_table_training_appended$predicted_age
#Plotting accuracy of the model - interested in both RMSE and MAE here.
small_fit = lm(predicted_age ~ Age, small_sample_table_training_appended)
ggplot(small_fit$model, aes_string(x = names(small_fit$model)[2], y =
names(small_fit$model)[1])) +
geom_point(aes(color=small_sample_table_training_appended$Tissue)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R^2 = ",cor(small_sample_table_training_appended$Age,
small_sample_table_training_appended$predicted_age)^2,
"MAE = ",mae(small_sample_table_training_appended$Age,
small_sample_table_training_appended$predicted_age)))+
xlim(0,110) + ylim(0,110)
small_test_set <- test_set[,colnames(test_set) %in% allowed_cpgs]
small_predicted_age_test <-
returnAge(predict(small_model,data.matrix(small_test_set[,1:(small_number_of_columns-1)])))
small_predicted_age_test <- data.frame(small_predicted_age_test)
colnames(small_predicted_age_test) <- "predicted_age"
small_predicted_age_test$ID <-rownames(small_test_set)
small_sample_table_test_appended <-
merge(healthy_sample_table,small_predicted_age_test,by="ID")
small_sample_table_test_appended$predicted_age <-
small_sample_table_test_appended$predicted_age
small_sample_table_test_appended <-
small_sample_table_test_appended[!is.na(small_sample_table_test_appended$predicted_age),]
small_sample_table_test_appended$acceleration <- small_sample_table_test_appended$Age -
small_sample_table_test_appended$predicted_age
small_mae <-
mae(small_sample_table_test_appended$Age,small_sample_table_test_appended$predicted_age
)
print(small_mae)
cor(small_sample_table_test_appended$Age,small_sample_table_test_appended$predicted_age)
^ 2
#Plotting accuracy of the model.
251
#sample_table_nontraining <-
sample_table_nontraining[sample_table_nontraining$Tissue=="Blood",]
small_fit_test = lm(predicted_age ~ Age, small_sample_table_test_appended)
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Nasopharynx
","Tissue"] <- "Throat"
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Bronchi","Tis
sue"] <- "Throat"
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Buccal","Tiss
ue"] <- "Saliva"
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Colon","Tissu
e"] <- "Colorectal"
ggplot(small_fit_test$model, aes_string(x = names(small_fit_test$model)[2], y =
names(small_fit_test$model)[1])) +
geom_point(aes(color=small_sample_table_test_appended$Key)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_classic() +
labs(title = paste("Adj R^2 = ",cor(small_sample_table_test_appended$Age,
small_sample_table_test_appended$predicted_age)^2,
"MAE = ",mae(small_sample_table_test_appended$Age,
small_sample_table_test_appended$predicted_age)))+
xlim(0,110) + ylim(0,110)
ggplot(small_fit_test$model, aes_string(x = names(small_fit_test$model)[2],
y = names(small_fit_test$model)[1])) +
geom_point(aes(color=small_sample_table_test_appended$Tissue)) +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
scale_color_manual(values=c("#0066A9","#D8852A", "#aa8537", "#00B5E6","#48AA48",
"#DDC63F",
"#000000","#8B8D90","#16B935","#71cae8",
"#650d0b","#4d0cea","#2C99D7","#1e4439","#588e19","#FFFAAB",
"#FAFBBB")) +
guides(color = guide_legend(title = "Tissue",byrow=TRUE)) +
scale_y_continuous(name ="Predicted Age",limits=c(0,100)) +
xlim(0,100) +
theme(legend.spacing.y = unit(.15, 'cm')) +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
small_r <- sqrt(summary(small_fit_test)$adj.r.squared)
print(small_r)
mae(small_sample_table_test_appended$Age,small_sample_table_test_appended$predicted_age
)
252
sample_table_test_blood <-
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Blood",]
fit_test_blood = lm(predicted_age ~ Age, sample_table_test_blood)
ggplot(fit_test_blood$model, aes_string(x = names(fit_test_blood$model)[2],
y = names(fit_test_blood$model)[1])) +
geom_point(color="#0066A9") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
xlim(0,100) + scale_y_continuous(limits=c(0,100),name="Predicted Age") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
blood_r <- sqrt(signif(summary(fit_test_blood)$adj.r.squared, 5))
blood_mae <- mae(sample_table_test_blood$Age,sample_table_test_blood$predicted_age)
sample_table_test_brain <-
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Brain",]
fit_test_brain = lm(predicted_age ~ Age, sample_table_test_brain)
ggplot(fit_test_brain$model, aes_string(x = names(fit_test_brain$model)[2],
y = names(fit_test_brain$model)[1])) +
geom_point(color="#D8852A") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
xlim(18,100) + scale_y_continuous(limits=c(18,100),name="Predicted Age") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
brain_r <- sqrt(signif(summary(fit_test_brain)$adj.r.squared, 5))
brain_mae <- mae(sample_table_test_brain$Age,sample_table_test_brain$predicted_age)
sample_table_test_skin <-
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Skin",]
fit_test_skin = lm(predicted_age ~ Age, sample_table_test_skin)
ggplot(fit_test_skin$model, aes_string(x = names(fit_test_skin$model)[2],
y = names(fit_test_skin$model)[1])) +
geom_point(color="#00B5E6") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
xlim(40,80) + scale_y_continuous(limits=c(40,80),name="Predicted Age") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
skin_r <- sqrt(signif(summary(fit_test_skin)$adj.r.squared, 5))
skin_mae <- mae(sample_table_test_skin$Age,sample_table_test_skin$predicted_age)
sample_table_test_saliva <-
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Saliva",]
fit_test_saliva = lm(predicted_age ~ Age, sample_table_test_saliva)
ggplot(fit_test_saliva$model, aes_string(x = names(fit_test_saliva$model)[2],
253
y = names(fit_test_saliva$model)[1])) +
geom_point(color="#000000") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
xlim(0,90) + scale_y_continuous(limits=c(0,90),name="Predicted Age") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
saliva_r <- sqrt(signif(summary(fit_test_saliva)$adj.r.squared, 5))
saliva_mae <- mae(sample_table_test_saliva$Age,sample_table_test_saliva$predicted_age)
sample_table_test_lung <-
small_sample_table_test_appended[small_sample_table_test_appended$Tissue=="Lung",]
fit_test_lung = lm(predicted_age ~ Age, sample_table_test_lung)
ggplot(fit_test_lung$model, aes_string(x = names(fit_test_lung$model)[2],
y = names(fit_test_lung$model)[1])) +
geom_point(color="#48AA48") +
geom_abline(slope=1,size=1,linetype="dashed") +
theme_moderate() +
xlim(20,90) + scale_y_continuous(limits=c(20,90),name="Predicted Age") +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
lung_r <- sqrt(signif(summary(fit_test_lung)$adj.r.squared, 5))
lung_mae <- mae(sample_table_test_lung$Age,sample_table_test_lung$predicted_age)
small_validation_present <- validation_cpgs %in% colnames(small_test_set)
small_validation_cpg_table_rotated <-
validation_cpg_table_rotated[,colnames(validation_cpg_table_rotated) %in%
colnames(small_training_set)]
small_validation_cpg_table_assess <-
data.table::setcolorder(small_validation_cpg_table_rotated,
colnames(small_training_set[,1:(small_number_of_columns-1)]))
small_predicted_age_validation <-
returnAge(predict(small_model,data.matrix(small_validation_cpg_table_assess)))
################################################################
#Goal is to plot difference between CD4 CM and naive cells; as well as CD8 EM and naive cells.
small_validation_sample_table <-
read.csv("~/Data/ClockConstruction/validation_sample_table.csv",
row.names=1)
small_predicted_ages_oldclock <-
254
read.csv("~/Data/ClockConstruction/predicted_ages_oldclock.csv",
row.names=1)
small_predicted_ages_newclock <-small_predicted_age_validation[,1]
small_validation_sample_table$new_predictions <- small_predicted_ages_newclock
small_merged_data <-
merge(small_validation_sample_table,small_predicted_ages_oldclock,by="ID")
small_filtered_data <- small_merged_data[small_merged_data$Misc != "Activated",]
#Going to investigate CD4s and CD8s and compare the "differences" between CD8+ effector ->
naive
# and CD4 memory -> naive. If the new clock has differences substantially closer to 0, that is
# an indicator of success.
small_cd8_summary_type <- small_filtered_data[small_filtered_data$CellType=="Effector
CD8+" |
small_filtered_data$CellType=="Naive CD8+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarize(average_age_hannum =
mean(Hannum),
average_age_levine=mean(Levine),
average_age_horvath = mean(Horvath))
small_cd8_differences <- small_cd8_summary_type %>% group_by(DonorID)
small_cd8_new_summary_type <- small_filtered_data[small_filtered_data$CellType=="Effector
CD8+" |
small_filtered_data$CellType=="Naive CD8+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
small_cd8_new_differences <- small_cd8_new_summary_type %>% group_by(DonorID)
small_cd8_new_summary_type$CellType <- factor(small_cd8_new_summary_type$CellType,
levels=c("Naive CD8+","Effector CD8+"))
small_cd8_new_summary_type$DonorID <-
as.factor(as.numeric(factor(small_cd8_new_summary_type$DonorID)))
ggplot(small_cd8_new_summary_type,aes(x=CellType,y=average_age,color=DonorID)) +
geom_line(aes(group=DonorID),size=1.5) + theme_moderate() +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2)) +
labs(x="",y="Predicted Age (IntrinClock)") +
scale_color_manual(values=c("#0066A9", "#2C99D7",
"#00B5E6","#48AA48","#16B935","#D8852A"))
small_cd8_old_changes<- na.omit(ave(small_cd8_differences$average_age_horvath,
factor(small_cd8_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd8_new_changes<- na.omit(ave(small_cd8_new_differences$average_age,
factor(small_cd8_new_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd8_old_mean <- mean(small_cd8_old_changes)
255
small_cd8_old_se <- sd(small_cd8_old_changes)/sqrt(6)
small_cd8_new_mean <- mean(small_cd8_new_changes)
small_cd8_new_se <- sd(small_cd8_new_changes)/sqrt(6)
small_cd8_old_changes_hannum<- na.omit(ave(small_cd8_differences$average_age_hannum,
factor(small_cd8_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd8_old_mean_hannum <- mean(small_cd8_old_changes_hannum)
small_cd8_old_se_hannum <- sd(small_cd8_old_changes_hannum)/sqrt(6)
small_cd8_old_changes_levine <- na.omit(ave(small_cd8_differences$average_age_levine,
factor(small_cd8_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd8_old_mean_levine <- mean(small_cd8_old_changes_levine)
small_cd8_old_se_levine <- sd(small_cd8_old_changes_levine)/sqrt(6)
small_cd8_quantifying_differences <- data.frame("Mean_Difference" = 1,
"Mean_SE" = 1)
small_cd8_quantifying_differences <- rbind(small_cd8_quantifying_differences,
data.frame(Mean_Difference=small_cd8_old_mean_hannum,
Mean_SE=small_cd8_old_se_hannum))
small_cd8_quantifying_differences <- rbind(small_cd8_quantifying_differences,
data.frame(Mean_Difference=small_cd8_old_mean_levine,
Mean_SE=small_cd8_old_se_levine))
small_cd8_quantifying_differences <- rbind(small_cd8_quantifying_differences,
data.frame(Mean_Difference=small_cd8_old_mean,
Mean_SE=small_cd8_old_se))
small_cd8_quantifying_differences <- rbind(small_cd8_quantifying_differences,
data.frame(Mean_Difference=small_cd8_new_mean,
Mean_SE=small_cd8_new_se))
small_cd8_quantifying_differences <- small_cd8_quantifying_differences[-1,]
small_cd8_quantifying_differences$Clock <- c("Hannum Clock","Levine Clock","Horvath
Clock", "New Clock")
#Let's do a T test with the old and new clock on the validation samples - CD8s.
small_old_naive_cd8 <-
small_cd8_summary_type[small_cd8_summary_type$CellType=="Naive CD8+",]
small_old_effector_cd8 <-
small_cd8_summary_type[small_cd8_summary_type$CellType=="Effector CD8+",]
small_new_naive_cd8 <-
small_cd8_new_summary_type[small_cd8_new_summary_type$CellType=="Naive CD8+",]
small_new_effector_cd8 <-
small_cd8_new_summary_type[small_cd8_new_summary_type$CellType=="Effector CD8+",]
256
t.test(small_old_naive_cd8$average_age_hannum,small_old_effector_cd8$average_age_hannum
,paired=TRUE)
small_cd8_test <-
t.test(small_new_naive_cd8$average_age,small_new_effector_cd8$average_age,paired=TRUE)
small_cd8_pvalue <- small_cd8_test$p.value
#CD8+ differentiation processed above. Will process CD4+ differentiation below.
small_cd4_summary_type <- small_filtered_data[small_filtered_data$CellType=="Naive CD4+"
|
small_filtered_data$CellType=="Memory CD4+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarize(average_age_hannum =
mean(Hannum),
average_age_levine=mean(Levine),
average_age_horvath = mean(Horvath))
small_cd4_summary_type <- small_cd4_summary_type[1:14,]
small_cd4_differences <- small_cd4_summary_type %>% group_by(DonorID)
ggplot(small_cd4_differences,aes(x=CellType,y=average_age_horvath)) +
geom_line(aes(group=DonorID)) + theme_classic()
small_cd4_new_summary_type <- small_filtered_data[small_filtered_data$CellType=="Naive
CD4+" |
small_filtered_data$CellType=="Memory CD4+",] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
small_cd4_new_summary_type <-small_cd4_new_summary_type[1:14,]
small_cd4_new_differences <- small_cd4_new_summary_type %>% group_by(DonorID)
small_cd4_new_summary_type$CellType <- factor(small_cd4_new_summary_type$CellType,
levels=c("Naive CD4+","Memory CD4+"))
small_cd4_new_summary_type$DonorID <-
as.factor(as.numeric(factor(small_cd4_new_summary_type$DonorID)))
ggplot(small_cd4_new_summary_type,aes(x=CellType,y=average_age,color=DonorID)) +
geom_line(aes(group=DonorID),size=1.5) + theme_moderate() +
labs(x="",y="Predicted Age (IntrinClock)") + ylim(25,75) +
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2)) +
scale_color_manual(values=c("#0066A9", "#2C99D7",
"#00B5E6","#48AA48","#16B935","#D8852A","#DDC63F"))
small_cd4_new_changes<- na.omit(ave(small_cd4_new_differences$average_age,
factor(small_cd4_new_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd4_new_mean <- mean(small_cd4_new_changes)
257
small_cd4_new_se <- sd(small_cd4_new_changes)/sqrt(7)
small_cd4_old_changes<- na.omit(ave(small_cd4_differences$average_age_horvath,
factor(small_cd4_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd4_old_se <- sd(small_cd4_old_changes)/sqrt(7)
small_cd4_old_mean <- mean(small_cd4_old_changes)
small_cd4_old_changes_hannum<- na.omit(ave(small_cd4_differences$average_age_hannum,
factor(small_cd4_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd4_old_se_hannum <- sd(small_cd4_old_changes_hannum)/sqrt(7)
small_cd4_old_mean_hannum <- mean(small_cd4_old_changes_hannum)
small_cd4_old_changes_levine <- na.omit(ave(small_cd4_differences$average_age_levine,
factor(small_cd4_differences$DonorID), FUN=function(x)
c(NA,diff(x))))
small_cd4_old_se_levine <- sd(small_cd4_old_changes_levine)/sqrt(7)
small_cd4_old_mean_levine <- mean(small_cd4_old_changes_levine)
small_cd4_quantifying_differences <- data.frame("Mean_Difference" = 1,
"Mean_SE" = 1)
small_cd4_quantifying_differences <- rbind(small_cd4_quantifying_differences,
data.frame(Mean_Difference=small_cd4_old_mean_hannum,
Mean_SE=small_cd4_old_se_hannum))
small_cd4_quantifying_differences <- rbind(small_cd4_quantifying_differences,
data.frame(Mean_Difference=small_cd4_old_mean_levine,
Mean_SE=small_cd4_old_se_levine))
small_cd4_quantifying_differences <- rbind(small_cd4_quantifying_differences,
data.frame(Mean_Difference=small_cd4_old_mean,
Mean_SE=small_cd4_old_se))
small_cd4_quantifying_differences <- rbind(small_cd4_quantifying_differences,
data.frame(Mean_Difference=small_cd4_new_mean,
Mean_SE=small_cd4_new_se))
small_cd4_quantifying_differences <- small_cd4_quantifying_differences[-1,]
small_cd4_quantifying_differences$Clock <- c("Hannum Clock","Levine Clock","Horvath
Clock", "New Clock")
small_cd8_quantifying_differences$Cell <- "CD8 EM - CD8 Naive"
small_cd4_quantifying_differences$Cell <- "CD4 Memory - CD4 Naive"
#Let's do a T test with the old and new clock on the validation samples - CD4s.
258
small_old_naive_cd4 <-
small_cd4_summary_type[small_cd4_summary_type$CellType=="Naive CD4+",]
small_old_memory_cd4 <-
small_cd4_summary_type[small_cd4_summary_type$CellType=="Memory CD4+",]
small_new_naive_cd4 <-
small_cd4_new_summary_type[small_cd4_new_summary_type$CellType=="Naive CD4+",]
small_new_memory_cd4 <-
small_cd4_new_summary_type[small_cd4_new_summary_type$CellType=="Memory CD4+",]
small_cd4_test <-
t.test(small_new_naive_cd4$average_age,small_new_memory_cd4$average_age,paired=TRUE)
small_cd8_test <-
t.test(small_new_naive_cd8$average_age,small_new_effector_cd8$average_age,paired=TRUE)
small_cd4_pvalue <- small_cd4_test$p.value
small_cd8_pvalue <- small_cd8_test$p.value
print(small_cd4_pvalue)
print(small_cd8_pvalue)
# Now plotting the differences in predicted age per differentiation state per clock.
ggplot(small_cd8_quantifying_differences, aes(x=Clock,y=Mean_Difference,
ymin=Mean_DifferenceMean_SE,ymax=Mean_Difference+Mean_SE)) +
geom_point() + theme_classic() + geom_errorbar() + ylim(-50,50) + facet_wrap(~Cell) +
geom_hline(yintercept=0,linetype="dashed")
ggplot(small_cd4_quantifying_differences, aes(x=Clock,y=Mean_Difference,
ymin=Mean_DifferenceMean_SE,ymax=Mean_Difference+Mean_SE)) +
geom_point() + theme_classic() + geom_errorbar() + ylim(-50,50) + facet_wrap(~Cell) +
geom_hline(yintercept=0,linetype="dashed")
small_monocyte_new_summary_type <-
small_filtered_data[small_filtered_data$Author=="Pitaksalee" &
(small_filtered_data$CellType=="Monocyte" |
small_filtered_data$CellType=="Naive CD4+"),] %>%
group_by(DonorID,CellType) %>% dplyr::summarise(average_age = mean(new_predictions))
small_monocyte_new_differences <- small_monocyte_new_summary_type %>%
group_by(DonorID)
ggplot(small_monocyte_new_summary_type,aes(x=CellType,y=average_age)) +
geom_line(aes(group=DonorID)) + theme_classic()
small_monocytes <-
small_monocyte_new_summary_type[small_monocyte_new_summary_type$CellType=="Mono
cyte",]
259
small_naive_cd4s <-
small_monocyte_new_summary_type[small_monocyte_new_summary_type$CellType=="Naive
CD4+",]
small_monocytes <- small_monocytes[1:4,]
small_monocyte_test <-
t.test(small_naive_cd4s$average_age,small_monocytes$average_age,paired=TRUE)
small_monocyte_pvalue <- small_monocyte_test$p.value
print(small_monocyte_pvalue)
##############################################################################
###
############################################
#Below we'll investigate the difference between CD8+ subsets in our in-house generated
# data. Again, this is no longer true validation - we will use the Tomusiak2022 dataset
# for that.
small_beta_values <- data.table::fread("~/Data/Tomusiak2021/beta_values.csv",header=TRUE)
small_beta_values_samples <- colnames(small_beta_values)
small_beta_values <- data.frame(small_beta_values)
small_beta_values_cpgs <- small_beta_values$V1
small_beta_values <- small_beta_values[,-1]
small_subset_beta <- data.frame(t(small_beta_values))
colnames(small_subset_beta) <- small_beta_values_cpgs
small_subset_beta <- small_subset_beta[,colnames(small_subset_beta) %in%
colnames(small_test_set)]
small_subset_beta <- small_subset_beta[,colnames(small_test_set)]
small_subset_metadata <- read.csv("~/Data/Tomusiak2021/subset_metadata.csv")
small_clock_data <- read.csv("~/Data/Tomusiak2021/clock_data.csv")
small_clock_data <- small_clock_data[,colnames(small_clock_data != "Age")]
small_all_data <- merge(small_clock_data,small_subset_metadata)
small_all_data$type <- as.character(small_all_data$type)
small_all_data$type <- factor(small_all_data$type, levels=c("naive", "central_memory",
"effector_memory","temra"))
small_keep <- small_all_data$SampleID[small_all_data$SampleID != "D3" &
small_all_data$sabgal_sample == FALSE &
small_all_data$SampleID != "E4"]
small_all_data <- small_all_data[small_all_data$SampleID %in% small_keep,]
small_predicted_ages_subset_set <- data.frame(returnAge(predict(small_model,
data.matrix(small_subset_beta))))
260
colnames(small_predicted_ages_subset_set) <- "predicted_newclock"
small_predicted_ages_subset_set$SampleID <- rownames(small_predicted_ages_subset_set)
small_all_data <- merge(small_all_data,small_predicted_ages_subset_set)
small_all_data$diff_newclock <- small_all_data$predicted_newclock-small_all_data$age
small_summary <- getSummary(small_all_data,"diff_newclock", "type")
small_summary$type <- c("Naive","Central Memory","Effector Memory","TEMRA")
small_summary$type <- factor(small_summary$type,levels=c("Naive","Central
Memory","Effector Memory","TEMRA"))
smallest_data <- small_all_data[,colnames(small_all_data) %in%
c("age","DNAmAge","type","predicted_newclock")]
#Investigating differences in methylation age per subset.
ggplot(data=small_summary, aes(x=type, y=diff_newclock, group=1)) +
geom_line(size=1.5)+
geom_point() +
theme_moderate() +
ylim(-20,20) +
theme(text = element_text(size = 15)) +
geom_hline(yintercept = 0,linetype="dotted") +
geom_errorbar(aes(ymin=diff_newclock-se, ymax=diff_newclock+se), width=.1,size=1.5) +
labs(x="CD8+ T Cell Subset",y="Predicted Age - Age (IntrinClock)")+
theme(axis.line.x.bottom=element_line(size=2),axis.line.y.left=element_line(size=2))
anova_test(small_all_data,diff_newclock ~ type,wid=donor)
ggplot(data=summary, aes(x=type, y=diff_newclock, group=1)) +
geom_line()+
geom_point() +
theme_classic() +
ylim(-20,20) +
theme(text = element_text(size = 15)) +
geom_hline(yintercept = 0,linetype="dotted") +
geom_errorbar(aes(ymin=diff_newclock-se, ymax=diff_newclock+se), width=.1) +
labs(x="CD8+ T Cell Subset",y="Predicted Age - Age", title="CD8+ T Cell Subset Differences
Between
Clock Age and Chronological Age")
small_group_ages_newclock <- small_all_data %>%
group_by(DonorID,type) %>% dplyr::summarise(Predicted_Age = (predicted_newclock))
small_group_ages_newclock <- small_group_ages_newclock %>% group_by(DonorID)
ggplot(small_group_ages_newclock,aes(x=type,y=Predicted_Age)) +
261
geom_line(aes(group=DonorID)) + geom_point(size=5) + theme_classic() +
labs(x="CD8+ T Cell Subset",y="Predicted Age", title="CD8+ Subset Age Predictions (New
Clock)") +
ylim(18,80)
mean_of_sd_small <- mean((small_group_ages_newclock %>% group_by(DonorID) %>%
dplyr::summarize(sd=sd(Predicted_Age)))$sd)
mean_of_sd <- mean((group_ages_newclock %>% group_by(DonorID) %>%
dplyr::summarize(sd=sd(Predicted_Age)))$sd)
print(mean_of_sd)
print(mean_of_sd_small)
ggplot(group_ages_newclock,aes(x=type,y=Predicted_Age)) +
geom_line(aes(group=DonorID)) + geom_point(size=5) + theme_classic() +
labs(x="CD8+ T Cell Subset",y="Predicted Age", title="CD8+ Subset Age Predictions (New
Clock)") +
ylim(18,80)
CellTypePrediction.R: Code used to generate the cell type predictor for the single-cell
transcriptomic dataset
#This code is responsible for creating a cell type prediction model given
# scRNA-Seq data for each given cell type. It also takes in relevant features
# for predicting cell type given a cell type.
#
# INPUT 1: List of useful features for cell type prediction.
# INPUT 2: Training set with labeled cell type and age.
# OUTPUT: Model to predict cell type.
############
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
training_cells_matrix_datatable <-
data.frame((data.table::fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells
_datatable_final.csv")))
training_cells_metadata <-
read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_metadata_final.csv")
training_cells_metadata$cell_id <- str_replace_all( training_cells_metadata$cell_id,"-",".")
training_cells_matrix_datatable <-
training_cells_matrix_datatable[,training_cells_metadata$cell_id]
262
any(!colnames(training_cells_matrix_datatable) == training_cells_metadata$cell_id)
training_cells_matrix_datatable <- data.frame(t(training_cells_matrix_datatable))
colnames(training_cells_matrix_datatable) <- gene_list
rownames(training_cells_matrix_datatable) <- training_cells_metadata$cell_id
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
test_cells_matrix_datatable <-
data.frame((data.table::fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_cells_data
table_final.csv")))
test_cells_metadata <-
read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_metadata_final.csv")
test_cells_metadata$cell_id <- str_replace_all( test_cells_metadata$cell_id,"-",".")
test_cells_matrix_datatable <- test_cells_matrix_datatable[,test_cells_metadata$cell_id]
any(!colnames(test_cells_matrix_datatable) == test_cells_metadata$cell_id)
test_cells_matrix_datatable <- data.frame(t(test_cells_matrix_datatable))
colnames(test_cells_matrix_datatable) <- gene_list
rownames(test_cells_matrix_datatable) <- test_cells_metadata$cell_id
training_seurat <- CreateSeuratObject(t(training_cells_matrix_datatable),
meta.data=training_cells_metadata)
training_seurat <- FindVariableFeatures(training_seurat)
training_seurat <- NormalizeData(training_seurat)
training_seurat <- ScaleData(training_seurat)
training_seurat <- RunPCA(training_seurat)
training_seurat <- FindNeighbors(training_seurat, dims = 1:20)
training_seurat <- FindClusters(training_seurat, resolution=.1)
training_seurat <- RunUMAP(training_seurat,dims=1:10)
DimPlot(training_seurat,reduction="umap")
FeaturePlot(training_seurat,features=c( "GNLY", "CD4",
"FCER1A","LYZ", "PPBP",
"CD8A","CD19","CCR7"))
new.cluster.ids <- c("Naive+CM CD4", "Effector CD4", "Effector CD8", "Naive CD8",
"Naive+CM CD4", "Effector CD4", "Naive CD8")
names(new.cluster.ids) <- levels(training_seurat)
training_seurat <- RenameIdents(training_seurat, new.cluster.ids)
test_seurat <- CreateSeuratObject(t(test_cells_matrix_datatable),
meta.data=test_cells_metadata)
test_seurat <- FindVariableFeatures(test_seurat)
test_seurat <- NormalizeData(test_seurat)
test_seurat <- ScaleData(test_seurat)
263
test_seurat <- RunPCA(test_seurat)
test_seurat <- FindNeighbors(test_seurat, dims = 1:20)
test_seurat <- FindClusters(test_seurat, resolution=.1)
test_seurat <- RunUMAP(test_seurat,dims=1:10)
DimPlot(test_seurat,reduction="umap")
FeaturePlot(test_seurat,features=c("CD4", "GNLY",
"FCER1A", "PPBP",
"CD8A","CD19","CCR7"))
new.cluster.ids <- c("Naive+CM CD4", "Effector CD4", "Effector CD8", "Naive+CM CD4",
"Naive CD8", "Naive CD8")
names(new.cluster.ids) <- levels(test_seurat)
test_seurat <- RenameIdents(test_seurat, new.cluster.ids)
training_cells_matrix_datatable$sex <- as.numeric(training_cells_metadata$sex=="Male")
test_cells_matrix_datatable$sex <- as.numeric(test_cells_metadata$sex=="Male")
any(!(training_cells_metadata$cell_id==training_seurat@meta.data$cell_id))
training_cells_metadata$cell_type <- Idents(training_seurat)
any(!(test_cells_metadata$cell_id==test_seurat@meta.data$cell_id))
test_cells_metadata$cell_type <- Idents(test_seurat)
write.csv(training_cells_metadata,
"~/Data/model_making/training_metadata_final.csv")
write.csv(test_cells_metadata,
"~/Data/model_making/test_metadata_final.csv")
saveRDS(training_seurat,
"~/Data/model_making/training_seurat.RDS")
saveRDS(test_seurat,
"~/Data/model_making/test_seurat.RDS")
###########
library(data.table)
library(stringr)
library(Matrix)
library(ggplot2)
library(parallel)
library(foreach)
264
library(doParallel)
library(Seurat)
library(glmnet)
source("~/AgingProjects/Useful Scripts/generally_useful.R") #Helper functions
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
training_cells_matrix_datatable <-
data.frame((data.table::fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells
_datatable_final.csv")))
training_cells_metadata <- read.csv("~/Data/model_making/training_metadata_final.csv")
training_cells_metadata$cell_id <- str_replace_all( training_cells_metadata$cell_id,"-",".")
training_cells_matrix_datatable <-
training_cells_matrix_datatable[,training_cells_metadata$cell_id]
any(!colnames(training_cells_matrix_datatable) == training_cells_metadata$cell_id)
training_cells_matrix_datatable <- data.frame(t(training_cells_matrix_datatable))
colnames(training_cells_matrix_datatable) <- gene_list
rownames(training_cells_matrix_datatable) <- training_cells_metadata$cell_id
training_cells_matrix_datatable$sex <- as.numeric(training_cells_metadata$sex=="Male")
numCores <- detectCores()
registerDoParallel(cores = numCores - 1)
celltype_model <-cv.glmnet(as.matrix(training_cells_matrix_datatable),
parallel=TRUE,
as.character(training_cells_metadata$cell_type),
family="multinomial",
verbose=TRUE)
saveRDS(celltype_model,"~/Data/model_making/cell_type_prediction_model_parallel.RData")
AgePrediction.R: Code used to generate the age predictor for the single-cell transcriptomic
dataset
#This code is responsible for creating several age prediction models given
# scRNA-Seq data for each given cell type. It also takes in relevant features
# for predicting age given a cell type.
#
# INPUT 1: Lists of useful features for age prediction given cell types.
# INPUT 2: Training set with labeled cell type and age.
# OUTPUT: Models to predict age given each cell type.
# Let's just read in some files and try to do some very basic age prediction off one cell cluster.
265
# Start by reading in the data table library...
library(data.table)
library(stringr)
library(glmnet)
library(coefplot)
library(ggplot2)
source("~/AgingProjects/Useful Scripts/generally_useful.R") #Helper functions
#########################################################
# Repeated elastic net for every single cell subset
training_cells <-
data.frame((data.table::fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells
_datatable_final.csv")))
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
# training_cells <-
data.frame(data.table::fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells_
datatable.csv"))
training_cells_metadata <- read.csv("~/Data/model_making/training_metadata_final.csv")
training_cells_metadata$cell_id <- str_replace_all( training_cells_metadata$cell_id,"-",".")
training_cells <- training_cells[,training_cells_metadata$cell_id]
any(!colnames(training_cells) == training_cells_metadata$cell_id)
training_cells <- data.frame(t(training_cells))
colnames(training_cells) <- gene_list
rownames(training_cells) <- training_cells_metadata$cell_id
training_cells$sex <- as.numeric(training_cells_metadata$sex=="Male")
training_cells <- data.frame(t(training_cells))
########
cluster <- "Regulatory"
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
gene_list <- str_replace_all(gene_list,"-",".")
training_cells_metadata_loop <-
training_cells_metadata[training_cells_metadata$cell_type==cluster,]
training_cells_loop <- training_cells[,colnames(training_cells) %in%
training_cells_metadata_loop$cell_id]
training_cells_loop <- training_cells_loop[,training_cells_metadata_loop$cell_id]
model <-cv.glmnet((t(training_cells_loop)),
training_cells_metadata_loop$age,
verbose=TRUE)
nonzero_coefs <- rownames(extract.coef(model) %>% dplyr::arrange(desc(abs(Value))))
nonzero_coefs <- nonzero_coefs[!(nonzero_coefs %in% "(Intercept)")]
small_training_cells <- training_cells_loop[rownames(training_cells_loop) %in%
nonzero_coefs,]
266
small_training_cells <- small_training_cells[nonzero_coefs,]
small_model <-cv.glmnet((t(small_training_cells)),
training_cells_metadata_loop$age,
verbose=TRUE)
saveRDS(small_model,paste(paste("~/Data/model_making/model",cluster,sep="_"),".RData",sep
=""))
MultipleModelConstructors.R: Code used to validate clock construction pipeline and to
determine the optimal number of clusters to use for single cell age prediction
library(Seurat)
library(glmnet)
library(dplyr)
library(Metrics)
library(stringr)
library(data.table)
#TEN SETS OF SIX MODELS
# Load and process training data
training_cells <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells_datatable_f
inal.csv"))
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
training_cells_metadata <- read.csv("~/Data/model_making/training_metadata_final.csv")
training_cells_metadata$cell_id <- str_replace_all(training_cells_metadata$cell_id, "-", ".")
training_cells <- training_cells[, training_cells_metadata$cell_id]
stopifnot(all(colnames(training_cells) == training_cells_metadata$cell_id))
training_cells <- data.frame(t(training_cells))
colnames(training_cells) <- gene_list
rownames(training_cells) <- training_cells_metadata$cell_id
training_cells$sex <- as.numeric(training_cells_metadata$sex == "Male")
# Load and process test data
test_cells <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_cells_datatable_final.
csv"))
test_cells_metadata <- read.csv("~/Data/model_making/test_metadata_final.csv")
test_cells_metadata$cell_id <- str_replace_all(test_cells_metadata$cell_id, "-", ".")
test_cells <- test_cells[, test_cells_metadata$cell_id]
stopifnot(all(colnames(test_cells) == test_cells_metadata$cell_id))
test_cells <- data.frame(t(test_cells))
colnames(test_cells) <- gene_list
267
rownames(test_cells) <- test_cells_metadata$cell_id
test_cells$sex <- as.numeric(test_cells_metadata$sex == "Male")
# Initialize lists to store results
mae_list <- c()
r_list <- c()
# Define clusters
clusters <- c("Naive Helper", "Memory Helper", "Naive Cytotoxic",
"Effector Cytotoxic", "Memory Cytotoxic", "Regulatory")
training_cells_metadata <- training_cells_metadata[,c(8:16)]
training_cells_metadata <- training_cells_metadata[,-c(7)]
test_cells_metadata <- test_cells_metadata[,c(6:13)]
all_cells <- rbind(training_cells,test_cells)
all_metadata <- rbind(training_cells_metadata,test_cells_metadata)
for (rep in 1:10) {
set.seed(rep) # Different seed for each iteration to introduce randomness
models <- list()
# Sample training donors
training_donors_indices <- sample(seq_len(length(unique(all_metadata$donor_id))),
length(unique(all_metadata$donor_id)) * 8 / 10)
training_donors <- unique(all_metadata$donor_id)[training_donors_indices]
test_donors <- unique(all_metadata$donor_id)[!(unique(all_metadata$donor_id) %in%
training_donors)]
# Training/test split on cells by donor
training_cell_id <- all_metadata[all_metadata$donor_id %in% training_donors, "cell_id"]
test_cell_id <- all_metadata[all_metadata$donor_id %in% test_donors, "cell_id"]
training_metadata_split <- all_metadata[all_metadata$donor_id %in% training_donors, ]
test_metadata_split <- all_metadata[all_metadata$donor_id %in% test_donors, ]
# Filter the datasets based on training and test cells
training_cells_split <- all_cells[rownames(all_cells) %in% training_cell_id, ]
test_cells_split <- all_cells[rownames(all_cells) %in% test_cell_id, ]
# Train cell type prediction model
cell_type_model <- cv.glmnet(as.matrix(training_cells_split),
as.factor(training_metadata_split$predicted_cluster), family = "multinomial")
# Predict cell types for the test set
268
predicted_clusters <- predict(cell_type_model, as.matrix(test_cells_split), type = "class")
test_metadata_split$predicted_cluster <- as.character(predicted_clusters)
# Loop through each cluster to train models for age prediction
for (cluster in clusters) {
# Subset training data by cluster
training_cluster_metadata <-
training_metadata_split[training_metadata_split$predicted_cluster == cluster, ]
small_training_cells <- training_cells_split[rownames(training_cells_split) %in%
training_cluster_metadata$cell_id, ]
if (nrow(small_training_cells) == 0) next # Skip if no cells for this cluster
# Train the initial model using all features
model <- cv.glmnet(as.matrix(small_training_cells), training_cluster_metadata$age, alpha =
1)
# Extract non-zero coefficients (excluding intercept)
nonzero_coefs <- rownames(coef(model, s = "lambda.min"))[-1]
nonzero_coefs <- nonzero_coefs[nonzero_coefs != "(Intercept)"]
# Select features with non-zero coefficients
small_training_cells <- small_training_cells[, nonzero_coefs, drop = FALSE]
# Train a smaller model using selected features
small_model <- cv.glmnet(as.matrix(small_training_cells), training_cluster_metadata$age,
alpha = 1)
models[[cluster]] <- small_model
}
individual_predictions_clusters <- data.frame()
# Loop through each cluster to make predictions
for (cluster in clusters) {
if (!is.null(models[[cluster]])) {
coefs <- rownames(coef(models[[cluster]], s = "lambda.min"))[-1]
test_matrix_metadata_cluster <- test_metadata_split[test_metadata_split$predicted_cluster
== cluster, ]
test_matrix_cluster <- test_cells_split[rownames(test_cells_split) %in%
test_matrix_metadata_cluster$cell_id, ]
test_matrix_cluster <- test_matrix_cluster[, colnames(test_matrix_cluster) %in% coefs]
test_matrix_cluster <- test_matrix_cluster[, coefs, drop = FALSE]
269
if (nrow(test_matrix_cluster) == 0) next # Skip if no cells for this cluster
predictions <- predict(models[[cluster]], as.matrix(test_matrix_cluster), s = "lambda.min")
predictions_df <- data.frame(predicted_age = predictions)
predictions_df$cell_id <- rownames(test_matrix_cluster)
predictions_df$predicted_cluster <- cluster
individual_predictions_clusters <- rbind(individual_predictions_clusters, predictions_df)
}
}
# Merge predictions with the test metadata
individual_predictions_clusters <-
individual_predictions_clusters[match(test_metadata_split$cell_id,
individual_predictions_clusters$cell_id), ]
test_metadata_split$predicted_age <-
as.numeric(individual_predictions_clusters$predicted_age)
# Ensure both columns are numeric
test_metadata_split$age <- as.numeric(test_metadata_split$age)
test_metadata_split$predicted_age <- as.numeric(test_metadata_split$predicted_age)
# Remove rows with NA values in predicted_age or age
test_metadata_split <- test_metadata_split %>%
filter(!is.na(predicted_age) & !is.na(age))
# Calculate overall metrics
overall_cor <- cor(test_metadata_split$predicted_age, test_metadata_split$age)
overall_mae <- mae(test_metadata_split$predicted_age, test_metadata_split$age)
test_metadata_donor <- test_metadata_split %>%
group_by(donor_id) %>%
summarize(age = mean(age), predicted_age = mean(predicted_age))
overall_donor_cor <- cor(test_metadata_donor$age, test_metadata_donor$predicted_age)
overall_donor_mae <- mae(test_metadata_donor$age, test_metadata_donor$predicted_age)
mae_list <- append(mae_list, overall_mae)
r_list <- append(r_list, overall_cor)
# Print current results
print(paste("Iteration:", rep))
print(paste("Overall Correlation (R):", overall_cor))
print(paste("Overall MAE:", overall_mae))
270
print(paste("Donor-based Correlation (R):", overall_donor_cor))
print(paste("Donor-based MAE:", overall_donor_mae))
}
# Print final results
print("MAE Results:")
print(mae_list)
print("Correlation Results:")
print(r_list)
###############################
# ONE BIG MODEL
# Load necessary libraries
library(data.table)
library(stringr)
library(glmnet)
library(Metrics)
library(dplyr)
# Load and process training data
training <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells_datatable_f
inal.csv"))
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
training_metadata <- read.csv("~/Data/model_making/training_metadata_final.csv")
training_metadata$cell_id <- str_replace_all(training_metadata$cell_id, "-", ".")
training <- training[, training_metadata$cell_id]
stopifnot(all(colnames(training) == training_metadata$cell_id))
training <- data.frame(t(training))
colnames(training) <- gene_list
rownames(training) <- training_metadata$cell_id
training$sex <- as.numeric(training_metadata$sex == "Male")
# Load and process test data
test <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_cells_datatable_final.
csv"))
test_metadata <- read.csv("~/Data/model_making/test_metadata_final.csv")
test_metadata$cell_id <- str_replace_all(test_metadata$cell_id, "-", ".")
test <- test[, test_metadata$cell_id]
stopifnot(all(colnames(test) == test_metadata$cell_id))
test <- data.frame(t(test))
colnames(test) <- gene_list
271
rownames(test) <- test_metadata$cell_id
test$sex <- as.numeric(test_metadata$sex == "Male")
# Combine all cells and metadata
all_cells <- rbind(training, test)
training_metadata <- training_metadata[,c(8:16)]
training_metadata <- training_metadata[,-c(7)]
test_metadata <- test_metadata[,c(6:13)]
all_metadata <- rbind(training_metadata, test_metadata)
# Initialize lists to store results
r_list <- c()
mae_list <- c()
dbd_r_list <- c()
dbd_mae_list <- c()
# Loop to train models and evaluate performance
for (i in 1:10) {
set.seed(i) # Different seed for each iteration to introduce randomness
# Sample training donors
training_donors_indices <- sample(seq_len(length(unique(all_metadata$donor_id))),
length(unique(all_metadata$donor_id)) * 8 / 10)
training_donors <- unique(all_metadata$donor_id)[training_donors_indices]
test_donors <- unique(all_metadata$donor_id)[!(unique(all_metadata$donor_id) %in%
training_donors)]
# Now doing a training/test split on cells by donor
training_cell_id <- all_metadata[all_metadata$donor_id %in% training_donors, "cell_id"]
test_cell_id <- all_metadata[all_metadata$donor_id %in% test_donors, "cell_id"]
training_metadata_split <- all_metadata[all_metadata$donor_id %in% training_donors, ]
test_metadata_split <- all_metadata[all_metadata$donor_id %in% test_donors, ]
# Filter the datasets based on training and test cells
training_split <- all_cells[rownames(all_cells) %in% training_cell_id, ]
test_split <- all_cells[rownames(all_cells) %in% test_cell_id, ]
# Update training ages
training_ages <- training_metadata_split$age
# Train the initial model using all features
model <- cv.glmnet(as.matrix(training_split), training_ages, alpha = 1)
272
# Extract non-zero coefficients (excluding intercept)
nonzero_coefs <- rownames(coef(model, s = "lambda.min"))[-1]
nonzero_coefs <- nonzero_coefs[nonzero_coefs != "(Intercept)"]
# Select features with non-zero coefficients
small_training <- training_split[, colnames(training_split) %in% nonzero_coefs]
small_training <- small_training[, nonzero_coefs]
# Train a smaller model using selected features
small_model <- cv.glmnet(as.matrix(small_training), training_ages, alpha = 1)
nonzero_coefs <- rownames(coef(small_model, s = "lambda.min"))[-1]
# Subset test data to use the same features
test_data_subset <- test_split[, colnames(test_split) %in% nonzero_coefs]
test_data_subset <- test_data_subset[, nonzero_coefs, drop = FALSE]
# Predict ages using the smaller model
predictions <- predict(small_model, as.matrix(test_data_subset), s = "lambda.min")
test_metadata_split$predicted_age <- predictions
# Calculate metrics
correlation <- cor(test_metadata_split$predicted_age, test_metadata_split$age)
mae <- Metrics::mae(test_metadata_split$predicted_age, test_metadata_split$age)
# Store metrics
r_list <- append(r_list, correlation)
mae_list <- append(mae_list, mae)
test_metadata_grouped <- test_metadata_split %>%
dplyr::group_by(donor_id) %>%
dplyr::summarize(mean_age = mean(age), mean_predicted_age = mean(predicted_age))
dbd_r <- cor(test_metadata_grouped$mean_age, test_metadata_grouped$mean_predicted_age)
dbd_mae <- Metrics::mae(test_metadata_grouped$mean_age,
test_metadata_grouped$mean_predicted_age)
dbd_mae_list <- append(dbd_mae_list, dbd_mae)
dbd_r_list <- append(dbd_r_list, dbd_r)
# Print current results
print(paste("Iteration:", i))
print(paste("Correlation (R):", correlation))
print(paste("MAE:", mae))
273
print(paste("Donor-based Correlation (R):", dbd_r))
print(paste("Donor-based MAE:", dbd_mae))
}
# Print final results
print("Correlation Results:")
print(r_list)
print("MAE Results:")
print(mae_list)
print("Donor-based Correlation Results:")
print(dbd_r_list)
print("Donor-based MAE Results:")
print(dbd_mae_list)
##############################################################################
####
# Code to get a sense of how number of clusters affects prediction performance.
# Load necessary libraries
library(Seurat)
library(readr)
library(data.table)
library(dplyr)
library(Metrics)
library(glmnet)
library(ggplot2)
library(stringr)
library(parallel)
library(foreach)
library(doParallel)
# Reading in all PBMC metadata
all_pbmcs_metadata <-
data.frame(read.csv("~/Data/Terekhova2023/all_pbmcs/all_pbmcs_metadata.csv"))
# Filter and rename columns
all_pbmcs_metadata <- all_pbmcs_metadata[, c(1, 12, 14, 15, 19, 17)]
colnames(all_pbmcs_metadata) <- c("cell_id", "donor_id", "sex", "age", "cell_type", "batch")
# Load training data
training_seurat <-
readRDS("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells_alternative_cluster
ed.RDS")
274
# Set up metadata for the training set
training_metadata <- data.frame(all_pbmcs_metadata[all_pbmcs_metadata$cell_id %in%
colnames(training_seurat), ])
rownames(training_metadata) <- training_metadata$cell_id
training_metadata <- training_metadata[colnames(training_seurat), ]
stopifnot(all(rownames(training_metadata) == colnames(training_seurat)))
training_data <- data.frame(t(training_seurat@assays$RNA$counts))
# Load test data
test_seurat <-
readRDS("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_cells_alternative_clustered.R
DS")
test_metadata <- data.frame(all_pbmcs_metadata[all_pbmcs_metadata$cell_id %in%
colnames(test_seurat), ])
rownames(test_metadata) <- test_metadata$cell_id
test_metadata <- test_metadata[colnames(test_seurat), ]
stopifnot(all(rownames(test_metadata) == colnames(test_seurat)))
# Prepare test data for prediction
test_data <- data.frame(t(test_seurat@assays$RNA$counts))
# Function to perform the workflow
perform_workflow <- function(training_seurat, training_metadata, training_data, test_data,
test_metadata, resolution, output_dir) {
# Copy the training Seurat object to avoid modifying the original
training_seurat_copy <- training_seurat
# Find clusters
training_seurat_copy <- FindClusters(training_seurat_copy, resolution = resolution, verbose =
FALSE)
clusters <- training_seurat_copy@meta.data %>% dplyr::group_by(seurat_clusters) %>%
dplyr::summarize(n = dplyr::n())
sufficient_cells <- clusters[(clusters$n) > 10, 1]
training_seurat_copy <- subset(training_seurat_copy, idents =
c(sufficient_cells$seurat_clusters))
training_metadata <- training_metadata[training_metadata$cell_id %in%
colnames(training_seurat_copy), ]
rownames(training_metadata) <- training_metadata$cell_id
training_metadata <- training_metadata[colnames(training_seurat_copy), ]
training_metadata$cluster <- Idents(training_seurat_copy)
# Prepare training data for cell type prediction
training_labels <- training_metadata$cluster
275
training_data_filtered <- training_data[rownames(training_data) %in%
training_metadata$cell_id, ]
# Train a cell type prediction model using glmnet
cell_type_model <- cv.glmnet(as.matrix(training_data_filtered), as.factor(training_labels),
family = "multinomial",
parallel = TRUE)
# Train age prediction models for each cluster
age_models <- list()
for (cluster in unique(training_metadata$cluster)) {
cluster_data <- training_data_filtered[training_metadata$cluster == cluster, ]
cluster_metadata <- training_metadata[training_metadata$cluster == cluster, ]
age_model <- cv.glmnet(as.matrix(cluster_data), cluster_metadata$age,alpha=1)
age_models[[as.character(cluster)]] <- age_model
}
predicted_clusters <- predict(cell_type_model, as.matrix(test_data), type = "class")
test_metadata$predicted_cluster <- as.character(predicted_clusters)
# Predict ages for test data based on predicted clusters
test_metadata$predicted_age <- NA
for (cluster in unique(test_metadata$predicted_cluster)) {
cluster_data <- test_data[test_metadata$predicted_cluster == cluster, ]
age_model <- age_models[[cluster]]
predicted_ages <- predict(age_model, as.matrix(cluster_data))
test_metadata$predicted_age[test_metadata$predicted_cluster == cluster] <- predicted_ages
}
# Calculate overall metrics
overall_cor <- cor(test_metadata$predicted_age, test_metadata$age, use = "complete.obs")
overall_mae <- mae(test_metadata$predicted_age, test_metadata$age, na.rm = TRUE)
test_metadata_donor <- test_metadata %>%
group_by(donor_id) %>%
summarize(age = mean(age), predicted_age = mean(predicted_age, na.rm = TRUE))
overall_donor_cor <- cor(test_metadata_donor$age, test_metadata_donor$predicted_age, use =
"complete.obs")
overall_donor_mae <- mae(test_metadata_donor$age, test_metadata_donor$predicted_age,
na.rm = TRUE)
# Save the results to a CSV file
276
result <- data.frame(
resolution = resolution,
num_clusters = length(unique(training_metadata$cluster)),
overall_cor = overall_cor,
overall_mae = overall_mae,
overall_donor_cor = overall_donor_cor,
overall_donor_mae = overall_donor_mae
)
write.csv(result, file = paste0(output_dir, "/results_resolution_", resolution, ".csv"), row.names
= FALSE)
return(result)
}
# Set up parallel backend
num_cores <- detectCores() - 1
cl <- makeCluster(num_cores)
registerDoParallel(cl)
# Perform the workflow for 10 different resolutions from 0.05 to 0.5
output_dir <- "~/Data/model_making/"
dir.create(output_dir, showWarnings = FALSE)
resolutions <- seq(0.05, 0.5, length.out = 10)
results <- foreach(res = resolutions, .combine = rbind, .packages = c("Seurat", "dplyr", "glmnet",
"Metrics", "data.table")) %dopar% {
perform_workflow(training_seurat, training_metadata, training_data, test_data, test_metadata,
res, output_dir)
}
# Stop the cluster
stopCluster(cl)
# Save the combined results to a CSV file
write.csv(results, file = paste0(output_dir, "/combined_results.csv"), row.names = FALSE)
# Plot results
ggplot(results, aes(x = resolution, y = overall_mae)) +
geom_line() +
geom_point() +
labs(title = "Effect of Clustering Resolution on Prediction Performance",
x = "Clustering Resolution",
y = "Overall MAE")
277
ggplot(results, aes(x = resolution, y = overall_cor)) +
geom_line() +
geom_point() +
labs(title = "Effect of Clustering Resolution on Prediction Performance",
x = "Clustering Resolution",
y = "Overall Correlation")
####################
# optimizing elastic net alpha
library(Seurat)
library(glmnet)
library(dplyr)
library(Metrics)
library(stringr)
library(data.table)
# Load and process training data
training_cells <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/training_cells_datatable_f
inal.csv"))
gene_list <- read.csv("~/Data/Terekhova2023/all_pbmcs/alternative_build/gene_list.csv")$x
training_cells_metadata <- read.csv("~/Data/model_making/training_metadata_final.csv")
training_cells_metadata$cell_id <- str_replace_all(training_cells_metadata$cell_id, "-", ".")
training_cells <- training_cells[, training_cells_metadata$cell_id]
stopifnot(all(colnames(training_cells) == training_cells_metadata$cell_id))
training_cells <- data.frame(t(training_cells))
colnames(training_cells) <- gene_list
rownames(training_cells) <- training_cells_metadata$cell_id
training_cells$sex <- as.numeric(training_cells_metadata$sex == "Male")
# Load and process test data
test_cells <-
data.frame(fread("~/Data/Terekhova2023/all_pbmcs/alternative_build/test_cells_datatable_final.
csv"))
test_cells_metadata <- read.csv("~/Data/model_making/test_metadata_final.csv")
test_cells_metadata$cell_id <- str_replace_all(test_cells_metadata$cell_id, "-", ".")
test_cells <- test_cells[, test_cells_metadata$cell_id]
stopifnot(all(colnames(test_cells) == test_cells_metadata$cell_id))
test_cells <- data.frame(t(test_cells))
colnames(test_cells) <- gene_list
rownames(test_cells) <- test_cells_metadata$cell_id
test_cells$sex <- as.numeric(test_cells_metadata$sex == "Male")
278
# Define clusters
clusters <- c("Naive Helper", "Memory Helper", "Naive Cytotoxic", "Effector Cytotoxic",
"Memory Cytotoxic", "Regulatory")
# Combine all cells and metadata
all_cells <- rbind(training_cells, test_cells)
training_cells_metadata <- training_cells_metadata[, c("cell_id", "donor_id", "sex", "age",
"predicted_cluster")]
test_cells_metadata <- test_cells_metadata[, c("cell_id", "donor_id", "sex", "age",
"predicted_cluster")]
all_metadata <- rbind(training_cells_metadata, test_cells_metadata)
# Initialize lists to store results
results <- list()
# Define a range of alpha values to try
alpha_values <- seq(0, 1, by = 0.1)
# Loop through each alpha value
for (alpha_value in alpha_values) {
mae_list <- c()
r_list <- c()
# Loop to fit models and evaluate performance 10 times
for (rep in 1:10) {
set.seed(rep) # Different seed for each iteration to introduce randomness
models <- list()
# Sample training donors
training_donors_indices <- sample(seq_len(length(unique(all_metadata$donor_id))),
length(unique(all_metadata$donor_id)) * 8 / 10)
training_donors <- unique(all_metadata$donor_id)[training_donors_indices]
test_donors <- unique(all_metadata$donor_id)[!(unique(all_metadata$donor_id) %in%
training_donors)]
# Training/test split on cells by donor
training_metadata_split <- all_metadata[all_metadata$donor_id %in% training_donors, ]
test_metadata_split <- all_metadata[all_metadata$donor_id %in% test_donors, ]
training_cell_id <- training_metadata_split$cell_id
test_cell_id <- test_metadata_split$cell_id
# Filter the datasets based on training and test cells
279
training_cells_split <- all_cells[rownames(all_cells) %in% training_cell_id, ]
test_cells_split <- all_cells[rownames(all_cells) %in% test_cell_id, ]
# Train cell type prediction model
cell_type_model <- cv.glmnet(as.matrix(training_cells_split),
as.factor(training_metadata_split$predicted_cluster), family = "multinomial", alpha =
alpha_value)
# Predict cell types for the test set
predicted_clusters <- predict(cell_type_model, as.matrix(test_cells_split), type = "class")
test_metadata_split$predicted_cluster <- as.character(predicted_clusters)
# Loop through each cluster to train models for age prediction
for (cluster in clusters) {
# Subset training data by cluster
training_cluster_metadata <-
training_metadata_split[training_metadata_split$predicted_cluster == cluster, ]
small_training_cells <- training_cells_split[rownames(training_cells_split) %in%
training_cluster_metadata$cell_id, ]
if (nrow(small_training_cells) == 0) next # Skip if no cells for this cluster
# Train the initial model using all features
model <- cv.glmnet(as.matrix(small_training_cells), training_cluster_metadata$age, alpha =
alpha_value)
# Extract non-zero coefficients (excluding intercept)
nonzero_coefs <- rownames(coef(model, s = "lambda.min"))[-1]
nonzero_coefs <- nonzero_coefs[nonzero_coefs != "(Intercept)"]
# Select features with non-zero coefficients
small_training_cells <- small_training_cells[, nonzero_coefs, drop = FALSE]
# Train a smaller model using selected features
small_model <- cv.glmnet(as.matrix(small_training_cells), training_cluster_metadata$age,
alpha = alpha_value)
models[[cluster]] <- small_model
}
individual_predictions_clusters <- data.frame()
# Loop through each cluster to make predictions
for (cluster in clusters) {
280
if (!is.null(models[[cluster]])) {
coefs <- rownames(coef(models[[cluster]], s = "lambda.min"))[-1]
test_matrix_metadata_cluster <- test_metadata_split[test_metadata_split$predicted_cluster
== cluster, ]
test_matrix_cluster <- test_cells_split[rownames(test_cells_split) %in%
test_matrix_metadata_cluster$cell_id, ]
test_matrix_cluster <- test_matrix_cluster[, colnames(test_matrix_cluster) %in% coefs]
test_matrix_cluster <- test_matrix_cluster[, coefs, drop = FALSE]
if (nrow(test_matrix_cluster) == 0) next # Skip if no cells for this cluster
predictions <- predict(models[[cluster]], as.matrix(test_matrix_cluster), s = "lambda.min")
predictions_df <- data.frame(predicted_age = predictions)
predictions_df$cell_id <- rownames(test_matrix_cluster)
predictions_df$predicted_cluster <- cluster
individual_predictions_clusters <- rbind(individual_predictions_clusters, predictions_df)
}
}
# Merge predictions with the test metadata
individual_predictions_clusters <-
individual_predictions_clusters[match(test_metadata_split$cell_id,
individual_predictions_clusters$cell_id), ]
test_metadata_split$predicted_age <-
as.numeric(individual_predictions_clusters$predicted_age)
# Ensure both columns are numeric
test_metadata_split$age <- as.numeric(test_metadata_split$age)
test_metadata_split$predicted_age <- as.numeric(test_metadata_split$predicted_age)
# Remove rows with NA values in predicted_age or age
test_metadata_split <- test_metadata_split %>%
filter(!is.na(predicted_age) & !is.na(age))
# Calculate overall metrics
overall_cor <- cor(test_metadata_split$predicted_age, test_metadata_split$age)
overall_mae <- mae(test_metadata_split$predicted_age, test_metadata_split$age)
test_metadata_donor <- test_metadata_split %>%
group_by(donor_id) %>%
summarize(age = mean(age), predicted_age = mean(predicted_age))
overall_donor_cor <- cor(test_metadata_donor$age, test_metadata_donor$predicted_age)
281
overall_donor_mae <- mae(test_metadata_donor$age, test_metadata_donor$predicted_age)
mae_list <- append(mae_list, overall_mae)
r_list <- append(r_list, overall_cor)
# Print current results
print(paste("Alpha:", alpha_value, "Iteration:", rep))
print(paste("Overall Correlation (R):", overall_cor))
print(paste("Overall MAE:", overall_mae))
print(paste("Donor-based Correlation (R):", overall_donor_cor))
print(paste("Donor-based MAE:", overall_donor_mae))
}
# Store results for this alpha value
results[[as.character(alpha_value)]] <- list(mae = mae_list, correlation = r_list)
}
# Print final results for all alpha values
for (alpha_value in names(results)) {
print(paste("Alpha:", alpha_value))
print("MAE Results:")
print(results[[alpha_value]]$mae)
print("Correlation Results:")
print(results[[alpha_value]]$correlation)
}
Abstract (if available)
Abstract
Validated, reliable, and interpretable biomarkers are critical for understanding the aging process. They are utilized both in the context of screening possible aging interventions in vitro and in understanding clinically how the aging process can be affected in humans. One of the main changes that occurs during aging is in the context of the immune system, as immune cell composition shifts significantly towards an enrichment of more differentiated cells. This shift in immune cell composition makes an important impact on biomarker measurements of aging, as it complicates the interpretation of aging processes regarding whether longevity interventions are affecting aging on a cellular level or if they are changing cellular composition.
Here, I characterize how epigenetic biomarkers of aging are impacted by changes in cell type composition. To address the potential issues with this association, I developed a novel biomarker based on DNA methylation (an epigenetic clock) that is robust to changes in immune cell composition. This new biomarker captures several aspects of intrinsic epigenetic changes, including the increase of epigenetic age prediction due to senescence. As an alternative method to study cell-intrinsic effects of aging, I also describe the creation of a single-cell transcriptomic cell type and age predictor focusing on T cells. In total, this work assists in the effort to identify and understand the fundamental process of aging via the utilization of biomarkers, on both a cellular and organismal scale.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A single cell time course of senescence uncovers discrete cell trajectories and transcriptional heterogeneity
PDF
Longitudinal assessment of neural stem-cell aging
PDF
Natural divergence of traits in species of mice reveals novel molecular mechanisms of cellular senescence
PDF
TLR4-mediated innate immune response and neuroinflammation: focus on APOE and obesity
PDF
Investigating brain aging and neurodegenerative diseases through omics data
PDF
Biomarkers of age-related health changes: associations with health outcomes and disparities
PDF
Modeling neurodegenerative diseases using induced pluripotent stem cells and identifying therapeutic targets
PDF
Genomic and phenotypic novelties in the Southeast Asian house mouse
PDF
Synaptic transmission, nutrient sensors, and aging in Drosophila melanogaster
PDF
The overlap between mTOR signaling, rapamycin and cellular senescence
PDF
Mitonuclear communication in metabolic homeostasis during aging and exercise
PDF
Characterization of senescent cell heterogeneity using cell culture models
PDF
Computational approaches to identify genetic regulators of aging and late-life mortality
PDF
Statistical algorithms for examining gene and environmental influences on human aging
PDF
Coenzyme A binding sites induce proximal acylation
PDF
Phenotypic and multi-omic characterization of novel C. elegans models of Alzheimer's disease
PDF
Mitochondrial dynamics regulate Leydig cell health and integrity
PDF
Self-perceptions of Aging in the Context of Neighborhood and Their Interplay in Late-life Cognitive Health
PDF
Signaling mechanisms governing intestinal regeneration and gut-glia cross-talk in Drosophila
PDF
Associations between longitudinal loneliness, epigenetic age, and dementia risk
Asset Metadata
Creator
Tomusiak, Alan
(author)
Core Title
Identifying and measuring cell-intrinsic and cell-extrinsic factors influencing aging
School
Leonard Davis School of Gerontology
Degree
Doctor of Philosophy
Degree Program
Biology of Aging
Degree Conferral Date
2024-05
Publication Date
12/03/2024
Defense Date
06/03/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
aging,biomarkers,epigenetic clocks,immunology,longevity,OAI-PMH Harvest,T cells
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Verdin, Eric (
committee chair
), Benayoun, Berenice (
committee member
), Ellerby, Lisa (
committee member
), Furman, David (
committee member
), Winer, Dan (
committee member
)
Creator Email
tomusiak@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113987446
Unique identifier
UC113987446
Identifier
etd-TomusiakAl-13055.pdf (filename)
Legacy Identifier
etd-TomusiakAl-13055
Document Type
Dissertation
Format
theses (aat)
Rights
Tomusiak, Alan
Internet Media Type
application/pdf
Type
texts
Source
20240605-usctheses-batch-1165
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
biomarkers
epigenetic clocks
immunology
longevity
T cells