Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
(USC Thesis Other)
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
GENE EXPRESSION AND ANGIOGENESIS PATHWAY
ACROSS DNA METHYLATION SUBTYPES IN COLON
ADENOCARCINOMA
by
Songren Wang
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the Requirements for the Degree
MASTER OF SCIENCE
(APPLIED BIOSTATISTICS AND EPIDEMIOLOGY)
May 2018
Copyright 2018 Songren Wang
ii
Dedication
To my husband, Fan Fei,
my mother, Hongxia Li and my father, Qinghe Wang
for their support and understanding
iii
Acknowledgements
This work was done under the direction and supervision of my guidance committee
chair, Dr. Kimberly Siegmund. I would like to give my gratitude and appreciation
to Dr. Meredith Franklin for her invaluable guidance and support provided
throughout my Master’s Program. I am especially grateful to my guidance
committee member, Dr. Anna Wu, for her endless guidance and support
throughout the preparation of this manuscript. I would also like to extend my
appreciation to another member of my guidance committee, Dr. Joshua Millstein,
for his advice and suggestions throughout the course of my research and in the
preparation of this manuscript. In addition, I would like to extend my special
thanks and gratitude to Dr. Huaiyu Mi for his exceptional guidance throughout the
course of my research and the writing of my thesis.
iv
Table of Contents
Dedication.................................................................................................................ii
Acknowledgements..................................................................................................iii
Tables of Contents..................................................................................................iv
List of Tables and Figures.........................................................................................v
Abstract.....................................................................................................................vi
1. Introduction.........................................................................................................01
2. Methods...............................................................................................................04
2.1 Data.................................................................................................................04
2.2 Statistical Analysis.........................................................................................04
3. Results.................................................................................................................07
4. Discussion............................................................................................................12
Appendix................................................................................................................14
Bibliography............................................................................................................21
v
List of Tables and Figures
Figure 1. Multidimensional scaling plot (MDS) of gene expression data...............08
Figure 2. Numbers of statistically significant genes from moderated t-tests,
adjusted for unwanted variation..............................................................................09
Figure 3. Volcano plot for significantly different expression of angiogenesis
pathway-related genes distribution among all genes...............................................10
Figure 4. Multidimensional scaling plot (MDS) using gene expression from
angiogenesis genes..................................................................................................11
Table 1. Summary of colon cancer biological classification--CIMP......................02
Table 2. Differential expression results (FDR-adjusted p<0.05) ...........................09
Table 3. Differential expression results (fold change > 2 and FDR-adjusted p<0.05)
.................................................................................................................................09
vi
Abstract
Background: In metastatic colon cancer, clinical observation suggests that benefit
from Bevacizumab, a chemotherapy agent that disrupts angiogenesis, is limited to
patients with tumors classified as having the CpG island methylator phenotype
(CIMP). This led us to investigate whether there is differential expression of
angiogenesis genes in CIMP tumors compared to non-CIMP tumors.
Methods: We analyzed publicly-available gene expression data on 164 colon
adenocarcinomas from The Cancer Genome Atlas (TCGA). We used moderated t-
tests to assess differential expression in 33 CIMP-high, 38 CIMP-low and 93 non-
CIMP cancers (FDR<0.05). Enrichment for angiogenesis genes in the set of genes
differentially expressed was evaluated using Fisher’s exact test.
Results: We found 967 genes differentially expressed (FDR-adjusted p < 0.05 and
fold change > 2) between CIMP-high tumors compared to non-CIMP tumors.
There were 11 angiogenesis genes in this set, but this did not represent an
enrichment (11/176 = 6.3% vs. 967/15061 = 6.4%).
Conclusion: Our results do not find strong evidence for differential expression of
angiogenesis genes distinguishing CIMP-high cancers. However, CIMP-high
tumors are heterogeneous, and a larger sample size that would allow us to restrict
to metastatic disease would be desired.
Impact: Angiogenesis pathway enrichment of genes differentially expressed in
CIMP-H metastatic colon cancer could increase the understanding of colon cancer
pathogenesis and bring new insights for colon adenocarcinoma treatment.
1
1. Introduction
Colon cancer is an important contributor to worldwide cancer morbidity and
mortality, especially in countries with high fat and low fiber consumption,
including Europe, Australia and the United States
[1]
. Colon adenocarcinoma is the
most common subset, representing approximately 95% of all colon cancers.
Epidemiologic studies of colon cancers, identify three types of risk factors for the
disease, age, genetic, and lifestyle factors. There are several current treatment
options for colon adenocarcinoma, such as immunotherapy, surgery, chemotherapy
and radiation therapy. These are primarily general cancer therapies, therefore, to
achieve better treatment outcomes, it would be reasonable to develop more specific
treatments targeting different disease subtypes.
Our ability to molecularly characterize different subgroups of colon cancer has
improved with the inventions of high-throughput technologies. Early molecular
classifications identified microsatellite instability (MSI) and chromosomal
instability (CIN), which are both pathways involved in colon cancer carcinogenic
mechanism
[2]
, promoter methylation and gene silencing causing the loss of MLH1
is the most common cause for MSI, especially among MSI-high (MSI-H) cancers.
Then CIMP was proposed to describe a subset of colorectal tumors with an
exceptionally high frequency of methylation of ‘Type C’ loci, which were defined
as loci methylated in cancer, but not in normal tissues
[3]
. According to
Weisenberger et al. article, colorectal cancers could be divided into CIMP-positive
and CIMP-negative groups, and CIMP-positive tumors is associated with BRAF
mutation
[4]
. And based on Weisenberger et al. article
[5]
, clinicopathologic factors
in CIMP-positive tumors with MLH1 DNA methylation differed from those in
2
CIMP-positive tumors without DNA methylation of MLH1. Subjects with CIMP-
positive tumors without MLH1 methylation were significantly younger, more
likely to be male, and more likely to have distal colon or rectal primaries and the
MSI-L phenotype. CIMP-positive MLH1-unmethylated tumors were significantly
less likely than CIMP-positive MLH1-methylated tumors to harbor a BRAF
V600E mutation and significantly more likely to harbor a KRAS mutation. MLH1
methylation was associated with significantly better overall survival (HR, 0.50;
95% confidence interval, 0.31–0.82).
Later, Hinoue et al.
[6]
identified four DNA-methylation-based groups of colorectal
cancer from a cluster analysis of high-dimensional microarray data. Firstly, cluster
1 (CIMP-H) subgroup exhibits an exceptionally high frequency of cancer-specific
DNA hyper-methylation, and it is strongly associated with MLH1 DNA hyper-
methylation and the BRAFV600E mutation. Secondly, the cluster 2 (CIMP-L)
subgroup is enriched for KRAS mutations and characterized by DNA hyper-
methylation of a subset of CIMP-H-associated markers rather than a unique group
of CpG islands. And the other two clusters are named cluster 3 and cluster 4.
Compared to cluster 4, cluster 3 has a higher frequency of TP53 mutations and
distal colon occurrence.
Table 1. Summary of colon cancer biological classification---CIMP
1
st
Author Toyota et al. Weisenberger et al. Hinoue et al.
Year 1999 2006 2011
Journal Proc. Natl. Acad. Sci. USA Nature Genetics Genome Research
Samples 50 colorectal cancers 187 colorectal cancers 125 colorectal cancers
Platform PCR-based, methylated CpG island
amplification
PCR-based, MethyLight
Human Methylation 450 Array
Classification CpG Island methylator phenotype
(CIMP)
CIMP+
CIMP-
CIMP-H, CIMP-L
Cluster3, Cluster 4
3
According to Guinney J et al.
[7]
, investigators evaluated the results of six CRC
subtyping algorithms, using 18 datasets (n = 4,151 patients), both public
(GSE42284, GSE33113, GSE39582, GSE35896, GSE13067, GSE13294,
GSE14333, GSE17536, GSE20916, GSE2109, GSE2109, TCGA) and proprietary.
They showed marked interconnectivity between six independent classification
systems coalescing into four consensus molecular subtypes: CMS1 (microsatellite
instability immune, 14%), hyper-mutated, microsatellite unstable and strong
immune activation; CMS2 (canonical, 37%), epithelial, marked WNT and MYC
signaling activation; CMS3 (metabolic, 13%), epithelial and evident metabolic
dysregulation; and CMS4 (mesenchymal, 23%), prominent transforming growth
factor-β activation, stromal invasion and angiogenesis.
Previous research revealed that CMS1, which is enriched for CIMP-H, has benefit
from Avastin, an angiogenesis inhibitor. This has led to the hypothesis that the
expression of genes involved in angiogenesis is different in CIMP-H and non-
CIMP tumors. For this purpose, we accessed RNA sequencing data and
information on tumor CIMP status from the Cancer Genome Atlas (TCGA). We
used the PANTHER classification system
[8]
associated with Gene Ontology, to
identify genes involved in angiogenesis.
4
2. Methods
2.1 Data
We use publicly available TCGA data for our analysis. RNA sequencing data were
downloaded from GSE62944
[9]
and clinical data, notably the CIMP calls, from
Multi-Assay Experiment
[10]
. The RNA count data reported on 23367 genes from
468 samples. The Multi-Assay Experiment data set included patient and sample
characteristics, such as gender, race, age, DNA methylation-based subgroup (e.g.
CIMP high). The DNA methylation-based subgroup
[6]
classified tumors into four
levels as CIMP-high, CIMP-low, Cluster 3 or Cluster 4, applying the definitions
from Hinoue et al. article
[6]
. We restrict the RNA-seq data to those with measured
methylation subgroup, resulting a final data set of 164 tumor samples. It included
33 samples defined as CIMP high (CIMP-H), 38 samples as CIMP low (CIMP-L),
and 93 non-CIMP samples. For the RNA-seq data, we use Gene Symbol
annotation which we also link to Entrez ID using the Bioconductor package,
org.Hs.eg.db. After linking, a total of 21039 genes remained.
2.2 Statistical methods
We performed differential expression analysis of RNA-seq data following the
methods described by Law et al.
[11]
and described in detail below. All analyses are
performed in R version 3.4.3.
2.2.1 Data processing
2.2.1.1 Removing genes that are not expressed
We discard genes that are not expressed at a biologically meaningful level to
reduce the subset of genes to those that are expressed in our tissues, and to reduce
5
the number of tests carried out downstream when looking at differential
expression. For sequencing data, the raw counts for each gene do not reflect their
absolute expression levels as the sequencing reads can be from 50-150 bps in
length and map to anywhere in the gene body. As a result, two genes with equal
expression levels but different lengths will have different read counts; the longer
gene will have a higher read count. A simple approach is to standardize gene
counts to a single length, the standard being (read) counts per million bases, or
CPM. In our analysis, genes were required to be expressed (CPM>1) in at least 10
samples to be retained for downstream analysis. Under this criterion, the number of
genes was reduced to approximately 64% of the original dataset (n=15061).
2.2.1.2 Data transformation of gene expression levels
Another factor that affects the absolute level of gene expression counts is the depth
of sequencing. Libraries sequenced to a greater depth will result in higher counts.
We transform our count data to mimic the distributions of log-2 expression data
from microarrays applying the normalization method “trimmed mean of M-values”
(TMM) to control for the variation in sequencing depth. The transformation is
applied using the voom function in the Bioconductor package limma. These
normalized and transformed data are used for all downstream analyses.
2.2.2 Unsupervised and supervised statistical analysis
We used multidimensional scaling (MDS) to show similarities and differences
between samples in an unsupervised manner, so that one can have an idea of the
extent to which gene expression informs the separation of the DNA methylation-
based definition of colon cancer subtypes. The method performs the analysis on a
reduced subset of 500 informative genes using pairwise selection methods, the
default parameter setting using the plotMDS command from the limma package.
6
Differential expression was evaluated using moderated t-tests in the limma
package. Empirical Bayes moderation is carried out by the eBayes function which
borrows information across all the genes to obtain more precise estimates of gene-
wise variability
[12]
. DNA methylation clusters 3 and 4 were pooled as a single
reference group and we compared gene expression in 33 CIMP-high tumors vs 93
reference tumors and in 38 CIMP-low tumors vs reference. We applied RUV-4
[13]
to estimate batch effects, estimating 10 covariates that we adjust for in our
differential expression analysis. RUV-4 estimates and adjusts for unwanted
variation using negative controls, in this case, genes that are not significantly
different between CIMP high tumors and clusters 3 and 4 (> 0.1 significance
level). Approximately 30% of genes were selected and 10 possible confounding
variables were generated and included in the final analysis. All p-values were
adjusted for multiple testing using the Benjamini and Hochberg false-discovery
rate approach (FDR<0.05). We use Venn diagrams to show the numbers genes that
are significantly differentially expressed in CIMP high and in CIMP low tumors.
2.2.3 Differential expression in angiogenesis pathway
We identified 195 genes related to angiogenesis pathway using PANTHER
( http://www.pantherdb.org/panther/prowler.jsp), a pathway analysis tool
developed using annotation from Gene Ontology.
The angiogenesis genes were matched to the RNA-seq dataset with UniProt IDs,
resulting in 192 genes. Among them, 176 genes have RNA count data for in the
TCGA dataset. We performed enrichment analysis using Fisher’s exact test to
compare the fraction of angiogenesis genes in the set of significant results to the
fraction in the non-significant set.
7
2.2.4 Software availability
This RNA-seq work flow makes use of various packages available from version
3.4 of the Bioconductor project, running on RStudio version 3.4.3., limma, ruv,
edgeR and eulerr.
2.2.5 Gene Annotation
The PANTHER (protein annotation through evolutionary relationship)
[14]
classification system (http://www.PANTHERdb.org/) is a comprehensive system
that combines gene function, ontology, pathways and statistical analysis tools that
enable biologists to analyze large-scale, genome-wide data from sequencing,
proteomics or gene expression experiments. Genes are classified according to their
function in several different ways: families and subfamilies are annotated with
ontology terms (Gene ontology (Go) and PANTHER protein class), and sequences
are assigned to PANTHER pathways. In our analysis, we used ‘angiogenesis’ as
the keyword and set species as Homo sapiens to search on PANTHER website, as
a result, we got a list of 195 genes with its gene symbols, PANTHER families and
PANTHER protein classes for further analysis. Then matching with normalized
dataset by gene symbols, we found 176 genes could be used for differential
expression analysis and enrichment analysis.
3. Results
Our analysis for differential gene expression was based on the combined dataset,
containing both RNA count data and clinical characteristics. In total, there were
15061 genes for differential expression analysis after removing non-expressed
genes. We performed multidimensional scaling of the gene expression residuals
after removing the effect of the ten covariates estimating unwanted variation.
8
Figure 1, shows strong clustering of samples, with CIMP-H and CIMP-L groups
well distinguished from the non-CIMP group.
Figure 1. Multidimensional scaling plot (MDS) of gene expression data. The first
dimension represents the leading-fold-change that explains the largest proportion
of variation in the data, with subsequent dimensions having a smaller effect and
being orthogonal to the ones before it. The gene expression values have the effects
of unwanted variation removed by using the residuals from a linear regression fit.
Samples are colored by CIMP subgroup: CIMP-H: red; CIMP-L: green; non-
CIMP: black.
Table 1 shows the number of genes identified as differentially expressed for each
group pairwise comparison (FDR-adjusted p<0.05). A total of 6036 genes have
differential expression in CIMP-H relative to non-CIMP tumors. For comparison
between CIMP-L and non-CIMP, a total of 1299 genes are differentially expressed.
Overlapping these two gene lists, we find 1051 genes are differentially expressed
in both CIMP-H and CIMP-L compared to the non-CIMP group (Figure 2A).
Lastly, a total of 2057 genes have differential expression in CIMP-H relative to
CIMP-L tumors.
9
Table 1. Differential expression results (FDR-adjusted p<0.05)
Up-regulated genes Down-regulated genes Total
CIMP-H vs non-CIMP 2827 3209 6036
CIMP-L vs non-CIMP 558 741 1299
CIMP-H vs CIMP-L 1051 1006 2057
Table 2 gives the numbers of differential expressed genes if we further filter this
list on genes with a fold-change > 2. Now we find 967 genes are differentially
expressed in CIMP-H compared to non-CIMP and only 138 genes in CIMP-L
(Table 2). Of these, 125 (91%) were also differentially expressed in CIMP-H
tumors (Figure 2B).
Table 2. Differential expression results (fold change > 2 and FDR-adjusted p<0.05)
Up-regulated genes Down-regulated genes Total
CIMP-H vs non-CIMP 333 634 967
CIMP-L vs non-CIMP 19 119 138
CIMP-H vs CIMP-L 177 195 372
A.
10
B.
Figure 2. Numbers of statistically significant genes from moderated t-tests,
adjusted for unwanted variation. A. FDR-adjusted p<0.05. B. FDR-adjusted
p<0.05 and fold change > 2 (up/down regulated).
Figure 3. Volcano plot of the significance level against the log-fold change
comparing CIMP H to non-CIMP average gene expression. Colored dots indicate
significant differential expression of angiogenesis pathway-related genes. Blue
dots stand for up-regulated genes and pink dots for down-regulated genes.
11
Figure 3 shows a scatter diagram of the significance level against the log-fold
change for the CIMP H vs non-CIMP comparison, with colored dots denoting
genes in the angiogenesis pathway having FDR adjusted p<0.05. A total of 75
genes are significantly differentially expressed in CIMP-H vs non-CIMP tumors
(40 genes down-regulated, 35 genes up-regulated). The fraction of angiogenesis
genes that are significantly differentially expressed (75/176 = 43%) is slightly
higher than the fraction of significant results (6036/15061 = 40%), however the
enrichment is not statistically significant (p=0.49). 11 of the significant
differentially expressed genes in the angiogenesis pathway achieve a fold-change
greater than two, including AXIN2, FRZB, RAMP1, WNT7B, PRKCG, NR1I2,
WNT10B, PLA2G4A, F7, DLL3, EPHB1. A plot of the top two scaling
dimensions from an MDS analysis of the 75 angiogenesis genes with FDR adjusted
p<0.05 displays similar separation as overall colon cancer gene dataset (Figure 4).
12
Figure 4. Multidimensional scaling plot (MDS) of gene expression for 75
angiogenesis genes differentially expressed between CIMP-H and non-CIMP
(FDR adjusted p<0.05). The first dimension represents the leading-fold-change that
explains the largest proportion of variation in the data, with subsequent dimensions
having a smaller effect and being orthogonal to the ones before it. The gene
expression values have the effects of unwanted variation removed by using the
residuals from a linear regression fit. Samples are colored by CIMP subgroup:
CIMP-H: red; CIMP-L: green; non-CIMP: black.
4. Discussion
The angiogenesis pathway plays an important role in the tumor progression. An
‘angiogenic switch’ is almost always activated and remains on, causing normally
quiescent vasculature to continually sprout new vessels that help sustain expanding
neoplastic growths
[15]
. Historically, angiogenesis was envisioned to be important
only when rapidly growing macroscopic tumors had formed, but more recent data
indicate that angiogenesis also contributes to the microscopic premalignant phase
of neoplastic progression, further cementing its status as an integral hallmark of
cancer
[16]
. And since many genes are potential candidates for inactivation through
promoter methylation
[17], [18]
, CIMP may have profound pathophysiological
consequences in neoplasia through inactivation of tumor-suppressor genes,
angiogenesis inhibitors, and others. According to Toyota M
[19]
, the majority of
cases with MSI may be caused by CIMP followed by hMLH1 methylation, loss of
hMLH1 expression, and resultant mismatch repair deficiency.
And based on Weisenberger et al. article
[5]
, CIMP with MSI-H have different
clinical characteristics from CIMP without MSI-H. Therefore, in colorectal cancer
without MSI-H, CIMP might also have effects through angiogenesis pathway.
13
In our analysis, 11 angiogenesis-related genes are significantly up-regulated in
CIMP-H groups. Among them, WNT7B, WNT10B and AXIN2 are members of
WNT signaling, and previously reported in MSI-H cells
[20], [21]
. So, these genes
could not provide much information on enrichment analysis in CIMP-H without
MSI-H groups. However, the other 8 genes, like FRZB, RAMP1, PRKCG, DLL3,
might provide new insights for the role of angiogenesis pathway in colon
adenocarcinoma carcinogenic mechanism. For example, DLL-4 notch signaling
has been reported as a potential target in tumor angiogenesis
[22]
.
Although we found a total of 75 (43%) differentially expressed genes in the
angiogenesis pathway, this was not an overrepresentation of significant results in
the study. However, it could be that angiogenesis pathway enrichment is driven by
the non-MSI-H subset of CIMP-H group. In our analysis, there were 33 CIMP-H
samples for analysis, and only 9 (27%) of them are not MSI-H. So, we need more
samples, especially CIMP-H without MSI-H group, to examine our hypothesis.
We used genes identified by PANTHER as belonging to the angiogenesis pathway.
The PANTHER website only includes genes of which protein product directly
involved in angiogenesis pathway, in this case, 195 genes. So, we might like to
also include genes indirectly involved in angiogenesis pathway for enrichment
analysis. AmiGO, the official web-based tool for browsing Gene Ontology would
allow us to expand our gene search Among these genes, there might be many genes
that are indirectly associated with angiogenesis pathway, which PANTHER dataset
does not include, but that teach us further about the extent of differentially
expressed in distinguishing this cancer subset.
14
Appendix
The following code are used to do the statistical analysis for RNA-seq dataset.
#loading RNA-seq data and clinical data for colon cancer
source("http://bioconductor.org/biocLite.R")
biocLite("genefilter")
biocLite("GSE62944")
biocLite("ExperimentHub")
biocLite("MultiAssayExperiment")
biocLite("RaggedExperiment")
library(MultiAssayExperiment)
library(RaggedExperiment)
coad<-readRDS(file.choose())
coad <- updateObject(coad)
experiments(coad)
## ExperimentList class object of length 12:
[1] RNASeqGene: ExpressionSet with 20502 rows and 10 columns
[2] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 191
columns
[3] miRNASeqGene: ExpressionSet with 705 rows and 221 columns
[4] CNASNP: RaggedExperiment with 457535 rows and 914 columns
[5] CNVSNP: RaggedExperiment with 90062 rows and 914 columns
[6] CNAseq: RaggedExperiment with 40530 rows and 136 columns
[7] Methylation: SummarizedExperiment with 485577 rows and 333
columns
[8] mRNAArray: ExpressionSet with 17814 rows and 172 columns
[9] RPPAArray: ExpressionSet with 208 rows and 360 columns
[10] Mutations: RaggedExperiment with 62530 rows and 154
columns
[11] gistica: SummarizedExperiment with 24776 rows and 448
columns
[12] gistict: SummarizedExperiment with 24776 rows and 448
columns
cDataMAE <- colData(coad)
cDataMAE <- cDataMAE[,c(1:19,2572:2615)]
cDataMAE <- cDataMAE[!is.na(cDataMAE$methylation_subtype),]
table(cDataMAE$methylation_subtype)
cDataMAE$CIMP <-
ifelse(is.element(cDataMAE$methylation_subtype,c("Cluster3","Clu
ster4")),"Clust34",cDataMAE$methylation_subtype)
cDataMAE$CIMP = factor(cDataMAE$CIMP,
c("Clust34","CIMP.H","CIMP.L"))
table(cDataMAE$CIMP)
15
## CIMP.H CIMP.L Cluster3 Cluster4
33 40 47 46
## Clust34 CIMP.H CIMP.L
93 33 40
library(genefilter)
library(ExperimentHub)
eh = ExperimentHub()
query(eh , "GSE62944")
tcga_data <- eh[["EH1"]]
dim(phenoData(tcga_data))
## sampleNames sampleColumns
7706 421
coadGSE <- tcga_data[,
which(phenoData(tcga_data)$CancerType=="COAD")]
dim(coadGSE)
## Features Samples
23368 468
bpGEx=exprs(coadGSE)
bpGEx=bpGEx[-nrow(bpGEx),]
dim(bpGEx)
## [1] 23367 468
rseq <- bpGEx[,which(is.element(substr(colnames(bpGEx),1,12),
rownames(cDataMAE)))]
dups = duplicated(substr(colnames(rseq),1,12))
rseq = rseq[,!dups]
colnames(rseq) = substr(colnames(rseq),1,12)
dim(rseq)
## [1] 23367 164
cDataMAE=cDataMAE[colnames(rseq),]
identical(rownames(cDataMAE),colnames(rseq))
## [1] TRUE
pancan12 <- read.delim(file.choose(),header=T)
pancan12 = pancan12[pancan12$disease=="COAD",]
rownames(pancan12) = substr(pancan12$tcga_id,1,12)
pancan12=pancan12[rownames(cDataMAE),]
cDataMAE$abs_purity = pancan12$abs_purity
save(rseq,file="coadGExGSE.rda")
save(cDataMAE,file="cDataMAE.rda")
load("coadGExGSE.rda")
load("cDataMAE.rda")
16
biocLite('edgeR')
library(edgeR)
y <- DGEList(counts=rseq,group=cDataMAE$CIMP)
dim(y)
##[1] 23367 164
#Organizing gene annotations
library(org.Hs.eg.db)
keys=rownames(y)
sel<-select(org.Hs.eg.db,keys=keys,columns="ENTREZID",
keytype="SYMBOL")
'select()' returned 1:many mapping between keys and columns
sel <- sel[!duplicated(sel[,1]),]
identical(sel$SYMBOL,rownames(y))
##[1] TRUE
#Data pre-processing and normalization
y$genes = sel
print(paste("Number of genes missing Entrezid: ",
sum(is.na(y$genes$ENTREZID))))
## [1] "Number of genes missing Entrezid: 2328"
y = y[!is.na(y$genes$ENTREZID),]
dim(y)
##[1] 21039 164
cpm.y <- cpm(y)
keep.exprs <- rowSums(cpm.y>1)>=10
fy <- y[keep.exprs,, keep.lib.sizes=FALSE]
dim(fy)
##[1] 15061 164
fy <- calcNormFactors(fy,method="TMM")
#Differential expression analysis
design <- model.matrix(~0+fy$samples$group)
colnames(design) <- gsub("fy\\$samples\\$group","",
colnames(design))
v <- voom(fy, design,plot=TRUE)
mds<-plotMDS(v,labels=fy$samples$group,
col=unclass(fy$samples$group),ndim=3)
biocLite('ruv')
library(ruv)
designI=model.matrix(~factor(fy$samples$group))
fit=lmFit(v,designI)
efit=eBayes(fit)
enc=efit$p.value[,2]>0.1 & efit$p.value[,3]>0.1
table(enc)
##enc
FALSE TRUE
17
9588 5473
myX=matrix(design,ncol=3)[,-1]
ruvfit10=RUV4(Y=t(v$E),X=as.matrix(myX),ctl=enc,10)
modW=model.matrix(~ruvfit10$W)
fit=lmFit(v$E,modW)
yhat=fit$coef %*% t(modW)
v.Wresid=v$E-yhat
mds<-plotMDS(v.Wresid,labels=fy$samples$group,
col=unclass(fy$samples$group),ndim=3)
modW=modW[,-1]
design=model.matrix(~0+fy$samples$group+modW)
head(design)
colnames(design) <- gsub("fy\\$samples\\$group","",
colnames(design))
colnames(design) <- gsub("modWruvfit10\\$","", colnames(design))
contr.matrix <- makeContrasts(
CimpHvsCl34 = CIMP.H - Clust34,
CimpLvsCl34 = CIMP.L - Clust34,
CimpHvsCimpL = CIMP.H - CIMP.L,
levels = colnames(design))
contr.matrix
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
plotSA(efit)
efit$FDR = as.matrix(efit$p.value)
for (i in 1:ncol(efit$FDR))
efit$FDR[,i]=p.adjust(efit$p.value[,i],method="BH")
#Examine results
dt=decideTests(efit)
summary(dt)
## CimpHvsCl34 CimpLvsCl34 CimpHvsCimpL
-1 3209 741 1006
0 9025 13762 13004
1 2827 558 1051
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon"))
dtffc <- decideTests(efit,lfc=1)
summary(dtffc)
## CimpHvsCl34 CimpLvsCl34 CimpHvsCimpL
-1 634 119 195
0 14094 14923 14689
1 333 19 177
18
vennDiagram(dtffc[,1:2], circle.col=c("turquoise", "salmon"))
#Angiogenesis pathway enrichment analysis
setwd('/Users/songren/Desktop/USC/Siegmund Lab/Bioconductor')
library(splitstackshape)
p <- read.delim("~/Desktop/USC/Siegmund
Lab/Bioconductor/PANTHERGeneList_pan.txt",header=F)
p2 <- cSplit(indt = p, splitCols = "V1", sep = "=", drop = TRUE)
keys=as.character(p2$V1_3)
sel<-
select(org.Hs.eg.db,keys=keys,columns=c("ENTREZID","SYMBOL"),
keytype="UNIPROT")
'select()' returned 1:many mapping between keys and columns
length(unique(sel$SYMBOL))
## [1] 192
sel <- sel[!duplicated(sel2$SYMBOL),]
agnames<- sel$SYMBOL
agrow = match(agnames,v$genes$SYMBOL)
agrow = agrow[!is.na(agrow)]
length(agrow)
##[1] 176
agsymbol = v$genes$SYMBOL[agrow]
plot(efit$coef[,1],-log10(efit$FDR[,1]),pch=16,xlab="log Fold-
change",ylab="-log10 FDR-adjusted pvalue")
points(efit$coef[agrow,1],-
log10(efit$FDR[agrow,1]),col=5,pch=16)
cimpH.vs.clust34 <- topTreat(efit,coef=1,lfc=1,n=Inf)
cimpH.vs.clust34.topgenes <-
rownames(cimpH.vs.clust34)[1:sum(abs(dtffc[,"CimpHvsCl34"]))]
cimpH.vs.clust34.topgenes[cimpH.vs.clust34.topgenes %in%
agsymbol]
## [1] "AXIN2" "FRZB" "RAMP1" "WNT7B" "PRKCG" "NR1I2"
"WNT10B"
[8] "PLA2G4A" "F7" "DLL3" "EPHB1"
library(eulerr)
overlapgenes<-cbind.data.frame(
CIMP.H=ifelse(efit$FDR[agrow,c("CimpHvsCl34")]<0.05,T,F),
CIMP.L=ifelse(efit$FDR[agrow,c("CimpLvsCl34")]<0.05,T,F))
plot(euler(overlapgenes),quantities=TRUE)
ag.FDRsign = rownames(overlapgenes)[overlapgenes$CIMP.H>0]
19
mds<-plotMDS(v.Wresid[ag.FDRsign,],labels=fy$samples$group,
col=unclass(fy$samples$group),ndim=3)
table(efit$coef[ag.FDRsign,1]>0)
## FALSE TRUE
40 35
plot(efit$coef[,1],-log10(efit$FDR[,1]),pch=16,xlab="log Fold-
change", ylab="-log10 FDR-adjusted pvalue",col=8)
x=efit$coef[ag.FDRsign,1]
lp=-log10(efit$FDR[ag.FDRsign,1])
pos = x>0
points(x[pos],lp[pos],col=4,pch=16)
points(x[!pos],lp[!pos],col=6,pch=16)
segments(1,0,1,30)
segments(-1,0,-1,30)
allTT=topTable(efit,coef=1,n=nrow(efit))
results =
cbind.data.frame(ag.idx=ifelse(is.element(rownames(allTT),agsymb
ol),1,0),sign.H=ifelse(allTT$adj.P.Val<0.05,1,0))
table(results)
## sign.H
ag.idx 0 1
0 8924 5961
1 101 75
fisher.test(table(results))
## Fisher's Exact Test for Count Data
data: table(results)
p-value = 0.4874
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8118295 1.5165727
sample estimates:
odds ratio
1.111672
sessionInfo()
## R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.1
Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frame
works/vecLib.framework/Versions/A/libBLAS.dylib
20
LAPACK:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRl
apack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-
8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
datasets
[8] methods base
other attached packages:
[1] eulerr_4.0.0 org.Hs.eg.db_3.4.1
AnnotationDbi_1.38.2
[4] IRanges_2.10.5 Biobase_2.36.2
splitstackshape_1.4.2
[7] data.table_1.10.4-3 edgeR_3.18.1 limma_3.32.10
[10] S4Vectors_0.14.7 BiocGenerics_0.22.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 knitr_1.20 bit_1.1-12
lattice_0.20-35
[5] rlang_0.2.0 blob_1.1.0 tools_3.4.3 grid_3.4.3
[9] DBI_0.7 bit64_0.9-7 digest_0.6.15
tibble_1.4.2
[13] memoise_1.1.0 RSQLite_2.0 polyclip_1.6-1
compiler_3.4.3
[17] pillar_1.1.0 locfit_1.5-9.1 pkgconfig_2.0.1
21
Bibliography
1. Fatima A. Haggar, Robin P. Boushey, Colorectal Cancer Epidemiology:
Incidence, Mortality, Survival, and Risk Factors. Clin Colon Rectal Surg. 2009
Nov; 22(4): 191–197.
2. Viktor H Koelzer, Pia Herrmann, Inti Zlobec, Eva Karamitopoulou, Alessandro
Lugli and Ulrike Stein, Heterogeneity analysis of Metastasis Associated in Colon
Cancer 1 (MACC1) for survival prognosis of colorectal cancer patients: a
retrospective cohort study. BMC Cancer. 2015; 15: 160.
3. Toyota, M. et al. CpG island methylator phenotype in colorectal cancer. Proc.
Natl. Acad. Sci. USA 96, 8681–8686 (1999)
4. Daniel J Weisenberger, Kimberly D Siegmund, Mihaela Campan, Joanne
Young, Tiffany I Long, Mark A Faasse, Gyeong Hoon Kang, Martin
Widschwendter, Deborah Weener, Daniel Buchanan, Hoey Koh, Lisa Simms,
Melissa Barker, Barbara Leggett, Joan Levine, Myungjin Kim, Amy J French,
Stephen N Thibodeau, Jeremy Jass, Robert Haile2 & Peter W Laird. CpG island
methylator phenotype underlies sporadic microsatellite instability and is tightly
associated with BRAF mutation in colorectal cancer. Nat Genet. 2006
5. A. Joan Levine, Amanda I. Phipps, John A. Baron, Daniel D. Buchanan, Dennis
J. Ahnen, Stacey A. Cohen, Noralane M. Lindor, Polly A. Newcomb, Christophe
Rosty, Robert W. Haile, Peter W. Laird, and Daniel J. Weisenberger.
Clinicopathologic Risk Factor Distributions for MLH1 Promoter Region
Methylation in CIMP-Positive Tumors. Cancer Epidemiol Biomarkers Prev; 25(1)
January 2016. 10.1158/1055-9965.EPI-15-0935
6. Toshinori Hinoue, Daniel J. Weisenberger, Christopher P.E. Lange, Hui Shen,
Hyang-Min Byun, David Van Den Berg, Simeen Malik, Fei Pan, Houtan
Noushmehr, Cornelis M. van Dijk, Rob A.E.M. Tollenaar and Peter W. Laird,
Genome-scale analysis of aberrant DNA methylation in colorectal cancer. Genome
Res. 2012 Feb;22(2):271-82. doi: 10.1101/gr.117523.110. Epub 2011 Jun 9.
7. Justin Guinney, Rodrigo Dienstmann, Xin Wang, Aurélien de Reyniès, Andreas
Schlicker, Charlotte Soneson, Laetitia Marisa, Paul Roepman, Gift Nyamundanda,
Paolo Angelino, Brian M. Bot, Jeffrey S. Morris, Iris M. Simon, Sarah Gerster
22
Evelyn Fessler Felipe de Sousa e Melo, Edoardo Missiaglia, Hena Ramay, David
Barras, Krisztian Homicsko, Dipen Maru, Ganiraju C. Manyam, Bradley Broom,
Valerie Boige, Beatriz Perez-Villamil, Ted Laderas, Ramon Salazar, Joe W. Gray,
Douglas Hanahan, Josep Tabernero2, Rene Bernards6, Stephen H. Friend1, Pierre
Laurent-Puig16,§, Jan Paul Medema3,§, Anguraj Sadanandam9,§, Lodewyk
Wessels, Mauro Delorenzi, Scott Kopetz, Louis Vermeulen and Sabine Tejpar, The
Consensus Molecular Subtypes of Colorectal Cancer. Nat Med. 2015 November ;
21(11): 1350–1356. doi:10.1038/nm.3967.
8. Huaiyu Mi, Xiaosong Huang, Anushya Muruganujan, Haiming Tang, Caitlin
Mills, Diane Kang and Paul D. Thomas. PANTHER version 11: expanded
annotation data from Gene Ontology and Reactome pathways, and data
analysis tool enhancements. Nucleic Acids Research, 2017, Vol. 45
9. Rahman M, Jackson LK, Johnson WE, Li DY, Bild AH, Piccolo SR. Alternative
preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to
improved analysis results. Bioinformatics. 2015 Nov 15;31(22):3666-72
10. Ramos M, Schiffer L, Re A, Azhar R, Basunia A, Cabrera CR, Chan T,
Chapman P, Davis S, Gomez-Cabrero D, Culhane AC, Haibe-Kains B, Hansen K,
Kodali H, Louis MS, Mer AS, Reister M, Morgan M, Carey V and Waldron L
(2017). Software For The Integration Of Multi-Omics Experiments In
Bioconductor. Cancer Research, 77(21); e39-42.
11. Charity W. Law, Monther Alhamdoosh, Shian Su, Gordon K. Smyth, Matthew
E. Ritchi, RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR.
F1000Research 2016, 5:1408
12. Smyth GK, Linear models and empirical bayes methods for assessing
differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;
3(1): Article3.
13. Jacob, L., Gagnon-Bartsch, J., Speed and P. T. Correcting gene expression data
when neither the unwanted variation nor the factor of interest are observed.
Biostatistics. (2016)
14. Huaiyu Mi, Anushya Muruganujan, John T Casagrande & Paul D Thomas.
Large-scale gene function analysis with the PANTHER classification system.
nature protocols.2013
23
15. Hanahan D, Folkman J. Patterns and emerging mechanisms of the angiogenic
switch during tumorigenesis. Cell. 1996 Aug 9;86(3):353-64.
16. Hanahan D1, Weinberg RA. Hallmarks of Cancer: The Next Generation, Cell.
2011 Mar 4;144(5):646-74. doi: 10.1016/j.cell.2011.02.013.
17. Baylin, S. B., Herman, J. G., Graff, J. R., Vertino, P. M. & Issa, J.-P. (1998)
Adv. Cancer Res. 72, 141–196.
18. Jones, P. A. (1996) Cancer Res. 56, 263–267.
19. Toyota M1, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, Issa JP. CpG
island methylator phenotype in colorectal cancer. Proc Natl Acad Sci U S A. 1999
Jul 20;96(15):8681-6.
20. Shimizu Y1, Ikeda S, Fujimori M, Kodama S, Nakahara M, Okajima M,
Asahara T. Frequent alterations in the Wnt signaling pathway in colorectal cancer
with microsatellite instability. Genes Chromosomes Cancer. 2002 Jan;33(1):73-81.
21. Serina M. Mazzonia, Eric R. Fearona. AXIN1 and AXIN2 Variants in
Gastrointestinal Cancers. Cancer Lett. 2014 December 1; 355(1): 1–8.
22. Frank Kuhnert, Jessica R Kirshner, and Gavin Thurston. Dll4-Notch signaling
as a therapeutic target in tumor angiogenesis. Vasc Cell. 2011; 3: 20.
Abstract (if available)
Abstract
Background: In metastatic colon cancer, clinical observation suggests that benefit from Bevacizumab, a chemotherapy agent that disrupts angiogenesis, is limited to patients with tumors classified as having the CpG island methylator phenotype (CIMP). This led us to investigate whether there is differential expression of angiogenesis genes in CIMP tumors compared to non‐CIMP tumors. ❧ Methods: We analyzed publicly‐available gene expression data on 164 colon adenocarcinomas from The Cancer Genome Atlas (TCGA). We used moderated t-tests to assess differential expression in 33 CIMP‐high, 38 CIMP‐low and 93 non‐CIMP cancers (FDR <0.05). Enrichment for angiogenesis genes in the set of genes differentially expressed was evaluated using Fisher’s exact test. ❧ Results: We found 967 genes differentially expressed (FDR‐adjusted p < 0.05 and fold change > 2) between CIMP‐high tumors compared to non‐CIMP tumors. There were 11 angiogenesis genes in this set, but this did not represent an enrichment (11/176 = 6.3% vs. 967/15061 = 6.4%). ❧ Conclusion: Our results do not find strong evidence for differential expression of angiogenesis genes distinguishing CIMP‐high cancers. However, CIMP‐high tumors are heterogeneous, and a larger sample size that would allow us to restrict to metastatic disease would be desired. ❧ Impact: Angiogenesis pathway enrichment of genes differentially expressed in CIMP‐H metastatic colon cancer could increase the understanding of colon cancer pathogenesis and bring new insights for colon adenocarcinoma treatment.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
An analysis of disease-free survival and overall survival in inflammatory breast cancer
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Disparities in colorectal cancer survival among Latinos in California
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
The effect of renal function on toxicity of E7389 (eribulin) among patients with bladder cancer
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Use of cell-free nucleic acids in associating PD-L1 gene expression with presence of driver mutations in DNA and demographics across different cancers
PDF
Carcinogen metabolism genes, meat intake, and colorectal cancer risk
PDF
Effect of biomass fuel exposure on infant respiratory health outcomes in Bangladesh
PDF
Survival of children and adolescents with low-risk non-rhabdomyosarcoma soft tissue sarcomas (NRSTS) treated with surgery only: an analysis of 234 patients from the Children’s Oncology Group stud...
PDF
Pathogenic variants in cancer predisposition genes and risk of non-breast multiple primary cancers in breast cancer patients
PDF
A comparison of three different sources of data in assessing the adolescent and young adults cancer survivors
PDF
Diet quality and pancreatic cancer incidence in the multiethnic cohort
PDF
SLIT3 gene expression by 1,25(OH)₂D₃ in an endometriosis stromal cell line
PDF
Finding signals in Infinium DNA methylation data
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Disparities in exposure to traffic-related pollution sources by self-identified and ancestral Hispanic descent in participants of the USC Children’s Health Study
Asset Metadata
Creator
Wang, Songren
(author)
Core Title
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Applied Biostatistics and Epidemiology
Publication Date
04/25/2018
Defense Date
05/12/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
angiogenesis pathway,CIMP,modified two-sample T-test,OAI-PMH Harvest,RNA count
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Siegmund, Kimberly (
committee chair
), Millstein, Joshua (
committee member
), Wu, Anna (
committee member
)
Creator Email
songren0829@gmail.com,songrenw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-493250
Unique identifier
UC11268471
Identifier
etd-WangSongre-6268.pdf (filename),usctheses-c40-493250 (legacy record id)
Legacy Identifier
etd-WangSongre-6268.pdf
Dmrecord
493250
Document Type
Thesis
Format
application/pdf (imt)
Rights
Wang, Songren
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
angiogenesis pathway
CIMP
modified two-sample T-test
RNA count