Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying important microRNAs in progression of breast cancer
(USC Thesis Other)
Identifying important microRNAs in progression of breast cancer
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Identifying important
microRNAs in progression of
breast cancer
Xun Zhu
June 2014
A thesis presented to the faculty of the
University of Southern California Graduate School
in partial fulfillment of the requirements for the degree
Master of Science in Applied Mathematics
1
Contents
1 Introduction 4
2 Data and tools 4
2.1 Collecting and processing data . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 TCGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 TargetScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 MiRanda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 R packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 GSEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 randomForest R package . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 rfPermute R package . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Workflow 6
3.1 Performe GSEA analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Post-GSEA-R processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Generate the matrix of score . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Apply the random forest algorithm . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Select important miRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Results and Analysis 14
5 Acknowledgements 16
6 Appendix 17
6.1 Parameters used in external libraries . . . . . . . . . . . . . . . . . . . . 17
6.1.1 GSEA-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.1.2 Random Forest R package . . . . . . . . . . . . . . . . . . . . . . . 17
2
Abstract
Stefan Wuchty et. al. from PLOS summarized a workflow to identify
important miRs of pathways for a type of tumor, which uses GSEA to
measure the importance of each pathway. We try to apply this kind of
workflow on a new set of data on breast cancer (aquired from TCGA).
Because the number of samples is very limited, which is different from
the situation in their original paper, instead of selecting miRs with high
FDR, we instead selected the important miRs based on their permuted
p-values.
We created three kinds of measurements trying to characterize the im-
portant miRs, which corresponds to three matrices – a binary matrix
indicating whether an pathway is the target of miR, a integer matrix
indicating the total number of binding sites between an miR-pathway
pair, and a real number matrix indicating the "bind score" of an miR-
pathway pair. We selected the important miRs through permutation tests
withpÇ 0.01 and compared the selected important miRs from the three
metrics.
Keywords: microRNA, GSEA, random-forest, p-value
3
1 Introduction
MicroRNAs (miRs) are small non-coding RNAs that interact with their protein coding
gene targets. Such small RNAs putatively inhibit translation by direct and imperfect
binding to the 3’- and 5’-untranslated regions (UTR) [1] and exert expression control
with other regulatory elements such as transcription factors [9, 10, 12].
The importance of miRs in cancers has been addressed in various literatures [2–8,
11]. An important question is how to computationally identify the important miRs
involved in cancer related pathways, using the high through-put data, such as the
transcriptome data and microRNA expression data. Previously, Stefan Wuchty et. al.
in [13] provided a workflow to detect important cancer related miRs with various
tools and computational methods, including the gene set enrichment analysis (GSEA)
for identifying significant gene sets/pathways in two different conditions, and some
microRNA target prediction tools, such as miRanda, targetScan and picTar to identify
micro RNA targets, incombination with random forest regression. Here we apply this
work flow to the data on the metastasis of breast cancer, and evaluate the identified
important miRs in three different measure.
2 Data and tools
2.1 Collecting and processing data
2.1.1 TCGA
Using the Cancer Genome Atlas (TCGA
1
), we utilized 12 metastasized breast cancer
samples along with 9 non-metastasized control samples that provided matching gene
and miR expression profiles.
2.1.2 TargetScan
TargetScan 2
2
is an online miR target prediction service provided by Whitehead
Institute for Biomedical Research. We downloaded the lastest (up to Apr. 1st, 2014)
version of their “Summary counts” table
3
, which contains 18209042 pairs of miRs
and mRNA and the number of different types of binding sites across multiple species.
Because we only care about human mRNAs, we excluded all the non-human pairs
and the final table contains 5806825 pairs. We counted both the conserved binding
sites and the non-conserved binding sites.
1
http://cancergenome.nih.gov/
2
http://www.targetscan.org/
3
http://www.targetscan.org/vert_61/vert_61_data_download/Summary_Counts.txt.zip
4
Table 2.1: A random sample of lines from the processed table of TargetScan
mRNA Gene Symbol miRNA name # Binding Sites
TMEM189 hsa-miR-3934 2
PWWP2A hsa-miR-3646 1
F2RL3 hsa-miR-3909 1
CSPG4 hsa-miR-331-3p 2
MAPK14 hsa-miR-4459 1
MYO3B hsa-miR-3675-5p 1
RREB1 hsa-miR-205 1
PGF hsa-miR-4691-5p 1
ZC3H8 hsa-miR-512-5p 1
C15orf38-AP3S2 hsa-miR-3115 1
. . . . . . . . .
2.1.3 MiRanda
MiRanda 4
4
is another online miR target prediction service. It is provided by Memo-
rial Sloan Kettering Cancer Center
5
. We downloaded the lastest (up to Apr. 1st,
2014) version of their “Target Site Predictions”
6
with all four categories (regardless
of good/bad mirSVR score, conserve/non-conserve). We combined the four tables
and sumed up the counts. The total number of pairs after combining and totaling is
7333127.
It’s worth noting that miRanda uses both the “-3p/-5p” notation and the deprecated
7
“star-notation” (*) to indicate minor product, for example, in Table 2.2, we see that it
uses “hsa-miR-545*” to indicate a minor product of hsa-mir-545. Therefore there are
some name-wise inconsistency with targetScan.
2.2 R packages
2.2.1 GSEA
Gene Set Enrichment Analysis (GSEA) is developed at the Broad Institute of MIT and
Harvard. to identify significantly enriched gene sets/pathways under two different
conditions. We used the lasted version (1.0) of their R package
8
. We selected only the
top 20 scoring gene sets for analysis. We used the “C2” version of gene set database,
4
http://www.microrna.org/
5
http://www.mskcc.org/
6
http://www.microrna.org/microrna/getDownloads.do
7
The deprecation is explained in http://www.mirbase.org/blog/2011/04/whats- in- a- name/
8
http://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=
/resources/software/GSEA- P- R.1.0.zip (might need to log in)
5
Table 2.2: A random sample of lines from the processed table of MiRanda
mRNA Gene Symbol miRNA name # Binding Sites
JAM3 hsa-miR-3173 1
SNPH hsa-miR-502-3p 2
FAM126A hsa-miR-874 2
ARGFX hsa-miR-4280 1
GNE hsa-miR-3153 1
BEND2 hsa-miR-2052 2
ASCC1 hsa-miR-188-5p 3
TSC22D2 hsa-miR-375 9
UACA hsa-miR-377 1
TMF1 hsa-miR-545* 3
. . . . . . . . .
and used 1000 as the number of random permutations. The other parameters we
used can be found in the Appendix 6.1.1.
2.2.2 randomForest R package
Random Forest is an R package for application of random forest algorithm. It is
developed by Leo Breiman and Adele Cutler. We used its current version (4.6-7). The
parameters we used can be found in the Appendix 6.1.2.
2.2.3 rfPermute R package
RfPermute is an R package that uses the result from Random Forest package and
compute the significance of importance by permuting the response variable. It is
developed by Eric Archer
9
. We use this package to verify the significance of the result.
From statistical convention, we set the cut-off to bepÇ 0.01.
3 Workflow
Figure 3.1 demonstrates the slightly modified workflow from Wuchty [13].
3.1 Performe GSEA analysis
In order to feed the data into GSEA-R, we first process the raw RNA-Seq data down-
loaded from TCGA. We wrote a script to combine the two tables, created a corre-
9
http://cran.r- project.org/web/packages/rfPermute/index.html
6
Figure 3.1: The modified workflow to select important miRs from TCGA data
sponding list of phenotypes, and added the necessary arguments in the beginning of
the table.
GSEA-R selected 40 pathways representing the most contributing to the change of
phenotype from 522 pathways stored in C2 database. 20 significantly enriched path-
ways and 20 least-enriched pathways in metastasized group. Table 3.1 and Table 3.2
list the names of pathways selected and their corresponding scores. Figure 3.2 lists
the heat map, p-values and other information produced by GSEA-R.
GSEA-R also computes the leading edges genes (LEG) for each pathway. LEGs are the
genes responsible for most gain of enrichment score in the pathways under GSEA
analysis, in other words, LEGs statistically represents the most important mRNAs in a
pathway. We proceed regression in the following steps using only the LEGs in each
pathway, as done in [13].
3.2 Post-GSEA-R processing
Before we proceeded to generate the matrix of score, we first aggregate the data from
the output of GSEA-R. We wrote a script to generate a summary table of pathways
and their corresponding scores, along with the their corresponding LEGs. A snippet
of the summary can be found in Table 3.3.
3.3 Generate the matrix of score
A matrix of miR scores is the training data for random forest. In our case, it is a matrix
with 40 rows (each representing a pathway selected by GSEA-R) and 1525 columns
(each representing an miR in the data given by the source). In the case of targetScan.
We will be generating a 40£ 1525 matrix. We filled the matrix with three different
score measures.
The first one is a binary matrix, with each entry indicating whether there is at least one
binding site between a pair, “1” for yes and “0” for no. As one can see, this measure
7
Table 3.1: The 20 pathways in metastasized group selected by GSEA-R
Pathway Names Score
P53_UP 1.4027
MAP00480_Glutathione_metabolism 1.3638
intrinsicPathway 1.283
MAP00360_Phenylalanine_metabolism 1.2174
Matrix_Metalloproteinases 1.2056
MAP00340_Histidine_metabolism 1.195
HOX_LIST_JP 1.1591
ADULT_LIVER_vs_FETAL_LIVER_GNF2 1.1514
electron_transporter_activity 1.083
no1Pathway 1.0657
MAP00350_Tyrosine_metabolism 1.0628
GO_ROS 1.0484
MAP00220_Urea_cycle_and_metabolism_of_amino_groups 1.0259
FRASOR_ER_UP 0.95187
ST_Wnt_beta_catenin_Pathway 0.93623
p53hypoxiaPathway 0.92514
SIG_CD40PATHWAYMAP 0.90249
AR_MOUSE_PLUS_TESTO_FROM_NETAFFX 0.89128
AR_ORTHOS_MAPPED_TO_U133_VIA_NETAFFX 0.88948
AR_MOUSE 0.88948
uses very limited information and the score level is very rigid. As a consequence,
we expect this measure to generate less accurate results. This matrix is henceforth
abbreviated as matrix-1.
The second one is an integer matrix, with each entry indicating the number of binding
sites between a pair. This is slightly more accurate portrait of the information we
have, which is also the one used in [13]. This matrix is abbreviated as matrix-2.
The third one is a supposely improved version of the second one, with each entry
being the score from matrix-2 timed by the differential expression of corresponding
miR. This measure adopts the most amount of information in the three measures,
and it is based on the fact that the speed of a chemical reaction is positively related to
the density of the catalyst. We expect this measure to give the best result among the
three. This matrix is abbreviated as matrix-3
8
Table 3.2: The 20 pathways in primary group selected by GSEA-R
Pathway Names Score
shh_lisa -1.7159
SA_CASPASE_CASCADE -1.7064
XINACT_MERGED -1.6725
41bbPathway -1.6581
tnfr1Pathway -1.6289
MAP00252_Alanine_and_aspartate_metabolism -1.5879
CR_CELL_CYCLE -1.5836
Cell_Cycle -1.5798
CR_REPAIR -1.5732
fasPathway -1.573
g2Pathway -1.5727
GLUCOSE_DOWN -1.5652
il7Pathway -1.5594
chrebpPathway -1.5504
il1rPathway -1.5485
MAPK_Cascade -1.5191
atrbrcaPathway -1.5186
Il12Pathway -1.5129
mitochondriaPathway -1.5096
ST_Fas_Signaling_Pathway -1.5035
9
Figure 3.2: The summary of GSEA-R
0 5000 10000 15000 20000
−1.0 −0.5 0.0 0.5 1.0
Gene List Correlation (S2N) Profile
Gene List Location
Signal to Noise Ratio (S2N)
Corr. Area Bias to "P" =21.1%
Zero Crossing at location 9550 ( 46.5 %)
"M"
"P"
−3 −2 −1 0 1 2 3
−0.005 0.000 0.005 0.010 0.015 0.020 0.025
Global Observed and Null Densities (Area Normalized)
NES
P(NES)
Neg. ES: "P " Pos. ES: "M"
Null Density
Observed Density
Observed NES values
Heat Map for Genes in Dataset
M P
−2 −1 0 1
0.0 0.2 0.4 0.6 0.8 1.0
p−values vs. NES
NES
p−val/q−val
Nominal p−value
FWER p−value
FDR q−value
10
Table 3.3: Pathways, scores and leading edge genes
Pathway Name Score First Few Leading Edge Genes . . .
P53_UP 1.4027 NDN IGFBP6 FHL2 APLP1 . . .
MAP00480_Glutathione_metabolism 1.3638 GGT1 GPX4 G6PD GSTM5 . . .
intrinsicPathway 1.283 F10 PROC F2 F12 . . .
MAP00360_Phenylalanine_metabolism 1.2174 TAT ALDH3A1 ABP1 MAOA . . .
Matrix_Metalloproteinases 1.2056 MMP17 MMP28 BSG TIMP3 . . .
MAP00340_Histidine_metabolism 1.195 ALDH3A1 ABP1 MAOA AOC3 . . .
HOX_LIST_JP 1.1591 HOXB7 HOXD1 HOXD9 HOXB2 . . .
ADULT_LIVER_vs_FETAL_LIVER_GNF2 1.1514 ADH1C SIGIRR CES2 RARRES2 . . .
electron_transporter_activity 1.083 SPR ETFB ADH1C BLVRA . . .
no1Pathway 1.0657 FLT4 BDKRB2 NOS3 CAV1 . . .
MAP00350_Tyrosine_metabolism 1.0628 ADH1C HGD TAT ALDH3A1 . . .
GO_ROS 1.0484 CCS PDLIM1 PRDX2 MTL5 . . .
MAP00220_Urea_cycle_and_metabolism_of . . . 1.0259 GAMT ARG1 GLUD1 OTC . . .
FRASOR_ER_UP 0.95187 IGFBP4 SLC39A6 AREG GLRB . . .
ST_Wnt_beta_catenin_Pathway 0.93623 TSHB NKD2 DKK4 PIN1 . . .
p53hypoxiaPathway 0.92514 FHL2 GADD45A CPB2 CDKN1A . . .
SIG_CD40PATHWAYMAP 0.90249 IKBKG MAPK11 NFKBIL1 MAPK3 . . .
AR_MOUSE_PLUS_TESTO_FROM_NETAFFX 0.89128 GPC1 ADRA2C CYP3A5 RAMP3 . . .
AR_ORTHOS_MAPPED_TO_U133_VIA_NETAFFX 0.88948 GPC1 ADRA2C RAMP3 CA4 . . .
AR_MOUSE 0.88948 GPC1 ADRA2C RAMP3 CA4 . . .
shh_lisa -1.7159 XPO1 DYRK1A CDK2 CDK8 . . .
SA_CASPASE_CASCADE -1.7064 APAF1 GZMB CASP3 DFFB . . .
XINACT_MERGED -1.6725 EIF1AX USP9X PRKX ATP6AP2 . . .
41bbPathway -1.6581 MAPK8 TNFRSF9 CHUK MAPK14 . . .
tnfr1Pathway -1.6289 MAP3K7 CASP3 MAPK8 LMNB1 . . .
MAP00252_Alanine_and_aspartate_metabolism -1.5879 NARS DDX3X ADSL ADSS . . .
CR_CELL_CYCLE -1.5836 FRK TTK SKP2 CDK6 . . .
Cell_Cycle -1.5798 SKP2 CDK6 HDAC2 CHEK2 . . .
CR_REPAIR -1.5732 BRCA2 CHEK2 PMS1 RAD1 . . .
fasPathway -1.573 MAP3K7 CASP3 MAPK8 LMNB1 . . .
g2Pathway -1.5727 CHEK2 EP300 CDC25A ATR . . .
GLUCOSE_DOWN -1.5652 ZFR DEK KIF11 PAPSS1 . . .
il7Pathway -1.5594 IL7 JAK1 PIK3CA EP300 . . .
chrebpPathway -1.5504 PRKAA1 GNB1 PRKAG2 PRKAR2A . . .
il1rPathway -1.5485 MAP3K7 MAPK8 IL1A TNF . . .
MAPK_Cascade -1.5191 NRAS BRAF MAPK1 RAF1 . . .
atrbrcaPathway -1.5186 FANCD2 BRCA2 CHEK2 RAD1 . . .
Il12Pathway -1.5129 JAK2 MAPK8 CCR5 MAPK14 . . .
mitochondriaPathway -1.5096 APAF1 CASP3 DFFB BIRC3 . . .
ST_Fas_Signaling_Pathway -1.5035 CASP8AP2 NFAT5 ROCK1 CASP3 . . .
11
Table 3.4: Snippet of the binary matrix (matrix-1
hsa-miR-1282
hsa-miR-137
hsa-miR-3117-5p
hsa-miR-32
hsa-miR-3673
hsa-miR-3976
hsa-miR-4428
hsa-miR-4522
. . .
P53_UP 0 1 1 1 1 1 1 1 . . .
MAP00480_Glutathione_metabolism 0 0 0 0 1 0 1 1 . . .
intrinsicPathway 0 1 1 0 1 1 1 1 . . .
MAP00360_Phenylalanine_metabolism 0 0 1 0 1 1 1 0 . . .
Matrix_Metalloproteinases 0 0 1 1 1 1 1 1 . . .
MAP00340_Histidine_metabolism 0 1 1 0 1 1 1 0 . . .
HOX_LIST_JP 0 1 0 1 1 1 1 0 . . .
ADULT_LIVER_vs_FETAL_LIVER_GNF2 0 1 1 1 1 1 1 1 . . .
electron_transporter_activity 0 1 1 1 1 1 1 1 . . .
no1Pathway 1 1 1 1 1 0 1 1 . . .
MAP00350_Tyrosine_metabolism 0 1 1 0 1 1 0 0 . . .
GO_ROS 0 1 1 0 1 1 1 0 . . .
MAP00220_Urea_cycle_and_metabolism_of . . . 0 0 1 0 1 0 1 1 . . .
FRASOR_ER_UP 1 1 1 1 1 1 1 1 . . .
ST_Wnt_beta_catenin_Pathway 0 1 0 1 1 0 1 1 . . .
p53hypoxiaPathway 0 1 0 1 0 0 1 0 . . .
SIG_CD40PATHWAYMAP 0 0 1 0 0 0 1 0 . . .
AR_MOUSE_PLUS_TESTO_FROM_NETAFFX 0 0 1 0 1 0 1 1 . . .
AR_ORTHOS_MAPPED_TO_U133_VIA_NETAFFX 0 0 1 0 1 0 1 1 . . .
AR_MOUSE 0 0 1 0 1 0 1 1 . . .
shh_lisa 1 1 1 1 1 1 1 1 . . .
SA_CASPASE_CASCADE 0 1 1 0 0 0 1 1 . . .
XINACT_MERGED 0 1 1 0 1 1 1 1 . . .
41bbPathway 0 1 1 1 1 1 1 1 . . .
tnfr1Pathway 0 1 1 0 1 1 1 1 . . .
MAP00252_Alanine_and_aspartate_metabolism 0 1 1 0 1 0 1 0 . . .
CR_CELL_CYCLE 1 1 1 1 1 1 1 1 . . .
Cell_Cycle 1 1 1 1 1 1 1 1 . . .
CR_REPAIR 0 1 1 0 1 1 1 1 . . .
fasPathway 0 1 1 0 1 1 1 1 . . .
g2Pathway 1 0 1 1 1 1 1 1 . . .
GLUCOSE_DOWN 1 1 1 1 1 1 2 1 . . .
il7Pathway 0 1 0 1 0 1 1 1 . . .
chrebpPathway 1 1 1 1 1 1 1 0 . . .
il1rPathway 0 1 1 0 1 1 1 1 . . .
MAPK_Cascade 1 1 1 1 1 1 1 1 . . .
atrbrcaPathway 1 1 1 1 1 0 1 1 . . .
Il12Pathway 0 1 0 0 0 1 1 1 . . .
mitochondriaPathway 0 1 1 0 1 1 1 1 . . .
ST_Fas_Signaling_Pathway 1 1 1 0 1 1 1 1 . . .
12
Table 3.5: Snippet of the generated binding site counting matrix (matrix-2)
hsa-miR-1282
hsa-miR-137
hsa-miR-3117-5p
hsa-miR-32
hsa-miR-3673
hsa-miR-3976
hsa-miR-4428
hsa-miR-4522
. . .
P53_UP 0 1 1 2 2 3 3 1 . . .
MAP00480_Glutathione_metabolism 0 0 0 0 1 0 1 2 . . .
intrinsicPathway 0 2 1 0 6 1 2 2 . . .
MAP00360_Phenylalanine_metabolism 0 0 1 0 2 2 2 0 . . .
Matrix_Metalloproteinases 0 0 3 1 4 4 2 1 . . .
MAP00340_Histidine_metabolism 0 1 1 0 2 2 1 0 . . .
HOX_LIST_JP 0 1 0 3 3 2 2 0 . . .
ADULT_LIVER_vs_FETAL_LIVER_GNF2 0 4 3 2 5 2 3 4 . . .
electron_transporter_activity 0 4 2 2 3 7 6 4 . . .
no1Pathway 2 1 2 1 1 0 2 1 . . .
MAP00350_Tyrosine_metabolism 0 1 1 0 2 2 0 0 . . .
GO_ROS 0 2 1 0 1 2 5 0 . . .
MAP00220_Urea_cycle_and_metabolism_of . . . 0 0 1 0 1 0 2 1 . . .
FRASOR_ER_UP 1 3 4 1 2 2 7 2 . . .
ST_Wnt_beta_catenin_Pathway 0 2 0 2 1 0 1 1 . . .
p53hypoxiaPathway 0 2 0 2 0 0 1 0 . . .
SIG_CD40PATHWAYMAP 0 0 1 0 0 0 3 0 . . .
AR_MOUSE_PLUS_TESTO_FROM_NETAFFX 0 0 3 0 1 0 2 1 . . .
AR_ORTHOS_MAPPED_TO_U133_VIA_NETAFFX 0 0 3 0 1 0 2 1 . . .
AR_MOUSE 0 0 3 0 1 0 2 1 . . .
shh_lisa 1 8 2 3 3 1 2 1 . . .
SA_CASPASE_CASCADE 0 4 2 0 0 0 3 2 . . .
XINACT_MERGED 0 4 1 0 13 3 5 4 . . .
41bbPathway 0 3 2 1 2 2 1 1 . . .
tnfr1Pathway 0 7 2 0 4 3 6 4 . . .
MAP00252_Alanine_and_aspartate_metabolism 0 2 2 0 1 0 2 0 . . .
CR_CELL_CYCLE 4 6 3 3 11 6 7 5 . . .
Cell_Cycle 2 6 7 5 14 9 10 5 . . .
CR_REPAIR 0 3 1 0 5 2 4 6 . . .
fasPathway 0 7 2 0 5 5 8 4 . . .
g2Pathway 1 0 2 2 6 1 4 3 . . .
GLUCOSE_DOWN 1 13 8 2 14 4 20 13 . . .
il7Pathway 0 1 0 1 0 1 2 1 . . .
chrebpPathway 2 4 1 1 1 1 1 0 . . .
il1rPathway 0 3 3 0 4 5 4 5 . . .
MAPK_Cascade 3 7 3 1 3 6 3 3 . . .
atrbrcaPathway 1 3 2 2 5 0 5 5 . . .
Il12Pathway 0 3 0 0 0 1 2 1 . . .
mitochondriaPathway 0 4 2 0 1 2 4 3 . . .
ST_Fas_Signaling_Pathway 1 6 3 0 7 2 8 3 . . .
13
3.4 Apply the random forest algorithm
We used the generated matrix and the normalized enrichment score (NES) as inputs
of the random forest regression R package. We computed the importance score for
each miR, indicating the extent to which that miR affects the score of a pathway. The
effect can be either positive or negative. Using matrix-2 as an example, a list of top
and bottom 10 importance miRs are shown in Table 3.6.
Table 3.6: The list of miRs with the most postive and negative importance
Positive miRs Importance Negative miRs Importance
hsa.miR.3613.3p 0.37676027 hsa.miR.3188 -0.004386791
hsa.miR.664 0.25072487 hsa.miR.613 -0.003706353
hsa.miR.217 0.19105828 hsa.miR.374b -0.003467034
hsa.miR.600 0.13873220 hsa.miR.4748 -0.003281420
hsa.miR.203 0.11067679 hsa.miR.628.5p -0.003214184
hsa.miR.498 0.09804174 hsa.miR.761 -0.003153555
hsa.miR.4282 0.05898798 hsa.miR.938 -0.003113474
hsa.miR.579 0.04513470 hsa.miR.377 -0.002981766
hsa.miR.216b 0.03228265 hsa.miR.4517 -0.002953550
hsa.miR.1208 0.02719592 hsa.miR.3175 -0.002895210
3.5 Select important miRs
To estimate the statistical significance of the result from randomForest, we computed
the p-values of the importance score of each miR by randomly permuting the enrich-
ment scores and the matrix entries, with the rfPermute R package. We repeated the
process 1000 times and constructed null-distributions of randomized importance
scores for each miR/pathway pair, and computed the p-values of observed important
score under the background distribution with a Z-test. The miRs with the smallest
p-values are listed in the Table 3.7.
4 Results and Analysis
We select in each measure those with the p-valueÇ 0.01 as the important miRs in
pathways. The selected list can be found in Table 4.1, from which we may draw a
venn diagram, as shown in Figure 4.1
As we can see from the venn diagram, the important miRs generated from matrix-2
and matrix-3 have 4 overlapping miRs, while matrix-1 has only 1 overlapping with
matrix-2 and 2 with matrix-3. However, this is expected since matrix-1 is binary and
uses significantly less information than the other two, thus its results don’t quite agree
14
Table 3.7: miRs with the smallest p-values
miR p-value
hsa.miR.1297 0.00990099
hsa.miR.1323 0.00990099
hsa.miR.190 0.00990099
hsa.miR.203 0.00990099
hsa.miR.216b 0.00990099
hsa.miR.217 0.00990099
hsa.miR.3120.5p 0.00990099
hsa.miR.323.3p 0.00990099
hsa.miR.3529 0.00990099
hsa.miR.3613.3p 0.00990099
with those from the other two. In fact, as we can see there is no three way agreement
among the three measures.
Figure 4.1: The selected important miRs of the three measures
374c 1323 208a 216b 498 507 548 1258, 1278, 1471, 222,
23c, 3713, 376c, 4292,
4305, 4325, 549, 562,
586, 616, 661, 920 whether
target #sites #sites * diff. expr. 1208, 203, 2116, 217,
3163, 3686, 3908,
4282, 548t, 579, 600,
664, 944 1246, 1286, 1290, 1322, 1913,
2113, 3125, 3143, 3148, 3658,
3685, 3692, 4254, 466, 555, 603,
622, 670, 892b, 938 none 15
Table 4.1: The selected miRs by each measure
Matrix-1 Matrix-2 Matrix-3
(Binary) (#bdg. sites) (#bdg. sites£ diff. expr.)
hsa-miR-1208 hsa-miR-1258 hsa-miR-1246
hsa-miR-203 hsa-miR-1278 hsa-miR-1286
hsa-miR-2116 hsa-miR-1471 hsa-miR-1290
hsa-miR-217 hsa-miR-222 hsa-miR-1322
hsa-miR-3163 hsa-miR-23c hsa-miR-1913
hsa-miR-3686 hsa-miR-3713 hsa-miR-2113
hsa-miR-3908 hsa-miR-376c hsa-miR-3125
hsa-miR-4282 hsa-miR-4292 hsa-miR-3143
hsa-miR-548t hsa-miR-4305 hsa-miR-3148
hsa-miR-579 hsa-miR-4325 hsa-miR-3658
hsa-miR-600 hsa-miR-549 hsa-miR-3685
hsa-miR-664 hsa-miR-562 hsa-miR-3692
hsa-miR-944 hsa-miR-586 hsa-miR-4254
hsa-miR-374c hsa-miR-616 hsa-miR-466
hsa-miR-216b hsa-miR-661 hsa-miR-555
hsa-miR-498 hsa-miR-920 hsa-miR-603
hsa-miR-507 hsa-miR-1323 hsa-miR-622
hsa-miR-548 hsa-miR-208a hsa-miR-670
hsa-miR-374c hsa-miR-892b
hsa-miR-938
hsa-miR-216b
hsa-miR-498
hsa-miR-507
hsa-miR-548
hsa-miR-1323
hsa-miR-208a
5 Acknowledgements
I would like to express my gratitude to my thesis advisor Professor Lana Garmire
for the continuous support of my work and for her motivation, enthusiasm, and
immense knowledge. Thank you for giving me the opportunity to be a part of the
research community and for broadening my background. Also, thanks to Sijia Huang
and Travers Ching for their research participation and collaboration.
16
6 Appendix
6.1 Parameters used in external libraries
6.1.1 GSEA-R
Program parameters:
doc.string = "breast_cancer",
non.interactive.run = F,
reshuffling.type = "sample.labels",
nperm = 1000,
weighted.score.type = 1,
nom.p.val.threshold = -1,
fwer.p.val.threshold = -1,
fdr.q.val.threshold = 0.25,
topgs = 20,
adjust.FDR.q.val = F,
gs.size.threshold.min = 15,
gs.size.threshold.max = 500,
reverse.sign = F,
preproc.type = 0,
random.seed = 111,
perm.type = 0,
fraction = 1.0,
replace = F,
save.intermediate.results = F,
OLD.GSEA = F,
use.fast.enrichment.routine = T
Analyzer parameters:
directory = "results/",
topgs = 20,
height = 16,
width = 16
6.1.2 Random Forest R package
ntree = 500
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
importance = FALSE,
localImp = FALSE,
nPerm = 1,
norm.votes = TRUE,
17
do.trace = FALSE,
keep.forest = !is.null(y) && is.null(xtest),
corr.bias = FALSE,
keep.inbag = FALSE
References
[1] David P Bartel. Micrornas: genomics, biogenesis, mechanism, and function.
cell, 116(2):281–297, 2004.
[2] Elcie Chan, Daniel Estévez Prado, and Joanne Barnes Weidhaas. Cancer mi-
crornas: From subtype profiling to predictors of response to therapy. Trendsin
molecularmedicine, 17(5):235–243, 2011.
[3] CZ Chen et al. Micrornas as oncogenes and tumor suppressors. NewEngland
JournalofMedicine, 353(17):1768, 2005.
[4] Carlo M Croce. Causes and consequences of microrna dysregulation in cancer.
NatureReviewsGenetics, 10(10):704–714, 2009.
[5] Aurora Esquela-Kerscher and Frank J Slack. Oncomirsâ
˘
A
ˇ
Tmicrornas with a role
in cancer. NatureReviewsCancer, 6(4):259–269, 2006.
[6] Hristo B Houbaviy, Michael F Murray, and Phillip A Sharp. Embryonic stem
cell-specific micrornas. Developmentalcell, 5(2):351–358, 2003.
[7] Douglas R Hurst, Mick D Edmonds, and Danny R Welch. Metastamir: the field of
metastasis-regulatory microrna is spreading. Cancerresearch, 69(19):7495–7498,
2009.
[8] Jun Lu, Gad Getz, Eric A Miska, Ezequiel Alvarez-Saavedra, Justin Lamb, David
Peck, Alejandro Sweet-Cordero, Benjamin L Ebert, Raymond H Mak, Adolfo A
Ferrando, et al. Microrna expression profiles classify human cancers. nature,
435(7043):834–838, 2005.
[9] Natalia J Martinez and Albertha JM Walhout. The interplay between transcrip-
tion factors and micrornas in genome-scale regulatory networks. Bioessays,
31(4):435–445, 2009.
[10] Reut Shalgi, Daniel Lieber, Moshe Oren, and Yitzhak Pilpel. Global and local ar-
chitecture of the mammalian microrna–transcription factor regulatory network.
PLoScomputationalbiology, 3(7):e131, 2007.
[11] Stefano Volinia, George A Calin, Chang-Gong Liu, Stefan Ambs, Amelia Cimmino,
Fabio Petrocca, Rosa Visone, Marilena Iorio, Claudia Roldo, Manuela Ferracin,
et al. A microrna expression signature of human solid tumors defines cancer
gene targets.ProceedingsoftheNationalAcademyofSciencesoftheUnitedStates
ofAmerica, 103(7):2257–2261, 2006.
18
[12] Juan Wang, Ming Lu, Chengxiang Qiu, and Qinghua Cui. Transmir: a tran-
scription factor–microrna regulation database. Nucleicacidsresearch, 38(suppl
1):D119–D122, 2010.
[13] Stefan Wuchty, Dolores Arjona, and Peter O Bauer. Important mirs of pathways
in different tumor types. PLoScomputationalbiology, 9(1):e1002883, 2013.
19
Abstract (if available)
Abstract
Stefan Wuchty et. al. from PLOS summarized a workflow to identify important miRs of pathways for a type of tumor, which uses GSEA to measure the importance of each pathway. We try to apply this kind of workflow on a new set of data on breast cancer (acquired from TCGA). Because the number of samples is very limited, which is different from the situation in their original paper, instead of selecting miRs with high FDR, we instead selected the important miRs based on their permuted p‐values. ❧ We created three kinds of measurements trying to characterize the important miRs, which corresponds to three matrices—a binary matrix indicating whether an pathway is the target of miR, an integer matrix indicating the total number of binding sites between an miR‐pathway pair, and a real number matrix indicating the ""bind score"" of an miR‐pathway pair. We selected the important miRs through permutation tests with p < 0.01 and compared the selected important miRs from the three metrics.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
The relationship between per- and polyfluoroalkyl substances, microRNA, and non-alcoholic fatty liver disease
PDF
TLR8-transferred miR-192 acts as a tumor suppressor in neuroblastoma by inhibiting CTCF
PDF
Empirical approach for estimating the ExB velocity from VTEC map
PDF
MicroRNAs involved in the regulation of Endothelin-1 gene expression in endothelial cells
PDF
Exploring serum and tear micro-RNA as biomarkers for early diagnosis of Sjögren’s Syndrome
PDF
Improvement of binomial trees model and Black-Scholes model in option pricing
PDF
A nonlinear pharmacokinetic model used in calibrating a transdermal alcohol transport concentration biosensor data analysis software
PDF
On the simple and jump-adapted weak Euler schemes for Lévy driven SDEs
PDF
Return time distributions of n-cylinders and infinitely long strings
PDF
Bulk and edge asymptotics in the GUE
PDF
Random forests and diffusion flow
PDF
An abstract hyperbolic population model for the transdermal transport of ethanol in humans: estimating the distribution of random parameters and the deconvolution of breath alcohol concentration
PDF
The existence of absolutely continuous invariant measures for piecewise expanding operators and random maps
PDF
Large deviations rates in a Gaussian setting and related topics
PDF
Multi-population optimal change-point detection
PDF
A comparative study of non-blind and blind deconvolution of ultrasound images
PDF
Mechanism of action of rapamycin and its applications in aging, cancer therapy and metabolism
PDF
Dimension reduction techniques for noise stability theorems
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
Asset Metadata
Creator
Zhu, Xun (author)
Core Title
Identifying important microRNAs in progression of breast cancer
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Applied Mathematics
Publication Date
07/11/2014
Defense Date
06/10/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
GSEA,microRNA,OAI-PMH Harvest,p‐value,random‐forest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lototsky, Sergey V. (
committee chair
), Garmire, Lana (
committee member
), Wang, Chunming (
committee member
)
Creator Email
zhuxun2@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-435497
Unique identifier
UC11287874
Identifier
etd-ZhuXun-2654.pdf (filename),usctheses-c3-435497 (legacy record id)
Legacy Identifier
etd-ZhuXun-2654-0.pdf
Dmrecord
435497
Document Type
Thesis
Format
application/pdf (imt)
Rights
Zhu, Xun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
GSEA
microRNA
p‐value
random‐forest