Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genomic risk factors associated with Ewing Sarcoma susceptibility
(USC Thesis Other)
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENOMIC RISK FACTORS ASSOCIATED WITH EWING SARCOMA
SUSCEPTIBILITY
by
Melissa Brooke Warden
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSPHY
(MOLECULAR EPIDEMIOLOGY)
May 2012
Copyright 2012 Melissa Brooke Warden
ii
TABLE OF CONTENTS
List of Tables iii
List of Figures iv
Abstract v
Introduction 1
Chapter 1: Bioinformatics for Copy Number Variation Data 11
Chapter 1: Introduction 11
Chapter 1: Materials and Methods 16
Chapter 1: Discussion 27
Chapter 2: A Copy Number Variation Signature to Predict Human Ancestry 29
Chapter 2: Introduction 29
Chapter 2: Materials and Methods 31
Chapter 2: Results 37
Chapter 2: Discussion 44
Chapter 3: A Genome-Wide Association Study of Ewing Sarcoma 50
Chapter 3: Introduction 50
Chapter 3: Materials and Methods 53
Chapter 3: Results 60
Chapter 3: Discussion 74
Conclusions 81
Bibliography 85
iii
LIST OF TABLES
Table 1: Software packages available for single sample CNV analysis 20
Table 2: Software packages available for common CNV analysis 22
Table 3: Demographic characteristics of EWS discovery set and
replication set 61
Table 4: Associations between top 30 SNPs and EWS in discovery phase
of 27 case-parent trios 68
Table 5: Detailed information of the 30 SNPs selected for replication 69
Table 6: Associations between SNPs and EWS in replication set of
139 cases and 381 controls 70
Table 7: Associations between SNPs and EWS in replication set of
87 case-parent trios 71
Table 8: Associations between SNPs and EWS in joint analysis of
162 cases and 381 controls 72
Table 9: Associations between SNPs and EWS in joint analysis of
110 case-parent trios 73
iv
LIST OF FIGURES
Figure 1: Approach to Identify Common CNVs 23
Figure 2: Overview of Common CNV Method 36
Figure 3: GADA Identifies Common Ancestry CNVs Which Differ
Between Populations of Different Ancestry 38
Figure 4: Principal Component Analysis (PCA) using caCNVs Clusters
Samples by Ancestry 39
Figure 5: Identification of Unique caCNV Among European, African,
and Han Chinese 41
Figure 6: Frequency of Copy Number Gains and Losses for the 53 unique
caCNVs among the HapMap Training Sets 43
Figure 7: Estimated Probability of Ancestry Classification using caCNV
Signature 44
Figure 8: Accuracy of Ancestry Prediction in Test Set using PCA of
Genome-Wide SNPs 46
Figure 9: Flowchart of the SNP Data Analysis 63
Figure 10: Manhattan Plot of TDT GWAS Results for 27 EWS Case-Parent
Trios 64
Figure 11: Regional Association Plot of Candidate SNPs in the Discovery
Phase 66
Figure 12: Map of Genome-Wide CNVs identified in 27 EWS Cases 74
v
ABSTRACT
Ewing Sarcoma (EWS) is malignant tumor of bone or soft tissue that primarily
affects children and adolescents. Most EWS tumors share the oncogenic fusion protein
EWS-FLI1 which is thought to play a role in the pathogenesis of these tumors. Besides
age, gender, and race, few environmental or genetic risk factors have been identified that
explain a substantial proportion of EWS cases. However, the ethnic-specific differences
in incidence of EWS suggest a genetic predisposition may be an important determinant of
developing this disease. In this dissertation, we hypothesized that genomic risk factors are
associated with risk of developing EWS. Specifically we will identify single nucleotide
polymorphisms (SNPs) and copy number variants (CNVs) associated with EWS using a
family-based genome-wide association study (GWAS) design. In the initial GWAS
discovery phase, we identified several potential SNP variants associated with EWS. For
replication we selected the top 30 most significant SNPs with differences in allele
frequencies across racial ethnic groups with known differences in risk of EWS. We
identified 2 SNPs associated with EWS (p-value < 0.05) in a combined analysis of the
discovery and replication sets. First, SNP rs11217524 was associated with EWS using a
joint case-control analysis (OR = 0.48, 95% CI = 0.33 - 0.71, p-value = 1.6X10
-4
) and
joint transmission disequilibrium test (TDT) analysis (OR = 0.63, 95% CI = 0.40 - 1.00,
p-value = 5.0X10
-2
). SNP rs11217524 is located near poliovirus receptor-related 1
(PVRL1) on chromosome 11q23.3. This gene encodes an adhesion protein that is used as
a receptor for herpes simplex virus to mediate entry of the virus into human epithelial and
neuronal cells. Second, SNP rs7907995 was also associated with EWS using a joint case-
vi
control analysis (OR = 1.36, 95% CI = 1.05 - 1.78, p-value = 2.2X10
-2
) and joint TDT
analysis (OR = 1.57, 95% CI = 1.03 - 2.78, p-value = 3.1X10
-2
). SNP rs7907995 is
located near the ankyrin repeat domain-containing protein 30A (ANKRD30A) gene on
chromosome 10p11.21. This locus is also known as NY-BR-1 or antigen B726P, which
has been used as a biomarker of disseminated breast cancer cells. The associations of
these 2 candidate SNPs and differences in allele frequencies across different ethnic
populations suggest that risk variants on chromosomes 10 and 11 may contribute to the
ethnic-specific incidence pattern of EWS. A family-based CNV association analysis of
22,543 probes in the 27 case-parent trios also provided suggestive evidence that
chromosomal region 14q11.2 may be associated with EWS risk (p-value = 4.0X10
-4
).
Several probes in this region are located within the heterogeneous nuclear
ribonucleoprotein C (HNRNPC) gene. This gene belongs to the subfamily of ubiquitously
expressed heterogeneous nuclear ribonucleoproteins (hnRNPs), which affect pre-mRNA
processing and other features of mRNA metabolism and transport. The potential for early
detection of EWS by screening children for a marker of genetic susceptibility to this
disease could improve diagnosis and treatment of future patients. Given the small sample
size of this study, replication in a larger dataset is needed to confirm these candidate
genomic markers.
1
INTRODUCTION
Malignant bone cancers are sarcomas that develop in bones and the surrounding
soft tissue. Ewing Sarcoma (EWS) is a unique tumor thought to arise from neural crest
cells. First described in 1921 by James Ewing, EWS of the bone is an aggressive cancer
with an age-adjusted incidence rate of 2.93 cases per million per year. (Eseashvili et al.,
2008; Ewing, 1921) Although it can occur at any age, the most common age of diagnosis
is the second decade of life, and males are affected more often than females. (Bleyer et
al., 2006) There is considerable variation in incidence among different countries,
consistent with a strong genetic component to its etiology. (Fraumeni and Glass, 1970; Li
et al., 1980) Australians have one of the highest observed rates (2.9 per million), followed
by New Zealand Maoris (2.2 per million) and Brazilians (2.0 per million). (Parkin et al.,
1998) The lowest population incidences rates are for East Asians (Japan 1.0 per million,
China 0.1 per million) and blacks (US blacks 0.9 per million, Uganda 0.6 per million).
(Parkin et al., 1998) The most common presenting symptom of EWS is locoregional pain.
The duration of symptoms prior to diagnosis can last for weeks to months, with a median
of 3-9 months. (Widhe and Widhe, 2000) The tumor growth eventually leads to a visible
swelling of the affected area. The most common site of EWS is the central axis, which
includes the vertebral column, ribs, sternum, clavicle, pelvis, sacrum, and coccyx.
(Bleyer et al., 2006) Primary metastases frequently occur in the lungs, bone, or bone
marrow. There is no blood, serum, or urine test that can specifically identify EWS.
However, nonspecific signs of tumor or inflammation may be present, such as an increase
in erythrocyte sedimentation rate, moderate anemia, or leukocytosis. (Bernstein et al.,
2
2006) As for other malignant diseases, a biopsy is the definitive diagnostic test.
Morphologically, the tumor appears as sheets of monotonous small round cells with a
high nuclear to cytoplasmic ratio, and the cells are immunoreactive to CD99. (Tsokos,
1992) Over 90% of EWS tumors are characterized by the reciprocal t(11;22)(q24;q12)
translocation, which is used in the diagnosis. (Aurias et al., 1984) This translocation leads
to the fusion of the EWSR1 gene on 22q12 with the FLI1 gene on 11q24, resulting in the
formation and expression of the chimeric transcript EWS-FLI1. (Delattre et al., 1992)
Treatment for localized EWS consists of both local tumor control and systemic
chemotherapy. The 5-year survival rate of patients with localized disease is 68%.
(Eseashvili et al., 2008) A variety of different approaches are used to treat patients with
metastatic disease, such as dose-intensified chemotherapy, aggressive radiotherapy,
myeloablative chemotherapy with stem cell rescue, or allogeneic transplantation.
(Hendershot, 2005) The 5-year survival rate of patients with metastatic disease is 32%.
(Eseashvili et al., 2008) Patients with primary metastatic disease are at greater risk for
relapse compared to those with localized disease. (Leavey et al., 2008) The 5-year overall
survival of patients with recurrent disease ranges 12 – 23%. (Barker et al., 2005; Leavey
et al., 2008)
Unfortunately, very little is known about the causes and prevention of EWS. Both
genetic and environmental exposures have been considered. The growth spurt during
adolescence is often thought to contribute to the etiology of the disease because of the
increased incidence during the second decade of life. (Ries et al., 1999) Most studies
have focused on maternal and paternal exposures during pregnancy as risk factors for
3
EWS. A population-based case control study of 43 EWS cases and 193 frequency
matched controls from the San Francisco Bay Area revealed a statistically significant
association between paternal occupation in agriculture (OR = 8.8, 95% CI = 1.8 - 42.7)
and an increased risk of developing EWS. (Holly et al., 1992) Similarly, a case-control
study conducted in Australia reported a statistically significant association between at
least one parent having a farm-related job at conception and/or during pregnancy and
increased risk of EWS (OR = 3.4, 95% CI = 1.1 – 10.5). (Valery et al., 2002) The results
of a case-control study in the US suggested that earlier reports of associations of EWS
with parental farming may have really been describing risks associated with exposure to
organic dusts when working on a farm, rather than agricultural exposures to pesticides or
farm animals. (Moore et al., 2005) The consistency of these results, as well as the results
of a pooled and meta-analysis provide evidence supporting the hypothesis of an
association between EWS and parental occupation in farming. (Valery et al., 2005a)
Other studies also have reported an association between EWS and a history of hernia.
(Valery et al., 2005a) An Australian group who reported the association between parental
farming and EWS conducted a case-control study to look at non-farming related
exposures, and estimated that EWS cases were 3 times as likely to have had a hernia
compared to controls (OR = 3.1, 95% CI = 1.2 – 7.6). (Valery et al., 2003) A US case-
control study found that EWS cases were almost 6 times as likely to have had a hernia
compared to regional controls (OR = 5.7, 95% CI = 1.7 – 19.3). (Winn et al., 1992)
4
While these studies suggest a possible association between both a history of hernia and
parental farming and EWS, they are limited by small sample sizes and exposure history
obtained by retrospective interviews. Both of these limitations could lead to spurious
associations.
The results of an Australian case-control study, The Inter-Regional
Epidemiological Study of Childhood Cancer (IRESCC), reported a slight excess of risk
for older maternal age (over 30 years old) at birth for children with EWS compared to
controls. (Hartley et al., 1988) This group also found that children with EWS were more
likely to have lower birth weight and developmental abnormalities compared to controls.
(Hartley et al., 1988) Indications of a genetic or prenatal origin for EWS are supported by
the observation of an apparent excess of congenital abnormalities in a series of 154
patients with this type of sarcoma. (McKeen et al., 1983) Genetic susceptibility may
result in clustering of cancers within families. Reports of familial clustering of cancers in
EWS are rare, but several pairs of siblings with this malignancy have been reported.
(Hutter et al., 1964; Joyce et al., 1984; Zamora et al., 1986) Li et al examined the risk of
familial bone cancer in offspring by parental probands using the Swedish Family-Cancer
Database. (Batanian et al., 2002) They found an indication of EWS by parental melanoma
(SIR = 2.4, 95% CI = 0.9 – 4.7), which also has been suggested in a case series. (Hartley
et al., 1991; Li and Hemminki, 2002)
5
A follow-up study by this same group in Sweden found that parental kidney cancer was
associated with an increased risk of EWS (SIR = 5.6, 95% CI = 1.5 – 14.6). (Ji and
Hemminki, 2006) The results of these studies, as well as the variation in incidence across
populations, are consistent with a probable genetic susceptibility to EWS.
While familial clustering of cancer suggests that germline mutations contribute to
cancer formation, somatic mutations are another essential component for tumor
development. The EWS-FLI fusion protein that is present in 90% of EWS tumors links a
strong transcriptional activation domain from EWSR1 to the ETS DNA-binding portion
of FLI1. It is sufficient for EWS oncogenesis, as inhibition of fusion function or
expression results in the loss of transformation of EWS cells. (Kovar et al., 1996;
Ouchida et al., 1995; Smith et al., 2006; Turc-Carel et al., 1988) Genomic instability in
cancer cells also may lead to abnormal genome copy number with associated gain or loss
of important genes in tumor progression. (Knuutila et al., 1998) Most EWS cases exhibit
defects in the maintenance of genomic stability with subsequent DNA copy number
alterations (CNA) in tumors. Using conventional CGH and array CGH methods, several
studies have shown that 63 – 84% of EWS tumors contain CNA. (Amiel et al., 2003;
Armengol et al., 1997; Brisset et al., 2001; Ferreira et al., 2007; Ozaki et al., 2001;
Tarkkanen et al., 1999) Changes in DNA copy number are not limited to tumors,
however, but are also present in germline DNA. Known as copy number variation
(CNV), these gains and losses of genetic material are greater than 1 kilobase (kb) in size
and vary in frequency among healthy individuals. (Redon et al., 2006) Copy number
polymorphisms (CNP) are common CNVs present in greater than 1% of the population,
6
while CNVs that are found in less than 1% of the population are considered to be rare.
(Itsara et al., 2009)These large regions of the genome encompassed by the CNV are
likely to include several deleted or duplicated genes, unlike traditional SNPs that affect
only one gene at a time. (Shlien and Malkin, 2009) Understanding the function of genetic
alterations such as CNVs may provide valuable insight into the development of EWS,
and why Europeans are more susceptible to this disease than populations of different
ancestry.
There are few known risk factors for EWS, and none of them explain the ethnic-
specific incidence pattern. One explanation for the lack of clearly defined risk factors
may be methodological issues. It is difficult to assemble an adequately powered study
with a large sample size when studying a rare disease. To increase statistical power, many
studies include cases of differing stages or clinical subtypes; however the combination of
etiologically distinct phenotypes will diminish our ability to identify exposure-disease
associations if they truly exist. The typical study design for rare diseases, the case-control
study, may be subject to exposure misclassification of some past exposures due to
retrospective reporting of the past exposures after the disease has already occurred. The
absence of environmental risk factors suggests that a genetic or gene by environment risk
factors may be important.
Initial attempts to investigate a genetic predisposition to EWS have focused on
single nucleotide polymorphisms (SNPs). A previous report suggested that individuals of
African ancestry have fewer Alu repeat sequences in intron 6 of the EWSR1 gene
compared to individuals of European ancestry. (Zucman-Rossi et al., 1997) This led to a
7
candidate gene study to identify SNPs in the EWSR1 gene that differs between EWS
cases and race-matched controls. (DuBois et al., 2011) Twenty-one SNPs were genotyped
in 135 cases and 200 controls. Intron 7 of the EWSR1 gene, the most common site of the
EWSR1 translocation, was also sequenced to identify novel variants. The findings
reported no association between SNPs and EWS susceptibility in the EWSR1 gene.
Recently, a genome-wide association study (GWAS) of EWS identified common
variants on chromosomes 1p36, 10q21, and 15q15 which may predispose to this disease.
(Postel-Vinay et al., 2012) A joint analysis of the GWAS and two independent replication
sets in this study identified associations with SNP rs9430161 (OR = 2.20, 95% CI = 1.77
- 2.72, p-value = 1.4X10
-20
); SNP rs224278 (OR = 1.66, 95% CI = 1.42 - 1.93, p-value =
4.0X10
-17
); and SNP rs4924410 (OR = 1.46, 95% CI = 1.23 - 1.74, p-value = 6.6X10
-9
).
The variants on 1p36.22 are located 25 kb from the Tat activating regulator DNA-binding
protein (TARDBP) gene. The variants on 10q21 are located within a 561 kilobase (kb)
linkage disequilibrium (LD) block spanning 3 genes: 2-aminoethanethiol dioxygenase
(ADO), zinc-finger protein 365 (ZNF365), and early growth response protein 2 (EGR2).
The variants on 15q14 are located within a 50 kb LD block spanning 3 genes: eukaryotic
translation initiation factor 2 kinase (EIF2AK4), signal recognition particle 14kDa
(SRP14), and BCL2 modifying factor (BMF). Analysis of the haplotypes carrying these
candidate SNPs showed differences in frequencies between European and African
populations, suggesting variants in these regions account for the differences in incidence
of EWS observed in these populations.
8
In this dissertation, I explore the role of genomic variation in EWS by performing
a family-based genome-wide association study. In the next three chapters I provide an
introduction to CNVs, discuss a new analysis method to detect common CNVs that can
predict human ancestry, and explore the association between SNPs and CNVs in EWS.
Chapter 1 provides an introduction to the CNV technology and describes a novel and
simple analysis method to detect common CNVs. CNVs are structural changes that occur
throughout the genome, primarily due to duplication, deletion, insertion, and unbalanced
translocation events. These gains and losses of genetic material are 1 kb or greater in size,
and are found to vary in frequency among healthy individuals. The frequency of CNVs
varies by ethnicity, and may contribute to phenotypic variations and differences in
disease susceptibility across different ethnic groups.
Chapter 2 demonstrates an application of the computational method to detect
common CNVs presented in the previous chapter. Common CNVs can be used to build a
CNV signature for a genetic predisposition to a complex disease using a case-control
dataset. As proof-of-principle, a common ancestry CNV (caCNV) signature was built
which can be used to predict human ancestry. This approach uses a simple application of
Genome Alteration Detection Analysis (GADA) algorithm on a distribution of t-statistics
obtained by comparing probe signal intensities from microarray copy number data of two
groups. The t-statistics arranged by the genomic locations of the probes allow detection
of common genome-wide CNVs. Pair-wise analysis of the publicly available
International HapMap Project microarray data from individuals of European, Asian, and
African descent are used to identify caCNVs. Next the copy number status of each
9
individual is assessed for the caCNVs and used as features in a linear discriminant
analysis model to identify caCNV signature that can predict ancestry. Lastly, this CNV
signature is validated in an independent dataset of samples with similar ancestry. The
proposed approach is a quick and simple method that eliminates data reduction
techniques to identify common CNVs, or use of principal component based analyses to
assign class assignment using only signal intensities. This novel approach can be used to
identify a genetic susceptibility signature to complex diseases such as EWS.
Chapter 3 describes a genome-wide association study of EWS. Ethnicity is a
strong risk factor for EWS, yet no genetic or environmental risk factors have been
identified that explain the ethnic-specific incidence of this disease. Recently, GWAS
have been successful in identifying cancer susceptibility loci, and this is the approach
used to identify a genetic predisposition to EWS in this chapter. (Broderick P, 2009;
Eeles RA, 2009; Gudmundsson J, 2009; Kanetsky PA, 2009; Petersen GM, 2010; Postel-
Vinay et al., 2012; Song H, 2009; Tenesa A, 2008; Thomas et al., 2008; Wu X, 2009)
The genetic epidemiology of EWS remains poorly understood due to the rarity and the
poor survival rate of this disease. Therefore, we have established a registry of
biospecimens (whole blood, peripheral blood stem cells, mouthwash, and tumors) from
individuals diagnosed with EWS, as well as their parents and siblings. The DNA is
extracted and hybridized to a high-density microarray and the signal intensities of the
millions of hybridizations are used to determine SNP genotype and estimate copy
number. The significance of transmission of a SNP from a heterozygous parent to an
affected child is tested using the TDT, and CNVs are identified using a Family-Based
10
Association Test (FBAT). A replication set is then used to validate the findings of the
GWAS, using the most statistically significant SNPs that also show differences in minor
allele frequencies across various ethnic populations.
In conclusion, the aim of this research is to identify potential genomic risk factors
for EWS using a comprehensive SNP and CNV approach. CNVs are emerging as an
important genetic variant associated with disease susceptibility, and we have developed
an application to identify a CNV signature for genetic susceptibility to complex diseases.
A GWAS is an unbiased approach to scanning the entire genome of EWS cases and
parents to identify genetic variants that may predispose children and young adults to
EWS. The potential for early detection of EWS by screening children for a marker of
genetic susceptibility to this disease could improve diagnosis and treatment of future
patients.
11
CHAPTER 1: BIOINFORMATICS FOR COPY NUMBER VARIATION DATA
Introduction
Until recently, single nucleotide polymorphisms (SNPs) were thought to be the
most abundant source of human genetic variation. A SNP is a single site in the DNA at
which two or more different nucleotide pairs occur at a frequency of 1% or greater within
a population. Several million SNPs have been documented to date, and are responsible
for the majority of phenotypic variability observed in humans. More recently, the
importance of another submicroscopic type of structural genetic variant has been
discovered, named copy number variants (CNVs). (J.S.Beckmann et al., 2008) CNVs are
structural changes that occur throughout the genome, primarily due to duplication,
deletion, insertion, and unbalanced translocation events. (Redon et al., 2006) Several
mechanisms of CNV formation include meiotic recombination, homology-directed and
non-homologous repair of double-strand breaks, and errors in replication. (Hastings et al.,
2009) These gains and losses of genetic material are 1 kb or greater in size, and are found
to vary in frequency among healthy individuals. (Shaikh et al., 2009) Copy Number
Polymorphisms (CNP) are common CNVs present in greater than 1% of the population,
while CNVs that are found in less than 1% of the population are considered to be rare.
(Itsara et al., 2009) The frequency of CNVs varies by ethnicity, which may contribute to
phenotypic variations and differences in disease susceptibility across different ethnic
groups. (Jakobsson et al., 2008; Takahashi et al., 2008) Several public databases are
available, which provide a comprehensive summary of CNVs detected in disease-free
human populations. For example, the Database of Genomic Variants (DGV) is a
12
collection of the structural variation identified in the human genome. It is continuously
updated with the detailed information of the location and gene content of several types of
structural variation, including, but not limited to, CNVs.
The functional impact of CNVs has been demonstrated through both cellular
phenotypes, such as gene expression, and by the genetic basis of human disease. (Conrad
et al., 2009) These large regions of the genome encompassed by the CNV are likely to
include several deleted or duplicated genes, unlike traditional SNPs that affect only one
gene at a time. (Shlien and Malkin, 2009) CNVs are known to confer risk for inherited
diseases, such as autism spectrum disorders and X-linked mental retardation in males,
schizophrenia and bipolar disorder; complex diseases such as systemic lupus
erythematosus and HIV-1/AIDS susceptibility; and cancer. (Bauters et al., 2008;
Kusenda and Sebat, 2008; Lachman, 2008; Nakajima et al., 2008; Ptacek et al., 2008) It
was recently reported that a common CNV inherited at 1q21.1 is associated with an
increased risk of developing neuroblastoma, and that this CNV influences the expression
of a previously unknown neuroblastoma breakpoint family (NBPF) gene. (Diskin et al.,
2009)
Several approaches have been used to examine CNVs in the human genome.
Comparative genomic hybridization (CGH) was first developed for genome-wide
analysis of DNA sequence copy number in a single experiment. (Kallioniemi et al., 1993)
CGH is based on a competitive in situ hybridization of differentially fluorescently labeled
test and reference DNA to normal human metaphase chromosomes. The fluorescence
intensity ratio measured along the length of each chromosome is approximately
13
proportional to the ratio of the copy numbers of the corresponding DNA sequences in the
test and reference genomes. However, the use of CGH was limited by its low-resolution
of only 5-10 megabases (Mb), and so improvements were made possible using the
resources generated for the public-domain Human Genome Project, where large-insert
clone libraries were developed and assembled into overlapping contigs for sequencing.
(Chueng, 2001) The metaphase chromosomes used for CGH could now be replaced with
arrays of clones accurately mapped to the human genome. Bacterial artificial
chromosome (BAC) and phage artificial chromosome (PAC) clones are most commonly
amplified and spotted for genome wide CGH arrays, with a resolution of 1-1.5 Mb.
(Snijders et al., 2001) Array-CGH is similar to CGH in that test and reference DNA are
differentially fluorescently labeled and hybridized together to the array. The resulting
fluorescent ratio is then measured for each clone and plotted relative to the position of the
clone in the genome. (Pinkel et al., 1998) Finally, oligonucleotide probes provide the
highest amount of resolution for array-CGH; however, use of these shorter probes results
in less stringent hybridization leading to poor signal-to-noise ratio and higher signal
variability compared to the CGH platform. (Carvalho et al., 2004)
The development of synthetic high-density oligonucleotide microarrays used for
genome-wide single nucleotide polymorphisms (SNP) genotyping are now being used to
estimate copy number. (Bengtsson et al., 2009) Hybridizations are not performed using
co-hybridization of test and reference DNA as in array-CGH, but rather by hybridization
of a single DNA sample to an oligonucleotide probe. In order to improve the signal-to-
noise ratio, Affymetrix has developed a technology in which the DNA sample is first
14
digested using restriction enzymes. The smaller DNA fragments are ligated with adapters
and then polymerase chain reaction (PCR) is used to amplify the fragments with
universal PCR primers. The PCR product of a single sample is fluorescently labeled and
hybridized to a chip consisting of millions of 25 base pair (bp) oligonucleotide probes.
The signal intensities of these millions of hybridizations are used to determine genotype
and estimate copy number. This entire process reduces the complexity of the
hybridization; however, it also introduces possible bias. Preferential amplification of
different regions of the genome may reflect differences in restriction digestion patterns
rather than copy number variation between individuals. Illumina® has developed an
alternative platform using 50-bp oligonucleotides attached to indexed beads randomly
deposited onto glass slides. Following whole genome amplification and fragmentation, a
two-step allele detection method is used. First, unlabeled DNA fragments are hybridized
to 50-bp probes on the array, followed by an enzymatic single base extension with
labeled nucleotides. Similar to the Affymetrix arrays, the signal intensities of these
millions of hybridizations are used to determine genotype and estimate copy number.
While high-density oligonucleotide microarrays have revolutionized the detection of
CNVs in large-scale genome studies, next-generation sequencing technologies are now
available, providing improved accuracy and specificity.
15
The rapid development of new sequencing technologies, such as Roche’s 454
sequencing, Illumina’s Genome Analyzer, and Applied Biosystems’ SOLiD, is
continuously increasing the speed and throughput of sequencing, while also decreasing
the cost. (Bentley, 2006; Margulies et al., 2005; Valouev et al., 2008) Several
computational methods are available for CNV detection using these next-generation
sequencing platforms. (Xie and Tammi, 2009; Yoon et al., 2009)
Existing methods for CNV detection are often performed on a sample-by-sample
basis. (Warden et al., 2011) A Hidden Markov Model (HMM) and Bayesian analysis are
statistical approaches commonly used for single-sample CNV calling. (Broët and
Richardson, 2006; Cahan et al., 2008; Colella et al., 2007; Daruwala et al., 2004; Fiegler
et al., 2006; Korbel et al., 2007; Korn et al., 2008; Wang et al., 2007) There are also
methods to detect population-level CNVs by first identifying the CNVs on a sample-by-
sample basis; followed by a data reduction method to identify the common CNVs.
(Beroukhim et al., 2007; Diskin et al., 2006; Ivakhno and Tavare, 2010) However, these
methods are not ideal for large datasets where common CNVs must be determined by
comparing the CNVs of the individual samples. This chapter describes these methods,
and proposes a new method that identifies population-level CNVs using an application of
the published single-sample Genome Alteration Detection Analysis (GADA) method.
(Pique-Regi et al., 2010; Pique-Regi et al., 2008; Pique-Regi et al., 2009) GADA utilizes
a sparse Bayesian learning (SBL) technique to determine the possible CNV locations, and
then a backward elimination (BE) procedure is used to rank the CNVs for manual
adjustment of the false discovery rate (FDR). GADA’s high accuracy and computational
16
efficiency has proven its utility in very large data sets used to identify global variation in
copy number in the human genome (Conrad et al., 2009). Here a simple application of the
GADA algorithm is used to identify common CNVs directly using the quantitative signal
intensity data, which eliminates the need to identify the individual CNVs. A distribution
of t-statistics obtained by comparing probe signal intensities from microarray data of two
groups is arranged by the genomic locations of the probes to allow for the detection of
common CNVs using the GADA algorithm.
Materials and Methods
DNA Microarray Platforms
CGH and array –CHG were originally used for genome-wide CNV detection, but
are limited by their low-resolution. Roche’s NimbleGen and Agilent technologies both
offer microarray-based CGH products with more than one million 50-bp or greater probes
useful for detecting copy number polymorphisms but not SNPs. The synthetic high-
density oligonucleotide microarrays used for genome-wide single nucleotide
polymorphisms (SNP) genotyping and copy number estimation provide the highest
resolution. Illumina has developed a platform using 50-bp oligonucleotides attached to
indexed beads randomly deposited onto glass slides. The HumanOmni1-Quad BeadChip
containing more than one million probes. The Genome-Wide Human SNP Array 6.0 from
Affymetrix (Affymetrix Inc., Santa Clara, CA) features more than 1.8 million markers of
genetic variation, including probes for the detection of both SNPs and CNVs.
17
Raw Data and Annotation Files
The Affymetrix CEL files store the results of the intensity calculations from the
DAT file, which is where the pixel intensity values collected from an Affymetrix scanner
are stored. The CEL file includes an intensity value, standard deviation of the intensity,
the number of pixels used to calculate the intensity value, a flag to indicate an outlier as
calculated by the algorithm and a user defined flag indicating the feature should be
excluded from future analysis. The CEL file stores these data for each feature on the
Affymetrix microarray and is used for all downstream analysis. For Illumina microarrays,
the IDAT file contains the average intensity value for each probe averaged over at least
20 beads collected from an Illumina scanner. The IDAT file can be read by the Illumina
BeadStudio analysis software to produce all other types of files for downstream analysis.
A unique Chip Definition File (CDF) accompanies each type of Affymetrix
microarray. The CDF file contains necessary information about the specific layout of the
microarray and can be downloaded from the Affymetrix website. Platform specific
annotation files map the units in the CDF file, and contain genome information specific
to the type of microarray, such as the chromosome number, transcription starting and
ending sites, and strand indication (the sense strand of a gene relative to the genome
sequence). Several command-line and Graphical User Interface (GUI)-based applications
are provided by commercial companies for preprocessing the array files. Additional
modules available in the R open-source statistical platform also allow for preprocessing
of samples. The Aroma Project (Aroma Project: www.aroma-project.org) supports
preprocessing of Affymetrix raw data and contains specific information regarding the 25-
18
mer oligonucleotide sequences and strand indication. This software also allows for
further downstream analyses of variety of commercial array platforms including the
Illumina platform.
Computer Hardware and Software
For large studies, a UNIX based environment is highly recommended. For small
to moderate projects, a 32-bit or 64-bit MS Windows computer operating system is
recommended. The high density genome-wide microarray datasets require large amounts
of memory and storage. For instance, the size of an Affymetrix Genome-Wide Human
SNP Array 6.0 CEL file is approximately 70 megabytes. Therefore, the minimum
hardware requirements are a 120 GB hard drive, 4 GB of memory, and at least a 2.0 GHz
Intel Pentium Processor.
R is both a computer language for statistical computing and a free software that
provides a coherent, flexible system for data analysis that can be extended as needed
(www.r-project.org). The open-source nature of R ensures its availability and it runs on a
variety of UNIX platforms, Windows, and MacOS. Aroma.Affymetrix is an open-source
R package that provides memory-efficient methods to perform basic data analysis such as
normalization and probe-level summarization on Affymetrix microarray datasets.
(Bengtsson et al., 2009) GADA is an R package that imports normalized Affymetrix or
Illumina microarray data sets and detects CNVs and also allows jointly modeling both
copy number and reference intensities. (Pique-Regi et al., 2009) GADA’s CNV module
can also be called from within the Aroma Project.
19
Data Analysis
Quality Control
Quality control (QC) is the first component of data analysis. A single-sample QC
analysis can be used to identify poor quality samples that should be removed from
subsequent analysis. This can be done by simply comparing the signal intensities (in log
scale) of each probe on the microarray for each sample. The single-sample QC metric
should be a good indication of the final performance of the copy number estimation. The
signal intensities from the X chromosome of male samples can also be used to measure
the distance between a copy number state of two versus a copy number state of one.
These QC methods can be performed both before and after normalization to determine
whether known sources of background have been correctly minimized.
Normalization
Measurements from microarrays can be affected by many biological factors, such
as sample extraction and hybridization. Therefore, it is important to correct the
measurements using normalization procedures in order to make comparisons between
different samples. Probe-level transformation methods are used to transform the
measurements into modified probe intensities by identifying and removing the systematic
effects that cannot be explained by the biological variation of interest or by random noise.
Examples of generic transformations include Robust Multi-Array (RMA) background
correction, gcRMA background correction, and quantile normalization. (Irizarry et al.,
2003) RMA background correction estimates the background using a mixture model
which assumes that the background signals follow a normal distribution and the true
20
signals follow an exponential distribution. Using quantile normalization, the target
distribution is first estimated by calculating an average of all the signal intensities across
all the microarrays, and then each microarray is normalized toward this target
distribution. (Bengsston et al., 2008) In addition, an effect known as allelic crosstalk can
occur on the microarrays because the oligonucleotide sequences for allele A and allele B
probes only differ by one nucleotide. This cross-hybridization can be corrected for using
allelic-crosstalk calibration methods. (Bengsston et al., 2008)
Table 1: Software packages available for single sample CNV analysis.
Software Platforms Supported CNV
Algorithms
Details
BeadStudio
Analysis
Illumina cnvPartition, proprietary
(additional algorithms can be
imported as modules)
CNV, LOH detection
GUI-based software
Affymetrix
Genotyping
Console
Affymetrix Canary Normalization
CNV, LOH detection
GUI-based software
Affymetrix
Power Tools
(APT)
Affymetrix Multiple CNV, LOH detection
Command-line
DNA-chip
Analyzer (dChip)
Affymetrix,
Illumina
HMM CNV, LOH detection
GUI-based software
Nexus Copy
Number
Multiple
Platforms
Segmentation algorithm CNV, LOH detection
GUI-based software
Partek Genomics
Suite
Multiple
Platforms
HMM CNV, LOH detection
GUI-based software
GADA Multiple
Platforms
SBL CNV detection
R-package
CNAG Affymetrix HMM CNV, LOH detection
GUI-based software
ITALICS Affymetrix GLAD algorithm CNV detection
R-package
PennCNV Multiple
Platforms
HMM CNV detection
Command-line
21
Summarization
Once the probe-level signal intensities have been background-corrected and
normalized, the signals must be summarized. Summarization methods are used to
summarize multiple signals from a set of probes into a single signal. Probe-level models
(PLMs) are models that describe the pre-processed signal intensities using statistical
models including both the effects and random noise.
Detection of Individual CNVs
A major concern for the detection of CNVs using synthetic high-density
oligonucleotide microarray technology is how to define the breakpoints of a given CNV.
Many algorithms have been developed to detect CNV and are based on the assumption
that the genome of a normal diploid individual consists of constant number of DNA
segments. (Pique-Regi et al., 2009) GADA is an R package that imports the preprocessed
Affymetrix microarray data sets and detects CNVs by jointly modeling copy number and
reference intensities. (Pique-Regi et al., 2009) This segmentation procedure is done in
two steps. First, a sparse Bayesian learning (SBL) model is fit to determine the most
likely candidate breakpoints of a given CNV. Second, a backward elimination (BE)
procedure consecutively removes the least significant breakpoints and allows for
modification of the False Discovery Rate (FDR). Several other applications and
algorithms have been developed for detection of CNVs in individual samples (Table 1).
PennCNV is a free software tool for detection of CNVs from Affymetrix and Illumina
microarray data sets. This algorithm uses a hidden Markov model (HMM) based
approach that uses total signal intensity and allelic intensity ratio for each probe, the
22
distance between neighboring SNPs, the allele frequency of SNPs, and pedigree
information when available. (Wang et al., 2007) dChip and CNAG are also freely
available GUI-based applications that allow processing of Affymetrix CEL files,
detection, and visualization of chromosomal regions with loss of heterozygosity (LOH)
and copy number alterations using HMM algorithms.
Table 2: Software packages available for common CNV analysis.
Software Platforms Supported CNV Algorithms Details
STAC Multiple platforms Permutation GUI-based
GISTIC Multiple platforms Permutation Command line
R-GADA Multiple platforms Multiple Correspondence
Analysis
R-package
JISTIC Multiple platforms Permutation Command line
DiNAMIC Multiple platforms Permutation Command line
CNAnova Affymetrix SNP 6.0 Gradient kernel density
estimation
Command line
CMDS Multiple platforms Correlation matrix diagonal
segmentation
Command line
MGVD Multiple platforms k-means-based clustering GUI-based
23
Detection of Common CNVs
There are several methods to identify shared regions of copy number gain or loss
within a group of samples. (Beroukhim et al., 2007; Diskin et al., 2006; Ivakhno and
Tavare, 2010; Pique-Regi et al., 2010; Sanchez-Garcia et al.; Walter et al., 2011; Zhang
et al., 2010) These approaches often use permutation tests to identify the recurring
changes in copy number (Table 2). The two most commonly used methods for common
CNV detection are Significance Testing for Aberrant Copy Number (STAC) and
Genomic Identification of Significant Targets in Cancer (GISTIC). The first method,
STAC, creates a binary matrix from the normalized probe signal intensities, assigning
genomic regions within individual samples having no copy number change to zero and
genomic regions with copy number gains or losses to one. Regions of copy number
variation are then determined by their length and frequency of occurrence. STAC also
uses non-overlapping windows to search for evidence of CNVs in each chromosome,
Figure 1: Approach to Identify Common CNVs.
24
which can be an enormous computational burden when using small window sizes. The
second method, GISTIC, sets probe signal intensities below a certain threshold to zero
and also uses a permutation approach to detect the significant copy number regions.
These methods are performed on a sample-by-sample basis, which is not ideal for large
datasets where common CNVs must be estimated by comparing the CNVs of the
individual samples. An approach to avoid assigning the individual CNVs is to compare
the mean signal intensities of each probe in two different populations using the t-test, and
then use GADA to perform the segmentation directly on the t-statistics, rather than the
individual signal intensities (Figure 1). This is a simple but novel approach that provides
additional power to detect small CNVs.
The underlying assumption for human DNA copy number is that there are two
autosomal copies of each chromosome, with an infrequent occurrence of nonrandom
copy number gain and copy number loss throughout the genome. Therefore, under the
null hypothesis that most DNA sequences consist of 2 copies, the probe signal intensities
will follow approximately a normal distribution, with increases in probe signal intensity
corresponding with copy number gains; and decreases with a corresponding copy number
loss:
(1)
where y
ij
is the signal intensity of sample i and probe j. Because the normalization step
corrects for experimental bias in probe signal intensities, the number of probes spanning a
y
ij
= x
ij
+ e
ij
25
CNV will share a common mean log2ratio x
ij
corresponding to the underlying DNA copy
number value. The noise e
ij
is assumed to be zero-mean, and Gaussian.
The t-test can be used to assess whether the mean measurements of two groups
are statistically different from each other. Here the t-test is used to determine whether the
mean log2ratio in one population (A) is statistically different from the mean log2ratio in a
second population (B).
(2)
This approach generates t
j
(t-statistics) for each of the 1.8 million probes on the Affy SNP
6.0. Under the null hypothesis that two human populations will have most DNA
sequences in common, the t-statistics will asymptotically follow a normal distribution.
The t-statistic will approximate zero for the two populations who share similar diploid
genomes. A region with positive t-statistic scores would then correlate with a region with
evidence of copy number gain for one population, with the second population having
either neutral or a loss of copy number for that region.
y
j
A
=
1
N
A
y
ij
i ∈A
∑
y
j
B
=
1
N
B
y
ij
i ∈B
∑
S
j
2
=
y
i j
− y
j
A
( )
2
i ∈ A
∑
+ y
i j
− y
j
B
( )
2
i ∈ B
∑
N N
A B + − 2
t
j
=
y
j
A
− y
j
B
S
j
2
1
N
A
+
1
N
B
26
Conversely, regions with negative t-statistic scores will identify regions of the genome in
which copy number loss is present in one population, and is absent or contains a copy
number gain in the second population. To identify regions with positive or negative t-
statistics, t
j
for the 1.8 million SNP and CNV probes are arranged based on the
chromosome location and imported into GADA.
The ordered t-statistics data are used to identify significant genomic boundaries of
positive or negative t
j
values. These regions correspond to regions with discriminative
copy number variations. The number of probes spanning a CNV region common to a
population is assumed to share a common t-statistic value. Therefore, the objective of
GADA is to identify the genome-wide CNVs which are most likely to be shared in one
population, that also differ in another population. This is a simple modification of the
GADA method in which t-statistics are used in place of the log2ratios. The GADA
method consists of two main steps. The first step is a Bayesian learning process which
generates a set of candidate breakpoints and segment means while trying to achieve an
optimal balance between model fit (measured as residual sum of squares) and model
sparseness (the number of breakpoints). The Bayesian learning process is driven by a
prior parameter, which is determined by the amount of segmentation expected in the
sample. Following the initial segmentation process, the significance of each segment is
estimated as a function of the segment mean and variance. The second step is then a
backward elimination procedure which removes segments with a level of significance
less than the user-predefined threshold.
27
Discussion
CNVs are structural changes that occur throughout the genome, primarily due to
duplication, deletion, insertion, and unbalanced translocation events. These gains and
losses of genetic material are 1 kb or greater in size, and are found to vary in frequency
among healthy individuals. The frequency of CNVs varies by ethnicity, and may
contribute to phenotypic variations and differences in disease susceptibility across
different ethnic groups. There are several practical applications for the use of CNVs in
cytogenetics, association studies, and population genetics.
Cytogenetics studies attempt to identify structural changes in DNA, such as copy
number changes. Platforms that reliably detect CNVs, such as Affymetrix and Illumina,
are particularly useful for genome-wide assessment of uniparentaldisomy (when 2 copies
of the chromosome are present, but both have been inherited from a single parent) in the
form of LOH, which refers to the loss of function of one allele of a gene. Despite the
widespread copy number variation in the genomes of healthy individuals, clinical
cytogeneticists must differentiate between pathogenic CNVs and CNVs that do not
contribute to the clinical presentation of an affected individual. (Lee et al., 2007) Genetic
association studies are the predominant strategy for identifying CNVs conferring risk for
complex genetic diseases, either within candidate loci or genome-wide. Using this
approach, the frequency of a given CNV is compared among affected and unaffected
individuals. These types of studies require large sample sizes are susceptible to
population stratification if cases and controls differ by ethnicity.
28
CNVs in normal individuals follow a model of Mendelian inheritance and present
a broad range in population frequencies. The distribution of copy number variation within
and among different populations is influenced by mutation, selection, and demographic
history. One study that attempted to create a CNV map in African Americans revealed
two regions of the genome with large CNV frequency differences between Caucasians
and African Americans, one on chromosome 15 and another on chromosome 17.
(McElroy et al., 2009)
Existing methods for CNV detection are often performed on a sample-by-sample
basis, which is not ideal for large datasets where common CNVs must be estimated by
comparing the CNVs of the individual samples. This chapter described a method to
identify population-level CNVs using an application of the single-sample GADA method.
CNVs were determined directly from the t-statistics estimated by comparing the DNA
microarray probe signal intensities of populations of different ancestry. Future studies are
needed to validate the success of this method on large datasets.
29
CHAPTER 2: A COPY NUMBER VARIATION SIGNATURE TO PREDICT HUMAN
ANCESTRY
Introduction
Copy number variations (CNVs) are gains and losses of genetic material in the
human genome that are greater than one kilobase (kb) in size. (Iafrate et al., 2004) These
structural variants are present in both healthy and diseased populations, and may confer
susceptibility to certain illnesses through a gene dosage effect. (Stranger et al., 2007) The
frequency of CNVs varies by ethnicity, which may contribute to phenotypic variations
and differences in disease susceptibility across different ethnic groups. (Jakobsson et al.,
2008; Redon et al., 2006) An array-based comparative genome hybridization (aCGH)
performed on pooled genomic DNA from the International HapMap Project populations
revealed 26 European population-specific CNVs, 53 African population-specific CNVs,
and 23 Asian population-specific CNVs. (Armengol et al., 2009) Several technological
approaches are used to examine CNVs in the human genome. Comparative genomic
hybridization techniques utilize thousands of probes to detect CNVs at a low resolution;
SNP microarray platforms employ millions of probes to detect smaller CNVs at precise
locations in the genome; and the most comprehensive assessment of CNVs can be
performed using next-generation sequencing of the human genome. (Alkan et al., 2009;
Carvalho et al., 2004; Kallioniemi et al., 1992)
30
Numerous algorithms have been developed for array based CNV detection using the
probe signal intensity from these array-based assays. (Karimpour-Fard et al., 2010) The
underlying assumption is that there are two copies of each autosomal chromosome in the
human genome, and the goal of these algorithms is to estimate the size and location of
regions which are significantly different from this assumption.
The statistical approaches implemented for the detection of CNVs are often
performed on individual samples. A Hidden Markov Model (HMM) and Bayesian
analysis are statistical approaches commonly used for single-sample CNV calling. (Broët
and Richardson, 2006; Cahan et al., 2008; Colella et al., 2007; Daruwala et al., 2004;
Fiegler et al., 2006; Korbel et al., 2007; Korn et al., 2008; Wang et al., 2007) The GADA
algorithm was developed to identify CNVs on aCGH and SNP microarray platforms.
(Pique-Regi et al., 2008; Pique-Regi et al., 2009) GADA utilizes a sparse Bayesian
learning (SBL) technique to determine the possible CNV locations, and then a backward
elimination (BE) procedure is used to rank the CNVs for manual adjustment of the false
discovery rate (FDR). GADA’s high accuracy and computational efficiency has proven
its utility in very large data sets used to identify global variation in copy number in the
human genome. (Conrad et al., 2009)
This chapter demonstrates a novel and simple method to detect common ancestry
CNVs (caCNV), which can then be used to build a CNV signature predictive of ancestry.
This method uses a simple application of GADA algorithm on a distribution of t-statistics
obtained by comparing probe signal intensities from microarray copy number data of two
groups. The t-statistics arranged by the genomic locations of the probes allow detection
31
of common genome-wide CNVs. Pair-wise analysis of the publicly available
International HapMap Project microarray data from individuals of European, Asian, and
African descent are used to identify caCNVs. Next the copy number status of each
individual is assessed for the caCNVs and used as features in a linear discriminant
analysis model to identify caCNV signature that can predict ancestry. Lastly, this CNV
signature is validated in an independent dataset of samples with similar ancestry.
Materials and Methods
Study Populations
Individuals of European, African, and Han Chinese descent were available from
the International HapMap Project. (The International HapMap Consortium, 2003)
Genome-Wide Human SNP Array 6.0 (Affy SNP 6.0) data for the HapMap individuals
was obtained from Affymetrix (Affymetrix Inc., Santa Clara, CA). The training set
includes 30 HapMap trios of European descent from Utah (CEU), 30 HapMap trios of the
African Yoruba from Nigeria (YRI), and 45 unrelated Han Chinese HapMap individuals
from Beijing, China (CHB). The training set was used to identify caCNVs of each
ancestry in relationship to other ancestry.
The test set was obtained through the Cancer Genetic Markers of Susceptibility
(CGEMS) project. (Mailman et al., 2007) The test set is a population-based Affy SNP 6.0
dataset of 300 samples (100 Caucasian, 100 African-American, and 100 Han Chinese)
collected by the National Institute of General Medical Sciences (NIGMS) to use as
normal healthy controls.
32
The ethnicities for the African-American and Caucasian populations were self-
identified as reported in physician records. The inclusion criteria for the Han Chinese
cohort, obtained from subjects living in the Los Angeles area, were that all four
grandparents were born in Taiwan, China, or Hong Kong.
DNA Microarray
The Affy SNP 6.0 consists of 906,600 polymorphic probes for detection of single
nucleotide polymorphisms (SNPs) and 946,000 non-polymorphic probes (used primarily
for identification of CNVs). The average minor allele frequency of SNPs on this platform
in the HapMap CEU, CHB, and YRI populations is 19.5%, 18.2%, and 20.6%,
respectively. CNV probes were originally selected for their genomic spacing (744,000,
79%) and based on known CNVs identified in the Database of Genomic Variants (DGV;
202,000, 21%). The median distance between all SNP and CNV probes combined is <
700 base pairs. (Affymetrix Inc., 2009)
Statistical Analysis
Microarray Normalization and Summarization
Affy SNP 6.0 data were normalized according to the manufacturer’s guidelines
and using Genotyping Console 3.0 (Affymetrix Inc., Santa Clara, CA). Quantile
normalization, which corrects for fragment-size amplification and GC content, was
performed on data from the training and test sets. (Bengsston et al., 2008) The result is a
log2ratio, which is the logarithm of the signal intensity of the probe relative to the
reference value. For each polymorphic SNP probe, the log2ratio of the two alleles are
summarized to produce a single log2ratio value. A single log2ratio value is estimated for
33
each individual non-polymorphic copy number probe. The entire dataset was imported
into R version 2.9.1. All the analyses were carried out in R using the GADA package. (R
Development Core Team, 2011)
Identification of common CNVs using genome-wide T-statistics and GADA
Under the null hypothesis that most DNA sequences consist of 2 copies, the probe
signal intensities will follow approximately a normal distribution, with increases in probe
signal intensity corresponding with copy number gains; and decreases with a
corresponding copy number loss:
(1)
where y
ij
is the signal intensity of sample i and probe j. Because the normalization step
corrects for experimental bias in probe signal intensities, the number of probes spanning a
CNV will share a common mean log2ratio x
ij
corresponding to the underlying DNA copy
number value. The noise e
ij
is assumed to be zero-mean, and Gaussian.
The t-test can be used to assess whether the mean measurements of two groups
are statistically different from each other. Here the t-test is used to determine whether the
mean log2ratio in one population (A) is statistically different from the mean log2ratio in a
second population (B).
(2)
y
ij
= x
ij
+ e
ij
t
j
=
y
j
A
− y
j
B
S
j
2
1
N
A
+
1
N
B
34
Pair-wise comparisons of the microarray data in CEU versus YRI, CEU versus CHB, and
YRI versus CHB were performed using the t-test. This approach generated t
j
(t-statistics)
for each of the 1.8 million probes on the Affy SNP 6.0. Under the null hypothesis that
two human populations will have most DNA sequences in common, the t-statistics will
asymptotically follow a normal distribution. The t-statistic will approximate zero for the
two populations who share similar diploid genomes. A region with positive t-statistic
scores would then correlate with a region with evidence of copy number gain for one
population, with the second population having either neutral or a loss of copy number for
that region. Conversely, regions with negative t-statistic scores will identify regions of
the genome in which copy number loss is present in one population, and is absent or
contains a copy number gain in the second population. To identify regions with positive
or negative t-statistics, t
j
for the 1.8 million SNP and CNV probes are arranged based on
the chromosome location and imported into GADA.
The ordered t-statistics data were used to identify significant genomic boundaries
of positive or negative t
j
values. These regions correspond to regions with discriminative
copy number variations. The number of probes spanning a CNV region common to a
population is assumed to share a common t-statistic value. Therefore, the objective of
GADA is to identify the genome-wide CNVs which are most likely to be shared in one
population, that also differ in another population. This is a simple modification of the
y
j
A
=
1
N
A
y
ij
i ∈A
∑
y
j
B
=
1
N
B
y
ij
i ∈B
∑
S
j
2
=
y
i j
− y
j
A
( )
2
i ∈ A
∑
+ y
i j
− y
j
B
( )
2
i ∈ B
∑
N N
A B + − 2
35
GADA method in which t-statistics are used in place of the log2ratios. The prior
parameter (alpha) was set to a = 0.5 and the significance threshold (T) was set to T = 9
for identification of CNV breakpoints. These estimates for the alpha and T were selected
based on copy number analysis of Affymetrix SNP 6.0 array data previously described.
(Pique-Regi et al., 2008) Further, only significant segments with greater than 10 probes
were selected for further analysis to further decrease false positive results. Statistical
analyses for identification of common CNVs using genome-wide T-statistics in R using
the GADA package.
Building the caCNV Signature
For each CNV
k
segment identified by GADA using the t-score data, the sum of the
log2ratio values of the total number of probes spanning the k-th CNV was calculated for
each individual in the training dataset. Thus each person was assigned a vector of
features, and for the k-th CNV and the i-th individual:
(3)
A variation of the linear discriminant analysis (LDA) approach was used, named nearest
shrunken centroids, to identify which of these CNV features are caCNVs that can
accurately be used to predict the ancestry of two defined populations. Briefly, the method
computes a standardized centroid for each class. Then a weighted discriminant is
computed to assess if each sample leans towards one population or the other. The
shrunken centroid method has been implemented as an R package (prediction analysis for
microarrays, PAMR) and used for this analysis. (Tibshirani et al., 2002) A ten-fold cross
f
ik
= y
ij
k ∈CNV
k
∑
36
validation was performed on the training set to estimate the performance of the model.
The t-statistics were calculated and CNV models were identified during each iteration of
the cross-validation. The parent-offspring trios were not split between the training and
the test sets in the cross-validation analyses.
Validation
Validation of the caCNV signature was performed using the independent CGEMS
test set, with the log2ratio sum for each sample calculated using the Affy
SNP 6.0 probes spanning the caCNV derived from the HapMap training set. The Ethnic
Figure 2. Overview of Common CNV Method. Affymetrix SNP6.0 signal intensity data are
normalized and summarized for copy number studies. The mean log2ratio of each probe is compared
between two populations of different ancestries using the t-test. The resulting t-statistic for each
probe is formatted with chromosome position and imported into GADA to identify common ancestry
CNVs (caCNVs). The t-statistics follow a normal distribution, with the t-statistic values in the tails
representing the common ancestry probes. Finally, the sum of the log2ratios for each CNV is
calculated and used as features in linear discriminant analysis to identify a minimum set of caCNVs
required to classify the populations.
37
Panels dataset was obtained from the NIGMS Human Genetic Cell Repository through
dbGaP (accession: phs000211.v1.p1). Self-reported ethnicities were also compared to a
principal component analysis (PCA) of genome-wide SNP data using a panel of 4,326
SNPs previously published as ancestry informative markers (AIMs) for African
Americans. ADMIXTURE version 1.21 software was then used to estimate ancestry
using a model-based approach from the same panel of SNPs.
Results
Identification of caCNVs
In order to identify common CNVs that differ between two populations, a series
of t-tests were performed on the mean log2ratio for each Affy SNP 6.0 probe comparing
European, African, and Han Chinese populations (Figure 2). GADA analysis of the t-
statistic values of each pair-wise analysis, ordered based on the genomic location of its
corresponding probe, identified 26, 31, and 16 caCNVs, respectively, which differed
between training set populations of European and African ancestry, European and Han
Chinese ancestry, and Han Chinese and African ancestry (Figure 3). A PCA of the
caCNV values for each individual in the pair-wise comparisons verified the separation of
these populations (Figure 4). Of the 73 total caCNVs identified by the pair-wise
comparisons, 53 unique caCNVs were identified (Figure 5). There were 10 caCNVs
identified that were common in analyses comparing African population to the European
or Han Chinese populations. There were 5 caCNVs identified that were common in each
pair of analyses comparing European or Han Chinese against the other two populations.
Scatter plots of top two principal components in the PCA of the unique 53 caCNV values
38
Figure 3. GADA Identifies Common Ancestry CNVs Which Differ Between Populations of
Different Ancestry. GADA identifies common ancestry CNVs (caCNVs) in pair-wise analyses of
the three training sets (European, African, and Han Chinese). The distribution of the caCNV
regions are shown for the A) 26 caCNVs which differ between the European and African
populations (median caCNV size of 30 kb), B) 31 caCNVs which differ between the training set
European and Han Chinese populations (median caCNV size of 37 kb), and C) 16 caCNVs which
differ between the training set African and Han Chinese populations (median caCNV size of 39
kb). The frequency of caCNVs is plotted by chromosome. The color of the bar indicates the size of
the caCNV.
39
Figure 4. Principal Component Analysis (PCA) using caCNVs Clusters Samples by Ancestry.
For each individual, the sum of log2ratios of the caCNV regions identified using pair wise analyses
were calculated and used for PCA analyses. Scatter plots of the first two principal components of A)
the 26 caCNVs comparing European compared to African populations, B) the 31 caCNVs
comparing European and Han Chinese populations, and C) the 16 caCNVs comparing African and
Han Chinese populations shows good separation of individuals based on ancestry (Red squares –
European ancestry, yellow triangles – African ancestry, blue circles – Han Chinese ancestry).
generated for each individual in the training set verified the separation of these three
populations (Figure 5). The median genomic size of the caCNV panel was 29.3 kb (range
1.4 – 1544.1 kb). The caCNVs encompassed all autosomal chromosomes except for
chromosomes 21 and 22. Figure 6 shows the distribution of gains and losses of the 53
caCNVs across the three ancestral groups. Among the caCNVs, losses were more
commonly observed across the three populations.
caCNV Signature-Based Ancestry Classification
Nearest shrunken centroid analysis using the 53 caCNVs in the training set
separated the European, African, and Han Chinese populations with a 1.7%
misclassification error rate using a ten 10-fold cross-validation routine (Figure 5). The
40
most significant caCNV was located in chromosome 4q13.2, with 43% of Europeans
exhibiting copy number gains. This region has previously been reported to be deleted in
East Asian populations, and encompasses the UDP-glucuronosyltransferase 2B17
(UGT2B17) gene (Campbell et al., 2011; Xue et al., 2008). The second most significant
caCNV was located on chromosome 3q26.1 and contains only a microRNA (MIR720).
The third most significant caCNV is a duplicated region of chromosome 17q21.31 found
only in Europeans, which has been validated experimentally by fluorescence in situ
hybridization (FISH) and next-generation sequencing techniques. (Sudmant et al., 2010)
41
Independent Validation of the caCNV Signature-based Classification
The entire test set of 100 Han Chinese samples, 98 out of the 100 African
samples, and 96 out of the 100 European samples were correctly classified using the 53
caCNV signature, with overall misclassification error rate of 2% (Figure 7). PCA was
performed on a panel of 4,326 genome-wide SNPs used as AIMs to verify the separation
Figure 5. Identification of Unique caCNVs among European, African, and Han Chinese. A)
Venn diagram of the 73 caCNVs identified from the pair wise comparisons identifies 53 unique
caCNVs among the three populations. B) Scatter plot of the top two principal components using
data generated from the 53 unique caCNVs shows good separation of individuals based on ancestry
(Red squares – European ancestry, yellow triangles – African ancestry, blue circles – Han Chinese
ancestry). C) Plot of the misclassification error rate for predicting ancestry using decreasing
numbers of the caCNVs identified using 10-fold cross validation analyses of the training set.
42
of these three populations by self-reported ancestry (Figure 8) (Tandon et al., 2011). To
further investigate the effects of admixture on classification, ADMIXTURE version 1.21
software was used to estimate ancestry using a model-based approach from the same
AIMs panel of 4,326 SNPs (Alexander et al., 2009). The estimates of ancestry for each
individual using the caCNV signature and SNPs were correlated in the Han Chinese (R
2
=
0.974), Europeans (R
2
= 0.924), and Africans (R
2
= 0.914), confirming the accuracy of
the caCNV signature (Figure 8). These correlations revealed greater levels of admixture
in the European population using the caCNV signature compared to SNPs; and the
reverse in the African population, who showed increased admixture using the SNPs
compared to the caCNV signature. The Han Chinese population showed very little
evidence of admixture.
43
Figure 6. Frequency of Copy Number Gains and Losses for the 53 unique caCNVs among the
HapMap Training Sets of A) European ancestry (CEU), B) African ancestry (YRI), and C) Han
Chinese ancestry (CHB). The panel is shown in ascending order (top to bottom) by statistical
significance obtained using the nearest shrunken centroid analysis. The genomic coordinates of the
caCNVs are based on NCBI Build 36, UCSC Version hg18. (CNV losses – black, CNV gains –
grey)
44
Discussion
This study shows for the first time a methodology for identifying a common CNV
signature that could predict ancestry with an extremely high accuracy. The
misclassification error rate of our 53 caCNV signature that distinguishes Europeans,
Africans, and Han Chinese was 2%. This signature also reports the identification of the
first microRNA caCNV. Importantly, this approach is applicable to a wide range of
biomedical research aimed at identifying CNV signatures predictive of population
phenotypes.
Figure 7. Estimated Probability of Ancestry Classification using caCNV Signature. The 100
European, 100 African, and 100 Han Chinese test set samples are plotted against the estimated
probability of belonging to each population. Each vertical bar represents an individual. The height
of each bar is proportional to the probability that the individual belongs to a given ancestry:
European (red bars), African (yellow bars), or Han Chinese (blue bars).
45
Existing methods for CNV detection are often performed on a sample-by-sample
basis, which is not ideal for large datasets where common CNVs must be estimated by
comparing the CNVs of the individual samples. The proposed method identifies
population-level CNVs using an application of our published single-sample GADA
method. CNVs were determined directly from the t-statistics estimated by comparing the
DNA microarray probe signal intensities of populations of different ancestry. When used
in a linear discriminant analysis model, a subset of 53 CNVs could accurately be used to
predict population structure. Closely-related populations can currently be discriminated
using genome-wide SNP data; however there are several advantages to using CNVs
instead. First, CNVs can be more informative than SNPs because they encompass the
genes giving rise to the observed phenotype, and do not necessarily rely on linkage
disequilibrium with the underlying causal variant. Second, the gene dosage effects of
CNVs can provide valuable insight to the biological differences observed between
populations. Finally, fewer CNVs than SNPs are needed for population discrimination,
and future studies may explore the combination of CNVs and SNPs for this purpose.
The distribution of CNVs in the human genome has previously been shown to
vary by ethnic populations. (White et al., 2007) This method reproduced many of these
known CNVs, and also identified several novel ones. The most significant caCNV we
identified on chromosome 4q13.2 involving UGT2B17 shows exceptionally increased
population variation. (Campbell et al., 2011; Xue et al., 2008) It is most frequently
deleted in East Asian populations, but is rarely deleted in both European and African
populations. Experimental validation of this CNV has been performed using whole-
46
genome sequence data. (Sudmant et al., 2010) The second most significant caCNV in the
signature was in the region of chromosome 3q26.1. This analysis reports a novel
observation of loss of this CNV in 80% of Han Chinese population. This region contains
only a microRNA (MIR720), that has been shown to be expressed in melanocytes and
melanoma. (Stark et al., 2010) This observation is the first report of microRNA CNV
Figure 8. Accuracy of Ancestry Prediction in Test Set using PCA of Genome-Wide SNPs. A)
Scatter plot of the top two principal components using data generated from 4,326 genome-wide
SNPs selected as AIMs shows separation of 100 European, 100 African, and 100 Han Chinese test
samples based on self-reported ancestry (red squares – European, yellow triangles – African, blue
circles – Han Chinese). B) Scatter plot of ancestry estimates using SNPs versus caCNV signature in
Africans (R
2
= 0.914), C) Europeans (R
2
= 0.924), and D) Han Chinese (R
2
= 0.974).
47
that varies by ancestry. Several other top ranking CNVs in the caCNV signature were
also previously reported to vary by ancestry. The caCNV on chromosome 17q21.31 was
reported to vary between Asians, Europeans, and Africans, with duplication present only
in European populations, and lacking in other ethnic groups, which is consistent with our
finding that gains of this CNV are found only in two-thirds of Europeans. (Campbell et
al., 2011; McElroy et al., 2009; Tsalenko et al., 2010; Xue et al., 2008) Whole-genome
sequence data and FISH analysis also confirmed this duplication in the European
population. (Sudmant et al., 2010) Region 17q12 has shows a greater number of copy
number deletions in Africans compared to the other populations studied. (Campbell et al.,
2011; McCarroll et al., 2006) Deletions of caCNV 17p11.2 occur more often in the Asian
and Caucasian populations compared to the African population. (Campbell et al., 2011)
Campbell et al described additional CNVs on chromosomes 16p13.11 and 20p13 which
differ between populations and were also identified using our method. (Campbell et al.,
2011) This analysis demonstrate additional novel caCNVs located throughout the
genome on all autosomal chromosomes except for 21 and 22.
The approach in building a common CNV signature has several advantages. The
proposed t-test approach is a quick and simple approach to identify regions of DNA copy
number which are significantly different in two populations. The GADA prior parameters
provide users the flexibility to control sensitivity and specificity in identifying boundaries
of common CNVs on the t-score data. These adjustments can be made in real-time as
only one dataset (t-score values) is analyzed in GADA. In comparison, Significance
Testing for Aberrant Copy Number (STAC) creates a binary matrix from the normalized
48
probe signal intensities, assigning genomic regions with no copy number change to zero
and genomic regions with copy number gains or losses to one. Regions of copy number
variation are then determined by their length and frequency of occurrence. STAC uses
non-overlapping windows to search for evidence of CNVs in each chromosome, which
can be computationally expensive when using small window sizes. Mei et al ran the
STAC algorithm longer than 48 hours on a 3 GHz windows PC with 4 gigabytes of RAM
to analyze >32,780 non-overlapping windows of chromosomes 1-22 of 112 HapMap
samples. (Mei et al., 2010) In comparison, the running time was 32 seconds for each
GADA analysis of the t-statistic values of the pair-wise ancestry analyses. Another
advantage of our approach is the elimination of data reduction techniques to identify
common CNVs or use of principal component based analyses to assign class assignment
using only signal intensities. (Beroukhim et al., 2007; Diskin et al., 2006; Ivakhno and
Tavare, 2010) Through our simple procedure common CNV signatures can be identified
that can be readily applied to other datasets with similar array data type as demonstrated
with our use of a test set in this report. These advantages along with our reported and
validated caCNV signature gives credence to our novel approach which could also easily
be implemented to identify CNVs as susceptibility loci in case-control studies.
In summary, the proposed novel method to detect population-based CNVs
demonstrates the feasibility using ancestry as the dependent variable. This methodology
reveals a 53 caCNV signature which can be used to infer human population structure with
extremely high accuracy. A simple modification of the GADA method allowed for direct
49
segmentation of t-statistics. The efficiency of this method in finding a CNV signature will
facilitate the use of a new type of structural variation important in human genomic
studies. The success of this methodology has implications for improving admixture
mapping and the minimization of population stratification in case-control and genome-
wide association studies. This methodology can be easily expanded to case-control
studies to identify a genetic susceptibility CNV signature specific to Mendelian or
complex diseases.
50
CHAPTER 3: A GENOME-WIDE ASSOCIATION STUDY OF EWING SARCOMA
Introduction
Malignant bone tumors account for 3% of all cancers in 15 – 29 year olds in the
United States. (Bleyer et al., 2006) Ewing Sarcoma (EWS) is the second most common
bone cancer in this age group, with an age-adjusted incidence rate of 2.93 cases per
million per year in the United States. (Eseashvili et al., 2008) Although it can occur at
any age, the most common age of diagnosis is the second decade of life, and males are
affected more often than females. (Bleyer et al., 2006) Epidemiological data from the
Surveillance, Epidemiology, and End Results (SEER) database have revealed an
increasing incidence of EWS among Non-Hispanic Whites over the past 3 decades.
(Jawad et al., 2009) However, for unknown reasons the incidence is significantly lower in
African-Americans and Asians, and the incidence rate in these two groups has remained
unchanged over the past 3 decades. (Fraumeni and Glass, 1970; Li et al., 1980) The 5-
year survival rate for 15 – 29 year olds affected with EWS in the United States is poor
and has remained relatively unchanged for several decades at approximately 60%.
(Bleyer et al., 2006) Several case-control studies have been conducted in North America
and Australia to investigate possible environmental or biologic risk factors for EWS. One
of the strongest associations has been found for parental agricultural exposures. (Holly et
al., 1992; Hum et al., 1998; Moore et al., 2005; Valery et al., 2002; Valery et al., 2005b;
Winn et al., 1992)
51
Another strong association has been reported for a history of hernia. (Valery et al., 2005a;
Valery et al., 2003; Winn et al., 1992) However, these results are limited to a small set of
case-control studies; even if the studies accurately assess risk, only a small fraction of
cases would be attributed to these factors due to the small proportion of cases exposed to
these risk factors.
Somatic changes found in EWS tumors have been well defined. Approximately
85% of EWS tumors share a common reciprocal translocation t(11;22)(q24;q12), which
links a strong transcriptional activation domain from the EWSR1 gene on chromosome 22
to the ETS DNA-binding portion of the FLI1 gene on chromosome 11. (Turc-Carel et al.,
1988) Other less common translocations present in these tumors include fusion of the
EWRS1 gene with ERG on chromosome 21q, ETV1 on chromosome 7p, and EIAF on
chromosome 17, and FEV on chromosome 2q. (Jeon et al., 1995; Kaneko et al., 1996;
Peter et al., 1997; Sorenson et al., 1994) In addition to the EWSR1 rearrangements,
several secondary chromosomal aberrations can occur, including trisomies of
chromosomes 8 and 12, and a gain of DNA sequences in 1q. (Armengol et al., 1997;
Mugneret et al., 1988) Less frequently, mutations in the TP53 gene have been described
in EWS tumors. (Komuro et al., 1993; Kovar et al., 1993)
Much less is known about genetic risk factors associated with EWS. Intron 6 of
the EWSR1 gene has previously been shown to have fewer CNVs in African populations
compared to European populations, which may explain the differential predisposition to
the EWS-FLI1 translocation. However, intron 6 is not the typical breakpoint for this
translocation. (Zucman-Rossi et al., 1997) A candidate gene study of the same locus was
52
conducted to identify SNPs and/or novel variants that differ between EWS cases and
race-matched controls. (DuBois et al., 2011) Twenty-one SNPs spanning EWSR1 were
genotyped, and intron 7 was sequenced, in 135 cases and 200 controls. The cases and
controls were identified from the Childhood Cancer Survivor Study (CCSS), which is a
national cohort of children who were diagnosed with cancer between 1970 and 1986, and
who survived for more than 5 years from initial diagnosis. Variations in the genotyped
EWSR1 gene region or across intron 7 were not associated with EWS risk in this study.
Any effect may have been biased towards the null by using a control group consisting of
cancer survivors, which may not have been a suitable comparison group. Recently,
genome-wide association studies (GWAS) have been successful in identifying cancer
susceptibility loci. (Broderick P, 2009; Eeles RA, 2009; Gudmundsson J, 2009;
Kanetsky PA, 2009; Petersen GM, 2010; Postel-Vinay et al., 2012; Song H, 2009; Tenesa
A, 2008; Thomas G, 2009; Wu X, 2009) While most associations have only modest
effect sizes, the SNP-trait p-values are often exceedingly statistically significant. A
genome-wide association study (GWAS) of EWS recently identified common variants on
chromosomes 1p36, 10q21, and 15q15 which may predispose to this disease. (Postel-
Vinay et al., 2012) A joint analysis of the GWAS and two independent replication sets in
this study identified associations with SNP rs9430161 (OR = 2.20, 95% CI = 1.77 - 2.72,
p-value = 1.4X10
-20
); SNP rs224278 (OR = 1.66, 95% CI = 1.42 - 1.93, p-value =
4.0X10
-17
); and SNP rs4924410 (OR = 1.46, 95% CI = 1.23 - 1.74, p-value = 6.6X10
-9
).
Analysis of the haplotypes carrying these candidate SNPs showed differences in
frequencies between European and African populations, suggesting variants in these
53
regions account for the differences in incidence of EWS observed in these populations. In
this current study, a family-based GWAS was conducted to identify ethnic-specific
genomic variants associated with EWS susceptibility. A dataset of EWS cases and case-
parent trios (an affected child and both parents) was assembled to determine the genome-
wide SNPs and CNVs which predispose to EWS.
Material and Methods
Identification of Cases and Controls
Eligible cases were defined for purposes of the study as males and females of all
ages diagnosed with EWS in California and Pennsylvania from 1985 – 2008 who were
still living at the time of study recruitment. Case diagnoses included EWS tumor, Askin
tumor, or peripheral primitive neuroectodermal tumor (pPNET) of bone (site/histology
ICD-O-3 codes 9260, 9363 – 9365, based on International Classification of Childhood
Cancer). EWS cases were identified through 3 main sources: Children’s Hospital Los
Angeles (CHLA), two statewide population-based cancer registries (the California
Cancer Registry and the Pennsylvania Cancer Registry), and the internet.
(1) CHLA is the largest nonprofit, academic, pediatric medical center located in
Southern California. The Division of Hematology – Oncology at CHLA cares for
more than 1,100 new patients each year and is the leading referral center in the
western United States. Patients undergoing treatment or follow-up care for EWS
at CHLA were identified through their treating physicians for recruitment in the
study (2009-2011).
54
(2) As California’s statewide population-based cancer surveillance system, the
California Cancer Registry (CCR) has been collecting cancer incidence data since
1947, and includes greater than 95% of the cancer cases diagnosed per year in
California. The Pennsylvania Cancer Registry (PCR) has been collecting
information on all new cases of cancer diagnosed or treated in Pennsylvania since
1985.
(3) Finally, EWS cases were identified and recruited from regions outside California
and Pennsylvania through the internet using Facebook
(http://www.facebook.com/#!/groups/129388753746890/), by contacting the
Sarcoma Foundation of America (SFA) and sending information about the study
to EWS listserves.
Genotype data for healthy controls for the replication phase was obtained through an
existing public database available from 1000 Genomes Project (www.1000genomes.org).
(The 1000 Genomes Project Consortium, 2010) There were 381 European samples
selected for inclusion in this study.
Case Recruitment
The EWS cases identified at CHLA were invited to participate in the study either
by contacting the parents by mail, or if they were an active patient, they were invited and
enrolled directly by their CHLA physician at their follow-up appointment. The eligible
EWS cases identified through the CCR and PCR were mailed a brochure and a letter
introducing the study and inviting them to participate. The letters were addressed directly
to cases who were 18 years old or greater; or were addressed to the parents of the case
55
when the case was under 18 years old. A brief letter was mailed to treating physicians in
California to obtain passive consent before contacting cases from the CCR. The EWS
cases who were reached through the internet contacted research study personnel directly
via email or telephone if they were interested in participating.
Regardless of recruitment methods, the identified cases, and the parents of the
cases, were invited to participate in the study. Those families willing to participate were
asked to return a short questionnaire with contact information, date of birth, gender and
ethnicity. The questionnaire was completed by the case if he or she was 18 years old or
older; or the parents completed the questionnaire if the case was less than 18 years old.
The parents of the cases served as the control subjects for the genetic analysis when
available.
Sample Collection
Cases and their parents seen at CHLA were asked to provide a blood sample for
genetic analysis. Whole blood was collected in one 8.5 mL acid citrate dextrose (ACD)
tube from the case on the day of their appointment. Families were given the option of
having one 8.5 mL ACD tube of blood drawn at the facilities of CHLA through the
General Clinical Research Center (GCRC), or having blood collection kits mailed to
them to have their blood drawn by their own primary care physician. For family members
not seen directly at CHLA, blood collection supplies, including 8.5 mL ACD tubes and
syringes, were mailed to each participant. For the latter option, blood samples were
returned via FedEx at room temperature within 72 hours of the blood draw. Peripheral
blood stem cells (PBSCs) stored at the CHLA Hematopoietic Progenitor Processing
56
Laboratory were obtained for patients who had expired. Archival DNA samples from
whole blood or mouthwash samples were obtained from a legacy EWS study at CHLA
conducted by Dr. Jan Van Tornout. Genomic DNA was extracted from whole blood or
PBSCs within 72 hours of collection using the desalting method on an automated system
(QIAGEN, Inc., Valencia, CA).(Scherczinger et al., 1997) Genomic DNA was isolated
from mouthwash using phenol-chloroform extraction methods. (Chomczynski and
Sacchi, 2006) DNA concentration was assessed using either UV (A
260 nm
)
spectrophotometry, and/or ethidium-bromide stained low-percentage agarose gel
compared with a high molecular weight standard. The DNA was stored in multiple
aliquots at – 80°C.
The genomic DNA samples acquired from identified EWS cases (and parents)
were used to construct two genetic analysis groups: a GWAS discovery set and a
replication set. The GWAS discovery set consisted of the EWS trios (affected case and
parents) who agreed to complete the questionnaire and provided the highest quality
genomic DNA samples extracted from whole blood. The replication set comprised of an
independent collection of EWS cases who provided genomic DNA from whole blood,
mouthwash, or PBSCs. This latter set of cases for replication did not have parental DNA
available for a complete genetic analysis of the family.
57
Genotype data for 381 individuals was obtained from the 1000 Genomes Project and used
as controls for the replication analysis.(The 1000 Genomes Project Consortium, 2010)
Controls from the 1000 Genomes Project were selected among those with European
ancestry: Utah residents with Northern and Western European ancestry; Toscani in Italy;
British from England and Scotland; Finnish from Finland; and Iberian populations from
Spain.
Written informed consent was obtained from each participant. The research study
was conducted according to a protocol approved by the Institutional Review Board of
Children’s Hospital Los Angeles (Committee on Clinical Investigation), the Committee
for the Protection of Human Subjects through the state of California, and the Department
of Public Health Institutional Review Board in Pennsylvania.
Mailed Questionnaire
Additional information for each case was collected by mailed questionnaire. Each
case (or a parent of cases under 18 years of age) was requested to complete a medical
history questionnaire which included information on characteristics of the case history
including tumor type, tumor site, age at diagnosis, treating hospital and physician.
SNP Genotyping
GWAS Discovery Phase
The discovery set included participants in which the case and both parents agreed
to complete the questionnaire and provide a blood sample for genetic analysis. The DNA
samples were processed by the CHLA Genome Core Laboratory using Genome-Wide
Human SNP Array 6.0 (Affymetrix, Santa Clara, CA), a commercially available
58
microarray which features more than 1.8 million probes for the detection of SNPs and
CNVs, according to manufacturer’s instructions. After scanning the microarrays, the
signal intensities of the millions of hybridizations were used to determine SNP genotype
and CNV status.
Replication Phase
The replication phase included an independent set of cases who agreed to
complete the questionnaire and provided blood, mouthwash, or PBSC sample for genetic
analysis. Genotyping of 30 selected SNPs was performed on 20 ng of genomic DNA
using iPLEX technology on the MassARRAY system (Sequenom, San Diego, CA). The
iPLEX assay is a primer extension reaction used to detect single nucleotide differences in
genomic DNA. Different template sequences will result in an allele-specific difference in
mass between extension products. This mass difference allows the data analysis software
to distinguish between two SNP alleles.
The control genotype data for the 30 SNPs was obtained from the 1000 Genomes
Project and derived using next generation sequencing technology. (The 1000 Genomes
Project Consortium, 2010)
Statistical Analyses
GWAS Discovery Phase
Genotype calls in the GWAS discovery set were generated from the Affymetrix
Genome-Wide Human SNP 6.0 array using the Birdseed algorithm implemented in the
Affymetrix Power Tools software package. Quality control and association analysis were
performed using PLINK v 1.07. (Purcell et al., 2007) A perl script called
59
LINKDATAGEN was used to create the input files for PLINK. (Bahlo and Bromhead,
2009) SNPs with minor allele frequency (MAF) less than 5%, genotyping rate of less
than 85%, Mendelian errors greater than 90%, or Hardy-Weinberg p-value less than
0.001 were excluded from the discovery set analysis. For the association test using the
case-parent trios, the TDT was used to test for the transmission of a SNP from a
heterozygous parent to an affected child in a proportion different than expected under the
null hypothesis that no association exists between the SNP and disease. (Spielman et al.,
1993) A Bonferroni correction was used to control for multiple comparisons, which is
overly conservative but protects against Type I error. (McIntyre et al., 2000)
Signal intensity data for copy number analysis in the GWAS discovery set were
normalized according to the manufacturer’s guidelines using Genotyping Console 3.0
(Affymetrix Inc., Santa Clara, CA). The resulting log base 2 ratios of the probes
(log2ratio) for the cases were imported into R version 2.9.2. The GADA package was
used to identify the genome-wide probes spanning CNVs in the EWS cases. (Pique-Regi
et al., 2008) The prior parameter (alpha) was set to alpha = 0.1, the significance threshold
(T) was set to T=6 for identification of breakpoints, and only significant segments with
greater than 3 probes were selected for further analysis. The log2ratios of altered probes
identified using GADA were imported into SNP & Variation Suite 7 Software (Golden
Helix, Inc., Bozeman, MT) for all case-parent trios to perform association analysis. A
family-based association test (FBAT) was used to test for association between CNVs and
EWS using the quantitative log2ratios of probes spanning the CNVs identified in the
cases.
60
Replication Phase
The top 30 most significant SNPs in the GWAS discovery phase, which also
showed a difference in minor allele frequency (MAF) greater than 10% in EWS cases
versus HapMap African (YRI) and Asian (CHB) populations, were selected for
replication in 190 independent EWS cases. Only the Non-Hispanic white cases were
included in the final analysis. The replication set was analyzed using two different control
groups: (1) 381 healthy controls of European ancestry from 1000 Genomes Project and
(2) the parents of the cases. Logistic regression was used to estimate the association
between SNPs and EWS in the cases compared to healthy controls. For the cases in
which genotype information was available for both parents, the TDT was used to estimate
the association between SNPs and EWS. The quality control and association analyses
were performed using PLINK v 1.07. Samples with more than 10% missing genotypes,
markers with less than 95% genotyping rate, and SNPs with greater than 90% Mendelian
errors in the family-based analysis were excluded.
Results
SNP Genotyping
EWS cases and parents who met the eligibility criteria were recruited into the
study (Table 3). The discovery set included 81 participants (27 case-parent EWS trios).
These 27 case-parent trios were identified from 3 sources: (a) 6 cases were treated at
CHLA from 1991 – 2000; (b) 18 were identified from the California Cancer Registry
from 1985-2008; and (c) 3 cases were identified through our study website designed to
recruit Ewing sarcoma cases and their families from outside the Los Angeles, CA area.
61
The replication set included 190 EWS cases identified from 4 sources in which the case
agreed to participate and provide a blood, mouthwash, or PBSC sample for genetic
analyses: (a) 37 cases were treated at CHLA from 1986–2011; (b) 91 were identified
from the California Cancer Registry from 1985-2008; (c) 11 cases were identified from
the Pennsylvania Cancer Registry from 1985-2008; and (d) 51 were identified through
our EWS study website. Only the Non-Hispanic white EWS cases in the replication set
were included in the final analysis; and two different control groups were used to
estimate the association between SNPs and EWS. The results of the case-control analysis
were compared to the family-based analysis.
Table 3: Demographic characteristics of EWS discovery set and replication set.
Description Discovery Set
(n=27)
Replication Set
(n=190)
Age of Diagnosis (years), median
(range)
15 (1 – 39) 17 (3 – 56)
Race, n (%)
Non-Hispanic White 23 (85.2%) 145 (81.0%)
Non-Hispanic Black 0 (0.0%) 2 (1.1%)
Hispanic 3 (11.1%) 28 (15.6%)
Asian/Pacific Islander 1 (3.7%) 1 (0.6%)
Gender, n (%)
Male 19 (70.4%) 100 (52.6%)
Female 8 (29.6%) 90 (47.4%)
Case Source, n (%)
Children’s Hospital Los Angeles 6 (22.2%) 37 (19.5%)
California Cancer Registry 18 (66.6%) 91 (47.9%)
Pennsylvania Cancer Registry 0 (0.0%) 11 (5.8%)
Internet 3 (11.1%) 51 (26.8%)
62
The median age at diagnosis in the discovery set was 15 years old (range 1 – 39); the
majority of cases was Non-Hispanic white (81%); male (70%); and identified through the
CCR (67%). The median age at diagnosis in the replication set was 17 years old (range 3
– 56); the majority of cases was Non-Hispanic white (86%); male (53%); and identified
through the CCR (48%).
A flow chart of the SNP data analysis pipeline is shown in Figure 9. After
applying the quality control filters to 905,018 SNPs genotyped using the Affymetrix SNP
6.0 microarray, a total of 887,995 SNPs were included in the discovery set analysis. For
each SNP, the TDT was used to test for transmission of alleles from a heterozygous
parent to an affected child. Figure 10 shows a Manhattan Plot of the results of the TDT
analysis in the 27 trios. The top 5 SNPs (rs1743040, rs1228824, rs7907995, rs1013831,
and rs2966998) associated with EWS had p-values less than 5.7X10
-5
, which did not
reach genome-wide significance when using a Bonferroni corrected p-value of 5.6X10
-8
.
This correction is known to be conservative and thus over-corrected the raw p-values.
The top 30 most significant SNPs (p-value < 3.12X10
-4
) from the discovery set
that also showed MAF greater than 10% in the EWS cases versus YRI and CHB HapMap
populations were selected for replication in an independent dataset. The associations of
these SNPs in the GWAS discovery set are listed in Table 4. A proxy SNP rs1228829
was selected to replace rs1228824 (r
2
=1.0) due to in silico MassARRAY multiplex assay
design failure.
63
All 30 SNPs selected for replication are synonymous variants, and the details of these
SNPs are shown in Table 5. The genotyping rate of the 30 SNPs in the replication set was
99.8%; and the genotype correlation between 16 samples (4 EWS case-parent trios)
analyzed using both the Affymetrix SNP 6.0 array and Sequenom MassARRAY was
98.3%.
The replication set was analyzed using two different control groups. In the first
analysis, SNP genotypes of 145 Non-Hispanic white cases were compared to 381
controls of European ancestry. Six cases had greater than 10% genotyping errors and
were excluded from the analysis. SNP rs1935778 had greater than 5% genotyping errors
Figure 9: Flowchart of the SNP Data Analysis.
64
Figure 10: Manhattan Plot of TDT GWAS Results for 27 EWS Case-Parent Trios. The –log10
TDT p-value for the association between each SNP and EWS is plotted by chromosome location.
and was excluded from the analysis. The associations between the remaining 29 SNPs
and EWS in this case-control analysis are listed in Table 6. The analysis identified
statistically significant associations between EWS and rs11217524 (OR = 0.55, 95% CI =
0.38 - 0.81) and rs1447149 (OR = 0.70, 95% CI = 0.50 - 0.99). A subset of 90 cases in
the replication set also included genotype data for both parents. Therefore, in addition to
the case-control analysis, a second analysis of 90 Non-Hispanic white cases was also
analyzed using genotype data from their 180 parents in a family-based analysis. Three
case-parent trios had >10% Mendelian errors and were excluded from the TDT analysis.
The associations between 30 SNPs and EWS in this TDT analysis are listed in Table 7.
SNP rs17072464 was the only SNP associated with EWS (OR = 1.82 95% CI = 1.01 -
3.30).
All Non-Hispanic white cases from the discovery and replication phases were
analyzed together in a joint analysis. Again, the cases were analyzed using two control
65
groups. First, 162 Non-Hispanic white cases were compared to 381 European controls.
The associations between SNPs and EWS in this joint case-control analysis are listed in
Table 8. SNP rs11217524 was most strongly associated with EWS (OR = 0.48, 95% CI
= 0.33 - 0.71). The joint analysis also identified statistically significant associations
between EWS and rs1447149 (OR = 0.61, 95% CI = 0.44 - 0.86), rs7907995 (OR = 1.36,
95% CI = 1.05 - 1.78), and rs2252432 (OR = 1.34, 95% CI = 1.01 - 1.76). Next, the 110
Non-Hispanic white cases were analyzed using their 220 parents in a family-based
analysis. The associations between SNPs and EWS in this joint TDT analysis are listed in
Table 9. The joint analysis identified statistically significant associations (p < 0.05)
between 11 SNPs and EWS. SNPs rs11217524 and rs7907995 were among the
significant SNPs that were also significant in the joint case-control analysis. A regional
association plot of rs1127254 and rs7907995 from the GWAS is shown in Figure 11.
These 2 candidate loci were not identified in a recently reported GWAS of 401 French
EWS cases. The nearest gene to SNP rs11217524 is poliovirus receptor-related 1
(PVRL1) on chromosome 11q23.3. This gene encodes an adhesion protein that is used as
a receptor for herpes simplex virus to mediate entry of the virus into human epithelial and
neuronal cells. (Geraghty et al., 1998) SNP rs7907995 is located near the ankyrin repeat
domain-containing protein 30A (ANKRD30A) gene on chromosome 10p11.21, which is
mainly expressed in the breast and testis. (Hahn et al., 2006)
66
Figure 11. Regional Association Plot of Candidate SNPs in the Discovery Phase. The strength
of association for candidate SNPs (blue diamond) A) rs11217524 and B) rs7907995 with EWS in
the context of the association results for surrounding markers, genes (green arrow shows
transcription orientation), estimated recombination rates (blue line) and pair-wise correlations
(r
2
>0.8=red diamond; r
2
>0.5=orange diamond; r
2
>0.2=yellow diamond; r
2
<0.2= white diamond).
A
B
67
Copy Number Variation Analysis
GADA analysis of the GWAS discovery set including 27 EWS cases identified
22,543 altered probes that showed evidence of spanning a CNV. Figure 11 shows the
frequency of CNV gains and losses across the genomes of these 27 samples. There were
no CNVs located near the candidate SNP identified in the GWAS on chromosome
10p11.21 and 11q23.3. In addition, there were no CNVs located in the EWSR1 or FLI1
genes involved in the fusion protein found in most EWS tumors. The association analysis
of the 22,543 probes in the 27 case-parent trios provided suggestive evidence that
chromosomal region 14q11.2 may be associated with EWS (p-value = 4.0X10
-4
).
However, this region did not reach genome-wide significance when using a Bonferroni
corrected p-value of 2.2X10
-6
. Several probes in this region are located within the
heterogeneous nuclear ribonucleoprotein C (HNRNPC) gene. This gene belongs to the
subfamily of ubiquitously expressed heterogeneous nuclear ribonucleoproteins (hnRNPs),
which affect pre-mRNA processing and other features of mRNA metabolism and
transport. (Nakagawa et al., 1986)
68
Table 4: Associations between top 30 SNPs and EWS in the discovery phase of 27 case-parent
trios.
SNP Chr Location Minor
Allele
T:U OR (95%CI) P-value
rs1743040 14 99420386 A 25:2 12.50 (2.96-52.77) 9.58 X10-6
rs1228824 1 30312975 T 3:25 0.12 (0.04-0.40) 3.22 X10-5
rs7907995 10 37088436 T 20:1 20.00 (2.68-149.00) 3.38 X10-5
rs1013831 13 90250273 A 17:0 0.00 3.74 X10-5
rs2946998 10 126739152 T 1:19 0.05 (0.01-0.39) 5.70 X10-5
rs6792514 3 42404821 C 21:2 10.50 (2.46-44.78) 7.44 X10-5
rs4870303 6 150491644 G 3:23 0.13 (0.04-0.43) 8.77 X10-5
rs1447149 11 119196279 T 1:18 0.06 (0.01-0.42) 9.62 X10-5
rs11217524 11 119198081 C 1:18 0.06 (0.01-0.42) 9.62 X10-5
rs1149062 1 30314790 G 4:25 0.16 (0.06-0.46) 9.64 X10-5
rs9613768 22 27732832 T 27:5 5.40 (2.08-14.02) 1.01 X10-4
rs7895833 10 69293063 G 0:15 0.00 1.08 X10-4
rs1935778 1 116196222 A 20:2 10.00 (2.34-42.78) 1.24 X10-4
rs658358 11 119292400 A 2:20 0.10 (0.02-0.43) 1.24 X10-4
rs10405576 19 50790725 G 22:3 7.14 (2.08-25.00) 1.45 X10-4
rs1025039 5 35937614 A 5:26 0.19 (0.07-0.50) 1.62 X10-4
rs233111 1 85558243 G 17:1 17.00 (2.63-127.70) 1.62 X10-4
rs1519016 2 216194918 G 17:1 16.67 (2.27-100.0) 1.62 X10-4
rs1548039 2 120835391 C 17:1 16.26 (2.27-100.0) 1.62 X10-4
rs12651205 4 181746748 G 14:0 0.00 1.83 X10-4
rs13005152 2 134914361 G 2:19 0.11 (0.02-0.45) 2.08 X10-4
rs2252432 10 37041833 G 19:2 9.50 (2.21-40.78) 2.08 X10-4
rs604768 4 16875044 C 2:19 0.11 (0.02-0.45) 2.08 X10-4
rs1999709 1 48879003 A 21:3 7.14 (2.08-25.00) 2.39 X10-4
rs12706912 7 80090231 C 3:21 0.14 (0.04-0.48) 2.39 X10-4
rs12651279 4 181027092 A 5:25 0.20 (0.08-0.52) 2.61 X10-4
rs517969 1 37074934 A 1:16 0.06 (0.01-0.47) 2.75 X10-4
rs3731748 2 179121356 T 0:13 0.00 3.12 X10-4
rs17072464 3 65126852 G 0:13 0.00 3.12 X10-4
rs9312259 4 181363535 A 0:13 0.00 3.12 X10-4
Chr=Chromosome; T:U=Number of Transmitted Alleles: Number of Untransmitted Alleles;
OR=Odds Ratio; CI=Confidence Interval
69
Table 5: Detailed information about the 30 SNPs selected for replication.
SNP Chr Location Alleles Nearest Gene MAF
Cases
MAF
CEU
MAF
CHB
MAF
YRI
rs1743040 14 99420386 A/G EML1 (intron) 0.574 0.387 0.137 0.456
rs1228829 1 30312975 T/G MATN1 0.185 0.376 0.345 0.788
rs7907995 10 37088436 T/C ANKRD30A 0.537 0.394 0.35 0.119
rs1013831 13 90250273 A/G PPIAP23 0.019 0.037 0.162 0.112
rs2946998 10 126739152 T/C CTBP2 (intron) 0.167 0.270 0.673 0.451
rs6792514 3 42404821 C/T LYZL4 0.630 0.416 0.095 0.292
rs4870303 6 150491644 G/A PPP1R14C 0.278 0.586 0.857 0.898
rs1447149 11 119196279 T/C PVRL1 0.037 0.248 0.012 0.186
rs11217524 11 119198081 C/T PVRL1 0.037 0.230 0.018 0.180
rs1149062 1 30314790 G/T MATN1 0.204 0.381 0.357 0.783
rs9613768 22 27732832 T/G ZNRF3 (intron) 0.667 0.398 0.392 0.018
rs7895833 10 69293063 G/A RPL12P8 0.130 0.146 0.704 0.214
rs1999709 1 48879003 A/G AGBL4 (intron) 0.204 0.355 0.690 0.226
rs1935778 1 116196222 A/G NHLH2 0.615 0.549 0.327 0.235
rs658358 11 119292400 C/G PVRL1 0.130 0.320 0.283 0.518
rs12706912 7 80090231 A/C CD36 (intron) 0.222 0.588 0.381 0.504
rs10405576 19 50790725 G/A GPR4 (intron) 0.222 0.381 0.399 0.407
rs1025039 5 35937614 A/C CAPSL 0.259 0.491 0.042 0.522
rs604768 4 16875044 T/C RPS7P6 0.185 0.371 0.096 0.173
rs233111 1 85558243 A/G DDAH1(3’UTR) 0.444 0.239 0.319 0.323
rs1519016 2 216194918 G/T FN1 0.074 0.204 0.500 0.493
rs13005152 2 134914361 G/T MGAT5 (intron) 0.111 0.142 0.376 0.585
rs1548039 2 120835391 C/T INHBB 0.093 0.119 0.458 0.841
rs12651205 4 181746748 G/A ODZ3 0.000 0.097 0.296 0.051
rs2252432 10 37041833 C/A ANKRD30A 0.463 0.336 0.332 0.027
rs12651279 4 181027092 A/G RPL19P8 0.148 0.124 0.029 0.034
rs517969 1 37074934 A/G GRIK3 (intron) 0.019 0.106 0.044 0.316
rs3731748 2 179121356 T/C TTN (exon) 0.037 0.027 0.139 0.027
rs17072464 3 65126852 G/T MAGI1 0.019 0.115 0.226 0.221
rs9312259 4 181363535 A/G ODZ3 0.019 0.138 0.376 0.085
Chr=Chromosome; MAF = minor allele frequency; CEU=European HapMap population; CHB=Han
Chinese HapMap population; YRI=African HapMap population
70
Table 6: Associations between SNPs and EWS in replication set of 139 cases and 381 controls.
SNP Chromosome Location Minor
Allele
OR (95% CI) P-value
rs11217524 11 119198081 C 0.55 (0.38-0.81) 0.0026
rs1447149 11 119196279 T 0.70 (0.50-0.99) 0.0423
rs6792514 3 42404821 C 0.75 (0.56-1.00) 0.0534
rs7907995 10 37088436 T 1.28 (0.97-1.70) 0.0846
rs1013831 13 90250273 A 1.39 (0.95-2.02) 0.0867
rs12651205 4 181746748 G 1.45 (0.93-2.27) 0.1000
rs9312259 4 181363535 A 1.36 (0.93-1.99) 0.1094
rs2252432 10 37041833 C 1.27 (0.95-1.71) 0.1127
rs1519016 2 216194918 G 0.78 (0.53-1.13) 0.1839
rs12651279 4 181027092 A 1.20 (0.91-1.58) 0.1965
rs658358 11 119292400 G 0.82 (0.59-1.13) 0.2166
rs3731748 2 179121356 T 1.27 (0.86-1.88) 0.2257
rs233111 1 85558243 A 0.83 (0.61-1.15) 0.2624
rs1999709 1 48879003 A 1.17 (0.88-1.56) 0.2736
rs17072464 3 65126852 G 1.21 (0.83-1.76) 0.3177
rs7895833 10 69293063 G 1.17 (0.84-1.63) 0.3554
rs604768 4 16875044 T 1.13 (0.85-1.49) 0.4052
rs1149062 1 30314790 G 1.13 (0.85-1.49) 0.4070
rs1743040 14 99420386 A 0.90 (0.68-1.19) 0.4487
rs10405576 19 50790725 G 0.90 (0.67-1.21) 0.4763
rs13005152 2 134914361 G 1.11 (0.81-1.53) 0.5069
rs1228829 1 30545629 G 1.10 (0.83-1.45) 0.5213
rs12706912 7 80090231 C 1.08 (0.82-1.42) 0.5767
rs9613768 22 27732832 T 1.08 (0.81-1.44) 0.5898
rs1025039 5 35937614 A 0.93 (0.71-1.24) 0.6388
rs1548039 2 120835391 C 0.96 (0.65-1.42) 0.8285
rs4870303 6 150491644 G 0.97 (0.73-1.29) 0.8379
rs517969 1 37074934 A 0.96 (0.63-1.46) 0.8480
rs2946998 10 126739152 T 1.02 (0.74-1.39) 0.9230
OR=Odds Ratio; CI=Confidence Interval
71
Table 7: Associations between SNPs and EWS in replication set of 87 case-parent trios.
SNP Chr Location Minor
Allele
T:U OR (95%CI) P-value
rs17072464 3 65126852 G 31:17 1.82 (1.01-3.30) 0.0433
rs2946998 10 126739152 T 27:41 0.66 (0.41-1.07) 0.0896
rs7895833 10 69293063 G 28:17 1.65 (0.90-3.01) 0.1011
rs658358 11 119292400 G 38:27 1.41 (0.86-2.31) 0.1724
rs9312259 4 181363535 A 18:27 0.67 (0.37-1.21) 0.1797
rs12706912 7 80090231 C 36:48 0.75 (0.49-1.16) 0.1904
rs1548039 2 120835391 C 15:23 0.65 (0.34-1.25) 0.1944
rs233111 1 85558243 A 31:42 0.74 (0.46-1.17) 0.1979
rs12651205 4 181746748 G 12:19 0.63 (0.31-1.30) 0.2087
rs1935778 1 116196222 G 31:41 0.76 (0.47-1.21) 0.2386
rs1519016 2 216194918 G 17:23 0.74 (0.39-1.38) 0.3428
rs1025039 5 35937614 A 41:50 0.82 (0.54-1.24) 0.3454
rs7907995 10 37088436 T 45:39 1.15 (0.75-1.77) 0.5127
rs517969 1 37074934 A 21:25 0.84 (0.47-1.50) 0.5553
rs6792514 3 42404821 C 37:42 0.88 (0.57-1.37) 0.5737
rs1743040 14 99420386 A 40:45 0.89 (0.58-1.36) 0.5876
rs1149062 1 30314790 G 48:43 1.12 (0.74-1.69) 0.6002
rs1228829 1 30545629 G 48:43 1.12 (0.74-1.69) 0.6002
rs11217524 11 119198081 C 27:30 0.90 (0.53-1.51) 0.6911
rs2252432 10 37041833 C 37:40 0.93 (0.59-1.45) 0.7324
rs604768 4 16875044 T 47:44 1.07 (0.71-1.61) 0.7532
rs13005152 2 134914361 G 31:33 0.94 (0.58-1.53) 0.8026
rs10405576 19 50790725 G 37:35 1.06 (0.67-1.68) 0.8137
rs4870303 6 150491644 G 43:45 0.96 (0.63-1.45) 0.8312
rs1013831 13 90250273 A 24:23 1.04 (0.59-1.85) 0.8840
rs1999709 1 48879003 A 35:34 1.03 (0.64-1.65) 0.9042
rs9613768 22 27732832 T 49:48 1.02 (0.69-1.52) 0.9191
rs3731748 2 179121356 T 21:21 1.00 (0.55-1.83) 1.0000
rs12651279 4 181027092 A 38:38 1.00 (0.64-1.57) 1.0000
rs1447149 11 119196279 T 30:30 1.00 (0.60-1.66) 1.0000
Chr=Chromosome; T:U=Number of Transmitted Alleles: Number of Untransmitted Alleles;
OR=Odds Ratio; CI=Confidence Interval
72
Table 8: Associations between SNPs and EWS in joint analysis of 162 cases and 381
controls.
SNP Chromosome Location Minor
Allele
OR (95% CI) P-value
rs11217524 11 119198081 C 0.48 (0.33-0.71) 0.00016
rs1447149 11 119196279 T 0.61 (0.44-0.86) 0.0042
rs7907995 10 37088436 T 1.36 (1.05-1.78) 0.0223
rs2252432 10 37041833 C 1.34 (1.01-1.76) 0.0408
rs658358 11 119292400 G 0.74 (0.55-1.01) 0.0601
rs1519016 2 216194918 G 0.73 (0.51-1.04) 0.0820
rs10405576 19 50790725 G 0.83 (0.63-1.10) 0.1931
rs4870303 6 150491644 G 0.88 (0.67-1.15) 0.3394
rs1025039 5 35937614 A 0.88 (0.67-1.15) 0.3440
rs517969 1 37074934 A 0.83 (0.55-1.25) 0.3677
rs12651205 4 181746748 G 1.22 (0.78-1.89) 0.3819
rs1548039 2 120835391 C 0.85 (0.58-1.25) 0.4112
rs6792514 3 42404821 C 0.89 (0.69-1.17) 0.4131
rs1013831 13 90250273 A 1.15 (0.80-1.67) 0.4500
rs9312259 4 181363535 A 1.15 (0.80-1.67) 0.4595
rs2946998 10 126739152 T 0.90 (0.67-1.22) 0.5092
rs12706912 7 80090231 C 0.95 (0.73-1.22) 0.6736
rs1228829 1 30545629 G 0.95 (0.73-1.22) 0.6828
rs3731748 2 179121356 T 1.08 (0.74-1.58) 0.7008
rs12651279 4 181027092 A 1.04 (0.80-1.35) 0.7867
rs233111 1 85558243 A 0.96 (0.72-1.28) 0.7917
rs604768 4 16875044 T 1.04 (0.79-1.35) 0.7990
rs9613768 22 27732832 T 1.03 (0.79-1.34) 0.8428
rs1149062 1 30314790 G 0.98 (0.75-1.28) 0.8753
rs1743040 14 99420386 A 1.02 (0.78-1.33) 0.8827
rs1999709 1 48879003 A 1.02 (0.78-1.34) 0.8867
rs7895833 10 69293063 G 1.02 (0.74-1.40) 0.9250
rs17072464 3 65126852 G 1.01 (0.70-1.45) 0.9558
rs13005152 2 134914361 G 1.00 (0.74-1.36) 0.9986
OR=Odds Ratio; CI=Confidence Interval
73
Table 9: Associations between SNPs and EWS in joint analysis of 110 case-parent trios.
SNP Chr Location Minor
Allele
T:U OR (95%CI) P-value
rs2946998 10 126739152 T 27:58 0.47 (0.29-0.73) 0.0008
rs1548039 2 120835391 C 16:38 0.42 (0.23-0.76) 0.0028
rs9312259 4 181363535 A 18:38 0.47 (0.27-0.83) 0.0075
rs12651205 4 181746748 G 12:29 0.41 (0.21-0.81) 0.0079
rs1025039 5 35937614 A 46:73 0.63 (0.44-0.91) 0.0133
rs12706912 7 80090231 C 39:64 0.61 (0.41-0.91) 0.0138
rs1519016 2 216194918 G 18:36 0.50 (0.28-0.88) 0.0143
rs7907995 10 37088436 T 63:41 1.54 (1.04-2.28) 0.0310
rs517969 1 37074934 A 22:38 0.58 (0.34-0.98) 0.0389
rs4870303 6 150491644 G 45:66 0.68 (0.47-1.00) 0.0462
rs11217524 11 119198081 C 46:29 0.63 (0.40-1.00) 0.0497
rs1743040 14 99420386 A 65:46 1.41 (0.97-2.06) 0.0713
rs1999709 1 48879003 A 37:54 0.69 (0.45-1.04) 0.0747
rs9613768 22 27732832 T 54:73 0.74 (0.52-1.05) 0.0918
rs1013831 13 90250273 A 24:37 0.65 (0.39-1.08) 0.0960
rs10405576 19 50790725 G 39:55 0.71 (0.47-1.07) 0.0989
rs12651279 4 181027092 A 43:58 0.74 (0.50-1.10) 0.1356
rs13005152 2 134914361 G 34:47 0.72 (0.47-1.13) 0.1486
rs3731748 2 179121356 T 21:31 0.68 (0.39-1.18) 0.1655
rs1447149 11 119196279 T 32:44 0.73 (0.46-1.15) 0.1687
rs604768 4 16875044 T 48:62 0.77 (0.53-1.13) 0.1819
rs1228829 1 30545629 G 51:64 0.80 (0.55-1.15) 0.2254
rs2252432 10 37041833 C 55:43 1.28 (0.86-1.91) 0.2254
rs1149062 1 30314790 G 52:64 0.81 (0.56-1.17) 0.2652
rs6792514 3 42404821 C 53:44 1.21 (0.81-1.80) 0.3608
rs1935778 1 116196222 G 47:42 1.12 (0.74-1.70) 0.5961
rs7895833 10 69293063 G 28:32 0.88 (0.53-1.45) 0.6056
rs233111 1 85558243 A 47:43 1.09 (0.72-1.65) 0.6733
rs17072464 3 65126852 G 31:28 1.11 (0.66-1.85) 0.6961
rs658358 11 119292400 G 41:43 0.95 (0.62-1.46) 0.8273
Chr=Chromosome; T:U=Number of Transmitted Alleles: Number of Untransmitted Alleles;
OR=Odds Ratio; CI=Confidence Interval
74
Discussion
Very little is known about the causes of EWS in children and young adults. The
only established biological risk factors are age, gender, and race. Several studies have
suggested that a history of inguinal hernia and environmental risk factors such as parental
exposure to pesticides are associated with increased risk of EWS. The lack of any
established exposures associated with EWS may be due to the heterogeneity of this
disease, methodological study design issues in previous studies, or there may be a genetic
component to the etiology EWS. One study has sought to determine a genetic risk factor
for EWS by interrogating the EWSR1 gene on chromosome 22; however the findings
Figure 12. Map of Genome-Wide CNVs Identified in 27 EWS Cases. Frequency of CNV
gains (red) and losses (blue) are plotted by genomic position.
75
suggested no variants in this region were associated with EWS. Most recently, a GWAS
of EWS in France identified common SNPs associated with susceptibility to this disease
near genes TARDBP on chromosome 1; genes ADO, ZNF365, and EGR2 on chromosome
10; and genes EIF2AK4, SRP14, and BMF on chromosome 15.
The aim of this current study was to scan the entire genome of EWS cases to
identify possible susceptibility loci. The initial GWAS discovery phase identified several
potential SNP variants associated with EWS. The top 30 SNPs with the smallest p-values
and largest differences in allele frequencies of different ethnic populations were selected
for replication in an independent dataset. When the discovery and replication sets were
combined, the SNP most strongly associated with EWS was rs11217524 located on
chromosome 11. Candidate SNP rs11217524 was significantly associated with EWS
using both a case-control analysis (OR = 0.48, 95% CI = 0.33 - 0.71, p-value = 1.6X10
-4
)
and TDT analysis (OR = 0.63, 95% CI = 0.40 - 1.00, p-value = 5.0X10
-2
). SNP
rs11217524 is located near poliovirus receptor-related 1 (PVRL1) on chromosome
11q23.3. (Francke and Francke, 1979) This gene encodes a receptor for herpes simplex
virus to mediate entry of the virus into human epithelial and neuronal cells. (Lopez et al.,
1995) To a lesser extent, SNP rs7907995 on chromosome 10 was also associated with
EWS using a joint case-control analysis (OR = 1.36, 95% CI = 1.05 - 1.78, p-value =
2.2X10
-2
) and joint TDT analysis (OR = 1.57, 95% CI = 1.03 - 2.78, p-value = 3.1X10
-2
).
76
SNP rs7907995 is located near the ankyrin repeat domain-containing protein 30A
(ANKRD30A) gene on chromosome 10p11.21. This locus is also known as NY-BR-1 or
antigen B726P, and the protein encoded by this gene is used as biomarker of
disseminated breast cancer cells. (Lacroix, 2006)
Candidate SNP rs11217524 is located within the pseudogene LOC390255,
also referred to as the DEAD (Asp-Glu-Ala-Asp) box polypeptide 39B pseudogene. The
sequence of this pseudogene was defined by a single cDNA clone obtained from pooled
human melanocyte, fetal heart, and pregnant uterus. (Thierry-Mieg and Thierry-Mieg,
2006) The nearest gene to the variant is PVRL1, which was first discovered because of its
role in herpes simplex virus susceptibility and was mapped to 11q23-q24 by in situ
hybridization. (Francke and Francke, 1979; Lopez et al., 1995) Mutations in PVRL1 can
give rise to the autosomal recessive disorder cleft lip/palate-ectodermal dysplasia
syndrome (CLPED1). (Suzuki et al., 1998) While CLPED1 is generally a rare disease,
the prevalence of this syndrome in the indigenous population of Margarita Island in the
Caribbean Sea is 1 per 2,000. (Suzuki et al., 2000) It is speculated that the increased
prevalence of CLPED1 in this population may have resulted from the beneficial
resistance to viruses conferred by heterozygotes. Worldwide, an estimated 15% of all
human cancers may be attributed to viruses, which represents a significant amount of the
global cancer burden. (Hausen, 1991) An association between the adenovirus early
region 1A (E1A) gene and the EWS-FLI1 fusion protein has previously been reported, in
which the EWS-FLI1 fusion protein was induced in normal fibroblasts and keratinocytes
after expression of the E1A gene. (Sanchez-Prieto et al., 1999) Therefore, increased
77
susceptibility to viral infections may be associated with the somatic chromosomal
alterations that lead to EWS tumor formation. The candidate SNP near PVRL1 in this
current study lies within close proximity (8.96 megabases) to the FLI1 gene on
chromosome 11q24. Due to its role as a viral receptor, it is possible that variants near
PVRL1 may confer susceptibility to certain viral infections, and are associated with the
EWS-FLI1 fusion protein.
The epidemiology of EWS reveals a striking ethnic-specific incidence pattern.
Therefore, the minor allele frequency (MAF) of the two candidate SNPs was investigated
across populations of different ancestries using data available from the International
HapMap Project. (Frazer et al., 2007) The MAF of SNP rs11217524 in the 162 Non-
Hispanic white EWS cases was 13%, compared to 24% in the 381 European controls. In
the HapMap populations, the MAF of this SNP was 23% in Utah residents with Northern
and Western European ancestry (CEU); 18% in the Yoruban population in Ibadan,
Nigeria (YRI); and 1.8% in the Han Chinese population in Beijing, China (CHB). The
MAF of SNP rs7907995 in the 162 Non-Hispanic white EWS cases was 42%; and 35%
in the 381 European controls; 39% in CEU population; 12% in the YRI population; and
35% in the CHB population. Therefore, these differences in MAF suggest that risk
variants on chromosomes 10 and 11 likely contribute to the ethnic-specific incidence
pattern of EWS.
A recently published GWAS of EWS identified three common variants which
may predispose to this disease. Associations with SNP rs9430161 on 1p36 (OR = 2.20,
95% CI = 1.77 - 2.72, p-value = 1.4X10
-20
); SNP rs224278 on 10q21 (OR = 1.66, 95% CI
78
= 1.42 - 1.93, p-value = 4.0X10
-17
); and SNP rs4924410 on 15q15 (OR = 1.46, 95% CI =
1.23 - 1.74, p-value = 6.6X10
-9
) were associated with EWS. Our study replicated these
findings for only SNP rs224278 (OR = 0.35, 95% CI = 0.83 - 6.26, p-value = 0.012).
SNP rs9430161 was not included in our family-based GWAS, and was not in LD with
any SNPs on chromosome 1 that were included in our analysis. SNP rs4924410 was not
included in our GWAS, but was in linkage disequilibrium (LD) with SNP rs8026641,
which was included in our GWAS. SNP rs8026641 was not associated with EWS (OR =
1.88, 95% CI = 0.80 - 4.42, p-value = 0.144) in our GWAS, but the effect size was in the
same direction. Further replication of these 3 candidate SNPs in large, independent
datasets is necessary to validate these findings.
There are several strengths to our study. The main strength is the use of case-
parent trios to identify susceptibility loci for EWS. Using trios is an efficient strategy to
protect against the effects of population stratification, especially when ethnicity is a
strong risk factor for EWS. Collecting DNA from complete trios can be challenging since
DNA from the affected child, mother, and father are required for a complete genetic
analysis, which is especially challenging for a rare, fatal childhood disease. All trio DNA
samples were processed and analyzed on the same assay plate to minimize any batch
effects. A second strength is the innovative use of the internet to recruit cases from
regions outside of California. Having participants from across the US, even some parts of
the world, increases the generalizability by increasing the potential genetic variants
represented in the cases. The results should be generalizable to most Non-Hispanic white
children and young adults in the US. The third strength is the use of two control groups to
79
analyze the replication set, including case-parent trios to address potential issues due to
population stratification. Finding significant associations using both analyses validates
the results.
There are a few limitations to this study to be considered in the interpretation of
the results. Those who participated in this study may differ from those who did not
participate by important characteristics, such as factors that influence survival. Cases that
survive long enough to be recruited into the study may not be representative of all
incident cases with EWS with respect to genetic variation. Even though we were not able
to collect a population-based sample for this study, the sample we did collect was
representative of the expected distribution of clinical and biological covariates observed
in most EWS patients. Survivor bias is also a potential limitation of the study. Our
eligibility criteria included EWS survivors, and we therefore oversampled for cases who
did not succumb to the disease, which only represents approximately 50% of those
diagnosed with EWS. However, we did have several families ask to participate even if
their child did not survive EWS, and we accepted peripheral blood stem cells or bone
marrow aspirates from cases as specimens to extract DNA from for the study. We
therefore attempted to reduce the potential for survivor bias.
In conclusion, this study has identified a potential genetic susceptibility to EWS
near the PVRL1 gene on chromosome 11 and ANKRD30A gene on chromosome 10. The
differences in the MAF of the candidate SNPs across distinct ethnic populations suggest
that risk variants on chromosomes 10 and 11 likely contribute to the ethnic-specific
incidence pattern of EWS. PVRL1 is a viral receptor gene, and an increased susceptibility
80
to viral infections may trigger somatic chromosomal alterations on chromosome 11
which lead to EWS tumor formation. Variants in the PVRL1 gene are linked to a rare
genetic syndrome, but have not been previously associated with cancer susceptibility. In
the future, large multi-center population-based studies will be needed to validate these
findings. Further studies are also required to determine the function of these loci in the
development of EWS tumors. The discovery of a genetic risk factor for EWS may
improve clinical practice by identifying the essential genes that influence the
pathogenesis of this disease. This information may help in developing early detection
methods and improve treatment strategies for this highly malignant tumor.
81
CONCLUSIONS
Ewing Sarcoma (EWS) is the second most common bone tumor in children and
young adults, and is characterized by an oncogenic protein consisting of the EWSR1 gene
fused with one of several genes from the ETS family of transcription factors. The
incidence of EWS varies by ethnicity, and predominantly affects Non-Hispanic white
males. Very little is known about the causes of EWS, and none of the biological or
environmental exposures reported explain the observed ethnic-specific incidence pattern.
Therefore, it is likely that a genetic risk factor contributes to the development of this
tumor.
Recently, copy number variants (CNVs) have emerged as important component of
genetic variation. CNVs are structural changes that occur throughout the genome,
primarily due to duplication, deletion, insertion, and unbalanced translocation events.
These gains and losses of genetic material are 1 kb or greater in size, and are found to
vary in frequency among healthy individuals. The frequency of CNVs varies by ethnicity,
and may contribute to phenotypic variations and differences in disease susceptibility
across different ethnic groups.
This research described a novel methodology for identifying a common CNV
signature that could predict ancestry with an extremely high accuracy. The
misclassification error rate of the 53 caCNV signature that distinguishes Europeans,
Africans, and Han Chinese was 2%. The CNV signature also reports the identification of
the first microRNA caCNV. This is a quick and simple method that eliminates data
reduction techniques to identify common CNVs, or use of principal component based
82
analyses to assign class assignment using only signal intensities. Importantly, this method
is applicable to a wide range of biomedical research aimed at identifying CNV signatures
predictive of population phenotypes.
The genetic epidemiology of EWS still remains elusive. Recently, genome-wide
association studies (GWAS) have been successful in identifying cancer susceptibility
loci, and this is the approach used in this research to identify a genetic predisposition to
EWS. The genetic epidemiology of EWS remains poorly understood due to the rarity
and the poor survival rate of this disease. Therefore, a registry of biospecimens (whole
blood, peripheral blood stem cells, mouthwash, and tumors) was established from
individuals diagnosed with EWS, as well as their parents and siblings. The DNA was
extracted and genotyped to determine genome-wide SNPs and CNVs in the discovery set,
followed by candidate genotyping of the most significant SNPs also having differences in
allele frequencies across various ethnic populations, in an independent replication set.
The significance of transmission of a SNP from a heterozygous parent to an affected
child was tested using the Transmission Disequilibrium Test (TDT) in the case-parent
trios, and CNVs were identified using a Family-Based Association Test (FBAT).
The initial GWAS discovery phase identified several potential SNP variants
associated with EWS. The top 30 SNPs that also showed a difference in allele frequency
across different ethnic populations were selected for replication in an independent
dataset. When the discovery and replication sets were combined, SNP rs11217524 was
significantly associated with EWS using both a case-control analysis (OR = 0.48, 95% CI
= 0.33 - 0.71, p-value = 1.6X10
-4
) and TDT analysis (OR = 0.63, 95% CI = 0.40 - 1.00,
83
p-value = 5.0X10
-2
). The variant is located within the pseudogene LOC390255, also
referred to as the DEAD (Asp-Glu-Ala-Asp) box polypeptide 39B pseudogene. SNP
rs11217524 is located near poliovirus receptor-related 1 (PVRL1) on chromosome
11q23.3. This gene encodes a receptor for herpes simplex virus to mediate entry of the
virus into human epithelial and neuronal cells. Mutations in this gene can lead to a rare
genetic syndrome named CLPED1. It is speculated that increased prevalence of
CLPED1 on Margarita Island in the Caribbean Sea is a result of improved resistance to
viruses conferred by heterozygotes. Of relevance for EWS, PVRL1 lies within close
proximity (8.96 megabases) to the FLI1 breakpoint region on chromosome 11q24. An
association between the adenovirus early region 1A (E1A) gene and the EWS-FLI1 fusion
protein has previously been described, and therefore, increased susceptibility to viral
infections may be associated with the somatic chromosomal alterations that lead to EWS
tumor formation.
SNP rs7907995 on chromosome 10 was also associated with EWS using a joint
case-control analysis (OR = 1.36, 95% CI = 1.05 - 1.78, p-value = 2.2X10
-2
) and joint
TDT analysis (OR = 1.57, 95% CI = 1.03 - 2.78, p-value = 3.1X10
-2
). SNP rs7907995 is
located near the ankyrin repeat domain-containing protein 30A (ANKRD30A) gene on
chromosome 10p11.21. This locus is also known as NY-BR-1 or antigen B726P, and the
protein encoded by this gene is used as biomarker of disseminated breast cancer cells.
84
The differences in allele frequencies of the 2 candidate SNPs across different ethnic
populations suggest that risk variants on chromosomes 10p11 and 11q23 likely contribute
to the ethnic-specific incidence pattern of EWS. These loci are in addition to the EWS
susceptibility loci on chromosomes 1p36.22, 10q21, and 15q15 recently reported in a
GWAS of 401 EWS cases in France.
These research findings lie in the early discovery phase in the spectrum of clinical
translational research. The next steps are to translate these results into clinical
applications, such as early detection methods and improved treatment strategies that
impact EWS patients. First, deep sequencing of the PVRL1 and ANKRD30A genes should
be done to confirm these results and identify rare variants that may exist in this region.
Ideally this would be accomplished through large, multi-center studies. Second, the
function of these genes in EWS tumor development should be determined. In silico gene
expression and copy number analysis of EWS tumors will help to better understand the
role of this gene in cancer development. Mouse models for EWS currently do not exist,
so in vivo work remains a challenge. Finally, clinical applications of PVRL1 and
ANKRD30A for EWS detection and treatment can be evaluated to assess the efficacy,
implementation, and impact in human populations. This will successfully translate a
clinical problem into a clinical practice that impacts real world populations.
85
BIBLIOGRAPHY
Affymetrix Inc. (2009). Genome-Wide Human SNP Array 6.0 Data Sheet. Available at
wwwaffymetrixcom.
Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of
ancestry in unrelated individuals. Genome Res 19, 1655-1664.
Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F.,
Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., et al. (2009). Personalized copy number
and segmental duplication maps using next-generation sequencing. Nat Genet 41, 1061-
1067.
Amiel, A., Ohali, A., Fejgin, M., Sardos-Albertini, F., Bouaron, N., Cohen, I., Yaniv, I.,
Zaizov, R., and Avigad, S. (2003). Molecular cytogenetic parameters in Ewing sarcoma.
Cancer Genet Cytogenet 140, 107-112.
Armengol, G., Tarkkanen, M., Virolainen, M., Forus, A., Valle, J., Bohling, T., Asko-
Seljavaara, S., Blomqvist, C., Elomaa, I., Karaharju, E., et al. (1997). Recurrent gains of I
q, 8 and 12 in the Ewing family of tumours by comparative genomic hybridization. Br J
Cancer 75, 1403-1409.
Armengol, L., Villatoro, S., Gonzalez, J.R., Pantano, L., Garcia-Aragones, M., Rabionet,
R., Caceres, M., and Estivill, X. (2009). Identification of copy number variants defining
genomic differences among major human groups. PLoS One 4, 7230-7243.
Aurias, A., Rimbaut, C., Buffe, D., Zucker, J., and Mazabraud, A. (1984). Translocation
Involving Chromosome 22 in Ewing’s Sarcoma. A Cytogenetic Study of Four Fresh
Tumors Cancer Genet Cytogenet 12, 21-25.
Bahlo, M., and Bromhead, C.J. (2009). Generating linkage mapping files from
Affymetrix SNP chip data. Bioinformatics 25, 1961-1962.
Barker, L., Pendergrass, T., Sanders, J., and Hawkins, D. (2005). Survival after
recurrence of Ewing's sarcoma family of tumors. J Clin Oncol 23, 4354-4362.
Batanian, J.R., Bridge, J.A., Wickert, R., Vogler, C., Gadre, B., and Huang, Y. (2002).
EWS/FLI-1 fusion signal inserted into chromosome 11 in one patient with morphologic
features of Ewing sarcoma, but lacking t(11;22). Cancer Genet Cytogenet 133, 72-75.
86
Bauters, M., Weuts, A., Vandewalle, M., Nevelsteen, J., P.Marynen, Esch, H.V., and
Froyen, G. (2008). Detection and validation of copy number vairation in X-linked mental
retardation. Cytogenet Genome Res 123, 44-53.
Bengsston, H., Irizarry, R., Carvalho, B., and Speed, T.P. (2008). Estimation and
assessment of raw copy numbers at the single locus level. Bioinformatics 24, 759-767.
Bengtsson, H., Ray, A., Spellman, P., and Speed, T. (2009). A single-sample method for
normalizing and combining full-resolution copy numbers from multiple platforms, labs
and analysis methods. Bioinformatics 1, 861-867.
Bentley, D.R. (2006). Whole-genome re-sequencing. Curr Opin Genet Dev 16, 545-552.
Bernstein, M., Kovar, H., Paulussen, M., Randall, R.L., Schuck, A., Teot, L.A., and
Juergens, H. (2006). Ewing's Sarcoma Family of tumors: Current Management.
Oncologist 11, 503-519.
Beroukhim, R., Getz, G., Nghiemphu, L., Barretina, J., Hsueh, T., Linhart, D., Vivanco,
I., Lee, J., Huang, J., Alexander, S., et al. (2007). Assessing the significance of
chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl
Acad Sci U S A 104, 20007-20012.
Bleyer, A., O'Leary, M., Barr, R., and Ries, L. (2006). Cancer Epidemiology in Older
Adolescents and Young Adults 15 to 29 Years of Age, Including SEER Incidence and
Survival: 1975 - 2000. National Cancer Institute NIH Pub. No. 06-5767.
Brisset, S., Schleiermacher, G., Peter, M., Mairal, A., Oberlin, O., Delattre, O., and
Aurias, A. (2001). CGH analysis of secondary genetic changes in Ewing tumors:
correlation with metastatic disease in a series of 43 cases. Cancer Genet Cytogenet 130,
57-61.
Broderick P, W.Y., Vijayakrishnan J, Matakidou A, Spitz MR, Eisen T, Amos CI,
Houlston RS. (2009). Deciphering the impact of common genetic variation on lung
cancer risk: a genome-wide association study. Cancer Res 69, 6633-6641.
Broët, P., and Richardson, S. (2006). Detection of gene copy number changes in CGH
microarrays using a spatially correlated mixture model. Bioinformatics 22, 911-918.
Cahan, P., Godfrey, L., Eis, P., Richmond, T., Selzer, R., Brent, M., McLeod, H., Ley, T.,
and Graubert, T. (2008). wuHMM: a robust algorithm to detect DNA copy number
variation using long oligonucleotide microarray data. Nucleic Acids Res 36, e41.
87
Campbell, C.D., Sampas, N., Tsalenko, A., Sudmant, P.H., Kidd, J.M., Malig, M., Vu,
T.H., Vives, L., Tsang, P., Bruhn, L., et al. (2011). Population-genetic properties of
differentiated human copy-number polymorphisms. Am J Hum Genet 88, 317-332.
Carvalho, B., Ouwerkerk, E., Meijer, G.A., and Ylstra, B. (2004). High resolution
microarray comparative genomic hybridisation analysis using spotted oligonucleotides. J
Clin Pathol 57, 644-646.
Chomczynski, P., and Sacchi, N. (2006). The single-step method of RNA isolation by
acid guanidinium thiocyanate-phenol-chloroform extraction: twenty-something years on.
Nature Protocols 1, 581-585.
Chueng, V. (2001). Integration of cytogenetic landmarks into the draft sequence of the
human genome. Nature 409, 953-958.
Colella, S., Yau, C., Taylor, J., Mirza, G., Butler, H., Clouston, P., Bassett, A., Seller, A.,
Holmes, C., and Ragoussis, J. (2007). QuantiSNP: an Objective Bayes Hidden-Markov
Model to detect and accurately map copy number variation using SNP genotyping data.
Nucleic Acids Res 35, 2013-2025.
Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J.,
Andrews, T.D., Barnes, C., Campbell, P., et al. (2009). Origins and functional impact of
copy number variation in the human genome. Nature 464, 704-712.
Daruwala, R., Rudra, A., Ostrer, H., Lucito, R., Wigler, M., and Mishra, B. (2004). A
versatile statistical analysis algorithm to detect genome copy number variation. Proc Natl
Acad Sci U S A 101, 16292-16297.
Delattre, O., Zucman, J., Plougastel, B., Desmaze, C., Melot, T., and Peter, M. (1992).
Gene Fusion with an ETS DNA-binding Domain Caused by Chromosome Translocation
in Human Tumors. Nature 359, 162-165.
Diskin, S.J., Eck, T., Greshock, J., Mosse, Y., Naylor, T., Stoeckert, C., Weber, B.,
Maris, J., and Grant, G. (2006). STAC: A method for testing the significance of DNA
copy number aberrations across multiple array-CGH experiments. Genome Res 16, 1149-
1158.
Diskin, S.J., Hou, C., Glessner, J.T., Attiyeh, E.F., Laudenslager, M., Bosse, K., Cole, K.,
Mosse´, Y.l.P., Wood, A., Lynch, J.E., et al. (2009). Copy number variation at 1q21.1
associated with neuroblastoma. Nature 459, 987-992.
88
DuBois, S.G., Goldsby, R., Segal, M., Woo, J., Copren, K., Kane, J.P., Pullinger, C.R.,
Matthay, K.K., Witte, J., Lessnick, S.L., et al. (2011). Evaluation of Polymorphisms in
EWSR1 and Risk of Ewing Sarcoma: A Report From the Childhood Cancer Survivor
Study. Pediatr Blood Cancer doi: 10.1002/pbc.23263, 1-5.
Eeles RA, K.-J.Z., Al Olama AA, Giles GG, Guy M, Severi G, Muir K, Hopper JL,
Henderson BE, Haiman CA, Schleutker J, Hamdy FC, Neal DE, Donovan JL, Stanford
JL, Ostrander EA, Ingles SA, John EM, Thibodeau SN, Schaid D, Park JY, Spurdle A,
Clements J, Dickinson JL, Maier C, Vogel W, Dörk T, Rebbeck TR, Cooney KA,
Cannon-Albright L, Chappuis PO, Hutter P, Zeegers M, Kaneva R, Zhang HW, Lu YJ,
Foulkes WD, English DR, Leongamornlert DA, Tymrakiewicz M, Morrison J, Ardern-
Jones AT, Hall AL, O'Brien LT, Wilkinson RA, Saunders EJ, Page EC, Sawyer EJ,
Edwards SM, Dearnaley DP, Horwich A, Huddart RA, Khoo VS, Parker CC, Van As N,
Woodhouse CJ, Thompson A, Christmas T, Ogden C, Cooper CS, Southey MC,
Lophatananon A, Liu JF, Kolonel LN, Le Marchand L, Wahlfors T, Tammela TL,
Auvinen A, Lewis SJ, Cox A, FitzGerald LM, Koopmeiners JS, Karyadi DM, Kwon EM,
Stern MC, Corral R, Joshi AD, Shahabi A, McDonnell SK, Sellers TA, Pow-Sang J,
Chambers S, Aitken J, Gardiner RA, Batra J, Kedda MA, Lose F, Polanowski A,
Patterson B, Serth J, Meyer A, Luedeke M, Stefflova K, Ray AM, Lange EM, Farnham J,
Khan H, Slavov C, Mitkova A, Cao G; UK Genetic Prostate Cancer Study
Collaborators/British Association of Urological Surgeons' Section of Oncology; UK
ProtecT Study Collaborators; PRACTICAL Consortium, Easton DF. (2009).
Identification of seven new prostate cancer susceptibility loci through a genome-wide
association study. Nat Genet 41, 1116-1121.
Eseashvili, N., Goodman, M., and Marcus, R. (2008). Changes in Incidence and Survival
of Ewing Sarcoma Patients Over the Past 3 Decades. Surveillance Epidemiology and End
Results Data. Journal of Pediatric Hematology Oncology 30, 425-430.
Ewing, J. (1921). Diffuse endothelioma of bone. Proc N Y Path Soc 27, 17-24.
Ferreira, B., Alonso, J., Carrillo, J., Acquadro, F., Largo, C., Suela, J., Teixeira, M.,
Cerveira, N., Molares, A., Gomez-Lopez, G., et al. (2007). Array CGH and gene-
expression profiling reveals distinct genomic instability patterns associated with DNA
repair and cell-cycle checkpoint pathways in Ewing's sarcoma. Oncogene 27, 2084-2090.
Fiegler, H., Redon, R., Andrews, D., Scott, C., Andrews, R., Carder, C., Clark, R.,
Dovey, O., Ellis, P., Feuk, L., et al. (2006). Accurate and reliable high-throughput
detection of copy number variation in the human genome. Genome Res 16, 1566-1574.
89
Francke, U., and Francke, B.R. (1979). Assignment of gene(s) required for herpes-
simplex virus type-1 (HV1S) replication to the long arm of human chromosome 11
Cytogenet Cell Genet 25, 155-155.
Fraumeni, J., and Glass, A. (1970). Rarity of Ewing's Sarcoma Among U.S. Negro
Children. Lancet 1, 366-367.
Frazer, K.A., Ballinger, D.G., Cox, D.R., Hinds, D.A., Stuve, L.L., Gibbs, R.A.,
Belmont, J.W., Boudreau, A., Hardenbol, P., Leal, S.M., et al. (2007). A second
generation human haplotype map of over 3.1 million SNPs. Nature 449, 851-U853.
Geraghty, R.J., Krummenacher, C., Cohen, G.H., Eisenberg, R.J., and Spear, P.G. (1998).
Entry of alphaherpesviruses mediated by poliovirus receptor-related protein 1 and
poliovirus receptor. Science 280, 1618-1620.
Gudmundsson J, S.P., Gudbjartsson DF, Jonasson JG, Sigurdsson A, Bergthorsson JT,
He H, Blondal T, Geller F, Jakobsdottir M, Magnusdottir DN, Matthiasdottir S, Stacey
SN, Skarphedinsson OB, Helgadottir H, Li W, Nagy R, Aguillo E, Faure E, Prats E, Saez
B, Martinez M, Eyjolfsson GI, Bjornsdottir US, Holm H, Kristjansson K, Frigge ML,
Kristvinsson H, Gulcher JR, Jonsson T, Rafnar T, Hjartarsson H, Mayordomo JI, de la
Chapelle A, Hrafnkelsson J, Thorsteinsdottir U, Kong A, Stefansson K. (2009). Common
variants on 9q22.33 and 14q13.3 predispose to thyroid cancer in European populations.
Nat Genet 41, 460-464.
Hahn, Y., Bera, T.K., Pastan, I.H., and Lee, B. (2006). Duplication and extensive
remodeling shaped POTE family genes encoding proteins containing ankyrin repeat and
coiled coil domains. Gene 366, 238-245.
Hartley, A., Birch, J., Blair, V., Teare, M., Marsden, H., and Harris, M. (1991). Cancer
incidence in the families of children with Ewing's tumor. Journal of National Cancer
Incidence 83, 955-956.
Hartley, A., Birch, J., McKinney, P., Teare, M., Blair, V., Carrette, J., Mann, J., Draper,
G., Stiller, C., Johnston, H., et al. (1988). The Inter-Regional Epidemiological Study of
Childhood Cancer (IRESCC): case control study of children with bone and soft tissue
sarcomas. Br J Cancer 58, 838-842.
Hastings, P.J., Lupski, J.R., Rosenberg, S.M., and Ira, G. (2009). Mechanisms of change
in gene copy number. Nature Reviews Genetics 10, 551-564.
Hausen, H.Z. (1991). Viruses in human cancers. Science 254, 1167-1173.
90
Hendershot, E. (2005). Treatment Approaches for Metastatic Ewing's Sarcoma: A
Review of the Literature. J Pediatr Oncol Nurs 22, 339-352.
Holly, E., Aston, D., Ahn, D., and Kristiansen, J. (1992). Ewing's Bone Sarcoma,
Paternal Occupational Exposure, and Other Factors. Am J Epidemiol 135, 122-129.
Hum, L., Kreiger, N., and Finkelstein, M. (1998). The relationship between parental
occupation and bone cancer risk in offspring. Int J Epidemiol 27, 766-771.
Hutter, R., Francis, K., and Foote, F. (1964). Ewing's sarcoma in female siblings. Am J
Surg 107, 598-603.
Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Scherer, Y.Q.S.W.,
and Lee, C. (2004). Detection of large-scale variation in the human genome. Nat Genet
36, 949-951.
Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U., and
Speed, T. (2003). Exploration, normalization, and summaries of high density
oligonucleotide array probe level data. Biostatistics 4, 249-264.
Itsara, A., Cooper, G.M., Baker, C., Girirajan, S., Li, J., Absher, D., Krauss, R.M., Myers,
R.M., Ridker, P.M., Chasman, D.I., et al. (2009). Population analysis of large copy
number variants and hotspots of human genetic disease. Am J Hum Genet 84, 148-161.
Ivakhno, S., and Tavare, S. (2010). CNAnova: a new approach for finding recurrent copy
number abnormalities in cancer SNP microarray data. Bioinformatics 26, 1395-1402.
J.S.Beckmann, A.J.Shapr, and Antonarakis, S.E. (2008). CNVs and genetic medicine
(excitement and consequences of a rediscovery). Cytogenet Genome Res 123, 7-16.
Jakobsson, M., Scholz, S.W., Scheet, P., Gibbs, J.R., VanLiere, J.M., Fung, H.-C.,
Szpiech, Z.A., Degnan, J.H., Wang, K., Guerreiro, R., et al. (2008). Genotype, haplotype,
and copy-number variation in worldwide human populations. Nature 451, 998-1003.
Jawad, M.U., Cheung, M.C., Min, E.S., Schneiderbauer, M.M., Koniaris, L.G., and
Scully, S.P. (2009). Ewing Sarcoma Demonstrates Racial Disparities in Incidence-related
and Sex-related Differences in Outcome. An Analysis of 1631 Cases From the SEER
Database, 1973-2005. Cancer 115, 3526-3536.
Jeon, I., Davis, J., Braun, B., Sublett, J., Roussel, M., Denny, C., and Shapiro, D. (1995).
A Variant Ewing's Sarcoma Translocation (7;22) Fuses the EWS Gene to the ETS Gene
ETV1. Oncogene 10, 1229-1234.
91
Ji, J., and Hemminki, K. (2006). Familial risk for histology-specific bone cancers: An
updated study in Sweden. Eur J Cancer 42, 2343-2349.
Joyce, M., Harmon, D., Mankin, H., Suit, H., Schiller, A., and Truman, J. (1984).
Ewing's sarcoma in female siblings. A clinical report and review of the literature. Cancer
53, 1959-1962.
Kallioniemi, A., Kallioniemi, O.-P., Sudar, D., Rutovitz, D., Gray, J.W., Waldman, F.M.,
and Pinkel, D. (1992). Comparative genomic hybridization for molecular cytogenetic
analysis of solid tumors. Science 258, 818-821.
Kallioniemi, O., Kallioniemi, A., Sudar, D., Rutovitz, D., Gray, J., Waldman, F., and
Pinkel, D. (1993). Comparative cenomic hybridization: A rapid new method for detecting
and mapping DNA amplifications in tumors. Seminar in Cancer Biology 4, 41-46.
Kaneko, Y., Yoshida, K., Handa, M., Toyoda, Y., Nishihira, H., Tanaka, Y., Sasaki, Y.,
Ishida, S., Higashino, F., and Fujinaga, K. (1996). Fusion of an ETS-family Gene, EIAF,
to EWS by t(17;22)(q12;q12) Chromosome Translocation in an Undifferentiated
Sarcoma of Infancy. Genes Chromosomes Cancer 15, 115-121.
Kanetsky PA, M.N., Vardhanabhuti S, Li M, Vaughn DJ, Letrero R, Ciosek SL, Doody
DR, Smith LM, Weaver J, Albano A, Chen C, Starr JR, Rader DJ, Godwin AK, Reilly
MP, Hakonarson H, Schwartz SM, Nathanson KL. (2009). Common variation in KITLG
and at 5q31.3 predisposes to testicular germ cell cancer. Nat Genet 41, 811-815.
Karimpour-Fard, A., Dumas, L., Phang, T., Sikela, J., and Hunter, L. (2010). A survey of
analysis software for array-comparative genomic hybridisation studies to detect copy
number variation. Human Genomics 4, 421-427.
Knuutila, S., Bjorkqvist, A., Autio, K., Tarkkanen, M., Wolf, M., Monni, O., Szymanska,
J., Larramendy, M., Tapper, J., Pere, H., et al. (1998). DNA copy number amplifications
in human neoplasms: review of comparative genomic hybridization studies. The
American Journal of Pathology 152, 1107-1123.
Komuro, H., Yasuhide Hayashi, Kawamura, M., Hayashi, K., Kaneko, Y., Shigehiko
Kamoshita, Hanada, R., Vaniamolo, K., Hongo, T., Yantada, M., et al. (1993). Mutations
of the p53 Gene Are Involved in Ewing's Sarcomas but not in Neuroblastomas. Cancer
Res 53, 5284-5288.
92
Korbel, J., Urban, A., Grubert, F., Du, J., Royce, T., Starr, P., Zhong, G., Emanuel, B.,
Weissman, S., Snyder, M., et al. (2007). Systematic prediction and validation of
breakpoints associated with copy-number variants in the human genome. Proc Natl Acad
Sci U S A 104, 10110-10115.
Korn, J., Kuruvilla, F., McCarroll, S., Wysoker, A., Nemesh, J., Cawley, S., Hubbell, E.,
Veitch, J., Collins, P., Darvishi, K., et al. (2008). Integrated genotype calling and
association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat
Genet 40, 1253-1260.
Kovar, H., Aryee, D., and Jug, G. (1996). EWS/FLI-1 antagonists induce growth
inhibition of Ewing tumor cells in vitro. . Cell Growth Differentiation 7, 429-437.
Kovar, H., Auinger, A., Jug, G., Aryee, D., Zoubek, A., M, S.-K., and H, G. (1993).
Narrow spectrum of infrequent p53 mutations and absence of MDM2 amplification in
Ewing tumours. Oncogene 8, 2683-2690.
Kusenda, M., and Sebat, J. (2008). The role of rare sturctural variants in the genetics of
autism spectrum disorders. Cytogenet Genome Res 123, 36-43.
Lachman, H.M. (2008). Copy variations in schizophrenia and bipolar disorder. Cytogenet
Genome Res 123, 27-35.
Lacroix, M. (2006). Significance, detection and markers of disseminated breast cancer
cells. Endocrine-Related Cancer 13, 1033-1067.
Leavey, P., Mascarenhas, L., Marina, N., Chen, Z., Krailo, M., Miser, J., Brown, K.,
Tarbell, N., Bernstein, M., Granowetter, L., et al. (2008). Prognostic factors for patients
with Ewing sarcoma (EWS) at first recurrence following multi-modality therapy: A
report from the Children's Oncology Group. Pediatric Blood Cancer 51, 334-338.
Lee, C., Iafrate, A.J., and Brothman, A.R. (2007). Copy number variations and clinical
cytogenetic diagnosis of constitutional disorders. Nat Genet 39, S48-S54.
Li, F., Tu, J., and Liu, F. (1980). Rarity of Ewing's Sarcoma in China. Lancet 1, 1255.
Li, X., and Hemminki, K. (2002). Parental Cancer as a risk factor for bone cancer: A
nation-wide study from Sweden. J Clin Epidemiol 55, 111-114.
Lopez, M., Eberle, F., Mattei, M.G., Gabert, J., Birg, F., Bardin, F., Maroc, C., and
Dubreuil, P. (1995). Complementary-DNA characterization and chromosomal
localization of a human gene-related to the polioviurs receptor-encoding gene. Gene 155,
261-265.
93
Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L.,
Kiang, A., Paschall, J., Phan, L., et al. (2007). The NCBI dbGaP database of genotypes
and phenotypes. Nat Genet 39, 1181-1186.
Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka,
J., Braverman, M.S., Chen, Y.-J., Chen, Z., et al. (2005). Genome Sequencing in open
microfabricated high fensity picoliter reactors. Nature 437, 376-380.
McCarroll, S., Hadnott, T., Perry, G., Sabeti, P., Zody, M., Barrett, J., Dallaire, S.,
Gabriel, S., Lee, C., Daly, M., et al. (2006). Common deletion polymorphisms in the
human genome. Nat Genet 38, 86-92.
McElroy, J.P., Nelson, M.R., Caillier, S.J., and Oksenberg, J.R. (2009). Copy number
variation in African Americans. BMC Genetics 10, 15-22.
McIntyre, L., Martin, E., Simonsen, S., and Kaplan, N. (2000). Circumventing multiple
testing: A multilocus Monte Carlo approach to testing for association. Genet Epidemiol
19, 18-29.
McKeen, E., Hanson, M., Mulvihill, J., and Glaubiger, D. (1983). Birth defects with
Ewing's sarcoma. N Engl J Med 309, 1522.
Mei, T.S., Salim, A., Calza, S., Seng, K.C., Seng, C.K., and Pawitan, Y. (2010).
Identification of recurrent regions of copy number variants across multiple individuals.
BMC Bioinformatics 11.
Moore, L., Gold, L., Stewart, P., Gridley, G., Prince, J., and Zahm, S. (2005). Parental
occupational exposures and Ewing's sarcoma. Int J Cancer 114, 472-478.
Mugneret, F., Lizard, S., Autio, K., and Turc-Carel, C. (1988). Chromosomes in Ewing's
sarcoma. II. Nonrandom additional changes, trisomy 8 and der(16)t(1;16). Cancer Genet
Cytogenet 32, 239-245.
Nakagawa, T., Swanson, M., Wold, B., and Dreyfuss, G. (1986). Molecular cloning of
cDNA for the nuclear ribonucleoprotein particle C proteins: a conserved gene family.
Proc Natl Acad Sci U S A 83, 2007-2011.
Nakajima, T., Kaur, G., Mehra, N., and Kimura, A. (2008). HIV-1/AIDS dusceptibility
and copy number variation in CCL3L1, a gene encoding a natural ligand for HIV-1 co-
receptor CCR5. Cytogenet Genome Res 123, 156-160.
94
Ouchida, M., Ohno, T., Fujimura, Y., Rao, V., and Reddy, E. (1995). Loss of
tunorigenicity of Ewing's sarcoma cells expressing antisense RNA to EWS-fusion
transcripts. Oncogene 11, 1049-1054.
Ozaki, T., Paulussen, M., Poremba, C., Brinkschmidt, C., Rerin, J., Ahrens, S.,
Hoffmann, C., Hillmann, A., Wai, D., Schaefer, K.-L., et al. (2001). Genetic Imbalances
Revealed by Comparative Genomic Hybridization in Ewing Tumors. Genes
Chromosomes Cancer 32, 164-171.
Parkin, D., Kramarova, E., and Draper, G. (1998). International Incidence of Childhood
Cancer, vol. II. Lyon: International Agency for Research on Cancer IARC Scientific
Publication No. 144.
Peter, M., Couturier, J., Pacquement, H., Michon, J., Thomas, G., Magdelat, H., and
Delattre, O. (1997). A new member of the ETS family fused to EWS in Ewing tumors.
Oncogene 14, 1159-1164.
Petersen GM, A.L., Fuchs CS, Kraft P, Stolzenberg-Solomon RZ, Jacobs KB, Arslan
AA, Bueno-de-Mesquita HB, Gallinger S, Gross M, Helzlsouer K, Holly EA, Jacobs EJ,
Klein AP, LaCroix A, Li D, Mandelson MT, Olson SH, Risch HA, Zheng W, Albanes D,
Bamlet WR, Berg CD, Boutron-Ruault MC, Buring JE, Bracci PM, Canzian F, Clipp S,
Cotterchio M, de Andrade M, Duell EJ, Gaziano JM, Giovannucci EL, Goggins M,
Hallmans G, Hankinson SE, Hassan M, Howard B, Hunter DJ, Hutchinson A, Jenab M,
Kaaks R, Kooperberg C, Krogh V, Kurtz RC, Lynch SM, McWilliams RR, Mendelsohn
JB, Michaud DS, Parikh H, Patel AV, Peeters PH, Rajkovic A, Riboli E, Rodriguez L,
Seminara D, Shu XO, Thomas G, Tjønneland A, Tobias GS, Trichopoulos D, Van Den
Eeden SK, Virtamo J, Wactawski-Wende J, Wang Z, Wolpin BM, Yu H, Yu K,
Zeleniuch-Jacquotte A, Fraumeni JF Jr, Hoover RN, Hartge P, Chanock SJ. (2010). A
genome-wide association study identifies pancreatic cancer susceptibility loci on
chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat Genet 42, 224-228.
Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins, C., Kuo, W.-
L., Chen, C., Zhai, Y., et al. (1998). High resolution analysis of DNA copy number
variation using comparative genomic hybridization to microarrays. Nat Genet 20, 207-
211.
Pique-Regi, R., Cáceres, A., and González, J.R. (2010). R-Gada: a fast and flexible
pipeline for copy number analysis in association studies. BMC Bioinformatics 11, 380-
392.
Pique-Regi, R., Monso-Varona, J., Ortega, A., Seeger, R., Triche, T., and Asgharzadeh,
S. (2008). Sparse representation and Bayesian detection of genome copy number
alterations from microarray data. Bioinformatics 24, 309-318.
95
Pique-Regi, R., Ortega, A., and Asgharzadeh, S. (2009). Joint estimation of copy number
variation and reference intensities on multiple DNA arrays using GADA. Bioinformatics
25, 1223-1230.
Postel-Vinay, S., Veron, A.S., Tirode, F., Pierron, G., Reynaud, S., Kovar, H., Oberlin,
O., Lapouble, E., Ballet, S., Lucchesi, C., et al. (2012). Common variants near TARDBP
and EGR2 are asociated with susceptibility to Ewing sarcoma. Nat Genet 44, 323-327.
Ptacek, T., Li, X., Kelley, J.M., and Edberg, J.C. (2008). Copy number variants in genetic
susceptibility and severity of systemic lupus erythematosus. Cytogenet Genome Res 123,
142-147.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., Maller, J.,
Sklar, P., deBakker, P., Daly, M., et al. (2007). PLINK: a toolset for whole-genome
association and population-based linkage analysis. American Jounal of Human Genetics
81.
R Development Core Team (2011). R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing. http://www.R-project.org (Vienna,
Austria).
Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H.,
Shapero, M.H., Carson, A.R., Chen, W., et al. (2006). Global variation in copy number in
the human genome. Nature 444, 444-454.
Ries, L., Smith, M., Gurney, J., Linet, M., Tamra, T., Young, J., and Bunin, G. (1999).
Cancer Incidence and Survival among Children and Adolescents: United States SEER
Program 1975-1995, National Cancer Institute, SEER Program. NIH Pub No 99-4649
Bethesda, MD.
Sanchez-Garcia, F., Akavia, U.D., Mozes, E., and Pe'er, D. JISTIC: Identification of
Significant Targets in Cancer. BMC Bioinformatics 11, 189-198.
Sanchez-Prieto, R., de Alava, E., Palomino, T., Guinea, J., Fernandez, V., Cebrian, S.,
Lleonart, M., Cabello, P., Martin, P., San Roman, C., et al. (1999). An association
between viral genes and human oncogenic alterations: The adenovirus E1A induces the
Ewing tumor fusion transcript EWS-FLI1. Nat Med 5, 1076-1079.
Scherczinger, C.A., Bourke, M.T., Ladd, C., and Lee, H.C. (1997). DNA extraction from
liquid blood using QIAamp. J Forensic Sci 42, 893-896.
96
Shaikh, T., Gai, X., Perin, J., Glessner, J., Xie, H., Murphy, K., O'Hara, R., Casalunovo,
T., Conlin, L., D'Arcy, M., et al. (2009). High-resolution mapping and analysis of copy
number variations in the human genome: A data resource for clinical and research
applications. Genome Res 19, 1682-1690.
Shlien, A., and Malkin, D. (2009). Copy number variations and cancer. Genome
Medicine 1, 62.
Smith, R., Owen, L., and Trem, D. (2006). Expression profiling of EWS/FLI identifies
NKX22 as a critical target gene in Ewing's sarcoma. Cancer Cell 9, 405-416.
Snijders, A., Nowak, N., and Segraves, R. (2001). Assembly of microarrays for genome-
wide measurement of DNA copy number. Nat Genet 29, 263-264.
Song H, R.S., Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H,
Chang-Claude J, Cramer DW, DiCioccio R, Dörk T, Goode EL, Goodman MT,
Schildkraut JM, Sellers T, Baglietto L, Beckmann MW, Beesley J, Blaakaer J, Carney
ME, Chanock S, Chen Z, Cunningham JM, Dicks E, Doherty JA, Dürst M, Ekici AB,
Fenstermacher D, Fridley BL, Giles G, Gore ME, De Vivo I, Hillemanns P, Hogdall C,
Hogdall E, Iversen ES, Jacobs IJ, Jakubowska A, Li D, Lissowska J, Lubiński J, Lurie G,
McGuire V, McLaughlin J, Medrek K, Moorman PG, Moysich K, Narod S, Phelan C,
Pye C, Risch H, Runnebaum IB, Severi G, Southey M, Stram DO, Thiel FC, Terry KL,
Tsai YY, Tworoger SS, Van Den Berg DJ, Vierkant RA, Wang-Gohrke S, Webb PM,
Wilkens LR, Wu AH, Yang H, Brewster W, Ziogas A; Australian Cancer (Ovarian)
Study; Australian Ovarian Cancer Study Group; Ovarian Cancer Association Consortium,
Houlston R, Tomlinson I, Whittemore AS, Rossing MA, Ponder BA, Pearce CL, Ness
RB, Menon U, Kjaer SK, Gronwald J, Garcia-Closas M, Fasching PA, Easton DF,
Chenevix-Trench G, Berchuck A, Pharoah PD, Gayther SA. (2009). A genome-wide
association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat
Genet 41, 996-1000.
Sorenson, P., Lessnick, S., Lopez-Terrada, D., Liu, X., Triche, T., and Denny, C. (1994).
A Second Ewing's Sarcoma Translocation, t(21;22), Fuses the EWS Gene to Another
ETS-family Transcription Factor, ERG. Nat Genet 6, 146-151.
Spielman, R.S., McGinnis, R.E., and Ewens, W.J. (1993). Transmission test for linkage
disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM).
American Jounal of Human Genetics 52, 506-516.
97
Stark, M.S., Tyagi, S., Nancarrow, D.J., Boyle, G.M., Cook, A.L., Whiteman, D.C.,
Parsons, P.G., Schmidt, C., Sturm, R.A., and Hayward, N.K. (2010). Characterization of
the Melanoma miRNAome by Deep Sequencing. Plos One 5, e9685.
Stranger, B.E., Forrest, M.S., Dunning, M., Ingle, C.E., Beazley, C., Thorne, N., Redon,
R., Bird, C.P., Grassi, A.d., Lee, C., et al. (2007). Relative Impact of Nucleotide and
Copy Number Variation on Gene Expression Phenotypes. Science 315, 848-853.
Sudmant, P.H., Kitzman, J.O., Antonacci, F., Alkan, C., Malig, M., Anya Tsalenko,
Sampas, N., Bruhn, L., Shendure, J., Project, G., et al. (2010). Diversity of Human Copy
Number Variation and Multicopy Genes. Science 330, 641-646.
Suzuki, K., Bustos, T., and Spritz, R.A. (1998). Linkage disequilibrium mapping of the
gene for Margarita Island ectodermal dysplasia (ED4) to 11q23. Am J Hum Genet 63,
1102-1107.
Suzuki, K., Hu, D., Bustos, T., Zlotogora, J., Richieri-Costa, A., Helms, J.A., and Spritz,
R.A. (2000). Mutations of PVRL1, encoding a cell-cell adhesion molecule/herpesvirus
receptor, in cleft lip/palate-ectodermal dysplasia. Nat Genet 25, 427-430.
Takahashi, N., Satoh, Y., Kodaira, M., and Katayame, H. (2008). Large-dcale copy
number variants (CNVs) detected in different ethnic human populations. Cytogenet
Genome Res 123, 224-233.
Tandon, A., Patterson, N., and Reich, D. (2011). Ancestry Informative Marker Panels for
African Americans Based on Subsets of Commercially Available SNP Arrays. Genet
Epidemiol 35, 80-83.
Tarkkanen, M., Kiuru-Kuhlefelt, S., Blomqvist, C., G. Armengol, Böhling, T., Ekfors, T.,
Virolainen, M., Lindholm, P., Monge, O., Picci, P., et al. (1999). Clinical Correlations of
Genetic Changes by Comparative Genomic Hybridization in Ewing Sarcoma and Related
Tumors. Cancer Genet Cytogenet 114, 35-41.
98
Tenesa A, F.S., Prendergast JG, Porteous ME, Walker M, Haq N, Barnetson RA,
Theodoratou E, Cetnarskyj R, Cartwright N, Semple C, Clark AJ, Reid FJ, Smith LA,
Kavoussanakis K, Koessler T, Pharoah PD, Buch S, Schafmayer C, Tepel J, Schreiber S,
Völzke H, Schmidt CO, Hampe J, Chang-Claude J, Hoffmeister M, Brenner H,
Wilkening S, Canzian F, Capella G, Moreno V, Deary IJ, Starr JM, Tomlinson IP, Kemp
Z, Howarth K, Carvajal-Carmona L, Webb E, Broderick P, Vijayakrishnan J, Houlston
RS, Rennert G, Ballinger D, Rozek L, Gruber SB, Matsuda K, Kidokoro T, Nakamura Y,
Zanke BW, Greenwood CM, Rangrej J, Kustra R, Montpetit A, Hudson TJ, Gallinger S,
Campbell H, Dunlop MG. (2008). Genome-wide association scan identifies a colorectal
cancer susceptibility locus on 11q23 and replicates risk loci at 8q24 and 18q21. Nat
Genet 40, 631-637.
The 1000 Genomes Project Consortium (2010). A map of human genome variation from
population-scale sequencing. Nature 467, 1061-1073.
The International HapMap Consortium (2003). The International HapMap Project.
Nature 426, 789-796.
Thierry-Mieg, D., and Thierry-Mieg, J. (2006). AceView: a comprehensive cDNA-
supported gene and transcripts annotation. Genome Biology 7, S12.
Thomas, G., Jacobs, K., Yeager, M., Kraft, P., Wacholder, S., Orr, N., Yu, K., Chatterjee,
N., Welch, R., Hutchinson, A., et al. (2008). Multiple loci identified in a genome-wide
association study of prostate cancer. Nat Genet 40, 310-315.
Thomas G, J.K., Kraft P, Yeager M, Wacholder S, Cox DG, Hankinson SE, Hutchinson
A, Wang Z, Yu K, Chatterjee N, Garcia-Closas M, Gonzalez-Bosquet J, Prokunina-
Olsson L, Orr N, Willett WC, Colditz GA, Ziegler RG, Berg CD, Buys SS, McCarty CA,
Feigelson HS, Calle EE, Thun MJ, Diver R, Prentice R, Jackson R, Kooperberg C,
Chlebowski R, Lissowska J, Peplonska B, Brinton LA, Sigurdson A, Doody M, Bhatti P,
Alexander BH, Buring J, Lee IM, Vatten LJ, Hveem K, Kumle M, Hayes RB, Tucker M,
Gerhard DS, Fraumeni JF Jr, Hoover RN, Chanock SJ, Hunter DJ. (2009). A multistage
genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2
and 14q24.1 (RAD51L1). Nat Genet 41, 579-584.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple
cancer types by shrunken centroids of gene expression. PNAS 99, 6567-6572.
Tsalenko, A., Sampas, N., Bruhn, L., Shendure, J., and Eichler, E.E. (2010). Diversity of
human copy number variation and multicopy genes. Science 330, 641-646.
Tsokos, M. (1992). Peripheral primitive neuroectodermal tumors. Diagnosis,
classification, and prognosis. Perspect Pediatr Pathol 16, 27-98.
99
Turc-Carel, C., Aurias, A., Mugneret, F., Lizard, S., Sidaner, I., Volk, C., Thiery, J.,
Olschwang, S., Philip, I., and Berger, M. (1988). Chromosomes in Ewing's sarcoma. I.
An evaluation of 85 cases of remarkable consistency of t(11;22)(q24;q12). Cancer Genet
Cytogenet 32, 229-238.
Valery, P., Holly, E., Sleigh, A., Williams, G., Kreiger, N., and Bain, C. (2005a). Hernias
and Ewing's sarcoma family of tumours: a pooled analysis and meta-analysis. The Lancet
6, 485-490.
Valery, P., McWhirter, W., Sleigh, A., Williams, G., and Bain, C. (2002). Farm
exposures, parental occupation, and risk of Ewing's sarcoma in Australia: a national case-
control study. Cancer Causes Control 13, 263-270.
Valery, P., McWhirter, W., Sleigh, A., Williams, G., and Bain, C. (2003). A National
Case-control Study of Ewing's Sarcoma Family of Tumours in Australia. Int J Cancer
105, 825-830.
Valery, P., Williams, G., Sleigh, A., Holly, E., Kreiger, N., and Bain, C. (2005b).
Parental occupation and Ewing's sarcoma: Pooled and meta-analysis. Int J Cancer 115,
799-806.
Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K.,
Malek, J.A., Costa, G., McKernan, K., et al. (2008). A high-resolution, nucleosome
position map of C. elegans reveals a lack of universal sequence-dictated positioning.
Genome Res 18, 1051-1063.
Walter, V., Nobel, A.B., and Wright, F.A. (2011). DiNAMIC: a method to identify
recurrent DNA copy number aberrations in tumors. Bioinformatics 27, 678-685.
Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S.F.A., Hakonarson, H., and
Bucan, M. (2007). PennCNV: An integrated hidden markov model designed for high-
resolution copy number variation detection in whole-genome SNP genotyping data.
Genome Res 17, 1665-1674.
Warden, M., Pique-Regi, R., Ortega, A., and Asgharzadeh, S. (2011). Bioinformatics for
copy number variation data, Vol 719.
White, S.J., Vissers, L.E.L.M., Kessel, A.G.v., Menezes, R.X.d., Kalay, E., Lehesjoki,
A.E., Giordano, P.C., Vosse, E.v.d., Breuning, M.H., Brunner, H.G., et al. (2007).
Variation of CNV distribution in five different ethnic populations. Cytogenet Genome
Res 118, 19-30.
100
Widhe, B., and Widhe, T. (2000). Initial symptoms and clinical features in osteosarcoma
and Ewing sarcoma. J Bone Joint Surg Am 82, 667-674.
Winn, D., Li, F., Robison, L., Mulvihill, J., and Fraumeni, J. (1992). A case-control
Study of the Etiology of Ewing's Sarcoma. Cancer Epidemiology and Biomarkers
Prevention 1, 525-532.
Wu X, Y.Y., Kiemeney LA, Sulem P, Rafnar T, Matullo G, Seminara D, Yoshida T,
Saeki N, Andrew AS, Dinney CP, Czerniak B, Zhang ZF, Kiltie AE, Bishop DT, Vineis
P, Porru S, Buntinx F, Kellen E, Zeegers MP, Kumar R, Rudnai P, Gurzau E, Koppova
K, Mayordomo JI, Sanchez M, Saez B, Lindblom A, de Verdier P, Steineck G, Mills GB,
Schned A, Guarrera S, Polidoro S, Chang SC, Lin J, Chang DW, Hale KS, Majewski T,
Grossman HB, Thorlacius S, Thorsteinsdottir U, Aben KK, Witjes JA, Stefansson K,
Amos CI, Karagas MR, Gu J. (2009). Genetic variation in the prostate stem cell antigen
gene PSCA confers susceptibility to urinary bladder cancer. Nat Genet 41, 991-995.
Xie, C., and Tammi, M.T. (2009). CNV-seq, a new method to detect copy number
variation using high-throughput sequencing. BMC Bioinformatics 10, 80-89.
Xue, Y., Sun, D., Daly, A., Yang, F., Zhou, X., Zhao, M., Huang, N., Zerjal, T., Lee, C.,
Carter, N.P., et al. (2008). Adaptive evolution of UGT2B17 copy-number variation. Am J
Hum Genet 83, 337-346.
Yoon, S., Xuan, Z., and Makarov, V. (2009). Sensitive and accurate detection of copy
number variants using read depth of coverage. Genome Res 19, 1586-1592.
Zamora, P., Paredes, M., Baron, M., diaz, M., Escobar, Y., Ordonez, A., Lopez, F., and
Gonzalez, J. (1986). Ewing's tumor in brothers. An unusual observation. Am J Clin
Oncol 9, 358-360.
Zhang, Q., Ding, L., Larson, D., Koboldt, D., McLellan, M., Chen, K., Shi, X., Kraja, A.,
Mardis, E., Wilson, R., et al. (2010). CMDS: a population-based method for identifying
recurrent DNA copy number aberrations in cancer from high-resolution data.
Bioinformatics 26, 464-469.
Zucman-Rossi, J., Batzer, M.A., Stoneking, M., Delattre, O., and Thomas, G. (1997).
Interethnic polymorphism of EWS intron 6: genome plasticity mediated by Alu
retroposition and recombination. Hum Genet 99, 357-363.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genetic and environmental risk factors for childhood cancer
PDF
Genetic association studies of age-related macular degeneration from candidate gene to whole genome
PDF
Meat intake, polymorphisms in the NER and MMR pathways and colorectal cancer risk
PDF
Genetic risk factors in multiple myeloma
PDF
HIF-1α gene polymorphisms and risk of severe-spectrum hypertensive disorders of pregnancy: a pilot triad-based case-control study
PDF
Functional role of a Ewing-sarcoma-specific vlncRNA in tumor growth and progression
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Age related macular degeneration in Latinos: risk factors and impact on quality of life
PDF
Dietary and supplementary folate intake and prostate cancer risk
PDF
Ancestral/Ethnic variation in the epidemiology and genetic predisposition of early-onset hematologic cancers
PDF
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
sFLT-1 gene polymorphisms and risk of severe-spectrum hypertensive disorders of pregnancy
PDF
The environmental and genetic determinants of cleft lip and palate in the global setting
PDF
Factors that influence mammographic density: role of estrogen metabolism genes, biomarkers of inflammation, and lifestyle
PDF
The ADRB3 TRP64ARG variant and obesity in African American breast cancer cases
PDF
Common immune-related factors and risk of non-Hodgkin lymphomy
PDF
Genetic variation in inducible nitric oxide synthase promoter, residential traffic related air pollution and exhaled nitric oxide in children
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
Asset Metadata
Creator
Warden, Melissa Brooke
(author)
Core Title
Genomic risk factors associated with Ewing Sarcoma susceptibility
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Molecular Epidemiology
Publication Date
05/04/2012
Defense Date
03/15/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
copy number variation,Ewing sarcoma,genome-wide association study,OAI-PMH Harvest,single nucleotide polymorphism
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gauderman, William James (
committee chair
), McKean-Cowdin, Roberta (
committee chair
), Asgharzadeh, Shahab (
committee member
), Ingles, Sue Ann (
committee member
), Triche, Timothy (
committee member
)
Creator Email
melissawarden@hotmail.com,mwarden@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-28196
Unique identifier
UC11289317
Identifier
usctheses-c3-28196 (legacy record id)
Legacy Identifier
etd-WardenMeli-750.pdf
Dmrecord
28196
Document Type
Dissertation
Rights
Warden, Melissa Brooke
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
copy number variation
Ewing sarcoma
genome-wide association study
single nucleotide polymorphism