Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
(USC Thesis Other)
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Too many needles in this haystack: Algorithms for the Analysis of Next
Generation Sequence Data.
by
Mourad (Tade) Souaiaia
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2012
Copyright 2012 Mourad (Tade) Souaiaia
Dedication
This thesis is dedicated to my grandfather Paul Cifrino, who lived a long and full life
and gave me the invaluable support such that I am able to nd myself in the position to
complete my PhD. Dissertation.
ii
Acknowledgements
There are a great many people that I would like to acknowledge and for whom I owe this
accomplishment. Above all else I have learned that graduate school is not something you
get through on your own. First, I would like to thank my graduate advisor, Professor
Ting Chen who has supported me in so many ways in my maturation as a scientist. Ting
has been the ideal advisor; giving myself and the rest of his students the space they
need to grow as individuals as well as always being there whenever necessary to oer
guidance in both science and life. I would like to thank my other committee members,
Professor Fengzhu Sun for his excellent teaching and unwavering patience with me during
our Thanksgiving trips for Dim Sum and Dr. James Knowles for taking a chance on me
and giving a computational scientist the chance to have a part in investigating some truly
spectacular biological questions.
I would like to thank the person who has put up with the most during my time at
USC, my wonderful ancee Sade. Her support has been invaluable, and her willingness
to accept the many late nights that often turned into all nighters as well as her constant
encouragement has allowed me to get past moments of diculty and truly excel to the
best of my potential. I hope I am able to reward her patience as we take the next step in
our lives together. I would like to give thanks to my rst co-author and former ocemate,
YangHo Chen. Despite the dierences in our background and personalities YangHo and I
made an unorthodox but excellent research team and he taught me what true dedication
to one's craft is. I also cannot forget to thank thank my closest friends in the program,
my co-author Zach Frazier and my partner in crime Reza Ardekani. Zach has taught me
so much about science and computer programming, but his most important contribution
iii
came during many late nights of studying together for our screening exams in the winter
of 2008. Without Zach I can say surely that I would not have passed my exams, and
for that I owe him so much gratitude. I don't believe either of us had ever studied as
hard as we did for those exams, and I even enjoy the idea that without me as a study
partner Zach would have struggled with his exams as well, but considering his wealth of
knowledge and expertise in algorithms, it's likely not true. I do run faster though.
While we don't always see each other as much as we did I would like to thank my
whole rst year class; it seems so recent that Zach, Tittu, Mehmet, Fang Fang, Dahze,
Sung-Chun and I started out in our rst courses together. After all the exams and hard
work it might seem surprising, but I will genuinely miss my coursework. I would like to
thank all my professors including but not limited to Michael Waterman, Shariar Samsian,
Peter Baxendale, Achiro Nakano, Larry Goldstein, and Jasmine Zhou. I also owe this one
to all my former classmates and the entire group of people that call RRI home! Finally,
I'd like to thank my mother and father and my siblings, even though they are 1000 miles
away they have always been there for me, and if I am correct they will soon owe me a
visit to celebrate this accomplishment!
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract ix
Chapter 1: Introduction to Sequence Data Analysis 1
1.1 Sequencing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sequencing Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Analysis of Alignment Data . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 2: ComB: An iterative SNP Caller 16
2.1 Survey on SNP-Calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Design of ComB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Analysis of ComB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: Fade: NR-optimized Methylation Estimation 32
3.1 Survey on DNA Methylation . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Design of FadE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Analysis of FadE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 4: Biological Applications: RNA-DNA Dierences in a single individual 55
4.1 Survey on RNA Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Methods to locate and classify RNA editing . . . . . . . . . . . . . . . . . 57
4.3 RNA Editing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 5: Biological Applications: RNA-DNA Dierences in a single individual 70
5.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Bibliography 73
v
Appendix 79
A.1 Survey of Alignment Strategies . . . . . . . . . . . . . . . . . . . . . . . . 79
A.2 Design of PerM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.3 Comparison of PerM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
vi
List of Tables
1.1 Interpretation of Alignment Substitutions . . . . . . . . . . . . . . . . . . 9
2.1 ComB - SNP Calling between E. coli strains . . . . . . . . . . . . . . . . . 28
2.2 ComB - Results of SNP Calling Simulation . . . . . . . . . . . . . . . . . 31
2.3 ComB - Whole genome human SNP Calling . . . . . . . . . . . . . . . . . 31
3.1 FadE - Dierent Interpretations in Color Space . . . . . . . . . . . . . . . 36
3.2 Fade - Comparison to Color Translation Methods . . . . . . . . . . . . . . 46
3.3 Fade - Comparison in nucleotide space . . . . . . . . . . . . . . . . . . . 47
3.4 FadE - The methylation rate near CpG Promoters . . . . . . . . . . . . . 50
3.5 FadE - Color Space Comparison to Lister et al . . . . . . . . . . . . . . . 53
3.6 FadE - Sequence Space Comparison to Lister et al . . . . . . . . . . . . . 54
4.1 Results of Iterative Genome Alignment (RNA Editing) . . . . . . . . . . . 65
4.2 Classication of RNA Editing Candidates . . . . . . . . . . . . . . . . . . 66
4.3 Identied Exonic Editing Events . . . . . . . . . . . . . . . . . . . . . . . 68
A.1 Comparison of Alignment Strategies . . . . . . . . . . . . . . . . . . . . . 84
A.2 PerM - An Illustration of local tolerance . . . . . . . . . . . . . . . . . . . 86
A.3 PerM - Optimal periodic seed patterns . . . . . . . . . . . . . . . . . . . . 90
A.4 PerM - Alignment Speed Comparisons w/ Illumina Reads . . . . . . . . . 91
A.5 PerM - Alignment Speed Comparisons in Color Space . . . . . . . . . . . 93
vii
List of Figures
1.1 Schematic: Illumina's Pyrosequencing Platform . . . . . . . . . . . . . . 4
1.2 SOLiD's color signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Schematic: SOLiD Oligonucleotides . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The pile-up method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 ComB - Indeterminate Color Space Alignment . . . . . . . . . . . . . . . 24
2.2 ComB - SNPs Calling between E. Coli strains for varying coverage levels . 29
3.1 FadE - Emission rates in color space . . . . . . . . . . . . . . . . . . . . . 34
3.2 FadE - Hidden Data Model for Methylation . . . . . . . . . . . . . . . . . 41
3.3 FadE - Probability Density Function Describing Methylation . . . . . . . 42
3.4 FadE - Model adjustments for adjacent cytosines . . . . . . . . . . . . . . 42
3.5 Fade - Simulated Performance in Color Space . . . . . . . . . . . . . . . . 45
3.6 FadE- Methylation Distribution for CpG Islands . . . . . . . . . . . . . . 49
3.7 FadE - Average methylation rates near CpG island promoters . . . . . . . 51
3.8 FadE - Methylation rates adjacent to TSS for gene BIVM . . . . . . . . . 52
4.1 RNA Editing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Positional Bias in RNA Editing . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Pie 4,102Chart for dierently annotated RNA editing sites . . . . . . . . . 67
4.4 RNA Editing Motif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 PerM - Illustration of traditional seed method . . . . . . . . . . . . . . . . 83
A.2 PerM - Seed Example (single end) . . . . . . . . . . . . . . . . . . . . . . 85
A.3 PerM - combination of hashing and binary search [11] . . . . . . . . . . . 88
A.4 PerM - Limiting weight to length ratios [11] . . . . . . . . . . . . . . . . . 89
viii
Abstract
The development of second-generation sequencing (SGS) technology has provided sci-
entists with a myriad of opportunities as well as new challenges. SGS machines are
capable of sequencing billions of short reads at a fraction of the cost and time in com-
parison to older technology. Often, the study of sequence data begins with the align-
ment of billions of short dna reads to the 3 billion base pair human reference genome,
a daunting computational task, especially if the error-rate between the reads and ref-
erence is high. For this reason, PerM was developed to use periodic spaced seeds to
eciently and accurately provide highly sensitive ungapped alignment for Illumina and
SOLiD reads. Inexact alignments are often the most interesting biologically, because
mismatches between the read and reference are often the result of genetic variation. To
accurately detect and discern variation from machine errors, we developed ComB, which
iteratively applies Bayesian statistics to color or base alignment to accurately determine
mutation probability. This allowed us to study a host of biological phenomena which
result in rare nucleotide dierences, including single nucleotide polymorphisms (SNPs),
RNA-editing, and allele-specic expression. DNA-methylation of cytosine residues also
produces single-base mismatches when dna is treated with sodium-bisulte which changes
all unmethylated cytosine residues to thymine. To accurately estimate methylation rates
from sodium bisulte treated dna we developed FadE, an algorithm which uses Newton-
Raphson optimization to estimate the methylation rate at every cytosine residue in the
genome. Finally, we have applied all our statistical tools to study human mRNA editing,
and have shown that RNA editing in human brain tissue occurs at a much lower rate
than previously thought.
ix
Chapter 1
Introduction to Sequence Data Analysis
The most well known analysis of sequencing data may well be the human genome
project of 1990, the daunting challenge to determine and analyze the sequence of the
3-billion base pairs and over 20,000 genes in a single human body. The 4-billion dollar
international collaborative eort culminated in the publication of a single draft genome
in 2001 [22]. Despite the fanfare associated with it's publication, the draft genome was
sequenced primarily from a handful of Caucasian male donors, providing little impact or
aid to the understanding of human biology or disease. Instead the genome provides an
extremely valuable template sequence for the alignment of short sequence reads, which
allows scientists to investigate the genetics of an individual at a fraction of the time and
costs associated with the original project. This thesis will describe the algorithms that
we have developed to eciently and accurately make use of the wealth of sequencing data
produced by new technologies. In this chapter we will rst describe advances in technology
that enable billions of short DNA fragments to be produced quickly. In Section 1.2 we
will describe the challenge of aligning the short DNA fragments to a template genome
and introduce our alignment algorithm, PerM, while in Section 1.3 we brie
y introduce
some of the biological applications which follow sequence alignment. Chapters 2 and 3
1
will describe the algorithms that we have developed for SNP-Calling and Methylation
Estimation, two downstream biological applications. Chapters 4 and 5 will describe
the results and ongoing analysis in our studies involving the sequence dierences between
human mRNA and DNA. Additionally, in the appendix we include an in-depth description
of PerM and other popular alignment algorithms.
1.1 Sequencing Technologies
Second generation sequencing (SGS) technology has changed the eld of genetics by
oering extremely high throughput at a reasonably low cost. The rst generation Sanger
sequencing technology responsible for the Human Genome Project used bacterial vectors
to generate sequence fragments of 700-900 base pairs with error rates between 0.1%
and 1% [27] at a cost of approximately 10$ per kilobase. Today, sequencing machines
routinely generate data on the terabase scale in a matter of days at costs closer to 3 cents
per kilobase [62]. This dramatic increase in throughput is not without cost however, the
technology used in todays sequencing machines produces shorter and less accurate reads.
Understanding the dierent technologies used today is vital to developing algorithms and
statistical tools which accurately align and analyze the vast amount of data they produce.
In this thesis we concentrate primarily on two particular sequencing technologies: The
Illumina HiSeq and the Lifetech SOLiD platform.
Illumina Sequencing Technology
Since the merger and acquisition of the Solexa company in 2006, Illumina has emerged as
one of the leaders in second generation sequencing. As of 2012, the Illumina HiSeq 2000 is
2
capable of sequencing more than one billion 50 to 100-bp reads in less than a week's time
at a cost less than $10,000 USD [15]. The HiSeq 2000 operates using two
ow cells, each
with eight sequencing lanes capable of sequencing close to 100 million reads. The process
begins with random fragmentation of the sample and then selection according to size (read
length). Fragments are then attached to adapter sequences which are immobilized on an
optically transparent glass slide. Each immobilized DNA fragment is duplicated into
clusters of thousands of identical copies using PCR-like bridge amplication [13]. Each
identical fragment in the cluster is synthesized base-by-base using reagents which provide
dierent optical signals depending on which base is added. The process, illustrated in
1.1 is known as sequencing-by-synthesis because the signal is obtained as the read is
synthesized. Ideally, each cycle involves the synthesis of a single base and the emission
of a single optical signal for every one of the thousand identical fragments. In reality
some fragments will incorrectly incorporate or not incorporate a base during some of the
cycles. The accumulation of these errors causes the accuracy of the sequencing to decay
as the read is sequenced, which imposes a soft limit to the length of reads which can be
accurately sequenced. Currently, the HiSeq 2000 produces reads which maintain accuracy
for lengths of up to 124bp. The read length can eectively be increased by sequencing
the reverse complement from the other end of the fragment. If the fragment length is
less than twice the read length the process will produce two overlapping reads which
can eectively be interpreted as one large read. When the fragment length is greater
than twice the read length, the two ends will be separated by an unsequenced portion.
For the Illumina sequencer, the separation between what are known as paired-end reads
varies between zero (overlapping) and ve-thousand basepairs. This is in contrast to the
3
SOLiD sequencer which uses slightly larger fragments to produce paired-end reads with
separation distances between six hundred and ten thousand basepairs.
T
G
A C
T G
A
C
T
G A
C
T
G
A
C T
G A
C
Cycle 1
Cycle 4 Cycle 5
Light Signal
Cycle 1 Cycle 5
Cycle 3
Cycle 2
A cluster of
amplified identical
DNA fragments
Figure 1.1: Let red represent nucleotide A, blue represent G, yellow represent C, and
green represent T. The read corresponding to the circled cluster is sequenced as TGCAA.
The sequencing takes place with one base per cycle; each copy in a cluster synthesizes a
nucleotide (A, C, G or T) emitting a light signal to indicate which base is synthesized.
During 50 to 100 cycles, billions of 50-bp to 150-bp same-length reads are sequenced.
The Lifetech SOLiD platform
As of 2012, the second most popular sequencing platform is Lifetech's SOLiD sequencing
platform, which is capable of sequencing single and paired-end reads between of up to
75bp at costs comparable to the Illumina technology [4]. The initial steps behind the
SOLiD sequencing platform are quite similar to those used by Illumina's HiSeq machine;
DNA is fragmented, selected, attached to probes and axed to glass slides and copied
using PCR. At this point the SOLiD system diverges considerably from the technology
employed by Illumina. Rather than using synthesis to sequence the template strand, one
of sixteen eight base pair oligonucleotides are ligated to the template sequence. The rst
and second of the eight oligonucleotides match the rst and second bases on the template
strand and last of the oligonucleotides is attached to one a
uorescent dye, as shown in
4
Di-nucleotide Colors Signals Encoded from the Base Encoding
00
A C G T
The 2
nd
Base
A
C
G
T
00
00
00
The 1st Base
01
10
10
10
10
01
01
01 11
11
11
11
0 1 0 0 1
0 1 0 0 1
11 01 00 10
0 1 1 1 0
0 1 1 1 0
1
st
Bit for each base
2
nd
Bit for each base
Shift →
Shift →
T A A
0 1 1 1 0
0 1 0 0 1
G G
Encode Colors
A A A A C A A A A A A
A A A A C G G G G G G
A A A A G A A A A A A
A A A A T A A A A A A
A A A A G G G G G G G
The first base
The second base
Figure 1.2: Four kinds of color signals for di-nucleotide transitions in SOLiD
Figure 1.3. The specic
uorescence is detected and the tail of the oligonucleotide is
cleaved and the process is repeated for each of the dinucleotides on the template strand.
Each of the
uorescent dyes detected corresponds to a dinucleotide such that with the
exception of the rst nucleotide on the template sequence every nucleotide on the template
strand is sequenced twice. As shown in Figure 1.2, only four colors of dye are used despite
the fact that there are sixteen unique dinucleotides. The dyes are chosen such that for
rst base the transition to each of the other four bases corresponds to a dierent dye, for
example (see Figure 1.2), the dincucleotides "AA","AC","AG",and "AT" are all encoded
with a dierent color. This means that if the rst base in a sequence is known the bases
at all the following positions can be inferred from the rest of the color sequence. This
two-color encoding (referred to as "color-space") has unique consequences with regard to
sequences errors. The main advantage of this system is the fact that a change in a single
nucleotide results in the change to consecutive color signals. Thus, unlike traditional base
sequencing where a single sequencing error can improperly be interpreted as a SNP, the
encoding for SNPs in color-space occupy only one third of the possible consecutive color
5
Figure 1.3: The SOLiD Sequencing Method
Step Sequence & Oligo Pairing Color Read
1 CCTTATAGAGGTA
GGXXXXXXB B
2 CCTTATAGAGGTA
-GAXXXXXXY BY
3 CCTTATAGAGGTA
--AAXXXXXXB BYB
4 CCTTAATAGAGGTA
---ATXXXXXXR BYBR
In each step an 8-bp oligonucleotide whose rst and second base are complimentary to the sample sequence
is attached. Each of sixteen oligonucleotides has one of four
orescent dyes; thus the transitions between
sequence bases are queried rather than the sequence. Notice the redundancy: Both the "GG" and "AA"
transitions correspond to the blue dye.
errors. The rare occurrence of consecutive errors which match SNP signals has provided
base calling accuracy to be quoted as high as 99.94%. Unfortunately, the chemistry used
in color-space technology also results in a single color error rate which is much higher than
the single base error rate for the Illumina technology [66]. This imposes new challenges for
sequence alignment especially when many variants are present in a read, and also imposes
challenges in the detection of variant-calls which aect multiple bases. These challenges
and solutions to provide accurate interpretation of both color-space and nucleotide-space
reads are further discussed in the following three chapters of this thesis.
1.2 Sequencing Alignment
The extremely high-throughput of short read data provided by the sequencing technolo-
gies described above provide biologists the means to to undertake a myriad of analyses,
including genome resequencing [10, 9], SNP calling [70], gene expression proling [76, 64],
and analysis of the epigenome [5, 32]. The common requirement of these and many other
6
biological applications is the necessary step to align short read data to a reference genome
or transcriptome. While accurate statistical tools are needed for many biological applica-
tions, it is imperative to rst arrive at an unbiased and accurate alignment, as any errors
made in read alignment will cloud the results of downstream analysis. In this thesis we
are primarily concerned with ungapped short read alignment to a large reference reference
sequence. Stated simply, this task involves mapping short reads to locations on a large
reference string which have high sequence similarity. In this section we will introduce the
computational constraints that must be considered when designing an alignment algo-
rithm and brie
y describe our periodic seed alignment algorithm, PerM. In the appendix
we provide a survey of popular alignment algorithms and describe our algorithm in detail.
To develop an algorithm for ungapped short read alignment one should rst formalize
the task and make clear the necessary assumptions regarding the input data.
Denition 1.2.1. Let be the alphabetf A, C, G, Tg. An alignment requires the
following:
A known reference stringG, which is a long string over with lengthjGj.
A unknown but similar target string T whose positions share a one-to-one corre-
spondence with those ofG and dier according to the mutation rate .
A read setR =fr
1
;r
2
;:::;r
n
g, where each readr
i
is an equal-length (jrj) substring
selected uniformly fromT and copied according to an error parameter .
One should notice that the same assumptions can be made for gapped sequence align-
ment, but that the alphabet is expanded tof A, C, G, T, *g, where represents a
genomic deletion. In either case the task required of a short read alignment algorithm is
7
to nd, for each r
i
in R, the location inG which most likely corresponds to the location
in T from which r
i
was derived. We notice that in the trivial case where G and T are
identical and sequencing is perfect (ie. = = 0), and the largest repeated substring in
G is less than the read length, every read will align mismatch-free to a single location in
G which constitutes it's correct corresponding location. For this reason most read align-
ment algorithms make the assumption that provided and are suciently small, the
alignment to G with the lowest number of substitutions is likely the correct alignment.
In reality, the extremely repetitive and semi-repetitive nature of many genomes results
in many reference substrings which dier by only a few base pairs, meaning some reads
will nd lower-substitution alignments to incorrect locations by chance. For example,
on chromosome 10 of the human genome their exist over 1-million 50mers which dier
from another genomic substring but two or fewer bases. This feature causes many short
reads to nd multiple-low substitution alignments to the reference string. In some cases
downstream analysis like SNP-Calling be be used to resolve alignment ambiguity. Ad-
ditionally, and are usually not uniformly distributed and sometimes sequence error
or large mutations result in reads which do not nd any low substitutions alignments.
Dierent results from alignment and their interpretations are summarized in Table 1.1.
As shown in Table 1.1 alignments to multiple locations in G with similar numbers of
substitutions are dicult to resolve without downstream analysis. Therefore, one task
required of an alignment algorithm is the ability to nd and report all alignments within
a reasonable number of substitutions. For example, if a read aligns to one location with
two substitutions and another with three substitutions accuracy will be compromised if
only the two substitution alignment is reported. However, it a 100-bp read nds a single
8
Table 1.1: Interpretation of Alignment Substitutions
Fewest
Substitutions
2
nd
Fewest 3
rd
Fewest
S. Cnt
y
S. Cnt
y
S. Cnt
y
Interpretation
2 1 > 50 > 100 - - Top alignment is likely correct
4 1 5 3 > 50 > 100
Correct alignment may be resolved with downstream analysis
0 311 1 232 2 101
Correct alignment location is unlikely to be
determined
Shown are the likely interpretations which result from the number of alignments (Cnt) with the fewest,
second fewest and third fewest substitutions (S). Notice that in the rst example, a single
two-substitution is likely correct because the second best alignment has over fty substitutions which is
likely noise. The second alignment however has a single four substitution alignment and three
ve-substitution alignments, all of which could be correct. In this case SNP-Calling may result in one of
the ve-substitution alignments actually aligning exactly to the reference; for this reason it would be
very misleading to only report the single four-substitution alignment. Finally, the third example is likely
a result of a repetitive sequence, while the correct alignment is likely to be determined, it it would be
especially misleading if only one of the 311 substitution free alignments were reported.
one substitution alignment and an alignment with fty substitutions as it's next best
alignment, it is likely unnecessary to report or consider the fty substitution alignment.
An alignment algorithm which can provide accuracy in such cases is one designed which
has high levels of mismatch tolerance and mismatch sensitivity.
Denition 1.2.2. Let r be a read aligned to G. The percentage of possible alignments
with k or fewer mismatches to G which an algorithm is able to report is the level of
sensitivity to k mismatches the algorithm provides.
Denition 1.2.3. Let r be a read aligned to G. An algorithm which is able to report
every alignment with k or fewer mismatches to G, the algorithm provides tolerance ()
to k substitutions.
9
It's important to notice that tolerance and sensitivity both represent upper-bounds
on the number of alignments an algorithm is capable of reporting. Unfortunately, some
algorithms run eciently only when reporting a small fraction of alignments within a
tolerance level. For this reason it's important to dene standards of alignment reporting
in addition to tolerance.
Denition 1.2.4. Assume a parameter
which represents the number of substitutions
for which an alignment algorithm should consider. LetS
represent the sensitivity of the
algorithm at
(note: if
then S
= 1). Then, an alignment algorithm may provide
one of the following reporting standards:
Threshold Filtration: If
the algorithm will report all alignments with
or fewer
substitutions. Otherwise the algorithm will report all alignments with or fewer
substitutions as well the fraction of alignments corresponding to it's sensitivity level
for substitutions greater than (ie. S
+1
;S
+2
:::S
).
Substitution Filtration: Substitution ltration involves rst searching for exact align-
ments and then allowing more substitutions till an alignment is located. Align-
ments found with substitution ltration will have the fewest possible substitutions
provided k, where k is the number of substitutions in the reported alignment.
If
k > , an alignment with fewer than k substitutions may exist but not be
located by the algorithm alignment located. In either case, after after substitution
ltration locates an alignment with k substitutions, one of three reporting choices
are made:
10
Complete Substitution Filtration: All alignments found with k substitutions
are reported.
Unique Substitution Filtration: If more than one alignment is found with k
substitutions none are reported. Otherwise, a single k substitution alignment
is reported.
Abridged Substitution Filtration: The rst alignment found with k substitu-
tions is reported.
Greedy Filtration In greedy threshold ltration the algorithm searches for an align-
ment with fewer than
substitutions. When one is found it is reported. However,
there is no guarantee that their does not exist an alignment with fewer substitutions.
Greedy alignment protocols also often spend a certain amount of CPU operations
searching for alignments; if this time is exhausted the read is discarded.
As shown in the examples described in Table 1.1, threshold ltration is necessary to
provide accurate results for all reads. Threshold ltration is also the most tedious report-
ing standard to adhere to; many algorithms slow down considerably when it is required
of them. Our short read alignment algorithm, PerM is able to achieve threshold ltration
for varying levels of mismatches by employing periodic spaced seeds designed for specic
tolerance levels. Described in detail in the appendix, a single periodic seed is a repeating
mask pattern which is slid along the genome to produce one subsequence for each ge-
nomic position. The same periodic pattern is then applied to slid along each read until a
subsequence generated from the read matches exactly to a genomic subsequence. If the
pattern is chosen correctly, at least one member in the set of read subsequences generated
11
will match exactly to a genomic subsequence if an alignment within substitutions exists,
enabling the algorithm to provide threshold ltration to substitutions in linear time.
It's important to notice that our algorithm and other similar methods which are
capable of providing threshold ltration at little cost to eciency, usually employ a cuto
such that reads which nd an extremely large amount of exact alignments are discarded.
This is usually done to save on the very large amount of IO required for reads which
are sequenced from especially repetitive regions (eg. more than 100 million alignments).
Such alignments are usually not useful for downstream analysis. Additionally, it should
be noted that some algorithms provide a mix of reporting standards. For example, by
default, the Bowtie algorithm (discussed in detail in the Appendix) provides abridged
substitution ltration for fewer than one mismatches and greedy ltration for more than
two mismatches. In the ideal case an alignment algorithm should provide tolerance to
jrj 1 substitutions and thus be capable of reporting threshold ltration for very large
numbers of substitution as described in Table 1.1. If the genome size is trivial, this
solution can be achieved through exhaustive search. In reality, providing tolerance to high
numbers of substitutions is computationally challenging. Additionally, the necessary level
of tolerance required to report accurate alignment results is dependent on the read length,
genome size and distribution, as well as the mutation and error rates. For example, if
50-bp reads sequenced using the SOLiD sequencer from a distant primate are aligned to
the human reference genome a very high level of tolerance is necessary to account for the
high mutation rate between dierent species as well as the high error rate associated with
color-space. Thus, for a given tolerance level (), the task of any alignment algorithm is
12
to provide a solution which minimizes two computational objectives: the time required
for read alignment and the amount of memory required of the algorithm.
Alignment speed is the amount of CPU time required per read to report one or more
alignments to the reference sequence or determine that none exist within a specied
tolerance level. Speed is usually reported as the number of millions of reads per CPU
hour that an alignment algorithm is capable of processing. Searching a 3-billion bp
reference genome for each read alignment is extremely inecient and as such all alignment
algorithms include a preprocessing step where the genome (or alternatively the read set)
is built into an index to enable ecient search. Consequently, the speed is a result of the
data structure used for indexing and the methods by which the index and genome are
queried.
The memory footprint required to store an index for the 3-billion bp genome can
quickly grow quite large. Storage of the base sequence alone requires 3 gigabytes of
memory, already a sizable portion for the average desktop computer. This restriction
eliminates lots of popular data structures. For example, one sux tree construction
described for whole genome alignment [16] quotes an upper bound of 37 bytes per base,
meaning 111 gigabytes of memory would be required to store the human genome. This
limitation has caused computational biologists to consider very compact data structures
for indexing the human genome.
Unfortunately, the use of a very compact data structure usually results in sacrices
to alignment speed at high tolerance levels. PerM is designed to provide a compromise
between the size of the reference index and the required memory to store the index.
By using only a single periodic seed, only one subsequence is created for each genomic
13
Reference 5’-A T A T T A T G C T G A G G G G G C A T
SNP
The aligned
and piled-up
reads
For RNA-Seq, the
coverage of each
gene region indicates
the expression level
of the gene.
Sequencing errors
C
C
C
C
C
A
C
G
G
G
Figure 1.4: An Illustration of the Pile-Up Method [11]
position. Thus, PerM is able to sort the subsequences lexicographically and store only the
position rather than the entire subsequence, requiring only one integer for each genomic
position, or approximately 12 gigabytes of memory. Through the use of a novel data
structure which employs both hashing and binary search PerM is able to eciently locate
alignments using maximal length substrings with only a moderate cost to memory. This
data structure, the maximal length subsequences produced by periodic seeds, and the
algorithm's implementation are discussed in far greater detail in the Appendix.
1.3 Analysis of Alignment Data
The results of short read alignment provide a myriad of biological application to be carried
out with a degree of accuracy and scope of analysis previously not possible. For example,
before the advent of second generation sequencing, microarray technology was frequently
used to search for single nucleotide dierences (SNPs) between populations which could
be associated with disease or another phenotype of interest. Microarray technology, which
relies upon sequence hybridization, allows biologists to perform nucleotide level analysis,
14
but limits the analysis to previously identied SNP positions and requires that
uores-
cence levels be normalized before they could be interpreted biologically. In contrast short
read alignment to a reference genome facilitates the analysis of known SNP positions and
also enables the genome-wide search for previously unannotated SNP positions. This
drastically increases the scope of the study, especially because most well characterized
SNPs are common mutations which are neutral to disease [2]. SNP-Calling, the central
task in genome-wide association studies and a necessary step in many other biological
applications, is often carried post-alignment by screening for positions for which a signi-
cant number of positions share an alignment substitution. This pile-up analysis (shown in
Figure 1.4, often provides inaccurate results if the SNPs are densely clustered or reads are
in color-space [74]. For this reason, we have developed ComB, an algorithm which accu-
rately performs SNP-Calling in color-space and nucleotide space. The motivation behind
the research and the iterative alignment algorithm used to increase SNP-Calling accuracy
is described in detail in Chapter 2. We have also extended the model to provide a tool
which is capable of performing whole-epigenome analysis of methylation rates using bisul-
te treated reads. Bisulte treated sequencing converts unmethylated cytosine residues
to thymine, which poses challenges for alignment and makes dicult the accurate estima-
tion of the rate at which methylation is present at dierent cytosine positions. Chapter
3 describes the development of FadE, an algorithm which uses Newton-Raphson opti-
mization to estimate genome-wide methylation rates, as well as analysis of methylation
trends which were identied using color-space human broblast data. Finally, Chapter
4, describes current and future development of pipeline tools that enable the analysis of
the dierences between the DNA and the RNA of a single individual.
15
Chapter 2
ComB: An iterative SNP Caller
2.1 Survey on SNP-Calling
As mentioned in Chapter 1, SNP-Calling is an important step in many biological analyses.
In fact, the identication of SNPs and short Block Nucleotide polymorphisms (BNPs)
may be a key step in the realization of personal medicine [60]. The challenge of SNP
calling from short read data requires that sequencing errors and mapping biases which
result in alignment substitution be discerned from actual polymorphism. For a given
candidate reference locust
j
=, the identity oft
j
is inferred from the following alignment
information:
The reference base g
j
and the global mutation distribution P (t
j
=jg
j
).
The model describing number of alleles at the position (ie. Organism is heterozy-
gous: [AA;AC;AG;AT;CC;CG;CT;GG;GT;TT ],
The set of reads which have an alignment spanning t
j
: R(t
j
).
16
There exist many algorithms which consider these parameters to perform SNP call-
ing on Illumina data [55, 7, 67]. However most algorithms designed for Illumina data
do not adequately conform to the ABI's color space where the single color error rate is
high and SNP signatures exist on only a subset of consecutive color substitutions [66].
Many alignment programs [45, 31, 50], attempt to make use of the unique two color SNP
signature and translate color reads to their most likely nucleotide string after alignment
in color-space. This facilitates the use of tools designed for Illumina data [50], but un-
fortunately conversion to nucleotide space results in a loss of information contained in
the original color read and quality sequence which may lead to inaccurate SNP calls. For
example SNPs which are obscured by color errors must be either categorized as SNPs
or disregarded completely during alignment before knowledge of the aggregate read data
is acquired. We developed ComB to provide a consistent statistical model to determine
short variants directly in color or nucleotide space. Building upon a simple Bayesian
framework, we designed ComB to include a host of features to adequately handle the
challenges and dependencies of color and nucleotide space to provide ecient and accu-
rate SNP calling. Section 2.2 introduces the general statistical model implemented in
ComB. Each of the sections that follow introduce a feature associated with alignment
data and explain how the feature is incorporated or the model updated so that SNP-
Calling accuracy and eciency can be maximized. In the Section 2.3 we compare ComB
to other SNP callers and to dierent data features.
17
2.2 Design of ComB
Bayesian Statistical Model
The general framework which ComB uses to determine genetic variants is a Bayesian
model which conditions the posterior probability that t
j
= on the set of reads which
contain an alignment spanning t
j
and the initial reference base. This probability can be
expressed as:
P (t
j
=jR(t
j
);g
j
) =
P (R(t
j
)jt
j
=)P (t
j
=jg
j
)
P
2fA;C;G;Tg
P (R(t
j
)jt
j
=)P (t
j
=jg
j
)
: (2.1)
where P (t
j
=jg
j
) is the prior SNP probability for the reference base g
j
,
P (R(t
j
)jt
j
=) is the conditional probability of the read set, and
P (R(t
j
)) is a normalizing constant which can be expressed as the sum of conditional
probability over the space of, for a haploid organism this is:
P
2fA;C;G;Tg
P (R(t
j
)jt
j
=
)P (t
j
=jg
j
)
If we assume read alignments are independent, the conditional probability of the read
set is the product of the conditional probability for each read:
P (R(t
j
)jt
j
=) =
Y
r2R(t
j
)
P (rjs;t
j
=): (2.2)
In the ideal case, where read alignment is unique, error-free and unbiased, SNPs are
isolated, and the sequencing error rate is a single uniform parameter , the conditional
probability for the read is:
18
P (R(t
j
)jt
j
=) =
8
>
>
>
>
<
>
>
>
>
:
1; if r
t
j
=
; else
(2.3)
where r
t
j
refers to the read position which aligns to the position in question, t
j
.
Unfortunately, features associated with the large and semi-repetitive human reference
genome, the highly variable sequencing error rate, and limitations associated with short
read alignment result in errors if the model is not designed to account for them. In the
next section, we will list challenging features to SNP calling as well as the solutions which
ComB uses to overcome them and provide accurate SNP calling.
Quality Calibration and Read Probability
As shown in equation 2.3, given a correct alignment, the read probability conditioned
on the identity of a possible variant (t
j
=), is dependent on the sequencing error rate.
However, the sequencing error rate is not uniform, but instead dependent largely on the
sequencing platform, sequence motifs, and read position [72]. The Illumina and SOLiD
platforms both supply quality scores, which re
ect the signal intensity but fail to capture
many sources of sequencing error [54]. An accurate estimation of the probability of
sequencing error should make use of both the reference and read base (or colors), the given
quality score, and the read position. This can be achieved by building an error model
which uses the frequency of observed nucleotide alignments at dierent quality scores
and read positions to determine the probability of observing any nucleotide alignment.
In sequence space this requires that uniquely mapping read alignments be used to build
a matrix with dimensions 4jQjjrj 4 wherejQj represents the range of quality
19
scores andjrj represents the length of the read. Each entry in the matrix represents the
probability of observing a certain nucleotide with quality score Q
i
at position r
i
aligned
to one of four reference nucleotides. This probability if estimated by the frequency of
observations of each reference nucleotide for any given combination of read nucleotide,
position and quality score and is calculated as follows:
C
i
(x;q;b) =
P
r
I(r
i
=x;q
i
=qjb)
P
r
I(r
i
;q
i
jb)
; (2.4)
wherer
i
andq
i
represent the base and quality score for a given read at itsi
th
position,
and I() is the indicator function, and the sum is over all uniquely mapping reads(r).
This recalibration of the call rates serves primarily to identify sequence specic errors
and reduce the false positive rate. In color space, the call rates must be recalibrated
with respect to an observed color which represents the transition between two reference
nucleotides; this results in an matrix with dimensions 4jQjjrj44 for each position in
the sequence. Here, each entry represents the probability of seeing a color (4), between
a pair of bases (4 4), at read position i (jrj) with quality score (jQj). Thus, in color
space the entry in the matrix describing the probability of observing color x with quality
q between reference dinucleotidesb
1
andb
2
at thei
th
position in the read is calculated in
the following way:
C
i
(x;q;b
1
;b
2
) =
P
r
I(r
i
=x;q
i
=qjb
1
;b
2
)
P
r
I(r
i
;q
i
jb
1
;b
2
)
; (2.5)
The call rates matrices described by Equations (2.4) and (2.5) can also be used to
measure the alignment between a readr and it's qualityq and a genome substrings. This
20
read alignment score, P (rjs), is calculated by multiplying the observation probabilities
from the call rates matrix for each color and quality score along the length of the read:
P (rjs) =
jrj
Y
i=1
C
i
(r
i
;q
i
;s
i
;s
i+1
): (2.6)
This is the probability that sequencing the genomejrj-mer represented by s would
produce the read r.
Prior SNP probability
The distribution of SNP rates is dependent on both the genetic distance between the
reference and target genome as well as the specic nucleotide change. Transitions ("A"
to "G" or "C" or "T") changes are much more common than transversions which lack
many of the molecular mechanisms that cause transitions [17]. This information can be
integrated into a prior SNP probability P (t
j
=jg
j
), which will result in more accurate
SNP calling. In the case of the well cataloged human genome, a global GC rate of 43% [35]
and homozygous and heterozygous SNP rates of 0.048% and 0.054% between individuals
[47] are fairly well established. If these rates are assumed, then the probability of a specic
nucleotide change can be estimated using the SNP frequencies from the 20 million SNPs
annotated in the dbSNP database (build 131 [14]). For example, the prior probability of
a homozygous "A" to "C" SNP between two individuals is:
P (t
j
=Cjg
j
=A) =
P (g
j
=Ajt
j
=C)P (t
j
=C)
P (g
j
=A)
(2.7)
Where the marginal probability for "A" is estimated from the genomic GC rate:
21
P (g
j
=A) =
1GC
RATE
2
(2.8)
and the marginal probability of a mutations resulting in "C" (P (t
j
= C)) and con-
ditional probability of reference A in the event of a mutation (P (g
j
= Ajt
j
= C)) are
estimated from the frequencies in dbSNP. For organisms which are less accurately cat-
aloged, priors can be estimated according to transition and transversion rates and GC
content can be calculated directly.
Ambiguous Alignments
As discussed in section 1.2, the most accurate alignment reporting option ("Complete
Reporting") will often have include many reads which align to multiple locations. An
error in the determination of the correct alignment from the many spurious alignments
may lead to incorrect SNP calling. However, discarding all all ambiguous reads may
result in the loss of a substantial amount of data, especially if the genome is large and
semi-repetitive, as is the case with the human genome. To preserve ambiguous read data
but avoid incorrect alignment selection we weight each ambiguous read by it's relative
alignment score to the reference location. We establish the conditional probability of
observing a read r which maps to a set of genomic locations L(r) given the identity of
locus t
j
:
22
P (rjL(r);t
j
=) =
X
s2L(r)
P (r;l!rjl;t
j
=) (2.9)
=
X
s2L(r)
P (rjt
j
=;l!r)P (l!r)
wherel!r represents the event that r is sequenced from l (correct alignment). The
expressionP (rjL(r);t
j
=) generalizes to the read observation probability when there is
only one (assumed correct) location. Since
P
l2l(r)
P (l!r) = 1, then for somel
2L(r),
P (l
!r) =
P (rjl
)
P
l2L(r)
P (rjl)
(2.10)
whereP (rjl) is the alignment score dened in Eqn (2.6). This results in each alignment
being weighted by it's relative alignment probability.
Consecutive Color Polymorphism Dependency
One major dierence between color space and nucleotide space color space reads are a
re
ection of the transition between two nucleotides. As such, consecutive polymorphisms
in color space do not have independent signatures in color space. Short block nucleotide
polymorphisms (BNPs) between 2 and 8 bp are not uncommon events. In fact 11.07%
of the 20.5 million positions annotated dbSNP [14] are adjacent to another nucleotide
polymorphism. Unlike SNPs which always change two consecutive colors, BNPs change
two or more colors which may not aways be adjacent, this is shown in Figure 2.1. For this
23
Color Read Interpretation Interpretation Read Interpretation Genome
BBRBBBBB
One sequence error BBBBBBBB AAAAAAAAA
One sequence error, one SNP BBRRBBBB AAATAAAAA
BBRBRBBB
Two sequence errors BBBBBBBB AAAAAAAAA
No sequence errors, two SNPs BBRRBBBB AAATTAAAA
Two color errors, one SNP BBRRBBBB AAAATAAAA
Two color errors, one SNP BBBRRBBB AAATAAAAA
BBRBGBBB
Two sequence errors BBBBBBBB AAAAAAAAA
One sequence errors, two SNPs BBRBRBBB AAATTAAAA
One sequence errors, two SNPs BBGBGBBB AAACCAAAA
Two sequence errors, one SNP BBRRBBBB AAATAAAAA
Two sequence errors, one SNP BBBGGBBB AAAACAAAA
One sequence error, three SNPs BYRBGBBB AGCCAAAAA
One sequence errror, three SNPs BBRBGYBB AATTGAAAA
Figure 2.1: Consider reads which map to a location with genome sequence AAAAAAAAA,
which in color space is BBBBBBBB. The reads may dier from the color space genome
sequence, because of either SNPs or sequencing errors. The goal of identifying likely
sets of sequencing errors and SNPs that would produce the given read is complicated by
the dependence between adjacent colors. Valid color substitutions are restricted to those
which change the read sequence locally, while sequencing errors are not restricted. Above
are a set of reads that map to the genome location in color space, along with a set of
possible interpretations of the read as results of sequencing errors and SNPs. Determining
the combination of SNPs and sequencing errors which aected any single read is dicult,
ComB's statistical model allows the likely interpretation to be made using the entire
collection of reads.
reason, candidate regions for block polymorphism must be evaluated for all possible poly-
morphism combinations along the length of the block. Equation 2.1 remains applicable
when the read alignment probability is conditioned on the set of all possible nucleotide
changes: P (rjt
i
=
1
;t
i+1
=
2
;::: ). In section 2.3, we show that this adjustment
produces far more accurate results than the color-translation method.
Diploid and Multiploid Organisms
If the target genome sequence comes from an organism with multiple chromosomes the
parameter should represent the space of all possible nucleotide groupings on the chro-
mosomes. For simplicity consider a diploid organism where =W , i.e. heterozygous for
nucleotides A and T. In this event, the observation score should consider the event that
24
the read is sequenced from either chromosome. Thus, the observation score for a read
aligned to such a locus is expressed as
P (rjt
j
=W ) =P (rjt
j
=A)P (t
j
=A) (2.11)
+P (rjt
j
=T )P (t
j
=T )
where
P (t
j
=A) =P (t
j
=T ) =
1
2
: (2.12)
When considering diploid space, 2fAA;AC;:::;TG;TTg, accuracy is markedly
improved through the use of iterative SNP calling. The iterative method will rst call
homozygous nucleotide polymorphisms (NPs), update the target genome, and then call
potential diploid locations to reduce bias to the reference genome. This bias otherwise
causes homozygous SNPs to be interpreted as heterozygous.
Alignment Biases and Iterative Implementation
Alignment to the reference sequence is carried out with respect to the number of sub-
stitutions a read has with the reference sequence. Areas with dense polymorphism may
produce reads with far too many substitutions relative to the reference sequence to align
to their correct location. It is important to note that even when SNPs are globally rare,
areas of dense polymorphism often still exist [74]. In color-space the problem is exacer-
bated because SNPs aect multiple colors. Increasing alignment tolerance may not be
computationally possible and may also result in the creation of more false or ambiguous
25
alignments. To avoid this bias we update the reference sequence and then re-align the
reads iteratively to solve areas with poor alignment rates due to dense polymorphism.
This iterative process assumes a set of short readsR sequenced from an unknown target
genomeT and a similar and known reference genomeG, and is carried out as follows:
1. Initially the read setR is mapped toG.
2. The machine quality scores are recalibrated and call rates are estimated. (Eqn
(2.5))
3. The genome is scanned and SNP candidates are located and tested using Bayesian
Inference (Eqn (2.1)).
4. The target genome is updated at positions with high posterior SNP probability, and
the mapping is repeated until a signicant number of SNPs are no longer located .
It should be noted that if the size of the dataset is large then iterative whole genome
SNP-Calling may require signicant memory usage and computational time. However,
ComB is implemented so that each step can be run on a dierent processor and and
intermediate information can be combined to produce call rates, candidate positions, and
SNP probability. This parallel implementation also allows color-space and nucleotide-
space alignment data to be used in combination. If there are multiple data types, the call
rates for each data type can be estimated separately and the product of the intermediate
SNP probability calculate with each data type can be multiplied by a prior probability
to produce to produce the total SNP probability for each candidate position.
26
2.3 Analysis of ComB
ComB was designed to facilitate accurate SNP calling by providing a general Bayesian
model for color and nucleotide space that made use of call-rate estimation, prior SNP
probability, weighting of ambiguous reads, iterative re-alignment, and consideration to
consecutive color polymorphism. The utility of these features in color space was demon-
strated in comparisons to Corona Lite [61] ABIs native color space alignment and SNP
calling pipeline, and to soapSNP [55] and MAQ-consensus [51], the two options contained
in the popular samtools package [50] which operate on the color-translated alignment
produced by BWA [48]. For comparisons in nucleotide space PerM was used to prepare
alignment for soapSNP and MAQ-consensus.
Table 2.1 and Figures 2.2 show the relative performance of each program in color and
sequence space using public E. coli datasets. In each experiment reads from an E. Coli
strain (SOLiD reads from E. coli DHB10 and and Illumina reads from E. coli MG166) were
downloaded from the NCBI short read archive [1] and aligned to the reference sequence
of moderately diverged strain, E. coli-REL606. All reference sequences were downloaded
from the UCSC genome browser [33] and globally aligned using MuMMER [43] to produce
a gold standard set of SNPs.
These global alignments revealed that while the global SNP rate between strains was
low (less than 1%) there were many areas of high SNP density, any many SNPs which
were members of block polymorphism [74]. In this experiment ComB called a larger
proportion of the SNPs in the global alignment between reference sequences and a smaller
number of unsupported SNPs. Two of the major advantages of ComB are demonstrated
27
Table 2.1: SNP Calling between E. coli strains at 25x coverage
Color-space: 31,902 Validated SNPs between E. coli DHB10 and Rel606
SNP-Caller Valid Unsupported
Valid BNPs (2,314 total)
Corona lite 16; 384 1; 883 27
MAQ Consensus 25; 031 6; 773 791
soapSNP 25; 305 6; 500 795
ComB (NQ)
y
28; 325 4; 230 1; 377
ComB 28; 364 3; 430 1; 382
Sequence-space: 33,944 Validated SNPs between E. coli MG1665 and Rel606
SNP-Caller Valid Unsupported
Valid BNPs (2,469 total)
MAQ Consensus 31; 823 16; 191 2; 011
soapSNP 31; 823 16; 188 2; 019
ComB (NQ)
y
32; 089 7; 773 2; 247
ComB 32; 095 7; 530 2; 248
Shown are the number of SNPs that were validated by global alignment that each program returned. Corona lite's
alignment le was prepared using ABI's mapreads pipeline, which returns all identiable alignments with fewer
than 6 mismatches, while MAQ Consensus and soapSNP's color-space alignment was generated by BWA with
parameters "-c -N -k 4 -n 6" which provide total reporting with tolerance to four substitutions and sensitivity
for up to six. PerM was used for all sequence-space alignment as well as ComB's color-space alignment with
options which also provide total reporting for tolerance to four and sensitivity to six substitutions. In all cases,
alignment les were trimmed such that each SNP caller was provided with 25x coverage.
y
ComB (NQ) used quality scores supplied by the read le instead of quality recalibrated call rates
Unsupported SNPs were not present in the global alignment.
in the experiment, that consideration to block polymorphism in color-space signicantly
increases accuracy and that quality score recalibration signicantly decreases error.
A total of 952 of 961 unsupported color-space SNPs were identied at by all four
SNP-Callers at 25x and 50x coverage. As shown in Figure 2.2, it is unlikely that further
increasing coverage would result in larger shared set of novel SNPs. It's likely that the 961
shared SNPs are specic to the population from which the reads were derived but are not
28
0 10 20 30 40 50
Coverage
0
5000
10000
15000
20000
25000
30000
Validated SNPs located
SNP Calling with 35bp SOLiD reads on Ecoli
ComB
Corona Lite
Maq
Soap
0 10 20 30 40 50
Coverage
0
5000
10000
15000
20000
25000
30000
35000
40000
Unvalidated SNPs located
Unvalidated SNPs found with 35bp SOLiD reads on Ecoli
ComB
Corona Lite
Maq
Soap
0 5 10 15 20 25 30 35 40
Coverage
27000
28000
29000
30000
31000
32000
33000
Validated SNPs located
SNP Calling with 36bp Solexa reads on Ecoli
ComB
Maq
Soap
0 5 10 15 20 25 30 35 40
Coverage
4000
5000
6000
7000
8000
9000
10000
11000
Unvalidated SNPs located
Unvalidated SNPs found with 36bp Solexa reads on Ecoli
ComB
Maq
Soap
Figure 2.2: ComB - SNPs Calling between E. Coli strains for varying coverage levels
Shown are the number of validated and unvalidated SNPs found at dierent coverage levels. Alignment parameters
used are described in Table 2.1. Each program nds exponentially more SNPs as coverage increases from 2x to
15-25x, after which improvement from increases to coverage are minor
annotated in the reference genome. The other unsupported SNPs may be false positives
or SNPs which only one method is capable of discovering. To get a more accurate esti-
mation of the number of false positives returned by each SNP-Caller, simulations were
performed using the Drosophila melanogaster X chromosome as a reference sequence.
In each simulation the X chromosome was mutated according to a particular SNP dis-
tribution and 50bp color reads were simulated according to a uniformly-distributed 1%
color-error rate, and each SNP caller was evaluated according to two parameters:
29
Recall: The percentage of SNPs identied:
TruePositives
TruePositives+FalseNegatives
Precision: The accuracy of identied SNPs:
TruePositives
TruePositives+FalsePositives
Table 2.2 displays the performance of each SNP caller for three dierent SNP distri-
butions:
Uniform Low: 22,422 total snps, 0.1% mutation rate.
Uniform High: 2,424,000 total snps, 10% mutation rate.
Block-distributed: 50,000 total mutations, 10,000 dinucleotide and trinucleotide
polymorphisms, 0.2% mutation rate.
For each SNP distribution, 50 million reads were aligned with the same parameters
described in the E. coli experiment (see Table 2.1) and trimmed to provide 25x cover-
age (12:12 million aligned reads) as input to each SNP-Caller. The results show that
while ComB's performance for the uniform low distribution is almost identical to that
of MAQ-Consensus and soapSNP, ComB perform signicantly better when SNPs are
densely populated or block distributed, both of which were observed in the global align-
ment between E. Coli strains.
Although, the results on SNP-Calling between E. Coli strains and on simulated SNP
distributed clearly illustrate the advantages of the multi-faceted approach we implemented
in ComB, both are carried out on relatively small datasets. To demonstrate that ComB
can scale adequately to large data sets, 1.48 billion 50bp color reads from an anonymous
human individual were downloaded [40] and aligned with tolerance to 3 substitutions and
sensitivity to 5 substitutions to the 3 billion bp human genome using PerM. PerM was able
30
Table 2.2: Performance for Simulated SNP distributions at 30x coverage
Uniform Low Uniform High Block Distributed
Recall Precision Recall Precision Recall Precision
Corona lite 0:826 0:752 0:095 0:866 0:002 0:019
MAQ Consensus 0:950 0:991 0:589 0:887 0:499 0:494
soapSNP 0:950 0:991 0:589 0:887 0:500 0:496
ComB 0:976 0:999 0:748 0:994 0:936 0:996
Shown are SNP-Calling performance of each program for dierent SNP distributions
Table 2.3: Human SNP Calling Results
SNP Type Called Annotated, n Annotated % Novel
Homozygous 1,154,666 1,095,973 94.91 58,693
Heterozygous 974,122 711,413 73.03 262,709
Block NP sites 56,317 29,519 52.41 26,789
Total 2,185,105 1,836,905 84.06 348,200
Shown are number of SNPs called against the human reference genome. It's likely that heterzygous SNPs which
are not xed in a population may be unannotated.
to align 832.07 million reads, providing a mean coverage of 13x. It's important to note
that greater than 15% of reads mapped to more than one and fewer than one-hundred
locations, which required ComB to make full use of it's ambiguous weighting. Making use
of 96 CPU's ComB was ran in parallel to perform two homozygous and one heterozygous
round of SNP-Calling in fewer than seven total hours. Table 2.3 displays the annotation
rates for the 2,185,105 likely SNP calls made (posterior probability> 0:95). That 84.06%
of the identied sites were previously annotated in the dbSNP 131 [14] suggests that
ComB provides accurate results for very large datasets as well.
31
Chapter 3
Fade: NR-optimized Methylation Estimation
3.1 Survey on DNA Methylation
While SNPs and other genetic dierences are often thought to underlie all of human di-
versity, there are a large number of heritable modications which aect gene expression
and other cellular phenotype without any change to the DNA sequence. Epigenetics is
the study of heritable modications which do not involve a change to the nucleotide se-
quence. The most studied epigenetic modication, methylation of cytosine nucleotides
was rst proposed to be a stable and heritable modication in 1975 by Holl et al [30].
Methylation of cytosine nucleotides is maintained from one generation to the next by
by the Dam methylase enzyme, which adds methyl groups after replication to unmethy-
lated daughter strands which are synthesized from highly methylated template strands
[3]. One major biological role that has been attributed to methylation of DNA is the
involvement of methylated cytosines in gene expression. It's been well established that
increased cytosine methylation of promoters is largely responsible for the silencing of
genes. Additionally, hypomethylation of repeat-rich heterochromatin and hypermetha-
lation of growth-regulatory genes have been shown to cause cancer [68]. The fact that
small changes in methylation levels can cause such major changes [21], have made the
32
estimation and identication of dierentially methylated sites an area of intense interest.
Fueling this interest is the goal of "epigenetic therapy", wherein cancer genes can be
turned o or silenced by drug targets which alter methylation levels [29, 18]. Such goals
require accurate, ecient, and cost eective determination of individual methylomes, the
methylation levels at all genomic cytosine sites across the cell population which exist
in a certain tissue. This can be accomplished through whole genome sodium bisulte
treated (SBT) sequencing [23]. Whole genome SBT sequencing involves treatment of the
DNA with sodium bisulte, which causes unmethylated cytosine nucleotides to convert
to uracil, which is interpreted as thymine by sequencing machines. SBT reads can then
be aligned to a reference genome, where the relative rate of aligned cytosine to thymine
nucleotides can be viewed as a proxy for the rate of methylation at a certain location.
Alignment of the treated reads presents a challenge because the number of mismatches in
cytosine heavy unmethylated regions will not allow a straightforward alignment to take
place, the use of a 3-letter alphabet (all read and reference cytosines rst converted to
"T") for nucleotide reads will reduce the sequence specicity and increase the probabil-
ity of ambiguous alignment, but will not bias the alignment regarding methylation and
has been shown to work successfully if reads are suciently long [41]. Such a method
is not feasible in color-space because unaligned reads cannot accurately be translated to
nucleotides without the risk that single color errors will alter the entire downstream se-
quence and compromise the accuracy of the reads [66]. As discussed in detail in Chapter
3, post alignment translation from colors to nucleotides also introduced a signicant level
of biases into sequence data analysis, and as such it is imperative that we develop an algo-
rithm which can be carried out directly in color or nucleotide space. Here it is important
33
x
i
;q
i
x
i+1
;q
i+1
n
j1
! c
j
! n
j+1
Figure 3.1: For any alignment spanning a non-consecutive reference cytosine (c
j
), the
transition from the previous nucleotide (n
j1
) and the transition to the following nu-
cleotide (n
j+1
) each emit a color-quality tuple, x
i
;q
i
and x
i+1
;q
i+1
, respectively. The
rate governing the emission of color-quality tuples for dierent read positions and ad-
jacent reference nucleotides can be used to estimate the probability that the cytosine
sequenced in each read existed in a methylated or unmethylated state.
to note the similarity between SNP-Calling and methylome estimation while being aware
of the biological distinctions which in
uence the goals behind each. SNP calling involves
the estimation and comparison of the posterior probability for multiple distinct events
(ie. P (t
j
=jR(t
j
))) where is a certain base. In a homozygous model exists on the
nucleotides [A;C;G;T ] while a heterozygous model requires that the posterior probability
be evaluated for all nucleotide pairs, [AA;AC;AG;AT;CC;CG;CT;GG;GT;TT ]. In
contrast methylation estimation involves the analysis of a continuous parameter , such
that the posterior probability distribution describing is estimated from the observed
data, ie. P (t
j
=jR(t
j
) where [0; 1].
As is the case for SNP-Calling, an accurate alignment to a reference position pro-
vides a single nucleotide and quality score in nucleotide space or two color-quality pairs
(describing each transition) for each read spanning a non-consecutive reference cytosine.
Figure 3.1 illustrates how color-quality tuples encode the transitions from the preceding
reference base to the cytosine and from the cytosine to the following reference base. If
we assume SNPs are not present, color pairs arise from the sequencing of a cytosine in
a methylated or unmethylated state and usually provide a high degree of certainty as to
the state of the cytosine.
34
Color alignments which are adjacent to SNPs or are corrupted by machine sequencing
errors will often provide a much smaller degree of certainty as to the state of the reference
cytosine, as is illustrated in Table 3.1. As described in Chapter 3.2, call-rates provide an
accurate estimation of the probability of a color observation with respect to the reference
bases it spans,the quality score, and read position. "Emission rates" are analogous to
call-rates in color space, but are estimated from control data sets which model complete
(M=1) or near-zero methylation (M=0) and only describe the emission of colors encoding
transitions which involve reference cytosine positions. The emission rates for the non-
methylated state are derived from SBT treated phage lambda reads. The phage lambda
is thought to have little to no methylation and has a very small non-repetitive reference
genome which facilitates accurate alignment of SBT reads. Complete methylation can be
modeled simply by shielding a portion of reads from bisulte treatment so that none of
the cytosines are converted to thymine. In either case the emission rates are calculated
very similarly to the call rates, with the exception that they are conditioned on the
methylation state and preceding or following bases instead of two reference bases. For
example, the emission rates describing the transition between the preceding base g
j1
and a reference cytosine for read position i, color x
i
, and quality q
i
are calculated from
their alignment frequencies as follows:
E(x
i
;q
i
jM =;g
j1
) =
P
r
I(x
i
;q
i
jM =;g
j1
)
P
r
I(x;qjM =;g
j1
)
; where f0; 1g: (3.1)
35
Table 3.1: Color Interpretations
A ! c
j
! G Likely Interpretation Methylation Probability
x
i
x
j
- -
Green Red A-C-G High
Red Green A-T-G Low
Red Red A-T-? or ?-C-G Dep. on Error Rate
Green Green A-C-? or ?-T-G Dep. on Error Rate
Some pairs of color alignments are dicult to resolve with in a single read. In the example below the colors
"Green-Red" and "Red-Green" lead to the interpretation that the position c
j
is in the state "C" and "T ",
respectively. The colors "Red-Red" is likely the result of sequencing error. Whether the sequencing error is
assumed to alter the rst or second color in the pair will lead to a dierent interpretation for the position c
j
.
3.2 Design of FadE
The primary design of FadE involves the incorporation of the inferred emission rates to
a hidden data model which describes the posterior distribution of the methylation level
at each cytosine position. In this model, the emission of color quality tuples at reference
cytosine positions c
j
are noisily controlled by an unknown methylation parameter
j
describing the rate at which c
j
is methylated or non-methylated. Each read alignment
spanning c
j
constitutes an independent sampling of the methylation state. However,
as shown in Figure 3.2, the methylation state is not observed, but instead an emission
whose probability in each state is calculated from the product of the two emission rates
(preceding and following) described in the previous section:
36
P (rjM = 1) =E(x
i
;q
i
jM = 1;n
j1
)E(x
i+1
;q
i+1
jM = 1;n
j+1
)
P (rjM = 0) =E(x
i
;q
i
jM = 0;n
j1
)E(x
i+1
;q
i+1
jM = 0;n
j+1
).
(3.2)
Every read alignment is thought to arrive from a discrete membership in either the
methylated or non-methylated state, so using the law of total probability, the probability
of observing a single read alignment is:
P (rj
j
) =P (rjM = 1)
j
+P (rjM = 0)(1
j
); (3.3)
and the probability of observing a set of independent read alignments R
j
which span
t
j
is:
P (R
j
j
j
) =
Y
rR
j
P (rj
j
): (3.4)
Using basic results on conditional probability allows us to express the inverse, or the
posterior probability describing the the methylation rate:
f(
j
) =P (
j
jR
j
) =
P (R
j
jj)P (j)
P (R
j
)
: (3.5)
In Equation 3.5, P (
j
) is the prior distribution and P (R
j
) is a normalizing constant,
very similar to the model used in ComB [74]. The rst major departure from model of
discrete SNP cases arises here, rather than calculate the explicit posterior probability for
a collection of discrete cases, we are instead interested in characterizing the distribution
37
f(
j
). If we assume a uniform prior probability then we can nd the maximum of the
distribution by evaluating only the conditional probability:
f(
j
) P (R
j
jj): (3.6)
Taking logarithms and and dierentiating Equation 3.6 once provides:
X
k
(
P (r
j
k
jM = 1)P (r
j
k
jM = 0)
(P (r
j
k
jM = 1)P (r
j
k
jM = 0)) +P (r
j
k
jM = 0)
); (3.7)
dierentiating once more provides:
X
k
(
(P (r
j
k
jM = 1)P (r
j
k
jM = 0))
2
((P (r
j
k
jM = 1)P (r
j
k
jM = 0)) +P (r
j
k
jM = 0))
2
) (3.8)
a strictly negative function on the region (0,1), which implies that Equation 3.6 is
strictly concave on the region (0,1). Thus, the single maximum value on the region, ^
j
can be calculated iteratively with Newton's optimization method, where is updated
according to the rule:
n+1
=
n
F
0
(R
j
)
F
00
(R
j
)
(3.9)
Convergence to the maximum value ^
j
happens quickly with Newton's method. This
value ^
j
is the peak of a concave function made from a large polynomial; this is shown
in Figure 3.3. When coverage and read quality are high the curve is tightly centered
around this value and the distributionf(
j
) is well characterized by just this single value.
However, poor read quality or coverage will result a
atter curve for which the maximum
38
value does not supply much information. In either case, 90% credible intervals around
^
j
can also be calculated, by iteratively updating the until the following equation is
satised:
R
^
j
+
^
j
Q
k
P (r
j
k
j)
R
1
0
Q
k
P (r
jk
j
j
)
> 0:9 (3.10)
Where
is the prior probability that
j
exists on the interval [ ^
j
, ^
j
+ ]. If a
uniform prior probability is assumed then this value is equal to the length of the interval.
The integrals can be calculated explicitly by expanding the polynomial functionP (R
j
j
j
)
when read depth is moderate or approximated using tools of numerical integration. In
practice, eciency is increased by abandoning the optimization after the rst iteration if
the estimate for ^ is very near the boundary, and instead regions near the boundary are
integrated until the a 90% credible interval is located. This eliminates the bottleneck that
occurs if ^ is so close to the boundary that a symmetric credible interval around ^ does
not exist. It should also be noted that if normal prior probability distribution is assumed
N(;) instead of a uniform prior probability, then F
00
(R
j
) (Equation 3.8) becomes:
P
k
(
(P (r
j
k
jM=1)P (r
j
k
jM=0))
2
((P (r
j
k
jM=1)P (r
j
k
jM=0))+P (r
j
k
jM=0))
2
)
2
2
2
(3.11)
which is also strictly negative on (0,1) enabling the use of the Newton-Raphson op-
timization described in Equation 3.9 and the numerical methods necessary to calculate
39
credible intervals. As the optimal prior probability is dependent on the data, FadE is im-
plemented such that dierent normal prior parameters for "CpG" vs "CpN" dinucleotides
can be chosen or prior parameters can be explicitly listed for dierent sites. One consid-
eration that should not be overlooked in color space is the change to the model necessary
if reference cytosines are consecutive. Shown in Figure 3.4, consecutive cytosines cause
subsequent color emissions to be dependent on the the methylation state of the previ-
ous cytosine. In the case of m-consecutive cytosines, FadE rstly calculate the rst and
last methylation parameters using the color emissions from the boundary. Then interior
probabilities are calculated by conditioning each emission probability on the estimate for
the adjacent methylation parameter, such that the observation probability for the j +1
t
h
position in a run of cytosine positions is calculated as follows:
P (rj
m+1
) =P (rj
j+1
;n
j
=C)
j
+P (rj
j+1
;n
j
=T )(1
j
) (3.12)
In nucleotide space, consideration to consecutive cytosine positions is unnecessary,
however the model must be altered slightly to accommodate single nucleotide-quality
signals. The primary dierence to the model involves the emission rates (Described by
equation 3.2 in color space) which become a function of only a single observation:
P (rjM = 1) =E(b
i
;q
i
jM = 1)
P (rjM = 0) =E(b
i
;q
i
jM = 0)
(3.13)
40
E(r
1
) E(r
2
) E(r
n
)
" " "
M
1
M
2
::: M
n
- " %
j
Figure 3.2: For each of n reads, a hidden binary methylation state (M
1
;M
2
:::M
n
) is drawn
uniformly according to an unknown methylation parameter
j
. For each methylation
state, observations E(r) in the form of color-quality tuples are produced according to
emission rates corresponding to the read position and adjacent reference bases.
whereb
i
andq
i
are the base and quality score aligned to a reference cytosine. Estima-
tion of the emission rates in sequence space have no dependence on adjacent nucleotides
and as of sequence emission rates are calculated as the straightforward frequency of nu-
cleotide and quality observations for each read position aligned to a reference cytosine:
E(b
i
;q
i
jM = 0) =
P
R
I(b
i
;q
i
jM = 0)
P
R
I(b;qjM = 0)
: (3.14)
These changes allow the rest of the algorithm to be developed as it was described in
color space. The results of this implementation and the color space implementation are
described and discussed in the following section.
3.3 Analysis of FadE
SBT Alignment
Accurate estimation of the methylome requires as input an accurate and unbiased accurate
le. While methods such as total translation and conversion are suitable for sequence
space, color-space reads cannot accurately be reduced to a three letter alphabet (ie convert
"C" to "T") without upstream sequencing errors imposing a large cost to read accuracy.
41
0.0 0.2 0.4 0.6 0.8 1.0
ρ
j
f(ρ
j
)
Figure 3.3: The probability density function f(p
j
) is concentrated near it's maximum ^
j
when coverage is sucient. Shown here is the pdf resulting from the alignment of thirty
reads to an isolated reference cytosine with a simulated methylation rate of 0:6. This
simulation imposed uniform color-error rate of 3% per position. A continuous uniform
prior distribution was assumed.
x
i
x
i+1
x
i+2
n
j1
! c
j
! c
j+1
! n
j+2
Figure 3.4: If reference cytosines are adjacent, then the interior emission probabilities
will depend on the methylation state of the adjacent cytosine. In this case, the methy-
lation rate at boundary cytosine nucleotides is estimated using the boundary emission
probabilities. Then, the rate of interior cytosine nucleotides are calculated conditioned
on the identity of boundary positions.
If the reads are aligned to the native reference sequence, two color substitutions will result
from each unmethylated cytosine on the read which will result in few accurate alignments
within a moderate tolerance threshold. If the reference sequence is translated (such that
all "C"! "T") then many more reads will nd suitable low substitution alignments.
However, this translation will result in reads which have many methylated cytosines
not nding suitable alignments; this bias will result in an articially low estimation of
methylation levels. For this reason we use a modied three-part alignment strategy. In
this strategy the reads are aligned to three reference sequences; (1) the native reference,
(2) aC!T translated reference, and (3) ACpH!TpH translated sequence (whereH is
42
any base butG). The third sequence is used because in most mammalian cells methylation
occurs primarily at CpG sites. After highly tolerant alignment to all three reference
sequences is completed the results are merged and read alignments which contain many
color substitutions spanning non-cytosine reference nucleotides are ltered if the number
of substitutions are above a pre-determined tolerance threshold. Additionally, alignment
accuracy can be improved through the use of pair-end reads, uni-directional reads, or by
rst performing SNP calling and altering homozygous SNP positions (provided non-SBT
reads exist for the same sample). If a large number of reads still do not nd alignments,
then a sliding window can be applied to the reference sequence such that cytosine rich
areas are split into multiple substrings such that no amount of C 6= T substitutions
will increase the number of substitutions between an otherwise alignable read above the
tolerance threshold. For most cell-types this is likely not necessary because most reads
are sequenced from nonCpG islands (which have very little methylation), orCpG islands
which are either heavily methylated or hypomethylated, both cases which will result in
alignment to reference sequence (3) or reference sequence (2).
FadE: Simulation Analysis
To assess the accuracy of FadE and compare it's performance to other popular methods
we performed multiple simulations in color and nucleotide space using Human Chromo-
some 21. In each simulation approximately twenty percent of the over fourteen million
cytosine positions on either strand (1.7 million of which are members of consecutive
blocks) were selected at random and assigned a rate to describe their methylation from
theU(0; 1) distribution. Between twenty and one hundred million nucleotide reads were
43
simulated assuming a 99% bisulte conversion rate and then either translated to color
sequences or kept in their native nucleotide format. In both cases dierent uniform error
rates were imposed on the reads between zero and 5%. The three reference translation
strategy was used in both cases; PerM was set to provide sensitivity to up to ve color or
nucleotide substitutions. After ambiguous alignments were ltered alignment accuracy
of approximately 98.8% (Color space) and 99.4% (nucleotide space) was achieved. Com-
putationally, the algorithm performed eciently in all cases, using less than three CPU
hours and less than 500MB of memory to estimate the methylome corresponding to over
14-million reference cytosines, even when the mean read depth was over 40x.
As expected the size of the credible interval decreased as the read depth increased.
This is displayed in Figure 3.8, which suggests that read depth of approximately 25X is
necessary for FadE to produce accurate estimations of the level of methylation for color
space.
To compare the algorithm's performance in color space to the canonical method reads
simulated according to dierent uniform error rates (0%,1.5%,3%, and 5%) were aligned
with PerM and then translated to SAM format (nucleotide space format) using a popular
dynamic programming algorithm for color translation [69]. After translation, information
regarding color quality is lost, and the the maximum likelihood estimation for methylation
is simply the percentage of "C" nucleotides aligned to each reference cytosine. The
mean dierence between the estimated and true parameter was used as a test statistic to
compare FadE and the translation method. Table 3.2 displays that the translation method
provides an accurate estimation of methylation for isolated cytosine nucleotides when the
44
Figure 3.5: At a 1.5% Error rate, 20X read depth appears to be sucient for FadE to
produce accurate results. Shown here are the average size of the calculated 90% credible
interval and the percentage of positions for which the simulated parameter
j
exists in
the credible interval. At 20x coverage the average credible interval length is ^
j
0:13 and
the interval includes
j
in 92% of cases.
reads are error-free, but that only FadE is robust to high rates of color-sequencing errors
and also capable of determining methylation for consecutive cytosine nucleotides.
The nucleotide space implementation of FadE was also compared popular methods
which use the binomial distribution to infer the methylation rate by the relative number
of cytosine or thymine alignments to each reference cytosine. The same methylation
distribution, bisulte conversion rate, and read error rates as used for the color space
simulations were used for nucleotide space. The major dierence in nucleotide space was
that FadE reported slightly more accurate point estimates for ^ but also returned larger
credible intervals ( ^
j
0:15 at 25x read depth vs. ^
j
0:13 for color alignment). The larger
credible interval size owes to the fact that a single emission does not provide as much
information as the two color-quality tuples observed in color space. Valid consecutive color
45
Table 3.2: Chr 21 Simulation Results
Isolated Adjacent
Error-Rate CT-Method FadE CT-Method FadE
0.0% 0.066* 0.064* 0.281 0.070
1.5% 0.107 0.094 0.301 0.096
5.0% 0.162 0.110 0.312 0.116
*The dierence is here is not less than the standard error in the twenty trials.
sequencing errors are much more rare (see equation 3.2 vs 3.13 ), than single nucleotide
errors, and as such misleading errors cannot as easily be discounted causing uncertainty
which leads to larger credible intervals. The results of the nucleotide space simulation
are shown in Table 3.3, which displays the comparison between FadE and the popular
binomial model for methylation estimation. In this Table it can be noticed that in the
absence of sequencing errors FadE actually returns the same value as the binomial method
because they emission rates without sequencing errors are always one, and the maximum
likelihood estimation converges to the relative rate of cytosine vs thymine nucleotides at
each position.
46
Table 3.3: Comparison to sequence space pileup estimation at 50x average read coverage.
Shown is the mean absolute dierence between the simulated parameter and the value for
^
j
returned by FadE and the percentage of cytosines relative to thymine (pileup) aligned
to the location. Additionally, the size of the 90% credible interval returned by FadE and
the percentage of locations where the simulated parameter was contained in the interval
are shown. Note that in the absence of sequencing error, the algorithm implemented in
FadE returns the parameter equal to the rate of cytosines aligned to the location.
Dierence 90% Credible Interval
Error-Rate Pileup-Method FadE Size % Contained
0.0% 0.057* 0.057* 0.178 0.93
1.5% 0.063* 0.062* 0.201 0.92
3.0% 0.074 0.069 0.218 0.90
5.0% 0.079 0.072 0.234 0.89
*The dierence is here is not less than the standard error in the twenty trials.
47
Human Methylome
Data Acquisition and Alignment
To test the scalability of the algorithm to large data sets and provide a more robust
comparison to well established nucleotide-space literature, two full slides of the SBT IMR-
90 (broblast cell-line) fragment libraries were sequenced. This cell line was previously
studied in an whole methylome analysis conducted by Lister et al [23]. In total 1.1 billion
unidirectional (forward strand) 50 and 25 color paired end reads were generated by the
SOLiD 4 system. The initial sample was spiked with phage lambda DNA that had been
digested by the restriction endonuclease ALu-I. The phage lambda DNA is thought to
have little to no methylation and as such it served as a control to validate that the
bisulte conversion was complete and also allowed the estimation of emission rates in the
unmethylated state (M = 0) to be estimated. The three-reference alignment protocol was
carried out with a substitution threshold of three and two-adjacent errors in the 50 and
25 color sequences, respectively. Alignments which did not pair ends within the 1200bp
fragment window were discarded, which resulted in mean read depth of 3.8x. In total
there were 23.8 million ( 85%)CpG sites with at least one read alignment and an average
read depth of 5.14 reads. The alignment was separated by chromosome and FadE was
run in parallel which required a total CPU time of 6 hours on twenty-four machines. The
mean methylation rate across all CpG sites returned was close to 63%, and was skewed
toward full methylation for all CpG sites but bimodal for sites inside of CpG islands.
It total FadE analyzed 13,841 CpG islands where more than ten CpG sites had greater
than 20x read depth, using the CpG island denition rst proposed by Takai et al [75].
48
0 20 40 60 80 100
Methylation Rate (%)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
Fraction of Positions
All CpG Positions
CpG Island Positions
Figure 3.6: Analysis of experimental SOLiD color space data taken from stem cell line
IMR90 shows that the distribution of methylation rates diers in CpG positions within
CpG-islands from that of isolated CG positions. Across Hg19 the global distribution of
"CG" dinucleotides is 1%. We dened a CpG-island as a region of at least 500bp and 50
and greater than 5% "CG" dinucleotides.
Figure 3.6 illustrates the bimodal distribution observed in CpG islands and compares it
to the overall distribution for methylation. In total 435 islands had an average methy-
lation rate less than ten times the global average. More than half of these (218 of 435)
were within 1000bp of a gene, while less approximately 23% of all CpG islands were
within 1000bp of a gene, which suggests that these hypomethylated islands are involved
in regulating genes. When the regions around genes were examined in greater detail one
common pattern that emerged near CpG island promoter regions was highly methylated
CpG positions upstream of the promoter and downstream from the transcription start
site, but extreme hypomethylation near at the transcription start sites. We carried out
detailed analysis on all genes which had likely CpG island promoters (regions between
500 and 2000 nucleotides with greater than 5% CpG content). For each such site we
49
Table 3.4: Methylation rates at three regions near CpG promoters in ve genes
Gene Name Loc CpG Isl. Upstr CpG Isl Dwnstr
BIVM chr13:103455399 -750:1750 0.86 0.046 0.59
VAPA chr18:9918000 -750:1250 0.85 0.077 0.63
UTP14c chr13:52586534 -1250:750 0.62 0.081 0.89
EGR1 chr5:137787179 -2000:1250 0.82 0.097 0.86
TBCB chr19:36605888 -1000:1250 0.80 0.13 0.81
calculated the methylation rate upstream of the promoter region, from the promoter to
the transcriptional start site, and downstream of the transcriptional start site. Shown
in Table 3.4 are the rates across such regions for ve dierentially methylated CpG is-
land promoter regions. For each site we separated the regions by 250bp increments such
that large changes in methylation rate were apparent; this illustrated the obvious lack
of methylation which existed only near the CpG island promoter. An illustration of this
pattern is shown in Figure 3.7 in which 100 base pair windows are binned and averaged
for each of the ve genes. Despite dierences in promoter lengths, a shared pattern of
hypomethylation near the transcriptional start site is observed. Additionally, Figure 3.8
illustrates the pattern for a single gene.
Comparison to nucleotide space analysis
FadE's results concerning the low-coverage color-space epigenome analysis were com-
pared to a study carried out using high coverage nucleotide space reads which were also
sequenced from the IMR-90 broblast cell line [23]. Lister et al. aligned over 91.0 gi-
gabases of SBT nucleotide space reads using the ELAND [65] algorithm , to achieve an
50
Figure 3.7: The average methylation rate for ve genes containing CpG island promoters
average coverage level of 14.8x to the human genome. Although FadE was ran on a much
lower coverage dataset (mean coverage of 3.8x), both protocols found very similar average
methylation rates (62.7% vs 63.5%) and a very similar methylation distribution across
CpG cites. The comparison included over 23 million CpG sites to which both protocols
had a least one aligned read with which to estimate methylation levels. In over 89% of
the sites analyzed by both protocols, the parameter estimated by Lister et al was con-
tained in the credible interval returned by FadE. Additionally, in fewer than ten percent
of positions analyzed by both methods did the estimated methylation parameter vary by
more than 30%. A summary of the the comparison between these methods in shown in
Table 3.5 Additionally, FadE and Lister et al's result were concordant with respect to the
distribution of hypomethylated regions, regions of greater than 10kb with less than than
70% average methylation. In 96% of the 110,748 covered regions described by Lister,
51
Figure 3.8: The methylation rate adjacent to the transcriptional start site of gene BIVM
FadE too found an average of less than 70% methylation. Globally, fewer than 30% of
10kb regions have less than 70% methylation.
To better estimate whether the non-similar positions returned by the two protocols
was a result of algorithmic dierence or just noise between datasets, a nucleotide imple-
mentation of FadE was also used to estimate methylation on the nucleotide alignment
produced by Lister et al. Since control SBT alignment was not available, FadE was
only able to estimate a position-specic error rate from the alignment to non-cytosine
bases. Using these error rates FadE was able to produce point and credible estimations
for methylation for over 42 million of CpG site examined by Lister et al. The results
of this sequence space comparison were strikingly similar, with an average dierence of
52
Table 3.5: The methylation rate at 23,152,801 IMR90 CpG positions analyze by FadE
using color space data and Lister et al. using nucleotide space data
FadE Lister et al Validation Rate*
Global Rate 62.7% 63.5% -
Total Shared Sites 23,152,801 23,152,801 89:1%
^
j
< 0:25 5,646,187 4,106,700 84:8%
^
j
> 0:75 10,943,429 11,160,396 91:1%
cov> 10 1,030,308 - 92:9%
**The Validation Rate refers to the percentage of positions where the methylation rate calculated by Lister et al.
fell within the credible interval returned by FadE
methylation estimation of only 2.1% between platforms. This similarity suggests the the
minor dierences between FadE's color space implementation and Lister et al's high cov-
erage results in sequence space result largely from biological variation or dierences in
sequencing platforms rather than dierences between FadE's optimization routine and
Lister et al's binomial model. These results are summarized in Table 3.6.
53
Table 3.6: Comparison of FadE and the results obtained by Lister et al on the same
nucleotide space alignment le
FadE Lister et al Validation Rate*
Global Rate 62.7% 63.5% -
Total Shared Sites 41,685,156 41,685,156 99:6%
^
j
< 0:25 7,023,184 7,023,184 99:7%
^
j
> 0:75 23,158,054 23,158,054 99:3%
cov> 10 122,400,411 - 99:8%
*Validation occurs where Lister et al's estimate falls within FadE's credible interval
54
Chapter 4
Biological Applications: RNA-DNA Dierences in a single
individual
This chapter discusses how we have applied our statistical tools to a large human data
set which includes paired mRNA and DNA reads from ve dierent individuals. This
chapter discusses our investigation into the discovery of RNA editing sites and estimation
of editing rates in the human brain.
4.1 Survey on RNA Editing
The DNA sequence is the most stable vessel in which biological information is stored. The
central dogma of molecular biology describes the transcription of the genetic code from
DNA to RNA and the translation from RNA to the proteins which are the building blocks
of eukaryotic life [25]. This sequential transfer of information suggests that the study of
stable DNA sequences can be used to infer the sequence and function of the diverse
and complicated set of protein products and provide new ways to combat disease and
understand biological diversity. One of the rare exceptions to this sequential transfer of
information involves single nucleotide editing of transcribed RNA sequences, which alter
the nal protein sequence such that it diers from what what would be predicted by the
DNA [6]. One of the most common editing mechanisms mediating RNA editing in humans
55
involve the selective deanimation of adenosine nucleotide to inosine by the ADAR1 and
ADAR2 enzymes [38]. The inosine nucleotide is read as guanine by the cells translational
machinery, producing which functions as a single nucleotide change from adenosine to
guanine. This mechanism has been showing to alter gene splicing, protein function, and
coding capacity [71]. The investigation of RNA editing using next generation sequencing
has resulted in the identication of a large number of possibly RNA editing sites, [20, 39].
Additionally, a recent study has provided evidence for not only a large number of RNA
edits, but widespread dierences between RNA and DNA nucleotides between all possible
base combinations [63]. This work has spurred the interest of the scientic community
to attempt to discern between whether such sites are the result of previously unknown
biological mechanisms or systematic errors and alignment artifacts. The study of a rare
event using a large, repetitive, and very incompletely annotated human genome reference
sequence is unfortunately rife with avenues in which false positives can be discovered. Our
endeavors in this study involved two aims: rstly we attempted to provide a model and
set of methods which accurately lter spurious RNA-DNA dierences. Secondly, when
evidence exists that an observed editing site is not the result of systematic alignment
error, we attempt to provide a statistical method to accurately assess the degree of RNA
editing occurring at the site.
56
mRNA Read C G U U G G C G U
mRNA Read U A G G C T U A
mRNA Read A C G U U G G C T
mRNA Read U U G G C G U A C
DNA Read C G T T G A C G T A
DNA Read A C G T T G A C T T
DNA Read G C T G A C T T A C
DNA Read C G T T G A C G T A
Reference Sequence A C G T T G A C G T A C
Figure 4.1: This gure depicts both an RNA editing event (transition from A to G in the
RNA but not the DNA) as well what is likely a heterozygous SNP, mutation from G to
T which is shown in both DNA and mRNA.
4.2 Methods to locate and classify RNA editing
Aim 1: Filters and methods to reduce the likelihood of error
The naive method for the discovery of RNA editing involves rst performing SNP-Calling
using DNA and mRNA from a single individual, after which non-concordant SNP calls
between the DNA and mRNA are classied as the result of RNA editing. While this
method is certainly a logical rst step to generate candidate sites (Figure 4.1), it does
not consider the many alignment artifacts which cause a great number of spurious editing
sites to be identied. In the following list we attempt to summarize a large number of
the sources of noise or systematic error which will dominate the set of sites produced by
such a method:
1. Genomic Heterozygosity
Description: Genomic heterozygosity can quite easily be incorrectly be interpreted
as an RNA editing event. If only one of two nucleotides in a heterozygous in-
dividual is observed in the DNA reads but both are present in the RNA reads,
57
editing may be incorrectly concluded. If we assume an individual has ap-
proximately fty thousand heterozygous positions in the transcriptome and
an average genomic coverage of approximately ten reads, then we can approx-
imate the number of sites in which only one of the two alleles (X or Y) is
observed with the following back of the envelope calculation:
P (X
i
= 0jY
i
= 0) =
1
2
10
(2 50000) 100 (4.1)
If RNA editing is indeed an rare event, then such a value for false positives may
be intolerable. Additionally, it should be noted that this back of the envelope
calculation does not consider the systematic biases or errors associated with
the sequencing process which result in a non-uniform sequencing distribution
and may cause more heterozygous genomic sites to be improperly classied.
Solution The straightforward solution to this problem is to require a large num-
ber of high quality genomic reads before a site is classied as homozygous.
Additionally databases such as dbSNP [14] can be consulted and sites which
correspond to SNPs in other studies can be ignored.
2. Sequencing Error and PCR Duplication
Description: Even if a large number of genomic reads correctly identify a homozy-
gous site, sequencing errors in the RNA reads may result in the presence of
an another spurious base. Although sequencing errors rates are relatively low
at high quality bases, PCR duplication may result in a large number of reads
being present which contain the same sequencing error.
58
Solution: To avoid the propagation of error through PCR duplication transcrip-
tomic sites which contain a great number of aligned mRNA reads which provide
evidence of an RNA-editing event but have identical sequence and alignment
location can be ignored. Additionally, if multiple biological replicates exist,
then the requirement that the edited base be observed in multiple tissues
above some background threshold level will largely remove the problem.
3. Splicing and Unknown Exons
Description: If mRNA reads which were sequenced from sites which span exon-
junctions are aligned to the genome, the true alignment to the genome will
include a large genomic deletion. If an ungapped alignment is performed then
the majority of such reads will nd no suitable alignments to the genome.
However, reads which span an exon junction toward the beginning or end of
the read will nd suitable low mismatch alignments for the majority of the
sequence, and provide false evidence of editing in either the beginning or end
of the alignment. Alignment of reads to the transcriptome will improve the
problem but will not ameliorate it completely, because reads sequenced from
novel exon-junctions will still provide spurious evidence of editing.
Solution: The most straightforward solution to this problem is to require evidence
for the edited base to be provided from bases along the length of the entire
read, unless the edited site is near the beginning or end of the isoform. An
example of such a lter is shown in Figure 4.2. To calculate an explicit p-
value in the event that the site in question is not near an isoform boundary we
59
can consider an special case of the classic balls and urns statistical problem,
where each index into the read represents an urn and each observed alignment
represents a ball. If we assume an underlying uniform distribution then the
distribution on the random variable E, which represents the number of empty
urns (or read indexes not included in the alignment) can be described with the
following recursive representation:
P (E =kju;b) =
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
1; if b = 0 and k =u
0; if uk = 0 or b<=uk
ue
u
P (E =kjb 1;u) +
e+1
u
P (E =k + 1ju;b 1); else
(4.2)
Because the length of the reads is constant in an our experiments we can pre-
compute p-values for the null-hypothesis that each of the read indexes observed
aligned to the site is drawn from a uniform distribution. Extremal p-values
thus provide evidence that many reads may be incorrectly aligned due to a
splicing event. Additionally, this lter may serve to eliminate evidence which
results from PCR duplication.
4. Pseudo-Genes or Gene Duplication
Description: The presence of pseudo-genes or mutated gene duplications may re-
sult in false evidence of rna-editing. In the event that a similar pseudo-gene
or non-functional gene duplication is transcribed but not translated (and thus
60
not annotated in the transcriptome) reads sequenced from this gene will align
incorrectly to the region corresponding to the protein coding gene causing
sequence substitution which will not be present in DNA reads. Addition-
ally, annotated pseudo-genes whose functional copies are not annotated will
also provide an incorrect reference sequence which will also cause incorrect
sequence substitutions.
Solution Pseudogenes and gene duplications may be very problematic if they either
produce mature mRNA but don't have an annotated sequence in the reference
or do not produce mature mRNA and include an annotated reference sequence.
Many pseudo-gene alignments can be ltered out by realigning all reads seg-
ments which span an editing site to the human reference and mitochondrial
sequences whilst allowing a large number of substitutions, to evaluate whether
the site shares homology with other genomic sequences.
Aim 2: Accurately determine level of RNA-editing
If a signicant number of aligned reads include nucleotides which provide evidence of
RNA-editing and the alignment does not appear to be the result from one of the sources
of error described in Aim 1, then it is necessary to accurately estimate the level of RNA
editing. To best estimate the level of RNA editing we model editing as a black box process
which occurs sometime between the sequencing of the DNA and mRNA, which allows us
to only slightly modify the algorithm used in FadE to predict the level of editing. Rather
than methylation, we instead attempt to infer the rate of a dierence parameter which
describes the rate at which DNA nucleotides dier from those in RNA. As described in
61
Figure 4.2: Positional Bias in RNA Editing
Shown are the aggregate count of read positions in our alignment for two of the RDD's reported by Chung et al
[63]. Notice that while positions along the length of the read align to the reference base, the support for the edited
base comes almost entirely from the ends of the read, suggesting that a incorrect splice site spanning alignment is
responsible for the result.
Aim 1, we consider only sites with sucient evidence of DNA homozygosity. Assuming
that the DNA is homozygous for the nucleotide X and the RNA includes the edited base Y
we can build up a model for RNA editing by rst considering the probability of observing
any read alignment (r) to an edited positions (n
i
) when
i
is known:
P (rj
i
) =P (rjn
i
=Y )
i
+P (rjn
i
=X)(1
i
) (4.3)
62
whereP (rjn
i
=Y )
i
andP (rjn
i
=X) represent the probability of observing the read
when the nucleotide being sequenced at position i is X or Y. As described in Chapter 3, we
can infer these probabilities by rst estimating call rates. Recall, that if read alignments
are unique and independent then the probability of observing the entire set of reads R
i
which align to the site n
i
is:
P (R
i
j
i
) =
Y
rR
i
P (rj
i
) (4.4)
and that if we apply Bayes theorem to the conditional probability in equation 4.4, we
can calculate the inverse or the posterior probability for the editing rate:
f(
i
) =P (
i
jR
i
) =
P (R
i
j
i
)P (
i
)
P (R
i
)
(4.5)
WhereP (
i
) is the prior distribution for
i
andP (R
i
) is a normalizing constant which
can be calculated by integrating over support of the distribution:
P (R
i
) =
1
Z
0
P (R
i
j
i
) =
1
Z
0
Y
rR
i
P (rj
i
) (4.6)
Notice that at this point the only dierence from the method described in FadE is
that the prior probability of editing is likely weighted toward zero because editing is a
rare phenomenon. With this exception, we can calculate point estimates and credible
intervals for the dierence parameter using Newton-Raphson optimization and numerical
integration. This is further explained in Chapter 3, section 3.2.
63
4.3 RNA Editing Results
Alignment, SNP-Calling and Candidate Generation
To assess the level of RNA-editing that takes place in the human brain we sequenced
mRNA from at least fteen distinct brain regions in ve dierent individuals using the
Solexa HiSeq machine. DNA from a single tissue (cerebellum) was also sequenced
One-hundred basepair paired-end reads and seventy-six basepair mRNA reads were
aligned to the human genome (build hg19 [33]) and the human reference transcriptome
(gencode version 10 [33]) using the alignment algorithm PerM [12] and the iterative SNP-
Calling iterative described by the ComB software [74]. In each iteration only reads which
were unique and contained fewer than ve substitutions in each end and eight in total
were accepted. In total, ve genomic iterations required 713 CPU hours and resulted in
the identication of approximately ve million SNP candidates.
We dened SNP candidates as positions where less than 95% of reads matched the
reference and more than two reads covered the position, these results are summarized in
Table 4.1. The iterative alignment and SNP-Calling of each of the 79 RNA tissues to the
transcriptome required approximately 1200 CPU hours and produced between 30,000 and
100,000 transcriptomic SNP-calls per tissue. In aggregate, 169,510 sites were identied as
possible SNP-Candidates of which 106,101 were identied as candidates for RNA editing
(RNA editing candidates were dened as positions with greater then 10x read depth and
with greater than 50% of reads not matching the reference, in at least one tissue). These
sites were compared against SNP calling resulting from the DNA reads. Positions whose
corresponding genomic positions were also identied as SNP-candidates or did not include
64
Table 4.1: Results of iterative DNA Alignment and SNP-Candidate Generation
Sample ID First Iteration Last Iteration SNP Candidates Identied
Total Reads Aligned Aligned Total
HSB123 581,017,521 423,920,128 456,052,832 5,968,253
HSB126 521,862,972 450,319,516 463,756,552 5,020,659
HSB135 611,792,970 516,004,842 535,113,900 5,794,439
HSB136 629,824,522 540,487,720 558,238,922 5,805,750
HSB145 611,112,507 540,304,764 552,889,520 5,165,082
alignment in one of the ve DNA samples were eliminated, which resulted in 7,152 unique
candidates for RNA editing.
Candidate Evaluation and Filtration
To evaluate each of the 7,152 unique editing candidates, we extracted from the transcrip-
tome the seventy-ve basepairs which precede and follow each candidate position in each
isoform. These 151 basepair candidate regions were then annotated and concatenated to
form a new reference sequence for read alignment. The 76 base pair mRNA reads from
each tissue were realigned to the regions using two dierent alignment protocols, one in
which no greater than three substitutions was allowed, and another in which up to ten
substitutions were allowed. In both cases the last two aligned bases from the beginning
and end of each read were trimmed to provide a 72 basepair alignment and the results
from each tissue and sample were combined and the positions classied as shown in Table
4.2.
While the majority of positions surveyed in the realignment (4,102 of 7,152) showed
greater than 5% editing in at either the high or low substitution protocol, only a small
percentage of these displayed high-coverage (> 20 reads) unbiased editing in one or more
tissues. A total of 509 positions were classied as an unbiased edit (see Table 4.2) using
65
Table 4.2: Classication of Candidate Region from aggregate realignment. Shown be-
low are the four classications and the criteria required to make each classication.
R
M
,R
E
,and R
N
represent that number of reads which match the reference base, the
candidate edited base, or one of the two remaining bases, respectively. Additionally,
P (Urn) represents the P-value for read location bias described in the preceding section.
Finally, Q
i
represents the number of locations in the ith quartile where a read index was
observed in the alignment. Each quartile represents 18 positions.
Classication Requirement Description
MATCH R
M
> 20 & %R
E
< 5% Reads match reference
NO EXPRESSION or NOISE R< 20 or %R
N
> 5% Insucient Coverage
BIASED PSEUDO-EDIT P (Urn)< 0:0001 or Q
i
< 5 Biased Evidence for Editing
UNBIASED EDIT P (Urn)> 0:0001 & Q
i
> 58i Unbiased Evidence
both alignment protocols or were classied as an unbiased edit using the high substitu-
tion protocol and did not obtain sucient coverage using the low substitution protocol.
To further evaluate these sites we realigned each of the reads which aligned one of these
509 regions to the entire human genome and human mitochondrial sequence allowing full
sensitivity to eight substitutions and high sensitivity to up to ten substitutions. For each
read alignment the dierence between the number of substitutions in the alignment to
the editing candidate region and the location on the lowest substitution alignment to the
reference or mitochondrial genome was recorded. Sites in which more than 10% of the
reads contained alignments to the reference or mitochrondrial genome within one substi-
tution of the alignment to the candidate location were removed. This lter eliminated an
additional seventy-four editing candidates, resulting in 435 unique sites which displayed a
high probability of RNA-editing. Approximately 97% (422 of 435) represented anA!G
editing event.
As shown in Figure 4.3, the majority of the edited sites appear occur in the 3' UTR
regions of the genes. Twenty one of the 435 sites were annotated as exonic and were
located in sixteen dierent genes, as shown in Table 4.3. Ten of the sixteen genes included
66
Figure 4.3: The majority of edited sites were located in the 3' UTR Regions of genes.
Intergenic and intronic sites also compromise a large number of the editing events.
Figure 4.4: While there was little evidence of a strong upstream or downstream motif
around edited sites, guanine was extremely underrepresented preceding edited adenosines
(3.6%) and overrepresented following edited adenosines (over 50%)
a GO annotation which included a transmembrane role. Of the ten transmembrane
genes, four (GRIA2, GRIA3, GRIK1, and GRIK2) involved the glutamate system, two
(CACNA1 and CADPS) involved calcium channels, two (TMEM and MFN1) did not
have specic annotation, and one involved the seratonin (HTRC2) and GABA (GABRA)
neurotransmitter systems. Finally, it should be noticed that while there was not sucient
evidence regarding strong upstream or downstream motifs, there was a strong local motif
directly near edited adenosines (Figure 4.4).
67
Table 4.3: Exonic Editing Events - Shown are the aggregate counts in the combined tissue
which supported the edit or matched the reference read for twenty-one exonic edits
Chr pos Type Isoform Count Ref Reads Edited Reads Gene
10 102777342 AG 3 214 854 PDZD7
10 103558604 GA 6 2576 440 MGEA5
15 75646087 AG 4 10 144 NEIL1
21 30953750 AG 12 159 257 GRIK1
2 27522496 AG 3 9 82 TRIM54
3 179093028 AG 8 377 79 MFN1
3 53820892 AG 7 228 106 CACNA1D
3 62423807 AG 8 3463 1095 CADPS
4 158281294 AG 7 1150 1682 GRIA2
5 156736808 AG 14 2341 5755 CYFIP2
6 102337689 AG 7 160 264 GRIK2
6 102337702 AG 7 78 299 GRIK2
6 102372589 AG 7 64 413 GRIK2
6 44120349 AG 6 2385 844 TMEM63B
6 84326687 GT 3 15 6209 SNAP91
X 114082682 AG 4 100 208 HTR2C
X 114082684 AG 4 173 128 HTR2C
X 114082689 AG 4 148 156 HTR2C
X 114082694 AG 4 141 167 HTR2C
X 122598962 AG 5 57 1181 GRIA3
X 151358319 AG 5 160 1012 GABRA3
Conclusion
Our method locates far fewer RNA-editing events than other recently published work
[63, 20]. While widespread non-canonical RNA editing may not actually occur, it is
important to take note of the careful normalization and ltration that must be carried
out to eliminate evidence of spurious edits. We believe that our data suggests that
spurious edits are less likely a result of sequencing errors caused by the sequencing machine
but rather an artifact which arises from an incompletely annotated transcriptome, an
incomplete knowledge of the location and mechanisms behind splicing events and a large
number of transcribed pseudo-genes and gene duplications which have not been previously
68
described. That our method increases the relative proportion of canonical editing (A!
G) is evidence that our statistical methods serve to eliminate largely false positives.
Additionally only thirty of the 435 positions we located have been previously annotated
as SNPS in the dbSNP database [14], and only 2 of the 30 were discovered using genomic
DNA. The other 28 were discovered using cDNA samples, further suggesting that they
may likely be real editing sites.
69
Chapter 5
Biological Applications: RNA-DNA Dierences in a single
individual
This chapter provides a brief description of the future direction of our research into
the characterization of dierences between human DNA and RNA. In this chapter we
introduce two types of ongoing research which include the characterization of isoform
specic RNA editing and allele specic expression.
5.1 Future Research
Allele Specic Expression
Allele specic expression (ASE) is genetic phenomena in which the aggregate gene expres-
sion between two alleles diers from the 1:1 ratio observed in the DNA. In the absence of
ASE transcribed heterozygous genomic loci are expected to express roughly equal num-
bers of each allele. ASE violates classical Mendelian inheritance without any changes to
nucleotide composition of DNA or RNA and was rst observed to take place as a result
of genomic imprinting, a process in which the methylation pattern completely silences
either the maternal or paternal allele. In imprinted genes, mutations may elicit dierent
phenotypic results depending on whether the paternal or maternal allele was aected
70
[19]. Recently, advances in sequencing technology have found evidence of widespread
ASE in which the level of transcription of the each allele is altered but not completely
silenced [24]. To investigate the level of both complete and relative ASE in the human
brain we have examined the allele ratios at expressed heterozygous loci. While our initial
investigation reveals statistically signicant deviation from a 1:1 ratio at approximately
25% of heterozygous positions, we aim to develop a model which is as robust to error as
the method described for ltering rna-editing candidates. Currently, we have observed a
moderate correlation between the allele ratio in DNA and mRNA, which suggests that
there may be sequence specic biases aecting the observation of dierent alleles. There-
fore, the assumption of a 1:1 sequencing ratio at heterozygous loci may poorly re
ect
the background distribution and result in the overestimation of ASE. We are currently
developing a model which considers this bias to more accurately estimate the level of ASE
in the brain.
Isoform Specic RNA editing and Transcriptome Characterization
The Urn Model lter used to eliminate bias in the read positions which span edited sites
discussed in the previous section eliminates approximately 90% of high coverage editing
candidates. While the elimination of such positions (see Figure 4.2) in the interest of
greater specicity is necessary, that such bias occurs so often merits investigation. The
likely culprit behind the majority of biased observations are reads which span a splice
junction between an annotated exon and an unannotated exon. While the number of
unannotated exons is not known, our aggregate mRNA reads align to over 6% of the
genome while the annotated transcriptome encompasses just over 1% of the bases in
71
the genome. While alignment does not necessarily prove transcriptional activity, reads
whose sequence span both known exons and such regions likely indicate the presence of
unknown exons. A more complete brain transcriptome should both reduce the number
of edit candidates with biased alignment and allow isoform level analysis to be carried
out. While most of our relatively short 76 basepair reads will not span the necessary
number of splice junctions to suggest one specic isoform rather than another, we may
be able to infer the set of isoforms expressed in a particular tissue using the reads which
span one or more exon junctions. While a complete isoform level analysis of RNA editing
or expression map may require longer reads, we are attempting to rst provide a more
accurate estimation of the transcriptome observed in our read set. Such a transcriptome
will allow the investigation into whether or not tissue specic dierences in editing rates
are the result of modications to the editing machinery in dierent tissues or dierential
expression between edited or non-edited isoforms. Additionally, we hope that a more
accurate view of our transcriptome will allow us to investigate whether dierent alleles
aect the abundance of particular isoforms.
72
Bibliography
[1] N. S. R. Archive. Solid system ecoli dh10b data set.
http://www.ncbi.nlm.nih.gov/sra/SRX000353.
[2] V. Bansal. A statistical method for the detection of variants from next-generation
resequencing of dna pools. Bioinformatics, 26(12):318{324, 2010.
[3] F. Barras and M. Marinus. The great gatc: Dna methylation in e. coli. Trends in
Genetics, 5(0):139 { 143, 1989.
[4] A. Biosystems. Principles of di-base sequencing and the advantage of color space
analysis in the solid system, 2008.
[5] C. A. Bormann Chung, V. L. Boyd, K. J. McKernan, Y. Fu, C. Monighetti, H. E.
Peckham, and M. Barker. Whole methylome analysis by ultra-deep sequencing using
two-base encoding. PLoS One, 5(2), 2010.
[6] A. M. e. a. Brennicke, A. "rna editing". FEMS Microbiol Rev, 23:297{316, 1999.
[7] W. Brockman, P. Alvarez, S. Young, M. Garber, G. Giannoukos, W. L. Lee, C. Russ,
E. S. Lander, C. Nusbaum, and D. B. Jae. Quality scores and snp detection in
sequencing-by-synthesis systems. Genome Res, 18(5):763{770, May 2008.
[8] M. G. R. Center. Sxoligosearch, 2009.
[9] M. J. Chaisson, D. Brinza, and P. A. Pevzner. De novo fragment assembly with
short mate-paired reads: Does the read length matter? Genome Res, 19(2):336{346,
2009.
[10] M. J. Chaisson and P. A. Pevzner. Short read fragment assembly of bacterial
genomes. Genome Res, 18(2):324{330, 2008.
[11] Y. Chen. Mapping trillions of short dna reads to a reference genome and its appli-
cations. USC, 2011.
[12] Y. Chen, T. Souaiaia, and T. Chen. Perm: ecient mapping of short sequencing
reads with periodic full sensitive spaced seeds. Bioinformatics, 25(19):2514{2521,
2009.
[13] D. F. Conrad, M. Jakobsson, G. Coop, X. Wen, J. D. Wall, N. A. Rosenberg, and
J. K. Pritchard. A worldwide survey of haplotype variation and linkage disequilib-
rium in the human genome. Nat Genet, 38(11):1251{1260, 2006.
73
[14] N. L. o. M. Database of Single Nucleotide Polymorphisms (dbSNP). Bethesda (MD):
National Center for Biotechnology Information. dbsnp: the ncbi database of genetic
variation. Nucleic Acids Res, 29(1):308{311, Jan 2001.
[15] K. Davies. Illumina drops personal genome sequencing price to below 20,000, 2010.
[16] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg.
Alignment of whole genomes. Nucleic Acids Res, 27(11):2369{2376, Jun 1999.
[17] I. Ebersberger, D. Metzler, C. Schwarz, and S. P a abo. Genomewide comparison of
dna sequences between humans and chimpanzees. Am J Hum Genet, 70(6):1490{
1497, Jun 2002.
[18] G. Egger, G. Liang, A. Aparicio, and P. A. Jones. Epigenetics in human disease and
prospects for epigenetic therapy. Nature, 429(6990):457{463, May 2004.
[19] B. et al. Inherited microdeletions in the angelman and prader willi syndromes dene
an imprinting centre on human chromosome 15. Nature Genetics, 9:395{400, 1995.
[20] C. et al. Genome-wide identication of human rna editing sites by parallel dna
capturing and sequencing. Science, 324:1210{1213, 2009.
[21] F. et al. Increased methylation variation in epigenetic domains across cancer types.
Nature Genetics, 43:768{775, 2011.
[22] L. et al. Initial sequencing and analysis of the human genome. Nature, 409, Feb
2001.
[23] L. et al. Human dna methylomes at base resolution show widespread epigenomic
dierences. Nature, 462:315{322, 2009.
[24] P. et al. Allele-specic gene expression is widespread across the genome and biological
processes. Plos ONE, 4(1):1371, 2009.
[25] C. F. Central dogma of molecular biology. Nature, 227:561{563, 1970.
[26] P. Ferragina and G. Manzini. Opportunistic data structures with applications. Pro-
ceedings of the 41st Annual Symposium on Foundations of Computer Science, p. 390,
2000.
[27] T. C. GLENN. Field guide to next-generation dna sequencers. Molecular Ecology
Resources, 11(5):759{769, 2011.
[28] F. Hach, F. Hormozdiari, C. Alkan, F. Hormozdiari, I. Birol, E. E. Eichler, and S. C.
Sahinalp. mrsfast: a cache-oblivious algorithm for short-read mapping. Nat Methods,
7(8):576{577, 2011.
[29] T. D. Heightman. Therapeutic prospects for epigenetic modulation. Expert Opin
Ther Targets, 15(6):729{740, Jun 2011.
74
[30] R. Holliday and J. Pugh. Dna modication mechanisms and gene activity during
development. Science, 187:226{232, 1975.
[31] N. Homer, B. Merriman, and S. F. Nelson. Bfast: an alignment tool for large scale
genome resequencing. PLoS One, 4(11), 2009.
[32] Y.-W. Huang, T. H. M. Huang, and L.-S. Wang. Proling dna methylomes from
microarray to genome-scale sequencing. Technol Cancer Res Treat, 9(2):139{147,
2010.
[33] D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs, Y. T. Lu, K. M.
Roskin, M. Schwartz, C. W. Sugnet, D. J. Thomas, R. J. Weber, D. Haussler, W. J.
Kent, and University of California Santa Cruz. The ucsc genome browser database.
Nucleic acids research, 31(1):51{54, January 2003.
[34] R. Karp. Ecient randomized pattern-matching algorithms. IBM J. Res. Develop,
31(31):249, 260, 1987.
[35] J. E. Karro, M. Peifer, R. C. Hardison, M. Kollmann, and H. H. von Gr unberg.
Exponential decay of gc content detected by strand-symmetric substitution rates
in
uences the evolution of isochore structure. Mol Biol Evol, 25(2):362{374, Feb
2008.
[36] Z. Khan, J. Bloom, L. Kruglyak, and M. Singh. A practical algorithm for nding
maximal exact matches in large sequence datasets using sparse sux arrays. Bioin-
formatics, 25(13):1609{1616, 2009.
[37] Y. J. Kim, N. Teletia, V. Ruotti, C. A. Maher, A. M. Chinnaiyan, R. Stewart, J. A.
Thomson, and J. M. Patel. Probematch: rapid alignment of oligonucleotides to
genome allowing both gaps and mismatches. Bioinformatics, 25(11):1424{1425, Jun
2009.
[38] S. T. Z. Y. N. K. Kim U, Wang Y. Molecular cloning of cdna for double-stranded rna
adenosine deaminase, a candidate enzyme for nuclear rna editing. Proc Natl Acad
Sci, 24:11457{11461, 1994.
[39] B. P. Kiran A. Darned: a database of rna editing in humans. Bioinformatics,
26:1772{6, 2010.
[40] A. Kitzmiller. Solid system human (yoruban) data set.
http://solidsoftwaretools.com/gf/project/yoruban/.
[41] F. Krueger and S. R. Andrews. Bismark: a
exible aligner and methylation caller
for bisulte-seq applications. Bioinformatics, 27:1571{1572, 2011.
[42] G. Kucherov, L. No e, and M. Roytberg. Multiseed lossless ltration. IEEE/ACM
Trans Comput Biol Bioinform, 2(1):51{61, Jan-Mar 2005.
[43] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and
S. L. Salzberg. Versatile and open software for comparing large genomes. Genome
Biol, 5(2), 2004.
75
[44] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. Compressed
indexing and local alignment of dna. Bioinformatics, 24(6):791{797, 2008.
[45] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-
ecient alignment of short dna sequences to the human genome. Genome Biol,
10(3), 2009.
[46] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-
ecient alignment of short dna sequences to the human genome. poster, 2008.
[47] S. Levy, G. Sutton, P. C. Ng, L. Feuk, A. L. Halpern, B. P. Walenz, N. Axel-
rod, J. Huang, E. F. Kirkness, G. Denisov, Y. Lin, J. R. MacDonald, A. W. Pang,
M. Shago, T. B. Stockwell, A. Tsiamouri, V. Bafna, V. Bansal, S. A. Kravitz, D. A.
Busam, K. Y. Beeson, T. C. McIntosh, K. A. Remington, J. F. Abril, J. Gill, J. Bor-
man, Y. H. Rogers, M. E. Frazier, S. W. Scherer, R. L. Strausberg, and J. C. Venter.
The diploid genome sequence of an individual human. PLoS Biol, 5(10), Sep 2007.
[48] H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler
transform. Bioinformatics, 25(14):1754{1760, Jul 2009.
[49] H. Li and R. Durbin. Fast and accurate short read alignment with burrows-wheeler
transform. Bioinformatics, 25(14):1754{1760, Jul 2009.
[50] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth,
G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. The
sequence alignment/map format and samtools. Bioinformatics, 25(16):2078{2079,
Aug 2009.
[51] H. Li, J. Ruan, and R. Durbin. Mapping short dna sequencing reads and calling
variants using mapping quality scores. Genome Res, 18(11):1851{1858, Nov 2008.
[52] H. Li and N. Homer. A survey of sequence alignment algorithms for next-generation
sequencing. Brief Bioinform, 2010.
[53] H. Li, J. Ruan, and R. Durbin. Mapping short dna sequencing reads and calling
variants using mapping quality scores. Genome Res, 18(11):1851{1858, 2008.
[54] M. Li, M. Nordborg, and L. M. Li. Adjust quality scores from alignment and improve
sequencing accuracy. Nucleic Acids Res, 32(17):5183{5191, 2004.
[55] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang. Snp detection
for massively parallel whole-genome resequencing. Genome Res, 19(6):1124{1132,
Jun 2009.
[56] R. Li, Y. Li, K. Kristiansen, and J. Wang. Soap: short oligonucleotide alignment
program. Bioinformatics, 24(5):713{714, Mar 2008.
[57] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang. Soap2: an
improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966{1967,
2009.
76
[58] H. Lin, Z. Zhang, M. Q. Zhang, B. Ma, and M. Li. Zoom! zillions of oligos mapped.
Bioinformatics, 24(21):2431{2437, Nov 2008.
[59] U. Manber and G. Myers. Sux arrays: a new method for on-line string searches.
SIAM Journal on Computing, 22(5):935{948, 1991.
[60] L. Mancinelli, M. Cronin, and W. Sad ee. Pharmacogenomics: the promise of per-
sonalized medicine. AAPS PharmSci, 2(1), 2000.
[61] K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu, E. F.
Tsung, C. R. Clouser, C. Duncan, J. K. Ichikawa, C. C. Lee, Z. Zhang, S. S. Ranade,
E. T. Dimalanta, F. C. Hyland, T. D. Sokolsky, L. Zhang, A. Sheridan, H. Fu,
C. L. Hendrickson, B. Li, L. Kotler, J. R. Stuart, J. A. Malek, J. M. Manning,
A. A. Antipova, D. S. Perez, M. P. Moore, K. C. Hayashibara, M. R. Lyons, R. E.
Beaudoin, B. E. Coleman, M. W. Laptewicz, A. E. Sannicandro, M. D. Rhodes, R. K.
Gottimukkala, S. Yang, V. Bafna, A. Bashir, A. MacBride, C. Alkan, J. M. Kidd,
E. E. Eichler, M. G. Reese, F. M. De La Vega, and A. P. Blanchard. Sequence and
structural variation in a human genome uncovered by short-read, massively parallel
ligation sequencing using two-base encoding. Genome Res, 19(9):1527{1541, Sep
2009.
[62] M. L. Metzker. Sequencing technologies - the next generation. Nat Rev Genet,
11(1):31{46, 2010.
[63] Y. L. A. B. A. L. R. J. M. T. Mingyao Li, Isabel X. Wang and V. G. Cheung.
Widespread rna and dna sequence dierences in the human transcriptome. Science,
10:1126, 2011.
[64] U. Nagalakshmi, K. Waern, and M. Snyder. Rna-seq: a method for comprehensive
transcriptome analysis. Curr Protoc Mol Biol, Chapter 4:1{13, 2010.
[65] Z. Ning, A. J. Cox, and J. C. Mullikin. Ssaha: a fast search method for large dna
databases. Genome Res, 11(10):1725{1729, Oct 2001.
[66] B. D. Ondov, A. Varadarajan, K. D. Passalacqua, and N. H. Bergman. Ecient map-
ping of applied biosystems solid sequence data to a reference genome for functional
genomic applications. Bioinformatics, 24(23):2776{2777, Dec 2008.
[67] A. R. Quinlan, D. A. Stewart, M. P. Str omberg, and G. T. Marth. Pyrobayes: an
improved base caller for snp discovery in pyrosequences. Nat Methods, 5(2):179{181,
Feb 2008.
[68] K. D. Robertson. Dna methylation and human disease. Nat Rev Genet, 6(8):597{610,
Aug 2005.
[69] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno.
Shrimp: accurate mapping of short color-space reads. Plos-Comput-Biol, 5, 2009.
77
[70] R. Sachidanandam, D. Weissman, S. C. Schmidt, J. M. Kakol, L. D. Stein, G. Marth,
S. Sherry, J. C. Mullikin, B. J. Mortimore, D. L. Willey, S. E. Hunt, C. G. Cole, P. C.
Coggill, C. M. Rice, Z. Ning, J. Rogers, D. R. Bentley, P. Y. Kwok, E. R. Mardis,
R. T. Yeh, B. Schultz, L. Cook, R. Davenport, M. Dante, L. Fulton, L. Hillier, R. H.
Waterston, J. D. McPherson, B. Gilman, S. Schaner, W. J. Van Etten, D. Reich,
J. Higgins, M. J. Daly, B. Blumenstiel, J. Baldwin, N. Stange-Thomann, M. C. Zody,
L. Linton, E. S. Lander, D. Altshuler, and International SNP Map Working Group.
A map of human genome sequence variation containing 1.42 million single nucleotide
polymorphisms. Nature, 409(6822):928{933, Feb 2001.
[71] J. J. C. R. Sandra Garrett. Rna editing underlies temperature adaptation in k+
channels from polar octopuses. Science, 335:848{851, 2012.
[72] J. Shendure and H. Ji. Next-generation dna sequencing. NatBiotechnol, 26(10):1135{
1145, Oct 2008.
[73] A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads
improves accuracy of solexa read mapping. BMC Bioinformatics, 9:128{128, 2008.
[74] T. Souaiaia, Z. Frazier, and T. Chen. Comb: Snp calling and mapping analysis for
color and nucleotide space platforms. J Comput Biol, 18(6):795{807, 2011.
[75] D. Takai and P. A. Jones. Comprehensive analysis of cpg islands in human chromo-
somes 21 and 22. Proc Natl Acad Sci USA, 99:3740{3745, 2002.
[76] Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcrip-
tomics. Nat Rev Genet, 10(1):57{63, 2009.
[77] H. Wing-Kai, L. Tak-Wah, S. Kunihiko, S. Wing-Kin, and Y. Siu-Ming. A space
and time ecient algorithm for constructing compressed sux arrays. Algorithmica,
48(1):23{36, 2007.
[78] H. Yao and B. Ma. Seed optimization is no easier than optimal golomb ruler design.
APBC, 2008.
78
Appendix
PerM: Periodic Seed Alignment
This chapter is builds upon the description of the alignment problem in Chapter 1, Section
1.2. In section A.1 we describe two dierent strategies to provide ecient solutions to
the ungapped alignment problem described in Chapter 1. Described are the Burrows-
Wheeler Transform (BWT) based strategies which are extremely memory ecient and
provide excellent speed for short accurate reads and the consecutive seed strategies which
are less memory ecient but are not slowed by dierent data types. Section A.2 describes
the development and implementation of our novel periodic seed based algorithm. Section
A.3 compares PerM to software from both classes of algorithms.
A.1 Survey of Alignment Strategies
In Chapter 1 we described the main goals of an ungapped alignment algorithm, namely
to provide an accurate reporting standard while maximizing computational eciency at
a given tolerance level (). Currently, there exist two distinct major alignment strategies
to solve the task [52]. Traceback strategies rely on a semi-exhaustive search for alignment
through a transformed reference index. Algorithms have been developed which use a suf-
x trees [36] and sux arrays [59], but the most memory ecient and popular traceback
79
algorithms use the FM-index. The alignment programs Bowtie [46], BWA [49] and SOAP
v2 [57], all make use of the FM-index, an extremely memory ecient indexing strategy
based on the Burrows-Wheeler Transform (BWT). The BWT is a reversible string trans-
formation which results in an easily compressible string which enables the 3 GB human
genome to be stored and queried using less than 1 GB of memory [77]. The FM-index also
facilitates very ecient exact string matching, the running time required for the trace-
back necessary to nd a substitution-free read alignment is linear with respect to the read
length and number of occurrences in the genome [26]. Unfortunately, the solution used
to locate inexact read alignments using the FM-index requires choosing a branch position
(allowed substitution) and resuming the traceback for an exact alignment.
This is especially troublesome for reads whose lowest substitution alignment is near
the tolerance level because because such cases cause the branch-search process to explore
an exponential number of branches. In practice this bottleneck is often avoided by by
combining substitution ltration for low levels of substitutions (< 3) and greedy ltration
for higher number of substitutions, or by discarding reads who don't quickly nd low
substitution alignments, both shortcuts which greatly sacrice alignment accuracy.
The other major alignment strategy, Seed-substring algorithms aim to quickly locate
portions of the read which align exactly to the genome and then check the remaining
portion of the read for sequence similarity. The portion of the read which is required to
align exactly to the reference serves as a lter. For example consider a seed-substring
algorithm which provides tolerance to one substitution for a 32-bp read:
80
1. Store the sequence and starting location of every reference 18-mer in a list sorted
lexicographically by sequence.
2. Cut the read in half and use binary search or hashing to look for matches for each
half in the sorted list/index.
3. If a match is found, then count the number of substitutions between the entire read
and the reference location, and return alignments within the required similarity
threshold.
In this example, which is an implementation of the Rabin-Karp algorithm [34] used by the
short read alignment program SOCS [66], tolerance one substitution is achieved because
one read half will always match the reference index. Notice that if the read is instead cut
into three equal-length substrings, tolerance to two substitutions will be provided. SOCS
extends this method to provide tolerance to substitutions by separating the read into
+ 1 substrings and relying on the pigeonhole principle
y
to guarantee that at least one
will match exactly a reference substring. In general all seed-substring algorithms use the
three steps in the example, stated more formally they are:
Indexing: Indexing involves extracting subsequences corresponding to reference loca-
tions and storing the subsequences into a compact data structure to facilitate e-
cient subsequence queries. The rule used to extract the subsequences is referred to
as the seed.
Querying: For a readr, the query step involves applying the seed to the read to extract
subsequences for and looking them in up in the reference index. The look up step
81
can be quickly accomplished through the use a sorted reference index or a hash
table.
Checking: An exact match between a read and reference subsequence suggests that an
alignment is likely, but the remaining read positions must still be checked against
the reference location to ensure the sequence similarity is sucient.
The general framework described here is used by many popular alignment algorithms
including Eland (Cox, 2007 unpublished), ABI's Corona Lite, RMAP v1 [73], MAQ [53],
ProbMatch [37], SXOligoSearch [8], SOAP v1 [56], and SOCS [66]. For these algorithms
the querying step is usually eciently implemented through hashing and the bottleneck
usually occurs when many short substring queries match and must be ltered during the
checking step. If the length of the substring or seed weight used to query the index is very
short, many matches to reference locations will occur by chance despite a lack of actual
sequence similarity. The expected number of false matches or random hits per query
which occur on a uniformly distributed double-stranded 3-billion bp reference genome
can be calculated as follows:
E(H) =
2jGj
4
w
(A.1)
Applying equation A.1 to the method for 32bp reads described in 1-substitution tol-
erant method previously described suggests that each 16bp query will result in approx-
imately one expected random hit which will not adversely aect eciency. However,
if tolerance to two substitutions is necessary and the read is separated into three 12bp
substrings, more than 300 random hits are expected per query, which will likely cause a
82
32-bp Read: ACGTACGTCCCCTTTTACGTACGTAAAAGGGG
Index Table 1 (3 cases): ACGTACGTCCCCTTTT****************
********CCCCTTTTACGTACGT********
****************ACGTACGTAAAAGGGG
Index Table 2 (2 cases): ACGTACGT********ACGTACGT********
********CCCCTTTT********AAAAGGGG
Index Table 3 (1 cases): ACGTACGT****************AAAAGGGG
Figure A.1: An illustration of the traditional seeds used by Eland, MAQ, and SOAP v1
bottleneck. This dierence illustrates one of the challenges to designing an ecient seed
subsequence algorithm, achieving tolerance without the sacrice to seed weight. The
canonical solution to increase seed weight for tolerance to two mismatches which is im-
plemented by the MAQ, ELAND and SOAP v1 software tools is to separate the read into
four equal-length substrings and check each pair of substrings for a reference hit. This
method, shown in Figure A.1 requires that the reference be indexed with three dierent
seed patterns and that six queries be performed, but also increases seed weight from 12bp
to 16bp when compared to the Rabin-Karp method employed by SOCS. The overhead
required for to index the reference and the three additional queries is minor compared to
the increased speed resulting for the additional seed weight.
In Table A.1, the general features of seed-subsequence methods used by SOCS and
ELAND which provide tolerance are compared a traceback algorithm which uses the FM-
Index. Notice that the main advantage of the FM-Index is a much smaller amount of
required memory, while the seed-subsequence methods main advantage is the ability to
quickly lter dissimilar reads and the low cost for complete reporting.
83
Table A.1: Comparison of Alignment Strategies
FM-Index Seed-Subsequence
Algorithms BWT Traceback Subsequence Filter
Memory Footprint Very Small ( 1GB for Hg19) Large ( 15GB for Hg19)
Common Bottleneck Tracebacks for inexact alignment Inadequate seed weight
Outcome from:
Complete Reporting Extreme slowdown Moderate slowdown
Dissimilar Reads Signicant slowdown Signicant speed increase
Multiple Exact Matches Signicant speed increase Moderate slowdown
Increasing Read length Signicant slowdown Signicant speed increase
y
The FM-index based strategies are compared to the seed-subsequence strategies
A.2 Design of PerM
The main challenge of seed-subsequence search is to provide high levels of seed weight
while satisfying tolerance requirements. This problem of designing high-weight tolerant
seeds and seed groups has extensions to many elds including lossless compression and has
been previously studied by Kucvherov et al. [42] and Lin et al. [58], among others. Based
on this work we were motivated to develop PerM, which makes use of periodic seeds to
maximize seed weight for dierent tolerance levels. Periodic seeds are repetitive patterns
used to partition a read which can be represented by a list of positions for subsequence
inclusion or as a string of 1 * where "1" denotes a positions a position for subsequence
inclusion and "*" represents a mask position which is not included.
The features of a periodic seed are period length, it's weight, and it's weight to length
ratio. Consider
^
P , the periodic seed which can be represented as (111*1**) or (7:1,2,3,5)
84
Reference ACGTACGTCCCCTTTTACGTACGTAAAAGGGG
Seed (List Representation) (1,2,3,5)+7k8 k2 0,1,2,3...
Seed (String Representation) 111*1**111*1**111*1**111*1**111*
Position 00000000011111111112222222222333
12345678901234567890123456789012 Subsequence length
Shift 0: (111*1**) ACG*A**TCC*C**TTA*G**CGT*A**GGG* 19
Shift 1: (*111*1*) *CGT*C**CCC*T**TAC*T**GTA*A**GGG 19
Shift 2: (**111*1) **GTA*G**CCC*T**ACG*A**TAA*A**GG 18
Shift 3: (1**111*) A**TAC*T**CCT*T**CGT*C**AAA*G**G 18 (17)
Shift 4: (*1**111) *C**ACG*C**CTT*T**GTA*G**AAA*G** 17 (16)
Shift 5: (1*1**11) A*G**CGT*C**TTT*A**TAC*T**AAG*G* 18 (16)
Shift 6: (11*1**1) AC*T**GTC*C**TTT*C**ACG*A**AGG*G 19 (16)
Colors are used to indicate periods.
Figure A.2: PerM - Seed Example (single end)
and has length seven, weight four, and weight to length ratio of 4=7. When this seed is
applied to a read seven subsequences are generated, as shown in Figure A.2. Notice that
the seed generates seven unique subsequences which is equal to the length of the seeds
period,j
^
Pj. Each subsequence generated requires that one query to a genome or index
table be made, which means that when weight to length ratios are equal, shorter seed
periods will result in faster sequence alignment. Periodic seeds provide many benecial
properties including:
1. Periodic patterns produce locally optimal weight to length ratios for a given toler-
ance level.
2. Local tolerance/optimality to k mismatches implies global tolerance to k mis-
matches.
3. The reference can be indexed linearly and stored in a single index.
85
Table A.2: An illustration of local tolerance
Position 8 9 10 11 12 13 14 Covering 21 position pairs
Shift 0 1 1 1 * 1 * * (11, 13), (11, 14), (13, 14)
Shift 1 * 1 1 1 * 1 * (8,12), (8, 14), (12, 14)
Shift 2 * * 1 1 1 * 1 (8, 9), (8, 13), (9, 13)
Shift 3 1 * * 1 1 1 * (9, 10), (9, 14), (10, 14)
Shift 4 * 1 * * 1 1 1 (8, 10), (8, 11), (10, 11)
Shift 5 1 * 1 * * 1 1 (9, 11), (9, 12), (11, 12)
Shift 6 1 1 * 1 * * 1 (10, 12), (10, 13), (12, 13)
Shown are the positions covered pairwise for each of seven slides.
^
P covers 3
3
2
pairs of positions for each of
seven slides, for a total of 21 total pairs. Each unique-position pair is covered a single time by
^
P , meaning that
no set of seven patterns with greater seed weight can achieve equal tolerance.
The rst property was shown in general by Kucvherov et al. [42] and can be proven
directly for moderate read lengths using exhaustive search. For example we can consider
the rst set of seven non-boundary positions in
^
P , read positions eight through fourteen.
For each of the seven shifts shown in Figure A.2, we can enumerate which positions are
covered pairwise by two of the three mask "*" positions in each slide. Table A.2 shows that
seed
^
P covers each pair of positions exactly once, meaning tolerance to two-substitutions
is provided with maximum weight for positions eight through fourteen.
The second property is perhaps the most useful property of periodic seeds because it
vastly simplies the design of seeds for dierent read lengths. This property which was
rigorously proven by Kercherov et al. [42], can be explained by the repeating property of
the seed. Notice that the maximal-weight tolerance achieved between positions 8 14 is
also achieved locally for each following 7mer (ie. 15-21, 22-28, etc), and additionally that
every position within a 7mer has a corresponding position in every 7mer for which it's
state is shared between shifts. For example, consider the corresponding read positions
11 and 18. If read position 11 is masked pairwise with every position within it's 7mer
(positions 8 14), then position 18 is also masked pairwise with every position within
86
positions 8 14 because position 18 is always masked when position 11 is masked. Thus,
seed periodicity guarantees that local tolerance and optimality (within the period length)
implies global tolerance and optimality.
As shown in Figure A.2, each of the seven shifts creates a dierent seed subsequence
of between 17-19bp. A straightforward indexing of the reference would involve applying
each of the seven patterns to the reference sequence and storing each in an index ta-
ble. However, if the subsequences always begins with the same point in the pattern (ie.
(111*1**)) then the reference can be indexed and stored in a single table. As shown in
A.2, this property is arrived at by imposing the boundary condition that the pattern is
not repeated in the rst 7mer (ie. Grey nucleotides excluded from subsequences). This
results in a loss of query weight (shown in parenthesis in Figure A.2, but facilitates an
implementation which requires only a single reference index. Assuming 32bp reads, the
outline of the implementation is as follows:
Indexing:
^
P
2
is applied to each reference 32mer which produces a subsequence of 19bp.
A prex of the subsequence not longer than the shortest read subsequence produced
(ie. 16bp in Shift 1), is used to generate a hash key. Reference subsequences which
share the same hash-key are assigned to a bucket and sorted lexicographically using
their suxes. Memory is preserved by storing integers which point to the reference
location in place of actual subsequences.
Querying:
^
P
2
is used to generate read subsequences whose prexes generate hash-keys
for bucket assignment. Within a bucket exact matches for a variable-length sux
can be located using binary search. This is shown in Figure A.3.
87
Reference String AAACCCCGGGGTTAAAAGTACGT……ACGT
Seed Pattern: 111*1**111*1**111*1**
Index Key: “AAA C CGG G AAA G …”
“AAC C CGG T AAA T …”
The x-bp prefix of each index key is hashed to one of the 4
x
buckets.
buckets .(pointers),
AAC…. AAA…
AAC…. AAC…
TTT…. AAA…
TTT…. AAC…
In each bucket, the postfixes of index keys are sorted lexicographically.
AAA…. AAA…
AAA…. AAC…
TTG…. AAC…
TTG…. AAG…
Postfixes are
sorted for
binary search
…… …… ……
Figure A.3: PerM - combination of hashing and binary search [11]
Checking: For each read subsequence which produces an exact match to a reference
index, the read sequence is compared to the shift-oset reference location using
bitwise operations (ie. If the rst shift produced an index match to the 100
th
position on the reference, the 99
th
location would be checked. If the number of
substitutions is below the tolerance threshold then the alignment may be stored in
a buer or immediately reported depending on the reporting standards.
The implementation described above is used to perform short read alignment for
tolerance to two substitutions. The seed used,
^
P
2;7
, maximizes weight while providing
tolerance to two substitutions in seven subsequence queries. Periodic seeds of dierent
lengths which maximize weight for given tolerance levels can quickly be generated using
exhaustive search or integer programming [11] because of the local to global property
of periodic seeds. Figure A.4, shows that weight-to-length ratios for optimized periodic
seeds increase non-monotonically as the seed length increases. Notice that local maxima
for tolerance to two substitutions exist at for period lengths of 7 and 13. P
2;13
provides
88
Figure A.4: PerM - Limiting weight to length ratios [11]
a weight-to-length ratio of 0:69 while P
2;7
provides a weight to length ratio of 0:57.
However,P
2;7
is advantageous because it requires as six fewer subsequence queries and it's
shorter length means it will lose less query weight when the boundary condition required
to facilitate a single reference index is imposed. When possible PerM is implemented
using the shortest seed which produces adequate seed weight for a required tolerance
level. We dene the F
k
as family of seeds which we selected which provide tolerance to
k substitutions. Additionally, we have designed color-space specic seed families which
provide tolerance to combinations of consecutive (SNPs change consecutive colors) and
isolated substitutions. The F
k
seed family is shown and contrasted to the Rabin-Karp
inspired method and the canonical method in Table A.3.
89
Table A.3: Optimal periodic seed patterns for 35-bp Reads reads
Tolerance Seed Index Tables Queries Weight
2 F
2
: 111*1** 1 7 17-20
2 Canonical:
4
2
3 6 17
2 RK-Split: 1=3 1 3 11
3 F
3
: 111*1**1*** 10 11 13-17
3 Canonical:
5
2
4 10 14
3 RK-Split: 1=4 1 4 8
4 F
4
: 11***1**** 1 10 12-15
4 Canonical:
6
2
5 15 14
4 RK-Split: 1=5 1 5 7
A.3 Comparison of PerM
PerM was designed to provide alignment which is robust to the bottlenecks associated
with FM-index aligners but also more ecient than seed-subsequence methods because
of the use of single periodic seed. Additionally, PerM was designed to provide a moderate
memory cost by indexing the reference with a single seed. In reality, many implementation
issues including cache eciency [28], compiler optimization, and bitwise operations have a
large role in eciency. One of the most likely reasons why PerM is much faster in genome-
scale alignment than other seed-subsequence algorithms is PerM's ability to make use of
large read lengths using binary search for subsequence suxes. Other seed-subsequence
algorithms only use subsequences for hashing which causes memory requirements to grow
very large. For example, if an algorithm generates seed subsequences of 16 basepairs, then
approximately 18 GB of memory will be required just to store integer bucket pointers.
90
For this reason most algorithms use only the rst 13 15 subsequence bases for hashing
(requiring 256 MB and 4 GB of memory). PerM is able to use up to 13 subsequence
basepairs for hashing and then use binary search to further dierentiate between many
sequences. This large advantage is eciency is demonstrated in Table A.4 which com-
pares PerM to MAQ, an alignment algorithm which uses the canonical seed-subsequence
algorithm to produce fourteen basepair subsequences for hashing.
Table A.4: Speed comparisons among PerM, Bowtie, and MAQ using Illumina reads [12]
36 bp 40 bp 50 bp
Settings Weight Speed
y
Weight Speed
y
Weight Speed
y
PerM F
2
18-21 37.1 20-24 58 25-28 82.6
MAQ 14 2.49 14 2.35 14 2.37
Bowtie {v 2
n/a 28.1 n/a 24.8 n/a 15.8
PerM F
3
13-18 10.0 15-19 16.9 18-23 40.2
Bowtie {v 3
n/a 13.4 n/a 9.0 n/a 7.21
Bowtie does not use seeds. Seed weight is not applicable to Bowtie.
MAQ does not provide options for tolerance to 3 substitutions.
y
Speeds are shown in million reads per CPU hour.
Comparing PerM to FM-Index based strategies is more dicult because the perfor-
mance of FM-Index aligners depends highly on the input data and reporting standards.
The FM-index based alignment algorithm Bowtie [46], was the rst to implement the
Burrows-Wheeler Transform. This signicant reduction in required memory [44] led to
the quick adoption of the BWT by BWA [53] and SOAP v2 [57]. Although the reduction
in memory is signicant (ie. 1 GB vs. 15 GB), modern computing clusters typically have
91
more than 20 GB memory shared by multiple CPUs which makes the memory usage no
longer a top priority. The greatest drawback to BWT-based aligners is the large num-
ber of bottlenecks which arise for dierent types of data or reporting standards. If read
lengths are short (eg 25-35 bp) and many reads align exactly to the reference, Bowties
default reporting standard very quickly report single alignments for a number of reads.
Unfortunately, this reporting standard greedily reports only the rst alignment it nds
in fewer than a pre-determined number of tracebacks which results in many alignable
reads being discarded. By contrast if Bowtie is run with a
ag set to report every single
alignment within a tolerance threshold, alignment becomes extremely slow because every
read must explore an exponential number of tracebacks before all alignments are found.
Bowtie also slows down substantially for long reads (eg. >50bp) [46] or when few reads
nd alignments with the threshold, both features which signicantly increase the speed
of PerM. Finally, Bowtie reports alignments which have at most 3 substitutions, while
PerM can provide tolerance to four substitutions for 28-64bp reads and tolerance to 9
substitutions for 65-128bp reads and sensitivity to many more mismatches.
To provide a comparison which is fair in light of dierences between the algorithms,
the reporting standard "Minimal" was used for both algorithms with a tolerance level of
three substitutions. This standard was used in the experiments described in Table A.4
and Table A.5.
The experiments described in Tables A.4 and A.5 demonstrate the advantages of
PerM's single periodic variable weight seed design. Perhaps the primary reason for the
popularity of PerM in the scientic community is that advances in technology have re-
sulted in a steady increase in the lengths of short reads, which bottleneck other programs
92
Table A.5: Speed comparisons between PerM and Bowtie using SOLiD reads
PerM Bowtie
Seed/Options F
2
F
1;1 F
3
F
2;0
-v 2
--best
-v 3
--best
Seed Weight 25-28 20-25 19-23 20-24 n/a
y
n/a
y
Allowed Mismatches 2 3 3 4 2 3
Mapping Speed
(Reads/CPU hr)
64 M 58 M 37 M 43 M 9.18 M 5.35 M
y
Bowtie does not use seeds. Seed weight is not applicable to Bowtie.
y
PerM version 0.2.9.6, Bowtie version 0.12.5, and MAQ version 0.6.6
but result in PerM performing faster. However, in the case that algorithms require toler-
ance to a large proportion of substitutions (eg >10%), a single seed will not be capable
of providing ecient alignment. In the advent that the costly step of using multiple in-
dex tables is deemed necessary to increase seed weight, we have explored algorithms to
develop multiple seeds which combine to provide tolerance thresholds. Already a topic of
interest, in 2008 Ma and Yao [78] showed that the optimization of multiple seeds cannot
be easier than the Golomb Ruler Design problem, considered likely to be NP-hard. This
problem as well as potential solutions to design multiple seeds are current research topics
of interest regarding short read sequence alignment.
93
Abstract (if available)
Abstract
The development of second-generation sequencing (SGS) technology has provided sci- entists with a myriad of opportunities as well as new challenges. SGS machines are capable of sequencing billions of short reads at a fraction of the cost and time in com- parison to older technology. Often, the study of sequence data begins with the align- ment of billions of short dna reads to the 3 billion base pair human reference genome, a daunting computational task, especially if the error-rate between the reads and ref- erence is high. For this reason, PerM was developed to use periodic spaced seeds to efficiently and accurately provide highly sensitive ungapped alignment for Illumina and SOLiD reads. Inexact alignments are often the most interesting biologically, because mismatches between the read and reference are often the result of genetic variation. To accurately detect and discern variation from machine errors, we developed ComB, which iteratively applies Bayesian statistics to color or base alignment to accurately determine mutation probability. This allowed us to study a host of biological phenomena which result in rare nucleotide differences, including single nucleotide polymorphisms (SNPs), RNA-editing, and allele-specific expression. DNA-methylation of cytosine residues also produces single-base mismatches when dna is treated with sodium-bisulfite which changes all unmethylated cytosine residues to thymine. To accurately estimate methylation rates from sodium bisulfite treated dna we developed FadE, an algorithm which uses Newton- Raphson optimization to estimate the methylation rate at every cytosine residue in the genome. Finally, we have applied all our statistical tools to study human mRNA editing, and have shown that RNA editing in human brain tissue occurs at a much lower rate than previously thought.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
A generalization of the accumulation curve in species sampling and its applications to high‐throughput sequencing
PDF
Understanding the characteristic of single nucleotide variants
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Mechanistic basis for chromosomal translocations at the E2A gene
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Souaiaia, Mourad (Tade)
(author)
Core Title
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
09/01/2012
Defense Date
08/02/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithms,computational biology,gene expression,genetics,genome sequencing,methylation,OAI-PMH Harvest,sequence alignment,SNP calling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Ting (
committee chair
), Knowles, James (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
tade.souaiaia@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-93879
Unique identifier
UC11289123
Identifier
usctheses-c3-93879 (legacy record id)
Legacy Identifier
etd-SouaiaiaMo-1180.pdf
Dmrecord
93879
Document Type
Dissertation
Rights
Souaiaia, Mourad (Tade)
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithms
computational biology
gene expression
genetics
genome sequencing
methylation
sequence alignment
SNP calling