Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Diploid genome reconstruction from shotgun sequencing
(USC Thesis Other) 

Diploid genome reconstruction from shotgun sequencing

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Transcript (if available)
Content DIPLOID GENOME RECONSTRUCTION FROM
SHOTGUN SEQUENCING
by
Jong Hyun Kim
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2006
Copyright 2006 Jong Hyun Kim
UMI Number: 3237132
3237132
2006
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
   unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
    Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
Dedication
This dissertation is dedicated to my parents, Min Ja Lee and Haing Sik Kim.
ii
Acknowledgments
To complete this work, I am deeply indebted to my advisor Prof. Michael S. Wa-
terman. I would like to thank him for his encouraging, considerate, and invaluable
advice. IalsoexpressmygratitudetoProf. LeiLi. Iwouldliketothankhimforhis
technical help and the kindness he has shown to me. I also thank the Institute for
Information Technology Assessment (IITA) and the Government of the Republic
of Korea for their generous financial support. This work was supported by NSF
DMS 9971698, NIH Grant R01 HG02366, and NIH P50 HG002790 CEGS grant.
The sequence data of Ciona intestinalis were produced by the US Department of
Energy Joint Genome Institute http://www.jgi.doe.gov/.
iii
Contents
Dedication ii
Acknowledgments iii
List Of Tables vii
List Of Figures ix
Abstract xii
1 Background and Introduction 1
1.1 Sequencing technology . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 DNA synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Sanger dideoxy sequencing . . . . . . . . . . . . . . . . . . . 5
1.1.3 Band pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Sequence quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Base-calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Error probability assignment . . . . . . . . . . . . . . . . . . 11
1.3 Shotgun sequencing strategies . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Hierarchical shotgun sequencing . . . . . . . . . . . . . . . . 14
1.4 Sequence assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Whole-genome shotgun sequencing . . . . . . . . . . . . . . 18
1.5 Diploid genome reconstruction problem . . . . . . . . . . . . . . . . 20
1.5.1 Polymorphism detection . . . . . . . . . . . . . . . . . . . . 20
1.5.2 Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.3 Inferring haplotypes from genome assembly . . . . . . . . . . 22
1.5.4 Algorithmic approaches. . . . . . . . . . . . . . . . . . . . . 23
1.5.5 Statistical background . . . . . . . . . . . . . . . . . . . . . 31
1.5.6 Derivation of a consensus sequence . . . . . . . . . . . . . . 33
1.5.7 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.5.8 Extension to diploid genome reconstruction . . . . . . . . . 37
1.5.9 Accuracy assessment . . . . . . . . . . . . . . . . . . . . . . 38
1.5.10 Gibbs sampling and its application to a real genome project 41
iv
2 Notation and Probabilistic Model 43
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 Sequencing error rates . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 Composition probabilities . . . . . . . . . . . . . . . . . . . 47
2.2.3 Fragment membership . . . . . . . . . . . . . . . . . . . . . 48
2.2.4 Missing information . . . . . . . . . . . . . . . . . . . . . . . 48
3 Methods 51
3.1 Pairwise Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.1 Genotype determination . . . . . . . . . . . . . . . . . . . . 52
3.1.2 Estimation of the most likely pairs . . . . . . . . . . . . . . 53
3.1.3 Reconstruction strategy . . . . . . . . . . . . . . . . . . . . 57
3.1.4 Confidence score for estimated haplotypes . . . . . . . . . . 58
3.2 A bridging strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 A fast algorithm to compute Pr(X) . . . . . . . . . . . . . . . . . . 63
3.3.1 A Markov property . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.2 Identification of uncertain connections . . . . . . . . . . . . 69
3.3.3 Maximum likelihood estimate of frequencies . . . . . . . . . 71
3.3.4 Composition probabilities . . . . . . . . . . . . . . . . . . . 72
3.4 Gibbs Sampling Approach . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.2 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.3 ThedistributionoffragmentmembershipandtheGibbssam-
pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.4 Position-specific sequencing error rates . . . . . . . . . . . . 84
3.4.5 Potential polymorphism detection . . . . . . . . . . . . . . . 85
3.4.6 Local haplotyping . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.7 Haplotype extension . . . . . . . . . . . . . . . . . . . . . . 90
4 Simulation Studies 95
4.1 Simulation study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.3 Simulation model . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2 Simulation study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.3 Simulation model . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 Simulation study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
v
4.3.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3.3 Simulation model . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 115
5 The Ciona intestinalis genome 117
5.1 Ciona intestinalis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Data deposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Conclusion and Discussion 132
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Relaxation of assumptions . . . . . . . . . . . . . . . . . . . . . . . 133
6.3 Comparison of average haplotype length . . . . . . . . . . . . . . . 134
6.4 Integration into sequence assembler . . . . . . . . . . . . . . . . . . 135
6.5 Quality score adjustment . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 ConfidenceofdiploidconsensussequencesfromtheCionaintestinalis
genome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Future Work 137
7.1 Multiple-haplotype reconstruction . . . . . . . . . . . . . . . . . . . 137
7.2 Optical mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 Software 140
8.1 Accuracy assessment program . . . . . . . . . . . . . . . . . . . . . 140
8.1.1 Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.1.2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.1.3 Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Gibbs-sampling based program . . . . . . . . . . . . . . . . . . . . 145
8.2.1 Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.3 Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.3 Software deposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Bibliography 153
Appendix A
Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
vi
List Of Tables
3.1 Genotypes at one SNP site . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Number of haplotypes at two SNPs . . . . . . . . . . . . . . . . . . 53
4.1 Composition probability in the simulation . . . . . . . . . . . . . . 97
4.2 Sequencing error probabilities in the simulation . . . . . . . . . . . 97
4.3 Parameter values in the simulation . . . . . . . . . . . . . . . . . . 97
4.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Simulation results. Continued . . . . . . . . . . . . . . . . . . . . . 100
4.6 Parameter values in the simulation . . . . . . . . . . . . . . . . . . 103
4.7 Reconstruction result from a simulation based on Ciona intestinalis
(λ= 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.8 Reconstruction result from a simulation based on Ciona intestinalis
(λ= 1/4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.9 Reconstructionresultfromasimulationofapolymorphismrate0.3%
(λ= 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.10 Reconstructionresultfromasimulationofapolymorphismrate0.3%
(λ= 1/4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.11 Error patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.12 Statistics of the maximum likelihood estimates of haplotype frequency107
vii
4.13 Composition probability in the Ciona intestinalis simulation . . . . 113
4.14 Results from the simulated data set . . . . . . . . . . . . . . . . . . 115
4.15 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1 Small insert libraries . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Long insert libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 Polymorphism rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Frequency of each polymorphism category . . . . . . . . . . . . . . 121
5.5 Composition of polymorphisms . . . . . . . . . . . . . . . . . . . . 121
5.6 Categories of point mutations and their frequencies . . . . . . . . . 121
viii
List Of Figures
1.1 A reaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Illustration of band pattern . . . . . . . . . . . . . . . . . . . . . . 4
1.3 An example of a sequence trace file . . . . . . . . . . . . . . . . . . 8
1.4 Hierarchical shotgun sequencing strategy . . . . . . . . . . . . . . . 15
1.5 Sequence assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Overlap detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Whole-genome shotgun sequencing strategy . . . . . . . . . . . . . 19
1.8 SNP matrix, fragment conflict graph, and SNP conflict graph. . . . 25
2.1 An illustrative example of our problem . . . . . . . . . . . . . . . . 44
3.1 Consistency between adjacent pairs . . . . . . . . . . . . . . . . . . 55
3.2 Inconsistency between adjacent pairs . . . . . . . . . . . . . . . . . 56
3.3 Reconstructing haplotypes from pairwise calculations . . . . . . . . 57
3.4 First case of connecting adjacent haplotypes . . . . . . . . . . . . . 61
3.5 Second case of connecting adjacent haplotypes . . . . . . . . . . . . 61
3.6 Definition of four index sets . . . . . . . . . . . . . . . . . . . . . . 64
3.7 An illustration of the recursive structure of the index sets . . . . . . 66
ix
3.8 Block composition and local haplotyping . . . . . . . . . . . . . . . 92
3.9 Haplotype extension . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.10 Cases where the confidence score for inferred haplotypes are low . . 94
4.1 Examples of accuracy assessment . . . . . . . . . . . . . . . . . . . 108
4.2 Estimated confidence score vs. observed accuracy . . . . . . . . . . 109
4.3 Adjusted confidence score vs. observed accuracy . . . . . . . . . . . 110
4.4 Histograms of MLE estimates . . . . . . . . . . . . . . . . . . . . . 111
4.5 Log-likelihood curve . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6 Part of a scaffold from Ciona intestinalis . . . . . . . . . . . . . . . 112
5.1 Distribution of polymorphisms on scaffold 1 . . . . . . . . . . . . . 123
5.2 Distribution of polymorphisms in a specific region on scaffold 1. . . 124
5.3 Distribution of polymorphisms on scaffold 140 . . . . . . . . . . . . 125
5.4 Distribution of polymorphisms on scaffold 643 . . . . . . . . . . . . 126
5.5 Distribution of polymorphisms on scaffold 990 . . . . . . . . . . . . 127
5.6 Distribution of haplotype lengths . . . . . . . . . . . . . . . . . . . 128
5.7 An example of the diploid genome reconstruction from scaffold 1 . . 129
5.8 Format of diploid genome sequences from scaffold 1 . . . . . . . . . 130
5.9 An example of discontinued diploid genome sequences from scaffold 1131
8.1 Format of a layout name file . . . . . . . . . . . . . . . . . . . . . . 141
8.2 Input of composition probabilities . . . . . . . . . . . . . . . . . . . 142
8.3 Format of a composition probabilities input file . . . . . . . . . . . 142
x
8.4 Input of sequencing error probabilities . . . . . . . . . . . . . . . . 146
8.5 Format of a sequencing error probabilities input file . . . . . . . . . 147
8.6 Format of an assembly layout input file . . . . . . . . . . . . . . . . 148
xi
Abstract
In genome sequencing projects, the objective has been to determine a haploid
consensus sequence. Although polymorphisms were identified from a eukaryotic
genome, haplotypes have not been inferred directly. In this article, we propose
a statistical method to detect polymorphisms in genome sequencing projects and
infer haplotypes from the polymorphisms.
First, assuming that polymorphic sites are given, we describe a probabilistic
model which incorporates sequence error rates, composition probability of each
base, and memberships of fragments. We provide a statistical method to calculate
probabilities of different haplotype configurations from the assembly. To reduce
computation complexity, the most likely haplotypes are calculated from adjacent
polymorphisms, and then those haplotypes are linked to form longer haplotypes.
We show the simulation results based on this method.
Next, we develop a fast algorithm to score the confidence of estimated haplo-
types. The complexity of the algorithm is linear with respect to the number of
polymorphisms. We utilize this scoring method toincrease the quality of estimated
haplotypes.
xii
To assemble the Ciona intestinalisgenome, we propose a novel approach, which
usestheGibbssamplersimultaneouslywithpolymorphismdetectionstrategy. Ciona
intestinalis is known for its high polymorphism rate which is reported to be 1.2%
(Dehaletal. [21]). Here,weremovetherestrictionthatpolymorphicsitesaregiven.
We report the polymorphism rate of Ciona intestinalis by using our method. Also,
simulation results are shown to evaluate the accuracy of our method.
xiii
Chapter 1
Background and Introduction
Genomics has changed biology dramatically toward an information-oriented sci-
ence, also from a component-oriented approach to a system-oriented approach. For
example, genomics has revolutionized ways to identify and understand functional
elements inorganisms. Traditionally, biologistswere focused onidentifying anindi-
vidual gene and unraveling its function (or its association with disease). Currently,
genomics enables biologists to study the networks and interactions of functional
elements, which have been believed to be essential in understanding phenomena in
life.
The bases of doing this are accurate genome sequences. In genome sequencing,
randomfragmentsaregeneratedfromwholegenomesorclones, andthensequenced
usingSangerdideoxytechnique(Sangeretal. [76]). Duetothelimitationofcurrent
sequencing technology, the length of high-quality reads typically ranges from 500
to 800. Therefore, sequencing projects require some redundancy in the coverage
of random fragments to improve the quality of the reconstructed genome sequence
1
ddATP
t a c g t c c g a t c c g DNA template 5’ 3’
Growing chain c g g 3’
Growing chain
DNA template 5’
3’
g c c t a g c c t g c a t 3’
c g g a t c g g c
a
a
A
A
5’
5’
Growing chain
DNA template 5’ g c c t a g c c a t g c a t 3’
3’ c g g a t c g g c a c g t A 5’
Figure 1.1: A reaction. For the given template DNA, three types of molecules
aresynthesizedwithdifferentsize. WhenddATP,insteadofdATP,isinserted
intothepositionwhichiscomplementaryto T,thegrowingchainhalts. These
events are random and possibly happen at every position which is comple-
mentary to T.
and also require an assembly strategy, called shotgun sequencing. Total amount
of sequence reads in base pairs divided by the genome size is called sequence cov-
erage. In shogun sequencing projects, the sequence coverage is usually between 5
and 10. The shotgun sequencing strategy includes a computational assembly strat-
egy, called sequence assembly, which traditionally follows a three-step procedure:
overlap-layout-consensus.
1.1 Sequencing technology
In 1977, the first sequencing technology was proposed by Maxam and Gilbert
(Maxam and Gilbert [55]). In this method, molecules are decomposed into pieces,
and then the bases are determined. Shortly, this method was replaced by the
2
Sanger dideoxy sequencing technique (Sanger et al. [76]). In contrast to the previ-
ous method, Sanger dideoxy technique determines the bases by synthesis. Sanger’s
dideoxy technique has been improved and now is established in mass-production
sequencing efforts because it is easy to automate. The improvements include ther-
mostablepolymerases(TaborandRichardson[81]),fluorescent-dye-labeled dideoxy
terminators(Proberetal. [68]),laser-baseddetectionoffluorescence-labeledmolecules
(Smithetal. [78];Hunkapilleretal. [39]),andcapillary-basedsequencing(Mullikin
and McMurray [61]; Meldrum [59]). Robotic systems also have been developed to
facilitate sequencing steps such as subclone library construction and sequencing
reaction preparation (Meldrum [58] [57]). The Sanger dideoxy technique is based
on the DNA synthesis process by using DNA polymerases. In addition to DNA
polymerases, this technique also requires several biochemical components: primers,
deoxynucleoside triphosphates (in other words, the DNA precursors dATP, dCTP,
dGTP,anddTTP),anddideoxynucleoside triphosphates(ddATP,ddCTP,ddGTP,
and ddTTP). Their roles are explained next.
1.1.1 DNA synthesis
In DNA synthesis, after the DNA template to be copied is inserted into a cloning
vector (plasmid or bacteriophage M13), a synthetic primer hybridizes to the target
site in the cloning vector. DNA polymerases extend the growing chain by adding
dXTP (any deoxynucleoside triphosphate) to the 3
′
end of the growing chain; with
3
20
1
2
3
4
5
6
7
A lane C lane G lane T lane
9
8
10
11
12
13
14
15
16
17
18
19
Figure 1.2: Band pattern. A separate lane exists for each reaction. In this
figure,theshortestmoleculeislocatedatthebottomandthelongestmolecule
is located at the top. After electrophoresis, the synthesized molecules are
sorted by size on each lane. Then reading the band pattern starts from the
bottom to the top on a gel. the number left to a band indicates the order of
reading the band. Therefore, the sequence will be AGTCATCAGTGCTGACTCGA.
the removal of pyrophosphate, DNA polymerases add complementary deoxynucle-
oside triphosphates, which do not contain an -OH group at the 2
′
position of the
ribose, to the 3
′
end of the growing chain. DNA polymerases are the catalytic
enzymes enabling DNA synthesis. DNA polymerases are often genetically modified
to remove the undesirable property such as exonuclease capability, and adapt to
the need of in vitro DNA synthesis. For instance, intact DNA polymerases often
have exonuclease capability: they can digest the growing chain (3
′
→ 5
′
exonu-
clease activity) or go backward to digest the growing chain (5
′
→ 3
′
exonuclease
activity). For example, a widely used DNA polymerase for sequencing, Taq DNA
polymerase, lacks the3
′
→ 5
′
exonuclease activity (Saikietal. [73]). Also, sequenc-
ing polymerases are genetically modified to make the chain grow rapidly without
4
beingaffectedbythesecondarystructureoftemplateDNAwhichcancausestalling
of the polymerase. The modified sequencing polymerase efficiently add substrates
other than dXTP (for example, thio-dXTP or dideoxynucleoside triphosphates).
Because the Sanger dideoxy technique uses DNA synthesis, it inserts deoxynucle-
oside triphosphates and dideoxynucleoside triphosphates into the growing chain,
thereby extending the growing chain from the primer. In this reaction, the chain
stops growing at the position where a dideoxynucleoside triphosphate is taken up.
1.1.2 Sanger dideoxy sequencing
We now describe the Sanger dideoxy sequencing in more detail. In the usual reac-
tion, if four deoxynucleoside triphosphates without any dideoxynucleoside triphos-
phates exist, the growing chain extends from the primer to the 5
′
end of the DNA
template. However, if a small amount of a particular dideoxynucleoside triphos-
phateisaddedtothereactionsolution, thegrowingchainstopspolymerizing atthe
first position at which a dideoxynucleoside triphosphate is incorporated. This po-
sition will occur at random. For example, suppose an A reaction case where dATP
and a small amount of ddATP added to the reaction solution (Figure 1.1). In most
of templates, reaching the T base, DNA polymerases take dATP and insert A into
the growing chain with 3
′
-OH available for the further growth. DNA polymerases
rarely insert ddATP into the chain instead of dATP. In this case, however, the
chain is not able to grow because there is no available 3
′
-OH for the further growth
5
and then the chain growth terminates. Because the chain growth can terminate at
any T position and many DNA templates exist in the reaction solution, there exist
many synthesized DNA molecules which terminate at every possible T position in
the reaction solution.
1.1.3 Band pattern
Usingelectrophoresistechniques, thesesynthesized moleculesaresortedbysizeona
gel(each lanepereachreaction)andsubsequently formbandpatterns (Figure1.2).
The longest molecule is located near the top of the gel and the shortest molecule
is located near the bottom of the gel. A band represents the synthesized molecules
which terminated at the same position, and the space between bands represents
the distance between termination bases. In automated sequencing, four different
fluorescent dyes are adopted fordifferent dideoxynucleoside triphosphates and each
of them is attached to the primer (Smith et al. [78]) or to the dideoxy chain-
terminating nucleotide (Prober et al. [68]). Because four different fluorescent dyes
are resolvable in a single lane of slab gels (or capillary gels in automated sequenc-
ing), fourreactions canoccursimultaneously andareanalyzed inasingle lane. The
reading process typically consists of four steps (Green [31]): “lane tracking”, “lane
profiling”, “trace processing”, and “base-calling”. Lane tracking step is necessary
if four lanes are used. In this step, boundaries between lanes are identified. In lane
profiling, the process of reading the synthesized molecules starts from the shortest
6
molecules on a lane, and then this reading jumps to the next shortest one in any
lane. At each time point with a uniform interval, signal intensities from four arrays
are recorded. This process continues until the resolution is degraded. Usually, the
maximum length ranges between 500 and 800. In trace processing, signal process-
ingtechniques areemployed toreducenoiseandeliminateerroneousdye effects. As
a result, sequencing of the synthesized molecules is generated in the form of chro-
matogram which represents an intensity curve of each signal with different color
(Figure 1.3). In practice, it is not straightforward to convert this chromatogram
into a sequence of base determination. The process of base determination from
the chromatogram is called, base-calling and will be described in Section 1.2. For
more detail of sequencing technology, see (Deonier, Tavare, and Waterman [88];
Green [31]; Ewing et al. [25]). In an ideal trace, all peaks are evenly spaced and
non-overlapping so that each peak corresponds to a specific base (Figure 1.3). In
real sequencing, ambiguous situations occur mainly from imperfections of the se-
quencing reactions, of gel electrophoresis, and of trace processing. Short synthetic
molecules move faster than expected. Their peaks are uneven and noisy (Ewing et
al. [25]). In genome sequencing projects, these poor quality regions are trimmed.
Inmoreevenly spacedregions,thereisalsoanwell-knownnoisefactor,called“com-
pression ,caused by electrophoresis anomalies” (Sanger and Coulson [74]; Sanger et
al. [76]); attheend ofsynthetic molecules, somebases canhybridize tootherbases,
thereby forminga hairpin-like structure, and this structure moves faster than other
7
T
Intensity
Time
A
C
G
T
A
C
G
Figure 1.3: An example of a sequencer trace file (chromatogram file). At each
time point, each signal intensity from a different fluorescence is represent in
adifferent color. Inthis figure, each signal intensity is displayed in adifferent
typeofline. Inthissimple case, thedeterminationof basesistrivialasshown
on top of each signal line.
synthetic molecules of the same length. This phenomenon causes anomalies in gel
electrophoresis (Ewing et al. [25]).
1.2 Sequence quality
1.2.1 Base-calling
Because of the noise occurring from the sequencing process, a method to differ-
entiate true base-calls from noise is required. For more complete solution, it is
also required to assign an error probability (quality score) to each called base.
Lawrence and Solovyev [45] recognized this problem. They proposed a method to
determine true bases and assign error probabilities by using discriminant analysis.
8
This method was superseded by Phred, which is now most widely used base-calling
software, and is described in (Ewing et al. [25], accuracy assessment; Ewing and
Green [24], error probabilities).
Two types of input processed trace (chromatogram files) from sequencing ma-
chines canbe provided: ABI formatorStandard ChromatogramFormat(Dearand
Staden [20]). The base-calling procedure of Phred consists of four phases: “lo-
cating predicted peaks” (phase 1), “locating observed peaks” (phase 2), “matching
observed peaks withpredicted peaks”(phase 3),and“findingmissed peaks”(phase
4).
In phase 1, idealized peaks are predicted. Simple Fourier methods (Press et al.
[67]) are used to generate idealized peaks.
In phase 2, observed peaks are identified by detecting concave regions from the
four trace arrays; concave regions are those regions satisfying 2×v(i)≥v(i+1)+
v(i−1) where v(i) is the trace value at time point i of a chromatogram file (Ewing
et al. [25]). The concave region is recognized as an observed peak if the area of the
region is larger than 10% of the average area of the 10previous observed peaks and
5% of the area of the previous observed peak.
In phase 3, observed peaks are matched to the idealized peaks from phase 1.
Phase 3 consists of three steps:
1. Match the easiest peak first: The shift is defined to be the ratio of the dis-
tance between the observed peak and the predicted peak to the period of the
9
predicted peak. If at least four consecutive predicted peaks have adjacent
observed peaks whose areas are > 2 and shifts range between -0.2 and 0.2,
those predicted peaks and observed peaks are matched.
2. Align observed peaks to the predicted peaks with the highest total score
by using dynamic programming: A score is assigned to a match between a
observed peak and a predicted peak. The score is the observed peak area
multiplied by some penalty factor (< 1) which is based on the direction of
the shift. Right shift implies that the predicted peak is located to the right
of the observed peak. Left shift implies that the predicted peak is located to
the left of the observed peak. A left shift is penalized less because molecules
often tend tomove faster than expected fromtheir lengths. With this scoring
scheme, a dynamic programming algorithm is used to calculate the optimal
score and assignment.
3. Assignobservedpeakswhichwerenotmatchedinthepreviousstagesbutpos-
sibly matched to other predicted peaks: The nearest predicted peaks without
associated matched peaks are assigned to the observed peaks without the
matched predicted peaks. Similarly, the nearest observed peaks without the
matched predicted peaks are assigned to the predicted peaks without the
matched observed peaks. In this case, the observed peaks can be split into
two peaks and then matched to the predicted peaks.
10
The three phases might leave some observed peaks unresolved. These unre-
solved observed peaks occur mainly because of “compression”, “extensive noise”,
or a “lane processing aberration interferes”. In this case, an additional step is re-
quired. In phase 4, we create new predicted peaks and match the predicted peaks
to the observed peaks if observed peaks satisfy the following five conditions which
is described in (Ewing et al. [25]).
1. The observed peak has stronger signal than other signals at its peak.
2. The size of the observed peak exceeds the minimum size criterion.
3. The observed peak is never split.
4. The two flanking observed peaks are matched to the predicted peaks.
5. Overall peak spacing is improved after the new predicted peak is created.
1.2.2 Error probability assignment
Phred also assigns an error probability (quality score) toeach base-call. Ewing and
Green [24] defined two criteria to evaluate assigned error probabilities: validity and
discrimination power. Error probabilities are valid if assigned error probabilities
correspondtoactualerrorprobabilities. Informally,errorprobabilitiesaredispersed
as wide as possible in a powerful discrimination method. The discrimination power
is formally defined as follows. Let e(b) be an error probability of base-call b. For
any subset B of B (B is an arbitrary subset of base-calls), the expected number
11
of errors in B is denoted by
P
b∈B
e(b). The expected error probability for B is
P
b∈B
e(b)
|B|
. Then, forany given error probabilityr, there exists a unique maximal set
of base-calls B
r
such that
P
b∈Br
e(b) is≤r and for a base-call b∈B
r
, B
r
includes
allotherbase-callswhoseerrorprobabilities≤e(b). Then,thediscriminationpower
at the error probability r is
P
r
=

|B
r
|
|B|

The goal of the error probability assignment was to develop a valid strategy with
high discrimination power at low error probability r (r≤ 0.01).
Four parameters were selected to differentiate correct bases from sequencing
errors. These parameters commonly consider the currently assessed peaks with
adjacent peaks, and are used to evaluate the uncertainty in each base-call. From
their tests, the following parameters were most effective to differentiate correct
bases from sequencing errors:
1. “peakspacing”inaseven-peakswindow: theratioofthelargestpeak-to-peak
spacing to the smallest peak-to-peak spacing in a seven-peaks window.
2. “uncalled/called ratio”: the ratio of the amplitude of the largest uncalled
peak to the smallest called peak in seven-peaks window.
3. “peakspacing”inathree-peakswindow: theratioofthelargestpeak-to-peak
spacing to the smallest peak-to-peak spacing in a three-peaks window.
12
4. “peak resolution”: the number of bases between the current base and the
nearestunresolved base, multiplied by-1. Forallfourparameters, thesmaller
ratio indicates the higher quality of base-calling.
The range of each parameter was evenly divided into 50 discrete values with
uniform intervals. The corresponding error probability foreach 4-tuple threshold is
estimated from the training data set. The observed error rate is calculated for the
set of bases defined by each 4-tuple parameter threshold. The 4-tuple parameter
threshold which has the smallest error rateis chosen andplaced in the firstline ofa
lookuptable. Theset ofbases which correspond tothisthreshold areremoved from
consideration. This process is iterated until the lookup table is completely filled.
As a result, the lookup table consists of the entries each of which has a 4-tuple
parameter threshold and the corresponding error probability. After this training,
the corresponding error probabilities can be obtained from the table if parameter
values in traces are given. This error probability is transformed to an quality score
by the formula 1.1 and the quality score is assigned to the base-calling.
q =−10log
10
(p) (1.1)
where p is the estimated error probability. For more detail of Phred, see (Ewing et
al. [25]; Ewing and Green [24])
The quality scores for base-calls are further used to infer an accurate consensus
sequenceandguidethefinishingstepbyprovidingtheinformationonthelocationof
13
low quality regions. For example, the sequence assembler, Phrap (Green [32]), uses
the quality scores for base-callings in inferring a consensus sequence and assigning
scores to the consensus bases.
1.3 Shotgun sequencing strategies
Because the maximum length of a high-quality read in Sanger dideoxy technique
reaches only 500 to 800 bases, an additional strategy, called shotgun sequencing,
is employed to sequence a genome > 1000 bp (Gardner et al. [27]; Anderson [5];
Sanger et al. [75]). Two shogun sequencing strategies for genomes are dominant:
hierarchical shotgun sequencing and whole-genome shotgun sequencing (WGS).
1.3.1 Hierarchical shotgun sequencing
In the hierarchical shotgun sequencing, contig maps of appropriate clones such as
yeastartificialchromosomes (YACs)orbacterialartificialchromosomes (BACs)are
constructed firstandcloneswhichmake upaminimum tilingpatharesubsequently
sequenced (Figure 1.4). The most common way to detect overlaps among clones is
to use restriction enzymes. Restriction enzymes are applied to each BAC clone and
fingerprints are obtained. This unique patterns (fingerprints) are used to detect
overlaps among clones and the overlapping clones are assembled into BAC contigs
14
Minimum tiling path of
A
B
DNA
purified BAC clones
Multiple copies of
Random fragments
BAC clones
Figure 1.4: Hierarchical shotgun sequencing strategy. (A) DNA to be se-
quenced is shown at the top. A constructed physical map of BAC clones is
shownatthebottom. (B)EachBACcloneisselectedtobesequenced. Multi-
ple copies of each BAC clone are prepared. Random fragments are generated
from the purified BAC clones.
15
(bactigs). These BAC contigs are mapped to the physical locations on the chro-
mosomes by analyzing the existence of sequence-tagged sites (STSs) and genetic
markers.
After the BAC contigs are assembled, the least redundant set of BAC clones
(minimal tiling path) arechosen forshotgun sequencing. Acandidate BAC clone is
finally selected for shotgun sequencing if the fingerprints by enzyme digest matches
that of another BAC clone. From each selected clone, multiple copies of BAC
DNA are purified for coverage redundancy and each copy is then randomly cut
by physical shearing methods. These random DNA fragments are subcloned into
plasmid or M13 vectors, which have different cloning biases. Reads of one or both
ends are generated from random subclones and these reads are assembled back to
a consensus sequence by a computational procedure, called sequence assembly.
1.4 Sequence assembly
Sequence assembly consists of three steps: overlap-layout-consensus (Figure 1.5).
Overlaps among reads, which are believed to be true, are detected by the Smith-
Waterman algorithm (Smith and Waterman [79]) (Figure 1.6). This overlap detec-
tion step is the main bottleneck in the sequence assembly process because there are
O(n
2
) comparisons, where n is the number of reads. In practice, the false overlaps
are easily eliminated by analyzing the coexistence of k-mers between reads (Myers
et al. [63]; Batzoglou et al. [8]). The rationale is that if reads do not share a
16
Scaffold
A
B
C
Overlap detection
Layout
Contigs
Figure 1.5: Sequence assembly. (A) The reads with overlapping regions are
subsequently merged. The enlarged view of the overlap detection step is
illustrated in Figure 1.6. (B) The merged reads are assembled into contigs.
Usually, each random fragments are two-end sequenced. Therefore, paired
reads are considered to be contiguous. This extra connectivity information
is called the mate-pair information. In this figure, the mate-pair information
is represented by a curved line. (C) The contigs which are connected by the
mate-pair information are merged into a scaffold. In this figure, the scaffold
contains an intra-scaffold gap between two contigs.
17
A
A C A Read #1 C T C T
T C
G
G C T G C G C A C Read #2
G A C A Read #1 C T C T
T C
G
G C T G C G C A C Read #2
G C G C A C T A T T G A Read #3
G A C A Read #1 C T C T
T C
G
G C T G C G C A Read #2
G C G C A Read #3
A
A G T T A C T
G C T A T T G A T A C T A Read #4
−
−
C
B
C
G
Figure 1.6: Overlap detection. (A) read #1 and read #2 are merged into a
contig by combining the overlapping sequence, CTGCT. (B) read #3 is merged
intothecontigbycombiningtheoverlappingsequence, GCGCAC.(C)Bymerging
read #4, a gap is inserted into the contig.
significant number of k-mers, it is unlikely that the overlap is true. In the layout
construction step, reads are collectively built into contigs and then constructed to
an assembly layout by multiple alignment. Finally, a consensus sequence is derived
from this layout. Churchill and Waterman [15] developed a statistical method to
derive a consensus sequence which is calculated from sequencing error rates and
base composition probabilities. The sequence assembler, Phrap derives consensus
bases based on the Phred quality scores (Green [32]; Ewing et al. [25]).
1.4.1 Whole-genome shotgun sequencing
On the other hand, a whole-genome shotgun sequencing strategy does not require
the physical map construction. In this approach, subclone libraries are prepared
from the entire genome without any use of a physical map where subclone libraries
18
fragments
DNA
Random
Figure 1.7: Whole-genome shotgun sequencing strategy. In this strategy, no
physicalmapisconstructed. Fragmentsarerandomlygenerateddirectlyfrom
the entire genome.
are created from the overlapping clones (Figure1.7). Naturally, this can reduce the
cost although this map construction cost accounts for less than 10% of the total
costs (Waterston, Lander, and Sulston [89]). A feature of most WGS strategies
is that mate-pair information from two-end sequencing of a DNA fragment is ex-
ploited to cope with repeats in the genome; usually, various sizes of subclones are
used to resolve different levels of repeats. For example, in whole-genome assembly
of the human genome, ∼ 2, ∼ 5, and ∼ 50 K bp subclone inserts were employed.
Weber and Myers [90] conducted simulations on the relationship between average
contig length and the long-to-short subclone ratio. They also proposed the opti-
mal long-to-short subclone ratio from the simulations. Although applying whole-
genome shotgun sequencing to the mammalian genomes was a challenge due to
their repetitive nature, this strategy was remarkably successful in sequencing the
human genome (Istrail et al. [40]; Venter et al. [84]) and themouse genome (Mural
19
et al. [62]). However, whole-genome shotgun sequencing strategy does not elimi-
nate completely the necessity of physical maps to finish the genome sequence. For
example, locating scaffolds on the physical chromosomes (Venter et al. [84]) and
filling gaps remain to be resolved. Today, the combination of hierarchical shotgun
sequencing and whole-genome shotgun sequencing is an active area of research. In
this hybrid strategy, subclone libraries are prepared not only from clones but also
directly from the entire genome. In principle, the hybrid strategy can utilize ad-
vantages from both a hierarchical shotgun sequencing approach and whole-genome
shotgun sequencing approach. This hybrid strategy was successfully applied to
mammalian genome sequencing projects (Mouse Genome Sequencing Consortium
[17]; Rat Genome Sequencing Project Consortium [69])
1.5 Diploid genome reconstruction problem
1.5.1 Polymorphism detection
In eukaryotic organisms, two sets of homologous chromosomes exist. However, the
objectiveofshotgunsequencinghasbeentodetermineahaploidconsensussequence
without assuming that the true genome is diploid. After a genome sequence is
assembled, polymorphisms can be detected because fragments originate from the
diploid genome and base-calling is highly accurate. Here, the term polymorphism
is used to include single-nucleotide polymorphisms (SNPs), single indels, multibase
20
substitutions, and multibase indels. Among these, SNPs are the most abundant
type of genetic variation in the genome (Hinds et al. [38]). Various SNP detection
methods are employed in real genome sequencing projects. Most of SNP detection
methods are based on the rationale that if significant inconsistency is observed
at the same locus of an assembly layout, the locus can be regarded as a SNP.
In the human genome, for instance, large-scale SNP discovery was facilitated by
reducedrepresentationshotgunsequencing(Altshuleretal. [3]),andheterozygosity
identification in the overlapping regions of clones (Taillon-Miller et al. [82]). In the
Ciona intestinalis and Fugu rubripes genome, a SNP was called if each allele at
the SNP site was observed at least twice (Aparicio et al. [6]; Dehal et al. [21]);
in the Candida albicans genome, a SNP was called if the posterior probability
of heterozygosity was greater than 0.99 (Jones et al. [41]). High-throughput SNP
genotypingmethods,requiringthepriorknowledgeofSNP-flankingsequences, have
been useful to produce population genotype data (Pastinen et al. [66]; Wang et al.
[86]; Chen et al. [13]). Those methods are not readily applicable in large-scale.
1.5.2 Haplotype
Haplotype is a set of alleles on the chromosome inherited from a parent. Single-
marker analysis was traditionally dominant in genetic studies. However, the power
of single-marker analysis often depends on the properties of individual markers,
thereby providingunclear pictureofallelicassociations(Dalyetal[19]). Haplotype
21
analysis is now considered by many to be more powerful in linkage disequilibrium
studies, disease association studies, and drug response (Daly et al [19]; Drysdale et
al. [23]). In silico haplotyping methods have gained popularity to infer haplotypes
from population SNPs. These methods can be categorized as based on parsimony
(Clark [16]), expectation-maximization (Excoffier and Slatkin [26]; Hawley and
Kidd [37]; Long et al. [54]), the Gibbs sampler (Stephens et al. [80]; Niu et al.
[65]), and phylogeny (Gusfield [34]; Haperin and Eskin [35]).
1.5.3 Inferring haplotypes from genome assembly
Although haplotypes are often inferred from population genotype data, haplotypes
also can be inferred directly from sequencing projects because phase information
between adjacent polymorphisms are preserved by shotgun reads and mate-pair
information. We term the reconstruction problem of diploid consensus sequences
as the diploid genome reconstruction problem. If viral genomes are sequenced and
clone libraries are prepared from more than one individual, diploid consensus se-
quences can be reconstructed. Throughout this article, we use the reconstruction
of diploid consensus sequences and the reconstruction of haplotypes interchange-
ably. However, because the origins offragments areunknown, the reconstruction of
diploid consensus sequences (or, interchangeably, the reconstruction of haplotypes)
in the consensus determination step of sequence assembly is a challenging prob-
lem. This problem can be separated into two parts. It is necessary to differentiate
22
polymorphisms from sequencing errors, and then to infer the phases between ad-
jacent polymorphisms. This haplotyping problem from sequencing projects, called
the SNP haplotyping problem, was recognized and proved to be NP-hard (Lancia et
al. [43]).
1.5.4 Algorithmic approaches
Lancia et al. [43] and Lippert et al. [50] proposed algorithmic strategies to address
the SNP haplotype problem without simulation results. If there are no sequencing
errors in sequence reads, it is trivial to reconstruct haplotypes from a layout. In
reality, these sequencing errors impose a challenge on reconstructing haplotypes. If
erroneous fragments which contain sequencing errors are excluded from a layout,
the layout can be thought to be an error-free assembly. Similarly, if erroneous
loci (columns) which contain sequencing errors are eliminated from the layout, the
layout canbethoughtobeanerror-freeassembly. Onceeither erroneousfragments
or erroneous loci are eliminated, the minimum number of fragments or loci should
be eliminated for more complete haplotype reconstruction. Therefore, Lancia et al.
[43] formulated this SNP haplotype problem as three optimization problems.
1. To find the minimum number of erroneous fragments and eliminate them.
2. Tofindtheminimumnumberoferroneouslociinalayoutandeliminatethem.
3. Toinferhaplotypes whose lengths aremaximized, afterthe minimum number
of erroneous fragments is eliminated.
23
After all the homozygous loci are removed from a layout, the layout can be
abridged as amatrix form, called “SNP matrix”(Figure1.8A). In this SNP matrix,
rows correspond to a set of fragmentsF ={1,...,m}, and columns correspond to
a set of SNPs S = {1,...,n}. Entries in each column take A or B each of which
indicates an allele. N denotes an ambiguous base determination. To solve the three
optimization problems by graph theory and algorithms, this SNP matrix will be
transformed into graphs.
For a SNP i, A
i
is a set of the fragments which cover the allele A and B
i
is a
set of the fragments which cover the allele B. An interesting property of relation
between fragments is observed as follows.
Definition 1.1. Lancia et al. [43]: the fragment j
1
and j
2
are “in conflict” if
j
1
∈A
k
and j
2
∈B
k
, or j
2
∈A
k
and j
1
∈B
k
for some SNP k.
This “in conflict” relation in a SNP matrix is represented as a form of a graph.
A graph is denoted by G = (V,E), where V is a set of vertexes and E is a set of
edges.
Definition 1.2. Lancia et al. [43]: for a SNP matrix M, the “fragment conflict
graph” is the graph G
F
(M) = (F,E
F
), where E
F
is a set of edges connecting
fragments in conflict.
An example of a fragment conflict graph is shown in Figure 1.8B. A special
form of a graph, called a bipartite graph, is defined in Definition 1.3. This bipartite
graph is essential to infer haplotypes from a fragment conflict graph.
24
5
A
1
2
3
4
5 B
5 6 1
B
2 3 4
A N B A
A A A
A B B A
N B N B
B A B
B
4
3
1
5 2
C
B
N B N
N
B
N
N
A
N
1
3
4
2
6
Figure 1.8: SNP matrix, fragment conflict graph, and SNP conflict graph. (A)
A SNP matrix M. Alleles are represented as either A or B. (B) The fragment
conflict graph G
F
(M) is derived from M. Each vertex represents a fragment.
Each edge represents a “in conflict” relationship between fragments. For the
definition of in conflict fragments, see Figure 1.1. (C) The fragment conflict
graph G
S
(M) is derived from M. Each vertex represents a SNP. Each edge
represents a “in conflict” relationship between SNPs. For the definition of in
conflict SNPs, see Figure 1.6.
25
Definition 1.3. Lancia et al. [43]: an undirected graph G = (V,E) is a bipartite
graph if V can be partitioned into two subsets V
1
and V
2
such that for any edge
(u,v)∈E, u∈V
1
and v∈V
2
or v∈V
1
and u∈V
2
. V
1
and V
2
are called “shores”.
Definition 1.4. Lancia et al. [43]: a SNP matrix M is “feasible” if G
F
(M) is a
bipartite graph.
If the fragment conflict graph G
F
(M) is feasible, it is trivial to infer haplotypes
by combining the fragments on each shore. In this case, each shore corresponds to
each haplotype.
Similar to Definition 1.1, the “in conflict” relationship between SNPs is defined
as follows.
Definition 1.5. Lancia et al. [43]: SNP i
1
and SNP i
2
are “in conflict” if there
exist fragments u and v (u ∈ A
i
1
, u ∈ A
i
2
, v ∈ A
i
1
, v ∈ A
i
2
) such that u and v
cover the same allele (either A or B) in one locus (either i
1
or i
2
) but cover the
different alleles in the other locus.
This “in conflict” relationship between SNPs is also summarized as a graph.
Definition 1.6. Lancia et al. [43]: for a SNP matrix M, the “SNP conflict graph”
is the graph G
S
(M) = (S,E
S
), where E
S
is a set of edges connecting SNPs in
conflict.
Anexample ofaSNPconflictgraphisshown inFigure1.8C.Byusingthesedef-
initions, the aforementioned three optimization problems can be formally restated
as follows.
26
1. “MFR (Minimum Fragment Removal)”: eliminate the minimum number of
fragments from a SNP matrix such that the modified matrix is feasible.
2. “MSR (Minimum SNP Removal)”: eliminate the minimum number of SNPs
from a SNP matrix such that the modified matrix is feasible.
3. “LHR(LongestHaplotypeReconstruction)”: eliminatetheminimum number
offragments from aSNP matrix such that the modified matrix isfeasible and
the total lengths of inferred haplotypes are maximized.
Thetheoreticalresultsonthethreeoptimizationproblemswillbeshownwithout
proofs in the following Theorem 1.1, Theorem 1.2, Theorem 1.3, Theorem 1.4,
Theorem 1.5, Theorem 1.6, and Theorem 1.7. For complete proofs, see Lancia et
al. [43].
Lancia et al. [43] proved that MFR and LHR are solvable in polynomial time
based on two simplifying assumptions as follows.
1. Mate pair information between two paired reads is discarded.
2. If a fragment j
1
starts earlier than a fragment j
2
, the fragment j
1
necessarily
ends before the fragment j
2
.
MFR is reduced to a “maximum cost flow problem”. Each vertex in the frag-
ment conflict graph is replaced by two vertexes which are connected by an edge
(cost:1, capacity:1 ). Then linear programming is applied to solve this problem in
polynomial time. The result is summarized in the following Theorem 1.1.
27
Theorem 1.1. Lancia et al. [43]: vertex disjoint paths, P
1
and P
2
, are calculated
from a fragment conflict graph G
F
(M) in polynomial time such that M[P
1
∪P
2
] is
feasible.
Toprove thatLHPcanbesolved inpolynomial time, asimilar reduction rule as
in Theorem 1.1 is used. In LHP case, the cost between two fragments is the differ-
enceoftheirlengths. Similarly, linearprogrammingisappliedtofindthemaximum
cost path. By maximizing the cost, the lengths of haplotypes are maximized.
Theorem 1.2. Lancia et al. [43]: LHP is solvable in polynomial time.
Moredefinitions suchasaclique andaperfectgrapharenecessary toprovethat
MSR is solvable in polynomial time. We still assume that mate-pair information is
not used. We say that two vertexes, u and v, are adjacent if there exists an edge
(u,v)∈E. The completeness of a graph is defined as follows.
Definition 1.7. Lancia et al. [43]: a graph is complete if there exists an edge for
any pair of vertexes. The complete graph with n vertexes is called a clique of size
n.
For a subset of vertexes A ⊆ V, the subgraph induced by A is denoted by
G
A
= (A,E
A
), where E
A
={(u,v)∈ E | u∈ A and v∈ A}. Among subgraphs in
G,therecanexistmanycliques. Amongthesecliques, weareparticularlyinterested
in a maximal clique.
28
Definition 1.8. Lancia et al. [43]: a clique K in G is maximal if there is no other
clique which properly contains K. ω(G) is the number of vertexes in the maximal
clique in G.
Supposewearepaintingeachvertexundertherestrictionthatadjacentvertexes
mustbepaintedwithdifferentcolors. Inthiscase,theminimumnumberofrequired
colors is denoted by χ(G). Then a perfect graph is defined as follows.
Definition 1.9. Lancia et al. [43]: G = (V,E) is a perfect graph if ω(G) = χ(G)
holds for any subset A⊆V
To prove that MSR is solved in polynomial time, G
F
(M) is proved to be a
perfect graph.
Theorem 1.3. Lancia et al. [43]: G
F
(M) is a perfect graph.
Definition 1.10. Lancia et al. [43]: a graph G = (V,E) is an independent set if
no vertexes are adjacent.
The maximum independent set is obtained by using the previous result that
the maximum independent set is derived from a perfect graph (Golumbic [30];
Groetscheletal. [33]). Thereexistsanequivalencerelationbetweenanindependent
set in G
S
(M) and a bipartite graph in G
F
(M) by the following Theorem 1.4.
Theorem 1.4. Lancia et al. [43]: G
S
(M) is an independent set if and only if
G
F
(M) is a bipartite graph.
29
It is obvious that an independent set of G
S
(M) can be transformed into a
bipartite graph of G
F
(M) in polynomial time. Each shore in a bipartite graph is
merged and reconstructed to a haplotype. Therefore, it is concluded that MSR is
also solvable in polynomial time.
Theorem 1.5. Lancia et al. [43]: MSR is solvable in polynomial time.
Theorem 1.6 and Theorem 1.7 are proved based on the assumption that the
mate-pair information is not exploited. In a real sequencing project, however, the
mate-pair information is used to assemble scaffolds by connecting contigs. If the
mate-pair information is used, gaps between paired reads should be allowed. This
extra complication unavoidably makes the SNP haplotyping problem intractable.
The NP-hardness of MFR and MSR is proved by reducing the existing NP-hard
problem to MFR and MSR (Lewis [46]; Yannakakis [92]).
Theorem 1.6. Lancia et al. [43]: MFR is an NP-hard problem.
Theorem 1.7. Lancia et al. [43]: MSR is an NP-hard problem.
In Theorem 1.1, a solution for MFR is obtained in polynomial time with linear
programming. However, there can be multiple feasible solutions for MFR. Lippert
et al. [50] proposed an algorithm to infer a consensus subgraph from multiple
feasiblefragmentconflictgraphs. Inprinciple, aconsensussubgraphcanbeinferred
by taking the intersection of the edges from multiple feasible fragment conflict
graphs. In the worst case, their algorithm has the exponential complexity to infer
a consensus subgraph. It is claimed that their algorithm is efficient in practice.
30
The algorithmconsists oftwosteps. Inthefirststep, thealgorithmprovides the
minimum number of vertexes such that with those vertexes a bipartite graph still
can be derived. If the minimum number is n for a graph G, the size of the graph
G is n. The algorithm starts with removing a large number of vertexes. Then it
checks whether a bipartite graph is derived as the size of a graph increases. If a
bipartite graph is no longer derived, the first step halts.
In the second step, the algorithm find the vertexes which are likely to be in a
consensus subgraph. Suppose that the size of a graph was 2 in the first step. For
each vertex, we remove the vertex and check whether a bipartite graph still can be
derived after any other extra vertex is also removed. If no bipartite can be derived,
it is very likely that the initially removed vertex is in a consensus graph.
1.5.5 Statistical background
In Section 3.1 (Methods Chapter), we present our first approach to statistical
reconstruction of haplotypes from aligned SNP fragments. Here, to simplify the
diploid genome reconstruction problem, we assume that SNP sites are known and
non-SNP sites are excluded from our estimation.
Churchill and Waterman [15] proposed a method for the statistical reconstruc-
tion of a long haploid DNA sequence from a set of assembled fragments. By adopt-
ing similar notation, probabilistic model, and assumptions as theirs, we start off
from their assumptions and probabilistic model which are briefly summarized as
31
follows. DNA sequence accuracy is considered as a function of the redundancy of
coverage and the sequencing errors rates in sequencing the fragment. It is assumed
that the fragments have been correctly assembled. A goal of sequence assembly
is to determine consensus bases and evaluate the accuracy of the consensus bases.
sequence is free from errors, in other words, its accuracy.
To facilitate the statistical analysis, we start off with a number of simplifying
assumptions: A1 – The DNA preparation and cloning stage are free of errors; A2 –
The fragment assembly is correct; A3 – All positions within a fragment are equally
reliable; A4 – All fragment sequences are equally reliable; A5 – Sequencing errors
areindependent oftheirlocalcontext; A6–Thesequencing errorratesareconstant
across the entire sequence; A7 – The composition of the sequence is independent,
both of adjacent bases and over large regions. In Chapter 6, we consider relaxing
some of these assumptions.
Their notation and model are described as follows. A set of fragment sequences
indexed byi =1,...,m are assembled by overlap detection and multiple alignment
process. Theresultoftheassemblyprocedurecanbethoughttobeamatrixwithm
rows. This matrix is called an assembly matrix. In this assembly matrix, each row
corresponds to the aligned sequence of nucleotide bases in a particular fragment
which is represented in either direct or reverse direction. Gaps may be inserted
internallyduringtheprocessofsequenceassembly. Thecolumnj = 1,...,nindexes
the genomic position. The index runs from the leftmost genomic position to the
rightmostgenomicpositionintheassembly. Thetruestatevariableoftheunknown
32
true DNA is denoted by s
j
, where index j indicate the genomic position in the
fragment assembly. The true state variable take a value from the character set
A ={A,C,G,T,−}. The symbol− is inserted into an assembly matrix when extra
column is needed to be aligned to a base in true DNA sequence which does not
existintheassembly matrix. ThealphabetsofbaseswhichappearinthetrueDNA
sequence is defined to be the set of all nonempty subsets ofA. A partial notation
for this set is given by the standard IUPAC DNA alphabet (Cornish-Bowden [18]).
This set of bases is used in describing a consensus sequence. The base-call of
the j-th position and fragment i in the fragment assembly matrix is denoted by
x
ij
. x
ij
may take a value from the set B = {A,C,G,T,−,N,φ}. − represents an
internal gap. N represents the ambiguous base-calling. Null symbol φ is used in
non-aligned regions beyond the ends of a fragment. The distribution of consensus
baseateachpositionisaprobabilitydistribution over thesetwhich isdefined tobe
π
j
(a) = Pr(S
j
=a|x
ij
, i = 1,...,m, a∈A). These probabilities can be computed
by using two parameters, sequencing error rates, composition probability of each
nucleotide.
1.5.6 Derivation of a consensus sequence
The consensus distribution can be used to infer a consensus base at each posi-
tion and subsequently to derive a consensus sequence. Churchill and Waterman
proposed three different approaches to derive a consensus sequence as follows. In
33
Procedure A, the most likely base s
j
is the one that maximizes π
j
(a) over a. This
value is denoted by s
j
. Then s
j
= argmax
a∈A
π
j
(a). In Procedure B, at each posi-
tion, redundant bases are considered as consensus bases. In this procedure, at each
position, the set of characters with minimum redundancy is obtained so that the
sum of the probabilities over the set exceeds a specified level 1−α. In summary,
the minimum set c
j
is selected to satisfy
X
a∈c
j
π
j
(a)≥ 1−α.
The process to select the smallest redundant set is described as follows. The most
likely base a
j
is taken to calculate π
j
(a
j
). If π
j
(a
j
) > 1−α, this process halts.
Otherwise, the next most likely base b
j
is taken to calculate π
j
(b
j
)+π
j
(a
j
). This
process continues until the sum of probabilities exceeds 1− α. In Procedure C,
the idea of Procedure B is extended to deal with the entire sequence or regions of
interests. The parameters in this model include the composition probabilities and
sequencing error rates. In their model, the composition probabilities are denoted
by
p(a) = Pr(S
i
=a), a∈A.
34
The sequencing error rates are denoted by
p(b|a) = Pr(X
ij
=b|S
i
=a), a∈A, b∈B.
In practice, those parameters need to be estimated from observation, in other
words, the fragment assembly. Churchill and Waterman derived a likelihood-based
procedure for the estimation of the parameters, which utilizes the EM algorithm.
The sequencing errorratesandcomposition probabilities areestimated by iterating
the following two steps.
1. Calculate π
i
(a), n
a
, and n
ab
by using the following formulas.
π
i
(a) =
p(a)
Q
m
j=1
p(x
ij
|a)
P
b∈A
p(b)
Q
m
j=1
p(x
ij
|b)
.
n
a
=
n
X
i=1
π
i
(a).
n
ab
=
n
X
i=1
m
X
j=1
1(x
ij
=b)π
i
(a).
2. By using the calculated π
i
(a), n
a
, and n
ab
, obtain the maximum likelihood
estimators of the two parameters, sequencing error rates and composition
probabilities.
p(a) =
n
a
n
.
p(b|a) =
n
ab
n
a
.
35
The general overview of the EM algorithm is described in Section 1.5.7. They
also derived formulas that relate depth of coverage to sequence accuracy.
1.5.7 EM algorithm
It is often hard to calculate the closed form of the maximum likelihood estimator
(MLE) when there exist the missing data. In the EM algorithm, after the missing
data is filled in, the process of maximizing the likelihood is replaced by a series of
iterations which is guaranteed to converge to a local maximum.
Let y be the incomplete data. Let x be the missing data. By combining the
incomplete and missing data, let (y, x) be the complete data. Let θ ∈ Θ be the
parameter to be maximized. Then L(θ|y,x) is the “complete-data likelihood”.
The EM algorithm consists of two steps, called E-step and M-step. The E-step
and M-step are repeated until some convergence condition is satisfied. The E-step
and M-step are described as follows.
1. E-step: calculate T(θ; θ
(k)
)=E
θ
(k){L(θ|y,x)|y}.
2. M-step: select θ
(k+1)
such that T(θ; θ
(k)
) is maximized.
In 1977, Dempster, Laird, and Runbin [22] formulated the EM algorithm in a
generalized framework. Before the theory of the EM algorithm was established by
them, many missing data problems had been dealt with the EM-style methods.
Some ofthem were understoodin the framework ofthe EM algorithmafterthe EM
theory was established.
36
Dempster, Laird, and Runbin [22] proved that the “incomplete-data likelihood”
L(θ
(k)
|y) does not decrease through EM iterations and converges to a “stationary
point” L
∗
if the sequence of L(θ
(k)
|y) is bounded. In 1983, Wu [91] proved that
L(θ
(k)
|y) monotonically converges to L
∗
= L(θ
∗
|y) for some stationary point θ
∗
under the following two regularity conditions.
1. For any L(θ
(0)
|y)>−∞, the set{θ∈ Θ|L(θ|y)≥L(θ
(0)
|y)} is compact
2. L(θ|y) is continuous in Θ and differentiable in the interior of Θ.
Wu [91] also proved the sequence {θ
(k)
} converges to the MLE when L(θ|y) is
unimodal. For more details of the EM algorithm, see Mclachlan and Krishnan [56].
1.5.8 Extension to diploid genome reconstruction
We extend the previously described model to solve the problem of reconstructing
twohaplotypes. InSection3.1,weconsiderthesimplecaseofdiploidgenomerecon-
struction. In this case, two homologous chromosomes originate from two parents.
AfterfragmentassemblyandSNPdetection, twohaplotypesarereconstructedfrom
the SNP alignment. Given a set of SNP sites detected from the assembly align-
ment, we wish to divide the fragment set into two sets, each of which represents
one haplotype. Adding one more parameter to the model in Section 1.5.5, our
model is based on three parameters, composition probabilities, sequencing errors
andfragmentmembership probabilities. Wecalculatejointprobabilitiesofdifferent
37
haplotype configurations from every adjacent pair. Then we connect them and ex-
tend haplotypes. Based on this pairwise reconstruction strategy, simulation results
are presented in Section 4.1 (Simulation studies Chapter). Also we propose a
conceptual strategy to assess the accuracy or confidence of the reconstructed hap-
lotypes. In Section 3.1 (Methods Chapter), the complexity of this strategy is
exponential and thus this strategy is impractical.
1.5.9 Accuracy assessment
Although Churchill and Waterman [15] proposed a statistical way to evaluate the
accuracy of a haploid consensus sequence, the method is not directly applicable to
diploid consensus sequences. Their method was based on the assumption that the
fragment originated from a haploid genome. To evaluate the accuracy of a haploid
consensus sequence, it is sufficient to evaluate the accuracy of consensus bases. It
is straightforward to provide the accuracy of consensus bases from the quality of
each base. The sequence assembler, Phrap (http://www.phrap.org), uses similar
approaches. After the quality score of each base-call is given by Phred, Phrap uses
the base-call quality score and calculates the score for each consensus base.
However, in diploid genome sequences, it is not straightforward to calculate the
quality score for consensus bases because there are polymorphic sites. The phase
information between polymorphisms should be considered as well. Therefore, for
a complete assessment, the phase information needs to be incorporated into the
38
model. In Section 3.1, an accuracy assessment strategy is sketched. However, the
strategy requires a calculation of the likelihood of an assembly. To calculate the
likelihood, all the haplotype configurations are enumerated, thereby requiring ex-
ponential exponential time complexity with respect to the number of SNP sites.
To be a realistic solution, the time complexity should be reduced to linear com-
plexity. In Section 3.3, we present, by using a Markov structure, a fast algorithm
of linear complexity with respect to the number of polymorphic sites to calculate
the likelihood ofanassembly. With this likelihood calculationmethod, overall time
complexityofourconfidencescoreisstilllinear. Furthermore,weusetheconfidence
score and likelihood calculation for improving haplotype estimation.
InSection3.1,ourreconstructionstrategyistoconsideronepairofpolymorphic
sites at a time, and then combine the phase information between adjacent pairs to
construct haplotypes because it is a computationally intensive problem to find the
most likely haplotype due tothe largenumber ofpossible haplotype configurations.
In Section3.3, the reconstructed haplotypes withconfidence score below aspec-
ified threshold are broken to increase the accuracy. The linear time likelihood cal-
culation is also used to estimate the sample fraction of each haplotype in the frag-
ments. Hereafter, we refer to this fraction as haplotype frequency. If the fragments
arefromoneindividual, thenominalhaplotypefrequency isone-half. However, due
to factors such as homozygosity, duplication, amplification bias and misalignment,
deviations may exist to some extent. A good estimate of the real haplotype fre-
quency may help us to detect abnormalities. We estimate haplotype frequency by
39
maximizing likelihood. In Section 4.2 (Simulation studies Chapter), we conduct
simulation study based on the likelihood calculation in Section 3.3. Here, we also
simulate two-endsequencing. Inshotgunsequencing, particularly, inwhole-genome
shogunsequencing, fragmentsofdifferentsizesarepreparedtoprovidedifferentlev-
els of continuity (Adams et al. [1]; Venter et al. [84]; Aparicio et al. [6]; Dehal
et al. [21]). The fragments are subcloned into plasmid, cosmid, or bacterial ar-
tificial chromosome (BAC) according to the size of fragments, and then two-end
sequenced. Our perspective is that the two-end sequencing strategy also provides
long-rangecontinuity in haplotype estimation because thetwo-end sequenced reads
ofeachfragmentcovermorepolymorphicsites. InSection4.2(Simulationstudies
Chapter), we exploit the information from two-end sequencing to estimate longer
haplotypes. In our simulation study, the polymorphism rate was based on the re-
ported polymorphism rate (1.2%) of Ciona intestinalis (Dehal et al. [21]); the high
polymorphism rate, which is higher than other organisms such as Fugu rubripes
(0.4%) and Homo sapiens (0.1%), was appropriate for evaluating the accuracy of
ourmethod(Aparicio et al. [6]; Venter et al. [84]). Two-endsequencing ofdifferent
clone libraries (plasmid, cosmid, BAC), was simulated; the fragments of variable
size(1.8Kbp∼120Kbp)weregeneratedcomplyingwiththeirproportionsinCiona
intestinalis sequencing project (Dehal et al. [21]).
40
1.5.10 Gibbs sampling and its application to a real genome
project
InSection3.1and3.3(MethodsChapter),thereconstructionstrategieshavethree
limitations. First, because the number of possible haplotypes grows exponentially
with the number of SNP sites, the most likely haplotypes were inferred from every
two adjacent SNP and then the inferred haplotypes for the adjacent pairs were
connected by using their phase consistency. However, it is more advantageous to
infer haplotypes from more than two SNPs at a time. Second, the pairwise recon-
struction strategy is designed to infer haplotypes when pre-determined SNP sites
are provided. To be applied to real genome sequencing projects, a SNP detection
methodneedstobeincorporatedinaphasedeterminationmethod. Third,constant
sequencing error probabilities are assumed in Section 3.1 and 3.3. Position-specific
quality scores can have more differentiating power in eliminating wrong base-calls.
In Section 3.4, we propose a Bayesian model selection to detect potential poly-
morphisms. After the potential polymorphisms are detected, we use the Gibbs
sampler which is a Markov chain Monte Carlo (MCMC) approach to determine the
phases between thepotentialpolymorphisms. Inthisnewapproach, themostlikely
haplotypes are inferred from four potential SNPs rather than two. To reflect the
reality of genome sequencing projects, mate-pair information is exploited to extend
the haplotypes beyond a sequence assembly contig in both simulation study (Sec-
tion 4.3, Simulation studies Chapter) and Ciona intestinalis genome sequence
41
(Ciona intestinalisgenome Chapter). Ourmethod can accommodates two-end
sequenced reads of any clone such as plasmid, cosmid, or BAC. The Ciona intesti-
nalis genome is an ideal organism to infer haplotypes from because its reported
polymorphism rate (1.2%)was high, and libraries were mostly prepared from a sin-
gle individual (Dehal et al. [21]). In Section 3.4, we integrate Phred quality scores
into our error model. In both simulation study and the Ciona intestinalis genome
project,position-specificqualityscoresareusedinsteadofconstantsequencingerror
probabilities.
42
Chapter 2
Notation and Probabilistic Model
2.1 Notation
The statistical reconstruction method by Churchill and Waterman is now extended
to a more generalized two haplotype case. The assumptions A1-A7 mentioned in
Section1.2isusedinourtwohaplotypecase. AsinSection1.2,theassemblylayout
can be summarized in a matrix form (Figure 2.1). In this figure, the targets, which
are shown at the top and bottom, are two haplotypes in a single individual. Six
fragmentsarealignedinthemiddle ofthematrix. This modelapplies tothecaseof
two-end sequencing if we assign the letter φ to those uncalled bases between paired
reads. Our observation is a layout where all fragment sequences are aligned as a
result of sequence assembly. Our reconstruction strategy is applied only to SNP
sites, which areassumed tobeknown. As aresult, ourobservationis the alignment
of SNP fragments. We will deal with non-SNP sites in Section 3.4. In Section 3.4,
our method will be applied to a real sequencing project. In this case, the statistical
43
Haplotype 1 A G C C T A G A T T C
Origin 1 A G A C T A G A φ φ φ
2 C C T A − G C T A φ φ
1 φ G C C T A G A T T φ
2 C C T A − T C T A G T
2 φ C T A − G C T − G T
1 φ φ φ C T A G A C T C
Haplotype 2 C C T A − G C T A G T
Figure 2.1: An illustrative example of the problem. The two target hap-
lotypes are shown at the top and bottom respectively. Six fragments are
aligned in the middle. In reality, the targets and origins of fragments
are not observed.
method to detect potential SNPs is required without disregarding non-SNP sites.
The issue of fragment orientation will be omitted for the sake of simplicity.
Our goal is to reconstruct two haplotypes. Ideally, when SNP sites and a set
of fragments are given, the fragment set can be separated into two sets based on
their origin. In practice, however, the distribution of fragment membership is hard
to obtain. We will explain this issue later. In Section 3.4, the distribution of
fragment membership is calculated by a series of sampling. We keep the notation
and definitions consistent with those in Section 1.2.
The given data set (observation) is an assembly matrix x
ij
with m rows, where
m is the total number of SNP fragments. Each row corresponds to a particular
fragment which is the ordered sequence of bases and possible gaps. The true hap-
lotypes, which are unknown, are denoted by σ
k
= s
k,1
s
k,2
...s
k,n
,k = 1,2, where
s
k,j
represents the true base of the j-th position in the haplotype k in the frag-
ment assembly. The true base takes value from the set A = {A,C,G,T,−}. The
44
membership (origin) of the i-th fragment is denoted by f
i
, i = 1,...,m. That is,
f
i
= 1 if it is from haplotype 1, and f
i
= 2 if it is from haplotype 2. The true
underlying base of fragment i at position j is denoted by y
ij
. Then, the follow-
ing relation holds: y
ij
= s
f
i
,j
, i = 1,2. To differentiate the state variable from
the following random variables, we use capital letters to represent the random ver-
sion of above variables. That is, the base-calls in assembly matrix are denoted
by X = {X
ij
, i = 1,...,m, j = 1,...,n}. Their underlying true bases are de-
noted by Y = {Y
ij
, i = 1,...,m, j = 1,...,n}. True haplotypes are denoted
by S = {S
i,j
, i = 1,2, j = 1,...,n}. The fragment memberships are denoted by
F ={F
i
, i = 1,...,m}. X
ij
and Y
ij
take values fromB ={A,C,G,T,−,N,φ} and
A,respectively. −represents agap. Nrepresents anyambiguous determinationofa
base. The null symbol φis fornon-aligned positionsbeyond theends ofafragment.
Each fragment in the assembly can be described as a set of base-calls, which are
denoted by X
i·
= {X
ij
, j = 1,...,n}. Each haplotype is also described as a set
of true bases, which are denoted by S
i·
= {S
i,j
, j = 1,...,n}. The base-calls in
column j is denoted by X
·j
={X
ij
, i = 1,...,m}.
2.2 Probabilistic Model
2.2.1 Sequencing error rates
In Section 1.2, we mentioned that the base-calls sometimes fail to reflect the true
bases. When base-calls are not the same as their underlying true bases, we call
45
these phenomena as sequencing errors. Because these sequencing errors occur in
practice, we incorporate this sequencing error into our probabilistic model. More
formally, according to our model, the sequencing errors are defined to be the case
where Y
ij
= a but X
ij
= b, b 6= a. Under our assumptions A5 and A6, We have
assumed that measurement errors occur independently with identical distribution
across the assembly because the notation is complicated without the assumption.
However, this assumption will be relaxed in Section 3.4. The sequencing error
probabilities are denoted by
Pr(X
ij
=b|Y
ij
=a), a∈A, b∈B.
If we assume the constant sequencing error probabilities across the assembly, the
sequencing error rates can be written as
p(b|a) =Pr(X
ij
=b|Y
ij
=a), a∈A, b∈B.
The errors can be categorized as single-nucleotide replacement, single-nucleotide
insertion, deletion.
46
2.2.2 Composition probabilities
The composition probabilities of the two haplotypes are denoted by
Pr(S
k,j
=a), a∈A.
Becausewehaveassumedthatbasesoccurindependentlywithidenticaldistribution
across the assembly, the indexes k,j can be dropped and rewritten as
p(a) = Pr(S
k,j
=a), a∈A.
If we consider genotypes at the same locus simultaneously, we denote the com-
position probabilities by
Pr(S
1,j
=a, S
2,j
=b) =Pr(S
1,j
=b, S
2,j
=a), a, b∈A.
We can also assume that the genotypes are independently and identically sampled
from the composition probabilities. Therefore, the probability of genotypes can be
rewritten as
p(a)p(b) =Pr(S
1,j
=a, S
2,j
=b) = Pr(S
1,j
=b, S
2,j
=a), a, b∈A.
47
Note that our definition of sequence composition includes the gap frequency.
Namely, we are also considering single-indel polymorphisms under our model by
including “-” inA.
2.2.3 Fragment membership
TheoriginsofthefragmentsaredenotedbyF ={F
i
,i = 1,...,m}. Itisreasonable
to consider that the fragments are sampled according to Bernoulli trials:
F
i
=







1 with prob λ
1
=λ,
2 with prob λ
2
= 1−λ.
In section 3.1, for simplicity, we assume that the rate at which the haplotype k
emits afragment isgiven byp(σ
k
) =Pr(F
i
=k) =λ =
1
2
, fork = 1,2. The random
fragments are generated either in the direct or reversed orientation. To deal with
the issue, we introduce the complementary bases as follows:
˜
A = T,
˜
T = A,
˜
G = C,
˜
C = G,
˜
φ = φ, and
˜
− = −. However, for the simplicity of our model, we do not
consider this orientation issue in the model.
2.2.4 Missing information
In sequence assembly, the base-calls in the assembly matrix {x
ij
} are observed
while theS andF is missing. Therefore, we need to estimateS andF based on the
observations (base-calls). The joint conditional distribution Pr(S,F|X) provides
48
information regarding both haplotypes and memberships. To calculate the joint
conditional distribution, we need to compute the joint distribution:
Pr(X,F,S)= Pr(X|F,S)Pr(F)Pr(S),
Pr(S) = Pr(S
1·
)Pr(S
2·
) =
n
Y
j=1
Pr(S
1j
)
n
Y
j=1
Pr(S
2j
), Pr(F) =
m
Y
i=1
Pr(F
i
),
Pr(X|F,S)=
m
Y
i=1
Pr(X
i·
|F
i
,S
F
i
·
) =
m
Y
i=1
Pr(X
i·
|Y
i·
=S
F
i
·
).
These probabilities are respectively the composition probabilities, haplotype fre-
quencies, and sequencing error rates. Consequently, Bayes’ rule gives us the condi-
tional distribution:
Pr(S,F|X)=
Pr(X,F,S)
Pr(X)
.
Ourspecific goalsaretoevaluatetheconditionaldistributionofthehaplotypecom-
position given observation: Pr(S|X) and to obtain the most probable haplotypes
by: max
S
Pr(S|X).
49
According to the Bayes’ rule, we have
Pr(S|X)=
Pr(X,S)
Pr(X)
, (2.1)
where Pr(X)=
P
S
Pr(X,S). The formula to compute Pr(X,S) is given by
Pr(X,S)= Pr(X|S)Pr(S)= Pr(S)
m
Y
i=1
Pr(X
i·
|S)
= Pr(S)
m
Y
i=1
[Pr(X
i·
,F
i
=1|S)+Pr(X
i·
,F
i
= 2|S)]
= Pr(S)
m
Y
i=1
[Pr(X
i·
|S,F
i
=1)Pr(F
i
=1)+Pr(X
i·
|S,F
i
=2)Pr(F
i
=2)]
= Pr(S)
m
Y
i=1
[Pr(X
i·
|Y
i·
=S
1·
)λ
1
+Pr(X
i·
|Y
i·
=S
2·
)λ
2
]
= Pr(S)[
m
Y
i=1
2
X
k=1
λ
k
n
Y
j=1
Pr(X
ij
|S
kj
)] (2.2)
If we set Pr(F
i
=k)=
1
2
,the formula to compute Pr(X,S) becomes
Pr(X,S)= Pr(X|S)Pr(S)= Pr(S)
m
Y
i=1
Pr(X
i·
|S)
= Pr(S)
m
Y
i=1
[Pr(X
i·
,F
i
=1|S)+Pr(X
i·
,F
i
= 2|S)]
= Pr(S)
m
Y
i=1
[Pr(X
i·
|S,F
i
=1)Pr(F
i
=1)+Pr(X
i·
|S,F
i
=2)Pr(F
i
=2)]
= (
1
2
)
m
Pr(S)
m
Y
i=1
[Pr(X
i·
|Y
i·
=S
1·
)+Pr(X
i·
|Y
i·
=S
2·
)]
= (
1
2
)
m
Pr(S)
m
Y
i=1
[
n
Y
j=1
Pr(X
ij
|S
1j
)+
n
Y
j=1
Pr(X
ij
|S
2j
)], (2.3)
50
Chapter 3
Methods
3.1 Pairwise Methods
In this section, we propose a simple reconstruction strategy which is based on
the pairwise calculation. The target genome sequence is diploid. The origin of
the a haplotype is paternal and the origin of the other haplotype is maternal.
Our reconstruction method is applied to the fragment assembly which we assume
to be correct without any misassembly. Here, our strategy is also based on the
assumption that SNP sites are correctly detected. From the assembly layout, our
method disentangles the fragments into two subsets based on their origins. As we
mentioned inChapter 2, the parameters ofprobabilistic modelaresequencing error
probabilities, composition probabilities, and fragment membership probabilities.
In this section, we also assumed that those parameters are constant. According to
our model, the reconstruction of haplotypes requires exponential time complexity
with respect to the number of SNP site. Therefore, we divide the reconstruction
51
heterozygote
{A,G},{A,C},{A,T},{A,-},{G, C},{G,T},{G,-},{C, T},{C,-},{T,-}
Table 3.1: Genotypes at one SNP site.
problem into smaller problems, solve the smaller problems, andthen combine those
solutionsintoaglobalsolution. WedeterminethephaseofeveryadjacentSNPsites
and then construct longer haplotypes from those pairs. Equivalently, we calculate
the probability of different haplotype configurations and choose the most likely
haplotypeconfigurationfromadjacentSNPsites. Wealsoproposeawaytoevaluate
the confidence of estimated haplotypes. The calculation of this method is also
computationally infeasible. The running time of this evaluation method will be
reduced to be linear in Section 3.3.
3.1.1 Genotype determination
We start from the simplest case. Suppose we set the genotype at the j-th site to
be s
1,j
and s
2,j
, then (2.3) becomes
Pr(X
·j
=x
·j
;S
1,j
=s
1,j
,S
2,j
=s
2,j
)
= (
1
2
)
m
Pr(S
1,j
=s
1,j
,S
2,j
=s
2,j
)
m
Y
i=1
[p(x
ij
|s
1,j
)+p(x
ij
|s
2,j
)]. (3.1)
52
1st SNP heterozygote
2nd SNP heterozygote
# haplotypes 10× 10× 2
Table 3.2: Number of haplotypes at two SNPs.
Consequently, the conditional probability we want to obtain is
Pr(S
1,j
=s
1,j
,S
2,j
=s
2,j
|X
·j
=x
·j
)=
Pr(X
·j
=x
·j
;S
1,j
=s
1,j
,S
2,j
=s
2,j
)
P
Pr(X
·j
=x
·j
;S
1,j
=s
1,j
,S
2,j
=s
2,j
)
.(3.2)
The most likely genotype can be determined at thej-th position from (3.2). As
shown in Table 3.1, we consider only 10 heterozygous cases of (s
1,j
,s
2,j
) because
we know that the given sites are heterozygous. For instance, heterozygotes such as
(A,G)and(G,A)cannotbedifferentiated. Byusing(3.2),theconditionalprobability
is calculated for each of 10 different heterozygous cases. The most likely genotype
is reported to be the final estimation. This calculation is very fast because we
consider a specific SNP site and m in (3.2) is the coverage of the SNP site.
3.1.2 Estimation of the most likely pairs
In addition to determine the genotypes, the phases among SNP sites should be de-
rived to infer haplotypes. To do this, Pr(S|X) should be calculated for more than
one SNP site simultaneously because there is no phase information in (3.2). The-
oretically, Pr(S|X) can be calculated from any number of SNP sites. In practice,
53
however, it is computationally infeasible because the complexity grows exponen-
tially as the number of SNP sites increase as all the possible haplotype configu-
rations are enumerated. Therefore, our strategy is to infer the phase information
from a pair of adjacent SNP sites. That is, for j
1
6=j
2
we calculate
Pr(S
k,j
=s
k,j
,k = 1,2,j =j
1
,j
2
|X
ij
=x
ij
, i = 1,...,m,j =j
1
,j
2
)
=
Pr(X
ij
=x
ij
,i= 1,...,m,j =j
1
,j
2
;S
k,j
=s
k,j
,k = 1,2,j =j
1
,j
2
)
P
Pr(X
ij
=x
ij
,i= 1,...,m,j =j
1
,j
2
;S
k,j
=s
k,j
,k = 1,2,j =j
1
,j
2
)
. (3.3)
To calculate this conditional probability, we need the formula (2.3), which now
becomes
Pr(X
ij
=x
ij
, i = 1,...,m,j =j
1
,j
2
;S
k,j
=s
k,j
,k = 1,2,j =j
1
,j
2
)(3.4)
= (
1
2
)
m
Pr(S
k,j
=s
k,j
,k =1,2,j =j
1
,j
2
) (3.5)
m
Y
i=1
[
2
X
k=1
Pr(X
i,j
1
=x
i,j
1
|S
k,j
1
=s
k,j
1
)Pr(X
i,j
2
=x
i,j
2
|S
k,j
2
=s
k,j
2
)].
After the most likely haplotypes are inferred from every adjacent pair, those
pairs are connected by their consistency (Figure 3.1).
In Table 3.2, the number of haplotype configurations is enumerated. The total
numberissummedupto200differenthaplotypeconfigurationsin(s
1,j
1
,s
2,j
1
,s
1,j
2
,s
2,j
2
).
A multiplication of the number of states in each position generates 10×10 states.
In addition to this number, extra 10× 10 states are added because extra states
occur by inverting the genotypes at the j
2
-th position. For instance, suppose that
54
j+1
Haplotype 2
Haplotype 1
C G
A T
Position
G
T
T
C
j
Position
Haplotype 1
Haplotype 2
A
C
T
G
C
T
A
B
j
j+1 j j−1
j−1
Figure 3.1: Consistency between adjacent pairs. At the (j−1)-th position
and j-th position, we have the phase-determined pair, AT and CG. At the j-
th position and (j+1)-th position, we obtain the pair, TC and GT. By the
consistency at the j-th position, those pairs are connected and become ATC
and CGT expanding from the (j−1)-th position to the (j+1)-th position.
we calculate the probability Pr(S|X) with the genotype A, C at the j
1
-th position
andthe genotype T, Aatthej
2
position. We firstcalculate theprobability Pr(S|X)
for the pair, AT and CA. Then we calculate the probability with the second position
j
2
inverted. In this case, the pair we consider is AA and CT.
Pr(the most likely haplotypes) is very low, which means the estimated haplo-
types are insignificant. Therefore, we define a confidence score to evaluate our esti-
mation. IfPr(the most likely haplotypes)is below athreshold, say 0.5, we consider
ourestimatedhaplotypestobeinsignificantanddonotconnectthoseadjacentSNP
sites. We also define another confidence score which is based on the ratio defined
as follows.
Pr(the most likely haplotypes)
Pr(the second most likely haplotypes)
. (3.6)
55
j+1
Haplotype 2
Haplotype 1
C G
A T
Position
G
A
T
C
Position
Haplotype 1
Haplotype 2
A
C
T
G
C
T
A
B
j j−1 j j+1
j−1 j
Figure 3.2: Inconsistency between adjacent pairs. (A) The inconsistency be-
tween adjacent pairs is observed. (B) By considering three SNP adjacent site
at a time, the inconsistency is resolved.
If the ratio is below a threshold, say 1.1, we also consider the estimated haplotypes
to be insignificant. We call this ratio odds ratio. Alternatively, we can define a
different ratio defined as follows.
Pr(most likely haplotypes)
1−Pr(most likely haplotypes)
.
However, our simulation shows that this ratio is not effective when it is combined
withPr(the most likely haplotypes). Inprinciple, wecanconsideranycombination
of two SNP sites. However, our simulation results confirm that it is reasonable to
focus our estimation on adjacent pairs of SNP sites.
Our pairwise strategy does not guarantee that two adjacent pairs are always
consistent. As shown in Figure 3.2, the pair at the (k−1)-th position and k-th
56
G
j j + 1
C
A
T
C
C
T
A
C
j + 2
A
T
j + 3 j + 4 j + 5 j − 1 Position j + 6 j + 7 j + 8 j + 9
C
T
A
C
C
G
G
T
C
A
A
Figure 3.3: Reconstructing haplotypes from pairwise calculations.
position might be inconsistent with the pair at the k-th position and (k + 1)-
th position. In this case, one of the pairs is estimated to be wrong because of
sequencing errors. We resolve this inconsistency by inferring haplotypes from three
SNP sites at a time. In other words, we calculate Pr(S|X) from three SNP sites.
The number of haplotype configurations in three SNP sites is still computationally
feasible.
3.1.3 Reconstruction strategy
We used the consistency of the phase between adjacent pairs to construct longer
haplotypes. Figure 3.3 illustrates our reconstruction strategy. The position of each
SNP site is indexed by the top row. In this figure, 11 SNP sites are indexed from
the (j − 1)-th position through the (j + 9)-th position. Each pair is connected
with an adjacent pair as long as the genotype at the shared position is consistent.
For example, we have haplotypes, CCTAA and ATCCT from the (j−1)-th position
through(j+3)-thposition. Wehave haplotypes, ACTAAand CGGCG fromthe(j+5)-
th position through (j +9)-th position. At some position, either of the confidence
score and ratio can be below a pre-specified threshold. Then we stop extending
57
haplotypes. In Figure 3.3, we observe two such positions, (j+3) and (j+4). Some
SNP sites might not be connected to other SNP sites. We call those isolated SNP
sites singletons. In Figure 3.3, a singleton exists at the (j +4)-th position.
3.1.4 Confidence score for estimated haplotypes
As the estimated pairs are evaluated for their confidence, we also need to evaluate
the confidence for estimated haplotypes which are longer than the pairs. Similar
to the confidence score for the pairs, we define a confidence score of haplotypes
s
1,h
s
1,h+1
...s
1,h+k
and s
2,h
s
2,h+1
...s
2,h+k
by
Pr(S
k,j
=s
k,j
,k = 1,2,j =h,...,h+k|X
ij
=x
ij
, i = 1,...,m,j =h,...,h+k).
(3.7)
To calculate this confidence score, we need to calculate the joint probability
Pr(S
k,j
=s
k,j
,k =1,2,j =h,...,h+k; X
ij
=x
ij
, i= 1,...,m,j =h,...,h+k),
which can be calculated using formula (2.3) and the marginal probability Pr(X
ij
=
x
ij
, i =1,...,m,j =h,h+1...,h+k).
Proposition 3.1. The number of different haplotypes is 10·20
K−1
.
Proof: Since positionsotherthanthefirstpositioncanbeinverted, thenumber
of haplotypes is 10
K
2
K−1
= 10·20
K−1
.
58
This marginal probability is the sum of joint probabilities over all haplotype
configurations. The number of haplotype grows exponentially as the size of hap-
lotypes increases. Because estimated haplotypes are much longer than a pair, this
imposes a challenge. We can address this problem in two different directions. One
direction is that we only use a reduced set of actually observed nucleotides in each
position. This approach leads to an approximate solution, which inevitably overes-
timates the probability of the most likely haplotypes. An accurate solution can be
obtainedwith alittlemorecost. Forexample, suppose thatonly twobases, Tand G
are observed at a SNP site. Then we compose a new character set{T,G,TG}, where
GT represents {A,C,−}. At the next SNP site, if we only observe two bases A and
C other than N, then we assume that the two letters of the genotype are from the
alphabet{A,C,AC}, where AC represents {G,T,−}. By doing so for each SNP site,
the list of candidate haplotypes is reduced. As indicated in Table 3.2, the number
of haplotype configurations is normally 2×10×10. In this particular example, the
number canbereduced to3×3. Ofcourse, wenowneed tocompute thesequencing
error rates for the new alphabet by Bayes’ rule. For example,
Pr(X
ij
=A|Y
ij
=AC) = Pr(X
ij
=A|Y
ij
=G, T, −)
=
p(G)p(A|G)+p(T)p(A|T)+p(−)p(A|−)
p(G)+p(T)+p(−)
.
59
The other direction is to reduce the running time by using a Markov property.
We introduce the second direction in Section 3.3. Our reconstruction strategy is
summarized in the following algorithm.
Algorithm3.1. SupposethatwearegivenanassemblyX ={X
ij
,i = 1,...,m,j =
1,...,n}.
1. For j = 1 to n−1, apply formula (3.4) to calculate the conditional probability
Pr(S
k,l
=s
k,l
,k =1,2,l =j,j +1|X
il
=x
il
, i = 1,...,m,l =j,j +1),
for all the possible haplotype configurations shown in Table 3.1 and 3.2. For
each estimated pair, the confidence is evaluated by the confidence score and
odds ratio. If either of them is below a threshold, the connection, which is
thought to be uncertain, is broken.
2. Connect the pairs obtained in Step 1 and construct longer haplotypes. If
inconsistent occur between adjacent pairs, then we consider three SNP sites
jointly. For those triples, we select the most probable haplotypes or break the
connection based on odds ratio and confidence score.
3. Evaluate the confidence scores of haplotypes according to formulas (3.7) and
(2.3).
4. Foreachsingleton, themostlikelygenotypeisestimatedaccordingtothesingle
confidence score (3.1).
60
S
(L)
1
ww '' ···
S
(R)
1
S
(L)
2
gg 77 ··· S
(R)
2
C
1
Figure 3.4: Connect two haplotypes by a bridging strategy.
S
(L)
1
ww '' ···
S
(R)
2
S
(L)
2
gg 77 ··· S
(R)
1
C
2
Figure3.5: Connect twohaplotypes byinvertingthephase. Notethat the
phase is inverted from Figure 3.4
3.2 A bridging strategy
ThepairwisereconstructionstrategyinSection3.1considerstheadjacenttwopoly-
morphic sites to infer phases. If the confidence score for a pair, (3.3) and (3.6),
is below a threshold, the extension does not continue further. However, it is ob-
vious that distant polymorphic sites are helpful to determine phases because each
fragment often covers more than two polymorphic sites. We propose a statistical
strategy which connects adjacent haplotypes. We call this strategy as a “bridging”
strategy. Theadjacenthaplotypesareconstructedfromthepairwisereconstruction
strategy. Thereexisttwoways toconnecttwoadjacenthaplotypes whichareshown
61
in Figure 3.4 and Figure 3.5. Two possible configurations are denoted by C
1
and
C
2
, which are shown in Figure 3.4 and Figure 3.5, respectively.
As shown in Figure 3.4, the left haplotypes are denoted by S
(L)
1
and S
(L)
2
in our
bridging strategy. Similarly, the right haplotypes are denoted by S
(R)
1
and S
(R)
2
.
Some fragment, Z
i
, spans two adjacent haplotypes. A set of such fragments are
denoted by Z ={Z
i
,i∈ I}. When a fragment Z
i
spans two adjacent haplotypes,
the indexes ofthe polymorphic sites inS
(L)
1
andS
(L)
2
aredenoted byJ
i,L
. Similarly,
the indexes of the polymorphic sites in S
(R)
1
and S
(R)
2
are denoted by J
i,R
. Then
Pr(Z|C
1
)=
Y
i∈I
Pr(Z
i
|C
1
), Pr(Z|C
2
) =
Y
i∈I
Pr(Z
i
|C
2
),
where
Pr(Z
i
|C
1
) =
1
2
Y
j∈J
i,L
Pr(Z
ij
|S
(L)
1j
)
Y
j∈J
i,R
Pr(Z
ij
|S
(R)
1j
)+
1
2
Y
j∈J
i,L
Pr(Z
ij
|S
(L)
2j
)
Y
j∈J
i,R
Pr(Z
ij
|S
(R)
2j
)
Pr(Z
i
|C
2
) =
1
2
Y
j∈J
i,L
Pr(Z
i
j|S
(L)
1j
)
Y
j∈J
i,R
Pr(Z
ij
|S
(R)
2j
)+
1
2
Y
j∈J
i,L
Pr(Z
ij
|S
(L)
2j
)
Y
j∈J
i,R
Pr(Z
ij
|S
(R)
1j
).
62
By calculating the Pr(Z|C
1
) and Pr(Z|C
2
), we determine the phase according
to the following decision rule:



















accept C
1
if
Pr(Z|C
1
)
Pr(Z|C
2
)
> threshold,
accept C
2
if
Pr(Z|C
2
)
Pr(Z|C
1
)
> threshold,
no decision if otherwise.
The value of the threshold should be greater than 1. Our bridging strategy is
readily adapted to two-end sequencing cases. The un-called region between paired
reads is filled with φ. Then the mate-pair information is used to connect adjacent
haplotypes.
3.3 A fast algorithm to compute Pr(X)
It is possible that all the phases are not determined because the SNP density, cov-
erage and origins of fragments are not strictly uniform across the entire clone. To
provide a complete solution of diploid genome reconstruction, we need a confidence
scoretoevaluateestimated haplotypes basedonourobservationfromtheassembly.
Based on our model, we need to calculate the likelihood of the assembly, Pr(X).
To calculate the marginal probability Pr(X), Pr(S,X) is summed over all S. Al-
though a pairwise reconstruction strategy is used to reduce the running time, it
is still required to enumerate all the haplotype configuration to calculate Pr(X).
63
c c c
c
c
c
c
c
c
c
c
Sequence
SNP (k-1)-th k-th (k+1)-th
Λ
1
(k)
Λ
2
(k)
Λ
3
(k)
Λ
4
(k)
Figure 3.6: Definition of four index sets. Γ(k) = Λ
3
(k)
S
Λ
4
(k) = Λ
2
(k +
1)
S
Λ
4
(k+1).
According to Proposition 3.1, the enumeration of all the haplotype configurations
leads to a computational infeasibility. As we mentioned, one way to avoid this is
to use a reduced character set of nucleotides at each position. In this section, the
secondapproachisintroduced. AMarkovpropertyisusedtoreducethecalculation
of Pr(X) to be linear with respect to the number of SNP sites.
The likelihood calculation and confidence score is further used to improve our
haplotype estimation in two directions. One direction is that low-scored phases
are disconnected by checking the confidence score of every connection. The other
direction is that instead of using nominal frequency
1
2
, the haplotype frequency is
estimated to reflect the actual contribution of each haplotype. In a small scaffold,
the true haplotype frequency is not always a half. In this case, we can use the
estimated haplotype frequency instead of the constant frequency
1
2
.
64
3.3.1 A Markov property
By using a Markov property, the likelihood calculation proceeds from the genomic
position with a low index to the positions with higher indexes. Suppose that we
are considering the k-th position. The fragments we are interested in at the k-th
position are the fragments which cover the k-th position. The set of fragments
is denoted by Ω(k), where k indicates the current position. The set of fragments
which cover atleast oneofpositions uptok is denoted byΘ(k). Obviously, Θ(k)=
S
k
j=1
Ω(j). ThesetΩ(k)iscategorizedasfourdisjointsubsets: Λ
1
(k),Λ
2
(k),Λ
3
(k),
and Λ
4
(k). The definition of four disjoint subset is illustrated in Figure 3.6. Λ
1
(k)
is the set of fragments which cover the k-th position but neither the (k− 1)-th
position nor the (k+1)-th position. Λ
2
(k) is the set of fragments which cover both
the (k−1)-th and k-th position but not the (k+1)-th position. Λ
3
(k) is the set of
fragments which cover both the k-th and (k+1)-th position but not the (k−1)-th
position. Λ
4
(k)isthesetoffragmentswhichcoverthe(k−1)-th,k-thand(k+1)-th
position.
We define Γ(k) = Λ
3
(k)
S
Λ
4
(k). Then there are close relationships among
Γ, Λ, and Θ. It is obvious that Γ(k) = Λ
2
(k + 1)
S
Λ
4
(k + 1), and Θ(k + 1) =
Θ(k)
S
Λ
1
(k+1)
S
Λ
3
(k+1).
The relationship between index sets in consecutive positions is illustrated in
Figure3.7. ThedependencestructureofΩ(k+1)onΘ(k)isobserved. Byusingthis
dependency, we can recursively calculate the likelihood Pr(X) without summing
65
k k+1
Λ
1
(k) Λ
1
(k+1)
Λ
2
(k) Λ
2
(k+1)
fragments starting
at k+1-th loci
vvn
n
n
n
n
n
n
n
n
n
n
n
hhP
P
P
P
P
P
P
P
P
P
P
P
Λ
3
(k)
%% K
K
K
K
K
K
K
K
K
K
:: u
u
u
u
u
u
u
u
u
u
u
Λ
3
(k+1)
Λ
4
(k)
// CC


















Λ
4
(k+1)
Figure 3.7: An illustration of the recursive structure of the index sets.
Formula (3.9) uses this structure.
over all the possible haplotype configurations. According to the definition of Γ(k)
and its relation to the (k+1)-th position, the following Proposition 3.2 is obvious.
Proposition 3.2.
Pr(X
ij
=x
ij
, j = 1,...,k+1,i∈ Θ(k+1);
S
k,1
=b
1
,S
k,2
=b
2
;S
k+1,1
=a
1
,S
k+1,2
=a
2
;F
i
=f
i
,i∈ Ω(k+1))
= Pr(X
i,k+1
=x
i,k+1
, i∈ Ω(k+1)|S
k+1,1
=a
1
,S
k+1,2
=a
2
;F
i
=f
i
,i∈ Ω(k+1))
Pr(X
ij
=x
ij
, j = 1,...,k, i∈ Θ(k);S
k,1
=b
1
,S
k,2
=b
2
;S
k+1,1
=a
1
,S
k+1,2
=a
2
;
F
i
=f
i
,i∈ Ω(k+1))
66
Please note that the dimension of the state space changes as we move along the
genome in this Markov structure. We define
α(k;a
1
,a
2
;f
i
, i∈ Γ(k))
= Pr(X
ij
=x
ij
, j = 1,...,k, i =1,...,m;S
k,1
=a
1
,S
k,2
=a
2
;F
i
=f
i
, i∈ Γ(k)).
If we calculate the likelihood of estimated haplotypes which consist of d SNP sites,
the set Γ(d) is empty at the d-th position. Therefore, the likelihood is
Pr(X
ij
=x
ij
, i = 1,...,m,j =1,...,d)=
X
a
1
,a
2
α(d;a
1
,a
2
). (3.8)
We can recursively calculate α(k;a
1
,a
2
;f
i
, i ∈ Γ(k)) by using the following
Theorem 3.1.
Theorem 3.1.
α(k+1;a
1
,a
2
;f
i
, i∈ Γ(k+1))=Pr(a
1
,a
2
)[
2
Y
h=1
λ
P
j∈Λ
3
(k+1)
1(F
j
=h)
h
]
[
Y
j∈Λ
1
(k+1)
[λ
1
Pr(x
j,k+1
|a
1
)+λ
2
Pr(x
j,k+1
|a
2
)]]
[
Y
j∈Λ
3
(k+1)
Pr(x
j,k+1
|a
f
j
)][
Y
j∈Λ
4
(k+1)
Pr(x
j,k+1
|a
f
j
)]


X
f
j
=1,2,j∈Λ
2
(k+1)
[
Y
j∈Λ
2
(k+1)
Pr(x
j,k+1
|a
f
j
)]
X
b
1
,b
2
α(k;b
1
,b
2
;f
j
, j∈ Γ(k))


(3.9)
Proof: See Appendix.
67
Please notice that if the set Λ
2
(k+1) is empty at some positions, then we skip
the corresponding summation in the formula.
Proposition 3.3. The average time complexity of the algorithm defined by (3.9)
and (3.8) is linear with respect the number of SNP sites. The expected complexity
is proportional to e
κ
, where κ is the average coverage.
Proof: During the recursion of calculating Pr(X), the states with their corre-
sponding values of{α(k;a
1
,a
2
;f
i
, i∈ Γ(k))} are stored in memory. The memory
requirement at each position is not fixed but changing across the genome. Sup-
pose the coverage at each position is a random variable which is denoted by K.
Then this random variable follows a Poisson distribution with average κ. The
expected memory requirement is E[2
K
]. We calculate this expectation by using
the moment generating function of Poisson distribution (E[e
tX
] = e
κ(e
t
−1)
). Thus,
E[2
K
] = E[e
log 2K
] = e
κ
. Similarly, we can also calculate the average time com-
plexity.
Although the average time and space complexity are represented as exponen-
tial forms, these exponential forms are manageable because the average sequence
coverage in sequencing projects is usually from 5 to 10. In the worst case, the
coverage is not extremely deep. Because the sequence coverage follows the Poisson
distribution, we can calculate the probability that the sequence coverage exceeds
68
some large number. When we denote the coverage by a random variableK and the
average coverage κ,
Pr(K≥m) =
∞
X
j=m
κ
j
e
−κ
j!
.
For the usual average coverage κ = 7, Pr(K ≥ 20) = 0.000044 and Pr(K ≥ 23)≤
10
−6
.
Afterthepairwisereconstructionstep,themate-pairinformationcanbeusedfor
further extension of haplotypes. In other words, two haplotypes can be connected
by the mate-pair information to form longer haplotypes. Our confidence score can
be calculated by fillingφ’s inthe middle ofpaired reads. This strategy unavoidably
increasesthecoverage,therebymakingthecalculationcomputationallychallenging.
As an alternative, we fill φ’s only for the fragments which provide the mate-pair
information to form longer haplotypes.
We often experienced the underflow problem as we calculate Pr(X) from many
positions. We solved this problem with taking a log scale and setting a bound
(-100) on the value logα(·). Whenever the value of logα(·) gets below the bound,
we add 3 and count it. After the calculation ends at the last position, say n, we
subtract 3×the counted number of adding 3.
3.3.2 Identification of uncertain connections
Our reconstruction strategy in Section 3.1 is based on the assumption that esti-
mating the most likely pairs and connecting those pairs lead to longer haplotypes.
69
In other words, the concatenation of local optimal solutions leads to the globally
optimalsolution. Inreality, thisisnottrue. Thepairwisereconstruction strategyis
employed just because the number of haplotype configurations grows exponentially
with the number of SNP sites increasing. We can complement the weakness of our
reconstruction strategy with the fast algorithm to calculate Pr(X). Since we have
a confidence score to evaluate each connection along long haplotypes, uncertain
connections can be identified by calculating the confidence score at every position.
The uncertain connections are then disconnected. The improved reconstruction
strategy is described as follows.
Algorithm 3.2. 1. The most likely pair is estimated from every adjacent pair.
The confidence of the most likely pair is evaluated by the odds ratio and con-
fidence score. If either of them is below a threshold, the most likely pair is
discarded.
2. The estimated pair from step 1 is connected together. If there is an inconsis-
tency between the pairs, the most likely triple is estimated from those three
positions.
3. Evaluate the overall confidence score for each estimated haplotypes by using
(3.8).
4. Iftheoverallconfidencescoreisbelowathreshold, theweakestpairofpositions
with the smallest pairwise confidence score is identified. After the phase is
70
inverted, the overall confidence score is calculated again. If the score exceeds
the threshold, we stop; otherwise the haplotypes are disconnected into two
parts. In this case, the step 3 and 4 are repeated to these two disconnected
segments.
5. Foreachsingleton, themostlikelygenotypeisestimatedaccordingtothesingle
confidence score (3.1).
In the case that the nominal frequency value is known, we can still use its
estimated value in the calculation of confidence scores to achieve adaptive recon-
struction and consequently to improve accuracy of reconstruction.
3.3.3 Maximum likelihood estimate of frequencies
Because we can calculate the likelihood Pr(X) which has linear time complexity,
the frequency which maximizes the likelihood can becalculated. The log-likelihood
of the observation is:
L(λ) =logPr(X;λ).
where, λ is the haplotype frequency.
For any arbitrary haplotype frequency λ, this log-likelihood is calculated very
efficientlybyusing(3.9)and(3.8). Bycalculatingthislog-likelihoodforanumberof
haplotype frequencies with uniform intervals, we can estimate the frequency which
71
maximizes the log-likelihood. An example of maximum likelihood estimation will
be shown in Simulation Chapter.
3.3.4 Composition probabilities
In our probabilistic model, the parameters are sequencing error rates, haplotype
frequency, and composition probabilities. Here, we deal with the issue on the com-
position probabilities. In the assumption A7 in Chapter 1, it is assumed that
the composition of the sequence is independent, both of adjacent bases and over
large regions. This simplified assumption was made because the complication is
to be avoided. One way to estimate the parameter (composition probabilities) is
to use E-M algorithm developed by Churchill and Waterman [15]. In reality, the
assumption about the constant composition probabilities is not true. The composi-
tion probabilities vary among the estimated haplotypes. In this case, the different
composition probabilities can be estimated from estimated haplotypes. We can es-
timate these composition probabilities adaptively. After we estimate the haplotype
frequency, we do not vary the frequency any more. Then we count the frequency
of each nucleotide and calculate the composition probability for each nucleotide.
This process is repeated until the change of the composition probabilities from the
previous step is insignificant. This adaptive strategy is summarized as follows.
Algorithm 3.3. 1. Estimate the haplotype frequency from each clone or scaf-
fold.
72
2. Reconstruct haplotypes by using the haplotype frequency obtained from step 1.
3. Calculate the frequency of each nucleotide. Step 2 and 3 are repeated until the
frequencies converge.
3.4 Gibbs Sampling Approach
In this section, we present a method based on the Gibbs sampler. First, we explain
the reasons why the Gibbs sampling is a more appropriate strategy in determining
the phases, especially in inferring haplotypes from real sequencing projects. For
instance, distinct from the methods in Section 3.1 and 3.3, the Gibbs sampler
estimates the origins of shotgun reads as a by-product. Next, we discuss an issue
on using position-specific quality score. We interpret the Phred quality score in the
framework of our error model. Potential polymorphisms are identified by using a
Bayesian model selection. According to the new phase determination method, we
also develop a new haplotype extension method.
In summary, our new diploid genome reconstruction strategy consists of four
steps: 1) potential polymorphism detection, 2) local haplotyping, 3) haplotype ex-
tension, and 4) extension by the bridging strategy described in Section 3.2. The
strategy is to screen potential polymorphic sites, to haplotype the potential poly-
morphisms locally, and to connect adjacent haplotypes to form longer haplotypes
by using the consistency of regions overlapping adjacent haplotypes.
73
In Section 4.3, the accuracy of the Gibbs sampler is tested on a simulated data
set. In Chapter 5, the Gibbs sampler is applied to the real Ciona intestinalis
genome.
We start off by discussing some statistical properties.
3.4.1 Background
Markov chain In a Markov chain with memory 1, given the current state, the
future state is independent of the history of the past states. A Markov chain is
formally defined as follows, where we implicitly intend a Markov chain of memory
1.
Definition 3.1. A process X
i
is a Markov chain if
P(X
n
=x
n
|X
0
=x
0
,X
1
=x
1
,...,X
n−1
=x
n−1
)=P(X
n
=x
n
|X
n−1
=x
n−1
).
for all n> 1 and all the states x
0
,x
1
,...,x
n
∈S.
We define the transition probability and transition matrix.
Definition 3.2. The transition probability p
ij
is defined as
p
ij
=P(X
n+1
=x
j
|X
n
=x
i
).
The transition matrix P is the |S|×|S| matrix where each entry is a transition
probability, p
ij
.
74
Wedenoten-steptransitionprobabilitiesbyp
ij
(m,m+n) =P(X
m+n
=x
j
|X
m
=
x
i
). The n-step transition probabilities from state x
0
are denoted by p
ij
(n) =
P(X
n
=x
j
|X
0
=x
i
).
Definition 3.3. The period of x
i
is defined to be d(i) =gcd{n:p
ii
(n)>0}. State
x
i
is called periodic if d(i)> 1 and aperiodic if d(i)= 1.
Definition 3.4. We say x
i
communicates with x
j
if there exist m,n≥ 0 such that
p
ij
(m) > 0 and p
ji
(m) > 0. We denote this communication between x
i
and x
j
by
x
i
↔x
j
.
Definition 3.5. A Markov chain is called irreducible if x
i
↔ x
j
holds for any x
i
and x
j
.
Definition 3.6. π is called a stationary distribution of the chain if the π satisfies
the following two conditions,
(a) π
j
≥ 0 for any state j and
P
j
π
j
= 1,
(b) π =πP.
The following well-known Theorem is presented without a proof.
Theorem3.2. ForanirreducibleandaperiodicMarkovchain, there existsaunique
stationary distribution π such that π =πP.
MarkovchainMonteCarlo TheMonteCarlomethodhasbeenwidelyusedfor
tackling computationally intensive problems by a sampling strategy. However, it is
75
sometimesimpossibletosampledirectlyfromthetargetdistribution. IntheMarkov
chain Monte Carlo (MCMC) method, samples are generated from a Markov chain
withthestationarydistributionwhichcorrespondstothetargetdistribution. Inthe
MCMC method, we try to derive an efficient transition rule to reach the stationary
distribution. As examples of the MCMC method, we describe three widely used
methods, the Metropolis algorithm, the Metropolis-Hastings algorithm, and the
Gibbs sampler.
Metropolisalgorithm Metropolisetal. [60]used asymmetric probability tran-
sition as a transition rule.
1. Let π be the target distribution. Given the current state x
(t)
, x
′
is sampled
from “proposal function” Q(x
(t)
,x
′
). The proposal function is required to be
a symmetric probability transition, Q(x
(t)
,x
′
) = Q(x
′
,x
(t)
). This proposal
function is considered as a random “unbiased perturbation”.
2. Sample u from Uniform[0,1]. Then,
x
(t+1)
=







x
′
if u≤
π(x
′
)
π(x
(t)
)
,
x
(t)
otherwise.
Metropolis-Hastingsalgorithm Hastings[36]eliminatedtherestrictiononthe
symmetries of the proposal function. The proposal function is just required to be
that Q(x,x
′
)>0 if and only if Q(x
′
,x)< 0.
76
1. Sample x
′
from the proposal distribution Q(x
(t)
,x
′
).
2. Sample u from Uniform[0,1]. Then,
x
(t+1)
=







x
′
if u≤K(x
(t)
,x
′
),
x
(t)
otherwise
K(x
(t)
,x
′
) =min{1,
π(x
′
)Q(x
′
,x
(t)
)
π(x
(t)
)Q(x
(t)
,x
′
)
} was suggested.
3.4.2 Gibbs sampler
History ThewidelyusedMCMCalgorithm,theGibbssampler,wasfirstexplored
in the image processing area (Geman and Geman [29]). The algorithm was named
the Gibbs sampler after the physicist, J. W. Gibbs, because of the analogy with
statistical physics. Afterward, the connection between the Gibbs sampler and the
Metropolis-Hastings algorithm was noticed. The Gibbs sampler turned out to be a
special case of the Metropolis-Hastings algorithm.
In the Metropolis algorithm and Metropolis-Hastings algorithm, it is often hard
to design a reasonable “unbiased perturbation” in practice (Liu [52]). Instead of
the proposal function, the Gibbs sampler uses the conditional distribution as the
transition rule. The conditional distribution is easy to derive in many cases, and
conditional transitions are known to be more global than the perturbation-based
77
transitions in the Metropolis algorithms; the perturbation-based transition tends
to depend on the local context (Liu [52]).
Data augmentation The analogy between the EM algorithm and the Bayesian
sampling structure is found in terms of imputing the missing values (Tanner and
Wong[83]). IfMLEiscalculatedbyfillinginthemissingdataintheEMalgorithm,
dataaugmentationintheBayesian frameworkcanbeunderstoodascalculating the
posterior probability by augmenting the observations, in other words, imputing the
missing data.
Lety be the observed data. Letx be the missing data. Let θ be the parameter
in Θ. The data augmentation is motivated by the calculation of the posterior
probability of θ.
p(θ|y) =
Z
X
p(θ|x,y)p(x|y)dx (3.10)
The density of the missing data is denoted by p(x|y), and defined in terms of
the posterior probability p(θ|y).
p(x|y)=
Z
Θ
p(x|θ,y)p(θ|y)dθ (3.11)
78
It often happens that (3.10) and (3.11) are not derived in closed forms. In this
case, the Monte Carlo method is applied to the integrations in (3.10) and (3.11).
In data augmentation strategy, the calculation of (3.10) and (3.11) is replaced
by the iterations of the following two steps.
1. θ
(1)
,...,θ
(n)
aresampledfromp(θ|x
(i)
,y). Calculateg(θ)=
1
m
P
n
i=1
p(θ|x
(i)
,y).
2. x
(1)
,...,x
(n)
are sampled from p(x|θ
(i)
,y).
The data augmentation is essentially equivalent to the two-component Gibbs
sampler (Liu [51]). For two-component Gibbs sampler, the geometric convergence
was proved under some regularity conditions (Liu, Wong, and Kong [53]). The
convergence rate is related to the correlation between two components (Rubin [72];
Liu[51]). Forastationarydistributionπ(x),two-componentGibbssamplerconsists
of the following two steps.
1. Sample x
(t+1)
1
from the conditional distribution π(X
1
|X
2
=x
(t)
2
).
2. Sample x
(t+1)
2
from the conditional distribution π(X
2
|X
1
=x
(t+1)
1
).
Sampling strategies There are several sampling strategies in the Gibbs sam-
pling (Roberts and Sahu [71]). In the random scanning strategy, the index to be
updated is randomly selected with equal probability. Let x = (x
1
,...,x
n
) be the
missing value. Letx
(t)
= (x
(t)
1
,...,x
(t)
n
)bethemissing valueattimet. x
[−i]
denotes
the set{x
j
;j6=i}. At each iteration t, the following steps are repeated n times.
79
1. Sample an index i to be updated with equal probability
1
n
.
2. Sample x
(t+1)
i
from π(·|x
[−i]
) and other components remain unchanged.
In the forward scanning strategy, every index is equally swept in the forward
direction fromthe lowest index. At each iterationt, the following step is performed
for i = 1,...,n.
1. Sample x
(t+1)
i
from π(·|x
[−i]
) and other components remain unchanged.
In the reversible scanning strategy, after the forward updates in the systematic
scanning strategy are conducted, the backward updates are also conducted. At
each iteration t, the following steps are performed.
1. In order of i =1,...,n, sample x
(t+1)
i
from π(·|x
[−i]
).
2. In order of i =n,...,1, sample x
(t+1)
i
from π(·|x
[−i]
).
In the randomly permuted scanning, a random permutation of indexes is gen-
erated. Then the components are updated in that order. At each iteration t, the
following steps are performed.
1. Randomly permute the order of i =1,...,n.
2. In that permuted order of i =1,...,n, sample x
(t+1)
i
from π(·|x
[−i]
).
The grouping of highly correlated components and updating the group simultane-
ously can show improved performance (Roberts and Sahu [71]).
80
Number of chains The issue of choosing a long single chain or shorter multiple
chains is discussed in Brooks [9]. In a single long chain, it is more likely that the
chain willeventually approximate thetargetdistribution. The number ofdiscarded
samples as a “burn-in” in a single long chain is less than the number in multiple
chains.
On the other hand, in multiple chains, it is more likely that the chains will visit
the entire sample space and running a single chain will leave some of sample space
unvisited. By running multiple chains, we can make use of parallel computation.
Multiple chains can be monitored to evaluate the convergence by starting from
several initial values.
Mykland et al. [64] developed a hybrid method to take advantage of both the
single-run and multiple-run approach to some extent. In this approach, a single
chain is run from a starting point. At some “regeneration” point, the single chain
is split into multiple chains. Because the multiple chains originate from a single
chain, the multiple chains will approximate the target distribution in the end. The
chains might visit more of the sample space than a single chain does.
Convergence diagnostics If the convergence rate of the Gibbs sampler is avail-
able,itiseasytodeterminethenumberofdiscardedsamplesasaburn-in. Although
the geometric convergence is known in the general setting of the Gibbs sampler,
an explicit bound on the convergence rate is still an open problem (Roberts and
Polson [70]; Chan [12]; Schervish and Carlin [77]).
81
For this reason, several methods, called convergence diagnostics, have been de-
veloped to determine the convergence empirically. One example of convergence
diagnostics is the method proposed by Gelman and Rubin [28].
In this method, multiple chains, say m, are run for some period, 2n iterations.
Thelastnsamplesfromeachchainistakentocalculatethe“within-chainvariance”
and the “between-chain variance”. The within-chain variance is the average of the
varianceofthesamplesfromeachchain. Thebetween-chainvarianceisthevariance
of the average of the samples from each chain. When the multiple chains initially
start to run, the between-chain variance dominates the within-chain variance. As
the ratio between the between-chain variance and the within-chain variance ap-
proaches 1, it is concluded that the starting values are no more influential and the
decision of the convergence is made. This convergence diagnostic is only applied to
the sampling of a scalar value.
3.4.3 The distribution of fragment membership and the
Gibbs sampling
Afterhaplotypesarereconstructed, thefragmentmembershipcanbealsoestimated
fromPr(F|X,S)forcompleteestimation. Theestimationoffragmentmemberships
is the dual problem of the estimation of haplotypes. Pr(F|X,S) can be calculated
from the fragments within the range of haplotypes because true haplotypes are
known from Pr(S|X) and our pairwise reconstruction strategy.
82
Maximizing this probability gives a solution to estimate fragment membership.
However, as we observed in the case of haplotype estimation, the calculation of
the fragment membership configuration which maximizes Pr(S|X) becomes com-
putationallyintensive when manyfragmentsareinvolved. Usually, manyfragments
tend to exist within the region of estimated haplotypes. A solution is to consider
one fragment at a time with other fragment memberships unchanged. Therefore, a
simple solution is to sample one fragment membership and update it while keeping
other fragment memberships unchanged. This sampling can be repeated through
all the fragments sequentially.
InSection3.1and3.3,SNPsitesarepre-determinedandthepairwisereconstruc-
tion method is applied to these SNP sites for computational efficiency. However,
theestimatedpairshappennottobecorrectbecausetheinformationspanningonly
two positions are used. On the one hand, obviously, it is more accurate to infer
haplotypes from more than two positions. On the other hand, in real assemblies
there is no guarantee that detected potential SNP sites are necessarily true SNPs.
In this case, we can replace the pairwise strategy by a series of sampling from more
thantwo positions. Wecan compute Pr(F|S,X)and Pr(S|F,X)by alternating the
following two steps:
1. Sample s
(l+1)
from Pr(S|F =f
(l)
,X =x);
2. Sample f
(l+1)
from Pr(F|S =s
(l+1)
,X =x).
83
This procedure can be viewed as disentangling the fragments by their member-
ships and estimating haplotype from each disentangled fragment group.
3.4.4 Position-specific sequencing error rates
Sequencingerrorprobabilitiesplayanimportantroleinourreconstructionstrategy.
In Section 3.1 and 3.3, under A5 and A6, the sequencing errors are assumed to be
constant and identically distributed across the assembly. Under these assumptions,
the sequencing error rates can be estimated from E-M algorithm developed by
Churchill and Waterman [15]. Alternately, the quality scores from the base-calling
software, Phred, can be directly used to reflect the local context of base-callings
(Ewing and Green [24]).
Let
ε
ij
= Pr(X
ij
=b|Y
ij
=a), b6=a, a∈A.
The quality scores are transformed from the error probability by (1.1):
q
ij
=−10log
10
ε
ij
.
where q
ij
is the Phred quality score at row i and column j in the assembly matrix.
Using this transformation form, we can incorporate the quality scores into our
model because the quality scores have a probabilistic interpretation.
84
Therefore, the sequencing error probabilities at the j-th locus in the fragment i
are denoted by
Pr(X
ij
6=a|Y
ij
=a) = 10
−
q
ij
10
, a∈A (3.12)
Thus,
Pr(X
ij
=b|Y
ij
=a)=







1−ε
ij
a =b,
ε
ij
ω(b|a) a6=b.
where a∈A, b∈B.
The error-bias parameters{ω(b|a), a6=b, a∈A, b∈B} either can be set equal
to some constant or can be trained from data. The quality score for an indel is set
to the average of two flanking quality scores.
3.4.5 Potential polymorphism detection
The number of genomic positions from which haplotypes are inferred need to be
reduced for computational efficiency. A Bayesian model selection is used to screen
potentialpolymorphisms. Afterpotentialpolymorphisms areidentified, haplotypes
are derived only from those potential polymorphisms. All the genomic positions
other than potential polymorphisms are excluded from estimation and regarded
to be homozygous. In the Bayesian model selection, our criterion is the ratio of
the two haplotype-based likelihood and the one haplotype-based likelihood. If the
two haplotype-based model is more supported than the one haplotype-based model
at some position, this position can be estimated to be potential polymorphism.
85
The more significant is this ratio, the more supported is the two haplotype-based
model. Two different thresholds (α
1
, α
2
) are used in this model selection, where
α
2
is greater than α
1
. Each threshold is used for a different purpose. If the ratio is
greaterthanorequaltoα
2
atsomeposition,wecallthispositionastronglypotential
polymorphism. If the ratio is greater than or equal to α
1
but less than α
2
at some
position, we call this position a weakly potential polymorphism. Strongly poten-
tial polymorphisms are used to extend estimated haplotypes. Therefore, strongly
potential polymorphisms are more likely to be true polymorphisms. This high ac-
curacy canbeachieved by increasingα
2
. Weakly potentialpolymorphisms areused
to lower the false negative rates (undetected polymorphisms). Therefore, α
1
needs
to be lowered in order not to miss true polymorphisms. More formally, potential
polymorphisms are defined as follows. For given thresholds α
1
and α
2
, the j-th
genomic locus is screened as a potential polymorphism if
Pr(M
2
|X
·j
=x
·j
)
Pr(M
1
|X
·j
=x
·j
)
=
Pr(X
·j
=x
·j
|M
2
)
Pr(X
·j
=x
·j
|M
1
)
Pr(M
2
)
Pr(M
1
)
≥α
k
, k = 1,2. (3.13)
where M
1
andM
2
represent one haplotype-based and two haplotype-based models,
respectively.
The genomic position where (3.13)holds fork = 2 are termed strongly potential
polymorphisms. Similarly, the genomic position where (3.13) holds for k = 1 are
termed weakly potential polymorphisms.
86
By these criteria, all strongly potential polymorphisms are also weakly poten-
tial polymorphisms. The likelihood calculations in (3.13) are based on (2.3) and
previous work (Churchill and Waterman [15]).
Pr(X
·j
=x
·j
|M
1
) =
X
s
1j
Pr(S
1j
)
m
Y
i=1
Pr(X
ij
|S
1j
).
Pr(X
·j
=x
·j
|M
2
) =
X
S
1j
,S
2j
(
1
2
)
m
Pr(S
1j
,S
2j
)
m
Y
i=1
[Pr(X
ij
|S
1j
)+Pr(X
ij
|S
2j
)].
Since
Pr(M
2
)
Pr(M
1
)
is a constant factor, (3.13) becomes
logPr(X
·j
=x
·j
|M
2
)−logPr(X
·j
=x
·j
|M
1
)≥α
′
k
, k = 1,2. (3.14)
The two thresholds, α
1
andα
2
are determined from a simulated data set. After the
assembly layout was constructed by aligning shotgun reads to the Ciona intesti-
nalis draft genome sequence (see Section 4.3), the artificial polymorphic sites were
generated alongthe genome sequence according to the reported polymorphism rate
(1.2%) (Dehal et al. [21]). The fragment sampling from the genome was simulated
by Bernoulli trials with p =
1
2
. The sequence readings were simulated according to
(3.12) and the real quality scores from the Ciona intestinalis assembly. α
′
2
and α
′
1
weretrainedandselected tobe2(truepositive rate> 0.99)and0.01(falsenegative
rate <0.03), respectively.
87
3.4.6 Local haplotyping
Before haplotypes are inferred from potential polymorphisms, potential polymor-
phismsareorganizedinaforminwhichfurtherextensionofhaplotypesarepossible.
In this step, we compose blocks each of which is comprised of four strongly poten-
tial polymorphisms with intervening weakly potential polymorphisms such that
the overlap (two strongly potential polymorphisms) exists between adjacent blocks
(Figure3.8A,B).Bycomposingablockwithfourstronglypotentialpolymorphisms,
the uncertain phases can be easily detected (see Section 3.4.7). The Gibbs sampler
is applied to determine the phases of each block. In our MCMC approach, a time
variable t is introduced. Therefore, S
(t)
ij
is S
ij
at time t, and F
(t)
i
is F
i
at time t.
Let S
[−(i,j)]
={S
i
′
j
′;(i
′
,j
′
)6=(i,j)} and F
[−i]
={F
i
′;i
′
6=i}. Samples are discarded
as burn-in until t reaches B. After burn-in, the following two steps are repeated T
times. R multiple chains are run and each chain is thinned by storing the sample
from every K-th iteration (we used B = 30,T = 1000,R = 20, and K = 20). We
usedtheconstantnumbers forthefourparameters(B,T,R,andK).Thisstrategy
is similar to theone that Stephens, Smith, and Donnelly [80]used. Our parameters
were tuned through a series of simulations. We tried two sampling strategies, the
random scanning and forward scanning, which showed similar results.
88
1. Foreach(i,j);i =1,2,j = 1,2,...,n,draws
(t+1)
ij
fromPr(S
(t+1)
ij
|S
(t)
[−(i,j)]
,F
(t)
=
f
(t)
,X) and set the remaining components as s
(t+1)
[−(i,j)]
=s
(t)
[−(i,j)]
.
Pr(S
(t+1)
ij
|S
(t)
[−(i,j)]
,F
(t)
=f
(t)
,X)
∝ Pr(S
(t+1)
ij
,S
(t)
[−(i,j)]
,F
(t)
=f
(t)
,X)
= Pr(X|S
(t+1)
ij
,S
(t)
[−(i,j)]
,F
(t)
=f
(t)
)Pr(F
(t)
=f
(t)
)Pr(S
(t+1)
ij
,S
(t)
[−(i,j)]
)
= {
Y
k:F
(t)
k
=i
Pr(X
kj
|Y
kj
=S
(t+1)
F
(t)
k
,j
)}{
Y
k:F
(t)
k
6=i
Pr(X
kj
|Y
kj
=S
(t)
F
(t)
k
,j
)}
{
m
Y
k=1
Y
l6=j
Pr(X
kl
|Y
kl
=S
(t)
F
(t)
k
,l
)}{
m
Y
k=1
Pr(F
(t)
k
=f
(t)
k
)}{Pr(S
(t+1)
ij
,S
(t)
[−(i,j)]
)}.
2. For each i;i =1,2,...,m, draw f
(t+1)
i
from Pr(F
(t+1)
i
|F
(t)
[−i]
,S
(t+1)
=s
(t+1)
,X)
and set the remaining components as f
(t+1)
[−i]
=f
(t)
[−i]
.
Pr(F
(t+1)
i
|F
(t)
[−i]
,S
(t+1)
=s
(t+1)
,X)
∝ Pr(F
(t+1)
i
,F
(t)
[−i]
,S
(t+1)
,X)
= Pr(X|F
(t+1)
i
,F
(t)
[−i]
,S
(t+1)
)Pr(F
(t+1)
i
,F
(t)
[−i]
)Pr(S
(t+1)
)
= {
n
Y
l=1
Pr(X
il
|Y
il
=S
(t+1)
F
(t+1)
i
,l
)}{
Y
k6=i
n
Y
l=1
Pr(X
kl
|Y
kl
=S
(t+1)
F
(t)
k
,l
)}
{
Y
k6=i
Pr(F
(t)
k
)}Pr(F
(t+1)
i
)Pr(S
(t+1)
).
If P(the most likely haplotypes for a block | observation) is greater than or
equal to a threshold (typically > 0.5), we say that the phases for the block are
89
determined (Figure 3.8C) and term the phase-determined blocks local haplotypes.
In the following haplotype extension step, local haplotypes are extended by con-
catenating local haplotypes. Haplotypes, which are a concatenation of at least two
local haplotypes, are termed extended haplotypes (Figure 3.9A).
3.4.7 Haplotype extension
According to our block composition method, adjacent local haplotypes share two
strongly potential polymorphisms (Figure 3.8B). Adjacent local haplotypes can
be further extended by using the consistency of these shared strongly potential
polymorphisms (Figure 3.9A, B). The shared strongly potential polymorphisms
might not be consistent between the extended haplotypes and local haplotypes
(Figure 3.9C (1)). In Figure 3.9C (2), the local haplotypes are extended to the
previous strongly potential polymorphisms. By inferring the local haplotypes from
the previous strongly potential polymorphisms, the ambiguity of the phases is re-
solved in the shared strongly potential polymorphisms. The haplotypes can be
extended by combining these resolved local haplotypes (Figure 3.9C (3)). If this
block extension fails to resolve the inconsistency, some mismatches are allowed as
long as the phases can be determined. The number of matches is calculated for
two different phase directions, and the majority direction is selected (Figure 3.9D).
If the confidence score of the inferred haplotypes is less than a threshold, the un-
certain phase (the confidence score < a threshold) is identified and disconnected
90
(Figure 3.10). Because a block is composed of four strongly potential polymor-
phisms, in Figure 3.10, the uncertain phase exists either from (i+3)-th position
to (i+4)-th position or (i+4)-th position to (i+5)-th position. We choose four
strongly potential polymorphisms for the block composition to easily identify the
uncertain phases.
91
Block j + 1
i i + 1
C
A
T
C
C
T
A
C
i + 2
A
T
C
G
T
A
i + 3 i + 4 i + 5
B
A
i − 1
C
(1)
T
C
C
T
A
C
A
T
A
C
A
T
C
G
T
A
(2)
(1)
T
C
C
T
A
C
A
T
A
C
A
T
C
G
T
A
(2)
Local haplotypes j + 1
Position
Local haplotypes j
Block j
Figure 3.8: Block composition and local haplotyping. In this figure, for sim-
plicity, only strongly potential polymorphisms are indexed by the top row
although there possibly exist weakly potential polymorphisms between ad-
jacent strongly potential polymorphisms. Here, we index 7 strongly poten-
tial polymorphisms from the (i−1)-th position to the (i+5)-th position. (A)
Genotypes are determined after the potential polymorphism detection step.
Dotted lines between strongly potential polymorphisms indicate that their
phasesarenotdeterminedyet. (B)Anexampleofblockcompositionisshown.
We compose two adjacent blocks, block
j
in (1) and block
j+1
in (2). The adja-
cent blocks share two strongly potential polymorphisms in the (i+2)-th and
(i+3)-th positions. (C) After the Gibbs sampler is applied to each block, the
phasesaredetermined. Eacharrowindicatestheconnection direction. In (1),
for instance, local haplotypes
j
, TCCT, CTAA are obtained.
92
Extended haplotypes
i i + 1
C
A
T
C
C
T
A
C
i + 2
A
T
i + 3 i + 4 i + 5
A
i − 1 Position
B (1)
A
C
A
T
C
G
T
A
A
C
A
T
C
G
T
A
(2)
C
T C
T
C
A
Local haplotypes
C (1)
T
C
A
T
A
C
A
T
C
G
T
A
(2)
C
G
T
A
C
T
A
C
A
T
C
G
T
A
(3)
C
T C
T A
Local haplotypes
Extended
Local haplotypes
C
D (1)
T
C
A
T
A
C
A
T
C
G
T
A
(2)
C
G
T
A
G
T
A
C
A
T
C
G
T
A
(3)
C
T C
T A
Local haplotypes
Extended
Local haplotypes
C
Figure3.9: Haplotypeextension. Haplotypeextensionproceedsfromthelower
index to the higher index. (A) Haplotypes can extend over more than 4
strongly potential polymorphisms as the result of the previous haplotype ex-
tension. We assume that the extended haplotypes, CCTAA, ATCCT are given
from the (i−1)-th position to the (i+3)-th position. (B) After the local hap-
lotypes, AACA, CTGT are obtained in (1), the extended haplotypes in A can be
combined with the local haplotypes and extended to CCTAACA and ATCCTGT by
the consistency of shared strongly potential polymorphisms (the dotted box)
at the (i+2)-th and (i+3)-th positions. The combined haplotypes are shown
in (2). (C) In (1), inconsistency is observed at the (i+2)-th position. After
the local haplotypes are extended to the (i+1)-th position, the inconsistency
is resolved in (2). The combined haplotypes are shown in (3). (D) Although
the local haplotypes are extended in (2), the inconsistency at the (i+2)-th
position is observed. The number of matches for two phases are calculated
from the (i+1)-th to the (i+3)-th position (the dotted box ).
93
Extended haplotypes
i i + 1
C
A
T
C
C
T
A
C
i + 2
A
T
i + 3 i + 4 i + 5
A
i − 1 Position
B (1)
A
C
A
T
C
G
T
A
A
C
A
T
C
G
T
A
(2)
C
T C
T
C
A
?
?
C (1)
A
C
A
T
C
G
T
A
Local haplotypes
A
C
A
T
C
G
T
A
(2)
C
T C
T
C
A
?
?
Local haplotypes
Figure 3.10: Cases where the confidence score for inferred haplotypes are low.
(A) Extended haplotypes are shown from the (i−1)-th to the (i+3)-th posi-
tion. (B) If the confidence score for the local haplotypes is less than a thresh-
old, we first check the confidence score for the pair consisting of the (i+3)-th
and (i+4)-th positions. If this confidence score for the pair is greater than
or equal to a threshold, we connect the pair with the extended haplotypes in
A. In (1), ? represents undetermined phase. We obtain extended haplotypes,
CCTAAC and ATCCTG in (2). We then compose a new block from the (i+5)-th
position. (C) Iftheconfidence score for the pairat the (i+3)-thand (i+4)-th
positions are less than a threshold, we compose a block from the (i+4)-th
position and infer local haplotypes as shown in (2).
94
Chapter 4
Simulation Studies
In this chapter, we present our simulation results based on the methods introduced
in Chapter 3. In Section 4.1, the pairwise-based method in Section 3.1 is tested on
a simulated data set. In Section 4.2, the Algorithm 3.2 based on the calculation
of Pr(X) is used to improve haplotype estimation. In Section 4.3, the accuracy of
the Gibbs sampler is evaluated on a simulated data set. In Section 4.3, the Phred
quality score of each base-call is used instead of the fixed global sequencing error
probabilities.
4.1 Simulation study 1
4.1.1 Overview
In Section 3.1, we proposed the pairwise-based reconstruction strategy. In that
strategy, we estimate the most likely pair from every adjacent pair, and extend
95
haplotypes by using those pairs. To evaluate the confidence of the estimated most
likely pairs, two confidence measures are adopted. One is the confidence score,
Pr(the most likely pair|X) for, and the other is the ratio between two most likely
pairs (odds ratio). The purposes of the following simulations in this section are
summarized as follows.
1. Verify the overall accuracy of our pairwise strategy.
2. Determine appropriate threshold levels for choosing the confidence score and
the odds ratio through a series of simulations.
4.1.2 Parameters
Inourmodel,threeparametersets,fragmentfrequencies, compositionprobabilities,
and sequencing error probabilities are specified. The fragment frequencies, which
are the probabilities that each fragment originates from one of two haplotypes, are
set to one half. As indicated in Table 4.1, the composition probability of each
nucleotide is equally set to 0.24 and the composition probability of gap 0.04. The
sequencingerrorrateis4%. Forexample, inTable4.2,Pr(X =G|S =A)+Pr(X =
C|S = A)+Pr(X = T|S = A)+Pr(X =−|S = A)+Pr(X = N|S = A) is equal
to 0.005+0.01+0.015+0.005+0.005=0.04.
96
A G C T -
0.24 0.24 0.24 0.24 0.04
Table 4.1: Composition probability in the simulation.
Observation
truth A G C T - N
A 0.96 0.005 0.01 0.015 0.005 0.005
G 0.005 0.96 0.015 0.01 0.005 0.005
C 0.005 0.015 0.96 0.01 0.005 0.005
T 0.015 0.005 0.01 0.96 0.005 0.005
- 0.003 0.003 0.003 0.003 0.985 0.003
Table 4.2: Sequencing error probabilities in the simulation
Clone length H 180,000bp
average distance between SNPs 300bp
average fragment length L 650bp
average coverage κ 7
Table 4.3: Parameter values in the simulation.
97
4.1.3 Simulation model
In our simulations, we employed the stochastic model proposed by Lander and Wa-
terman ([44]). Under this model, the clone length is denoted by H. SNP sites are
generated along the clone whose size is H. The rate of SNP sites is denoted by λ.
The number of SNP sites in the clone whose size is H is denoted by N(H). N(H)
follows the Poisson distribution with the parameter λH. In this case, the expec-
tation of the number of SNP sites is E[N(H)] = λH, and the standard deviation
is SD[N(H)] =
√
λH. After the number of SNP sites is decided, their positions
are uniformly generated along the clone. Alternatively, because the inter-arrival
time between SNP sites follows the exponential distribution with the parameter
1
λ
, the SNP sites are sequentially generated by the exponential distribution. The
parameter values in our simulation are shown in Table 4.3.
The average length of fragment is denoted by L. The number of fragments is
denoted by D. The average sequence coverage in clone (H) is denoted by κ. Then
the equation κH =DL holds.
4.1.4 Simulation results
By varying the two thresholds, the confidence score and odds ratio, we obtained
reasonable results as shown in Table 4.4 and 4.5. 1000 replicates are generated
accordingtooursimulationmodelandtheaverageresultof1000replicatesisshown.
For the confidence score, 7 configurations from 0.4 to 0.7 were used. For the odds
98
Threshold for odds ratio: 1
Threshold for confidence 0.4 0.45 0.5 0.55
False positive rate (pairwise comparison) 9.91% 9.50% 8.65% 7.92
# SNP reported (including singleton) 553 548 535 524
True positive rate (all reported) 92.05% 92.62% 94.32% 94.82%
Percentage of correctly detected pairs 71.65% 71.40% 70.72% 70.03%
Average segment length 5.31 5.17 4.78 4.67
Threshold for odds ratio: 1.1
Threshold for confidence 0.4 0.45 0.5 0.55
False positive rate (pairwise comparison) 9.19% 8.90% 8.47% 7.82%
# SNP reported (including singleton) 540 537 531 522
True positive rate (all reported) 93.86% 94.11% 94.43% 94.89%
Percentage of correctly detected pairs 70.94% 70.77% 70.48% 69.87%
Average segment length 4.85 4.80 4.75 4.66
Threshold for odds ratio: 1.3
Threshold for confidence 0.4 0.45 0.5 0.55
False positive rate (pairwise comparison) 8.59% 8.37% 8.04% 7.58%
# SNP reported (including singleton) 532 530 526 519
True positive rate (all reported) 94.22% 94.42% 94.67% 95.01%
Percentage of correctly detected pairs 70.50% 70.38% 70.17% 69.75%
Average segment length 4.77 4.74 4.70 4.64
Threshold for odds ratio: 1.5
Threshold for confidence 0.4 0.45 0.5 0.55
False positive rate (pairwise comparison) 8.40% 8.21% 7.91% 7.49%
# SNP reported (including singleton) 529 527 524 517
True positive rate (all reported) 94.37% 94.54% 94.77% 95.08%
Percentage of correctly detected pairs 70.29% 70.20% 70.01% 69.63%
Average segment length 4.74 4.71 4.67 4.62
Table 4.4: Simulation results. To determine the significance of pairwise com-
parison, we set thresholds for both confidence and odds ratio of the two most
probable cases. We show results under seven thresholds for confidence and
four thresholds for odds ratio. Each sub-table corresponds to a constant
threshold for odds ratio. Under each configuration, 1000 replicates were sim-
ulated. The false positive rates are for pairwise comparison. Some of these
positivepairswerelatereithermodified ordiscardedby checking consistency.
The SNP number in the final report includes singletons, namely, those single
sites that cannot be connected to others. In this case, we report their geno-
types. The true positive rates are for those reported sites, either genotypes
or haplotypes. The last two accounts arethe percentageof correctly detected
pairs among all and average lengths of haplotype segments.
99
Threshold for odds ratio: 1
Threshold for confidence 0.6 0.65 0.7
False positive rate (pairwise comparison) 7.08% 6.31% 5.65%
# SNP reported (including singleton) 510 498 486
True positive rate (all reported) 95.4% 95.9% 96.3%
Percentage of correctly detected pairs 69.2% 68.3% 67.4%
Average segment length 4.56 4.47 4.37
Threshold for odds ratio: 1.1
Threshold for confidence 0.6 0.65 0.7
False positive rate (pairwise comparison) 7.02% 6.30% 5.65%
# SNP reported (including singleton) 509 497 486
True positive rate (all reported) 95.45% 95.90% 96.31%
Percentage of correctly detected pairs 69.07% 68.23% 67.35%
Average segment length 4.55 4.46 4.37
Threshold for odds ratio: 1.3
Threshold for confidence 0.6 0.65 0.7
False positive rate (pairwise comparison) 6.83% 6.13% 5.54%
# SNP reported (including singleton) 507 496 485
True positive rate (all reported) 95.54% 95.97% 96.35%
Percentage of correctly detected pairs 68.99% 68.16% 67.30%
Average segment length 4.54 4.46 4.37
Threshold for odds ratio: 1.5
Threshold for confidence 0.6 0.65 0.7
False positive rate (pairwise comparison) 6.83% 6.13% 5.54%
# SNP reported (including singleton) 507 496 485
True positive rate (all reported) 95.54% 95.97% 96.35%
Percentage of correctly detected pairs 68.99% 68.16% 67.30%
Average segment length 4.54 4.46 4.37
Table 4.5: Simulation results. Continued .
100
ratio, 4 configurations from 1 to 1.5 were used. Totally, we tested our pairwise
reconstruction strategy under 28 different configurations.
In the case where the confidence score is 0.5 and the odds ratio is 1.3, 599 SNP
sites and 598 pairs were generated on average. For 97 pairs, no phases could be
derived because those pairs are located in different islands (Waterman [87]). On
average, 0.57 SNP sites fell into the oceans (Waterman [87]). For 45 pairs with the
low confidence score and odds ratio, the phases were not determined. Totally, 456
connected pairs were obtained. Among them, 419 pairs were estimated correctly as
true and the true positive ratewas 91.96%. 356pairs were consistent with adjacent
pairs. Wealsodetectedsingletons,SNPsiteswhichcannotbeconnectedwithother
SNP sites. By using the detected pairs, we resolved some inconsistencies by the
reconstruction strategies illustrated in Figure 3.1 and Figure 3.2. Totally, 526 SNP
sites were detected . Among them, the true positive rate was 94.67%. 70.17% of
all 598 adjacent pairs were correctly estimated. After inconsistencies were resolved
as Figure 3.2, 83.05% of all SNP sites generated were correctly estimated. The
average length of estimated haplotypes was 4.7.
As the confidence score increases to0.55andthe odds ratioremains unchanged,
the true positive rate increases to 95.01%. But the percentage of correctly detected
pairsdecreases to69.75%. Astheconfidence scoreremainsunchangedandtheodds
ratioincreasesto1.5,thetruepositiverateincreases to94.77%. Butthepercentage
of correctly detected pairs decreases to 70.01%. As the confidence score decreases
to 0.45 and the odds ratio remains unchanged, the true positive rate decreases to
101
94.42%. But the percentage of correctly detected pairs increases to 70.38%. As the
confidence score remains unchanged and the odds ratio decreases to 1.1, the true
positive rate decreases to 94.43%. But the percentage of correctly detected pairs
increases to 70.48%. The trade-off between sensitivity and specificity was clearly
seen in our simulations.
4.2 Simulation study 2
4.2.1 Overview
In this section, the method proposed in Section 3.3 is used. The fast algorithm to
calculate Pr(X) is used to improve the accuracy of our estimation. The uncertain
connections are identified by using this method and disconnected. In real sequenc-
ingprojects, asequence readispairedwithanothersequence readwhich isfromthe
same fragment. This connection information is called the mate-pair information.
In this section, our algorithm exploits the mate-pair information to estimate longer
haplotypes. Because the final goal of this article is to assemble diploid genome
sequences from the Ciona intestinalis, we reflected the polymorphism rate (1.2%)
ofthe Ciona intestinalisgenome(Dehaletal. [21]). This rateis4timeshigherthat
therateweusedinSection4.1. Obviously, thehigherthepolymorphismrateis, the
longer our estimated haplotypes will be. The purposes of the following simulations
in this section are summarized as follows.
102
Clone length H 120,000bp
average distance between SNPs 66bp/300bp
average fragment length L 650bp
average coverage κ 7
Table 4.6: Parameter values in the simulation.
1. Apply the fast algorithm to calculate Pr(X) and improve the accuracy.
2. Before assembling the real Ciona intestinalis genome, test our reconstruction
strategy under a highly polymorphic situation on a simulated data set.
3. Implement the bridging strategy (see Section 3.2) and test it on a simulated
data set.
4. Implement two-end sequencing and test it on a simulated data set.
5. Apply Algorithm 3.2 to a small part of the real Ciona intestinalis genome as
a preliminary work of assembling the whole genome.
4.2.2 Parameters
For this simulation, the same parameters described in Table 4.1 and 4.2 are used.
4.2.3 Simulation model
Again, the stochastic model proposed by Lander and Waterman was employed in
this section ([44]). Wedescribed themodelinSection 4.1.3. Two-end sequencing of
each fragment was simulated. The fragments of various size (1.8K bp∼ 120K bp)
103
weregeneratedaccordingtothetheirproportionsintheCionaintestinalis(Dehalet
al. [21]). The simulation parameters are summarized in Table 4.6. In this section,
we used two different polymorphism rates. In one setting, the inter-arrival time
between adjacent SNP sites is 66bp. In the other setting, the inter-arrival time is
300bp. Other parameters remain unchanged from Section 4.1.
haplotype frequency λ=1/2 λ=1/2
mate-pair information w/o mate-pair w/i mate-pair
total # polymorphism 674246 674246
# polymorphism reported (including singleton) 618034 618034
true positive rate (all reported) 97.51% 97.05%
percentage of correctly detected sites 89.38% 88.96%
average segment length 45.43 70.36
Table4.7: Reconstructionresultfromasimulationbasedon Ciona intestinalis
(λ=1/2). Thepolymorphism rateis 1.2%. Thetotalsize ofscaffolds isabout
60M bp. To determine the significance of pairwise comparison, we set the
thresholds for pairwise confidence to be 0.5 and odds ratio of the two most
probable cases to be 1.1. The number of polymorphisms in the final report
includes singletons, namely, those single sites that cannot be connected to
others. In this case, we report their genotypes. The true positive rates are
forthosereportedsites,eithergenotypesorhaplotypes. Thelasttwoaccounts
arethepercentageofcorrectlydetectedpairsamongallthepolymorphicsites
generated and average lengths of haplotype segments.
haplotype frequency λ=1/4
mate-pair information w/i mate-pair
total # polymorphism 671359
# polymorphism reported (including singleton) 554442
true positive rate (all reported) 96.08%
percentage of correctly detected sites 79.35%
average segment length 31.10
Table4.8: Reconstructionresultfromasimulationbasedon Ciona intestinalis
(λ=1/4). The polymorphism rate is 1.2%. cf. Table 4.7.
104
haplotype frequency λ=1/2 λ=1/2
mate-pair information w/o mate-pair w/i mate-pair
total # polymorphism 179686 179686
# polymorphism reported (including singleton) 165188 165188
true positive rate (all reported) 98.23% 96.0%
percentage of correctly detected sites 90.31% 88.25%
average segment length 5.06 33.87
Table 4.9: Reconstruction result from a simulation of a polymorphism rate
0.3% (λ=1/2), cf. Table 4.7.
haplotype frequency λ=1/4
mate-pair information w/i mate-pair
total # polymorphism 180294
# polymorphism reported (including singleton) 151501
true positive rate (all reported) 94.49%
percentage of correctly detected sites 79.40%
average segment length 19.39
Table 4.10: Reconstruction result from a simulation of a polymorphism rate
0.3% (λ=1/4), cf. Table 4.7.
4.2.4 Simulation results
The accuracy of Algorithm 3.2 was tested on simulated data sets, and the simu-
lation results are shown in Table 4.7, Table 4.8, Table 4.9, and Table 4.10. The
threshold for the confidence score is set to 0.5, and the threshold for the odds ratio
was set to 1.1. In the first data set, the origins of fragments were generated accord-
ing to the frequency λ =
1
2
(polymorphism rate: 1.2%). The simulation results are
summarized in Table 4.7. When neither the bridging strategy nor the mate-pair
information was used, the true positive rate was 97.51%. The percentage of cor-
rectly detected SNP sites was 89.38%. The average length of estimated haplotypes
105
Error type λ=1/2 λ=1/4
Truth Reconstructed percentage
heterozygote heterozygote, one match 1.23 2.77
heterozygote wrong phase 1.72 1.15
Total 2.95 3.92
Table 4.11: Error patterns. Most errors are partial genotyping errors.
was 45.43. If the bridging strategy and the mate-pair information were exploited,
the true positive rate slightly decreased to 97.05% because extra phase inversions
occurred. However, a significant increase in the average length of haplotypes was
obtained (70.36 vs. 45.43).
In the second data set, the sampling frequency of fragments was one quarter
(λ =
1
4
). The simulation results are shown in Table 4.8. Compared with the case
whentheλisequalto
1
2
,aslightlosswasobserved inaccuracy(96.08%vs. 97.05%)
whereas a significant loss occurred in average haplotype length (31.10 vs. 70.36).
We tested our approach in less polymorphic cases. In the third data set (Ta-
ble 4.9), the polymorphism rate was 0.3% and the sampling frequency is one half
(λ =
1
2
). When the low polymorphism rate was considered, haplotypes were re-
markably extended with the bridging strategy and the mate-pair information: the
average haplotype length was 33.87. Without the bridging strategy and the mate-
pair information, the average haplotype length was only 5.06.
If the sampling frequency is a quarter (λ =
1
4
), the estimated haplotypes were
less accurate (94.49% vs. 96.0%) and significantly short (19.39 vs. 33.87); see
Table 4.10.
106
mean STD
r=0.5 0.4867 0.0122
r=0.25 0.2501 0.0134
Table 4.12: Statistics of the maximum likelihood estimates of haplotype fre-
quency.
The types of errors in Table 4.7 and Table 4.8 were analyzed and summarized
in Table 4.11. Among the false positive rate (2.95%), the rate of phase errors was
1.72%. Inthesephaseerrors,theirgenotypeswerecorrect. Therateofthegenotype
errors was 1.23%. Because the genotype errors occur due to the sequencing errors,
there always exists one match. In the case of λ =
1
4
, the rate of phase errors were
1.15%. The rate of the genotype errors was 2.77%.
The confidence score is calculated for estimated haplotypes. Examples of the
generated confidence scores are shown in Table 4.1. We checked the validity of our
confidence score by plotting observed accuracy against estimated confidence score
(Figure 4.2). In the pairwise reconstruction step, we used the threshold 0.5 for
evaluating the confidence of estimated pairs. For the estimated confidence scores
below 0.5, the observed accuracy tends to be higher than the confidence scores.
Our confidence scores are still valid as a lower bound of actual error rates.
Because this happens regularly, we can adjust confidence scores by subtract-
ing some constant from estimated confidence scores. This constant value can be
trained by minimizing the mean squared errors (MSE) between confidence scores
and observed accuracy (Figure 4.3).
107
segment 7 starts at 144 and ends at 201 - size: 58
confidence score: 0.9960188778
ATACCTCGTTCCGAATGCGAATACCGCTCAATACTGAACCTGTAAACCAACGCGTAGA
TCCATAATGGTAC-TACACCGGTGTATCTCCGGTACTCGAGAAGTCTGTGAATTGTAG
segment 8 starts at 202 and ends at 230 - size: 29
confidence score: 0.9752074893
TAATTACCATAGTGACATCAGTTCAATTT
ACGCCGTACCCTCCTA-AGCAAAACGACA
segment 9 starts at 231 and ends at 254 - size: 24
confidence score: 0.6125626073
ATGCCAACATTCTCCCCGCAGCTA
-GAAATTGTGGTGTTAATTGAGAT
Figure 4.1: Examples of accuracy assessment. The haplotypes with positions,
sizes, and confidence scores are produced by our program.
The likelihood calculation Pr(X) was further used to estimate the maximum
likelihood estimator of the haplotype frequency in the assembly. 500 replicates of
120Kbp clones were generated with two different haplotype frequencies (λ =
1
2
,
λ =
1
4
) according to the parameters described in Table 4.6. In the case of λ =
1
2
,
the histogram of the maximum likelihood estimators is shown in the left panel
of Table 4.4. In the case of λ =
1
4
, the histogram of the maximum likelihood
estimators is shown in the right panel of Table 4.4. The statistics (mean and
standard deviation) ofthe maximum likelihood estimators are shown in Table 4.12.
The left panel of Figure 4.5 shows the detailed view of the maximum likelihood
estimation in a 120K clone when true frequency was λ =
1
2
. The log-likelihood
curve is maximized at the true haplotype frequency. The right panel of Figure 4.5
108
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Estimated confidence score
Observed accuracy
Figure 4.2: Estimated confidence score vs. observed accuracy. The confidence
score is calculated by (3.7), (3.8), and recursion (3.9). The sample size (the
number of haplotypes) was 47,504. Over-estimation of expected errors are
seen in the confidence scores below 0.5. 0.5 was the threshold for deciding
the goodness of estimated pairs in the pairwise reconstruction step. For the
confidence scores above 0.5, our score scheme fits the idealized scores (the
diagonal line) and well reflects the actual error rates.
109
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Estimated confidence score
Observed accuracy
Figure 4.3: Adjusted confidence score vs. observed accuracy. The adjustment
constant (0.143) was determined by calculating the MSE. The constant was
subtracted from the confidence scores below 0.5.
110
0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3
0
50
100
150
haplotype frequency
frequency
0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5
0
50
100
150
200
250
300
350
haplotype frequency
frequency
Figure 4.4: Histograms of MLE estimates. Fragments were generated from
clones of size 120K bp. Left: the true haplotype frequency is one-quarter;
Right: the true haplotype frequency is one-half.
shows the detailed view of the maximum likelihood estimation in a 120K clone
when true frequency was λ= 1. The curve had the maximum value at 0 and 1.
As preliminary work for assembling the real Ciona intestinalis genome, Algo-
rithm3.2wasappliedtoasmallfractionofthegenomeassembly. Thesmallscaffold
(scaffold 1611) was obtained by aligning sequence reads to the published reference
genome. A fraction of the assembly is shown in Figure 5.7. In this figure, nine
fragments were aligned to the scaffold 1611. Four polymorphic sites were observed,
one of which was a multibase indel, CCC/---. Some sequencing errors were ob-
served as well, but those sequencing errors were not regarded as polymorphisms.
In Chapter 5, diploid genome sequences are constructed from the whole genome of
the Ciona intestinalis.
111
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
−8970
−8965
−8960
−8955
−8950
−8945
−8940
−8935
−8930
−8925
haplotype frequency
log likelihood
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−2500
−2400
−2300
−2200
−2100
−2000
−1900
−1800
haplotype frequency
log likelihood
Figure 4.5: Log-likelihood curve. The above figures plot the logarithm of
probability of observations as a function of haplotype frequency. The clone
size is 120K bp. Left: the true haplotype frequency is one-half; Right: the
true haplotype frequency is one — homozygous case.
scaffold_1611 #1 aaCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
#2 aaTgagataatagaatAagaagtgtCCCatcttcccca-Acct-t
aaCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
aaCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
aaCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
aaCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
aaTgagataatagaatAagaagtgtCCCatcttcccca-Acct-t
aaCgagataatagaatTagaagtgt---atcttccccacCcct-t
aaTgagataatagaatAagaagtgtCCCatcttcccca-Acct-t
-aCgagataatagaatTagaagtgt---atcttcccca-Ccct-t
atTagaagtgt---atcttcccca-Ccctgt
Figure 4.6: Part of a scaffold from Ciona intestinalis. Nine fragments were
aligned and four polymorphic sites were observed. Non-polymorphic and
polymorphic sitesarerepresentedbysmallandlargelettersrespectively. The
two targets are shown at the top.
112
A G C T -
0.315 0.18 0.18 0.315 0.01
Table 4.13: Composition probability in the Ciona intestinalis simulation.
4.3 Simulation study 3
4.3.1 Overview
In this section, the Gibbs sampler in Section 3.4 is used. Potential polymorphic
sites are detected by a Bayesian model selection. Blocks which consists of four
strongly potential polymorphisms and intervening weakly potential polymorphisms
are composed. Then the Gibbs sampler is applied to each block to infer the phases
between potential polymorphisms. Although the methods in Section 3.4 will be
applied to the real Ciona intestinalis genome in the following Chapter 5, the accu-
racy of the methods still needs to be evaluated. The purposes of this section are
summarized as follows.
1. Use the Phred quality score of each base-call.
2. Evaluate the accuracy of the Gibbs sampler on a simulated data set.
4.3.2 Parameters
The composition rates were calculated by counting the actual proportions in se-
quence reads in the Ciona intestinalisassembly. The rates are shown in Table 4.13.
Instead of using the constant sequencing error rates in Table 4.2, the sequencing
113
error rates were simulated from the real quality scores of base-calls and our error
model (3.12).
4.3.3 Simulation model
In Section 4.1 and 4.2, the assembly was simulated. In this section, the assembly
wasconstructedbyaligningtherealshotgunreadstothepublished draftgenomeof
Ciona intestinalis. Therefore, the genome sequence, sequence reads, quality scores,
and mate-pair informations were from the real data. The number of polymorphic
sites follows the model used in Section 4.1 and 4.2. In the published genome, the
draft genome consists of scaffolds whose size varies from 2Kbp to 1Mbp. In each
scaffold, the number of polymorphic sites follows the Poisson distribution. The
polymorphism rate we observed in the following Chapter 5 was higher than the
reported rate (1.2%). The polymorphism rate in this simulation was based on
our observation. In Section 4.1 and 4.2, only SNPs are generated. In this section,
however, multibase substitutionandindels weregeneratedaccordingtotheiractual
observed rates and lengths in the real Ciona intestinalis assembly. The origin of
each fragment was randomly generated with probability
1
2
. Base-calls of each read
were simulated based on the quality scores. To conclude in this simulation, true
diploid genome sequences were known and the fragment memberships were also
known. Therefore, the accuracy of estimated haplotypes could be verified.
114
Number of true polymorphisms 1,973,774
Number of identifiable polymorphisms 1,682,119
Number of identified polymorphisms 1,635,598
True positive rate 97.33%
Correct genotype, wrong phase 1.88%
False negative rate 3.53%
Table 4.14: Results from the simulated data set. Identifiable polymorphisms
are defined to be polymorphisms where at least once base-call occurs for
each base of the polymorphism. The false negative rate is defined to be the
number of undetected polymorphisms divided by the number of identifiable
polymorphisms.
4.3.4 Simulation results
The simulation results are shown in Table 4.14. In some regions of the genome,
one or both haplotypes fails to be sampled. Therefore, we generated 1,973,774
true polymorphisms, which are more than we actually expected to observe. Among
these polymorphisms, 1,682,119 polymorphisms were identifiable. Polymorphisms
are defined to be identifiable if each allele is observed at least one in the assembly.
Amongthese identifiable polymorphisms, 1,635,598polymorphisms were identified.
The truepositive rateoftheidentified polymorphisms was 97.33%. The haplotypes
are considered to be correctly estimated if both their genotypes and phases are
correct. 1.88% of the estimated haplotypes are correctly genotyped although their
phases were wrong. Summing those rate, 99.21% of our estimation was correct
in terms of genotype. 3.53% of the identifiable polymorphisms were not detected.
Thislowfalsenegativerateindicatesthatmostofpolymorphismsaredetectedonce
they are identifiable.
115
Truth Estimated %
homozygote heterozygote, one match 0.78
homozygote heterozygote, no match 0
heterozygote heterozygote, one match 0.01
heterozygote heterozygote, no match 0
Table 4.15: Error analysis.
We also analyzed the error types in Table 4.15. Some of homozygous positions
were identified to be polymorphisms. 0.78% of our estimated polymorphisms cor-
responds to this type of error. In 0.01% ofour estimated polymorphisms, one allele
was matched with true one although the estimation was wrong in terms of geno-
type. The sequencing errors in low coverage regions were the common causes to
those two types of errors.
116
Chapter 5
The Ciona intestinalis genome
5.1 Ciona intestinalis
The ascidian, Ciona intestinalis, is an organism the study of which gives insights
into the divergence of the chordates from the deuterostomes and the vertebrates
from the chordates (Dehal et al. [21]). Vertebrate genes, involved in cell signaling
and development, exist as primitive forms in the ascidian. Ciona intestinalis also
has lineage-specific genes which are involved in cellulose metabolism (Dehal et al.
[21]).
Ciona intestinalis is an important resource in developmental studies (for in-
stance, itsmolecularmechanisms ofcell-fatespecification)forthefollowingreasons:
(1) The tadpole comprises a small number of cells for which cell-lineage informa-
tion on tissues and organs exists. (2) The blastomeres of early embryos are large
enough to handle and visualize its differentiation throughout development. (3) It
117
Library name Insert size Std. dev. Read length Total sequence (Mbp)
SEZ 2,821 355 608 154.8
XPA 1,840 300 600 33.2
PJA 6,226 1,431 527 49.6
LQW 2,821 355 617 1,031.5
DEV 2,792 322 504 45.5
Table 5.1: Small insert libraries. The insert length of the second column is
the average size of each library. The read length of the fourth column is the
average length of the trimmed reads. This table is cited from (Dehal et al.
[21]).
Library name Insert size (Kbp) Std. dev. (Kbp) Read length Total sequence (Mbp)
JPB 120 20 429 5
PTR 36.5 6 593 20.2
OMM 120 20 504 4.2
Table 5.2: Long insert libraries. The insert length of the second column is
the average size of each library. The read length of the fourth column is the
average length of the trimmed reads. This table is cited from (Dehal et al.
[21]).
has a short embryogenesis (18 hours) and brief life cycle (< 3 months). (4) Elec-
troporation methods readily incorporate transgenic DNAs into developing embryos
(Dehal et al. [21]).
A whole-genome shotgun approach was taken to sequence the genome of Ciona
intestinalis by the Joint Genome Institute (JGI). DNA was purified mainly from
the sperm of an individual from Half Moon Bay, California, USA. BAC and cosmid
librarieswerepreparedinpartfromaJapaneseindividualandadifferentCalifornia
individual, respectively (Dehal et al. [21]). Fragments from each library (plasmid,
cosmid, BAC) were two-end sequenced and mate-pair information was generated
118
for long-range connection. The sequence coverage was∼8. For more detail of the
Ciona intestinalis sequencing project, see (Dehal et al. [21]).
5.2 Assembly
We assembled the Ciona intestinalis genome by aligning the shotgun reads with
quality scores to the reference genome sequence. The reference genome sequence,
the shotgun reads, and the quality files were publicly available and downloaded
from the web site of the JGI (genome.jgi-psf.org/ciona4). After low-quality regions
of shotgun reads were trimmed by using LUCY (Chou and Holmes [14]), BLASTN
was used for the pairwise alignments to the reference sequence (Altschul et al. [2]).
As indicated in Table 5.1 and Table 5.2, the size of clone inserts ranges from 1.8K
bp∼ 120K bp. Each fragment was two-end sequenced. The mate-pair information
wastracedthroughouralignmentprocess. Aligningsequence reads tothereference
sequence often generates more than one significant candidate region because of the
repeats in the genome. In this case, there was no clue that one alignment was
more significant than others. Therefore, we decided to use the traced mate-pair
information. In our assembly, the mate-pair information between two paired reads
is used to resolve the repeat problem. In approximately 99% of the fragments,
paired reads were located within the mean± 4 standard deviation of its fragment
length (Dehal et al. [21]; Aparicio et al. [6]). By this criterion, the paired reads
wereincludedifthepairedreadswerelocatedwithintherangeanduniquelyaligned
119
Genome size 107.6M bp
Number of identified polymorphisms (counted additively) 1,620,331
Polymorphism rate 1.52%
Number of identified polymorphisms (counted once) 1,334,406
Polymorphism rate 1.25%
Table5.3: Polymorphismrate. Intra-scaffoldgapsandtheregionsnotspanned
by any read were excluded in genome size below. Polymorphisms rate is the
number of polymorphisms divided by genome size.
to one region. The average and standard deviation of each library is summarized
in Table 5.1 and Table 5.2. After all the intra-scaffold gaps were eliminated, our
alignments spanned∼95% of the reference genome sequence. The size of the entire
genome is 116.7 Mbp (Dehal et al. [21]). The coverage we obtained was∼8, which
is consistent with the reported coverage (Dehal et al. [21]). The low quality bases
were further eliminated in two ways. Over all regions, the bases whose quality
scores were (≤ 4)were eliminated. The bases notsatisfying aneighborhood quality
standard (NQS) condition were eliminated over those regions; a base satisfies the
NQS condition if its Phred score is ≥ 20, and the 10 flanking bases have Phred
scores ≥ 15 (Altschuler et al. [3]). In low coverage regions (≤ 6), the non-NQS
bases were eliminated from our assembly.
5.3 Results
ThediploidgenomereconstructionstrategyinSection3.4wasappliedtoourassem-
bly. The results are shown in Table 5.3. The regions spanned by at least once read
120
Single substitutions 1,124,190
Single indels 70,058
Multibase substitutions 65,453
Multibase indels 74,706
Table 5.4: Frequency of each polymorphism category.
Genotype Frequency(%)
A, C 7.0
A, G 23.2
A, T 13.0
A, - 7.2
C, G 5.2
C, T 23.2
C, - 3.5
G, T 7.0
G, - 3.5
T, - 7.2
Table 5.5: Composition of polymorphisms.
Category Frequency (%)
Transition 46.4
Transversion 32.2
Indels 21.4
Table 5.6: Categories of point mutations and their frequencies.
121
cover 107.6Mbp. We counted the number of polymorphisms in two different ways
which produced different results. In total, 1,620,331polymorphisms were identified
when we counted a multibase polymorphism additively by its length. In this case,
the polymorphism rate was 1.52%. 1,334,406 polymorphisms were identified when
a multibase polymorphism was counted once. In this case, the polymorphism rate
was 1.25%. Either of the polymorphism rates was higher than the reported poly-
morphism rate (1.2%) (Dehal et al. [21]). Although the majority of the identified
polymorphisms were SNPs, there were many multibase substitutions andmultibase
indels. 65,453 multibase substitutions and 74,706 multibase indels were identified
as shown in Table 5.4. The difference in the polymorphism rates is also explained
by these multibase substitutions and multibase indels. The composition of poly-
morphisms is shown in Table 5.5. Based on Table 5.5, we categorize the point
mutations as transition, transversion, and indels in Table 5.6. In Figure 5.1, an
1,000bp window was slid along scaffold 1 and the number of polymorphisms was
counted in each window. 71 polymorphisms were identified in some 1000bp win-
dow. InFigure5.2,weenlargetheregionfrom40Kto60Kinscaffold 1. Thesizeof
sliding window is decreased to 100bp. We show the distribution of polymorphisms
in scaffold 140, scaffold 643, and scaffold 990 (Figure 5.3; Figure 5.4; Figure 5.5).
From these figures, it is concluded that the distribution of polymorphisms is not
uniform across the genome. The distribution of the haplotype lengths are shown
in Figure 5.6. In this figure, all the homozygous loci were excluded in calculating
122
0 2000 4000 6000 8000 10000
0
5
10
15
20
25
position (kb)
number of polymorphisms
Figure 5.1: Distribution of polymorphisms on scaffold 1. The position of win-
dow along the genome is shown on x axis and the number of polymorphisms
in window is shown on y axis. The size of the window is 1kb. Multibase
substitutions and multibase indels were additively counted based on their
lengths.
the lengths of the estimated haplotypes. The lengths of the estimated haplotypes
range from 2 to 4,123. There were a large number of isolated SNPs which could
not be connected to other SNPs. In Section 4.1 and 4.2, those SNPs are called
singletons. An example of our diploid genome sequences is shown in Figure 5.7.
The two consensus sequences instead of a consensus sequence is provided at the
top.
123
4000 4500 5000 5500 6000
0
2
4
6
8
10
12
14
16
18
20
position (100bp)
number of polymorphisms
Figure 5.2: Distribution of polymorphisms in a specific region on scaffold 1.
From Figure 5.1, the region from 40k to 60k is enlarged. The size of the
windowis100bp. Thepositionofwindowalongthegenomeisshownonxaxis
and the number of polymorphisms in window is shown on y axis. Multibase
substitutions and multibase indels were additively counted based on their
lengths.
124
0 500 1000 1500 2000
0
2
4
6
8
10
12
14
16
18
20
position (100bp)
number of polymorphisms
Figure 5.3: Distribution of polymorphisms on scaffold 140. The position of
window along the genome is shown on x axis and the number of polymor-
phisms in window is shown on y axis. The size of the window is 100bp.
Multibase substitutions and multibase indels were additively counted based
on their lengths.
125
0 50 100 150 200 250 300 350
0
2
4
6
8
10
12
position (100bp)
number of polymorphisms
Figure 5.4: Distribution of polymorphisms on scaffold 643. The position of
window along the genome is shown on x axis and the number of polymor-
phisms in window is shown on y axis. The size of the window is 100bp.
Multibase substitutions and multibase indels were additively counted based
on their lengths.
126
0 50 100 150
0
2
4
6
8
10
12
position (100bp)
number of polymorphisms
Figure 5.5: Distribution of polymorphisms on scaffold 990. The position of
window along the genome is shown on x axis and the number of polymor-
phisms in window is shown on y axis. The size of the window is 100bp.
Multibase substitutions and multibase indels were additively counted based
on their lengths.
127
0 500 1000 1500 2000 2500 3000 3500 4000
0
20
40
60
80
100
120
140
160
180
200
haplotype length
frequency
Figure 5.6: Distribution of haplotype lengths. The average size of haplotypes
was ∼103. As figure 5.1, multibase substitutions and multibase indels were
additively counted based on their lengths. The haplotypes which contain less
than 10 polymorphisms are omitted..
128
scaffold_1 #1 tgtgtctttgggCAAgaCacttaacgg-caAttggtccaacccagTGgtc
#2 tgtgtctttggg---gaAacttaacgg-caTttggtccaacccag-Cgtc
LQW610018.y1 tgtgtctttggg---gaAacttaacgg-caTttggtccaacccag-Cgtc
LQW684879.y1 tgtgtctttgggCAAgaCacttaacgg-caAttggtccaacccagTGgtc
LQW938309.x2 tgtgtctttggg---gaAacttaacggAcaTttggtccaacccag-Cgtc
LQW715176.x1 tgtgtctttggg---gaAacttaacgg-caTttggAccaacccag-Cgtc
LQW793007.x3 tgtgtctttgggCAAgaCacttaacgg-caAttggtccaacccagTGgtc
LQW750543.x1 tgtgtctttgggCAAgaCacttaacgg-caAttggtccaacccagTGgtc
LQW159357.x1 tgtgtctttgggCAAgaCacttaacgg-caAttggtccaacccagTGgtc
Figure 5.7: An example of the diploid genome reconstruction from scaffold 1.
The diploid consensus sequence is shown at the top. Polymorphic sites and
sequencing errors are represented as upper cases.
5.4 Data deposition
DiploidgenomesequencesofCionaintestinaliswillbedepositedinhttp://diploid.cmb
.usc.edu. Our format of diploid genome sequences is shown in Figure 5.8. The
numbers on the first line indicates the genomic indexes of the staring and ending
positions. Diploid genome sequences are displayed on the second and third lines.
50 nucleotides are displayed on each line. After a blank line, genomic indexes,
diploid genome sequences continue to be displayed with their phases preserved.
Discontinued diploid genome sequences are separated by a string of * (Figure 5.9).
129
43116 43165
aagaaccCacttcgtaagttttcAttttAtcacaataTccattaattaaa
aagaaccTacttcgtaagttttcGttttGtcacaataAccattaattaaa
43166 43215
agaTtagtatttgagatgtttgataTtttttagaacatggtatgtgacgt
agaGtagtatttgagatgtttgata-tttttagaacatggtatgtgacgt
43216 43265
cacaggTaaagtggaccgtcgagcacgaagatggaatgaaggttattatt
cacaggAaaagtggaccgtcgagcacgaagatggaatgaaggttattatt
43266 43315
acgtaataatatacaaaatattggacataactatgctacaataaacaaca
acgtaataatatacaaaatattggacataactatgctacaataaacaaca
43316 43365
attaaattattgcatattataagatgtacaaNNtGtttaattataaaagc
attaaattattgcatattataagatgtacaaNNtTtttaattataaaagc
43366 43415
atggtcgaNNNNNNNNNagaaaagtcagcagaaaaggctgcagaaggaaa
atggtcgaNNNNNNNNNagaaaagtcagcagaaaaggctgcagaaggaaa
43416 43465
gcacaaggtagtgtttgttgtttaaacaattataatatctgtatatatca
gcacaaggtagtgtttgttgtttaaacaattataatatctgtatatatca
43466 43515
acttaacaCAtagtattgtggtgtaagatgggatatatttagcacatatt
acttaacaTGtagtattgtggtgtaagatgggatatatttagcacatatt
43516 43565
accatataccatacagtacccatttAttagcgtgtaaggttacagTtata
accatataccatacagtacccatttCttagcgtgtaaggttacag-tata
Figure 5.8: Format of diploid genome sequences from scaffold 1. As in Fig-
ure 5.7, polymorphisms are represented by capital letters. N represents an
intra-scaffold gap.
130
1297 1346
atcagaacacttggatattatgcgttaaaggtgtcccatcttcttccaag
atcagaacacttggatattatgcgttaaaggtgtcccatcttcttccaag
1347 1396
ctactatatattttttgtgtagtacattactcaatttatgtttctattac
ctactatatattttttgtgtagtacattactcaatttatgtttctattac
1397 1446
tgaacaaactaaacgacaactatcgtctttatatttcaagctccaccatt
tgaacaaactaaacgacaactatcgtctttatatttcaagctccaccatt
1447 1496
atggcaatgtaaaccttatttatagccNNNNNNNNNNNNNNNNatcaata
atggcaatgtaaaccttatttatagccNNNNNNNNNNNNNNNNatcaata
1497 1510
tgtctgcttgactc
tgtctgcttgactc
**********************************************************
1511 1560
CctgtacaacagaaatacatagacacaacattatTataAaaggtatataa
Tctgtacaacagaaatacatagacacaacattat-ataTaaggtatataa
1561 1610
tttaaattcaatctccgtcatgcttaaacaacaacaaacctaacttttta
tttaaattcaatctccgtcatgcttaaacaacaacaaacctaacttttta
1611 1660
caacctcactacagctcctcccaatgtgtgacgtcatccctacccgttcc
caacctcactacagctcctcccaatgtgtgacgtcatccctacccgttcc
1661 1710
cagtgttcgtcccacctcccgctccgcccccaacacgcccacttccacCc
cagtgttcgtcccacctcccgctccgcccccaacacgcccacttccacTc
Figure 5.9: An example of discontinued diploid genome sequences from scaf-
fold 1. A string of * indicates that the phases are no longer preserved from
the previous sequences after the * string. As in Figure 5.7, polymorphisms
are represented by capital letters. N represents an intra-scaffold gap.
131
Chapter 6
Conclusion and Discussion
6.1 Conclusion
Throughout this article, we have developed methods to address the diploid genome
reconstruction problem. We proposed a simple pairwise reconstruction strategy in
Section 3.1. Based on this pairwise reconstruction strategy, we proposed a fast
algorithm to evaluate the confidence of estimated haplotypes in Section 3.3. We
developed a new strategy based on the Gibbs sampler to infer haplotypes from the
Ciona intestinalis genome. We successfully derived diploid consensus sequences
from the assembly. The estimated polymorphism rates (1.25% and 1.52%) were
higher than the reported polymorphism rate (1.2%). Because our methods are
applied to the assembly with the usual coverage (5∼10), it is now possible to infer
diploid genome sequences instead of a haploid sequence.
132
6.2 Relaxation of assumptions
We made 7 assumptions to make the problem simple. Those assumptions follows:
A1 – The DNA preparation and cloning stage are free of errors; A2 – The fragment
assembly is correct; A3 – All positions within a fragment are equally reliable; A4 –
Allfragmentsequences areequallyreliable; A5–Sequencing errorsareindependent
oftheirlocalcontext; A6–Thesequencingerrorratesareconstantacrosstheentire
sequence; A7 – The composition of the sequence is independent, both of adjacent
bases and over large regions.
By incorporating the local context in base-calling, we can eliminate some as-
sumptions. In Section 3.4, we used the Phred quality scores. In principle, the
Phred quality score considers the local context of each base-calling. Sequencing er-
ror probabilities are not constant and not independent of their local context. Then
A5 and A6 can be eliminated. Because the Phred quality scores assign the error
probability to each base-call in a fragment, we do not assume that all positions
within a fragment are equally reliable. Therefore, A3 is not needed. By consider-
ing the local context of each base-calling, every fragment sequence has a different
quality. Therefore, A4 is not needed. In Section 3.3.4, we proposed a method to
estimatethecompositionprobabilitiesadaptivelyinthelocalcontext. Thismethod
does not guarantee that A7 is completely eliminated. A1 and A2 are not directly
related to our problem. These problems are left for others to solve.
133
6.3 Comparison of average haplotype length
We compared the average haplotype length in the simulated data set (Section 4.3)
and the average haplotype length in the real data set (Section 5.3). The average
haplotype length in the simulated data set was∼161 while the average haplotype
length in the real data set was∼103. The shorter haplotype length is explained by
the complications in the real Ciona intestinalis sequencing project. Three reasons
are postulated as follows.
1. In the real data, recombination exists.
2. Sequence libraries were prepared from more than a single individual.
3. Misassembly can occur in the real sequence assembly.
In the Ciona intestinalis sequencing project, DNA was purified from sperm.
Numerous recombinations were observed from the assembly. In this case, recombi-
nations prevent our algorithm from determining the phases correctly. We can see
that longer haplotypes can be extended when there exists no recombination.
In the Ciona intestinalis sequencing project, subclone libraries were prepared
frommorethananindividual althoughtheproportionoftheexternal librariesfrom
other individuals was very low (∼10%). The polymorphism rate gets higher, but
the average haplotype length gets shorter by including external subclone libraries.
It is reported that the genome of a highly polymorphic organism is hard to
assemble because to differentiate true overlaps from false overlaps is non-trivial in
134
the presence of a large number of polymorphisms (Jones et al. [41]). As a result
the size of contigs tends to be shorter.
6.4 Integration into sequence assembler
Sequence assembly usually consists of the overlap-layout-consensus (Section 1.4).
Our diploid genome reconstruction methods are designed to replace the consensus
step of sequence assembly. Therefore, our methods are free from the approach
by which the genome sequence was assembled. Our methods can be applied to
the layout which was assembled by any sequence assembler. Our diploid genome
reconstruction methods can be integrated into any sequence assembler. Currently,
our program takes an assembly layout in format shown in Figure 8.6.
6.5 Quality score adjustment
In Section 3.4, we replaced the constant error probabilities by the quality scores of
base-calls. In using the quality scores, we made two following assumptions.
1. The quality scores from the Ciona intestinalis genome project reflects the
Phred error model described in (1.1).
2. The Phred quality scores reflect the true error probabilities.
Because the Phred quality score and the error assignment method were trained
from a large data set, it is considered that the scores reflect the error probabilities.
135
However, because sequencing machines operate under various conditions, it is not
guaranteed that the scores reflect the true error probabilities in all cases. Li et al.
[47] proposed a method to adjust the scores when the scores do not reflect the true
error probabilities. If our assumptions are violated, the score adjustment method
can be used.
6.6 Confidenceofdiploidconsensussequencesfrom
the Ciona intestinalis genome
In Section 3.3, the confidence score was used assess the accuracy of estimated hap-
lotypes. By using this confidence score, we can identify the low quality haplotype
which require further investigation. In Figure 4.1, we calculated the confidence
score by taking the constant sequencing error probabilities as inputs. However,
this confidence score also can be calculated by taking the Phred quality scores as
inputs.
136
Chapter 7
Future Work
7.1 Multiple-haplotype reconstruction
Subclone libraries were often prepared from more than a single individual. In this
case,multiplehaplotypescanexistoversomeregions. Currently, ourreconstruction
methods infer two haplotypes from the assembly. Our reconstruction methods can
be extended to infer multiple haplotypes. First, the Bayesian model selection to
identify potential polymorphisms can be extended to handle multiple haplotypes.
Pr(M
l+1
|X
·j
=x
·j
)
Pr(M
l
|X
·j
=x
·j
)
=
Pr(X
·j
=x
·j
|M
l+1
)
Pr(X
·j
=x
·j
|M
l
)
Pr(M
l+1
)
Pr(M
l
)
≥α
l,k
, k = 1,2, l = 1,2,....
(7.1)
whereM
l
andM
l+1
representlhaplotype-basedand(l+1)haplotype-basedmodels,
respectively.
137
The thresholds α
l,k
can be trained as described in Section 3.4. By increasing
the model complexity l, themodel which maximizes the (7.1)can be chosen. When
haplotypes are inferred from some region, the most complex model over the region
is selected. The Gibbs sampler and haplotype extension can be adapted according
to the selected model.
An ideal candidate for this multiple-haplotype reconstruction is environmental
shotgun assembly in which a mixture of numerous different organisms is sequenced
simultaneously (Venter et al. [85]). In this environmental shotgun assembly, the
closely-related organisms might be assembled into the same scaffold. Therefore,
there might exist a large varieties in haplotype compositions.
7.2 Optical mapping
A technique to construct an ordered restriction map, optical mapping, has been
long explored and optical maps were constructed from the small genomes such
as Plasmodium falciparum (Lai et al. [42]), Escherichia coli (Lim et al. [48]), and
Deinococcus radiodurans(Linetal. [49]). Opticalmapping canbeappliedtoBACs
(Cai et al. [11]), YACs (Cai et al. [10]), or randomly sheared DNA molecules (Lin
et al. [49]) to construct various types of restriction maps.
Inopticalmapping,aconsensus mapisconstructed inseveral steps. First,DNA
molecules are placed on glass surfaces and circulating fluid flows extend the DNA
molecules which are held on the surfaces by charge interactions. Then each DNA
138
molecule is digested by a selected restriction enzyme, but the original order of its
restriction fragments is preserved. The fragments are subsequently stained with
fluorescence dyes, visualized by fluorescence microscopy, and transformed to digital
images by a charge-coupled-device camera. The size of each fragment is measured
by its fluorescence intensity andasingle-molecule mapis obtained. Single-molecule
maps are finally assembled into map contigs and a consensus map is determined
(Aston et al. [7]).
Optical mapping facilitates genome sequencing projects by verifying the ac-
curacy of assembled contigs or scaffolds, estimating genome size before sequence
assembly and estimating the size of gaps between contigs after sequence assem-
bly, identifying extra-chromosomal elements such as plasmid, and functioning as
a physical map in which assembled contigs or scaffolds can be mapped (Lin et al.
[49]).
Iftheopticalmapisconstructedfromaneukaryoticgenome,themapisapoten-
tialresourcetodetectgenomicvariations(substitution, indels, andrearrangement).
Similartosequence assembly, becauseaDNAmoleculepreserves thephasebetween
adjacent restriction sites, haplotyping of restriction sites in optical mapping can be
an interesting problem (Anantharaman and Mysore [4]). Once an optical map and
haplotypes are constructed, these can be further utilized to guide the finishing of
diploid genome sequences because diploid scaffolds or contigs can be better aligned
to this diploid restriction map.
139
Chapter 8
Software
Two software packages are available from our works. One is the software based on
the method which was described in Section 3.3 The other is the software based on
the Gibbs sampler which was described in Section 3.4. Two software packages are
explained in the following Section 8.1 and 8.2.
8.1 Accuracy assessment program
8.1.1 Command
Type the following command on the command line.
hapseq [options] layout file1 [, layout file2, ... ]
Multiple layout files can be specified in the command.
140
Scaffold_1
Scaffold_2
Scaffold_3
Scaffold_4
Scaffold_5
Figure 8.1: Format of a layout name file.
8.1.2 Options
-i layout name file Instead of specifying multiple layout files on the command
line, The names of multiple layout files can be provided as a layout name file.
The format of the layout name file is shown in Figure 8.1. In this example, the
names of multiple layout files should be Scaffold 1.lyt, Scaffold 2.lyt, Scaffold 3.lyt,
Scaffold 4.lyt, and Scaffold 5.lyt. In the layout name file, the extension “.lyt” must
be omitted. The layout name file and layout file must reside in the same directory
as the executable command file.
-t threshold value The threshold for estimated pairs is specified with this -t
option. The detail of this threshold is described in Section 3.1.2. As this threshold
varies, the trade-off between the average haplotype length and the true positive
rate is explained in Table 4.4. The default value is 0.5.
-r odds ratio Another way to evaluate the confidence of estimated pairs is to
use the odds ratio (3.6). The default ratio is 1.1.
141
Pr(A) = 0.24
Pr(C) = 0.25
Pr(G) = 0.23
Pr(T) = 0.26
Pr(-) = 0.02
Figure 8.2: Input of composition probabilities.
0.24
0.25
0.23
0.26
0.02
Figure 8.3: Format of a composition probabilities input file according to Fig-
ure 8.2. Each composition probability is separated by a new line.
-h threshold value In Algorithm 3.2, the uncertain phases are identified by
checking the confidence score of the phases. We consider that some phases are
uncertain if the confidence score for the phases are below some threshold. This
threshold is specified with -h. The default value is 0.5.
-f InSection3.3,weestimatedthehaplotypefrequency. Thishaplotypefrequency
estimationisenabledwith-foption. Withoutthisoption, thefrequency isassumed
to be a half.
-m If this option is specified, the mate-pair information is used to extend esti-
mated haplotypes. Without this option, the mate-pair information is not used.
142
-c composition probabilities file The composition probabilities are given as a
file. Suppose that we used the composition probabilities shown in Figure8.2. Then
the corresponding format of composition probabilities file is shown in Figure 8.3.
The order of the probabilities must be consistent with Figure 8.2. The sum of
probabilities must be 1.
-s sequencing error probabilities file Sequencing error rates are also pro-
vided as a file. Figure 8.4 shows the sequencing error rates we wish to use in
some simulation. The format of sequencing error rates file is shown in Figure 8.5.
P
X
Pr(X|S) must be 1.
layout file The format of an input layout file is shown in Figure 8.6. Only SNP
sites must be provided. Each row corresponds to a base-call. The name of a read is
given in the first column. The identification number for a fragment is given in the
second column. This number must be unique among fragments. The identification
number for each read is given in the third column. This number must be unique
among reads. Because the mate-pair information between paired reads is used, the
change of the read identification number from 7 to 8 is observed between the 19th
line and the 20th line with the fragment identification number unchanged. The
position of this base-call is given in the fourth column. The quality score for this
base-call is given inthe fifthcolumn. This software discards the base-calling scores.
143
8.1.3 Output files
Severaloutputfilesarestoredinthesubdirectory“result”underthedirectorywhere
the executable command file is located.
result.out Overall results such as the number of polymorphisms, the estimated
haplotype frequency, and the average haplotype length are stored in this file.
segments.out Estimated haplotypes with their confidence scores are printed in
this file. The format of this file is shown in Figure 4.1.
pair.out Estimated pairs with their confidence score and odds ratio are stored in
this file.
cut.out The locations which no fragment covers is reported in this file.
mle.out When -f is used, the data for plotting maximum likelihood curves are
generated in this file.
odds ratio.out Some of estimated pairs are not used for reconstructing haplo-
types because either of their confidence scores or odds ratio is below threshold
specified with -t and -r options. Those low-scored pairs are reported in this file.
144
est target.out Estimatedhaplotypeswithoutconfidencescoresarestoredinthis
file.
layout.out The assembled layout is stored in this file.
8.2 Gibbs-sampling based program
8.2.1 Command
Type the following command on the command line.
dipsampler [options] layout file1 [, layout file2, ... ]
Multiple layout files can be specified in the command.
8.2.2 Options
-i layout name file Instead of specifying multiple layout files on the command
line, the names of multiple layout files can be provided as a layout name file.
The format of the layout name file is shown in Figure 8.1. In this example, the
names of multiple layout files should be Scaffold 1.lyt, Scaffold 2.lyt, Scaffold 3.lyt,
Scaffold 4.lyt, and Scaffold 5.lyt. In the layout name file, the extension “.lyt” must
be omitted. The layout name file and layout file must resided in the same directory
as the executable command file.
145
Pr(A|A) = 0.99
Pr(C|A) = 0.0015
Pr(G|A) = 0.0025
Pr(T|A) = 0.0015
Pr(-|A) = 0.0035
Pr(N|A) = 0.0010
Pr(A|C) = 0.0015
Pr(C|C) = 0.99
Pr(G|C) = 0.0015
Pr(T|C) = 0.0010
Pr(-|C) = 0.0025
Pr(N|C) = 0.0035
Pr(A|G) = 0.0015
Pr(C|G) = 0.0015
Pr(G|G) = 0.99
Pr(T|G) = 0.0010
Pr(-|G) = 0.0025
Pr(N|G) = 0.0035
Pr(A|T) = 0.0015
Pr(C|T) = 0.0035
Pr(G|T) = 0.0025
Pr(T|T) = 0.99
Pr(-|T) = 0.0015
Pr(N|T) = 0.0010
Pr(A|-) = 0.001
Pr(C|-) = 0.001
Pr(G|-) = 0.001
Pr(T|-) = 0.001
Pr(-|-) = 0.995
Pr(N|-) = 0.001
Figure 8.4: Input of sequencing error probabilities. On the first line, for
instance, Pr(A| A) implies Pr(X
ij
= A|Y
ij
= A). On the second line, Pr(C| A)
implies Pr(X
ij
= C|Y
ij
= A).
146
0.99
0.0015
0.0025
0.0015
0.0035
0.0010
0.0015
0.99
0.0015
0.0010
0.0025
0.0035
0.0015
0.0015
0.99
0.0010
0.0025
0.0035
0.0015
0.0035
0.0025
0.99
0.0015
0.0010
0.001
0.001
0.001
0.001
0.995
0.001
Figure 8.5: Format of a sequencing error probabilities input file. The order of
the probabilities must be consistent with Figure 8.4.
147
LQW536710.x1 4 7 930 A 38
LQW536710.x1 4 7 931 A 38
LQW536710.x1 4 7 932 A 37
LQW536710.x1 4 7 933 C 37
LQW536710.x1 4 7 934 T 37
LQW536710.x1 4 7 935 A 37
LQW536710.x1 4 7 936 G 43
LQW536710.x1 4 7 937 T 43
LQW536710.x1 4 7 938 T 26
LQW536710.x1 4 7 939 T 23
LQW536710.x1 4 7 940 T 30
LQW536710.x1 4 7 941 A 30
LQW536710.x1 4 7 942 G 31
LQW536710.x1 4 7 943 T 31
LQW536710.x1 4 7 944 T 33
LQW536710.x1 4 7 945 T 29
LQW536710.x1 4 7 946 T 23
LQW536710.x1 4 7 947 G 23
LQW536710.x1 4 7 948 G 23
LQW536710.y1 4 8 2274 C 21
LQW536710.y1 4 8 2275 T 15
LQW536710.y1 4 8 2276 T 15
LQW536710.y1 4 8 2277 A 15
LQW536710.y1 4 8 2278 C 21
LQW536710.y1 4 8 2279 T 17
LQW536710.y1 4 8 2280 G 15
LQW536710.y1 4 8 2281 C 20
LQW536710.y1 4 8 2282 T 19
LQW536710.y1 4 8 2283 A 20
LQW536710.y1 4 8 2284 T 17
Figure 8.6: Format of an assembly layout input file.
148
-t threshold value The threshold value for evaluating the confidence of esti-
mated local haplotypes is specified with this option. If the confidence score is
below this threshold, the uncertain phase is identified with the method described
in Figure 3.10. The default value is 0.6.
-bburn in Thisoptionspecifiesthenumberofiterationsforwhichtheestimation
is discarded as a burn-in. The default value is 100.
-m max number of iteration The maximum number of allowed iterations per
a chain is specified with this option. The default value is 2000.
-r number of chains We allow to run multiple chains. The number of multiple
chains is specified with this option. The default value is 10.
-k thinning interval The thinning interval is specified with this option. The
default value is 10.
-c composition probabilities file The composition probabilities are given as a
file. Suppose that we used the composition probabilities shown in Figure8.2. Then
the corresponding format of composition probabilities file is shown in Figure 8.3.
The order of the probabilities must be consistent with Figure 8.2. The sum of
probabilities must be 1.
149
-l level The value of level is one of 0, 1, or 2. In the regions of low coverage,
some oflow-quality bases need to be eliminated. This option determines the extent
to which low-quality bases are eliminated. If the level is 0, no low-quality bases
are eliminated. If the level is 1, the bases having quality scores below the cut-off
(specified with the-o) are eliminated from an assembly layout. If the level is 2, the
non-NQS bases are removed from the low-coverage regions (specified with the -u).
The default value is 2.
-o quality score If the -l option is used with either 1 or 2, this option specifies
the cut-off scores for the low-quality bases. The bases with the score which is less
than this cut-off score is defined to be the low-score bases. The default score is 5.
-u coverage If the -l option is used with 2, this option defines the low-coverage
regions. The regions having the coverage which is less than this threshold are
defined to be the low-coverage regions. The default value is 7.
-w window size This options specifies the window size for counting the number
of polymorphisms in each window as the window is slid along the genome. The
default value is 1000.
-s threshold value The threshold is necessary to screen strongly potential poly-
morphisms. If (3.14) is greater than or equal to this threshold at some loci, those
loci are called strongly potential polymorphisms. The default value is 2.
150
-q threshold value The threshold is necessary to screen weakly potential poly-
morphisms. If (3.14) is greater than or equal to this threshold at some loci, those
loci are called strongly potential polymorphisms. The default value is 0.01.
layout file The format of an input layout file is shown in Figure 8.6. Non-SNP
sites are also provided The software will detect polymorphic sites with the method
described inSection3.4.5. Eachrowcorrespondstoabase-call. Thenameofaread
isgiveninthefirstcolumn. The identificationnumber forafragmentisgiveninthe
second column. This number must be unique among fragments. The identification
number for each read is given in the third column. This number must be unique
among reads. Because the mate-pair information between paired reads is used, the
change of the read identification number from 7 to 8 is observed between the 19th
line and the 20th line with the fragment identification number unchanged. The
position of this base-call is given in the fourth column. The quality score for this
base-call is given in the fifth column. This software uses the base-calling scores
instead of the constant sequencing error rates.
8.2.3 Output files
Severaloutputfilesarestoredinthesubdirectory“result”underthedirectorywhere
the executable command file is located.
151
result.out The number of polymorphisms, polymorphism rates, average haplo-
type length are printed in this file.
scaffold name.segments.out Diploid consensus sequences are stored in this
file. The format is shown in Figure 5.8.
scaffold name.layout.out Theassemblylayoutisstoredinthisfile. Theformat
is shown in Figure 5.7.
scaffold name.walk.out The number of polymorphisms in each slid window is
generated in this file. This information is used to plot the distribution of polymor-
phisms along the genome (Figure 5.1).
hap len.out The distribution of haplotype lengths is stored in this file. This
information is used to plot the distribution of haplotype lengths (Figure 5.6).
poly len.out The lengths of substitutions is reported in this file.
poly len indel.out The length of indels is reported in this file.
8.3 Software deposition
The softwares are available and downloadable from http://diploid.cmb.usc.edu. For
academic use, the softwares can be used for free. For commercial use, contact
msw@usc.edu or lilei@usc.edu.
152
Bibliography
[1] M. Adam et al. The genome sequence of Drosophila melanogaster. Science,
287:2185–2195, 2000.
[2] S. Altschul et al. Gapped blast and psi-blast: a new generation of proten
database search program. Nucleic Acids Res., 25:3389–3402, 1997.
[3] D.Altshuler, V.J.Pollara, C. R.Cowles, W.J.VanEtten, J.Baldwin, L.Lin-
ton, and E. S. Lander. An SNP map of the human genome generated by
reduced representation shotgun sequencing. Nature, 407:513–516, 2000.
[4] T. S. Anantharaman and V. Mysore. Fast and cheap genome wide haplotype
construction via optical mapping. The Pacific Symposium on Biocomputing,
pages 385–396, 2005.
[5] S. Anderson. Shotgun DNA sequencing using cloned DNase I-generated frag-
ments. Nucleic Acids Res., 10:3015–3027, 1981.
[6] S. Aparicio et al. Whole-genome shotgun assembly and analysis of the genome
of Fugu rubripes. Science, 297:1301–1310, 2002.
[7] C. Aston, B. Mishra, and D. C. Schwartz. Optical mapping and its potential
forlarge-scale sequencing projects. Trends in Biotechnology,17:297–302,1999.
[8] S. Batzoglou et al. ARACHNE; A whole-genome shotgun assembler. Genome
Res., 12:177–189, 2002.
[9] S. P. Brooks. Markov chain Monte Carlo method and its application. The
Statistician, 47:69–100, 1998.
[10] W. Cai, H. Aburatani, V. P. Stanton, D. E. Housman, Y. Wang, and D. C.
Schwartz. Ordered restriction endonuclease maps of yeast artificial chromo-
somes created by optical mapping on surfaces. Proc. Natl. Acad. Sci. USA,
92:5164–5168, 1995.
[11] W. Cai et al. High-resolution restriction maps of bacterial artifical chromo-
somes constructed by optical mapping. Proc. Natl. Acad. Sci. USA, 95:3390–
3395, 1998.
153
[12] K. S. Chan. Asymptotic behaviour of the Gibbs sampler. J. Am. Statist. Ass.,
88:320–328, 1993.
[13] X. Chen et al. A homogeneous, ligase-mediated DNA diagnostic test. Genome
Res., 8:549–556, 1998.
[14] H.Chou and M. Holmes. DNA sequence quality trimming and vector removal.
Bioinformatics, 17:1093–1104, 2001.
[15] G. A. Churchill and M. S. Waterman. The accuracy of DNA sequences: Esti-
mating sequence quality. Genomics, 14:89–98, 1992.
[16] A. Clark. Inference of haplotypes from PCR-amplified samples of diploid pop-
ulations. Mol. Biol. Evol., 7:111–122, 1990.
[17] Mouse Genome Sequencing Consortium. Initial sequencing and comparative
analysis of the mouse genome. Nature, 420:520–562, 2002.
[18] A. Cornish-Bowden. Nomenclature for incompletely specified bases in nucleic
acid sequences: recommendations. Nucleic Acids Res., 13:3021–3030, 1985.
[19] M.Dalyetal. High-resolutionhaplotype structure inthe human genome. Nat.
Genet., 29:229–232, 2001.
[20] S. Dear and R. Staden. A standard file format for data from DNA sequencing
instruments. DNA Sequence, 3:107–110, 1992.
[21] P. Dehal et al. The draft genome of Ciona intestinalis: Insights into chordate
and vertebrate origins. Science, 298:2157–2167, 2002.
[22] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm (with discussion). Journal of the Royal
Statistical Society, 39:1–38, 1977.
[23] C.Drysdale etal. Complex promoterandcodingregionβ
2
-adrenergicreceptor
haplotypes alter receptor expression and predict in vivo responsiveness. Proc.
Natl. Acad. Sci. USA, 97:10483–10488, 2000.
[24] B.EwingandP.Green. Basecallingofautomatedsequencertracesusingphred.
ii. error probabilities. Genome Res., 8:186–194, 1998.
[25] B. Ewing, L. Hillier, M. C. Wendl, and P. Green. Basecalling of automated
sequencer tracesusingphred.i.accuracyassessment. Genome Res.,8:175–185,
1998.
[26] L.Excoffier andM.Slatkin. Maximum-likelihood estimation ofmolecularhap-
lotype frequencies in a diploid population. Mol Biol Evol, 12:921–927, 1995.
154
[27] R.C.Gardneretal. Thecompletenucleotide sequence ofaninfectious cloneof
cauliflower mosaic virus by M13mp7 shotgun sequencing. Nucleic Acids Res.,
9:2871–2888, 1981.
[28] A.GelmanandD.B.Rubin. Inferencefromiterativesimulationusingmultiple
sequences (with discussion). Statistical Science, 7:457–511, 1992.
[29] S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6:721–741, 1984.
[30] M. C. Golumbic. Algorithmic graph theory and perfect graphs. Academic
press, NY, 1980.
[31] E. D. Green. Strategies for the systematic sequencing of complex genomes.
Nature Review Genetics, 2:573–583, 2001.
[32] P. Green. Phrap documentation(http://www.phrap.org).
[33] M.Groetschel,L.Lovasz,andA.Schrijver. Apolymomialalgorithmforperfect
graph. Annals of Discr. Math., 21:325–356, 1984.
[34] D. Gusfield. Haplotying as perfect phylogeny: A direct approach. J. Comput.
Biol., 10:323–340, 2003.
[35] E. Halperin and E. Eskin. Haplotype reconstruction from genotype data using
imperfect phylogeny. Bioinformatics, 20:1842–1849, 2004.
[36] W.K.Hastings. MonteCarlosamplingmethodsusingmarkovchainsandtheir
applications. Biometrika, 57(1):97–109, 1970.
[37] M.HawleyandK.Kidd. Haplo: aprogramusingtheEMalgorithmtoestimate
the frequencies of multi-site haplotypes. J. Heredity, 86:409–411, 1995.
[38] D. Hinds et al. Whole-genome patterns of common DNA variation in three
human populations. Science, 307:1072–1079, 2005.
[39] T. Hunkapiller, R. J. Kaiser, B. F. Koop, and L. Hood. Large-scale and
automated DNA sequence determination. Science, 254:59–67, 1991.
[40] S. Istrail et al. Whole-genome shotgun assembly and comparison of human
genome assemblies. Proc. Natl. Acad. Sci. USA, 101:1916–1921, 2004.
[41] T. Jones et al. The diploid genome sequence of Candida albicans. Proc. Natl.
Acad. Sci. USA, 101:7329–7334, 2004.
[42] Z. Lai et al. A shotgun optical map of the entire plasmodium falciparum
genome. Nat. Genet., 23:309–313, 1999.
155
[43] G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. SNPs problems,
complexity, and algorithms. European Symposium on Algorithms, pages 182–
193, 2001.
[44] E.S.LanderandM.S.Waterman. Genomicmappingbyfingerprintingrandom
clones: a mathematical analysis. Genomics, 2:231–239, 1988.
[45] C. B. Lawrence and V. V. Solovyev. Assignment of position-specific error
probability to primary DNA sequence data. Nucleic Acids Res., 22:1272–1280,
1994.
[46] J.M.Lewis. Onthecomplexity ofthemaximum subgraphproblem. Xth ACM
symposium on Theory of Computing, pages 265–274, 1978.
[47] L. Li, M. Nordborg, and L. M. Li. Adjust quality scores from alignment and
improve sequencing accuracy. Nucleic Acids Res., 32:5183–5191, 2004.
[48] A. Lim et al. Shotgun optical maps of the whole Escherichia coli O157:H7
genome. Genome Res., 11:1584–1593, 2001.
[49] J. Lin et al. Whole-genome shotgun optical mapping of Deinococcus radiodu-
rans. Science, 285:1558–1561, 1999.
[50] R. Lippert, R. Schwartz, Giuseppe Lancia, and Sorin Istrail. Algorithmic
strategiesforthesinglenucleotide polymorphismhaplotypeassembly problem.
Briefings in bioinformatics, 3:(1):1–9, 2002.
[51] J. S. Liu. Fration of missing information and convergence rate of data aug-
mentation. Computationally intensive statistical methods: Proceedings of the
26th symposium on the Interface, Vol. 26 of Computing Science and Statistics,
Interface Foundation of North America, North Carolina, pages 490–497, 1994.
[52] J. S. Liu. Monte Carlo strategies in scientific computing. Springer, 2001.
[53] J. S. Liu, W. H. Wong, and A. Kong. Covariance structure and convergence
rate of the gibbs sampler with various scans. Journal of the Royal Statistical
Society, Series B, 57(1):157–169, 1995.
[54] J.Long,R.Williams, andM.Urbanek. AnE-Malgorithmandtestingstrategy
for multiple-locus haplotypes. Am. J. Hum. Genet., 56:799–810, 1995.
[55] A. M. Maxam and W. Gilbert. A new method for sequencing DNA. Proc.
Natl. Acad. Sci. USA, 74:560–564, 1977.
[56] J. G. McLachlan and T. Krishnan. The EM algorithm and extensions. John
Wiley & Sons, 1996.
156
[57] D.Meldrum. Automationforgenomics. ii. sequencers, microarrays, and future
trends. Genome Res., 1288-1303, 10.
[58] D.Meldrum. Automationforgenomics. i.preparationforsequencing. Genome
Res., 10:1081–1092, 2000.
[59] D.R.Meldrum. Sequencing genomes and beyond. Science,292:515–516,2001.
[60] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and
E.Teller. Equations ofstate calculations by fast computing machines. Journal
of Chemical Physics, 21(6):1087–1091, 1953.
[61] J. C. Mullikin and A. McMurray. Sequencing the genome fast. Science,
283:1867–1868, 1999.
[62] R. J. Mural et al. A comparison of whole-genome shotgun-derived mouse
chromosome 16 and the human genome. Science, 296:1661–1671, 2002.
[63] E.W.Myers etal. Awhoe-genomeassembly ofDrosophila. Science,287:2196–
2204, 2000.
[64] P. Mykland, L. Tierney, and B. Yu. Regeneration in Markov chain samplers.
J. Am. Statist. Ass., 90:233–241, 1995.
[65] T.Niu, Z.S.Qin, X.Xu, andJ.Liu. Bayesian haplotypeinference formultiple
linked single-nucleotide polymorphisms. Am J Hum Genet, 70:157–169, 2002.
[66] T. Pastinen et al. Minisequencing: A specific tool for DNA analysis and
diagnostics on oligonucleotide arrays. Genome Res., 7:606–614, 1997.
[67] W.H.Press, B.P.Flannery, S.A.Teukolsky, andW.T.Vetterling. Numerical
recipesinC:Theartofscientificcomputing. CambridgeUniversityPress,1988.
[68] J.M.Proberet al. Asystem forrapidDNAsequencing with fluorescent chain-
terminating dideoxynucleotides. Science, 238:336–341, 1987.
[69] Rat Genome Sequencing project Consortium. Genome sequence of the brown
norway rat yields insights into mammalian evolution. Nature, 428:493–521,
2004.
[70] G. O. Roberts and N. G. Sahu. On the geometric convergence of the Gibbs
sampler. Journal of the Royal Statistical Society, Series B, 56:377–384, 1994.
[71] G.O.RobertsandS.K.Sahu. Updatingschemes; correlationstructure; block-
ingandparameterizationfortheGibbssampler. JournaloftheRoyalStatistical
Society, Series B, 59(2):291–327, 1997.
157
[72] D. B. Rubin. A noniterative sampling/importance resampling alternative to
thedataaugmentationalgorithmforcreatingafewimputationswhenfractions
ofmissinginformationaremodest: theSIRalgorithm.JournaloftheAmerican
Statistical Association, 52:543–546, 1987.
[73] R. K. Saiki et al. Primer-directed enzymatic amplification of DNA with a
thermostable DNA polymerases. Science, 239:487–491, 1999.
[74] F. Sanger and A. Coulson. A rapid method for determining sequences in DNA
by primed synthesis with DNA polymerase. J. Mol. Biol., 94:441–448, 1975.
[75] F. Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, and G. B. Petersen. Nu-
cleotide sequence of the bacteriophage lambda DNA. J. Mol. Biol., 162:729–
773, 1982.
[76] F.Sanger,S.Nicklen,andA.Coulson.DNAsequencingwithchain-terminating
inhibitor. Proc. Natl. Acad. Sci. USA, 74:5463–5467, 1977.
[77] M.J.SchervishandB.P.Carlin. Ontheconvergence ofsuccessive substitution
sampling. J., 1:111–127, 1992.
[78] L. M. Smith et al. Fluorescentce detection in automated DNA sequence anal-
ysis. Nature, 321:674–679, 1986.
[79] T. F. Smith and M. S. Waterman. Identification of common molecular subse-
quences. J. Mol. Biol., 147:195–197, 1981.
[80] M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplo-
type reconstruction from population data. Am. J. Hum. Genet., 68:978–989,
2001.
[81] S. Tabor and C. C. Richardson. A single residue in DNA polymerases of the
Escherichia coliDNApolymerase Ifamilyiscriticalfordistinguishing between
deoxy- and dideoxyribonucleotides. Proc. Natl. Acad. Sci. USA,92:6339–6343,
1995.
[82] P. Taillon-Miller et al. Overlapping genomic sequences: A treasure trove of
single-nucleotide polymorphisms. Genome Res., 8:748–754, 1998.
[83] M. Tanner and W. H. Wong. The calculation of posterior distributions by
data augmentation. Journal of the American Statistical Association, 82:528–
547, 1987.
[84] J. C. Venter et al. The sequence of human genome. Science, 291:1304–1351,
2001.
158
[85] J. C. Venter et al. Environmental genome shotgun sequencing of the sargasso
sea. Science, 304:66–74, 2004.
[86] D. Wang et al. Large-scale identification, mapping, and genotyping of single-
nucleotide polymorphisms in the human genome. Science, 280:1077–1082,
1998.
[87] M. S. Waterman. Introduction to computational biology. Chapman & Hall,
1995.
[88] M. S. Waterman, S. Tavare, and R. C. Deonier. Computational genome anal-
ysis. Springer, 2005.
[89] R. H. Waterston, E. S. Lander, and J. E. Sulston. On the sequencing of the
human genome. Proc. Natl. Acad. Sci. USA, 99:3712–3716, 2002.
[90] J. L. Weber and E. W. Myers. Human whole-genome shotgun sequencing.
Genome Res., 7:401–409, 1997.
[91] C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of
Statistics, 11:95–103, 1983.
[92] M. Yannakakis. Node- and edge- deletion NP-complete problems. Xth ACM
symposium on Theory of Computing, pages 253–264, 1978.
159
Appendix A
Proof of Theorem 3.1
Because the derivation is quite lengthy. we include some explanatory remarks after
each step. Without loss of clarity, we use the same letter in different subscripts in
the same line. The scope of each subscript is restricted its domain. These domains
are separated by semicolons or can be identified from the context. According to
the definition of α(k), only the fragments in Θ(k) are relevant. Therefore we have
α(k+1;a
1
,a
2
;f
i
, i∈ Γ(k+1))
= Pr(X
ij
=x
ij
, j = 1,...,k, i∈ Θ(k+1);S
k+1,1
=a
1
,S
k+1,2
=a
2
;
F
i
=f
i
,i∈ Γ(k+1))
(Notice that Γ(k+1) =Λ
3
(k+1)
S
Λ
4
(k+1).
Next we sum over cases of Λ
1
(k+1) and Λ
2
(k+1))
=
X
b
1
,b
2
X
f
i
=1,2,i∈Λ
1
(k+1)∪Λ
2
(k+1)
{Pr(X
ij
=x
ij
, j = 1,...,k+1,i∈ Θ(k+1);
160
S
k,1
=b
1
,S
k,2
=b
2
;S
k+1,1
=a
1
,S
k+1,2
=a
2
;F
i
=f
i
,i∈ Ω(k+1))}
(Next apply the Markov structure)
=
X
b
1
,b
2
X
f
i
=1,2,i∈Λ
1
(k+1)∪Λ
2
(k+1)
{Pr(X
i,k+1
=x
i,k+1
, i∈ Ω(k+1)|S
k+1,1
=a
1
,S
k+1,2
=a
2
;F
i
=f
i
,i∈ Ω(k+1))
Pr(X
ij
=x
ij
, j = 1,...,k, i∈ Θ(k);S
k,1
=b
1
,S
k,2
=b
2
;S
k+1,1
=a
1
,S
k+1,2
=a
2
;
F
i
=f
i
,i∈ Ω(k+1))} (Notice that Γ(k)= Λ
2
(k+1)
S
Λ
4
(k+1))
=
X
b
1
,b
2
X
f
i
=1,2i∈Λ
1
(k+1)
X
f
i
=1,2,i∈Λ
2
(k+1)
{Pr(X
i,k+1
=x
i,k+1
, i∈ Ω(k+1)|S
k+1,1
=a
1
,S
k+1,2
=a
2
;F
i
=f
i
,i∈ Ω(k+1))
Pr(X
ij
=x
ij
, j = 1,...,k, i∈ Θ(k);S
k,1
=b
1
,S
k,2
=b
2
;S
k+1,1
=a
1
,S
k+1,2
=a
2
;
F
i
=f
i
,i∈ Γ(k);F
i
=f
i
,i∈ Λ
1
(k+1);F
i
=f
i
,i∈ Λ
3
(k+1))}
(Next factor some quantities by independence of fragments
and base compositions)
=
X
b
1
,b
2
X
f
i
i∈Λ
1
(k+1)
X
f
i
,i∈Λ
2
(k+1)
{[
Y
i∈Ω(k+1)
Pr(X
i,k+1
=x
i,k+1
|S
k+1,f
i
=a
f
i
)]
Pr(X
ij
=x
ij
, i∈ Θ(k),j =1,...,k;S
k,1
=b
1
,S
k,2
=b
2
;F
i
=f
i
,i∈ Γ(k))
Pr(S
k+1,1
=a
1
,S
k+1,2
=a
2
)Pr(F
i
=f
i
,i∈ Λ
1
(k+1))Pr(F
i
=f
i
,i∈ Λ
3
(k+1))}
(Notice that Ω(k+1)= Λ
1
(k+1)
S
Λ
2
(k+1)
S
Λ
3
(k+1)
S
Λ
4
(k+1)
and move common terms out of summation)
161
= Pr(S
k+1,1
=a
1
,S
k+1,2
=a
2
)Pr(F
i
=f
i
,i∈ Λ
3
(k+1))
[
Y
i∈Λ
3
(k+1)∪Λ
4
(k+1)
Pr(X
i,k+1
=x
i,k+1
|S
k+1,F
i
=a
f
i
)]
X
b
1
,b
2
X
f
i
,i∈Λ
2
(k+1)
{α(k;b
1
,b
2
;f
i
,i∈ Γ(k))[
Y
i∈Λ
2
(k+1)
Pr(X
i,k+1
=x
i,k+1
|S
k+1,F
i
=a
f
i
)]
[
X
f
i
,i∈Λ
1
(k+1)
Pr(F
i
=f
i
,i∈ Λ
1
(k+1))
Y
i∈Λ
1
(k+1)
Pr(X
i,k+1
=x
i,k+1
|S
k+1,F
i
=a
f
i
)]}
(The last line is [
Q
i∈Λ
1
(k+1)
[λ
1
Pr(x
i,k+1
|a
1
)+λ
2
Pr(x
i,k+1
|a
2
)])
= Pr(S
k+1,1
=a
1
,S
k+1,2
=a
2
)
2
Y
h=1
[λ
P
j∈Λ
3
(k+1)
1(F
j
=h)
h
][
Y
i∈Λ
3
(k+1)∪Λ
4
(k+1)
Pr(x
i,k+1
|a
f
i
)]
X
b
1
,b
2
X
f
i
,i∈Λ
2
(k+1)
{α(k;b
1
,b
2
;f
i
,i∈ Γ(k))[
Y
i∈Λ
2
(k+1)
Pr(x
i,k+1
|a
f
i
)]
[
Y
i∈Λ
1
(k+1)
[λ
1
Pr(x
i,k+1
|a
1
)+λ
2
Pr(x
i,k+1
|a
2
)]]}
(Move the terms relating to Λ
1
(k+1) out of summation
and rearrange summations)
= Pr(S
k+1,1
=a
1
,S
k+1,2
=a
2
)[
2
Y
h=1
λ
P
j∈Λ
3
(k+1)
1(F
j
=h)
h
]
[
Y
j∈Λ
1
(k+1)
[λ
1
Pr(x
j,k+1
|a
1
)+λ
2
Pr(x
j,k+1
|a
2
)][
Y
j∈Λ
3
(k+1)∪Λ
4
(k+1)
Pr(x
j,k+1
|a
f
j
)]
{
X
f
j
=1,2,j∈Λ
2
(k+1)
[
Y
j∈Λ
2
(k+1)
Pr(x
j,k+1
|a
f
j
)]
X
b
1
,b
2
α(k;b
1
,b
2
;f
j
, j∈ Γ(k))}.
162 
Asset Metadata
Creator Kim, Jong Hyun (author) 
Core Title Diploid genome reconstruction from shotgun sequencing 
Contributor Digitized by ProQuest (provenance) 
School Graduate School 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Biology, Bioinformatics,biology, molecular,Computer Science,OAI-PMH Harvest 
Language English
Advisor Waterman, Michael (committee chair), Li, Lei (committee member) 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c16-439406 
Unique identifier UC11336637 
Identifier 3237132.pdf (filename),usctheses-c16-439406 (legacy record id) 
Legacy Identifier 3237132.pdf 
Dmrecord 439406 
Document Type Dissertation 
Rights Kim, Jong Hyun 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
Biology, Bioinformatics
biology, molecular
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button