Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
(USC Thesis Other)
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENOMICMAPPING:ASTATISTICALANDALGORITHMICANALYSISOF THEOPTICAL MAPPINGSYSTEM by JohnVuNguyen ADissertationPresented tothe FACULTYOFTHEUSC GRADUATESCHOOL UNIVERSITYOF SOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROF PHILOSOPHY (COMPUTATIONALBIOLOGYANDBIOINFORMATICS) May2010 Copyright 2010 JohnVuNguyen Dedication Formyparents ii Acknowledgments Iamindebtedtomanypeoplewithoutwhomthisworkwouldnothavebeen possible. I wouldliketofirstgivethankstomyadvisor,MichaelWaterman,forbothhissupportand wordsofencouragementandadvicethatenabledmetocompletemydissertation. Ialso want to thank David Schwartz, whose pioneering work on the optical mapping system formsthebasisofthisthesis,forallowingmetheopportunitytocollaboratewithhimas wellashismentorship,support,andadvice. SeveralpeopleintheSchwartz Labhelped me understand various aspects of the system: I would like to especially thank Shiguo Zhouforhishelpandpatientlyansweringallofmyquestions. Iwouldalsoliketothank LeiLiforstimulatingdiscussionsaswellashisadviceandsuggestions. Thisworkwouldnothavebeenpossiblewithoutthesupportofthemanyfriendsand colleaguesIhavemadehereattheUniversityofSouthernCalifornia. Theyhavehelped meduringmygraduate schoolyears innotonlymywork andstudies,butmyeveryday life as well. I would like to thank my family who supported me during my graduate school days in so many ways. To my sister, thank you for always lending a hand even when I thought I didn’t need it. And finally, to my parents, whose love and support makesthisworkasmuchtheirsasitismine. iii TableofContents Dedication ii Acknowledgments iii ListofTables vii ListofFigures viii Abstract x 1 OverviewofOpticalMapping 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 OpticalMappingSystem . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 DataCollection . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 ImageAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 GoalsandChallenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 OpticalMapData . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 MapAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 MapAssembly . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 ModelingOpticalMappingData 23 2.1 StatisticalModelsforOpticalMapping . . . . . . . . . . . . . . . . . . 24 2.1.1 StochasticModel . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ModelAssumptions . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 OpticalMapContentAnalysis . . . . . . . . . . . . . . . . . . . . . . 32 2.2.1 RegionMatching . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.2 RegionMatchingwithSizingErrors . . . . . . . . . . . . . . . 45 iv 2.2.3 RegionMatchingwithMissingandFalseCuts . . . . . . . . . 55 2.2.4 RegionMatchingwithSizingErrors,MissingandFalseCuts . . 63 2.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3 AlignmentofOpticalMaps 66 3.1 MapAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.1.2 OpticalMatchScore Function . . . . . . . . . . . . . . . . . . 72 Size Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 73 FragmentSize Estimation . . . . . . . . . . . . . . . . . . . . 77 D-sizeStatistic . . . . . . . . . . . . . . . . . . . . . . . . . . 85 NewScore Function . . . . . . . . . . . . . . . . . . . . . . . 96 Parameter SettingsforDeNovoAssembly . . . . . . . . . . . . 113 3.1.3 RandomForestClassification . . . . . . . . . . . . . . . . . . 122 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.2 FPC MapAlignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.2.3 AlgorithmOverview . . . . . . . . . . . . . . . . . . . . . . . 135 FragmentMatching . . . . . . . . . . . . . . . . . . . . . . . . 137 CloneLocations . . . . . . . . . . . . . . . . . . . . . . . . . 138 FPC Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 148 AnchoredClones . . . . . . . . . . . . . . . . . . . . . . . . . 153 3.2.4 StatisticalPropertiesofAlignments . . . . . . . . . . . . . . . 154 3.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Ctg-182Validation . . . . . . . . . . . . . . . . . . . . . . . . 159 3.2.6 AlignedOpticalMaps . . . . . . . . . . . . . . . . . . . . . . 159 3.2.7 AlignmentVisualization . . . . . . . . . . . . . . . . . . . . . 160 3.2.8 FPC GapDistances . . . . . . . . . . . . . . . . . . . . . . . . 162 3.2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4 AssemblyofOpticalMaps 166 4.1 DeNovoAssembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 OverlapComputation . . . . . . . . . . . . . . . . . . . . . . . 171 LayoutGraphConstruction. . . . . . . . . . . . . . . . . . . . 173 v GraphTransformations . . . . . . . . . . . . . . . . . . . . . . 175 InitialContigConstruction . . . . . . . . . . . . . . . . . . . . 182 4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 4.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5 FutureWork 192 Bibliography 194 vi ListofTables 2.1 Insilicorestrictionfragmentlengthsfor humangenome . . . . . . . . . 31 3.1 Comparisonofestimators ˆ y . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2 ConfusionMatrixforAlignmentMethods . . . . . . . . . . . . . . . . 128 4.1 Assembleddatasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 4.2 Assemblyoverlapstatistics . . . . . . . . . . . . . . . . . . . . . . . . 186 4.3 Overlapgraphedgestatistics . . . . . . . . . . . . . . . . . . . . . . . 186 4.4 Contigconstructionstatistics . . . . . . . . . . . . . . . . . . . . . . . 187 4.5 E.coligenomeassemblieswithscorecutoffthresholds . . . . . . . . . 189 vii ListofFigures 1.1 Overviewofgenomemapping . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Traditionalrestrictionmapping . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Opticalmappingsystemoverview . . . . . . . . . . . . . . . . . . . . 6 1.4 Sampleopticalmappingdata . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Themapalignmentproblem . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Themapassemblyproblem . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 DensityplotofNCBI Build36ofthehumangenome . . . . . . . . . . 26 2.2 E.coliK12restrictionfragmentlengths . . . . . . . . . . . . . . . . . 44 2.3 Simulatedgenomeforsinglefragmentmatching . . . . . . . . . . . . . 45 2.4 Randomfragmentmatchingbasedonp α andσ . . . . . . . . . . . . . 52 2.5 Simulatedgenomewithfragmentmatchingbasedonp α (σ,λ). . . . . . 55 2.6 “Declumped”matchingbasedonp α (σ,λ). . . . . . . . . . . . . . . . . 56 3.1 SimulateddistributionofZ 1 =|X 1 −X 2 |underalternativehypothesis . 77 3.2 QQ-plotsofdistributionofD-sizestatistic . . . . . . . . . . . . . . . . 96 3.3 QQ-plotofnulldistributionofscorefunction . . . . . . . . . . . . . . 110 3.4 αparameterundernullandalternativedistributionofD-sizestatistic . . 113 3.5 OptimalparametersforDeNovoAssembly . . . . . . . . . . . . . . . 116 viii 3.6 Performancecomparisonoflinearversuslikelihoodscoringfunction . . 118 3.7 Likelihoodandlinearscorefunctionpositivepredictivevalue . . . . . . 120 3.8 Likelihoodscorefunctionwithupperbound . . . . . . . . . . . . . . . 121 3.9 Randomforesttrainingwithclassweightsandcutoffvalues. . . . . . . 126 3.10 Randomforestclassificationperformance . . . . . . . . . . . . . . . . 127 3.11 PositivePredictiveValueforCutoffProbabilities . . . . . . . . . . . . 129 3.12 DiagramofarrangementofFPC map . . . . . . . . . . . . . . . . . . . 136 3.13 Determiningclonelocationsandinitialedgeconstructionofmatchgraph 148 3.14 Finalmatchgraphandmatchgraphtraversal . . . . . . . . . . . . . . . 149 3.15 Analignmentbetweenanopticalmapandfingerprintcontigs . . . . . . 152 3.16 Histogramoffragmentsizesfromopticalmapsandsequencecontigs . . 157 3.17 Histogramofsequencecontigspercloneaccordingtosequencestatus . 158 3.18 Numberoffragments . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 3.19 Maxmulti-fragmentsequencecontig . . . . . . . . . . . . . . . . . . . 161 3.20 Alignmentimagevisualization . . . . . . . . . . . . . . . . . . . . . . 162 3.21 HistogramofFPC gapdistances . . . . . . . . . . . . . . . . . . . . . 164 4.1 Bi-directedgraphconstruction . . . . . . . . . . . . . . . . . . . . . . 174 4.2 Edgedistancecalculation . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.3 Transitiveedgereduction . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.4 Assemblycoveragefor Simgenome . . . . . . . . . . . . . . . . . . . 188 4.5 HistogramoffragmentsizesfromunassembledregionsofSim . . . . . 189 ix Abstract The abilityto conduct wholegenome analysisof variationisslowlybecominga reality withbothimprovementsinbiotechnologyandadvancementsindataanalysis. However, large-scale de novo sequencing still remains a formidable task for complex plant and mammalian genomes. Although not providing resolution at the sequence nucleotide level, physical maps convey useful information that can be leveraged to discover bio- logical events not possible with sequencing technologies. Optical mapping, a novel restriction mapping technology, is able to produce complete genome-wide physical maps both quickly and cheaply. These maps serve not only as invaluable aids for de novo sequencing, but can be used directly to make valuable inferences regarding the underlying genome itself. However, in order for optical mapping to be useful as a tool for genomic analysis, both computational and statistical questions must be addressed. In this thesis, we explore some of the issues involved with analyzing optical mapping data. Specifically,weexplorevariousstatisticalmodelsandtheirimplicationsforoptical mappingdata. Wealsodevelopanewscoringfunctionforthealigmentofopticalmaps using dynamic programming. A strategy for comparing optical mapping data against a clone-basedsequencingstrategyforagenomeisexamined. Finally,methodsforassem- blingopticalmappingdataintoacompletegenome-widephysicalmaparepresented. x Chapter1 OverviewofOpticalMapping 1.1 Background ThecompletionoftheHumanGenomeProject(Collinsetal.,2003)andthesubsequent availability of the entire genomic landscape have opened up new areas of research in both science and medicine. Although much is yet to be discovered, having the full genomic sequence accessible is of vital importance for understanding the underlying mechanisms of biology. Any two individuals are largely identical in terms of DNA sequence similarity; however, it is the differences that contribute to the diverse set of phenotypes specific to both normal and disease biology. These variations manifest themselvesasinsertions,deletions,singlenucleotidepolymorphisms(SNPs),rearrange- ments,andcopynumberchanges. Ascertainingthedegreeofvariationandhowitaffects an individual is a central problem in the field of genomics. The most straightforward approach to investigate differences between genomes would be to directly sequence each. There is ongoing research in different types of technologies that will eventually makethismethodfeasible(Metzker,2005). WholeGenomeMapping: Givenacompletelysequencedgenome,ahostofanalyses can be conducted yielding insight into the global organization, expression, regulation, andevolutionofthetargetorganism. Althoughproducingthefullsequenceisdesirable, both technological and financial hurdles remain for sequencing organisms with large, 1 complex genomes. Whole genome mapping is the creation of molecular views of a genome at different levels of resolution. A high resolution map of the genome would consistofthecompleteDNAsequenceofanorganism. Evenwithoutthefullsequence, othertypesofmapsarepossiblethatcanbeusedtolocalizegenesandotherfeaturesof interest. These lower resolution maps are based on other well-defined markers within theDNAsequencethatcanbedeterminedandpositionedalongthegenome. GeneticMapping: Intheearlydaysofgenomicresearch,geneticistsusedlowresolu- tiongeneticmapstoinitiallyobtainaroughviewofthegenomiccontentofanorganism. Someofthefirst lowresolutionmapsconstructedwere linkagemapscreated fromana- lyzing the recombinant frequency of crossovers from experimental data. The resulting maplinkedspecificgenestogetherintolinkagegroupsshowingtheirapproximateloca- tions relative to each other. Although successful for many genomes, the techniques involved were dependent upon organisms that could be easily manipulated and mated toproduceinformativetestcrossesprovidingenoughstatisticalpowertoreliablydeter- minethelinkagegroups. Furthermore,distancesbetweengenesaremeasuredindirectly and are given in terms of genetic distance related to the recombination frequency that couldbeinaccuratefor genesphysicallylocatedfarfromeachother. Thisledtotheusageof differentialmarkers otherthanthegenesthemselvestopro- vide a higher resolution map. The markers chosen were those that were more abun- dantlydistributedwithinthe genomeandconsistedof distinguishingfeatures thatread- ily identify their presence. Markers such as restriction fragment length polymorphisms (RFLPs), simple-sequence length polymorphisms (SSLPs), variable number tandem repeats (VNTRs), and randomly amplified polymorphic DNA (RAPD) have been uti- lized. Althoughthese differentmarkers offer varyinglevelsof resolution,genetic maps 2 canonlyofferrelativepositionsofthemarkerssincerecombinationfrequenciesareused todeterminetheirlocation. Physical Mapping: Physical maps, on the other hand, are low resolution maps that contain the actual spacing of the markers of interest in terms of basepairs along the genome. They are maps of physically isolated pieces of the genome where a complete physicalmapconsistsof a series of maps for each chromosomein the haploidchromo- some set for an organism. Although not providing resolution at the nucleotide level, physical maps have been leveraged to serve as an initial blueprint of an organism’s sequence since they can locate genes and other DNA markers and the actual basepair distancesbetweenthem. Physical maps, similar to genetic maps, are constructed by identifying the loca- tionsof certain markers along the genome. These markers are often short, well-defined nucleotide sequences where the resulting physical map contains the distance in base pairs between successive marker positions. Detected genomic variations would reveal themselvesbyaffectingthepresenceorabsenceofmarkersoralteringthedistancesand orientations between adjacent markers. Although these changes do not fully encom- pass the wide range of structural variation that can occur, they are able to detect events such as indels and translocations as well as other chromosomal abberations not easily discoveredbyothertechnologies. Restriction Maps: Restriction maps are a specific type of physical map that involve the use of restriction enzymes. Restriction enzymes cut double-stranded DNA at spe- cific recognition sequences through a process called restriction digestion. The recogni- tion sequences along the genome are called restriction sites and the fragment produced between two adjacent restriction sites that are cut, a restriction fragment. For example, 3 Molecular marker 1 Molecular marker 2 Molecular marker 3 Gene Gene Cloned fragments ACTGGAACTGCATGATACATACAT Genetic High-Resolution Mapping Physical Mapping DNA Sequencing Figure1.1: Anoverviewofgenomicmappingwherewehave: ageneticlow-resolution map showing the locations of markers as well as genes, a physical map created by cloning the indicated region and positioning the cloned fragments using the indicated markersdrawnasverticallines,andasegmentofaclonedfragmentthatissequenced. the restrictionenzyme, EcoRI, recognizes the sequence 5 ! -GAATTC-3 ! and cleaves the DNAbetweenthe3 rd and4 th nucleotides. Differentenzymeshavedifferentrecognition sequences and thus provide different restriction maps. Restriction enzymes can also be sensitive to chemical modifications such as methylation occurring on DNA molecules. Methylation involves the addition of a methyl group to bases along double-stranded DNAthatcanimpedetheabilityoftherestrictionenzymetocleaverestrictionsites. After a digest has been performed, the restriction fragments are measured to deter- mine the distance between restriction sites. In early work with restriction maps using gel-basedrestrictionmappingmethods,thefragmentsweremeasuredusingagarosegel electrophoresis. However, the order of the fragments is not known requiring digests using both different enzymes singly and in combination. By comparing the same genomic region digested with the enzymes, the order of the fragments can be inferred. The goal of reconstructing the underlying map from this type of data includes both the 4 AATACCGG. . .CCATTTAAATTGA. . .ACTTTG. . .GAGTAAATTTAAATCTGATAC. . .ATAAGA. . .GTCATTTAAATCGG. . .GGACTGCCGAA 12.4 Kb 4.8 Kb 9.2 Kb 8.6 Kb Agarose Gel Unordered Restriction Fragments Figure 1.2: Restriction maps are produced by: (1) Initially, digesting the DNA with a particular restriction enzyme. In the example above we use SwaI with recognition sequence 5 ! -ATTTAAAT -3 ! thatproduces 4 restrictionfragments. (2) The fragmentsare run through an agarose gel through which smaller fragments move faster and migrate fartherthanlargerones. Usingsizingstandards,themigrationdistanceistranslatedinto physical sizes in terms of basepairs as indicated by the sizes of the fragments on the right. Intraditionalrestrictionmapping,theorderofthefragmentsisunknown. double digest and partial digest problems. These problems were shown not only to be computationally hard (Goldstein and Waterman, 1987), but they do not always yield a uniquesolution(SchmittandWaterman,1991). Optical Mapping: Optical mapping (Schwartz et al., 1993) is a restriction mapping system that utilizes fluorescence microscopy to directly image single-molecule DNA substrates. The novelty of the technique is the ability to produce ordered restriction mapswhererestrictionsitesappearintheordertheywouldappearonthegenomeitself. This distinguishes optical mapping from traditional gel-based restriction mapping sys- temsthatonlyprovideestimatesof unorderedrestrictionfragmentlengths. Whilelack- ing nucleotide resolution afforded by sequencing technologies, optical mapping avoids 5 DNA Glass Surface Image Processing Data Analysis Figure1.3: An overviewof the optical mappingsystem: (1) DNA is extracted directly fromcells. (2)TheDNAismountedontoglasssurfaceswhereitiselongated,digested, stained, and imaged. (3) Image processing software determines the sizes of the restric- tion fragments on each of the surfaces. (4) Subsequent data analysis is carried out on thesetofopticalmapstorevealdiscoveriesabouttheunderlyinggenome. the need for amplification, cloning, electrophoresis, or hybridization associated with thesemethods. Data canbecollectedandgenerated bothcheaplyandefficientlyforthe fractionofthecostofcompletegenomesequencing. 6 1.2 OpticalMappingSystem 1.2.1 History SchwartzandKoval(1989)firststudiedtheconformationaldynamicsofDNAmolecules by observing their movementthrough gel electrophoresis. This led to the development of the first optical mapping system used to construct ordered restriction maps of Sac- charomyces cerevisiae (Schwartz et al., 1993). The original system used fluid flow to stretch outDNA moleculesdissolvedin moltenagarose. A restrictionenzyme is added to the molten agarose-DNA mixture and fluorescence microscopy coupled with digital image processing techniques is used to record cleavage sites. Further improvementsto the system led to the use of derivatized glass surfaces to mount the target DNA (Cai et al., 1995). This increased the capabilities of the system by easing the detection of the molecules during the imaging process with the use of surface mounts. Initially, the elongation of the DNA molecule occurred within fluid flow developed within a drying droplet on the glass surface (Jing et al., 1998). This approach was suitable for smaller amountsofDNAbutdidnotworkaswellforwholegenomicDNAmoleculesthatcould spanentirechromosomes. Toovercomethischallenge,simplecapillaryactionwasused byaddingtheDNAsolutiontoglasssurfacessandwichedtogether(Laietal.,1999). 1.2.2 DataCollection Initscurrent incarnation,opticalmappingusesa microfluidicsystemtomanipulatethe DNA molecules (Dimalanta et al., 2004). The devices are manufactured by first etch- ing a channel mask into a silicon wafer master. The etched pattern consists of sets of 10 parallel channels (100µm width, 10 mm length) with triangular entraces and exits. Microchannels are formed by molding the PDMS (poly(dimethylsiloxane)) and curing 7 them on the master silicon wafer. The ends of the microchannels are trimmed to gen- erate open-ended channels to facilitate capillary action. Positively charged glass cover slips are derivatized (chemically prepared to facilitate DNA interaction) and rigorously cleaned to remove any contaminants. The PDMS microchannels are combined with the glass cover slips to form channels that allow for the loading of DNA substrates. DNAisextracteddirectlyfromthecellsofatargetorganismandfragmentedintopieces around 500 Kb by random shearing. Microliter quantities of DNA-containing solution are placed at one end of the microchannel where capillary action acts to draw them through the channel. The process of loading the microchannel and the resulting capil- lary action creates the flow of buffer allowing for the DNA molecules to elongate and linearizewhileaffixedtotheglasssurface. After the DNA has been deposited and elongated by the fluid flow, an enzymatic digestion buffer is added containing the restriction enzyme and incubated to allow for cleavage of the DNA. After complete digestion has occurred, the glass surfaces are stained with a solution containing fluorochromes (YOYO-I: 200 nM in 20% β- mercaptoethanol in 1x TE) and sealed onto the microscope slide with nail polish to prevent the sample from drying while imaging. The imaging process uses a laser- illuminated microscope equipped with a high-magnification objective and a charge- coupled device digital camera to acquire images of the DNA along the microchannels. Snapshots of consecutive images of each channel have∼ 20% overlap to ensure that DNA molecules that are larger than one image or span the intersection of images are correctlyobtained. Approximately120imagesarecollectedperamicrochannelwith10 microchannelsperopticalmappingsurface. Thisyields∼1200imagesperanentiresur- face that can be collected in less than 1 hour giventhe high-intensitylaser illumination andhigh-speedCCD camera. 8 1.2.3 ImageAnalysis Aftertheimagesarecollectedforanopticalmappingexperiment,theymustbeanalyzed tofind the locationsof the DNA moleculesand cleavage sites. Due to the adherence of the DNA molecules to the glass mount, the cleavage of the stretched DNA molecule pulls back at the ends of the cleavage sites. Cleavage sites become visible as tiny gaps between DNA fragments that appear as lines of fluorescence within the image. The image analysis involves two distinct stages. The first stage assembles the individual images into a collage of images and corrects for the microscopic illumination profile andcameratranslation,rotation,andscaledifferencetopreparethemforthenextstage. ThesecondstageprocessestheimagesettodetecttheDNAmoleculeswithintheimage alongwiththecleavagesites. Thesestepsarefurtherdescribedbelow. Image Preparation: The images are processed to remove imaging artifacts and to prepare them for the subsequent steps of marking the locations of the DNA molecules andcleavagesites. Eachrawimagecontainserrorsduetoanunevenilluminationprofile where objects in the center of the image are brighter than those at the edges. This is caused by scattered light from the glass surface during the imaging process. The background is removed by interpolating the image, filtering out any high frequency components, and subtracting the interpolated image from the raw source image. The image is “flattened” by dividing by the normalized illumination profile image taken at thebeginningofeach collection. The cameras used produce individualimages of different regions of the microchan- nel given the resolution needed to detect the molecules. These images are combined intoalargersuperimageoftheentiremicrochannelbydetectingoverlapbetweenthem. Since the cameras used might not be aligned exactly with each other, differences in 9 angle of orientation, magnification factor, and image displacement must be accounted for. Scale and rotation adjustments are calculated using a reference set of images that are appliedtoexperimentalimagescollected. Molecule Detection: After a super image has been created, the molecules and their flourescence intensities are determined using customized software. The image is ini- tially segmented into regions showing similar intensity values. The segmented regions are associated with fragment chains. Fragment chains are regions corresponding to straightlinesegmentsofhighintensityvaluesthatrepresentagenomicDNA.Aninten- sity profile is generated along the fragment chain. The profile is analyzed to detect dipsthatcorrespondtorestrictionsites. Theintensityprofileiscomparedagainstsizing markers to generate a measurement in basepairs between the dips that correspond to restriction fragments. The sizing markers consist of genomic molecules of known size and cut pattern that are deposited on the microchannel along with the target genome. Fragmentsizesare calculatedaccordingtothefollowingformula: fragment(bps) = standard(bps)× fragmentintegratedintensity standardfluorescence intensity Thefinaloutputarethecalculatedsizesoftherestrictionfragmentswithineachdetected molecule along with the intensity profile and other measurements associated with the imageprocessingstep. 10 1.3 GoalsandChallenges Therearemanystatisticalandcomputationalaspectsinvolvedinanalyzingopticalmap- ping data. In a typical optical mapping experiment, a collection of optical maps rep- resenting random regions of the target genome is obtained. The main goal is to make inferencesabouttheunderlyingrestrictionmapbasedontheacquireddata. Duetoerrors ofopticalmappingmeasurements,statisticalmodelsmustbedevelopedtoanalyzeopti- cal mappingdata. Two of the maincomputationalchallenges mirrorthat of sequencing data, namely the issues of alignment and assembly of optical maps. Below, we give a overviewofthesefundamentalstatisticalandcomputationaltasks. 1.3.1 OpticalMapData Sample Data: Optical mapping data are generally regarded as random snapshots obtained from the underlying genome. Their locations are assumed to be uniformly distributedalongthegenome. Theoutputofanopticalmappingexperimentaretheesti- matedfragmentsizesintheorder theyare detected alongamolecule. Figure1.4shows a sampleopticalmappingdatasetfora typicalexperimentthatconsistsofthemapsand theirmeasuredfragmentsizesasasimpletextfile. Associated Errors: Unlike sequence reads that are obtained as averages over many copiesofaclonedregionofDNA,opticalmapsconsistofmeasurementsmadeonindi- vidualDNAmoleculesderiveddirectlyfromgenomicDNA.Althoughprovidingamore straightforwardglimpseattheunderlyingtargetgenome,thisalsoincreasestheamount ofnoiseassociatedwithopticalmapping. Theseerrorsinclude: 11 omdb:99615872:1321856_0_1 BsiWI B 7.201 122.146 71.063 9.528 3.522 22.226 23.07 47.572 9.767 6.0 16.068 omdb:99615873:1321856_0_2 BsiWI B 16.405 17.07 22.372 13.318 5.399 15.472 30.381 14.879 1.727 10.004 2.193 11.402 16.92 9.253 10.679 6.432 54.304 18.461 54.206 40.958 22.341 12.141 ... Figure 1.4: The text file contains a map ID used to identify each collected map. The measured fragment sizes are specified in kilobases (Kb). The first two fields after the mapIDcontainthetypeofrestrictionenzymeusedanditsacronym. • Non-uniformuptakeofthefluorescencealongtheDNAmoleculeresultsinsizing errors using fluorescent intensities. Imaging artifacts and balling effects at the ends of DNAfragmentsalso contributeto inaccurate sizes when convertingfrom fluorescentintensitiestobasepairs. • Not all true restriction sites are observed, missing cuts occur due to imperfect activityoftherestrictionenzyme. • Randombreakage ofDNAmay cause falsecuts tobe detectedas restrictionsites withina map. • Small fragments (≤ 2 Kb) do not adhere to the glass surface and desorb, and subsequently they are underrepresented in optical maps. It is possible that these fragments may re-attach themselves to nearby fragments and potentially cause overestimationoftheattachedfragmentsizes. 12 All these sources of errors are confounded with image processing errors. Addi- tionally, the image processing step may cause some maps to be chimeric, where two unrelated regions of the genome are erroneously marked up as one map because they overlap within the image. In Chapter2, we develop statistical models addressing these errorsandtheirimplicationsforopticalmappingasatoolforgenomicanalysis. 1.3.2 MapAlignment The goal of alignment is to detect association or overlap between two or more restric- tion maps allowing for their comparison. Algorithmically, aligning restriction maps bears resemblance to sequence alignment algorithms such as the Needleman-Wunsch andSmith-Watermanalgorithmandhavebeenadaptedbyvariousauthorstoworkwith restrictionmaps(Watermanetal.,1984;HuangandWaterman,1992;MyersandHuang, 1992). In sequence alignmentthegoal isto alignnucleotidesbases. Maps containboth restriction sites and the intersite distances between them. Map alignment consists of aligning restriction sites taking into account the intersite distances. The main issue to these dynamic programming based algorithms is a suitable scoring function that can measure the degree of goodness of a potential alignment and assign a numerical value. Waterman et al. (1984) presented a linear scoring function for restriction map compar- ison while Valouev et al. (2006b) developed a log-likelihood based scoring function in thecontextofopticalmappinganditsassociatederrors. Alignment Types: Various types of alignmentsare possible for restriction maps. We focusontwotypesofglobalalignmentthatareparticularusefulinthecontextofoptical mapping; overlap alignment and fit alignment. In overlap alignment, the suffix of one map is aligned to a prefix of another map. Fit alignmentis an alignmentfor a map that 13 is completely contained in another, usually much larger, map. The type of alignment desiredisdependentupontheapplication. Map Types: It is also worth mentioning the types of maps used in the analysis of optical mapping data. Three types of maps are possible; individual optical maps from an experimental collection, reference maps derived from in silico digest of sequence data,andconsensusmapsderivedfrommergingmultipleopticalmaps. Referencemaps are obtained from sequence data as a list of ordered fragments that theoretically would be produced if the corresponding DNA was experimentallydigested. As such, they are considered free of the errors associated with optical mapping since they are produced directlyfromDNAsequence. Consensusmaps,ontheotherhand,areusuallytheresult of the assembly of optical maps with the information contained within them averaged over individual optical maps. Thus, they are not totally free of errors but usually yield more accurate information regarding the underlying restriction map since they are composedofmultipleopticalmaps. Comparinganopticalmapagainstareference map involves different assumptions since the reference map can be assumed to be accurate and generally error free. However, when comparing an optical map against another opticalmap,bothcontainerrorsthatmustbeaccountedfor. In Chapter 3, we discuss the problem of map alignment and focus on overlap alignment between a pair of maps. We develop a new scoring function for aligning mapsandcompareittopreviousscoringfunctions. 14 Figure 1.5: The map alignment problem deals with finding a correspondence between therestrictionfragmentsfromapairofmaps. Intheabovepicture,apossiblealignment between a pair of maps is given with the lines drawn between the two maps indicating thealignedregion. 1.3.3 MapAssembly Themapassemblyprobleminvolvesthereconstructionoftheentiregenomefromopti- calmappingdata. Thisisanalogoustosequenceassemblywhereinsteadofassembling sequence reads, restriction maps are assembled to produce a consensus map represent- ing the entire genome. The genome wide restriction map details the locations of all the restriction sites and the distances between them. In order to assemble optical map- ping data, associations between the optical maps must be found. Anantharaman et al. (1999)designedthefirstmapassemblerusingaBayesianapproachwhereapriormodel for the unknown restriction map and a conditional distribution for optical maps given the true map are used to derive the posterior density for a hypothesized map. Compar- isonsbetweenopticalmapswerefoundusingadynamicprogrammingapproachwherea greedyalgorithmsuccessivelymergedtogethermapstoproduceafinalconsensusmap. Valouev et al. (2006c) utilized a different approach by first computing all the pairwise alignments between all maps for a given dataset. An overlap graph encoding the maps and their alignments is constructed and after applying graph correction procedures to 15 removeerroneousmapsandalignments,apaththroughthegraphisextractedrepresent- ingtheconsensusmap. We distinguish between two types of assembly termed reference assembly and de novo assembly. In reference assembly,a reference map for the target organismis avail- able to guide the assembly of the consensus map. Although the consensus map might have differences from the reference map, most regions are similar so that the reference map can effectively “seed” the assembly by initially grouping together maps that orig- inate from the same region on the reference map. Maps that differ from the reference map bridge together “seed” regions. De novo assembly is when there is no reference map available and thus the maps must be assembled to produce a consensus map with nopriorinformationregardingthestructureof thegenome. De novoassemblyisinher- ently more difficult than reference assembly. In Chapter 4, we examine the problem of mapassemblyandpresentmethodsfordenovoassembly. 1.4 Applications The overall application of optical mapping is to make inferences about the underlying restriction map. As such, changes to the underlying restriction map are reflected in optical mapping data and can be used to study the genome itself. Some applications of opticalmappingforgenomicanalysisarepresentedbelow. Genomic Variation: Optical mapping can be used in structural variation studies where a reference genome is available for comparison (Sebat et al., 2004; Tuzun et al., 2005). Inthiscontext,opticalmappingcanquicklybeusedtoidentifyawideassortment ofgenomicvariationintermsofinsertionsanddeletionsofsequences,single-nucleotide polymorphisms,and pointmutations. Regions thatcontain insertionsand deletionscan 16 Figure1.6: The map assembly problem involvesthe reconstruction of a genome given adatasetofopticalmaps. Intheabovepicture,weshowaportionofassembledgenome using an application called Genspect. The consensus map is represented at the top in blue while the underlying maps using to assemble a particular region are presented below. be determined through differing sizes of restriction fragments when compared to a ref- erencegenome. Mutationsmanifestthemselvesthroughnewormissingrestrictionsites thatalsocontributetoalteredrestrictionfragmentsizes. Zhou et al. (2004) utilized optical mapping for genomic comparisons across differ- entbacterialstrainsrepresentedbyseveralspecies. Microbialgenomesexhibitextensive intraspecificvariationswheredifferentstrainsortypeswithinthesamespeciescanvary as much as by 20% in gene content (Lan and Reeves, 2000). By detecting map sim- ilarities by segmenting whole-genome restriction maps into overlapping map sections and aligning against reference maps, variant loci were discovered to exist between dif- ferent strains of bacteria. The alignments revealed chromosomal differences in terms 17 of insertions, translocations, inversions, and deletions indicative of genomic rearrange- ments. These genomic differences demonstrated the ability of optical mapping to dis- coverstructuralvariationsatthegenome-widelevelthatcanthenbefurtherinvestigated usingPCR-based techniquestosequence andidentifysmallnovelinsertionsorconfirm experimentalresults. Valouev(2006)further developedstatisticalmethodsforcomparingopticalmapsto identifyregionsconsistingofgenomicvariationsusinga differentapproach. Variations were identified using a three-step procedure consisting of: (i) assembling optical maps from a target organism into contigs, (ii) aligning consensus maps to in silico maps to determine genomic positionsof assembled consensus maps and screening for potential variation loci, and (iii) analyzing regions of interest using statistical tests to evaluate evidence of variation. These methods were applied to two human cancer cell lines that werecomparedagainstareferencehumangenometodiscoversomaticDNAaberrations presentinmanytypesofcancers. Variousinsertionsanddeletionsaswellaspointmuta- tionswereuncoveredwithselectdifferencesverifiedbyPCR-basedtechniquesconfirm- ing that they were indeed abnormalities caused by somatic changes within the aberrant cells. SequenceValidation: Besidesbeingusefulforgenomicanalysisofvariation,optical mapping is also well-suited for sequence validation for large-scale sequencing efforts (Astonetal.,1999). Physicalmapsservetoguidesequenceassembly,characterizegaps, and confirm finished sequences. They are particularly useful for assembling genomic regionscontainingrepeatssincecleavagepatternscanaccuratelyidentifysuchelements. Thecostlyendingstagesofasequencingprojectbenefitsfromtheabilityofopticalmaps tospansequencinggapsandvalidatefinishedsequences. 18 Genome-widerestrictionmapscanbeusedinsequenceassemblyvalidationbyper- forming simulated restriction digests of sequence contigs and comparing them against the restriction maps. In this manner, repeat regions that are problematic for sequence assembly are significantly easier to resolve. Contig order and orientation can be deter- mined and validated reducingfinishing times by identifying faulty regions of sequence thatcanbetargetedforre-sequencingandre-assembly. Physicalmapsprovideaninitial scaffoldthatsequencedatacanbeanchoredagainstandverified. Various groups (Lai et al., 1999; Lim et al., 2001; Lin et al., 1999; Reslewic et al., 2005;Zhouetal.,2002,2003)usedopticalmappingtocharacterizeassembledsequence contigs and guide sequence assembly efforts. Cai et al. (1998) demonstrated the feasi- bilityof using optical mapping on BAC clones showingits potential in complementing and advancing large-scale sequencing projects. Their experiments demonstrated that high-resolutionmaps of BAC clones used extensivelyinsequencing projects were pos- sibleandcouldbeusedtoprovideascaffoldforanchoringsequencecontigs. Fromthis, Giacalone etal. (2000)employedopticalmappingtoanalyze BAC clonesderivedfrom the human Y chromosome DAZ locus, a highly-repetitive genomic region containing multipleduplications. Optical maps of the DAZ gene complex yielded new insightand analysisnotpossiblewithconventionalsequencingtechniquesduetothehighdegreeof sequencesimilarityamongtherepeatedsegmentswithintheregion. Antoniotti et al. (2001) describe a system for the validation of DNA sequences against physical maps. Their system employed a dynamic programming algorithm for aligning consensus maps against sequence derived maps based upon a maximum like- lihood formulation of the problem that takes into account the errors associated with optical mapping data. Using their system, they checked various optical maps against 19 the publicly available sequence data for Plasmodium falciparum showing the viability ofassessingthequalityofsequencedatausingopticalmaps. De Novo Assembly: One of the more interesting applications of optical mapping is the construction of genome-wide physical maps completely de novo. This process is similar to shotgun sequence assembly with optical maps consisting of restriction frag- ments replacing sequence data. Ordered restriction maps enable a quick genome-wide analysis that can reveal structural detail for a fraction of the cost of full sequencing. Recentfindingsindicatethatthereisprevalentstructuralvariationinhumanpopulations (Stefansson et al., 2005; Tuzun et al., 2005) with several loci associated with differ- ent types of diseases. Optical mapping facilitates these types of studies by offering a quickandcosteffectivemethodforconductinggenome-wideinvestigationstofindsuch variations. Cancergenomesarealsoknowntocontainwidespreadaneuploidyandgeneticaber- rationsduetogenomicinstability. Theabilitytodetecttheseabnormalitieswillallowfor the development of new diagnostics and treatment options (Tomlins et al., 2005). Cur- rent techniques for discovering these genetic alterations are limited by the prohibitive cost of using full sequencing and deficiencies in DNA hybridization based technolo- gies (McCarroll et al., 2006; Sebat et al., 2004). Thus, optical mapping offers a viable method for uncovering and characterizing these structural events from a genome-wide pointofview. De novo optical map assembly has successfully been applied to moderately sized bacterialandviralgenomes(Laietal.,1999;Limetal.,2001;Linetal.,1999;Reslewic et al., 2005; Zhou et al., 2002, 2003). These genome-wide maps served to facilitate 20 sequence assembly by assisting in the closure of sequence gaps as well as contig vali- dation. Valouev et al. (2006c) assembled larger eukaryotic genomes consistingof Tha- lassiosirapseudonana,Oryzasativassp. japonica,andHomosapiens. Comparisonsof theopticalmapassemblyofOryzasativassp. japonicaagainstavailablesequencedata showed 91% alignment between contig maps with sequence maps with the vast major- ity of unaligned regions due to small fragments not present in consensus maps. These assemblies established optical mapping as a practical tool for de novo map assembly suitableforavarietyofapplications. 1.5 Outline Optical mapping is a fast, inexpensive, single molecule system capable of producing wholegenomerestrictionmaps. However,bothcomputationalandstatisticalchallenges must be addressed in order for optical mapping to be effectively used for the study of biology and genetics. In the following chapters, we study the various features of optical mapping and offer improvementson previous methods developedfor analyzing opticalmapping. InChapter2,weexaminethevariousfeaturesof opticalmappingand its associated errors and discuss the limitations of the system. We study the ability of opticalmappingasatoolforgenomicanalysisusingsimulationexperiments. Ourmain contribution is in Chapter 3 where we present and validate a new scoring function for detectingalignmentsfordenovoassembly. Amethodfor aligningopticalmapsagainst fingerprint contig maps that addresses the unique nature of fingerprint contig maps is also presented. Chapter 4 examines the problem of assembling optical maps and we present an assembly algorithm using graph-based methods for producing a consensus 21 map. The assembly algorithm utilizes the new scoring function optimized for de novo assemblytocalculatethealignments. 22 Chapter2 ModelingOpticalMappingData In order to analyze optical mapping data, a necessary first step is to study the inherent features of the system. By examining the associated errors of optical mapping, we can gain an understanding of both the advantages and limitations of optical mapping for genomic analysis. One of the key advantages of optical mapping over traditional restriction mapping techniques is the ability to produce ordered restriction maps where restrictionsitesappearinthesameorderastheywouldonthegenomeitself. Thesystem is also designed with extremely high throughput so that data can quickly be collected fromatargetorganism. However,opticalmapscollectedfromagenomeofinteresthave unknown orientations; fragment lengths are not measured accurately, not all restriction sitesare correctly identified;andsmallfragmentsare usuallylostandunderrepresented in the data. Furthermore, some maps are chimeric and consist of unrelated regions of the genomethathave crossed overwithinthe image. Theyare detected as a singlemap during the imaging process. These errors add to the complexity of analyzing optical mappingdataandmustbeaccountedfor. 23 2.1 StatisticalModelsforOpticalMapping 2.1.1 StochasticModel Restriction Sites: Restriction sites are the markers identified using optical mapping. They are short specific nucleotide sequences of around 6-12 basepairs that are recog- nized by the corresponding restriction enzyme. Clearly, the distribution of the DNA sequenceaffectsthedistributionofrestrictionsites. Restrictionfragmentsare theresult of digesting the DNA with the restriction enzyme that cleaves the DNA at the loca- tionsofrestrictionsites. Consequently,thedistributionofthefragmentlengthsbetween successiverestrictionsites dependsupon the distributionof the DNA sequence as well. One simple model for the distribution of DNA basepairs along a genome is that they are distributed independently over the nucleotides{A,C,T,G}. In real data, however, the assumption of the bases as independently distributed is not reasonable leading to theusageofaMarkovchaintomoreaccuratelymodelthedistributionofbases(Bishop et al., 1983). In this model, the distribution of the bases is dependent upon previous bases occurring withinthe sequence so that dependence between nucleotidesbases can be accounted for. Waterman (1983); Breen et al. (1985); Biggins and Cannings (1987) examined how the structure of the restriction site (the nucleotide pattern) itself could greatlyaffectthenumberofrestrictionsitesandthereforechangetheresultingfragment lengthswhenbothsingleandmultiplerestrictionsitesare considered. Forourpurposes,weassumeauniformdistributionoverthenucleotidesbaseswhere a restriction site of length l basepairs occurs with probability p =1/4 l . Under this model, we neglect the effects of the restriction site sequence and also only consider a single restriction site. As the genome size, G, is extremely large compared to the size of the restriction site where, G % l, recognition sites along the genome are modeled 24 as the realization of a homogeneous Poisson process and consequently, the fragment lengths are independently and identically distributed exponentialvariates. The Poisson processmakesanexcellentmodelduetothefactthatthegenomeisextremelylargeand the probability of a recognition site occurring is small (Churchill et al., 1990). Even if dependenceexistsbetweenthenucleotidebaseswhereaMarkovchainismoreappropri- ate to model the distributionof the sequence of the restriction site, the Poisson process approximatesthefragmentlengthssufficiently. UnderaMarkovianassumptionthefrag- mentlengthsaredistributedgeometricallyaccordingtothetransistionmatrixwherethe exponentialdistributionisthecontinuousanalogforthegeometricdistribution. TherateofthePoissonprocessisdependentupontherestrictionenzymebeingused as well as the genome being mapped. Figure 2.1 shows the distribution of fragment lengthsfor the humangenome. AlthoughthePoissonprocess is a goodapproximation, various features of the underlying sequence could affect the rate. Coding constraints and codon usage as well as other local flunctuations in base composition could have significant effects on the distribution of restriction sites. Furthermore, the density of genes and certain specific sequence patterns that play integral roles in the function of theorganismalsoaffect thedistributionofthenucleotidebases. 2.1.2 ErrorModels The error model presented is based on the model developed by Valouev et al. (2006b). Wesummarizebelowtherelevantfeaturesofthemodelforourpurposes. Sizing Errors: Fragment length is determined by measuring the fluorescence inten- sities of the dye that attaches to the DNA molecules during the staining process. The intensities are measured against a measurement standard of DNA of known size that 25 Fragment Lengths (Kb) Density 0.00 0.02 0.04 0.06 0.08 0 50 100150 1 (14.20) 2 (12.54) 0 50 100150 3 (12.24) 4 (10.53) 0 50 100150 5 (12.23) 6 (12.04) 0 50 100150 7 (12.95) 8 (12.96) 9 (14.11) 10 (14.41) 11 (14.46) 12 (13.25) 13 (10.67) 14 (13.19) 15 (15.15) 0.00 0.02 0.04 0.06 0.08 16 (18.50) 0.00 0.02 0.04 0.06 0.08 17 (19.65) 0 50 100150 18 (12.07) 19 (26.33) 0 50 100150 20 (17.78) 21 (12.48) 0 50 100150 22 (23.81) X (14.03) 0 50 100150 Y (14.10) Figure2.1: DensityplotsofNCBIBuild36ofthehumangenomeinsilicodigestedwith therestrictionenzymeSwaIgroupedbychromosome. Restrictionfragmentslengthsare exponentially distributed given that locations of restriction sites follow a Poisson pro- cesswiththerateestimatedbytheaveragefragmentlengthfromtheobserveddata. The upper strip of each plot also shows the mean fragment length for each individual chro- mosome. It is interesting to note that the graphs resemble exponential distributionsbut with different curves. This points to different rates for the underlying Poisson process dependent upon the chromosome being examined. The data also consists of restriction fragments≤ 150 Kb as extreme outliers are present on the sex chromosomes where restrictionfragmentsgreater than30Mbarepresent. is deposited with the target genome. Consider a restriction fragment that is n times that of the size of a unit DNA mass. The fluorescence intensity, W, of the restriction fragment can be computed asW = ! n i=1 W i whereW i is the amount of measuredflu- orescence peri-th unit of DNA mass. The individualW i ’s are modeled as independent andidenticallydistributedrandomvariableswithexpectation,E(W i )=µ,andvariance, 26 Var(W i )= γ 2 . Thisassumesthatlocal effects such asthe structure and accessibilityof the DNA do not greatly affect the uptake of the dye along the span of the molecule. Alternative models that consider the local effects of the fluorescence are explored in Sarkar (2006). Given that the dye attaches itself along the DNA using the above model, by the central limit theorem, W converges to a normal random variable where W → Normal(nµ,nγ 2 ). The size of the fragment is determined by dividing the total flu- orescent intensity by the standard amount of intensity per unit DNA mass estimated from lambda DNA co-mounted along the target DNA. This allows the distribution parameter, µ, to be calculated by examining the fluorescence intensity of lambda DNA of known size. Estimated sizes, X, of restriction fragments are identified as X = W/µ. As n becomes large, X converges in distribution to a normal random variableX→Normal(n,nσ 2 ), where σ 2 =(γ/µ) 2 . Thus, ifY is the underlyingtrue size of the restriction fragment, the estimated sizeX→ Normal(Y,σ 2 Y) in distribu- tion. We can view sizing errors as the difference between the estimated size and the true underlying size of a fragment as & = X−Y, so that &→ Normal(0,σ 2 Y) as n increases. Underthismodel,thevarianceofthesizingerrorscaleslinearlywiththetrue underlyingsize,Y,ofthefragment. The described sizing error model agrees with most of the data from relatively long fragments (≥ 4 Kb) but fails to hold for smaller fragments. One reason is that unlike longfragmentsthatappearclearlystretchedoutandlinearizedwithintheimage,smaller fragments are often balled up. To take this into account, fragments below a specified threshold,Δ,aremodeledas,& Δ ∼Normal(0,η 2 ),where η 2 % σ 2 . MissingCuts: Restriction sites in the true restriction map may fail to show up in the corresponding optical map and are known as missing cuts. These can be attributed to 27 either incomplete digestion by the restriction enzyme or noise within the optical map image. Cut sites are modeled as independent Bernoulli trials with success probability p, and consequently, missing cuts occur with probability 1− p. It may be possible that the probability should be dependent upon the proximity to other cuts, i.e., that the underlying restriction map exhibits locality effects that affect the processivity of the restriction enzyme. However, this is difficult to formalize and model and has not been closelyexamined. Weassumethattheobservanceofcutsiteswithinanopticalmapare independentofeachother. FalseCuts: Asidefrommissingcuts,opticalmapsalsocontainfalsecuts thatdonot correspond to any true restriction site. They arise due to random DNA breakage or imageerrors andare modeledasahomogeneousPoissonprocesswithrateζ. Chimerism: Two unrelated molecules may be mistakenly combined into one as the resultof theDNA moleculescrossingwitheach other withinthe image. Althoughcon- trols are used during the imaging process to minimize the markup of these chimeric maps,theyare stillpresentinthedata. ModelAssumptions We adopt the following model assumption and parameters derived from Valouev et al. (2006b)where, Proposition 2.1.1. Restriction fragment lengths,X, are exponentially distributed with X∼ Exponential(λ) and E(X)= λ. Restriction sites are laid down as a Poisson process with rate λ −1 with the number of restriction sites, N, in s Kb distributed as, N∼Poisson(s/λ). 28 Proposition2.1.2. Measuredfragmentslengthsfromopticalmappingdata,X,arenor- mally distributed as X∼ Normal(Y,σ 2 Y), given true fragment length, Y. There- fore, the sizing error, & = X−Y, is normally distributed as &∼ Normal(0,σ 2 Y). Furthermore, measured fragment sizes X ≤ Δ have sizing error distributed as, & Δ ∼Normal(0,η 2 ),withη 2 % σ 2 . Proposition 2.1.3. Restriction sites are observed on optical maps with probability, p, independent of other sites. Thus, observed restriction sites occur as a thinned Poisson processwithratep/λ. Proposition2.1.4. False cuts occur as a Poissonprocess with rateζ and consequently thenumberofrandombreaks,R,insKbdistributedas,R∼Poisson(ζs). We make a few remarks regarding the above propositions. Proposition 2.1.1 assumes that across the entire genome, restriction sites are laid down as a homogeneous Pois- son process with rate λ −1 so that the lengths of restriction fragments are exponentially distributedwithmean,λ. InSection2.1.1,we had examinedthatthisisagoodapprox- imation for the distribution of restriction fragment lengths. However, the underlying sequence could greatly affect the lengths of restriction fragments for various regions of the genome. Table 2.1 lists summary statistics for the individual chromosomes of the human genome. We can see that the mean fragment length varies from chromo- sometochromosomewhere thestandarddeviationdoesnotquiteagree withthemodel of the fragment lengthsbeing exponentiallydistributed. If we restrict ourselves to only fragmentslessthansomeprespecifiedthresholdthedistributionofthefragmentslengths morecloselyfollowthatofanexponentialdistribution. Thepresenceofafewextremely largefragmentsgreatlyskewsthedistributionofthedata. Thus,itmightbemoreappro- priatetoconsiderseparateratesforthePoissonprocessfortheindividualchromosomes 29 of complex genomes with multiple chromosomes. A more complicated model would be the use of an inhomogeneous Poisson process that was dependent upon the location along the genome. However, these methods would require knowledge of the underly- ing sequence beforehand that is only possible when a reference genome is available. Although the simplistic assumption of a homogeneous Poisson process with the same rate across the entire genome is not entirely valid, the approximation works well for mostpurposes. Notation: We present some notation to be used in the following sections. Optical maps are represented as an ordered sequence of fragment lengths. An optical map, x, consistingofnfragmentsisrepresentedas x =(x 1 ,...,x n ) wherex i representsofthelengthofthei th fragment. Theusageofparenthesesspecifies thatthe listisordered. Anotherequivalentrepresentation istoconsider theopticalmap as a sequence of cut sites and their positions along the genome. The optical map, x, canbeconvertedintothisform,denotedasS(x),byaccumulatingthefragmentlengths where S(x) = (s 0 =0,s 1 =x 1 ,s 2 =x 1 +x 2 ,...,s n = n " i=1 x i ). 30 Chr Mean(Kb) SD Max(Kb) Median(Kb) 1 15.90 164.94 20445.14 8.71 2 12.89 27.87 3072.54 8.05 3 12.65 39.67 4642.82 7.86 4 10.81 26.11 3041.02 6.75 5 12.50 30.06 3213.40 7.96 6 12.38 29.96 3124.04 7.88 7 13.57 33.76 3146.63 8.17 8 13.54 33.91 3061.67 8.31 9 16.62 197.87 18110.42 8.58 10 14.90 32.66 2626.73 8.96 11 15.35 38.46 3064.80 8.97 12 13.67 22.68 1463.27 8.19 13 12.77 189.90 17918.41 6.99 14 16.19 223.42 18077.98 8.35 15 18.78 250.89 18305.38 9.72 16 22.05 156.74 9825.66 11.28 17 20.97 28.09 379.16 11.69 18 12.32 22.47 1422.10 7.73 19 35.87 201.83 8348.38 16.28 20 19.15 42.38 1814.98 11.19 21 17.44 199.43 9857.52 7.75 22 38.43 402.62 14443.83 15.04 X 14.53 34.64 3137.85 9.27 Y 32.75 720.19 30228.82 8.78 Table 2.1: The average, standard deviation, maximum, and median of restriction frag- ment lengths for the human genome NCBI Build 36 in silico digested with restriction enzymeSwaI. In this form, the sites s 0 and s n do not represent actual cut sites but the ends of the molecule. Dependinguponthesituation,bothdefinitionsare applicableand whichrep- resentationthatisbeingusedwillbeclearlystated. Recall that different types of maps are possible when analyzing optical mapping data. Reference maps refer to those derived from in silico digest of sequence data and 31 areconsideredfreeoferrors. Opticalmaps,ontheotherhand,arethoseobtainedexper- imentallyandassuchcontaintheerrorspreviouslydescribed. Inthefollowingsections, wewillusebothtypesofmapsinouranalysis. 2.2 OpticalMapContentAnalysis Motivation: Alignment of optical maps is a necessary first step for analyzing optical mapping data. An alignment involves comparing the number of restriction sites and lengths of restriction fragments to produce pairs of aligned sites between maps. How- ever, due to the errors of optical mapping, the alignment can be erroneous and falsely relatemapsthathavenoassociation. Thisassumptionthatalignedmapsshouldbeasso- ciated with each other has different consequences depending upon the application of opticalmapping. In de novo assembly of a target genome, collected maps are pairwise aligned to derive relationships among them. Aligned maps are assumed to have originated from thesameregionofthegenomeandthereforeshouldbeassembledtogether. Ifmapsfrom differentlocationsofthegenomeshareanalignment,theywillincorrectlybeassociated with each other leading to a faulty assembly. For the detection of structural variation usingopticalmapping,mapsfromatargetgenomeare alignedtoanavailablereference genome to discover differences and similarities. The aligned locations are assumed to representthesamegenomicregiononboththetarget andreference genome. Variations presentthemselvesasdifferingrestrictionfragmentlengthsandthepresenceorabsence of restriction sites. For optical mapping to be effective for this problem, the locations must be accurately found so that meaningful comparisons can be made. If the deter- minedlocationiserroneous,theresultingcomparisonsleadtoincorrectconclusions. 32 Asopticalmapsarederivedfromrandomregionsofthetargetgenome,thequestion to study is the degree to which optical mapping errors cause different regions of the genometoexhibitanalignment. Alignmentsareproducedusingdynamicprogramming withrespecttoaspecificscoringfunctionthatassignsavaluetoanalignedregionwhere therecursionis S(i,j) = max 0≤g<i,0≤h<j {S(g,h)+X(s i −s g ,t j −t h ,i−g,j−h)} (2.1) withS(i,j)asthescoreofanalignmentbetweentwomaps,xandy,withtherightmost pair of sites i and j aligned. X(s i − s g ,t j − t h ,i− g,j− h) gives the alignment extensionscore for the segmentof the alignment (g,h)→ (i,j). The scoring function, X(x =s i −s g ,y =t j −t h ,n =i−g,m =j−h),comparesthesizesx,y andnumber offragmentsn,mofregionsfromthetwoopticalmapsandassignsascorequantifying their similarity. Although alignments are produced with respect to a specific scoring function using dynamic programming, we assume that any scoring function will have thefollowingproperties: (i) Higher scores are givento regions with similar sizes. Due to measurement error, the sizes of the regions are approximately given. The scoring function takes into account the sizing error when assigning a score given the sizes x,y. It is also possible for the scoring function to take into account the distributionof the sizes themselvessothatuniquesizesarerewardedhigher. (ii) Regionswithdifferingnumbersofrestrictionsitesshouldscorelower. Inanideal case where there are no missing and false cuts, only regions consisting of single fragmentswouldneedtobeconsidered. However,sincemissingandfalsecutsdo 33 occur, both can be present in either of the regions being aligned. Since missing andfalsecutsoccurwithsmallprobability,wedonotexpectn,m% 1. With the above assumptions, we want to examine a target genome first for regions thatexhibitsimilarsizesandsecondforregionswheremissingandfalsecutscausethem to exhibit similar sizes when considering them as single fragments. In the following sections, we initially examine a target genome under an ideal setting of no sizing error or missing and false cuts. We then treat the situation of sizing errors and false and missingcutswhere weexaminebothtypesoferrorstogether. 2.2.1 RegionMatching Motivation: We initially examine a target genome in an idealized setting where we ignore measurement error as well as missing and false cuts. This allows us to quantify thefeasibilityofusingorderedrestrictionfragmentstoidentifyaparticularregionofthe genome. Recall that dynamic programming is used to align restriction maps where the sizesandnumberofsitesofregionsfromthemapsarescored. Sinceweignoretheissue ofmissingandfalsecuts,wefirstfocusonthesizesofsinglerestrictionfragmentsbeing aligned. As optical maps are composed of more than one fragment, we generalize our resultstoagenomicregionconsistingofmultiplefragments. Wewillspecifyourtarget genomeasarandomgenomedenoted,G=(Z 1 ,...,Z n ),ofnorderedfragmentlengths where,Z i ∼F z ,forspecifieddistributionF z . Wewillinitiallyconsideraspecificregion of the genome specified as k ordered fragments, T i =(Z i ,...,Z i+k−1 ), indexed by i, thestartingfragmentalongthegenome. Single Fragments: We first examine the simple problem of counting the number of matching pairs of restriction fragments in a given genome. A match is defined as 34 two restriction fragments with the same length. Assume that restriction sites are laid downalonga genomeas a Poissonprocesswithrateλ where,P(k sitesin [s,s+t)) = e −λt (λt) k /k!. The distances between sites are independent exponential variables X i with mean 1/λ. Let, X i ∼ Exponential(λ), be the length of the restriction fragments and consider the probability that two random fragments have the same lengths. Dis- cretize the restriction fragment lengths as Z i = (X i ) to approximate the lengths into discretebasepairs,wherea simplecalculationshows P(Z i =t)= P(t−1≤X i ≤t) = # t t−1 λe −λx dx = e −λ(t−1) (1−e −λ ) = (1−γ) t−1 γ ,t≥ 1. so that Z i ∼ Geometric(γ = (1−e −λ )) with E(Z i ) = 1/γ and λ = log( 1 1−γ ) ∼ = γ. Fortworandomrestrictionfragmentlengths,Z i ,Z j ∼Geometric(γ), wewouldliketo knowtheprobability,p,thattheyare equalinlength. Thisisgivenbythefollowing, Lemma2.2.1. LetX,Y∼Geometric(γ), thenp≡P(X =Y)= γ 2+γ . Proof. P(Z i =Z j )= $ γ 1−γ % 2 ∞ " z i =1 (1−γ) 2z i = γ 2 1−(1−γ) 2 = γ 2+γ ≡ p 35 Considera randomgenome,G=(Z 1 ,...,Z n ), ofn restrictionfragmentlengthswhere Z i ∼Geometric(γ)sothattheexpectednumberofmatchingfragmentswillbe & n 2 ' p = n(n−1) 2 p withp = γ 2+γ . With largen, wheren% p, the number of matching pairs will beapproximatelyPoissonwithmeanµ =n 2 p. Otherinterestingquestionscanbeaskedusingthissimplemodel. Givenarestriction fragment of lengthz that has occured, the number of restriction fragments lengths,F, that are encountered along the genome before seeing another restriction fragment of the same length can be calculated. The encountered restriction fragment lengths are those seen after the fragment of length z where they are formed from restriction sites occurring along the genome. The restriction fragment lengths are, Z 1 ,Z 2 ,...,Z F ∼ Geometric(γ)where,P(Z i =z) = (1−γ) z−1 γ,andthereforethenumberofrestriction fragmentsseenis,F|z∼Geometric(r =P(Z i =z)). Ordered k-Fragment Region: The above dealt with the simple case of a single restrictionfragment. The situationofk ordered restrictionfragmentsis nowaddressed. LetG=(Z 1 ,...,Z n )be arandomgenomeconsistingofn restrictionfragmentlengths withZ i ∼Geometric(γ). Aregionofthegenomeisdenotedasksuccessivefragments, T i =(Z i ,...,Z i+k−1 ),wherei=1,...,n−k+1denotesthestartingfragmentofthe region. Wewanttoknowthenumberofmatches,M,ofarandomregionofthegenome. Two regions, T i =(Z i ,...,Z i+k−1 ) and T j =(Z j ,...,Z j+k−1 ), match if their corre- spondingrestrictionfragments havethe same length,Z i+l = Z j+l forl=0,...,k−1. With the probability of two random fragments matching asp = γ 2+γ , and the fragment lengthsoccurringindependentlyfromeachother,theprobabilitythattworegionsmatch 36 is p k . Since regions consist of k fragments, there is dependency among overlapping regions. To see this, region T i can only match to region T i+1 if Z i = Z i+1 ,Z i+1 = Z i+2 ,Z i+2 = Z i+3 ,...,Z i+k−2 = Z i+k−1 due to the overlap of region T i and T i+1 . This causes matches between regionT i and region T j for j = i+1,...,i +k− 1 to be dependent upon the self-overlap of region T i . If the probability, p, of two random fragments matching is small, the effect of self-overlap can be ignored. Withn−k+1 regionsoforderedk-fragmentsgivenagenomeofnfragments,theexpectednumberof matchingregionsis & n−k+1 2 ' p k ≈n 2 p k forn%k. Forlargen,thenumberofmatching regionsisagainapproximatelyPoissonwithµ =n 2 p k . To account for self-overlap, a similar approach by Ewens and Grant (2001) is used tosolveexactlyfor themean andvariance ofthe numberofmatches. We willshowthe following. Lemma 2.2.2. Consider a random ordered k-fragment region T i =(Z i ,...,Z i+k−1 ). Define the indicator variable I j =1 if T i matches the region ending at fragment j, and I j =0 if it does not. The total number of matches of T i can be represented as M =I k +I k+1 +···+I n where E(M) = (n−k +1)p k and Var(M) = (n−k +1)p k + ((2k−1)n−3k 2 +4k−1)p 2k +2 k−1 " j=1 (n−2k +j +1)p 2k−j . 37 Proof. Calculatingfirsttheexpectationyields E(M)= E(I k )+E(I k+1 )+···+E(I n ) = P(I k = 1)+P(I k+1 = 1)+···+P(I n = 1). Asbefore,theprobabilityofT i matchingtheregionendingatapositionj isp k sothat E(M) = (n−k +1)p k . ThevarianceofM canbecalculatedas Var(M)= E(I k +I k+1 +···+I n ) 2 −((n−k +1)p k ) 2 . Now 38 E(I k +I k+1 +···+I n ) 2 = E(I 2 k +I 2 k+1 +···+I 2 n ) + 2(n−k)termsinvolvingE(I j I j+1 ) + 2(n−k−1)termsinvolvingE(I j I j+2 ) + ··· + 2(n−2k +2)termsinvolvingE(I j I j+k−1 ) +(n−2k +2)(n−2k +1)termsinvolvingE(I j I m ), |m−j|>k−1 where the above assumes that n ≥ 2k− 2. For the first term, I 2 k = I k since I k is an indicator random variable where the only possible values are 0 and 1 so that E(I 2 k + I 2 k+1 +···+I 2 n )= E(I k +I k+1 +···+I n ) = (n−k+1)p k . Thelastterminvolvesnon- overlapping positions that region T i can match to so that the probability of this event occurring is (p k ) 2 . The final term contributes (n− 2k + 2)(n− 2k + 1)(p k ) 2 to the variancecalculation. The remaining terms make apparent the self-overlap properties of region T i . The expectationsoftheformE(I j I j+k−l )forl =,1,2,...,k−1involvetheprobabilitythat the lastl fragments match thefirstl fragments of regionT i that occurs with probability p l . With the last l fragments matching the first l fragments, the remaining 2(k− l) fragments due to the k−l beginning fragments of T i and k−l ending fragments of T i must also match. Thus, E(I j I j+k−l )= p 2(k−l) p l = p 2k−l where the variance can be computed,withsomealgebraicmanipulationas, 39 Var(M) = (n−k +1)p k + ((2k−1)n−3k 2 +4k−1)p 2k +2 k−1 " j=1 (n−2k +j +1)p 2k−j . With small p, the term involving self-overlap is negligible and the variance is roughly approximatelythe same as the mean suggestinga Poisson approximationas previously noted. If self-overlap occurs, matches occur in “clumps” due to the added variability from a prefix ofT j matchinga suffix ofT i . In thissituationonly suffix fragments ofT i need to match the prefix fragments of T j . This also works with the roles of T i and T j exchanged. Tobespecific,wedefine aclumpasthefollowing. Definition2.2.1. AclumpisdefinedasasetofmatchesofoverlappingregionsT j fora regionT i . If the probability of two random fragments matching is high, that is for large p, the variance is greatly increased where the Poisson approximation fails to hold. In this situation, the first matching region in a clump of overlapping matching regions is only countedandinthiswaytheyare“declumped”. Inthis“declumped”version,thenumber of clumps can be approximated with the Poisson. Using the Chen-Stein method of Poissonapproximation(Arratiaetal.,1990),wecanderiveerrorboundsforthePoisson approximationandouranalysisfollowsasimilartreatmentasgiveninWaterman(1995). Westatethefollowingtheorem, 40 Theorem2.2.1(Chen-Stein). LetX υ forυ∈I beindicatorrandomvariablessuchthat X υ is indepenent of{X τ },τ / ∈J υ . LetW = ! υ∈I X υ and λ = E(W) and letZ be a PoissonrandomvariablewithE(Z)= λ. Then, ||W−Z||≤ 2(b 1 +b 2 ) 1−e −λ λ ≤ 2(b 1 +b 2 ) andinparticular, |P(W = 0)−e −λ | ≤ (b 1 +b 2 ) 1−e −λ λ , where b 1 = " υ∈I " τ∈Jυ E(X υ )E(X τ ), and b 2 = " υ∈I " υ&=τ∈Jυ E(X υ X τ ). The above theorem gives explicit bounds using the the values of b 1 and b 2 . We will obtaintheseboundsforourproblemwhichisgivenbythefollowing, Lemma 2.2.3. LetW be the number of occurrences of “clumps” for a random region T i . Then, b 1 < 2λp k + λ 2 (2k +1) n 41 and b 2 =0 withλ =p k +(n−k)(1−p)p k . Proof. We first define the match indicator, C i,j = I(Z i = Z j ), where fragment Z i matchesfragmentZ j and Y i,j = C i,j C i+1,j+1 ···C i+k−1,j+k−1 sothatY i,j =1ifthereisamatchbetweenregionsT i andT j ,andwithdeclumping X i,j = Y i,j ifi=1orj=1, (1−C i−1,j−1 )Y i,j otherwise. The declumped indicator variable X i,j counts only the first occurrence of a match betweenregionsT i andT j . Ourindexsetis I = {1,2,...,n−k+1} andthedependentset J i = {j∈I :|i−j|≤k}. 42 LetW bethenumberofclumpsforregionT i where W = " j∈I X i,j sothat λ =E(W)=p k +(n−k)(1−p)p k with p = P(Z i = Z j )= γ 2+γ for Z i ,Z j ∼ Geometric(γ). For b 2 , we first notice that duetothefactor(1−C i−1,j−1 )fordeclumping,X i,r andX i,s cannotbothbe1forr-=s wherer,s∈J i . Therefore, b 2 = " r∈I " r&=s∈J i E(X i,r X i,s ) = 0. Tocalculateaboundforb 1 ,wehavethefollowing, b 1 = " r∈I " s∈J i E(X i,r )E(X i,s ) = p k " s∈J 1 E(X i,s )+ n−k+1 " r=2 (1−p)p k " s∈Jr E(X i,s ) <p k (p k +2k(1−p)p k )+(n−k)(2k +1)((1−p)p k ) 2 < 2λp k + λ 2 (2k +1) n . 43 Exponential Theoretical Quantiles Fragment Lengths (Kb) 0 10 20 30 40 50 60 0 20 40 60 Geometric Theoretical Quantiles Fragment Lengths (Kb) 0 10 20 30 40 50 60 0 20 40 60 Figure 2.2: Restriction fragment lengths for the E. coli K12 genome of size 4.6 Mb digested with the restriction enzyme StuI that recognizes the 6-bp sequence 5 ! -AGGCCT -3 ! producing 605 restriction fragments with average restriction fragment lengthof∼11,411basepairs. QQ-plotsusingtheoreticalexponentialquantilesandgeo- metricquantilesindicatea goodfittobothdistributions. NumericalSimulations: Tovalidateourmodels,weconductedsimulationtests. Fig- ure2.2showstheapproximationofthefragmentlengthsusingboththeexponentialand geometric distributions for the E. coli genome. The parameters for each distribution are estimatedusing the average fragment length. Both showthat the use of the Poisson process to model the occurrence of restriction sites and the resulting fragment lengths are approximatedwellusingbothdistributions. In order to test the single fragment matching model, we simulated genomes,G j = (Z 1 ,...,Z n ), with Z i ∼ Geometric(γ). For different values of γ, we simulate m genomes and count the number of pairs of fragments, M j , equal in length for each genome. Let P(m,γ)= ! m j=1 M j be the sum of the number of matching fragments for each of the m genomes, withM j ∼ Poisson(n 2 p) withp = γ/(2 +γ). We have 44 Normal Theoretical Quantiles Z(m,!) −2 −1 0 1 2 −2 −1 0 1 2 Figure2.3: Forexpectedfragmentlengthof1,000basepairs,1,250basepairs,...,20,000 basepairs, we set the corresponding γ =1/1000,1/1250,...,1/20000 and simulate, m = 100 genomes, each with, n = 1000 fragments, with each fragment distributed as Geometric(γ). We set P(m,γ) equal to the total number of matching fragments and plot the transformed value, Z(m,γ) = (P(m,γ)− m(n 2 p))/ , n 2 pm with p = γ/(2 + γ), against the quantiles of a standard normal distribution. The resulting QQ- plotindicatesagoodfittothestandardnormaldistributionvalidatingtheapproximation ofthePoissondistribution. Z(m,γ)=(P(m,γ)−m(n 2 p))/ , n 2 pm→ Normal(0,1) asm→∞. We can plot the resulting values ofZ(m,γ), and compare against a standard normal distribution as showninFigure2.3. 2.2.2 RegionMatchingwithSizingErrors Motivation: We hadpreviouslyexaminedthetargetgenomewithoutconsideringsiz- ing errors or missing and false cuts. In this section, we still ignore missing and false cuts but allow for sizing errors to occur. Since restriction fragments can no longer be precisely measured, we mustdefine a matchbetween two restrictionfragment fromthe targetgenometakingintoaccountthesizingerror. Recallthatthescoringfunctionused 45 during the dynamic programmingdetermines the degree that two regions are similarto eachother. Althoughmanydifferentscoringfunctionsarepossible,wemakeuseofthe assumptionthatregionswithsimilarsizesshouldberewardedwithahigherscore. This leadsustoourdefinitionofamatchbetweentworestrictionfragmentsthatservesasthe basisfortherestofouranalysis. Random Genome: Consider a random genome consisting of n fragments, G = (Z 1 ,...,Z n ) withZ i ∼ Exponential(λ) and E(Z i )= λ. Previously we had used the discretized version of the fragment lengths as being geometricallydistributed. We now use the continuous version where the fragment lengths are exponentially distributed. Recall from Proposition 2.1.2 that measured fragment lengths Y i are normally dis- tributedsothatY i |Z i ∼Normal(Z i ,σ 2 Z i ). We statethefollowinglemma. Lemma 2.2.4. Measured fragment lengthsY i have exponential density with mean θ = - 1 σ . 2λ+ 1 σ 2 − 1 σ 2 / −1 . Proof. We sketch the details of the proof with the full proof given in Valouev et al. (2006b) that we refer the reader to for details. We can consider the density ofY i as the marginal distribution of the joint density of Y i and Z i where by integrating out Z i , we have, f Y i (y)= # f Y i ,Z i (y,z)dz = # f Y i |Z i (y|z)f Z i (z)dz. 46 Bysubstitutingthedensitiesforf Y i |Z i (y|z)andf Z i (z)withY i |Z i ∼Normal(Z i ,σ 2 Z i ) andZ i ∼Exponential(λ), wearriveatY i ∼Exponential(θ). From Lemma 2.2.4, we see that measured fragment lengths are exponentially dis- tributed as well but with a different rate parameter due to measurement error. Given a random genome of n ordered restriction fragments, G =(Z 1 ,...,Z n ) with Z i ∼ Exponential(λ),wecanconsiderthemeasuredgenomewherethesizesofthefragment lengths are determined using optical mapping as n ordered and estimated restriction fragment lengths,G measured =(Y 1 ,...,Y n ) where Y i ∼ Normal(Z i ,σ 2 Z i ) given frag- mentZ i from the random genome. We would like to determine the number of matches between fragments under this model of measured fragment sizes. To do so, we define matchesasthefollowing, Definition 2.2.2. Let Z i and Z j be two fragments from the random genome, G = (Z 1 ,...,Z n ). A match is declared between Z i and Z j at matching significance level α,if, |Z i −Y j | σ , Z j ≤ α where Y j is the corresponding observed fragment length from the measured genome, G measured . The definition of a match between fragments examines the absolute difference in their sizes|Z i −Y j | while incorporating the sizing error associated with fragmentZ i where measured fragments from Z i are given by Y i ∼ Normal(Z i ,σ 2 Z i ). Thus, measured fragments have mean sizeZ i with variance σ 2 Z i so that the standard deviationis given by σ √ Z i . We quantify the degree of similarity in terms of the number of standard 47 deviations fragment Y j is away from the true size Z i using the parameter α. Under this definition of matches, we would like to know the probability of a match between fragments from the random genomeG. Since fragmentsZ i ∼Exponential(λ) fromG and fragmentsY i ∼Exponential(θ) fromG measured , the probability of a match is given bythefollowinglemma. Lemma2.2.5. LetZ∼Exponential(λ) andY∼Exponential(θ) where f Z (z)= 1 λ e − z λ , z > 0 f Y (y)= 1 θ e − y θ , y > 0. Then P $ |Z−Y| σ √ Z ≤ α % = 1 aθ (1−e − k λ ) + 1 λ 0 π a b a e b 2 4a $ Q $ √ 2ak− b √ 2a % +Q $ b √ 2a %% where k = α 2 σ 2 , a = 1 λ + 1 θ , and b = ασ θ and Q(x)=1−Φ(x) withΦ(x) as the cumulativedistributionfunctionofthestandardnormal. Proof. Wewillconsidertheevents{Z≥Y}and{Z <Y}wherewehave P $ |Z−Y| σ √ Z ≤ α % = P $ |Z−Y| σ √ Z ≤ α,Z≥Y % +P $ |Z−Y| σ √ Z ≤ α,Z <Y % . 48 Focusingfirston{Z≥Y},wehavethat|Z−Y| =Z−Y andthus, P $ |Z−Y| σ √ Z ≤ α,Z≥Y % = # α 2 σ 2 0 # z 0 1 λ e − z λ 1 θ e − y θ dydz + # ∞ α 2 σ 2 # z z−ασ √ z 1 λ e − z λ 1 θ e − y θ dydz. The integral is broken into two parts in order to ensure thaty> 0. Solving the first integralgives # α 2 σ 2 0 # z 0 1 λ e − z λ 1 θ e − y θ dydz = (1−e − k λ )− 1 aλ (1−e −ak ), andthesecond # ∞ α 2 σ 2 # z z−ασ √ z 1 λ e − z λ 1 θ e − y θ dydz = $ 1 λ # ∞ k e −az+b √ z dz % − 1 aλ e −ak , wherek = α 2 σ 2 ,a = 1 λ + 1 θ , andb = ασ θ . For the event{Z < Y}, we have|Z−Y| = Y −Z sothat P $ |Z−Y| σ √ Z ≤ α,Z <Y % = # ∞ 0 # z+ασ √ z z 1 λ e − z λ 1 θ e − y θ dydz = # ∞ 0 e −( 1 λ + 1 θ )z −e −( 1 λ + 1 θ )z− ασ √ z θ dz = 1 λ # ∞ 0 e −az −e −az−b √ z dz = 1 aλ − 1 λ # ∞ 0 e −az−b √ z dz. 49 Combiningthetermsyields P $ |Z−Y| σ √ Z ≤ α % = (1−e − k λ )+ 1 λ $# ∞ k e −az+b √ z dz− # ∞ 0 e −az−b √ z dz % . The last two integrals can be shown to equal the following where if we letr = √ z so thatdr = 1 2 z −1/2 dz and 2rdr =dz # ∞ k e −az+b √ z dz =2 # ∞ √ k re −ar 2 +br dr =2e b 2 4a # ∞ √ k re −a(r− b 2a ) 2 dr, Makingthesubstitutions =r− b 2a sothatds =dr andr =s+ b 2a yields =2e b 2 4a # ∞ √ k+ b 2a $ s+ b 2a % e −as 2 ds =2e b 2 4a 1 # ∞ √ k+ b 2a se −as 2 ds+ b 2a # ∞ √ k+ b 2a e −as 2 ds 2 , where solving the first integral with the substitution t = s 2 so that dt =2sds and rearrangingtermsofthesecondintegralwehave = 1 a e −ak+b √ k + 0 π a b a e b 2 4a # ∞ √ k 1 √ 2π( 1 √ 2a ) e − (s− b 2a ) 2 2(1/ √ 2a) 2 ds = 1 a e −ak+b √ k + 0 π a b a e b 2 4a $ 1−Φ $ √ 2ak− b √ 2a %% . 50 We recognize the second integral as the pdf of the normal distribution so that we can rewrite it in terms of the cumulative distribution function of the standard normalΦ(·). Withtheabovewecancombinetermssothatwehave P $ |Z−Y| σ √ Z ≤ α % = (1−e − k λ ) + 1 λ 3 1 a e −ak+b √ k + 0 π a b a e b 2 4a $ Q $ √ 2ak− b √ 2a %% − 1 a + 0 π a b a e b 2 4a $ Q $ b √ 2a %%4 = 1 aθ (1−e − k λ ) + 1 λ 0 π a b a e b 2 4a $ Q $ √ 2ak− b √ 2a % +Q $ b √ 2a %% withQ(x)=1−Φ(x). Matches between fragments at a particular significance level α are termed α-level matches with the resultingprobability of match denoted asp α . Figure2.4 shows a plot of p α for various values of α and σ. From the plot, we can see that p α grows roughly linearlywith α while higher valuesofσ increase the slope at whichp α grows. We note that in practical applications we have σ=0.5 where from the plot at α=2, we have the probability of a match at roughly∼ 0.2. To put this into the context of sequences over the four different nucleotides, if we assume a uniform distribution,the probability of a match is p=0.25. The key difference, however, is that the number of restriction fragments is extremely low compared to the number of nucleotides within a genome. Forexample,thehumangenomeiscomposedof∼ 200,000restrictionfragmentswhen digestedwiththeenzymeSwaIwhereasitcontains3billionbasepairsintermsofDNA sequence. 51 " p " 0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 Figure 2.4: We plot the probability,p α , of two random fragments matching versus the match significance α. Each of the curves represents different values of σ which is the standarddeviationassociatedwiththesizingerrorofopticalmapping. Thebottomcurve isatσ=0.1witheachsuccessivecurveatvalues0.2,0.3,...,1. Ordered k-Fragment Region: We have examined matching for single fragments incorporating sizing error and we would like to generalize these results to regions con- sisting of multiple fragments. We have already determined the probabilityp α (σ,λ) of a match between single fragments given parameters α,σ,λ. As the probability for a match between two fragments with sizing error is high (≥ 0.10 for practical values of α,σ,λ),Lemma2.2.2isapplicableindeterminingthenumberofmatchingregionsfora given regionT i = {Z i ,...,Z i+k−1 } that takes into account the self-overlap properties. Lemma2.2.3canbeusedifthe“declumped”versionthatcountsonlythefirstmatching regioniscalculated. 52 We want to examine regions of the genome that match only to a few other regions. To be more concrete, let T i = {Z i ,...,Z i+k−1 } denote the region starting at the i-th fragment alongG measured . We are interested in the number of regions,K d , that match to at most d other regions. We can use the “declumped” version of matching regions as given by Lemma 2.2.3. We use the probability of a match between two fragments as p α (σ,λ). Clearly,ifregionT i doesnotmatchtoanyotherregion,thenumberofclumps is0aswell. ThiscorrespondstoP(W = 0)where Theorem2.2.1givesthisprobability as|P(W = 0)−e −λ |≤b 1 min{1,1/λ}. Let K 0 be the number of regions that match at most to itself. The probability of a region matching at most to itself was given previously as≈ e −λ with λ = p α (σ,λ) k + (n−k)(1−p α (σ,λ))p α (σ,λ) k . Given that there aren−k+1 total regions, we have that K 0 ≈ Binomial(n− k+1,e −λ ). The distribution is approximate as there is dependency between the K 0 regions since they can overlap with each other. Define K d as the number of regions that match at most to d clumps. We have that K d ≈ Binomial(n−k+1,p z = P(Z≤d))withZ∼Poisson(λ). Forlargen,wehavethat K d ≈Poisson((n−k+1)p z ). Theorem2.2.1canbeusedtocomputeerrorboundson ourapproximation. We can also approximate the distance between the regions of K d . If P(Z ≤ d) with Z∼ Poisson(λ) is the probability of a region matching at most to d clumps, the number of regions encountered before finding another region matching to at most d clumps is approximately geometric with success probability p z = P(Z ≤ d). Dis- tances, D, between K d regions are distributed in terms of number of regions between them that are distributed as R∼ Geometric(p z ). Recall that a k-fragment region is distributed as Gamma(k,λ) where we count only the starting fragment for the dis- tances between regions distributed as Exponential(λ). We can view the distances D 53 asD|R∼Gamma(R,λ) withR∼Geometric(p z ). UsingLemma3.1.6,wecan show that the marginal distribution of D∼ Exponential(λ/p z ). We have that the expected distancebetweenregionsofK d isE(D)= λ/p z withVar(D) = (λ/p z ) 2 . TheseK d regionsareinterestingforafewreasons. Byknowingthedistancebetween the K d regions, they act as unique sets of markers along the genome where few other regions match. With d=0, where we have regions that match at most to themselves, weshouldfindthatthedistancesdistributedaccordingtoDbetweenthemarelarger. As we increased,D decreases as we allow theK d regions to match to more clumps. This increasestheirfrequencysothatthedistancesbetweenthemaresmaller. Numerical Simulations: We first ran simulations to verify Lemma 2.2.5 where we first simulated genomes G =(Z 1 ,...,Z n ) with exponentially distributed distributed sizes with rate λ. We generated Y i |Z i ∼ Normal(Z i ,σ 2 Z i ) based on the error model givenbyProposition2.1.2andcountedfragmentmatchesusingDefinition2.2.2. Figure 2.5showstheresultsofoursimulation. To test our results regardingK d , we performed the same simulation where we sim- ulated a genome and counted the number of regions with at most d clumps as well as the distances, D between them. We compared these values to our theoretical results as shown in Figure 2.6. Our theoretical values agree with simulated values for K d for small d but gets worse as d increases. We can attribute this to the fact the distribu- tionisonlyapproximateduetoboththeChen-Steinapproximationandthedependency between overlapping regions. We also seem to be underestimating the values slightly forD. Theunderestimationmightbeduetothefactthatweneglecttocountthenumber ofoverlappingregionswithinaclumpasthisaddstothedistance. 54 Normal Theoretical Quantiles Z −2 −1 0 1 2 −2 −1 0 1 2 Figure2.5: We simulatedm genomescontainingn fragments withsizes exponentially distributedwithrateλ. Wesimulatedameasuredgenomewithsizingerrorwithparam- eter σ and counting the number of matches between the two. As the probability of a match, p α (σ,λ), is small, the number of matches M≈ Poisson(mn 2 p α (σ,λ)). We plot the transformed valueZ =(M−mn 2 p)/ , mn 2 p against the quantiles of a stan- dardnormaldistributionindicatingagoodfitwithsomeslightcurvature. 2.2.3 RegionMatchingwithMissingandFalseCuts Motivation: We had previously considered the sizing error associated with optical mapping. Inthissection,weexaminetheeffectsofmissingandfalsecutsandhowthey affect matches between regions of a genome in terms of restriction fragments. Since we initially do not account for sizing error, we will use the discretized version of the distributionoftherestrictionfragmentlengths. Wewillfirstconsiderthecaseofmissing cuts and in turn consider false cuts. Recall in our previous discussions that we had examined a k-ordered fragment region from the genome. Now we make explicit the definition of an optical map as they are sampled randomly from a genome and can containavaryingnumberofrestrictionfragments. Missing Cuts: In our previous analysis we considered a random genome, G = (Z 1 ,...,Z N ), of ordered restriction fragment lengths where Z i ∼ Geometric(γ). We 55 Theoretical Values K d 980 985 990 989 990 991 992 993 0 1 2 989 990 991 992 993 980 985 990 3 Theoretical Values D 12 13 14 15 16 17 12 13 14 15 16 0 1 2 12 13 14 15 16 12 13 14 15 16 17 3 Figure 2.6: We simulate a genome consisting ofn = 10000 restriction fragments with sizes distributedExponential(λ) for different values of λ. We then simulated a mea- suredgenomewithsizingerrorσ andcounted“declumped”regionsaccordingtoα. On therightisaplotofK d ford=0,1,2,3whereweplotthesimulatedvaluesagainstour computedtheoretical values. On the left isthe correspondingplot of the mean distance D betweenK d regions. hadassumedthatthenumberofrestrictionfragmentsisknown. However,inapractical settingwedonotknowthenumberofrestrictionfragmentswithinthegenome. Wenow consider a model where we know the size of the genome, G, with restriction sites laid downas a Poissonprocesswithrateλ. We can againdiscretizethe restrictionfragment lengths asGeometric(γ). The number of fragments is given by N∼ Poisson(G/λ) accordingtoProposition2.1.1. FromProposition2.1.3,restrictionsitesareobservedwithprobabilityp,independent of other sites. Consider the i-th restriction fragment Z i along the genome where we assume that the restriction site preceding Z i is cut. If the restriction site between Z i and Z i+1 fails to be cut but the restriction site after Z i+1 is cut the fragments Z i and 56 Z i+1 are combined into a larger restriction fragment, Y i = Z i +Z i+1 . Consider the genome containing missing cuts,G missing cuts =(Y 1 ,...,Y M ), with M ≤ N restriction fragments derived from the original genome,G. Clearly, M∼ Binomial(N,p)|N ∼ Poisson(G/λ),asarestrictionfragmentisproducedwhenarestrictionsiteiscut. Each restriction fragment Y j from the genome containing missing cuts is composed of one or more fragments from the original genome. Let Y j = ! r k=1 Z k ∗ where Z k ∗ denotes the starting restriction fragment alongG. As Z k ∗∼ Geometric(γ) and Y j is the sum of geometric random variables, it follows that Y j has a negative binomial distribution whereY j ∼ NegativeBinomial(r,γ). Forr,the number of restrictionfragmentsfrom the original genome comprising Y j , we know that as each restriction site is cut with probabilitypandnotcutwithprobability1−p,r followsageometricdistributionwith probability p. We would like to derive the distribution of Y j which is given by the followinglemma. Lemma 2.2.6. If Y ∼ NegativeBinomial(r,γ)|r ∼ Geometric(p), then Y ∼ Geometric(pγ). Proof. Wehave P(Y =y)= y " r=1 P(Y =y|r)P(r) = y " r=1 $ y−1 r−1 % (1−γ) y−r γ r (1−p) r−1 p, 57 Wenotethaty≥r. We changevariablesusingn =y−1andk =r−1sothat = pγ n " k=0 $ n k % (1−γ) n−k (γ(1−p)) k = pγ(1−γ +γ(1−p)) n = pγ(1−pγ) y−1 . From Lemma2.2.1, the probability that two fragmentsY i andY j match ispγ/(2 + pγ). ForG missingcuts =(Y 1 ,...,Y M ),wehaveE(M)=Gp/λ. ForlargeGwithpγ/(2+ pγ) 0 G, the number of matching fragments forG missing cuts is approximately Poisson withµ=(E(M)) 2 (pγ/(2+pγ)). FalseCuts: FromProposition2.1.4,falsecutsoccurasaPoissonprocesswithrateζ. Typicalvaluesofζ∼ 0.005sothatfalsecutsoccurinfrequently. LetG=(Z 1 ,...,Z N ) bearandomgenomeoforderedrestrictionfragmentlengthswhereZ i ∼Geometric(γ). LetG false cuts = {Y 1 ,...,Y M }, consisting of M fragments be the genome consisting of false cuts. To derive the distribution of Y i , we note that as false cuts are distributed as a Poisson process with rate ζ, the distancesD, between false cuts are exponentially distributed with rate ζ and E(D)=1/ζ. Recall that restriction sites are laid down as a Poisson process with rate λ. We can view the placement of both false cuts and true restriction sites as a Poisson process with rate ζ +λ so that they are exponentially distributedwithmean 1/(ζ +λ). DiscretizingthedistanceyieldsY i ∼Geometric(η = (1−e −(ζ+λ) )). Similar results hold for the number of matching fragments where the probabilitythatY i andY j matchisgivenbyη/(2+η). 58 Todeterminethenumberoffalsecuts,wefirstassumethatfalsecutsoccurbetween true restriction sites. For a particular fragment lengthZ i , the number of false cuts, K, is distributed asK∼ Poisson(ζZ i ) withZ i ∼ Geometric(γ). The distribution is not straightforwardtoderive. Computingtheexpectationandvariancegives, E(K)= E(E(K|Z i )) = ζ γ and, Var(K)= Var(E(K|Z i ))+E(Var(K|Z i )) = ζ 2 (1−γ) γ 2 + ζ γ Withpracticalvaluesofγ∈ [0.04,0.08]andζ∼ 0.005,E(K)0 1andVar(K)0 1so thatmostfragmentsdonotcontainanyfalsecuts. We can instead assume that the false cuts occur before digesting the DNA with the restriction enzyme. In a typical optical mapping experiment, extracted DNA is frag- mented into a smaller size to facilitate loading onto the glass surface. In this situation, the false cuts occur along the length of the fragmented DNA. We assume that the frag- mentationprocess produces maps of sizeL∼Gamma(t,κ). Thisassumptionis based on the fact that observed map sizes from an optical mapping experiment are composed of restriction fragments that are exponentially distributed. It is actually more appro- priate to consider the size of the maps from a truncated gamma distribution as sizes below a certain threshold,R, are not observed. We consider the simplified case where 59 the gamma distribution is not truncated as in most practical applications P(L ≤ R) occurs with small probability. We have that the number of false cuts,C, is distributed asC|L∼Poisson(ζL). ThedistributionofC isgivenbythenextlemma. Lemma 2.2.7. If C|L ∼ Poisson(ζL) and L ∼ Gamma(t,κ), then C ∼ NegativeBinomial(t,1/(1 +ζκ)). Proof. f C (c)= # ∞ 0 e −(ζl) (ζl) c c! l t−1 e −l/κ Γ(t)κ t dl = ζ c c!Γ(t)κ t # ∞ 0 l (c+t)−1 e −(ζ+ 1 κ )l dl = 1 c!Γ(t)(ζκ) t Γ(c+t) $ ζκ 1+ζκ % (c+t) Withtasapositiveinteger,wehave, f C (c)= $ c+t−1 c %$ ζκ 1+ζκ % c $ 1 1+ζκ % t sothatC∼NegativeBinomial(t,1/(1+ζκ)). Under this model, the number of false cuts within a map is distributed as a negative binomial. We would like to consider the number of total false cuts given a genome of size G. This requires us to determine the number of DNA maps produced for a given genome. We can make some simplifying assumptions regarding the distribution of the sizes of the maps. We can view them instead as exponentially distributed with rate parameter κ ! . This model closely resembles real data since the truncated portion of the gamma distribution resembles a truncated exponential distribution. As the maps are now L∼ Exponential(κ ! ) with E(L)= κ ! , we can view the right end of each 60 map being placed down along the genome as a Poisson process with rate 1/κ ! . The number of maps is distributed asN∼ Poisson(G/κ ! ) withG being the length of the genome. Each map hasC false cuts distributed asC∼ Geometric((1 +ζκ ! ) −1 ) from Lemma2.2.7. The total number of false cuts,F = C 1 +C 2 +···+C N , is distributed asF|N∼NegativeBinomial(N,q = (1+ζκ ! ) −1 )withN∼Poisson(G/κ ! ). Thisis a compoundPoisson distributionas there are a Poisson distributednumber of variables forthesumofF. WecancalculatetheexpectationandvarianceofF as, E(F)= E(N)E(F) = G κ ! 1 q and, Var(F)= E(N)(Var(F)+E(F) 2 ) = G κ ! $ (1−q) q 2 + 1 q 2 % To compute the number of matching fragments, we must also consider the num- ber of observed true cut sites along with false cuts. We can view the placement of both false cuts and true restriction sites as a Poisson process with rate ζ + λ, so that we can view the total number of observed restriction sites per a map as O ∼ Geometric((1+(ζ +λ)κ ! ) −1 ). ThetotalnumberofobservedcutsitesisgivenbyM = O 1 +···+O N distributedasM|N ∼ NegativeBinomial(N,r = (1+(ζ +λ)κ ! ) −1 ) withN∼ Poisson(G/κ ! ). We have already derived the distribution of fragment sizes as Geometric(η). The probability of a match between two fragments is η/(2 + η). 61 WehaveforthegenomeG false cuts =(Y 1 ,...,Y M )thattheexpectednumberofmatching fragmentsforlargeGisapproximatelyE(M) 2 (η/(2+η)≈ (G(1−r)/κr) 2 (η/(2+η)). Missing and False Cuts: We would like to combine the two models of missing and false cuts into a single model that accounts for both of them. Let G missing &false be the genomeofmissingandfalsecutsderivedfromagenomeofsizeG. Recall thatwenow consideropticalmapsderivedfromthegenomewhoselengthsareL∼Exponential(κ) so that the number of maps is distributed as T∼ Poisson(G/κ). From Proposition 2.1.3, observed true restriction sites appear as a thinned Poisson process with ratep/λ. False cuts are laid down as a Poisson process with rate ζ. Observed sites on an optical mapconsistofbothobservedtruerestrictionsitesandfalsecuts. Wecanviewobserved sitesas aPoissonprocess withrate ζ+(p/λ). Thenumberof observedsitesper amap is distributed as N∼ Poisson(L/(ζ +(p/λ))|L∼ Exponential(κ). The marginal distribution of N isN∼ Geometric(r = (1 + (ζ + κ/(p/λ))) −1 ). The total number offragmentsM isgivenbythesumofthenumberoffragmentsineach individualmap where M = N 1 +··· + N T with T∼ Poisson(ν = G/κ). This is a compound Poisson distribution with the sum of geometric random variables forming the P´ olya- Aepplidistribution(Johnsonetal.,2005)givenby, P(M =m)= e −ν m " n=1 ν n n! $ m−1 n−1 % (1−r) m−n r n = e −ν (1−r) m m " n=1 $ m−1 n−1 % (νr/(1−r)) n n! = e −ν/(1−r) $ νr 1−r % (1−r) m 1 F 1 $ m+1;2; νr (1−r) % 62 where 1 F 1 (a;b;z;) as Kummer’sconfluent hypergeometricfunctionwithE(M)= ν/r and Var(M)= ν(2− r)/r 2 . Under this model, the fragment sizes of the maps are distributed asExponential(ζ+(p/λ)) so that we can discretize the fragment sizes as Geometric(γ ! = (1− e −(ζ+(p/λ) ))). For large G, the expected number of matching fragmentscanbecomputedas≈ (ν/r) 2 (γ ! /(2+γ ! ). 2.2.4 RegionMatchingwithSizingErrors,MissingandFalseCuts Motivation: In this section, we combine all of the optical mapping errors discussed previously into one model. We will discuss the implications of our results for optical mapping under this new model. We will again assume that we are given the genome size,G,withfragmentlengthsdistributedasX∼Exponential(λ)withE(X)= λ. We willalsoanalyzeourmodelintermsofmapsderivedfromthegenome. Combined Model: In our combined model, we use the continuous version of the fragment sizes in order to incorporate sizing error. As before, we assume that we have a genome of size G so that the number of restriction fragments is given by N ∼ Poisson(G/λ) with fragment lengths given as X∼ Exponential(λ). We again assume that optical maps are derived from the genome as L∼ Exponential(κ) where we have determined the distribution of, M, the number of fragments with both missing and false cuts given by a P´ olya-Aeppli distribution. Recall that that observed sites are laid down as a Poisson process with rate ζ+(p/λ) so that observed frag- ment lengths are given as Y ∼ Exponential(θ =(ζ+(p/λ)) −1 ) with E(Y)= θ. Results for matches between fragments using Definition 2.2.2 given by Lemma 2.2.5 then hold. The number of expected matching fragments for large G is approximately E(M) 2 ·p α (σ,λ) = (G/κr) 2 ·p α (σ,λ)withr = (1+(ζ +κ/(p/λ))) −1 andp α (σ,λ)as 63 the probabilityof a matchbetween twofragments as givenby Lemma2.2.5for α-level matches. We are also interested in k-fragment regions using optical maps derived from the genome. Wenotethatsimilarresultsapplyexceptthatthenumberofk-fragmentregions isgivenbyM−k+1 withM distributedas a randomvariable. We can apply Lemma 2.2.3wherewe set, λ k = p k +(E(M)−k)(1−p)p k where p is the probability of a match between two fragments while we use E(M)= (G/κr) with r = (1 + (ζ + κ/(p/λ))) −1 as the number of places where a match can start for a k-fragment region. Similar results hold for the calculations of K d regions usingourapproximations. 2.2.5 Conclusion In this chapter, we have analyzed different statistical models for optical mapping data and their implications. Our analysis shows that even with the given errors of opti- cal mapping as described by our statistical model, regions of the genome can still be identified based on the sizes of the ordered restriction fragments derived from it. We had examined a given genome based first on the number of matching fragments. Our definition of a match incorporated the sizing error associated with optical mapping. Even with sizing error, for practical values used in optical mapping, the probability of a match between fragments resembles that of DNA sequences uniformly distributed over the nucleotide bases. We also generalized these results to regions composed of 64 k-fragments and applied the Chen-Stein theorem to derive a “declumped” version of matching regions. The “declumped” matching region analysis led us to the idea of a K d region, a region matching to at mostd other clumps within the genome. These are interesting because they act as unique features within a genome in terms of restriction fragments that can be used to identify one region from another. Their presence allows for an optical map derived from that particular region to contain a unique restriction fragmentsequencethat,intheory,shouldmatchtofewotherregionsalongthegenome. Wealsoexaminedthedistributionofmissingandfalsecutsastheyaffectthenumber and size of restriction fragments for a given genome. For this treatment, we consider opticalmapsderivedfromagenomeforatypicalopticalmappingexperiment. Wenote that we assume optical maps can be derived from all regions of the genome equally likely. However, since optical maps are composed of restriction fragments, we could view them as being laid down as a thinned Poisson process if we assume that the ini- tial restriction site is from the original genome. Thus, regions of the genome with a higher density of restriction sites have more optical maps derived from them. We use thesimplifiedmodelwheremapsareassumetobederivedfromthegenomewithlengths exponentiallydistributed. This allowsus to determine the distributionof the restriction fragments and their sizes with both missing and false cuts. Even when all errors are accounted for, the number of matching regions does not increase based on our model. Althoughthefragmentsizedistributionaswellasthedistributionofthenumberoffrag- mentsisdifferent,theprobabilityofmatchingregionschangeswithrespecttotheerrors so that the number of matching regions is only slightly changed. Our overall analysis demonstratesthefeasibilityofopticalmappingandtheusageofrestrictionfragmentsto identifygenomicregions. 65 Chapter3 AlignmentofOpticalMaps Inthischapter,wediscusssomeaspectsinvolvedinthealignmentofopticalmaps. One of the key issues is the alignment score function that captures the degree of similar- ity between restriction sites and fragment lengths. In Section 3.1, we explore a score function optimized for recovering alignments for de novo assembly. Based on the new scoringfunction,wevalidateitsperformanceversuspreviousscoringfunctions. In Section 3.2 we examine a different topic that also deals with alignment. We explore techniques for aligning optical mapping data against FPC maps. FPC maps are made usingtraditionalrestrictionmappingtechniquesrequiringthe developmentof algorithms that allow for the comparison of optical mapping data to traditional restric- tionmaps. 3.1 MapAlignment 3.1.1 Introduction Oneofthemainissuesfortheanalysisofopticalmappingdataisalignment. Mapalign- ment refers to finding a correspondence between restriction fragment sites and lengths betweentwoopticalmaps. Inmanyways,mapalignmentresemblessequencealignment so that formulations using dynamic programming for sequence alignment are applica- ble. This has allowed algorithms used for sequence alignment to be adapted for map 66 alignment. As discussed in Section 1.3.2, many different types of alignments are pos- sible. In sequence alignment, pairwise alignment is used to infer functional, structural, or evolutionaryeventsbetween sequences. These same eventscan be detected between restriction maps using map alignment. Thus, we first focus on the problem of pairwise alignment. Acriticalcomponentofthedynamicprogrammingalgorithmisthescoringfunction used to assign a numerical value to the “goodness” offit of regionsbetween two maps. Thescoringfunctionshouldbeabletodifferentiatebetweenregionsthattrulyalignver- susregionsthathavenoassociation. Thescorefunctionmustalsotakeintoaccountthe associatederrors ofopticalmappingmeasurementsthatcanambiguouslycauseregions betweenmapstoexhibitsimilaritywhennotruecorrespondenceexists. Inthissection,wedesignanewscoringfunctionbasedonpreviousapproaches. Our scoringfunctionistunedtodealwiththeproblemofmapalignmentfordenovoassem- blyandassuch maynotbe applicableforotherpurposes. Inde novoassembly,aligned maps are assumed to have originate from the same genomic region. The goal was to implementasimplescoringfunctionthatwouldoptimizethenumberoftruealignments foundandlenditselftoothertechniquesforfastercomputationofalignments. Dynamic Programming Algorithms: The first algorithm for the alignment of ordered restrictionmapsdates back to Waterman et al. (1984). A general dynamicpro- gramming framework for the calculation of optimal restriction map alignments can be solvedbytheoptimizationofanalignmentscore. We giveabriefoverviewofthealgo- rithm. Using the notatation for maps described in Section 2.1.2, an alignment between xandy can beviewedasanorderedsetofindexpairs 67 Π = $$ i 1 j 1 % , $ i 2 j 2 % ,..., $ i d j d %% where the indices indicate correspondence between cut sitess i l andt j l forl=1,...,d with 0<i 1 <···<i d <n and 0<j 1 <···<j d <m. Each successive pair of aligned index pairs 55 i l j l 6 , 5 i l+1 j l+1 66 forl=1,...,d−1 denotes thel th matched block betweenopticalmaps. Analignmentconsistsofd−1matchedblocks. Thel th matched block has length ¯ x l = s i l+1 −s i l and ¯ y l = t j l+1 −t j l and consists ofn l = i l+1 −i l and m l =j l+1 −j l fragmentsrespectively. Thedynamicprogrammingrecursionwasgivenin(2.1)withX(s i −s g ,t j −t h ,i− g,j−h) as the alignment extension score for the segment of the alignment (g,h)→ (i,j). The alignmentscore maximizationhasO(t·n 2 m 2 ) timecomplexityto calculate (with t being the time to compute the alignment extension score) and O(nm) storage requirements. Aligned sites usually occur within a limited number of sites of a given position,soitisusuallyreasonabletolimitthesearchspaceforeachalignedpairbyonly exploring at most δ steps in each direction. This restriced version has time complexity O(t·δ 2 nm)tocalculateandisformulatedas S(i,j) = max (0∨i−δ)≤g<i,(0∨j−δ)≤h<j {S(g,h)+X(s i −s g ,t j −t h ,i−g,j−h)}. (3.1) Algorithm 3.1.1 gives the dynamic programming algorithm to compute the restricted versionofthemapalignmentfordeterminingoverlapbetweenapairofmaps. Inoverlap alignment,wewanttoalignaprefixofonemaptoasuffixofanothermap. Thistypeof alignment is used during de novo assembly since we want tofind the overlaps between 68 mapsinordertoassemblethem. Othertypesofalignmentsarepossibledependingupon the initialization of the scoring matrix and the region of the scoring matrix scanned to findtheoptimalalignment. Huang and Waterman (1992) gave an extended verision of the algorithm that dealt withtheerrorsoftraditionalrestrictionmappingmethodswherecloselyspacedsitesfor differentenzymesare ordered incorrectlyand closelyspacedsitesfor thesameenzyme maptoasinglesite. (MyersandHuang,1992)gaveanimprovedversionoftheoriginal algorithmwithcomputationalcomplexityO(nmlog(nm)). Data: Pairofmaps,S(x)=(s 0 ,...,s n )andS(y)=(t 0 ,...,t m ). Initialization: Overlap: X(i,0)← 0,i=1,...,n;X(0,j)← 0,j=0,...,m 1: fori← 1tondo 2: forj← 1tomdo 3: y←−∞ 4: forg← max(0,i−δ)toi−1do 5: forh← max(0,j−δ)toj−1do 6: y← max{y,X(g,h)+S(s i −s g ,t j −t h ,i−g,j−h)} 7: endfor 8: endfor 9: X(i,j)←y 10: endfor 11: endfor Algorithm 3.1.1: Dynamic programming algorithm for alignment of two restriction maps. Although these algorithms can be successfully applied to optical mapping data, a critical component in order for them to be effective was the choice of the scoring func- tion,X(x =s i −s g ,y =t j −t h ,n =i−g,m =j−h). Watermanetal.(1984)defined thescorefunctionas X(x,y,n,m)= υ−λ(n+m−2)−µ|x−y| 69 for specified positive constants υ, λ, andµ. The score can be interpreted as rewarding υ for each pair of matching sites, λ penalizes for each unaligned site, and µ is the multiplier for a linear sizing discrepancy penalty. Although this score has an easily understandableformulation,thebestchoiceofalignmentparametersisdependentupon thepreferences oftheuser. Valouev et al. (2006b) gave a formulation of the alignment score using a likelihood based method derived from the statistical distributions of optical map measurements. Using this formulation, the alignment score is defined as the log ratio of likelihoods under two hypotheses: the null hypothesis H 0 under which the given pair of optical maps are independent of each other and therefore share no correspondence, and the alternative hypothesis H 1 under which the optical maps represent the same genomic regionandthusshouldbealigned. Thealignmentscoreisgivenas X(x,y,n,m) = log $ f H 0 (x,y,n,m) f H 1 (x,y,n,m) % wheref H 0 (x,y,n,m)andf H 1 (x,y,n,m)definestatisticalmodelscorrespondingtothe two hypotheses respectively. The maximum score given by the dynamic programming recursion will correspond to the maximum distance between H 0 and H 1 and therefore discriminate against spurious alignments in favor of correct alignments. This scoring function has the benefit of not requiring ad-hoc parameters as well as being derived directly from the statistical distributions of optical mapping measurements taking into accounttheassociatederrors. However, although based on a statistical framework that dealt with the specific fea- tures of optical mapping, the likelihood score function has some problems and issues. Theseincluded: 70 1. Evaluation of complicated likelihood function. The scoring function required the evaluation of a likelihood function that is expensive computationally. Although some of the scores can be pre-computed to avoid redundant calcula- tions,mostofthescorescannotbe,requiringfullevaluationofthescorefunction duringthedynamicprogrammingrecursion. 2. Decomposition of the likelihood score function. Valouev (2006) notes that for practical reasons, the likelihood scoring function is decomposed into a size score and site score component. The size score explains the discrepancy in sizes betweenthe regionsonthe mapsbeingconsidered whilethesitescore dealswith the number of aligned and unaligned sites. During the dynamic programming algorithm, the actual score maximized is the size score function that deals only with the sizes of matching regions while the site score is used to screen possi- ble aligned regions for spurious alignments. This assumes the sizes of matching regionsvarysignificantlywhilelimitingthenumberofunalignedsitesforagiven matched region. However, this assumption is not always correct and often leads to more spurious alignments since the size score is more permissive using the likelihoodapproachintermsofallowingmatchingregionstodiffer. 3. Scorethresholdsduringalignment. Variousscorethresholdsareutilizedduring the dynamic programming recursion to eliminate regions of the scoring matrix that lead to bad alignments. The thresholds speed up the algorithm by reduc- ing the number of computationsneeded to calculate the alignment. However,the thresholds often removes regions of the alignment that correspond to true align- ments. Furthermore, a final score alignment threshold is applied at the conclu- sion of the algorithm to filter out spurious alignments. The final score threshold does decrease the number of incorrect alignments at the cost of limiting the total 71 number of alignments reported. The difficulty of assembling a genome de novo increasesasnotenoughalignmentsareavailabletoreconstructtheentiregenome. 3.1.2 OpticalMatchScoreFunction Motivation: Duetothedeficienciesofthelikelihoodscoringfunction,weexploredan alternativescoringfunctionoptimizedforfindingtruealignmentsfordenovoassembly. OurscoringfunctionisbasedonideasdevelopedbyYang(2005)byusingthestatistical distribution of the optical mapping measurements in a more straightforward manner. The need for a linear score function was based on two requirements. One was the need for speed sincethe score functioniscalled repeatedly toevaluatepossiblealigned regionsduringthedynamicprogrammingrecursion. Theotherrequirementwasascore functionwhosesensitivityatrecoveringalignmentscouldbeoptimizeddependingupon the application. A score functionthat isclose to linear is ideal for thissituationsince it can be quickly calculated and avoid the expensive computations of probabilities using the likelihood based scoring approach. Although the score function that we develop is not entirely linear, it will be substantially easier to compute than the likelihood score function. An alignment between two maps will be given by the minimum partition of the alignment into blocks corresponding to minimal matching regions between compared maps. Theminimalpartitionisfoundbymaximizingthealignmentscoreandthusmis- matched sites are penalized. An alignment block is defined as two matching regions of maps flanked by matching restriction sites that do not contain any internal match- ing sites. As discussed in Section 2.2, the score function is defined asX(x 1 ,x 2 ,n,m) wherex 1 ,x 2 refertothelengthsoftheregionsbeingconsideredandn,mthenumberof fragmentsforeachregionrespectively. Thescorefunctioncanbedecomposedintotwo 72 components: the size score that measures the degree of similarity between the lengths x 1 ,x 2 andthesitescorethatassignsavaluetothenumberoffragmentsn,m. Our outline is as follows. We first examine how to measure the similarity between sizes x 1 ,x 2 . To do so, we formulate two competing hypotheses, the null hypothesis correspondingtonoalignmentwherethetworegionsareindependentofeachotherand thealternativehypothesiswherethetworegionsarederivedfromsomecommonregion. This will lead us to the definition of the D-size statistic for quantifying the degree of similarity of sizes x 1 ,x 2 . We derive the distribution of the D-size statistic under both thenulland alternativehypothesisand comparethemagainsteachother. Wedefine our scorefunctionbyutilizingtheD-sizestatisticdirectly. Wethenincorporatethenumber of fragments n,m in our score function. The null distribution of the score function is examined as it can be used to speed up the dynamic programming recursion. We then discuss the parameters involved in our score function. Finally, we compare our score functionwithpreviousscorefunctionsandevaluateitsperformance. SizeComparison We have already previously described the statistical model for optical mapping in Sec- tion2.1.2. Basedonthemodel,thefollowingcanbeshown(proofsaregiveninValouev etal.(2006b)thatwerefer thereader to). Lemma 3.1.1. Fragment sizes underlying matching regions between two optical maps haveexponentialdensitywithmeanφ = λ/p 2 . We first focus on the comparison of sizes of regions from the optical maps. We focus on the simple case of both regions consisting of a single fragment. Let X 1 and X 2 be the sizes of the regions being considered. We can view the size score under two com- peting hypotheses: H 0 and H 1 . Although this follows the likelihood based approach, 73 examiningcomputed scores under the two hypothesesallows us further insightinfind- ing a suitable scoring function. Under the null hypothesis, H 0 , there is no depen- dence between the two regions and thus they are independent. This corresponds to X 1 ,X 2 ∼ Exponential(θ) with E(X i )= θ distributed independently of each other from Lemma 2.2.4. Under the alternative hypothesis, H 1 , the two regions are derived fromthesamegenomicregionofunknowntruesizeY whereY ∼Exponential(φ)and E(Y)= φfromLemma3.1.1. Thus,wehavethatX 1 =Y +& 1 andX 2 =Y +& 2 ,where & 1 ,& 2 ∼Normal(0,σ 2 Y), due to measurement error, andX 1 andX 2 are conditionally independentgivenY. Thetwohypothesesare H 0 : X 1 ,X 2 ∼Exponential(θ),X 1 ⊥X 2 H 1 : X 1 =Y +& 1 ,X 2 =Y +& 2 & 1 ,& 2 ∼Normal(0,σ 2 Y)|Y∼Exponential(φ),X 1 ⊥X 2 |Y. Wefirstshowthefollowing. Lemma 3.1.2. If X 1 ,X 2 ∼ Exponential(θ) with X 1 ⊥ X 2 , then Z = |X 1 −X 2 | is exponentiallydistributedwiththesamerate,θ. Proof. See JohnsonandKotz(1970). Under the null hypothesisZ 0 = |X 1 −X 2 | with X 1 ,X 2 ∼ Exponential(θ), then from Lemma 3.1.2,Z 0 ∼ Exponential(θ) so that E(Z 0 )= θ and Var(Z 0 )= θ 2 . For the alternative hypothesis,we have thatZ 1 =|(Y +& 1 )−(Y +& 2 )| =|& 1 −& 2 | where & 1 −& 2 ∼Normal(0,2σ 2 Y)|Y withY∼Exponential(φ). Weshowthefollowing. 74 Lemma 3.1.3. If Z 1 = |(Y + & 1 ) − (Y + & 2 )| = |& 1 − & 2 | where & 1 − & 2 ∼ Normal(0,2σ 2 Y)|Y with Y ∼ Exponential(φ), then E(Z 1 )= σ √ φ and Var(Z 1 )= σ 2 φ. Proof. We first note that Z 1 is the half-normal distribution since & 1 − & 2 ∼ Normal(0,2σ 2 Y)|Y with Y ∼ Exponential(φ), due to taking the absolute value of the normal distribution. Thus Z 1 |Y ∼ N 1/2 (2σ 2 Y), the half-normal distribu- tion with E(Z 1 |Y)= 2σ √ π √ Y and Var(Z 1 |Y) = 2σ 2 Y(1− 2 π ). We note that given Y∼Exponential(φ) we have that √ Y∼Weibull(2, √ φ) withE( √ Y)= √ φΓ(3/2) andVar( √ Y)= φ[Γ(2)−Γ 2 (3/2)]. Usingconditionalexpectationandvariance E(Z 1 )= E(E(Z 1 |Y)) = 2σ √ π E( √ Y) = σ , φ and Var(Z 1 )= Var(E(Z 1 |Y))+E(Var(Z 1 |Y)) = 4σ 2 π Var( √ Y)+2σ 2 (1− 2 π )E(Y) = σ 2 φ 3 4 π (1− π 4 )+2(1− 2 π ) 4 = σ 2 φ. 75 From Lemma 3.1.3, we can see that E(Z 1 ) ≤ E(Z 0 ) and Var(Z 1 ) ≤ Var(Z 0 ). The idea is to incorporate the absolute difference of the two sizes Z = |X 1 − X 2 | as a measure of whether the two regions are independent of each other or are derived from the same genomic region. Under the null hypothesis, the absolute difference is due to random measured regions with no association where the difference is between two i.i.d. exponentials. Under the alternative hypothesis, however, we are examining the differencebetweentwonormalrandomvariablesassociatedwiththemeasurementerror oftheunderlyingsizethatthetworegionsarederivedfrom. Weexpectthattheabsolute difference between two regions under the null hypothesisto be much larger than under the alternative hypothesis based on our assumptions. The distribution of Z under the null hypothesis is easily derived while the distribution under the alternative hypothesis is difficult to solve for analytically. We notice that under the alternative hypothesis the variance is the mean squared suggesting an exponential distribution with mean σ √ φ. Simulation experiments shown in Figure 3.1 confirm that the resulting distribution is approximatelyexponentialsothatZ 1 ≈Exponential(σ √ φ). Althoughwehavedeterminedthedistributionsoftheabsolutedifferenceofthesizes, there are still some problems in using it solely as a measure that two regions should be aligned. Since the variance of the normally distributed measurement error scales with the size of the fragment, larger fragments have larger observed sizes. This results in a largerabsolutedifferencethatmightbeincorrectlyattributedasanunalignedregion. To correct for this, we want to measure the absolute difference in terms of the number of standarddeviationsawayfromthetruesizeofthegenomicregion. Under the null hypothesis, both sizes are independent thus there is no underlying true size. Under the alternativehypothesis,we have as the true size the unknown value Y ∼ Exponential(φ). Measured fragments sizes are normally distributed as X i ∼ 76 Normal Theoretical Quantiles Z 1 (#,$) −3 −2 −1 0 1 2 −2 −1 0 1 2 Figure 3.1: Values of Z 1 = |X 1 − X 2 | were obtained by simulating X 1 = Y + & 1 ,X 2 = Y + & 2 with & 1 ,& 2 ∼ Normal(0,σ 2 Y)|Y ∼ Exponential(φ). We sim- ulated n = 10000 values of Y with φ = {12,13,...,20} and corresponding & 1 ,& 2 with σ = {0.1,0.2,...,1}. Let Z 1 (φ,σ) denote the sum of the absolute differ- ences for given φ,σ for the n = 10000 values of Z 1 . The estimated distribution is Z 1 ∼ Exponential(σ √ φ) with E(Z 1 )= σ √ φ and Var(Z 1 )= σ 2 φ. By the CLT, we havethat(Z 1 (φ,σ)−nσ √ φ)/ , nσ √ φ→Normal(0,1)andsowecanproduceaQQ- plot against normal quantiles to compare. The resulting plot indicates a goodfit to the estimateddistributionforZ 1 . Normal(Y,σ 2 Y)withstandarddeviationσ √ Y. WeexpectthatmeasuredsizesX 1 and X 2 shouldnotdeviatefromthetruesizeY substantiallysothat|X 1 −Y|and|X 2 −Y| are small and correspondingly|X 1 −X 2 | should similarly be small as well. We divide bythestandarddeviationσ √ Y todeterminethenumberofstandarddeviationsapartthe absolutedifferenceinsizesX 1 andX 2 arefromeachother. FragmentSizeEstimation As we do not know the true size we must estimate it given sizes X 1 ,X 2 . For the null hypothesis, both sizes are independent of each other and no true underlying size is assumed to exist. Under the alternative hypothesisX 1 = Y +& 1 ,X 2 = Y +& 2 with 77 & 1 ,& 2 ∼Normal(0,σ 2 Y)|Y∼Exponential(φ)withY astheunderlyingtruesize. We havethatX 1 ,X 2 ∼Normal(Y,σ 2 Y) andbothare conditionallyindependentgivenY. WecanconsiderthelikelihoodfunctionandcorrespondingloglikelihoodfunctionofY givensizesX 1 andX 2 as L(y|x 1 ,x 2 ,σ 2 )= 1 2πσ 2 y e − 1 2σ 2 y ((x 1 −y) 2 +(x 2 −y) 2 ) ,x 1 ,x 2 > 0 logL(y|x 1 ,x 2 ,σ 2 )= −log2π−logσ 2 y− 1 2σ 2 y & (x 1 −y) 2 +(x 2 −y) 2 ' . Differentiatingwithrespecttoy gives ∂ ∂y logL(y|x 1 ,x 2 ,σ 2 )= x 2 1 +x 2 2 −2y(σ 2 +y) 2σ 2 y 2 . Settingthederivativeto0andsolvingfory yields y = −σ 2 ± , 2(x 2 1 +x 2 2 )+σ 4 2 . Asweassumethaty> 0,wehaveasourmaximumlikelihoodestimate(MLE) ˆ y MLE = 1 2 $ . 2(x 2 1 +x 2 2 )+σ 4 −σ 2 % . We note that the MLE is not the sample mean since the unknown sizeY is involvedin both the mean and variance of X 1 ,X 2 . To verify that the MLE is a global maximum, takingthesecondderivativewehave 78 ∂ 2 ∂y 2 logL(y|x 1 ,x 2 ,σ 2 )= − x 2 1 +x 2 2 +σ 2 y σ 2 y 3 . Ifweevaluatethesecondderivativeat ˆ y MLE ∂ 2 ∂y 2 logL(y|x 1 ,x 2 ,σ 2 )| y=ˆ y MLE = − x 2 1 +x 2 2 +σ 2 ( , 2(x 2 1 +x 2 2 )+σ 4 −σ 2 ) σ 2 4 ( , 2(x 2 1 +x 2 2 )+σ 4 −σ 2 ) 3 < 0 with ˆ y MLE is a global maximum since σ 0 x 1 ,x 2 and σ,x1,x2 > 0. Calculating the expectationoftheMLEgives E(ˆ y MLE )= 1 2 E $ . 2(x 2 1 +x 2 2 )+σ 4 % − σ 2 2 . In order to evaluate the expectation of the MLE, we must compute the expectation of m = , 2(x 2 1 +x 2 2 )+σ 4 with x 1 ,x 2 ∼ Normal(y,σ 2 y). To do so, we make the fol- lowingtransformations,where weinitiallyhave m 2 −σ 4 2 = x 2 1 +x 2 2 . Since x 1 ,x 2 ∼ Normal(y,σ 2 y) it follows that (x 2 1 + x 2 2 )/σ 2 y has a noncentral chi- square distributionwhere (x 2 1 +x 2 2 )/σ 2 y∼X 2 2 (λ=2y/σ 2 ) with2 degrees of freedom 79 and noncentrality parameter λ=2y/σ 2 . Let T =(x 2 1 +x 2 2 )/σ 2 y with E(T) = 2 + (2y/σ 2 )andVar(T) = 4+(8y/σ 2 )where m 2 −σ 4 2σ 2 y = T m = , 2σ 2 yT +σ 4 . TocalculateE(m) we willapproximateusingTaylorexpansionswithf(x)=x 1/2 . Let µ = E(2σ 2 yT + σ 4 ) = (2y + σ 2 ) 2 and γ 2 = Var(2σ 2 yT + σ 4 ) = 16σ 2 y 2 (2y + σ 2 ) where E(m)= E( , 2σ 2 yT +σ 4 ) ≈ f(µ)+ f $$ (µ)γ 2 2 = (2y +σ 2 )− 2σ 2 y 2 (2y +σ 2 ) 3 andsimilarly Var(m)= Var 5 , 2σ 2 yT +σ 4 6 ≈ (f $ (µ)) 2 γ 2 = γ 2 4µ = 4σ 2 y 2 2y +σ 2 . WehaveastheexpectedvalueoftheMLE 80 E(ˆ y MLE ) ≈ y− σ 2 y 2 (2y +σ 2 ) 3 ≈ y $ 1− σ 2 8 % . We see that ˆ y MLE is biased as it underestimates the true value y. The variance of the MLEis Var(ˆ y MLE ) ≈ σ 2 y 2 2y +σ 2 ≈ σ 2 y 2 andcorrespondingmeansquarederror E y (ˆ y MLE −y) 2 = Var(ˆ y MLE )+ $ y $ 1− σ 2 8 %% 2 = σ 2 y 2 + $ y $ 1− σ 2 8 %% 2 . Another estimate is found by assuming one of the sizes is the true underlying size. Due tothefact thatthe sizingerror scales withthe sizeof the fragment,we can conser- vativelychoosethesmallermeasuredsizeasour estimateofthetruesize. Theusageof thesmallersizeresultsinasmallersizingerrortobeusedsothatX 1 ,X 2 mustbecloser 81 inagreementinorderforthenumberofstandarddeviationstheyareaparttobesmaller aswell. Thisleadsustotheestimator ˆ y min = min(x 1 ,x 2 ). This estimator is not unbiased as we can see by looking at the expected value of the estimator E(ˆ y min )= E(min(y +& 1 ,y +& 2 )) = y +E(min(& 1 ,& 2 )). We recognize min(& 1 ,& 2 ) as the first order statistic for a pair of normally distributed random variables with & 1 ,& 2 ∼ Normal(0,σ 2 y). The expected value of the first order statistic of a pair of normal random variables must be computed numerically as no closed-formed expression exists. Consulting a table of normal order statistics (Harter, 1961),wehaveastheexpectedvalue E(ˆ y min ) ≈ y−σ √ y(0.5642). Computingthemeansquarederror gives 82 E y (ˆ y min −y) 2 = Var(ˆ y min )+(E(min(& 1 ,& 2 ))) 2 = Var(min(& 1 ,& 2 ))+(E(min(& 1 ,& 2 ))) 2 = E(min(& 1 ,& 2 )) 2 −(E(min(& 1 ,& 2 ))) 2 +(E(min(& 1 ,& 2 ))) 2 = σ 2 y(E(min(& 1 /σ √ y,& 2 /σ √ y)) 2 ) with & i /σ √ y∼Normal(0,1). The second moment of thefirst order statisticof a pair ofstandardnormalsisapproximatedbyE(min(& 1 /σ √ y,& 2 /σ √ y)) 2 ≈ 1sothat E y (ˆ y min −y) 2 ≈ σ 2 y. We canalsoconsiderthesamplemeanasourestimatorwhere ˆ y mean = x 1 +x 2 2 so that we have an unbiased estimator as E(ˆ y mean )= y. We have as the mean squared error E y (ˆ y mean −y) 2 = Var(ˆ y mean ) = 1 4 Var(x 1 +x 2 ) = σ 2 y 2 83 with x 1 ,x 2 ∼ Normal(y,σ 2 y). The distribution of ˆ y mean is easily derived as ˆ y mean ∼ Normal(y,σ 2 y/2). Alternatively,wecouldalsousethesamplevarianceasourestimatorasthemeasured fragment sizes have variance that are dependent upon the true underlying size y. This estimatorisgivenas ˆ y var = (x 1 −x) 2 +(x 2 −x) 2 σ 2 wherex=(x 1 +x 2 )/2denotesthesamplemean. ˆ y var isunbiasedasE(ˆ y var )= σ 2 y/σ 2 = y. As ˆ y var /y∼ X 2 1 , a chi-square distributionwith one degree of freedom, we have that Var(ˆ y var )= Var(y(ˆ y var /y)) =y 2 ,withmeansquarederror E y (ˆ y var −y) 2 = Var(ˆ y var ) = y 2 . Althoughtheestimatorisunbiased,itsvarianceisnotasgoodcomparedtotheestimator basedonthesamplemean ˆ y mean . We summarizeourresultsregardingthe variousestimatorsgiveninTable 3.1. Both ˆ y MLE and ˆ y min are biased in their estimation while ˆ y mean and ˆ y var are unbiased. Among theunbiasedestimators ˆ y mean haslowervarianceandthuscanbeseenasthebestmethod toestimatetheunderlyingtruefragmentsize. Furthermore,thesamplemeaniseasierto computethan ˆ y MLE and ˆ y var . 84 Estimator E(ˆ y) Var(ˆ y) E((ˆ y−y)) 2 ˆ y MLE y(1−(σ 2 /8)) σ 2 y/2 σ 2 y/2+(y(1−(σ 2 /8))) 2 ˆ y min y−σ √ y(0.5642) σ 2 y(0.6817) σ 2 y ˆ y mean y σ 2 y/2 σ 2 y/2 ˆ y var yy 2 y 2 Table3.1: Comparisonofestimators ˆ y D-sizeStatistic Ourstatisticforexplainingthesizediscrepancybetweentwofragmentsisbuiltuponthe ideas we havepreviouslydiscussed. We want toincorporate theabsolutedifferences of the sizes so that the difference is measured in terms of standard deviation from a true underlyingsize. Toestimatethetrueunderlyingsize,we choosetheestimatorbasedon thesamplemean. We have, Definition 3.1.1. Given two measured fragment lengths x 1 and x 2 , define the D-size statisticas D = |x 1 −x 2 | σ , (x 1 +x 2 )/2 . Asx 1 ,x 2 > 0,we havethatD> 0. TheD-sizestatisticasgivenbyDefinition3.1.1shouldbesmallerforfragmentlengths derived from the same genomic region while larger for fragment lengths that are inde- pendent of each other. Fragment lengthsindependent of each other have no true under- lyingsizeandthecomputedsamplemeancanbeattributedtonoise. 85 Distribution ofD-size Statistic: Our previous analyses considered the numerator of theD-sizestatisticwhere weshowedthatunderthenullhypothesis,theexpectedabso- lutedifferenceofsizesissubstantiallylarger thanthatofthealternativehypothesis. We now analyze theD-statistic for both the null and alternative hypothesis to characterize thedistributionoftheD-statisticundereach. Ourresultsshouldshowthattheexpected value of the D-size statistic under the null hypothesis is much larger than under the alternativehypothesis. TheD-statisticunderthenullhypothesis,denotedasD 0 ,isgivenas, D 0 = Z 0 W 0 withZ 0 = |X 1 −X 2 | and W 0 = σ , (X 1 +X 2 )/2 where X 1 ,X 2 ∼ Exponential(θ). Wewillshowthefollowing, Lemma 3.1.4. If Z 0 = |X 1 − X 2 | and W 0 = σ , (X 1 +X 2 )/2 with X 1 ,X 2 ∼ Exponential(θ), thenforD 0 =Z 0 /W 0 , E(D k 0 )= (2θ) k/2 (k +1)σ k Γ $ k+4 2 % whereD k 0 denotesthek-thmomentandΓ(·)istheGammafunction. Proof. ToderivethedstributionofD 0 ,weinitiallydeterminethedistributionofZ 0 and W 0 . We first derive the joint density of Z 0 ,W 0 , where Z 0 = |X 1 − X 2 | and W 0 = σ , (X 1 +X 2 )/2, with X 1 ,X 2 ∼ Exponential(θ) and X 1 is independent of X 2 . The jointdistributionofX 1 ,X 2 is 86 f X 1 ,X 2 (x 1 ,x 2 )= 1 θ e −x 1 /θ 1 θ e −x 2 /θ ,x 1 ≥ 0,x 2 ≥ 0. The transformation Z 0 = |X 1 − X 2 | and W 0 = σ , (X 1 +X 2 )/2 is not one-to-one as (x 1 ,x 2 ) and (x 2 ,x 1 ) are mapped to the same point (z,w). By restricting ourselves to the regions with x 1 >x 2 and x 1 <x 2 , then the transformation is one-to-one. Let B ={(x 1 ,x 2 ):x 1 ≥ 0,x 2 ≥ 0}. ConsiderthepartitionofB intotheregions A 1 = {(x 1 ,x 2 ):x 1 >x 2 }, A 2 = {(x 1 ,x 2 ):x 1 <x 2 }, A 0 = {(x 1 ,x 2 ):x 1 =x 2 }. with P((X 1 ,X 2 ) ∈ A 0 )= P(X 1 = X 2 )=0. Within the regions A 1 and A 2 the transformationis one-to-one. For eitherA 1 orA 2 , if (x 1 ,x 2 )∈ A i ,z =|x 1 −x 2 |≥ 0, andforafixedvalueofz =|x 1 −x 2 |,w = σ , (x 1 +x 2 )/2canbeanypositivenumber. Thus, B = {(x 1 ,x 2 ): x 1 ≥ 0,x 2 ≥ 0} is the image of both A 1 and A 2 under the transformation. Theinversetransformationsaregivenby A 1 = 7 x 1 = z+2 & w σ ' 2 2 ,x 2 = 2 & w σ ' 2 −z 2 8 , A 2 = 7 x 1 = 2 & w σ ' 2 −z 2 ,x 2 = z+2 & w σ ' 2 2 8 87 with the resultingJacobians,J 1 = 2w σ 2 andJ 2 =− 2w σ 2 . We notice that asx 1 ,x 2 ≥ 0 this impliesthatforregionA 1 , (z+2(w/σ) 2 )/2≥ 0sothatz≤ 2(w/σ) 2 ,andsimilarlythe sameconditionholdsforregionA 2 . Therefore thejointdensityis f Z 0 ,W 0 (z,w)= 1 θ e − 1 θ „ z+2( w σ ) 2 2 « 1 θ e − 1 θ „ 2( w σ ) 2 −z 2 « 9 9 9 9 2w σ 2 9 9 9 9 + 1 θ e − 1 θ „ 2( w σ ) 2 −z 2 « 1 θ e − 1 θ „ z+2( w σ ) 2 2 « 9 9 9 9 − 2w σ 2 9 9 9 9 ,z≤ 2 5 w σ 6 2 . Andafter rearrangingtermswearriveat f Z 0 ,W 0 (z,w)= 1 θ 2 4w σ 2 e − 1 θ “ 2( w σ ) 2 ” ,z≤ 2 5 w σ 6 2 . We notice that Z 0 and W 0 are not independent since the range of z is dependent upon w. Toverifythecorrectnessofourjointdensity,recallthemarginaldistributionofZ 0 ∼ Exponential(θ). We can integrate outw to check ifz has an exponential distribution. Wehave f Z 0 (z)= # ∞ σ √ z 2 f Z 0 ,W 0 (z,w)dw = # ∞ σ √ z 2 1 θ 2 4w σ 2 e − 1 θ “ 2( w σ ) 2 ” dw withz ≤ 2 & w σ ' 2 so that σ , z 2 ≤ w≤∞ sincew is positive. Making the substitution u = 1 θ 5 2 & w σ ' 2 6 withdu = 4w θσ 2 dw yields 88 = # ∞ u=z/θ 1 θ e −u du = 1 θ e −z/θ ,z≥ 0 so that we have shownZ 0 ∼ Exponential(θ). Using the same technique, we have as thedistributionofW 0 f W 0 (w)= # 2( w σ ) 2 0 f Z 0 ,W 0 (z,w)dz = # 2( w σ ) 2 0 1 θ 2 4w σ 2 e − 1 θ “ 2( w σ ) 2 ” dz =2 5 w σ 6 2 1 θ 2 4w σ 2 e − 1 θ “ 2( w σ ) 2 ” ,w≥ 0. Given the joint density, we can now calculate the mean and variance of D 0 . Consider thek-thmomentoftheD-statisticunderthenullhypothesisas 89 E(D k 0 )= E 1 $ Z 0 W 0 % k 2 = ## 5 z w 6 k f Z 0 ,W 0 (z,w)dzdw = # ∞ 0 # 2( w σ ) 2 0 5 z w 6 k 1 θ 2 4w σ 2 e − 1 θ “ 2( w σ ) 2 ” dzdw = # ∞ 0 1 θ 2 4w 1−k σ 2 e − 1 θ “ 2( w σ ) 2 ”# 2( w σ ) 2 0 z k dzdw = # ∞ 0 1 θ 2 4w 1−k σ 2 e − 1 θ “ 2( w σ ) 2 ” 3 1 k+1 z k+1 4 2( w σ ) 2 0 dw = # ∞ 0 1 θ 2 4w 1−k σ 2 e − 1 θ “ 2( w σ ) 2 ” 1 k+1 $ 2 5 w σ 6 2 % k+1 dw = 2 k+3 (k +1)θ 2 σ 2k+4 # ∞ 0 w k+3 e − w 2 (θσ 2 /2) dw. Lettingc = , (θσ 2 /2)wehave = 2 k+3 (k +1)θ 2 σ 2k+4 c 2 2 # ∞ 0 w k+2 $ 2 c 5 w c 6 e −( w c ) 2 % dw sothatwerecognizetheintegralasthepdfoftheWeibulldistributionwithshapeparam- eter 2 and scale parameter c. The termw k+2 can be seen as computing the (k + 2)-th momentoftheWeibulldistribution. Wecan rearrange termssothatwehave E(D k 0 )= 2 k+1 (k +1)θσ 2k+2 E(B k+2 ) 90 that givesa relationshipbetween the momentsofD 0 andB∼Weibull(2, , (θσ 2 /2)). Asaclosed-formexpressionexiststodescribethemomentsofaWeibulldistribution E(D k 0 )= 2 k+1 (k +1)θσ 2k+2 $ θσ 2 2 % (k/2)+1 Γ $ 1+ k+2 2 % = (2θ) k/2 (k +1)σ k Γ $ k+4 2 % whereΓ(·)denotestheGammafunction. WithanexpressiontodescribethemomentsofD 0 ,wecan easilyderivethemean E(D 0 )= √ 2θ 2σ Γ $ 5 2 % = √ θ σ $ 3 4 0 π 2 % ≈ √ θ σ andvariance Var(D 0 )= E(D 2 0 )−(E(D 0 )) 2 = 2θ 3σ 2 Γ(3)− 1 3 4σ 0 πθ 2 2 2 = θ σ 2 $ 4 3 − 9π 32 % ≈ θ σ 2 . 91 Withσ0 θ,themeanisroughlyproportionalto √ θ andthevarianceproportionaltoθ. We now turn our attention to the alternative hypothesis. Under the alternative hypothesis, X 1 = Y + & 1 ,X 2 = Y + & 2 with X 1 ,X 2 ∼ Normal(Y,σ 2 Y)|Y ∼ Exponential(φ). We note that under the alternative hypothesis, we are conditioning onatrueunderlyingsizeY. OurD-statisticis D 1 = Z 1 W 1 withZ 1 =|X 1 −X 2 |andW 1 = σ , (X 1 +X 2 )/2. ThedistributionofD 1 isdifficultto derive. Wewillshowthefollowing, Lemma 3.1.5. If X 1 = Y + & 1 ,X 2 = Y + & 2 with X 1 ,X 2 ∼ Normal(Y,σ 2 Y)|Y ∼ Exponential(φ) andZ 1 =|X 1 −X 2 | andW 1 = σ , (X 1 +X 2 )/2, then the mean and varianceofD 1 =Z 1 /W 1 areapproximately E(D 1 ) ≈ 2 √ π Var(D 1 ) ≈ 4 π − 4 π 3/2 + 4(4−π) π 2 Proof. ConsidertheTaylorseriesexpansionforthefunctionf(x,y)= x y sothatwecan approximatethemeanofD 1 as E(D 1 )= E $ Z 1 W 1 % ≈ E(Z 1 ) E(W 1 ) . 92 We have already determined that E(Z 1 )= σ √ φ and Var(Z 1 )= σ 2 φ. To deter- mine the distribution of W 1 = σ , (X 1 +X 2 )/2 we initially have that M =(X 1 + X 2 )/2∼ Normal(Y,σ 2 Y/2) as X 1 ,X 2 ∼ Normal(Y,σ 2 Y). From Lemma 2.2.4, we have that the marginal distribution of M is given by M ∼ Exponential(ϕ = - √ 2 σ . 2 φ + 2 σ 2 − 2 σ 2 / −1 ) with E(M)= ϕ. The distribution W 1 = σ √ M is given by σ √ M∼Weibull(2,r = σ √ ϕ). TheexpectationofW 1 isgivenbyE(W 1 )=r·Γ(3/2) sothatwehave E(D 1 ) ≈ σ √ φ σ √ ϕ·Γ(3/2) = : φ ϕ 2 √ π whereforφ% σ wehavethat φ ϕ ≈ 1sothat E(D 1 ) ≈ 2 √ π . WecanapproximatethevarianceofD 1 usingafirst-orderTaylorseriesexpansionwhere Var(D 1 )= Var $ Z 1 W 1 % ≈ Var(Z 1 ) E(W 1 ) 2 − 2E(Z 1 ) E(W 1 ) 3 Cov(Z 1 ,W 1 )+ E(Z 1 ) 2 E(W 1 ) 4 Var(W 1 ). ThejointdensityofZ 1 ,W 1 isdifficulttoderiveanalytically. TocomputeCov(Z 1 ,W 1 ), we note that Z 1 ,W 1 are functions of X 1 ,X 2 ∼ Normal(Y,σ 2 Y) where Y ∼ 93 Exponential(φ) so that they are dependent upon the parameters σ 2 ,φ. By simulating the covariance Cov(Z 1 ,W 1 ) for various values of σ 2 ,φ and fitting a linear regression model,wefindthatCov(Z 1 ,W 1 )≈ σ 2 φ/4. We thenhave Var(D 1 ) ≈ σ 2 φ σ 2 ϕΓ 2 (3/2) − 2σ √ φ σ 3 ϕ 3/2 Γ 3 (3/2) σ 2 φ 4 + σ 2 φ σ 4 ϕ 2 Γ 4 (3/2) σ 2 ϕ(Γ(2)−Γ 2 (3/2)) whereforφ% σ wehavethat φ ϕ ≈ 1sothat Var(D 1 ) ≈ 4 π − 4 π 3/2 + 4(4−π) π 2 We have the striking feature that the mean and variance of theD-statistic under the alternative hypothesis is independent of the parameters σ,φ. This feature is useful as we do not have to account for parameters for the alternative hypothesis and can focus solelyonthebehavioroftheD-sizestatisticunderthenullhypothesis. Theideaistouse the value of theD-size statistic to determine whether the null hypothesisor alternative hypothesis applies. Small values of the D-size statistic are more likely to be observed under the alternativehypothesisversusthe nullhypothesis. Our idea is to use the value oftheD-sizestatisticdirectlyasameasureofwhethertworegionsarederivedfromthe samegenomicregion. NumericalSimulationsofD-sizeStatistic: Tovalidateourresultsregardingthedis- tributionsof theD-size statisticunder both the null and alternative hypothesis,we per- formnumericalsimulationsby, 94 1. For given parameters λ,ζ,p,σ 2 , we generate n = 10000 pairs of (X 1 ,X 2 ) ∼ Exponential(θ) under the null hypothesis. Under the alternative hypothesis we generaten = 10000 valuesY i ∼ Exponential(φ) from which we generate pairs (M 1 ,M 2 )∼Normal(Y i ,σ 2 Y i ). 2. We calculate the D-size statistic under both the null hypothesis and alternative hypothesis for each pair of random variables. Let D 0 be the sample mean for theD-sizestatisticfor thepairofrandomvariablesunderthenullhypothesisand similarlyD 1 bethesamplemeanforthealternativehypothesis. 3. BytheCentralLimitTheoremwehave C 0 = D 0 −µ D 0 . ν 2 D 0 /n →Normal(0,1) C 1 = D 1 −µ D 1 . ν 2 D 1 /n →Normal(0,1) whereµ D 0 denotes the mean and ν 2 D 0 denotes the variance of theD-size statistic underthenullhypothesisandsimilarlyforµ D 1 andν 2 D 1 forthealternativehypoth- esis. 4. We can construct a QQ-plot and compare againstthe quantilesof a standard nor- maldistributionandexaminethefit. Figure 3.2 shows the QQ-plots based on our simulation results. They indicate that thetheoreticalmeansandvariancesunderthenullhypothesisandalternativehypothesis agree withthesimulateddata. 95 Normal Theoretical Quantiles C 0 −2 0 2 −2 0 2 Null Hypothesis: H 0 Normal Theoretical Quantiles C 1 −2 0 2 4 −2 0 2 Alternative Hypothesis: H 1 Figure 3.2: QQ-plots for D-size statistic under both the null and alternative hypothe- sis. The calculated means and variances for the D-size statistic agree with the simu- lateddata forthenullhypothesis. AShapiro-Wilktestfornormalityyieldsap-valueof 0.8498indicatingnoevidenceforrejectingthenullhypothesisthatthedataisnormally distributed. For the alternative hypothesis, we obtain a p-value of 0.9294 which again indicatesnoevidenceforrejectingthatthedataisnormallydistributed. NewScoreFunction We have previously demonstrated that the distribution of theD-size statistic under the nullandalternativehypothesisareextremelydifferentintermsofthemeanandvariance. OurscorefunctionwillutilizethevalueoftheD-sizestatisticasameasureofagreement between sizes x 1 ,x 2 from two optical maps. The scoring function is defined as the following, Definition3.1.2. Letx 1 ,x 2 be the lengthsof the region between two opticalmaps con- sistingofnandmfragmentsrespectively. Definethescorefunctionby, S(x 1 ,x 2 ,n,m)= ; β I(α<d) ·(α−d) < −[γ(n+m−2)] 96 forspecifiedpositiveconstantsα,β,γ,anddistheD-statisticpreviouslydefinedas, d = |x 1 −x 2 | σ , (x 1 +x 2 )/2 I(α<d )isanindicatorfortheeventα<d . Thescoringfunctionaboveconsistsoftheparametersα,β,andγ. Thescorecanbe seenasbeingcomposedofthesizescorethatdealsspecificallywiththesizesx 1 ,x 2 and the site score that deals with the number of fragmentsn,m. The score function is not trulylinearduetothecalculationofd. Inourcomparsions,wewillrefertoourfunction as linear when comparing with previous methods since we can view it as linear with respecttod. Below,weexplainhowtheparametersaffectthealignmentscore. • α: Recall that the dynamic programming recursion seeks to maximize the align- mentscore. WehadpreviouslyshownthatsmallvaluesoftheD-sizestatisticcor- respondtotheregionsfromtheopticalmapbeingderivedfromthesamegenomic region. The α parameter controls the maximum value of theD-size statistic that is still considered a good match. Since we have (α−d), regions with a D-size statisticgreaterthanαincuranegativescore. Thepurposeofαistodiscriminate between values of theD-size statistic by assigning positive values for those less thanα andnegativevaluesforthosegreater thanα. • β: This parameter is a further penalty for a D-size statistic greater than α. This allows for these regions of the scoring matrix to incur negative scores at a faster rateforbadlyalignedregionsintermsoftheD-sizestatistic. Regionsgreaterthan αare alreadyassignedanegativescorethatisfurthermultipliedbyβ. 97 • γ: This parameter penalizes the number of unaligned sites in the aligned region asthetotalnumberofunalignedsitesisequalton+m−2givenn,mfragments from each map. We expect that regions that align well should consist of single fragmentssothattherearenomissingorfalsecutswithn =m=1andtherefore the penalty is 0. Each unaligned site from either region incurs a penalty of−γ each. Our score function rewards regions that have lengths that closely match based on values of the D-statistic less than α with positive scores while giving negative scores to those regions greater than α. Those regions also incur an extra penalty factor by being multiplied by β. We note that a positive score implies that the D-size statistic for the aligned region is less than α. Negative scores are more prevalent under this scoring function as positive scores are only assigned to regions that initially have aD- size statistic less than α. Unaligned sites are taken into account by penalizing−γ for each unaligned site from both optical maps. In an ideal situation where two regions derivedfromthesamegenomicregionhaveaD-sizestatisticof0,thesizescoreforthe region would be α. If we examine the penalty for unaligned sites, at most α γ unaligned sitesare possiblesothatthescoreremainspositivesince α−γ(n+m−2) ≥ 0 α γ ≥ (n+m−2). The ratio α γ determines the maximum number of unaligned sites so that the score still remainspositive. 98 We have not yet discussed the site score component of the score function in detail. Our score function assumes that no internal matching sites exist between aligned regions. Under the null hypothesis, the regions are assumed to be independent and the penalty for unaligned sites should decrease the score further as we assume that the D-size statistic is large for unrelated regions. Although adding additional fragments might lower the D-size statistic as the regions might agree in size more, we incur a penalty for each addeded fragment from each map. Under the alternative hypothesis, both regions are assumed to have derived from a common region so that the unaligned sitescorrespondtomissingorfalsecutsasalignedregionscontainnointernalmatching sites. Assume first that the unaligned sites are all due to missing cuts. The number of unaligned sites is given by u = n + m− 2. We do not know the true number of cut sites within the common region that both regions are derived from and make the simplifying assumption that the common region has at mostu sites. It is possible that thecommonregioncancontainmorethanusitesthatwerenotobservedoneithermap. Each site from the common region not observed on either map occurs with probability (1−p) 2 where in practice p=0.8 so that (1−p) 2 =0.04, so that they occur with smallprobability. Ifallunalignedsitesrepresentmissingcuts,eachunalignedsiteisnot observedinthecorrespondingcomparedregionandoccurswithprobability(1−p). The probability that all unaligned sites are due to missing cuts is (1−p) u . Foru≥ 2, and in practice forp=0.8, the probability is≤ 0.04. For most purposes, a good cutoff for thenumberofunalignedsitesisu≤ 2sincegreatervaluesoccurwithsmallprobability. It is possible that some of the unaligned sites are due to false cuts. False cuts occur as a Poisson process with intensity ζ. With typical values of ζ =0.005, they occur with smallprobability. We canreplacetheabovecalculatedprobabilitiestakingintoaccount 99 thefactthatsomeoftheunalignedsitesareduetofalsecuts. Sincefalsecutsoccurwith even smaller probability than missing cuts, the resulting probabilities would be even lowersothatu≤ 2remainsa suitablecutoffvalue. Since we want positivescores to reflect regions that should be aligned together, we can bound the γ parameter in terms of the α parameter. We have already shownthat at most α γ unaligned sites are possible so that the score stillremains positive. Withu≤ 2 asthecutoffforthenumberofunalignedsiteswehave α γ ≥ 2. This gives us an idea of how to set the γ parameter to ensure that high quality aligned regionsreceiveapositivescore. Our previous analysis for the D-size statistic had assumed that regions consisted of single fragments for both the null and alternative hypothesis. For the alternative hypothesis,we stillassumethat regionsderivedfrom thesame genomicregionhaveno internalmatchingsites. Recallthatforthedynamicprogrammingrecursion,weexamine at most δ unaligned sites from each optical map so that regions consisting of at most δ fragments are examined. This is due to the fact that missing and false cuts are present. As missing cuts are more prevalent, we make the simplifying assumption that most unaligned sites are due to missing cuts. It is more appropriate to view the sizes x 1 ,x 2 underthenullhypothesisasx 1 ∼Gamma(n,θ),x 2 ∼Gamma(m,θ)withn,m∈ [1,δ] sincetheycanconsistofmultiplefragments. Asmissingcutsitesoccurindependentlyof eachotherwithprobability1−p,wecanviewn,masgeometricrandomvariableswith 100 success probabilityp. Thisisbecause the size ofx 1 ,x 2 isdeterminedbythe numberof missed cut sites. Thus, n,m are distributed as n,m∼ Geometric(p) so that our null hypothesisbecomes, H 0 : X 1 ∼Gamma(n,θ),X 2 ∼Gamma(m,θ)|n,m∼Geometric(p). We wouldliketoderivethedensityofX 1 ,X 2 . Wewillshowthefollowing, Lemma3.1.6. IfX∼Gamma(n,θ)|n∼Geometric(p), thenX∼Exponential(ϑ = θ/p). Proof. WeconsiderthejointdensityofX,nandintegrateoutnsothat f X (x)= ∞ " i=1 f X,n (x,i) = ∞ " i=1 f X|n (x|i)f n (i) = ∞ " i=1 x i−1 e −x/θ Γ(i)θ i (1−p) i−1 p = p θ e −x/θ ∞ " i=1 5 (1−p)x θ 6 i−1 (i−1)! = p θ e −x/θ ∞ " j=0 5 (1−p)x θ 6 j j! = p θ e −x/θ e −x/(θ/(1−p)) = p θ e − p θ x , sothatX∼Exponential(ϑ = θ/p)withE(X)= ϑ. 101 UsingLemma3.1.6,wecanconsiderasournewnullhypothesis, H 0 : X 1 ,X 2 ∼Exponential(ϑ = θ/p). The mean and variance of the D-size statistic increases under the new null hypothesis as ϑ = θ/p > θ with p ∈ (0,1). We note that in practical applications, during the dynamic programming recursion, at most δ unaligned sites are examined to speed up thecomputations. Thus,wecouldviewthedistributionofn,masatruncatedgeometric distributionwithn,m∈ [1,...,δ]. Distribution of Score Function under Null Hypothesis: We would like to derive the distribution of the score function under the new null hypothesis. We had derived previously the density ofZ 0 andW 0 with theD-size statistic defined asD 0 = Z 0 /W 0 . Wewillshowthefollowing, Lemma3.1.7. ThedensityfunctionoftheD-sizestatisticunderthenewnullhypothesis is f D 0 (d)= σ 2 d 2ϑ e − σ 2 d 2 2ϑ +σ 0 π 2ϑ $ 1−Φ $ σd √ ϑ %% whereΦ(·)isthecumulativedistributionfunctionforthestandardnormalwiththecor- respondingdistributionfunction F D 0 (d)= σd 0 π 2ϑ $ 1−Φ $ σd √ ϑ %% +(1−e − σ 2 d 2 2ϑ ). 102 Proof. If we consider the transformations C = W 0 ,D 0 = Z 0 /W 0 the resulting joint densityofC,D 0 isf C,D 0 (c,d)=f Z 0 ,W 0 (cd,c)|c|sothatifweintegrateoutcweobtain f D 0 (d)= # f C,D 0 (c,d)dc = # ∞ σ 2 d 2 1 ϑ 2 4c 2 σ 2 e − 1 ϑ “ 2( c σ ) 2 ” dc. Using integration by parts with u = c and dv =4ce −2c 2 /ϑσ 2 /ϑ 2 σ 2 , so that v = −e −2c 2 /ϑσ 2 /ϑwehave = = − ce −2c 2 /ϑσ 2 ϑ > ∞ σ 2 d 2 + # ∞ σ 2 d 2 e −2c 2 /ϑσ 2 ϑ = σ 2 d 2ϑ e − σ 2 d 2 2ϑ + √ 2πr ϑ # ∞ σ 2 d 2 1 √ 2πr e − c 2 2r withr = ϑσ 2 /4werecognizetheintegralasthatofthenormaldensitywithmean0and variancer. We canrewriteas = σ 2 d 2ϑ e − σ 2 d 2 2ϑ +σ 0 π 2ϑ $ 1−Φ $ σd √ ϑ %% withΦ(·)asthecumulativedistributionfunctionforthestandardnormalwiththecorre- spondingdistributionfunction 103 F D 0 (d)= # d 0 f D 0 (x)dx = σd 0 π 2ϑ $ 1−Φ $ σd √ ϑ %% +(1−e − σ 2 d 2 2ϑ ). Consider the expected scoreS 0 under the null hypothesis for sizesx 1 ,x 2 consisting of n,mfragmentsrespectively. Wehave E(S 0 )= E(β I(α>d) (α−d)−γ(n+m−2)) where the null hypothesis is x 1 ∼ Gamma(n,θ)|n∼ Geometric(p) and x 2 ∼ Gamma(m,θ)|m∼Geometric(p) withx 1 ⊥ x 2 ,n⊥ m. We make some simplifying assumptions and treat the size score and site score components of the score function independently. This is justified since the size score dominates the score function as d% (n +m− 2). The sizes play a more importantrole in affecting the score than the number of fragments. Large values of d where d > α are penalized further by being multiplied by β so that negative scores are due to the size score. We have shown that thesizesx 1 ,x 2 ∼Exponential(ϑ) andwehavealreadyderivedthemeanoftheD-size statisticunderthenullhypothesiswhenthesizesareexponentiallydistributed. Weshow thefollowing, 104 Lemma 3.1.8. The expectation of the score function under the new null hypothesis is givenby E(S 0 )= αβ +α(1−β)F D 0 (α)−(E α 0 (D 0 )+βE ∞ α (D 0 ))−γ((2/p)−2) with F D 0 the distribution function of the D-size statistic given in Lemma 3.1.7 and E α 0 (D 0 ) and E ∞ α (D 0 ) are the partial moments of D 0 whose form is given in the proof. Thevarianceofthescorefunctionunderthenewnullhypothesisisgivenby Var(S 0 )= Var α 0 (d)+β 2 Var ∞ α (d)+γ 2 $ 2(1−p) p 2 % withVar α 0 (d)andVar ∞ α (d)givenintheproof. Proof. Computingfirsttheexpectationwehave E(S 0 )= E(β I(α<d) (α−d))−γ(E(n)+E(m)−2) = E(β I(α<d) (α−d))−γ((2/p)−2) sincen,m∼Geometric(p). Tocomputetheexpectationinvolvingd,wehave 105 E(β I(α<d) (α−d)) = # α 0 (α−d)f D 0 (x)dx+ # ∞ α β(α−d)f D 0 (x)dx = αF D 0 (α)+αβ(1−F D 0 (α))−(E α 0 (D 0 )+βE ∞ α (D 0 )) = αβ +α(1−β)F D 0 (α)−(E α 0 (D 0 )+βE ∞ α (D 0 )) withE b a (D 0 )asthepartialmomentofD 0 computedwithrespecttotheboundsa,b. The partialmomentis,usingintegrationbypartsandrepresentingtheterminatingintegralin termsofthecumulativedistributionfunctionofthestandardnormalΦ(·), E b a (D 0 )= ϑ c 3 # b a w 4 e − w 2 c dw = 3 , π/c(2Φ 5 w . 2 c 6 −1)−e −w 2 /c (6w/c+4w 3 /c 2 ) 8/ϑ b a withc = ϑσ 2 /2sothat E α 0 (D 0 )= 3 , π/c(2Φ 5 α . 2 c 6 −1)−e −α 2 /c (6α/c+4α 3 /c 2 ) 8/ϑ . UsingthefactthatwehavealreadyderivedthemomentsofD 0 E ∞ α (D 0 )= E(D 0 )−E α 0 (D 0 ). 106 Similarcomputationscanbemadetocomputethevarianceunderthenullhypothesis Var(S 0 )= Var(β I(α<d) (α−d)−γ(n+m−2)) = Var(β I(α<d) (α−d))+γ 2 (Var(n)+Var(m)) = Var(β I(α<d) (α−d))+γ 2 $ 2(1−p) p 2 % andthevarianceinvolvingdas Var(β I(α<d) (α−d)) = Var α 0 (d)+β 2 Var ∞ α (d) with the variance Var b a (d) computed according to the given bounds a,b. The bounded varianceisgivenby Var b a (d)= E b a (D 2 0 )−(E b a (D 0 )) 2 where we have already computed the partial moment E b a (D 0 ). The second partial momentsare givenas E α 0 (D 2 0 )= ϑ−(1+α 2 /c+α 4 /2c 2 )e −α 2 /c E ∞ α (D 2 0 )= E(D 2 0 )−E α 0 (D 2 0 ) 107 sothatthevarianceiscomputedas Var(β I(α<d) (α−d)) = Var α 0 (d)+β 2 Var ∞ α (d)+γ 2 $ 2(1−p) p 2 % . To verify the computed mean and variance, we simulate the distribution of the score function under the null hypothesis and compare. Figure 3.3 shows a plot of simulated values against theoretical calculated values. The computed mean and variance can be usedto determinethedistributionof the score for randomlyselected regionsfrom opti- cal maps. We can use these values as a guide for speeding up the dynamic program- mingalgorithmbyeliminatingregionsofthescorematrixthatfallbelowacertainscore threshold. Given the dynamic programming recursion in (3.1), we can consider a low scorethresholdT intermsofk standarddeviationsawayfromthemeanscoreunderthe nullhypothesisas T = E(S 0 )−k , Var(S 0 ) so that if X(s i −s g ,t j −t h ,i−g,j−h)<T during the assignment to S(i,j), we know the alignment ending at position S(i,j) consists of a badly aligned region. We can assign−∞ to S(i,j) to terminate alignments that contain S(i,j) since it would containthealignedregionwithscoreX(s i −s g ,t j −t h ,i−g,j−h)belowthethreshold 108 T. We can restate the dynamic programming recursion as the following algorithm that ensuresalignedregionsscoreabovethresholdT. Data: Maps,S(x)=(s 0 ,...,s n )andS(y)=(t 0 ,...,t m )withscorethreshold,T. Initialization: Overlap: X(i,0)← 0,i=1,...,n;X(0,j)← 0,j=0,...,m 1: fori← 1tondo 2: forj← 1tomdo 3: y←−∞ 4: forg← max(0,i−δ)toi−1do 5: forh← max(0,j−δ)toj−1do 6: v←S(s i −s g ,t j −t h ,i−g,j−h) 7: ifv≥T then 8: y← max{y,X(g,h)+v} 9: endif 10: endfor 11: endfor 12: X(i,j)←y 13: endfor 14: endfor Algorithm 3.1.2: Dynamic programming algorithm for alignment of two restriction mapswiththresholdcutoff. ParameterValues: Clearly,theparametersα,β,γaffectthedistributionofthescores. Recall the use of the D-size statistic to determine whether the sizes of two regions are derived from the same genomic region or are independent of each other. We have alreadyderivedthedistributionoftheD-sizestatisticunderboththenullandalternative hypothesis. The size component of the scoring function involves the parameters α,β wheregivensizesx 1 ,x 2 ; β I(α<d) ·(α−d) < anddistheD-sizestatisticdefined as, 109 Normal Theoretical Quantiles Simulated Normal Quantiles −4 −2 0 2 −3 −2 −1 0 1 2 3 Figure 3.3: The mean E(S 0 ) and variance Var(S 0 ) are compared against simulated means and variances of S 0 under the null hypothesis by generating pairs x 1 ∼ Gamma(n,θ)|n∼ Geometric(p) and x 2 ∼ Gamma(m,θ)|m∼ Geometric(p) with x 1 ⊥ x 2 , n ⊥ m and computing their score S(x 1 ,x 2 ,n,m) for given parameters α,β,γ,θ,p. Using the CLT, we can compute the sample mean of our simulated score values and standardize them using our computed means and variances to plot against the quantiles of a standard normal distribution. From the graph, we can see that the computedmeansandvariancesagreefairlywellwithsomeslightcurvature. d = |x 1 −x 2 | σ , (x 1 +x 2 )/2 . Consider the first part of the size score defined as (α− d) with values of d above α assigned a negative score. If we view α as discriminating between values of d so that positive scores indicate sizes that match well to each other while negative scores are given to sizes that are independent, we should set α dependent upon the distributionof d,theD-sizestatistic. Underthealternativehypothesis,wehavederivedthemeanµ D 1 , 110 andvarianceν 2 D 1 thatwefoundtobelargelyindependentoftheparameters. Wecanuse Chebyshev’sInequalitytoboundtheprobabilitydistributionas P(D 1 <µ D 1 +a)> 1− ν 2 D 1 ν 2 D 1 +a 2 for positivea> 0. We want to setα to a value that captures mostof the distributionof theD-sizestatisticunderthealternativehypothesissothatthosevaluesofdareassigned apositivescore. Forα>µ D 1 ,a positivescoreisassignedifD 1 < αsothat P(D 1 < α)= P(D 1 <µ D 1 +(α−µ D 1 )) > 1− ν 2 D 1 ν 2 D 1 +(α−µ D 1 ) 2 asalowerboundfortheprobabilityofassigningapositivescorewhenα>µ D 1 forval- uesofdunderthealternativehypothesis. Wecancomparetheprobabilitiesagainstthose ofvaluesofdfromthenullhypothesisusingthecumulativedistributionfunctionofD 0 that we have already derived from Lemma3.1.7. Figure 3.4 shows the probabilities of bothP(D 0 < α)andP(D 1 < α)forvariousvaluesofα andp. Sincewewanttomaximizetheprobabilityofassigningapositivescoretoadvalue fromthealternativedistributionwhileminimizingtheprobabilityofassigningapositive score to a d value from the null hypothesis, we can use the resulting calculated proba- bilities to choose an appropriate value for α. Examining the graphs, we can see that at α∈ [3,3.5]withp=0.8,P(D 1 < α)≈ 0.8andP(D 0 < α)≈ 0.25. 111 Theβ parameterisalsousedinthesizecomponentofthescorefunctionsincevalues ofd> α incuran extrapenaltyfactor bybeingmultipliedbyβ tomakethescore more negative. Thisallowsfortheseregionstobepenalizedfurthersothattheyarenotchosen during the dynamic programming recursion. We can view the β parameter as scoring regionswithhighdvalueswithgreaternegativescoressothattheseregionscanquickly beeliminatedand nolongerbe consideredduringthedynamicprogrammingrecursion. The β parameter should be chosen to reflect how often we expect a good alignment to contain an aligned region withd> α. We have determined bounds on the distribution function P(D 1 < α) of the D-size statistic under the alternative hypothesis. We have astheprobabilityofobservinganalignedregionunderthealternativehypothesiswitha D-sizestatisticgreater thanα asp α =1−P(D 1 < α). Assumingthatalignedregionsoccurindependentlyofeach other,analignedregion with a D-size statistic greater than α under the alternative hypothesis occurs as a geo- metricdistributionwithsuccessprobabilityp α withmean 1/p α . Sincegoodalignments should receive positive scores, we want to set β so that the resulting alignment has a final score that is positive. We had previously determined that P(D 1 < α) ≈ 0.8 for α∈ [3,3.5]asasuitablerangewithP(D 1 > α)≈ 0.2. AnalignedregionwithaD-size statistic greater than α occurs with mean 1/0.2=5. In the ideal case, each aligned regionwithd< αhasadvalueof0sothatitreceivesascoreofα,assumingthatthere are no unaligned sites. As an aligned region withD-size statisticgreater thanα occurs with mean≈ 5, the aligned regions between them with a D-size statistic less than α have score∈ [0,4α]. An aligned region with a D-size statistic greater than α but less than 2α has a score∈ [−2αβ,0]. Since P(α < D 1 < 2α)≈ 0.15, we can use 2α as a cutoff of theD-size statisticunder the alternativehypothesiswithP(D 1 > α)≈ 0.2. Recall that we want alignments to receive an overall positive score so we should set β 112 " P D 0 ,P D 1 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 p = 0.6 2 4 6 8 10 p = 0.7 2 4 6 8 10 p = 0.8 Figure 3.4: The probabilities P(D 1 < α) and P(D 0 < α) under both the alternative and null hypothesis are plotted for corresponding values of α for different values ofp. TheuppercurverepresentsthealternativehypothesisoftheD-sizestatistic. Wecansee fromthegraphsthatthealternativedistributionislargelyindependentoftheparameters whilerecallthatforthenulldistributionx 1 ,x 2 ∼Exponential(ϑ = θ/p). with 4α− 2αβ > 0 as the aligned regions withD-size statistic less than α have score at most 4α. This gives us that 2> β so that we can set β ∈ (0,2] to ensure a positive score so that the well aligned regions are not over penalized by an aligned region with D-sizestatisticgreater thanα. The site component of the score function penalizes each unaligned site using the parameter γ. Since positive scores are indicative of well aligned regions, we have pre- viouslyshownthat we wantα/γ≥ 2 as those represent highquality regionswhere the score remains positive. As α∈ [3,3.5], and α/γ≥ 2, we have thatγ∈ [1.5,1.75] as a suitablerangefortheparameterγ. ParameterSettingsforDeNovoAssembly Althoughthe new scoringfunction is simplisticand easilyinterpretable, itsuffers from havingtochoosethealignmentparametersthatcouldgreatlyaffect thealignmentspro- duced. In certain applications, however, we can validate the parameters of the scoring 113 function and quantify its performance given a set of parameters. For our purposes, we focus on choosing parameters for detecting alignments used during de novo assembly. We havealready discussedsuitablerangesregardingthe parametersusedinthe scoring function. We perform simulations to evaluate if the chosen parameter ranges perform well. Inordertotestagivensetofparameters,wedothefollowingprocedure, 1. We simulate a genome G =(X 1 ,...,X n ) with fragment lengths distributed as X i ∼ Exponential(λ). From the simulated genome, we simulate maps with the associated errors of optical mapping. Since we simulate maps directly from the genome, we know the true location of each map. Based on the location of each map, we can infer the set of maps that each map should align with and the order of the aligned sites for each alignment. We call the set of alignments, T, that should be detected based on the known locations of each map as true alignments. Itisappropriatetoconsidertrueandfalsepositivesandtrueandfalse negatives based on the set of true alignmentsT. One issue to consider is when to determine an alignment as being correctly found since the aligned sites given by a reported alignment might not match the aligned sites as given by the true alignmentinvolvingthesame pairofmaps. We makethefollowingdefinitionsto beusedlaterregardingtrue andfalsepositivesandtrueandfalsenegatives. Fora giventruealignmentΠ∈T wedefine, (a) Truepositive: Thereported alignmentinvolvingthesamepairofmapscon- tains≥q·|Π|alignedsitesfromthetruealignment. (b) Falsepositive: Thereportedalignmentinvolvingthesamepairofmapscon- tains<q·|Π| aligned sites from the true alignment or the alignment does notexistinthesetoftruealignments. 114 Wealsodefine thefollowing, (a) Truenegative: Analignmentthatdoesnotexistinthesetoftruealignments T andreportedalignmentsA. (b) Falsenegative: AnalignmentΠ∈T thatexistsinthesetoftruealignments butnotinthereportedaligmentswhereΠ / ∈A. 2. Givenasetofparametersα,β,γ andthesetoftruealignmentsT ,wecomputethe optimalaligmentfor the pair of maps involvedin the alignmentΠ∈T usingthe newscoringscheme. Ifthereportedalignmentcontains≥q·|Π|,wecallthetrue alignmentasbeingrecovered. Otherwisewecallthealignmentasunrecovered. Givenarangeofparametervaluesforα,β,andγ,wewanttomaximizethenumber ofrecoveredalignments. Weinitiallydonotconcernourselveswithtrueandfalseposi- tivesortrueandfalsenegativessincewewanttooptimizeforthenumberofalignments discovered from the set of true alignments. This is crucial for de novo assembly since recovering as many true alignments as possible allows for more regions to be assem- bled. Figure 3.5 illustrates the process of choosing a set of optimal parameters based onasimulateddataset. Recallfromourpreviousdiscussionthatsuitablerangesforthe parameters were α∈ [3,3.5], β ∈ (0,2], and γ ∈ [1.5,1.75]. The optimal parameters foundconsistof α=3, β=2.5, andγ=2that closelyfollowsthe theoretical optimal parameterswiththeγ parameterslightlyoutofthegivenrange. Wenotethattheβ parameterdoesnotsignificantlychangethenumberofalignments recoveredastheαandγ parametersplayamoreimportantrole. However,theuseofthe β parameterdoesallowforarecoveryof∼ 2%morealignments. Thiscanbesignificant when the alignments are to be used for denovo assembly since the more alignments recovered allows for more data to be available for assembly. Thus, the β parameter is 115 Max. Correct = 0.3247 " = 3 % = 2.5 ! = 2 " ! 0 2 4 6 8 10 0 2 4 6 8 10 % = 0.5 % = 1 0 2 4 6 8 10 % = 1.5 % = 2 0 2 4 6 8 10 % = 2.5 % = 3 0 2 4 6 8 10 % = 3.5 % = 4 0 2 4 6 8 10 % = 4.5 0 2 4 6 8 10 % = 5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Figure3.5: Wetestedparametervaluesforthefollowingranges: α∈ (0,10],β∈ (0,5], andγ∈ (0,10]. Wesetq=0.85sothatanalignmentisrecoveredonlyifatleastq·|Π| of the true aligned sites are in the reported alignment. We simulated a genome of size 5 Mbwith 500simulatedmaps representingroughly∼ 10xcoverage. We exhaustively computethesetof reportedalignmentsfor differentcombinationsoftheparameter val- ues by discretizing each range by 0.5. The above plot shows the percentage of the true alignments recovered grouped by different values of β while the x-axis and y-axis denote values of α and γ respectively. The optimal set of parameter values obtained were α=3,β=2.5and γ=2with 32.5%ofthetruealignmentsrecovered. Forcom- parison, using the likelihood scoring scheme, roughly∼ 12% of the true alignments wererecovered. useful in allowing a greater number of alignments to be recovered by penalizing badly alignedregionsfurtherduringthedynamicprogrammingalgorithm. We compared the performance of our new scoring scheme for a given simulated data set against the likelihood scoring scheme. For this comparison, we computed all pairwisealignmentsforagivensetofsimulatedmapsfromasimulatedgenome. Based on the set of reported alignments, we can classify each of the reported alignments and unreportedalignmentsaccordingtotrueandfalse positivesand trueandfalse negatives aspreviouslydefined. Wesetfinalscorethresholdswherealignmentswithscoresabove thethresholdareretainedundereachofthescoringschemes. Thisallowsustocompare thenumberofalignmentsofeachtypediscoveredunderbothscoringschemes. 116 Figure3.6comparestheperformanceofbothscoringschemesusingprecision-recall curvesand ROC curves. The plotsillustratethat thelinear scoringscheme outperforms thelikelihoodscoringschemeforbothprecision-recallandtruepositiverateversusfalse positiverate. Atdifferentfinalscorethresholds,theperformanceofthelikelihoodscor- ing scheme only slightly improves in terms of recall. This suggests that the likelihood scoring function does not detect a large number of true alignments and is too stringent intermsof boththescoringthresholdandthescore cutoffsusedtoeliminateregionsof thescoringmatrixduringthedynamicprogrammingrecursion. Thisinturncausesonly a smallnumberof truealignmentstobefoundregardlessofthefinalscore thresholdso that even lowering the threshold does not drastically change the number of true align- mentsreported. Thelinearscoringscheme,ontheotherhand,isoptimizedfordetecting true alignmentsso initiallya large number of true alignmentsare discovered by choos- ing an optimal set of alignment parameters. The linear scoring scheme has improved recall for lower final scoring thresholds and is able to substantially recover more true alignments. Comparing Against Likelihood Scoring Function: We have shown that the linear scoring function outperforms the likelihoodscoring function although the latter is built upona well-developedstatisticalmodel. It isinterestingto examinefurther the reasons why the linear scoring function is able to capture alignments that the likelihood scor- ing function cannot. One issue to examine is if the distribution of fragment sizes for a particular genomic region plays a role in determining whether alignments from those regions can be recovered. We simulated maps from the E. coli genome and computed pairwise overlaps using both the likelihood and linear scoring function. We determine the positionsof themaps involvedineach overlapandmark thosethat correspondedto 117 Recall Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −19.5 0 2 4 6 8 10 12 14 4.1 −2 5 10 15 20 25 30 35 40 55.16 Log−Likelihood Linear 1 − Specificity (FPR) Sensitivity (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −19.5 0 2 4 6 8 10 12 14 14.1 −2.57 5 10 15 20 25 30 35 40 55.16 Log−Likelihood Linear Figure 3.6: The left plot is a precision-recall graph showing the performance of the likelihood scoring scheme and linear scoring scheme. Precision refers to # true posi- tives / (# true positives + # false positives) and can be interpreted as the proportion of thereportedalignmentsthatcorrespondtotruealignments. Recall,ontheotherhand,is #truepositives/(#truepositives+#falsenegatives)andreferstotheproportionoftrue alignmentswithin the set of reported alignments. Setting a high score threshold allows for highprecisionsince mostof the returned alignmentscorrespondto true alignments. However, at a high score threshold, recall is low since a smaller number of alignments are reported and thus a smaller proportion of true alignments are recovered. We see that the linear scoring scheme vastly outperforms the likelihood scoring scheme under differentscorethresholdsintermsofprecisionandrecall. Therightplotisthestandard ROC curve for plotting the true positive rate versus the false positive rate. We can see thatthelinearscoringschemealsoperformsbetterintermsoftheROCcurve. Thelike- lihoodscoringschemehasverylowtruepositiverateandfalsepositiverateevenathigh scorethresholds. Loweringthescorethresholddoesnotimprovetheperformanceeither asthecurveforthelikelihoodscoringschemefallsbelowthelineofnodiscrimination. truealignmentsandfalsealignmentsalongthegenome. InFigure3.7,weplottheposi- tivepredictivevalueofbothscoringfunctionsdefined asthenumberoftruealignments 118 divided by the number of true and false alignments combined along the genome. Posi- tivepredictivevaluesmeasuretheproportionoftruealignmentsrecoveredoutofallthe alignments computed for a given positionalong the genome. The usage of the positive predictiveis slightlymisleadingas it depends on the prevalence of true alignmentsthat issmallcomparedtothetotalnumberofalignmentscomputed. From the figure, we observe that the positive predictive value roughly increases in regionswithlargerfragments. Intuitively,weexpectthistobethecase asmapsderived from regions with large fragments are easier to align due to their presence. Given that the larger sized fragments are more rare according to the distribution of the fragment sizes, despite the errors of optical mapping, these fragments are aligned together so that a correct alignment is produced. We note that the linear scoring function does substantiallybetteracrosstheentiregenomeasawhole. Inregionswithsmallfragments or the variability of fragment sizes is low, the linear scoring function performs slightly worse but the dropoff is more extreme with the likelihood scoring. These regions are more difficult to align since there is more ambiguity in resolving the correct alignment forthesefragments. Anotherfeaturetoexamineofbothscoringfunctionsarethescoresthemselves. The likelihood scoring function computes the score for a given aligned region as a likeli- hood ratio so that the score is unbounded. This is due to the fact that the fragment sizes are continuously distributed. Recall that the linear scoring function is bounded above so that the highest score an aligned region can receive is the parameter α. The alignment score is computed using dynamic programming where the goal is to maxi- mize the alignment score over all aligned regions. Since the likelihood score function is unbounded,thescore for an incorrectly alignedregioncould dominatethe alignment scoresothatsubsequentalignedregionsareerroneous. Totestifthisisthecase,weused 119 Fragment Position Positive Predictive Value 0.000 0.005 0.010 0.015 0.020 0.025 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 Linear Log−Likelihood Figure 3.7: The two upper curves represent the positive predictive value of the two scoring functions that is defined as the number of true positivesdivided by the number of true and false positives combined. We plot this value across the genome where we calculatethepositionsoftrueandfalsealignmentsusingeachscoringfunction. Thebars at the bottom of the plot indicate the fragment sizes along the genome. The likelihood scoring function is unable to recover alignments from regions with small fragments or low variability in terms of fragment sizes. The regions between fragments 25-75 and 325-375 suffer extreme dropoffs in terms of the positive predictive value. Those regions either have small fragments, low variability of fragment sizes, or both. The linear scoring function also has problemsin those regionsbut the decline in recovering thosealignmentsislessthanthelikelihoodscorefunction. a bounded version of the likelihood scoring function and compared its performance to theunboundedversionofthelikelihoodscoringfunctionandthelinearscoringfunction. Theboundedversionofthelikelihoodscoringfunctionisgivenbythefollowing X(x,y,n,m) = max{log $ f H 0 (x,y,n,m) f H 1 (x,y,n,m) % ,ν} for given parameter ν that upper bounds the likelihoodscoring function. In Figure 3.8, by lowering the upper bound, we improve the performance of the likelihood scoring 120 Recall Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log−Likelihood (& = 1) Log−Likelihood (& = 2) Log−Likelihood (& = 4) Log−Likelihood (& = ') Linear 1 − Specificity (FPR) Sensitivity (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Log−Likelihood (& = 1) Log−Likelihood (& = 2) Log−Likelihood (& = 4) Log−Likelihood (& = ') Linear Figure 3.8: We plot both precision-recall and ROC curves for the original likelihood (with ν = ∞) scoring function as well as the upper bounded likelihood function for ν =1,2,4 as well as the linear scoring function. From the graphs, we can see that as wedecreaseν,wenoticethatboththeprecisionandrecallandROCcurvesimprove. In allcases,thelinearscoringfunctionoutperformsthelikelihoodscoringfunction. function. We note that this can be due to several reasons. One is that large frag- ments,underthelikelihoodscoringfunction,endupbeingalignedtogetherregardlessof whethertheytrulyalign,renderingtheentirealignmentincorrect. Theboundedversion ofthelikelihoodscoringfunctiondecreasestheeffectofthepresenceoflargefragments byrestrictingthemaximumvalueofanalignedregion. Furthermore,thelikelihoodscor- ingfunctionisdependentuponexaminingtheratiobetweentwocompetinghypotheses: (1) bothmapsare derivedfromthesamegenomicregionor(2) theyare independentof eachother. Thehypothesisthattheyareindependentofeachotheriscalculatedbasedon thedistributionoffragmentsizeswherelargefragmentsarerare. Withalowprobability of large fragments occurring, the alternative hypothesis of the fragments being derived fromthesamegenomicregionismorelikelythanthenullhypothesisofindependence. 121 3.1.3 RandomForestClassification Motivation: Our usage of the new scoring function and optimizing it for de novo assembly was to maximize the number of recovered alignments. However, despite the fact that we recover substantially more alignments than the likelihood scoring scheme, we stillreturn many alignmentsthat are false positives. These alignmentsare problem- atic since theyincorrectly align mapsthat share no relationshipwitheach other. Forde novoassemblythiswouldresultinafaultyalignmentsincemapsderivedfromdifferent genomic regions would be assembled together. We would like to reduce the number of false positivesreturned by screening the returned alignments and classifying them into trueandfalsepositives. Random Forest: Random forest (Breiman, 2001) is a machine learning ensemble classifier consistingof theuse of manydecision trees thatoutputsthe class bymajority vote among the trees. The advantage of using random forest classification is that for many data sets it produces a highly accurate classifier while having the ability to han- dle a large number of input variables. It can also estimate the importance of variables in determining classification while providing an experimental way to detect variable interactions. Briefly, a random forest is constructed by creating a decision tree in the followingmanner, 1. Atrainingsetconsistingofncasesischosenwithreplacement(boostrapsample) from a total set of N available training cases. The rest of the cases are used to estimatetheerror ofthetree bypredictingtheirclasses. 2. For each node of the tree, randomly choose m variables from a total of M total classifiervariablestobasedecisionsatthatnode. Thebestsplitbasedonthesem variablesarecalculatedinthetrainingset. 122 3. Eachofthetreesare fullygrownandnotpruned. The random forest can be seen as constructing trees by randomly sampling from a training set and for each node of the tree randomly sampling the predictor variables. The individual trees are used to vote for the final class. For our purposes, we want to classify a particular alignmentas either a true positiveor false positive. A true positive corresponds to maps derived from the same genomic region. Since we are using the alignmentsasourinput,webasethepredictorvariablesonthefollowingfeaturesofthe alignment: (i) Alignmentscore: Thereturnedalignmentscoreasgivenbythenewscoringfunc- tion. (ii) Numberofalignedsites: Thenumberofalignedsiteswithinthealignment. (iii) Numberofunalignedsites: Thenumberofunalignedsiteswithinthealignment. (iv) Individualmapsizes: Thesizesofthemapsinvolvedinthealignment. (v) Overlap size: The combined size of the overlap region between the two maps obtainedbyaveragingthesizeoftheoverlapregionfromeachmap. The alignmentscoreisa naturalchoicefor apredictorvariablesincewe expectthat truepositivesshouldhavehighalignmentscores. However,sincelongermapswillincur higheralignmentscoresduetothefact thattheycontainmorerestrictionfragments,we also utilize the size of both maps as predictor variables as well. This allows for align- mentsofsmallermapswherethealignmentscoremightbelowersincetheycontainfew restriction fragments. The number of aligned and unaligned sites should be indicative of a true alignment. Recall that unaligned sites are penalized so that we expect a true 123 alignment to contain few unaligned sites. Finally, we also use the size of the overlap region as a predictor variable as well. True alignments should consist of “deep” over- laps where the overlap region from each map is large since this will result in higher alignmentscores. Results To test the performance of using random forest classification in order to decrease the numberoffalsepositives,webuiltarandomforestusingsimulateddata. Wesimulatea genomeandrandommapsandproduceafullsetofallpairwisealignmentsbetweenthe maps. The set of pairwise alignments are used as the training set to the random forest classifier. We simulate another genome using the same average fragment size as the first. From this genome we also simulate maps and perform pairwise alignments using boththelikelihoodscoringschemeaswellasthenewscoringfunction. Thealignments produced using the new scoring function are classified using the random forest trained fromthefirstgenomeintotrueandfalsepositivealignments. Training: We first train the random forest on simulated data to determine suitable class weights and cutoffs to be used for the classification. Data sets with highly unbal- anced class sizes could lead to extremely different prediction errors for the different classes. Considerthetrainingsetconsistingofallpairwisealignmentsforagivensetof maps. Mostofthealignmentsarefalsepositiveswhileonlyasmallamountareactually true positives. Random forests, in attempting to minimize the overall prediction error rate, will keep the error rate low on the larger class of false alignments while letting the smaller class of true alignments have a larger error. The random forest attempts to decrease the error rate for classes given a higher weight. Cutoff values, on the other 124 hand, are used to determine the prediction class of the random forest. The returned outputoftherandomtreeeitherconsistsofthepredictionclassortheprobabilitiesasso- ciated with each possible prediction class. We chose to use the class probabilities in order to assess the accuracy of our random tree using different cutoff values. To test bothclassweightsandcutoffvalues,wefurther dividedourinitialtrainingsetintosep- arate training and test data sets. We train our random forest using the training data set with various class weights and cutoff values and and tested its performance on the test datasettodeterminesuitablevaluesforbothparameters. Figure3.9showsourtraining andtestingperformance forasampledataset. Performance: Using the random forest we had trained from the previous step, we classified a set of pairwise alignments to validate the performance of the random for- est. We simulated another genome with the same average fragment size and computed overlaps using all three methods: the original likelihood scoring function, the linear scoring function, and the linear scoring function combined with the random forest. We combined the linear scoring function with the random forest by first computing all of thepairwiseoverlapsandclassifyingtheresultingoverlapsusingtherandomforestinto true and false alignments. Figure 3.10 shows the performance of the three different methods. From the graphs, we can see that we substantiallyimprove on the number of false positives returned. Using the random forest classifier, we are able to drastically reduce the number of false alignments at the expense of losing only a small number of true alignments. This validates the use of the random forest in classifying the optical mapalignments. Although our results show that we do reduce the number of false positives, it is interesting to examine the increase in false negatives since some true positives will be mis-classified as false alignments. Recall that the purpose of our alignments is for the 125 Recall Precision 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.25 0.5 1 2 4 1 − Specificity (FPR) Sensitivity (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.25 0.5 1 2 4 Figure3.9: We simulateda genomeof size 5 Mband 500 mapsfrom the genome with 18,156 true alignments. The full set of pairwise alignments consisted of 249,500 over- lapswhereweconsiderbothorientationsoftheoverlap.∼ 32.5%ofthetruealignments are recovered within the full set of pairwise alignments. The full set of pairwise align- ments are divided into a training and test data set. We trained a random forest using different values for class weights and cutoff values. The left shows a precision-recall plot where each line represents a different weighting for the false alignmentclass. The points on the line represent different cutoff values. From the graph, we can see that the class weights do not greatly affect the precision and recall curves. This could be attributed to the fact that the chosen features discriminatebetween true and false align- mentswithhighaccuracydespitetheoverabundanceoffalsealignmentswithinthedata set. From the graph, at a cutoff of 0.65 for a true alignment without any class weights achieves a recall of 0.38 and precision of 0.59. The right graph is a ROC curve of the samedata. assemblyof a target genomesoit iscrucial tomaximizethe numberof true alignments recovered. Table3.2liststhenumberoftruepositives,falsepositives,truenegatives,and false negatives for the various alignment methods for different alignment score thresh- olds. Comparing the linear scoring function against the linear scoring function com- bined with the random forest classifier, we see that at an alignment score threshold of 5, we obtain 4,213 true positives using the linear scoring function and 1,524 using the 126 Recall Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −18.3 0 2 4 6 8 10 .1 −5.23 0 5 10 15 20 25 47 −5.23 0 5 10 15 20 25 42.47 Linear Linear & Random Forest Log−Likelihood 1 − Specificity (FPR) Sensitivity (TPR) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −18.3 0 2 4 6 8 10 .1 −5.23 0 5 10 15 20 25 47 −5.23 0 5 10 15 20 25 42.47 Linear Linear & Random Forest Log−Likelihood Figure3.10: Therandomforestclassifiedasetofpairwisealignmentsforanothersim- ulated genomeof size 5 Mb with10x coverageusingthe same average restrictionfrag- ment size. The plotsabove show the precision-recall and ROC curves for the three dif- ferent methods: the linear scoring function, the linear scoring function combined with therandomforestclassifier,andthelikelihoodscoringfunction. Themarkedpointsindi- cate different alignment score thresholds for each of the scoring functions used. From thegraphs,wecanseethatthelinearscoringfunctioncombinedwiththerandomforest classifier outperformsusingonlythelinear scoringfunctionor likelihoodscoringfunc- tion. Atanalignmentscore thresholdof 5(alignmentswithscore≥ 5are retained), we have a precisionof∼ 0.40 and recall of∼ 0.11. Usingonly the linear scoring function at the same threshold we obtained a precision of∼ 0.04 and recall of∼ 0.56. We are abletogreatlyincreaseourprecisionusingtherandomforestclassifier. linear scoringfunctionandrandomforest. We lose2,689true positivesthatamountsto ∼ 64% of the true positives using the linear scoring function with the random forest. However, we are able to reduce the number of false positivesfrom 107,587 to 2,252so that only∼ 60% of our data set is composed of false alignments. Without the random forest classifier, we would encounter a data set with∼ 96% of the alignments are false positives. In either case, we still recover significantly more alignments than the likeli- hood method that only manages to recover 1,441 true alignments if we use all of the returnedalignmentswithoutanyalignmentscorethreshold. 127 Method TP FP FN TN AlignmentThreshold Likelihood 1322 50134 3521 69773 0.00 864 8108 7160 108618 2.00 385 904 8874 114587 4.00 168 153 9404 115025 6.00 30 22 9638 115060 8.00 2 3 9683 115062 10.00 Linear 4570 245232 20 1670 0.00 4213 107587 3302 136390 5.00 2814 12666 9014 226998 10.00 1437 1734 12290 236031 15.00 619 411 13933 236529 20.00 202 111 14631 236548 25.00 Linear&RandomForest 1531 2948 12161 234852 0.00 1524 2252 12185 235531 5.00 1491 1526 12283 236192 10.00 1308 1250 12619 236315 15.00 619 411 13933 236529 20.00 202 111 14631 236548 25.00 Table 3.2: Confusion Matrix for Likelihood, Linear, and Linear & Random Forest AlignmentMethods Wealsoexaminedregionsofthegenomethatwerebeingrecoveredmoreaccurately using the combined linear scoring function with the random forest classification. In Figure 3.11, we plot the positive predictive value for various cutoff probabilities using the random forest. From the graph, we see that as we decrease the cutoff probability, weincreasethepositivepredictivevalueacrossthegenome. Recallthatthelinearscore function,ingeneral,recoversmorealignmentsinregionsthathavelargefragmentsizes as these are easier to align together due to their rare occurrence. The situation remains the same using the random forest classification as it is applied to the set of returned alignmentsfromthelinearscoringfunction. Sincethenumberoftruepositivesexisting inthesetofreturnedalignmentsisextremelysmallincomparisontothenumberoffalse 128 Fragment Position Positive Predictive Value 0.000 0.002 0.004 0.006 0.008 0.010 0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 0.8 0.6 0.4 0.2 Figure3.11: Positivepredictivevaluecurvesare plottedacross the genomefor various cutoffprobabilitiesusingtherandomforest. Thecutoffprobabilitiesdeterminewhether we call an alignment a true alignment based on the random forest classification. We note that the positive predictive value increases as we lower the cutoff probabilities. In general, the positive predictive value increases in regions of the genome with large fragmentsizes. positives,byloweringthecutoffprobabilityweallowformorealignmentstobeselected bytherandomforest. Thisresultsinahigherpositivepredictivevalueattheexpenseof increasingthesizeofthedatasetofselectedalignments. ImplementationIssues: Webrieflynotesomepracticalconsiderationswhenutilizing randomforestsusingthe extremelylarge data setsthatwe encounter fromopticalmap- ping. The most time-consumingprocess of the random forest is the training procedure sincewemustrepeatedlysamplefromourtrainingsetinordertoconstructthedecision trees. To make this methodfeasible, we parallelize the training process by dividingthe random forest into smaller components that are individually constucted. We build our randomforestinparallel andtraineach smallercomponentindividuallyinsteadof con- structingone large random forest. Specifically, we build 100 random forests consisting 129 of 5 trees each from our training set. During the classification step, we aggregate the classifications from each random forest into a final classification. This does not affect theoverallmethodastheindividualcomponentscanbeviewedasrandomforeststhem- selves. Recall that instead of outputting the class returned by the random forest, we returntheclassprobabilitiesassociatedwitheach. Thisallowsustofine-tuneandselect anappropriateprobabilitythresholdtodetermineifanalignmentiseitheratrueorfalse alignment. Bydividingandparallelizingtheconstructionand classificationstepsofthe randomforest,weareabletoprocessourdatasetsmoreefficientlyintermsoftimeand storagerequirements. 3.1.4 Conclusion We have developed and validated a new scoring function for the alignment of opti- cal maps using dynamic programming. The new scoring function requires parameters whose values we have determined theoretically and shown to perform well in practice. We performed simulations to compare the optimal parameters against our theoretical valuesandfoundthattheyagree. Thenewscoringfunctionisabletodrasticallyrecover morealignmentsthanthepreviouslikelihoodscoringfunctionthatimprovesthedenovo assemblyofatargetgenome. Althoughweareabletodiscovermorealignments,weare faced with the huge number of false alignments that are returned. In order to limit the numberoffalsealignments,wehaveusedrandomforestclassificationtofilteroutalign- mentsthataremorelikelytobetruealignments. Thisresultsinasubstantialdecreasein the number of false alignmentswith the tradeoff of losing some true alignments. Since the goal of our alignment method is for de novo assembly, we have made significant gains by: (1) recovering more true alignments to construct our assembly, (2) removing 130 falsealignmentsthatwouldresultinerrorsandambiguitywhenattemptingtoassemble mapsusingthem. 131 3.2 FPCMapAlignment 3.2.1 Introduction In the previous section, we developed a scoring function used to align optical maps for pairwisealignment. Wenowturnourattentiontoa differenttypeofalignmentproblem where we are comparing optical maps against reference maps. Recall from Section 1.3.2 the distinction between optical and reference maps. Optical maps are the result of an experimental collection. Reference maps are obtained from sequence data as a list of ordered fragments that would be produced (with no error) by calculating exact lengths between restriction sites in the genomic sequence. We call fragments produced in such a manner as being in silico digested. In this section, we examine the problem of comparing optical maps against reference maps derived from sequence contigs for a large-scale sequencingproject. Sequencing methods can be classified according to two approaches; the whole genome shotgun approach and clone-based hierarchical approach (Green, 1997;Weber and Myers, 1997). Both methods have their own pros and cons that has led to a hybrid combination of the two. The whole genome approach involves randomly shearing tar- get DNA into fragments, sequencing them, and using linking information and compu- tational analysis to piece together and assemble the original genome. This approach works well for repeat-poor genomes but encounters problems with repetitive genomic regionsbecauseoftheambiguityinresolvinglinkinginformation. The clone-based hierarchical approach, on the other hand, involves the generation of a physical map to aid with the assembly process. Traditionally, physical maps are constructed from organizing large-insert clones covering the genome (Green, 2001). The clones are placed into the physical map detailing the locations of each relative to 132 theotherclonesalongthegenome. Clonesalongaminimaltilingpatharethenshotgun sequencedandassembledseparately. The hierarchical approach offers several advantages. The physical map reduces the chances of long-range misassembly since each clone is individually sequenced to pro- duce a set of sequence contigs. Cloning biases from under-represented regions can be detected and retargeted for additional sequencing. The task of assembling individual clonescanbedividedamongdifferentgroupsallowingtheworktobeeasilydistributed. For repeat-rich genomes, assembly quality is improved by lowering the frequency of misassembled regions where the physical map is used to infer the nature of repeat-rich regions. Toorganizetheclonesintoaphysicalmap,a‘fingerprint’ofeachcloneisobtained. These fingerprints allow for the clones to be distinguished from each other and the amount of similarity between different clones to be assessed. One commonfingerprint methodused isto perform a restrictiondigest of each clone and measure the lengthsof the resulting restriction fragments by agarose gel electrophoresis as discussed in Sec- tion 1.1. These clones are further organized into fingerprint contigs. The clones are assembled into fingerprint contigs (FPC) using software that analyzes the ‘fingerprint’ of each clone to merge overlapping clones (Soderlund et al., 2000). Unique molecular markers are also incorporated to assist in the assembly of the fingerprint contigs and validate their ordering and orientation as well as anchoring them against the genome. Althoughtheclonesare assembledintofingerprintcontigsinknownorder andlocation on the genome, the locations and orientations of the sequence contigs derived from the clonesare notknown. In this section, we present methods for comparing FPC maps against optical maps. Although both are physical maps, they are produced using different techniques. Recall 133 that the FPC is composed of a list of ordered clones. The clones themselves contain sequencecontigs. Inordertocomparethetwo,referencemapsderivedfromtheclones’ sequence contigs are compared against the optical maps. Our comparisons must take intoaccountthatthesequencecontigsare organizedaccordingtotheFPC map. 3.2.2 Methods DataDescription: Thefingerprintcontig(FPC) map isorganizedintoindividualfin- gerprint contigs anchored to specific chromosomes where F c =(F 1 ,...,F i ) denotes the ordered list of fingerprint contigs for chromosome c. Each fingerprint contig F i is composed of an ordered list of clones (B 1 ,...,B j i ). The fingerprint contigs can con- tain different numbers of clones. Each clone B j i is individually shotgun sequenced producing sequence contigs{S 1 ,...,S k j i }. Figure 3.12 (a) depicts fingerprint contigs anchored to the genome and the corresponding clones. Denote|B j i | as the size of the clone in basepairs. The order and orientation of the sequence contigs are unknown within the clone. In order to compare the sequence data against the optical map data, aninsilicodigestisperformedoneach sequencecontigS k j i producingordered restric- tionfragmentswithsizes(r k j i ,1 ,...,r k j i ,l k j i )withr k j i ,x asthex-thorderedfragmentfor sequence contig S k j i as shown in Figure 3.12 (c). When the context is clear, we will use F i to refer to a particular fingerprint contig and B j as one of its clones containing sequence contigS k withfragments (r k,1 ,...,r k,l ). We note that theterminal fragments foreachsequencecontigarenotusablesincethesefragmentsonlyhaveasinglerestric- tion site; they are discarded. After the terminal fragments are removed, we have two types of sequence contigs; single-fragment sequence contigs (those containing only a single in silico fragment) and multi-fragment sequence contigs. Although the relative order and orientationof the sequence contigsare unknownwithineach clone, the order 134 of each sequence contig’s in silico fragments are known. The optical map consists of contigs O 1 ,...,O p . Each optical map O p contains ordered restriction fragments with sizes (t 1 ,...,t qp ). The FPC map also contains the approximate locationsof clones withineachfinger- print contig. Let d(B j ,B k ) be the distance between the midpoints of clones B j and B k as indicated by the FPC map. From the construction of the FPC map, the dis- tances between clones within the same fingerprint contig are known whereas distances ofclonesfromdifferentfingerprintcontigsaregenerallyunknown. Ouralgorithmtakes this into account by treating those distances separately when determiningan alignment foraparticularopticalmap. An alignmentbetweenan opticalmapO p andfingerprintcontigF i consistsofpairs of matching fragments between the two. The matching fragments from the fingerprint contig F i are derived from the in silico sequence contig fragments obtained from its clones. As an optical map cannot be larger than a chromsome from the genome of interest,we align an opticalmapto allfingerprint contigsfor a specific chromsomeF c . The resulting alignment of the sequence data for each clone to the optical map should have distances between clones that correspond to the distances indicated by the FPC map. 3.2.3 AlgorithmOverview ToalignanopticalmapagainstanFPCmap,wedivideouralgorithmintothefollowing steps: 1. Matches between restriction fragments from both the optical maps and sequence contigsarefound. 135 ~ ~ ~ ~ Genome B i S 1 S 2 S 3 S i (a) (b) (c) 12.5 Kb 18.4 Kb r 1 r 2 AA. . .CCATTTAAATTGA. . .ACTTTG. . .GAGTAAATTTAAATCTGATAC. . .ATAAGA. . .GTCATTTAAATCGG. . .GG F 1 F 3 F 2 F 4 F 5 B 1 B 2 B 3 B 4 B 1 B 2 B 3 B 1 B 2 B 3 B 4 B 1 B 2 B 3 B 4 B 1 B 2 Distances between overlapping clones are known. Distances between fingerprint contigs are unknown. Figure 3.12: Diagram of arrangement of FPC map where: (a) The fingerprint con- tigs F 1 ,F 2 ,F 3 ,F 4 ,F 5 are anchored to a specific chromosome. The distances between the overlapping clones within each fingerprint contig are known whereas the distances betweenthefingerprintcontigsarenot. (b)EachcloneB i isshotgunsequencedproduc- inga setof sequencecontigswhoseorder andorientationare unknown. In theexample shown, the clone has three sequence contigsS 1 ,S 2 ,S 3 where the ordering and orienta- tion is arbitrary. (c) An in silico digest is performed on the sequence contigS i . In the example shown, the enzyme SwaI with recognition sequence 5 ! -ATTTAAAT -3 ! is used where the digestionresults in four fragments total with two usable fragmentsr 1 andr 2 of sizes 12.5 Kb and 18.4 Kb. Recall that terminal fragments are removed so that for this particular sequence contig consisting of four fragments, there are only two usable fragmentsavailable. 136 2. The possible locations of all clones on the optical maps are found based on the matchesbetweenfragmentsfromtheopticalmapstothesequencecontigsofeach clone. 3. The final alignment between an optical map and an FPC map is determined by incorporatingtheinformationofthelocationsoftheclonesgivenbytheFPCmap againstthelocationsoftheclonesontheopticalmappreviouslydiscovered. Eachstageusesdifferentmethodsthatare describedbelow. FragmentMatching Initially, matches between fragments from both optical maps and in silico digested sequence contigs are found. From Proposition2.1.2, given a sequence contig fragment size r j , observed optical map fragments have sizes normally distributed with mean r j and variance σ 2 r j . A match is declared between an optical map fragment size t i and sequencecontigfragmentsizer j if |t i −r j | σ √ r j ≤k (3.2) forgivenk withk asthenumberofstandarddeviationsfromtheunderlyingtruesizer j . This assumesthat sequencing errors are rare and that in silicofragments sizes from the sequencecontigsarecorrect. We do not consider the situation of multiple fragments matching and assume that boththeopticalmapandsequence contigsare error free. Thiscan be incorrectas undi- gested restriction sites and random DNA breakage result in bothmissingand false cuts toappearintheopticalmappingdata. Repetitivesequencesalsocausesequencecontigs tobeovercollapsedduringtheassemblyprocessresultinginderivedinsilicofragments 137 to be much smaller than their true size. Although these cases are possible, we initially set the criteria for matching fragments to be strict and later relax them once we have found fragments between the optical map and sequence contig that match under more stringentconditions. AsnotedinSection2.1.2,thesizingerrormodelagreeswellforlargerfragments(> 4Kb),butdoesnotholdforsmallerfragments. Theyarealsounderrepresentedinoptical map data as small DNA fragments have weak surface affinity and are unable to adhere to the surface mount during the linearization step. We chose to remove fragments< 1 Kb from the sequence contig data since these are difficult to detect in optical mapping data. Furthermore,fragmentsfromboththesequencecontigandopticalmapdatabelow agiventhresholdΔare consideredmatchesregardlessoftheirsizes. We utilized techniques developed by Yang (2005) to quickly determine matches between fragments from the optical map contigs and sequence contigs. A list of n fragment lengths from both contigswas constructed and sorted takingO(nlogn) time. Wequicklyscannedthroughthelistandpeformedabinarysearch toquicklydetermine andstorethematchingfragmentsgivenanopticalmapfragment. CloneLocations The locations of all clones are determined based on the matching fragments of its sequence contigs. Recall that each clone B j is composed of sequence contigs {S 1 ,...,S k } whose order and orientation are unknown as shown in Figure 3.12 (b). The goal of this step is to determine a possible alignment of a clone to a particular region on the optical map by finding alignments of the individual sequence contigs of the clone to the optical map. We first determine a possible location on an optical map thattheclonecanaligntobyquicklyfilteringforcandidateregionsbasedonthedensity 138 of matching fragments between the optical map and the in silico fragments from the sequence contigs. Once a candidate region is found, an actual alignment is determined takinginto account thatthe sequence contigshave unknownorder and orientation. The procedureisdescribednext. CandidateRegions: Givenaclone,B j anditssequencecontigs{S 1 ,...,S k },candi- dateregionsofwheretheclonecanaligntotheopticalmaparefirstfound. Weconstruct amatchmatrixM betweentheopticalmapfragments(t 1 ,...,t q )andallsequencecon- tig fragments S 1 =(r 1,1 ,...,r 1,l 1 ),...,S k =(r k,1 ,...,r k,l k ). The matrix contains a ‘•’ if the fragment from a sequence contig matches an optical map fragment from the previous step. The idea is to scan through the matrix for regions where the density of ‘•’s warrants further consideration. Since each row and column represents matches toa particularsequence contigfragmentand opticalmapfragment respectively,we can examinerowsandcolumnsofaregionforentriescontaining‘•’. Aregionontheoptical map that a clone can match to is only possible if there are enough matching fragments inthematchmatrixfor aparticularregion. Using a sliding window approach, we scan through the match matrix to determine candidate regions for a particular clone on the optical map. The size of the sliding windowW =|B j |+kσ , |B j |issettothesizeoftheclone|B j |plusthemeasurement error associated with optical mapping. The sizes of the sequence contig fragments do notnecessarily addup tothe sizeof the clone. Sequencing qualityand statuscan result in fragments whose total size is much smaller than the given clone size. For partially sequenced clones or those of low draft quality, the resultingin silicofragments derived fromthesequencecontigsare usuallysubstantiallysmallerthanthegivenclonesize. A given region on the optical map, (t a ,...,t b ), specified by an interval of start- ing and ending fragments t a and t b is considered a candidate region by examining the 139 number of matching fragments in the rows and columns of the match matrix. Each column specified by the positions t a ,...,t b is marked if the optical map fragment specified by the column matches to some sequence contig fragment, i.e., the column contains at least one ‘•’. The resulting number of marked columns is called the col- umn occupancy C. Similarly, each row specified by the sequence contig fragments, S 1 =(r 1,1 ,...,r 1,l 1 ),...,S k =(r k,1 ,...,r k,l k ), is marked if the sequence contig frag- mentspecified bytherowmatchestosomeopticalmapfragment,i.e., therowcontains at least one ‘•’. The resulting number of marked rows is called the row occupancy R. Figure 3.13 (a) depicts the column occupancy and row occupany calculations. We expect that each sequence contig fragment matches to a unique optical map fragment withinaconsideredregion. Ifeachsequencecontigfragmentisuniquelymatched,then the column occupancy and row occupancy should be equal to the number of sequence contig fragments. However, it is possible that some sequence contig fragments might notbepresentontheopticalmaporthesizesofthereportedsequencecontigfragments mightbe incorrect resultingin non-matches. A regionis considered a candidate if both thecolumnoccupancyandrowoccupancyareaboveasetthresholdbaseduponthetotal numberofsequencecontigfragments. Given a sequence contig containingT fragments, we consider a region a candidate ifC≥ αT andR≥ αT forgivenparameterαandcolumnoccupancyC androwoccu- pancyR. The parameter α can be determined in a clone specific manner to account for the varying number of sequence contigs for each clone. The presence of a low number ofsequencecontigfragmentscouldbeduetoeithertheclonebeingpartiallysequenced or few restrictionsites withinthe sequence contigs. An alternativemethod to cut down on the number of candidate regions found is to enforce that rare sequence contig frag- ment sizes are matched within a candidate region. To further reduce the number of 140 candidate regions, we first examine only clones containing at least one multi-fragment sequence contig. Clones with only single-fragment sequence contigs often match to multiple locations resulting in numerous false positives since we are then examining matchesofasinglefragmenttotheopticalmap. MatchGraph: Onceacandidateregionisfound,analignmentbetweenthesequence contig fragments and optical map fragments must be determined. The previous step only considered the presence of matching fragments between the two and did not take into account that fragments belonging to the same sequence contig should be matched in the same order on the optical map. Each optical map fragment within a candidate regionmustuniquelymatchtoexactlyonesequence contigfragment. Furthermore, the order and orientations of the sequence contigs are unknown. We want to maximize the numberofmatchedfragmentsbetweenthesequencecontigsandopticalmapcontig. Theproblemoffindingthematchingfragmentsbetweentheopticalmapcontigand sequence contigs can be shown to be NP-complete with the following reduction. Con- sider the match matrix where each ‘•’ indicates a match between a fragment from the sequence contig and optical map contig. In this situation, we want to select elements from the match matrix containing a ‘•’ such that no two elements are on the same row or column of the match matrix. Elements in the same row indicate matches to the same sequence contig fragment, and similarly, elements in the same column indi- cate matches to the same optical map contig fragment. As only one sequence contig fragment can match uniquely to another optical map contig fragment, we must choose elementsrespectingthepreviousrequirement. Wefirstdefinethefollowingproblem. Definition3.2.1. WedefinetheMATCH-FRAGMENTS problemasfollows. Wearegiven a set of n fragments denoted as T = {f 1 ,...,f n } and a set of k matching fragment 141 pairsM ={m 1 =(f i 1 ,f j 1 ),...,m k =(f i k ,f j k )}. Given fragmentplacementsdefined assubsetsP 1 ,...,P m ⊆M withweightw i =|P i |,doesthereexistacollectionofthese subsetswithtotalweightatleastk suchthatnofragmentappearstwiceinthecollection ofsubsets? Wewillshowthefollowing. Lemma3.2.1. The MATCH-FRAGMENTS problemisNP-complete. Proof. We first note that we defined the problem in the following manner so that we ensure the requirement that each fragment is matched uniquely to only one other frag- ment. Wedefineaplacementintermsofthematchingpairsoffragments. Figure3.13(a) givesanexampleofthematchesbetweensequencecontigsforagivenclonetoaregion on the opticalmap which we can viewas the set of matching pairsM. We can see that each sequence contig can match to multiple locations which correspond to a particular placement. Wewillconsiderthesimplercase wheretheplacementsareunweighted. Given a candidate solution of the placements, we can easily check that the place- mentsdonotcontainafragmentappearingtwice. ToshowthattheproblemisNP-hard, wewillreducefromsetpacking. Recallthesetpackingproblemasthefollowing: given a collection U = {x 1 ,...,x n } of n elements and a collection S 1 ,...,S m of subsets ofU and a numberk, does there exist a collection of at leastk of the subsets with the propertythatnotwoofthemintersect? Ourreductionissimplewherewedefineourset of fragments of size 2n as, T = {f 1 ,...,f n ,f ∗ 1 ,...,f ∗ n }, where each elementx i ∈ U is duplicated as fragment f i and f ∗ i . Our matching fragment pairs consists of the set, M = {m 1 =(f 1 ,f ∗ 1 ),m n =(f n ,f ∗ n )}, where we match each fragment with its dupli- cate. WetransformeachsubsetS i intoaplacementP i consistingoftheoriginalelements ofS i alongwithitsduplicate. If wehavea matchingfragmentplacementofsizek then 142 we have a set packing of sizek as each of the placementsP i and its corresponding set S i are disjointaccording to our construction. SinceP i consistsof the original elements fromS i as well as the duplicate, we know that if theS i ’s are disjoint, then theP i ’s are aswellasnoneoftheelementsf i arecommontoanytwooftheplacementsandneither are its duplicates f ∗ i . Similar reasoning follows for the S i ’s. Clearly, the construction takespolynomialtime. Match Graph Construction: As MATCH-FRAGMENTS is NP-complete, we devel- oped a heuristic method where we transform the matched fragments for each of the candidate regions into a match graph. A node in the graph represents an alignment of a sequence contig to the optical map within the candidate region. Edges are then drawnbetweenthenodestoenforcetherestrictionsgivenabove. Thegraphatthispoint encodespathsrepresentingafullalignmentofallsequencecontigswithintheclone. To find a path optimizing the number of matched fragments, we weigh the edges so that pathsrepresentinghighqualityalignmentsarefound. Wethentraversethegraphtofind a path representing a possible alignment and check if the alignment contains enough matchedfragments. Wefirstexplainhowthematchgraphisconstructed. Recallthatourcandidateregion containsofmatchesbetweensequencecontigandopticalmapfragments. Letr i,j denote a the j-th ordered sequence contig fragment for sequence contig S i and t k the k-th ordered optical map fragment. Given a candidate region, for each pair of matching fragments r i,j from a sequence contig to optical map fragment t k within the interval (t a ,...,t b ), a directed graph G is constructed with nodes v i,j,k . We follow the steps belowtoplaceedgesbetweenthenodes. 143 1. SequenceContigEdges. Formulti-fragmentsequencecontigs(contigscontainingmorethanonesequence contig fragment), directed edges are initially placed from v i,j,k → v i,j+1,k+1 to representan alignmentbetween apairof fragmentsfromthe samesequencecon- tigmatchedinthesameordertoapairofopticalmapfragments. Similarly,edges are placed from v i,j,k → v i,j−1,k+1 to account for an alignment of the sequence contig fragments in the reverse orientation to the optical map. Since we want to match sequence contigs containing more than one fragment, the edges placed at thispointshouldlinktogethermatchedfragmentsonboththesequencecontigand opticalmapineithertheforwardorreverseorientationofthesequencecontig. We canpreliminarilycollapseedgestoenforcetheorderingbyinitiallycollapsingall “single” paths containing nodes with in-degree and out-degree equal to 1. These represent unique placements of the corresponding sequence contig and optical map fragments where the path can be replaced with a single edge containing the startingand endingmatch node. We note that for nodes involvedin both orienta- tions, we replace the node with two nodes associated with either the forward or reverse orientation. To eliminate spurious matches between successive pairs of matchedfragmentsthatmightreducethenumberofpathsthatweare abletocol- lapse, we canfirst removeall paths under a certain length that is dependent upon the number of fragments for a particular sequence contig. Figure 3.13 (b) shows theinitialplacementofnodesandedgesforasamplematchgraph. 2. LinkEdges. Atthispoint,allcomponentsconsistofeitherasinglenodev i,j,k orasingleedge, e = v i,j $ ,k $ → v i,j,k withk ! <k where nodev i,j $ ,k $ is termed the begin node and v i,j,k is termed the end node as shown in Figure 3.13 (b). For each single node 144 component and end node of a single edge component,v i,j,k , the following edges areplaced: (a) Foreachsinglenodegraphcomponent,v a,b,c ,withk <candi-=a,anedge isdrawnfromv i,j,k →v a,b,c . (b) For each singleedge graphcomponent,v a,b,c →v a,b $ ,c $, withk <c<c ! and i-=a,anedgeisdrawnfromv i,j,k →v a,b,c ,thebeginnode. 3. StartEdges. A startnode,v s , is alsocreated withedges to allsinglenode componentsand the beginnodeofallsingleedgecomponents. We notethatthegraphisacyclicsince alledgesareplacedtopointfromanopticalmapfragmentt k tot k $ wherek <k ! . AfinalsamplematchgraphisshowninFigure3.14(c). Edge Weights: The above graph encodes possible alignments between the sequence contig fragments and optical map fragments. In order to determine a path traversal that optimizes the number of matched fragments, we weigh the edges so that paths representingahighqualityalignmentwillbefound. Thegivenweightsreflectthetypes ofpathswewantwhere: 1. SequenceContigEdges. These edges are weighed with the number of fragments along the originaluncol- lapsed path. This is to give precedence to paths that use collapsed edges since they represent successive fragments between a sequence contig and optical map matchedinorder. 145 2. LinkEdges. Edgesbetween nodes representingmatched fragmentsof different sequence con- tigsare weighedaccording totheirpositionwithintheirrespectivesequence con- tigs. For example, for sequence contig S i and sequence contig S i $, suppose an edge existsbetween nodesv i,j,k →v i $ ,j $ ,k $. Since we want full placementof each sequence contig (all sequence contig fragmentsare matched), we weigh the edge as |(|S i |/2)−j|+|(|S i $|/2)−j ! |−|k−k ! | where |S i | refers to the number of fragments in sequence contig, S i . The first two terms give preference to matched fragments that occur near the ends of the sequence contigs since we expect that edges linking alignments between differ- ent sequence contigs should match as much of the sequence contig as possible. The last term penalizes the number of missingfragments between both sequence contigslinkedbythegivenalignment. 3. StartEdges. Edges from the start node v s to each of the nodes v i,j,k are given weights |(|S i |/2)−j|− (T − k) where T is the total number of sequence contig frag- ments. The first term rewards fragments matched near the ends of the sequence contig. The last terms penalizes choosingsequence contigs not matched near the start of the candidate region. This affects the number of leftover fragments to be usedtomatchtherestofthesequencecontigs. 146 Path Traversal: At this point, we can traverse from the start node v s to a sink node usingdynamicprogramming(Waterman,1995)tofindthelongestpathtodeterminethe best alignment according to our edge weightings as shown in Figure 3.14 (d). Letl(v) forv∈Gbe thelengthofthelongestpathfromnodev. Therecursionfor thedynamic programmingalgorithmis l(v)= 0 ifv isasinknode, max v→w d v,w +l(w). However,weencounteraproblemwiththefactthatwemustenforcethateachsequence contigfragmentismatcheduniquely. Thissituationoccurswhenwehavesequencecon- tigs that match to many different locationswithina candidate region. The optimalpath through the constructed graph usually contains sequence fragments that are uniquely placed. The returned path sometimes does contain sequence contigs placed multiple times if the sequence contig is small and contains average sized fragments (according tothedistributionoftheopticalmappingfragments). Toaccountforreturnedpathsthat have sequence contigs multiply placed, we perform an exhaustive path traversal that examines all paths where the edges of the graph are reweighted to 1. We enumerate all possiblepathswithinthegraphinthissituationandreturnthelongestpath. Acandidateregionismarkedasafeasiblealignmentifthediscoveredpathofaligned fragments≥ β|T| fragments where |T| is the total size of sequence contig fragments for given parameter β. This removes a large number of spurious alignments where greater values of β require more of the sequence contig fragments to be aligned. The parameterβ canbeadjustedbasedonthesequencingstatusofthegivenclonetoreflect thequalityofthegivensequencedata. Thegivenregionontheopticalmapismarkedat 147 Optical Map Contig r 1 r 2 r 3 r 1 r 1 r 2 S 1 S 2 S 3 B j (a) (b) t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19 t 20 { t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 r 1 r 2 r 3 r 1 r 1 r 2 S 1 S 2 S 3 B j { Column Occupany = 8 Row Occupany = 6 Figure3.13: Thestepsfordeterminingaclonelocationonanopticalmapwhere: (a)An example candidate region discoveredon an optical map where the cloneB j consists of threesequencecontigs,S 1 ,S 2 ,S 3 . Theyhave3fragments,1fragment,and2fragments respectively. The match matrix between the cloneB j and optical map is shown where thedotsindicateamatchbetweenasequencecontigfragmentandopticalmapfragment. (b) Initial construction of the match graph where the edges drawn indicate fragments matched in order on both a sequence contig and optical map. Notice that the edges lie on the diagonals of the match matrix indicating placements of the sequence contig. The matched fragments between fragment r 1 of sequence contig S 1 and optical map fragments t 6 and t 7 are removed since they are paths of length 0 whereas sequence contigS 1 hasthreefragments. the midpointoptical map fragment locationj to indicate that the cloneB i matches to a candidateregionatthatparticularlocationwhere wesetA(B i ,j)=1. FPCAlignment In this step, we want tofind an alignment of an optical map to the FPC map. The FPC mapcontainsanorderedlistoffingerprintcontigs. Thefingerprintcontigsarecomposed 148 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 r 1 r 2 r 3 r 1 r 1 r 2 S 1 S 2 S 3 B j { v s (c) (d) t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 r 1 r 2 r 3 r 1 r 1 r 2 S 1 S 2 S 3 B j { v s Figure 3.14: (c) The edges from the initial construction are collapsed and shown as dashedlines. Thegreynodesrepresentnodeswhereadditionaledgesaredrawnaccord- ing to the rules previously given. A start node, v s is also added to the graph with the correspondingedges. (d)Thelongestpathisdeterminedintheacylicgraphasindicated bythethick,solidblacklineswiththecorrespondingmatchesbetweensequencecontig fragments. ThepathresultsinthealignmentofsequencecontigS 1 inthereverseorien- tationwithalignedfragments(r 1 ,t 10 ),(r 2 ,t 9 ),(r 3 ,t 8 ),sequencecontigS 2 withaligned fragments(r 1 ,t 16 ),andsequencecontigS 3 intheforwardorientationwithalignedfrag- ments (r 1 ,t 14 ),(r 2 ,t 15 ). of an ordered list of clones. From the previous step, we have determined the locations ofalltheclonesontheopticalmap. However,thelocationsaredeterminedindividually and do not take into account their position relative to other clones based on the FPC map. Using the locations of the clones, we can use dynamic programming to find an alignment of an optical map to the FPC map. An alignment of an optical map to the FPC mapisshowninFigure3.15(a). Match Score: We first define the match score m(B i ,j) of a clone B i to the j-th ordered fragment on the optical map. We would like the match score to reflect the 149 sequencing status of the clones as well as the fragment density. We expect clones of higher sequence quality to be aligned since they should be free of errors. Clones con- taining a large number of fragments should also be aligned since more fragments are available to correctly place them. However, high sequence qualityand large number of fragmentsare oftennotcorrelated witheach other. Ideally,each cloneshouldonlycon- tain a single sequence contigspanning the entire clone if all gaps and ambiguitieshave been resolved. Finished clones, those that have undergone extensive manual interven- tion,usuallyhavea smallnumberofsequencecontigs. Lowdraftqualityclones,onthe other hand, contain many sequence contigs due to the numerous sequencing gaps and unresolved inconsistencies in their assembly. The sequence contigs from these clones can possibly overlap and should be merged together but the low sequence quality does not provide enough information to assemble them together. This artificially creates a greater number of sequence contig fragments from the increased number of sequence contigs. From thepreviousstepof determiningtheclonelocations,A(B i ,j) = 1ifcloneB i is found at the j-th ordered fragment on the optical map and 0 otherwise. To account for sequence status and number of fragments of each clone, the match score is set to m(B i ,j)= ζ B i ·A(B i ,j)+T, where ζ B i refers to the sequencing quality of the clone andT isthetotalnumberofsequencecontigfragmentsinB i . Thisgivesalinearmatch scoreintermsofthesequencingstatusandnumberoffragments. Dynamic Programming Recursion: Consider an optical map O p and fingerprint contig F i =(B 1 ,...,B j ). Let S(i,j) denote the score of a match between clones B 1 ,...,B i to locations 1,...,j on optical map O p . Recall that the distances between clones B j and B k are specified by d(B j ,B k ) by the FPC map. Let m(B i ,j) be the 150 matchscoreforaligningcloneB i atlocationj ontheopticalmapasdefinedabove. The recursionisthen S(i,j)= m(B i ,j) i=1, max k m(B i ,j)+S(i−1,k) f l (d(B i−1 ,B i ),j)≤k, k≤f u (d(B i −1,B i ),j) wheref l (d(B i−1 ,B i ),j)andf u (d(B i −1,B i ),j)arethelowerandupperboundsonthe distancesbetweenclonesB i−1 andB i . TherecursionfollowsfromaligningcloneB i at thecurrentpositionj andexaminingbackinthescoringmatrixforapreviousalignment of contigB i−1 at the possiblelocationsalong the opticalmapgivena match at position j. Since an optical map cannot be larger than an individual chromosome, the optical mapsare alignedagainstallfingerprintcontigsforagivenchromosome. Thisresultsin a fit alignment where the largest value in the scoring matrix is found and traced back from. Figure 3.15 (b) illustrates an alignment between an optical map and fingerprint contigsanchoredtothegenome. Clone Distances: The distances betweeen clones on the same fingerprint contig are approximatelyknownbutclonesfrom differentfingerprintcontigsare not. Whenclone B i−1 andB i are on the same fingerprint contig then we set the lower and upper bound distancesas 151 ~ ~ ~ ~ Genome Genome (a) F 1 F 3 F 2 F 4 F 5 B 1 B 2 B 3 B 4 B 1 B 2 B 3 B 1 B 2 B 3 B 4 B 1 B 2 B 3 B 4 B 1 B 2 Optical Map Contig Optical Map Contig (b) B 1 B 2 B 3 B 4 F 1 B 1 B 2 B 3 B 4 F 3 B 1 B 2 B 3 B 4 F 4 B 1 B 2 F 5 B 1 B 2 B 3 F 2 Figure 3.15: An alignment between an optical map and fingerprint contigs anchored to the genome where: (a) The optical map is aligned to the fingerprint contigs based upon the matching locations of the clones. The order of the clones and their distances withineachfingerprintcontigshouldcorrespondwiththelocationsfoundontheoptical map. The distances between fingerprint contigs are unknown but must be accounted for in the alignment. In the alignment, clones B 2 and B 3 of fingerprint contig F 2 are directly adjacent. Clones B 1 and B 3 of fingerprint contig F 4 are indirectly adjacent since cloneB 2 isnot alignedto the opticalmap. (b) The dynamicprogrammingmatrix for anopticalmapcontigagainsta setofanchoredfingerprintcontigs. Locationsofthe clones for each fingerprint contig are indicated by dots on the optical map contig. The gray boxes represent the traceback giving the alignment of the optical map against the fingerprintcontigs. 152 f l (d(B i−1 ,B i ),j)= j−[d(B i−1 ,B i )(1+λ)] f u (d(B i−1 ,B i ),j)= j−[d(B i−1 ,B i )(1−λ)] where we weight the distances given by the FPC map by a factor of±λ. For clones in adjacentfingerprint contigs, the distances between them are usually not known. In this situation, we simply set f l (d(B i−1 ,B i ),j)= j−g u and f u (d(B i−1 ,B i ),j)= j−g l whereg l ,g u denotethresholdsonthegapsbetweenfingerprintcontigs. UnknownOrientation: Since theorientationsoftheopticalmapsare unknown,they need to be aligned in both the forward and reverse orientation. Each optical map is alignedtoanFPC mapinbothorientationsandthebestalignmentiskept. AnchoredClones After the best alignment for an optical map is found, we attempt to improve the align- ment by treating the aligned clones to the optical map as anchors. Unaligned clones betweenapairofalignedclonesarethenre-examinedduringasecondpasstoseeifthe alignments could be improved. The parameters from the first step are relaxed so that moresequencecontigscan bealigned. Inparticular: • Thevalueofk isincreasedforequation(3.2). • The parameter β in Section 3.2.3 for the number of matched fragments is low- ered forthereturnedpathtraversalinthematchgraph. Unmatchedfragmentsare furtherexaminediftheyconsistmostlyofsmallfragments. 153 Theaboveare appliedtoregionsontheopticalmapbetweenapairofalignedclonesto findcandidateregionsfortheunalignedclonesthatmighthavebeenmissed. Thealigned clones allow for a smaller region on the optical map to be examined. This reduces the numberofspuriouscandidateregionsthatmightbedetectediflessstringentparameters wereapplied. BipartiteMatching: Recall thatwe initiallyonlyalignclonescontainingatleastone multi-fragment sequence contig. This eliminates the large number of spurious align- mentsresultingfromaligningclonescontainingonlysingle-fragmentsequencecontigs. We now try to align by considering unaligned clones containing only single-fragments between a pair of aligned clones. With single-fragment sequence contigs, we do not build a match graph but instead form a bipartite graph linking together matching frag- ments between the sequence contigs and optical map data. Network flow algorithms (Cormenetal.,2001)canbeappliedtofindmatchingfragmentsbetweenthetwo. 3.2.4 StatisticalPropertiesofAlignments In this section, we discuss some of the statistical properties of our method of aligning theFPC maptoanopticalmapcontig. FragmentMatchingProbability: The initialstepfor aligningFPC mapsisto deter- minematchesbetweenfragmentsfrombothopticalmapsandinsilicodigestedsequence contigs. In Figure 3.16, the distributions of the sizes of both can be approximated by an exponential distribution. Let X be a fragment size from a sequence contig whereX∼ Exponential(λ) andY a fragment size from an optical map contig where 154 Y∼Exponential(κ). From(3.2),amatchisdeclaredwhen |Y−X| σ √ X ≤k forgivenvalue ofk. Wehavethefollowing, Lemma3.2.2. IfX∼Exponential(λ) andY∼Exponential(κ) where, f X (x)= 1 λ e − x λ , x> 0 f Y (y)= 1 κ e − y κ , y > 0 then, P $ |Z−Y| σ √ Z ≤ α % = 1 aθ (1−e − k λ ) + 1 λ 0 π a b a e b 2 4a $ Q $ √ 2ak− b √ 2a % +Q $ b √ 2a %% where k = α 2 σ 2 , a = 1 λ + 1 θ , and b = ασ θ and Q(x)=1−Φ(x) withΦ(x) as the cumulativedistributionfunctionofthestandardnormal. Proof. See Lemma2.2.5. 3.2.5 Results Datasets To validate our methods we used the 2,300-Mb maize B73 genome presently being sequenced by a clone-based hierarchical approach. Maize (Zea mays L.) is one of the mostimportantcerealcropsaswellasbeingamodelforthestudyofgenetics,evolution, and domestication. Maize is an extremely challenging genome to sequence due to the genome size, reduplication, and highly repetitive DNA content. The high-resolution 155 physical map of Zea mays ssp. mays cv. B73 was assembled using different resources: (1) deep-coverage large insert BAC libraries; (2) agarose and high information content fingerprint (HICF) datasets assembled with fingerprinted contig software (FPC); (3) a genetic marker dataset to anchor the physical map to the maize genetic map; and (4) sequence-based marker datasets to infer contig position and orientation with respect to thericereference sequence(Weietal.,2007). ThefinalFPCmapcontained721contigscoveringroughly2,150-Mb(93.5%ofthe 2,300-Mb genome). Of the 721 contigs, 421 are anchored to the maize genome’s 10 chromosomes where anchored contigs cover 1,981 Mb (86.1% of the maize genome). Of the remaining 300 unanchored contigs (7.4% of the genome), 189 contain fewer than ten BAC clones each. The 421 anchored contigs contain 15,128 BAC clones with average size between 100-200 Kb. Each clone has a designated sequencing status in orderofincreasingqualityas: FULLTOP,PREFIN, ACTIVEFIN,andIMPROVED. The optical map dataset for the maize genome was constructed with the enzyme SwaI and consists of 72 contigs spanning a total of 2,281 Mb. The average map size is 29 Mb with maps ranging from 3.2 Mb to 101 Mb in size. The average fragment size is 23.19 Kb with the largest map containing 4,416 fragments to the smallest map containing120fragments. The sequence contigs for the 421 anchored FPCs and 300 unanchored FPCs from thephysicalmapwere insilicodigestedwiththeenzyme SwaI.Thisproduced176,074 sequence contigs distributed among 16,155 BAC clones of which, after removing ter- minal fragments and those less than 1 Kb, 25,260 sequence contigs distributed among 14,017 BAC clones remained. Figure 3.17 shows the distribution of the number of sequencecontigsforeachcloneforthevarioussequencingstatuses. 156 Fragment Size (Kb) Percent of Total 0 10 20 30 40 50 60 100 200 300 400 Optical Map Data 100 200 300 400 Sequence Contig Data Figure3.16: Histogramofthefragmentsizesfrombothopticalmapandsequencecon- tig data. The optical map data had an average fragment size of 23.19 Kb with standard deviationof24.19. Thesequencecontigdatahadanaveragefragmentsizeof11.95Kb withstandarddeviation11.12. The remaining sequence contigs had an average fragment size of 11.95 Kb. Fig- ure3.16 showsthe distributionof the fragment sizes for both datasets. The sequencing goalsfor the maizegenomeis toinitiallytargetclones thatcontaina greater proportion ofgenes. Mostofthehighqualityclonesarelocatedingenicregions. Thepossibledis- crepancybetweentheaveragefragmentsizefromtheopticalmapdataandthesequence contig data could be due to the density of restriction sites within these regions. This can also be attributed to the fact that the repetitive regions of the genome are difficult toassembleand thusare underrepresentedinthedatawhere repetitivegenomicregions containlessrestrictionsites. Theaveragenumberoffragmentspersequencecontigis2 withthemaximumnumberofsequencecontigfragmentsas25. 157 Number of Sequence Contigs Count 0 1000 2000 3000 246 FULLTOP 246 PREFIN 246 ACTIVEFIN 246 IMPROVED Figure 3.17: Histogram of the number of sequence contigs for each of the sequencing statuses. In order of increasing quality, the sequencing statuses are: FULLTOP, PRE- FIN, ACTIVEFIN, and IMPROVED. Most of the clones designated as IMPROVED have 1-2 sequence contigs whereas for the rest of the sequencing statuses, the number ofcontigscanvaryfrom1sequencecontigtoupto7. Parameters Wechosethefollowingparametersduringourinitialalignmentofthesequencecontigs totheopticalmappingdata: • k=1,σ=0.55: Thenumberofstandarddeviationsktodeclareamatchbetween a sequence contig fragment size and optical map fragment size. The value of the sizingerrorσ wasestimatedfrompreviousopticalmappingdata. • α=0.8: Werequireatleast80%ofthesequencecontigfragmentstomatchfora candidateregiontobediscovered. • β=0.8: We require at least 80% of the sequence contig fragments in a traversal pathtomatchforacandidateregiontobemarkedasafeasiblealignment. 158 • λ=0.3: We set the approximation factor for the locations of FPCs within the samecontigtobe30%ofthedistancespecifiedbytheFPC map. • g l =−1 Kb,g u =1 Mb: The gaps between FPCs are in the range of [−1 Kb, 1 Mb]where thenegativevalueindicatespossibleoverlapbetweenadjacentFPCs. Ctg-182Validation Tovalidatethe algorithmwiththeaboveparameters, we useas our testcase an entirely sequenced fingerprint contig Ctg-182. The sequenced clones in the region were simu- lated to represent sequencing quality found in the real data set. We aligned the finger- printcontigseparately tothemaptoexaminetheeffects ofthe parametersfor correctly locatingtheopticalmapthatthecontigbelongedto. Wecomparedtheresultsofoursim- ulations by examining the number of clones aligned using different sets of parameters. Wefoundthattheaboveparametersworkwellinpractice. 3.2.6 AlignedOpticalMaps Of the 71 optical maps, 65 aligned to the FPC map with total mass 2,168 Mb. Out of the 15,128 sequence contigs, 4,335 were aligned (∼28.7 %). To further investigate the alignments that were found, we examined sequence contigs in both aligned and unaligned clones. Figure 3.18 shows the number of fragments for both aligned and unaligned sequence contigs grouped according to their sequencing status. We expect that sequence contigs of high sequence quality should be mostly aligned. We see that mostoftheunalignedsequencecontigscontainfewfragmentsregardlessofthesequenc- ing status. The low information content of having fewer fragments makes it difficult to reliablyalignthemtotheopticalmapdata. Asthesequencingstatusimproves,agreater numberofthesesequencecontigsshouldaligntotheopticalmap. 159 Number of Fragments Count 0 200 400 600 800 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 BAC Quality = FULLTOP: n A = 101, µ A = 3.65, n U = 609, µ U = 1.64 0 200 400 600 800 1000 BAC Quality = PREFIN: n A = 1075, µ A = 4.02, n U = 2920, µ U = 2.82 0 200 400 600 800 1000 BAC Quality = ACTIVEFIN: n A = 1143, µ A = 3.8, n U = 3218, µ U = 2.34 0 200 400 600 800 1000 BAC Quality = IMPROVED: n A = 2016, µ A = 4.21, n U = 4233, µ U = 3 Unaligned (10980) Aligned (4335) Figure3.18: Numberoffragments. Figure 3.19 examines the alignments by considering the largest multi-fragment sequence contig for each unaligned and aligned clone. Again, we notice the same pat- tern that clones where the largest multi-fragment sequence contig is less than 2 frag- ments cannot be reliably aligned. As sequencing status improves, a greater proportion ofthoseclonesarealignedagainsttheopticalmap. 3.2.7 AlignmentVisualization Inordertovisualizeouralignments,wedevelopedseparatelyimagegenerationsoftware able to produce high-quality SVG images detailing the alignments between the optical 160 Max Consecutive Run Count 0 500 1000 1500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 BAC Quality = FULLTOP: n A = 101, µ A = 2.68, median A = 2, n U = 609, µ U = 1.22, median U = 1 0 500 1000 1500 BAC Quality = PREFIN: n A = 1075, µ A = 2.91, median A = 2, n U = 2920, µ U = 1.98, median U = 1 0 500 1000 1500 BAC Quality = ACTIVEFIN: n A = 1143, µ A = 2.73, median A = 2, n U = 3218, µ U = 1.62, median U = 1 0 500 1000 1500 BAC Quality = IMPROVED: n A = 2016, µ A = 3.13, median A = 3, n U = 4233, µ U = 2.15, median U = 1 Unaligned (10980) Aligned (4335) Figure3.19: Maxmulti-fragmentsequencecontig. mapsandFPCmaps. Oneoftheimagesgivesacoarseviewofthelocationsoftheopti- calmapsagainsttheFPCmaponachromosomallevel. Theimageshowsthealignment oftheclonesfromeachfingerprintcontigtotheopticalmapandisaccessibleviaaweb serverat http://dl403a-2.cmb.usc.edu/FPC_Maize. The other high-resolution image gives a zoomed in detailed view of the alignment between the sequence contig itself and the optical map. This shows the aligned frag- mentsbetweenthetwodatasetsandtheapproximatelocationsoftheclonesasspecified by the optical map. The detailed alignments are available as a separate software pack- age for individuals interested in obtaining a closer look at the alignments generated. 161 AC186285 (240.10 Kb) AC199770 (215.60 Kb) AC193424 (191.10 Kb) AC194213 (196.00 Kb) 11.49 20.23 38.25 12.62 19.14 9.29 24.98 7.37 26.68 11.04 89.52 4.45 35.75 65.80 11.94 5.24 70.41 38.23 20.65 10.49 16.49 15.07 8.60 1.34 5.99 1.92 4.69 35.22 65.81 12.02 43.71 2.22 9.98 17.42 0.71 5.31 12.68 3.50 4.70 35.23 65.81 12.22 0.43 8.87 2.20 29.41 5.67 6.28 25.69 5.31 12.22 65.81 1.53 2.40 2.91 3.94 3.82 2.42 1.15 8.04 6.92 0.76 6.83 3.26 14.20 11.35 19.08 12.17 9.93 19.31 8.92 13.29 22.85 Figure3.20: Thegeneratedalignmentimagehastheopticalmapdrawnwiththealigned clones from thefingerprint contigplaced above. Unalignedclones are drawnas wellto indicated their possibleposition on the optical map. Matches between sequence contig fragments and optical map fragments are shown with vertical lines drawn below the optical map. In the illustration, the fragments marked black are unusable either due to thembeingtheterminalfragmentsforasequencecontigorfromasequencecontigwith lessthan2fragments. TheapproximatelocationofunalignedcloneAC194213isshown baseduponthelocationsofneighboringalignedclones. CloneAC194213hasnousable fragments and thus no alignmentis possible. For the aligned clones, the unordered and unorientedsequence contigscan beseenfromthematchesbetween thefragmentsfrom thesequencecontigandopticalmap. Figure3.20 shows an example illustrationof an alignmentbetween an optical map and fingerprintcontig. 3.2.8 FPCGapDistances The distances between the fingerprint contigs are generally unknown since they are anchored to the genome using markers. The alignment between an optical map and fingerprint contigs anchored to a specific chromosome can provide estimates of these 162 distances. Based upon the locations of the aligned clones to the optical map, we can calculateestimatesofofdistancesbetweenadjacentclones. There are two types of adjacent aligned clones; those that are directly adjacent (no unaligned clones exist between them) and those that are indirectly adjacent (unaligned clones exist between them). In Figure 3.15 (a), an example of clones within a fin- geprint that are directly adjacent versus indirectly adjacent are shown. The distances between clones within the same fingerprint contig are approximately known whereas the distances between fingerprint contigs are not known. Using the alignments of the fingerprint contigs against the optical map, we can estimate the distances between fin- gerprint contigs based on the corresponding alignments of the clones within each fin- gerprintcontig. However,theclonesusedtomake theestimatesare notalways directly adjacent. Sinceindirectlyadjacentcloneshaveunalignedclonesbetweenthem,thedis- tancesgivenbythemarelessaccurateasunalignedclonesareunaccountedfor. Figure 3.21 shows a histogram of the estimated distances based on using clones from adjacent fingerprint contigs given by the FPC map using both directly adjacent and indirectly adjacent clones. The distribution of estimated gap distances using indi- rectly adjacent clones is more varied than the estimates using directly adjacent clones. The estimates of the gap distances are useful for sequencing efforts since they indi- cate regions not covered by the FPC map. This allows researchers the ability to locate sequencinggapsintermsoftheFPC map. 3.2.9 Conclusion Ourresultsshowthatopticalmappingcanprovideaccurategenome-widephysicalmaps forcomplexgenomeswithhighlyrepetitivesequences. Physicalmapsconstructedusing traditional methods are extremely time-consuming requiring large amounts of manual 163 Gap Distances (Kb) Percent of Total 0 10 20 30 −1000 −500 0 500 1000 1500 2000 Directly Adjacent −1000 −500 0 500 1000 1500 2000 Indirectly Adjacent Figure 3.21: Histogram of the FPC gap distances using both directly adjacent and indirectly adjacent aligned clones. The gap distances estimated using directly adja- cent clones range from -112 Kb to 1064 Kb. A negativegap distance indicates that the fingerprint clones possibly overlap with each other. The gap distances estimated using indirectly adjacent clones are more varied with range from -730.30 Kb to -1680 Kb. The variation could be attributed to the fact that these estimates are less accurate since unalignedclonesexistbetweenthem. intervention. Optical maps can be produced relatively quickly and as a scaffold for sequence data become an invaluable source for assessing sequencing efforts. With the development of short-read technologies that will enable the sequencing of large, com- plex genomes faster and cheaper than before, an accurate physical map becomes an essentialtoolinassemblingthegenome(Sundquistetal.,2007). Althoughthemethodspresentedhereweretailoredtowardstheparticularcharacter- istics of the FPC map, the algorithm is general enough to be applicable to other areas. For example, with modifications to the match graph technique, sequence data contain- ing gross abnormalities can be compared. Cancer genomes are known to be rife with chromosomal duplications and rearrangements. By segmenting the sequence data and treating them as individual sequence contigs, these large-scale events can be detected 164 whencomparedagainstareferencemap. Thiscouldofferinsightintothetypesofchro- mosomaleventsthatoccurfordifferenttypesofcancer. 165 Chapter4 AssemblyofOpticalMaps In this chapter, we deal with the challenges of reconstructing a genome using optical mapping data. Chapter 3 dealt with the issue of alignment where in Section 3.1 we focused on the alignment of optical maps for de novo assembly. In Section 3.2, we had examinedthe problemofaligningofopticalmapsto FPC mapswhichare physical maps assembled using traditional means. In this chapter, we address the problem of assembling optical maps as they are physical maps themselves. We first describe a method that constructs a graph modeling the overlap relationships between the optical maps. We then apply various graph transformations that detect subgraphs representing erroneous regions. Finally, we use information extracted and stored during our graph transformationstoassembleaconsensusmap. 4.1 DeNovoAssembly 4.1.1 Introduction PreviousWork: Thefirstalgorithmsforopticalmapassemblyweredevelopedforthe reconstruction of short restriction maps with the assumption that these were identical DNA molecules (Karp and Shamir, 1998). As such, they could not be applied to the more general problem of shotgun optical mapping data that deals with optical maps taken from random regions of the genome. Anantharaman et al. (1999) designed the first map assembler capable of handling shotgun optical mapping (maps from random 166 regions of the genome) data using a Bayesian approach that greedily merged optical maps to produce an assembly. Although successful for smaller genomes, the algorithm wasdifficulttoscaletolargerplantandmammaliangenomes. Most existing sequence assemblers follow the general computational paradigm of overlap, layout, and consensus. Within this framework, the overlap stage consists of determining pairwise sequence read relationships, the layout phase produces assembly contigsbyfindingtherelativeplacementofeachreadagainsteachother,whilethecon- sensus step builds a finished assembly using the placement of all reads. Valouev et al. (2006c) adopted the overlap-layout-consensus strategy successful in sequence assem- bly and applied it to optical map assembly. Building upon a dynamic programming algorithmusinglikelihoodratiostoalignopticalmaps(Valouevetal.,2006a),pairwise overlaps were calculated between the maps allowing for the construction of a global overlap with nodes representing optical maps and edges representing overlap relation- shipsbewteenthem. Grapherrorcorrectionroutineswerethenappliedtocullthegraph of edges representing false overlaps as well as remove nodes that represent chimeric maps. Thegraphwasthenexaminedtoextractpathsrepresentingdraftconsensusmaps thatwere thenrefined toproducedafinishedmap. Theoverlap-layout-consensusapproachdevelopedbyValouevetal.(2006c)allowed fortheassemblyofmorecomplexgenomessuchasO.sativaandH.sapiens. However, scalability issues were encountered with datasets comparable to the size of the human genome with extremely deep coverage. The amount of noise within these datasets requires more sophisticated approaches to detecting optical maps that could be con- fidently assembled from those that are erroneous and did not represent true genomic regions. We build upon the usage of a graph that models the connectivity relationships 167 betweenmapsanduseadditionalmethodstoproduceamoreaccuratemap. Graphtrans- formations are then applied to retain portions of the graph corresponding to genomic regions accurately represented by optical maps. By exploiting the deep coverage of the dataset, thegraph transformationsare more sensitiveindetectingreal data fromthe noise. We then examine paths within the graph to construct initial contigs and use a segmentmatchingrefinementproceduretoproduceafinalconsensusmap. 4.1.2 Methods Our method follows the overlap-layout-consensus approach used by existing sequence assemblers (Myers et al., 2000; Batzoglou et al., 2002; Huang et al., 2003). This three step computational framework consists of the following steps: (i) sequence read con- nectivityrelationshipsarecomputedintheoverlapstep,(ii)thereadsarecombinedinto assemblycontigsandscaffoldswheretheirrelativeorderandorientationaredetermined usinglocalandglobalconnectivityinthelayoutstep,and(iii)finishedsequencecontigs are constructed in the consensus step. Various assemblers incorporate different error detectiontechniquestoensurehighaccuracyofthefinishedsequence. Sequenceandopticalmapassemblyaresimilarinmanywaysthatmanyoftheideas developedfor sequence assembly are applicable. Errors unique to optical maps require methods that differ from those used by existing sequence assemblers. Furthermore, optical maps consist of ordered restriction fragments whereas sequence reads are com- posedofdiscretenucleotidebases. Theinformationcontentofbotharedifferentsothat sequence-based methodsmustbe modified. Another uniquefeature of opticalmapping is the ability to produce datasets with genome coverage on order of magnitude greater that that of sequencing. This added capability provides further information on how to detect errors during the assembly process. Below, we describe our modifications that 168 deal specifically with optical mapping data within the context of the overlap-layout- consensus framework. We briefly outline the steps of our assembly algorithm and give thefulldetailsinthefollowingsections. 1. OverlapComputation. Pairwise alignments between all optical maps are initially computed. Most sequence assemblersemploy the use of hashingtechniques to reduce the number of pairwise alignments that must be computed and quickly screen for matching regions between sequence reads. Although successful for sequence data, these methods do not work well for optical mapping data due to the associated errors. WeusedmethodsdescribedinSection3.1tocomputethepairwisealignments. 2. LayoutGraphConstruction. The resulting pairwise alignments are represented as a bi-directed graph (Kece- ciogluandMyers,1995)withnodesrepresentingopticalmapsandedgesindicat- ing alignments between the corresponding optical maps. This representation has theaddedadvantageofencodingtheorientationsoftheopticalmapswithrespect toagivenalignment. 3. GraphTransformations. Errorswithinthelayoutgraphmustbedetectedandremovedinordertoconstruct initial contigs. These errors appear as: (i) false edges due to spurious overlaps, and (ii) false nodes due to chimeric maps that incorrectly joinunrelated genomic regions. Wefirstapplyatransitiveedgereductiontechniquetoremoveredundant edges that also provides evidence of correctly joined nodes. The graph is then simplified by collapsing and combining paths representing initial contigs. These procedurestakeintoaccountthatsomenodespossiblyrepresentchimericmaps. 169 4. InitialContigConstruction. Thegraphtransformationsseparate thegraphintodistinctcomponentsrepresent- ing sets of maps derived from a particular genomic region. Each of these com- ponents are examined to extract a path of overlapping maps that correspond to a draft consensus map by merging overlaps between the optical maps along the path. We use informationcomputedduring the graph transformationscarried out previouslytofindthebestpathtoproduceadraftconsensusmap. 5. ConsensusGeneration. The best path is used to construct a contig. For sequence assembly, this problem is dealt with by constructing a multiple alignment where sequence accuracy can be assessed and corrected (Churchill and Waterman, 1992). However, a multiple alignment algorithm for optical mapping is difficult to formulate and computa- tionallyexpensivetocompute. Furthermore,themapsareassumedtobemeasure- ments from random individual DNA molecules unlike sequence assembly where regions of the genome are amplified multiple times so that averaging from bulk measurementtechniquesare possible. We applied methods from Valouev et al. (2006d) to refine the draft map into a consensus map. However, instead of using the entire data set of maps to align againstthedraftmaptoiterativelyrefineit,werestrictourselvesonlytothemaps within each component. This reduces the amount of noise introduced by using onlythosemapsthat were originallywithineach component. Althoughmapsnot originally within the component can align to the draft map, we assume that they are erroneous since they did not exhibit enough evidence to remain within the componentafterourgraphsimplificationsanderrorremoval. 170 OverlapComputation Overlaps between maps are computed using dynamic programming based the scoring function described in Section 3.1.2. Computationally, this is the most intensive step where for a set of n maps with average number of m restriction fragments per map, an exhaustive all-against-all comparison would require O(n 2 ) comparisons with each roughlytakingO(m 2 ) time to compute. With datasetswheren≈ 3×10 6 andm≈ 20 for extremely large genomes with deep coverage, we have 9× 10 12 comparisons that mustbecomputedwhere mostwouldnotbesignificant. In sequence assembly, the number of comparisons are reduced by means of k-mer hashingwherelongmatchesbetweensequencereadsaredetected. Thisindicatespossi- ble overlapso only reads with mutiplematchesare examined and the expensivestep of dynamic programming is avoided. This method works in practice due to the relatively low error rates of sequence data and the discrete nature of the nucleotide bases. For optical mapping, however, such methods are difficult to adapt due to the higher rate of errors that that include missing and false cuts and sizing inaccuracies of the individ- ual restriction fragments. Restriction fragment sizes follow a continuous distribution complicatingtheabilityto designan effectivek-merhashingscheme. Furthermore,the presence of missing and false cuts allows for the possibilityof regions with a differing numberofrestrictionfragmentsbetweentwoopticalmapstomatch. Despite these difficulties, reducing the number of candidate pairs to be considered woulddramaticallyspeedupthisstep. Forexample,themaizegenomedatasetconsists of≈ 2.1× 10 6 maps containing an average of≈ 17 restriction fragments with average mapsizeof438Kb. Anexhaustivepairwisecomparisonresultedin≈ 3.2×10 7 overlaps involving≈ 1.5× 10 6 maps that consists of≈ 0.000008 % of the number of possible overlaps that were examined. Clearly, the number of significant overlaps is extremely 171 small compared to the total number of possible overlaps. Using an optimistic estimate of the coverage c by dividing the genome size by the sum of the map lengths within the dataset gives for the maize genome of size≈ 2500 Mb aroundc≈ 370x coverage. However,thesetofmapscomprisingthesignficantoverlapssuggestsc≈ 260xcoverage implyingthatthereshouldbearound1.5×10 6 ·260 = 3.9×10 8 signficantoverlaps,an orderofmagnitudegreaterthantheoverlapsthatwerefound. We employed a score cutoff determined by simulated data to limit the number of returned overlapsalong withrandom forest classification to computeour alignmentsas described in Section3.1. The computationsnecessary for the overlapstep were carried out on a computingcluster. We utilized the fact that the overlaps between sets of maps can be independently processed from each other to divide the full map set among the multiple processors. The random forest classification can also be done independently where we also separated the maps into smaller groups so that the classifications are doneinparallel. Thisallowedfortheoverlapstobecomputedandclassifedinafeasible amountoftime. Map Trimming: The ends of the optical maps are less reliable since they might not representactualrestrictionfragments. Sinceobtainingaccuratesizesoftheopticalmaps is crucial for the construction of the overlap graph described in the next step, we use the computed overlap alignments to screen the fragments on the ends of each optical map. More specifically, the coverage of the restriction fragments for each optical map is computed based on aligning maps. Terminal fragments not covered by any aligning mapare subsequentlyremovedfromtheopticalmap. 172 LayoutGraphConstruction Thebi-directedgraphG=(V,E)isdefinedbyasetofnodesV correspondingtotheset ofopticalmapsandasetbi-directededgesE correspondingtooverlapsbetweenoptical maps. The edges encode overlaps between a pair of maps. LetM = {r 1 ,...,r n } be the set of optical maps. An overlap, o k = {r i k ,r j k }, between a pair of maps consists of an alignment where a prefix of one map is aligned to the suffix of another. It is also possiblethatonemapiscontainedentirelyinanothermapsothatimproperprefixesand suffixes for an overlap are allowed. Recall that the orientations of the optical maps are unknown so that for each overlap, r i k specifies that the optical map was aligned in the forward orientation and ¯ r i k specifies that it was aligned in the reverse orientation. Let O ={o 1 ={r i 1 ,r j 1 },...,o m ={r im ,r jm }}bethesetofoverlaps. Eachopticalmapr i hasacorrespondingnodev r i ∈V. To encode the orientations of a given overlap, we direct an edge into a node if the correspondingoverlapinvolvestheforwardorientationoftheopticalmap,otherwisewe directanedgeoutofanodetoindicatetheoverlapinvolvesthereverseorientation. Note thatouredgesarebi-directedinthesensethattheycanbetraversedinbothdirectionsbut a path througha node mustinvolveoneinward arrowhead and one outwardarrowhead. Itispossibleatthisstagethatthegraphcontainsedgeswithconflictingorientationsthat weleaveforsubsequentstepstodetectandcorrect. Figure4.1showsthedifferenttypes ofedgespossibleaswellasanexampleofabi-directedgraph. Distances d are also assigned to each edge by computing the genomic distance betweenthecentersofeach opticalmapasspecifiedbytheoverlapalignment(Valouev etal.,2006c). SeeFigure4.2foradescriptionoftheedgecalculation. Eachopticalmap representsagenomicregionandthecorrespondingsizesoftheorderedrestrictionfrag- ments can be used to estimate the distance in terms of genomic DNA between aligned 173 A Maps in same orientation Maps in opposite orientation W = abcdef X = defghi Y = jklihg Z = cbajkl B W = abcdef defghi = X W = fedcba cbajkl = Z X = defghi ghilkj = Y Z = cbajkl jklihg = Y W X Z Y Figure 4.1: The bi-directed graph stores the orientation of the given overlap in the following manner: (a) The arrowheads indicate the orientation of the maps involved in theoverlapwhereontheleftwehavetwomapsinthesameorientationwhileontheright we have maps in reverse orientation. (b) An example of a bi-directed graph where we use mapsW,X,Y,Z and alphabetical letters for illustrativepurposes. Given the maps, wehaveoverlaps{W,X},{X,Y},{W,Z},{Z,Y}withY,W representingthereverse orienations of the map. For overlap{W,X}, both maps are in the forward orientation sowedirectanedgeoutofnodeW andintonodeX. Foroverlap{W,Z},mapW isin reverseorientationsowedirectanedgeintonodeW andZ isinforward orientationso wedirectanedgeintonodeZ. Noticethatifweenteranodeonanedgewithaninward arrowheadthenwemustleavethenodeonanedgewithanoutwardarrowhead. opticalmaps. Wealsoassigntoeachedgeaqualityscoreq thatiscalculatedbyconsid- eringtheaverage fragmentcoverageoftheoverlapregioncorrespondingtoaparticular edge. For each restriction fragment on both optical maps withinthe overlap region, we 174 13.2 Kb 13.4 Kb 6.2 Kb 7.8 Kb 12.9 Kb 9.7 Kb 4.7 Kb 5.4 Kb 14.4 Kb 5.6 Kb 3.4 Kb 7.4 Kb 8.9 Kb 9.7 Kb 3.5 Kb 10.4 Kb 14.9 Kb 4.7 Kb 10.4 Kb 3.4 Kb Overlap Region Figure 4.2: The distances to the edges are assigned by examining the overlap region between the pair of maps involved in the alignment for a given edge. The distances between the centers of each map are computed relative to the overlap region by aver- aging the size of the overlap region. The distances between the map centers are then computedby subtractingor adding as necessary based onthe positionsof the map cen- tersrelativetotheoverlapregion. compute the number of overlapping maps that cover each fragment. The average frag- mentcoverageisthencomputedoverallfragmentsfrombothopticalmapsandassigned asthequalityscoreforthecorrespondingedge. GraphTransformations TheconstructionofthegraphGmodelsallthepairwiseoverlaprelationshipsamongthe maps. Althoughanalysescan be performed directly on thisgraph, the size of the graph can be significantly reduced by a series of graph transformations. The main advantage of the transformations is that the size of the overlap graph is reduced so that recon- struction of the underlying genome becomes easier. Furthermore, the transformations identify portions of the graph that correspond to true genomic regions. The goal of the transformations to the overlap graph is to reduce the number of edges and nodes with- out changing the space of potential solutions. The usage of graph transformations has been used by other groups to address the complexityof sequence assembly in different contexts(IduryandWaterman,1995;Myersetal.,1995;Myers,2005). 175 a ce b d f A a ce b d f B a c d f C a c d f D Figure 4.3: An example of the graph transformations where in (a) the transitive edges (dotted lines) are identified as a→ b since there is a path a→ c→ b. Similarly, we havec→ d ande→ f as the other transitive edges. (b) After removing the transitive edges, we can identify the internal pathsc→ b→ d andc→ e→ d. (c) The internal paths are collapsed into composite edges where we then have multiple edges existing between nodesc,d. This is a graph bulge where we can merge the paths together. (d) Theresultinggraphnowconsistsofaninternalpathwherewecansubsequentlycollapse intoacompositeedgeaswell. Transitive Edge Reduction: The graph contains more edges than necessary in the formof pathsbetween nodesthatreconstruct thesame genomicregion. Inparticular, if the path v → w → x and edge v → x are ‘consistent’ with each other, than the edge v→ x is unnecessary since the pathv→ w→ x can be used to reconstruct the same region of the genome. In addition, the edgev→ x offers supporting evidence that the pathv→w→xisnotcomposedofspuriousoverlaps. Theedgev→xcanbereduced (removed) without affecting regions of the genome that can be constructed using those nodes. Thepresenceoftheedgev→xservestoincreasewhatwetermtheredundancy of the edges along the path v → w → x. Using a method similar to that described in Myers(2005),wecanquicklydetectwhenwewillencounterthissituation. The redundancyr of each edge is initiallyset to 0, i.e. r[v→ w]=0, for all edges v→w∈E. Whenanedgev→xisfoundtobeconsistentwithapathv→w→x,the 176 redundancy is incremented for the edgesv→ w andw→ x. Recall that distances are assignedtotheedgesinthegraphcorrespondingtogenomicregionsrepresentedbythe underlying maps. If both the edgev→ x and the pathv→ w→ x represent the same genomicregion,thentheirdistancesshouldbeconsistentwitheachother. However,the true genomic distance is not known so we estimate it by considering the quality of the alignmentsalongboththeedgev→xandthepathv→w→x. LetP =d(v→x)be the distance assigned to edgev→ x andQ = d(v→ w)+d(w→ x) be the distance alongthepathv→w→x. Weuseasourestimateoftheunknowndistance ˆ R = q(v→x)·P +[q(v→w)+q(w→x)]·Q [q(v→w)+q(w→x)+q(v→x)] (4.1) wherewetakeaweightedaverageofthegivendistancesaccordingtothequalityscores. ThedistancesP andQare deemedconsistentif |P−Q| σ , ˆ R ≤k (4.2) for given standard deviationk since measured fragment sizes are normally distributed. This approach is similar to (3.2) except we are not comparing sizes of multiple frag- ments. Besides providingadditionalevidence that a givenset of edges alonga path are not spurious, finding redundant edges also gives support to whether a map is chimeric or not. Recall that a chimeric map joins together unrelated regions of the genome so thatmapsalignedtodisparateregionsonachimericmapshouldexhibitnoconnectivity. Given a path v → w → x, the detected redundant edge v → x indicates that overlap existsbetweenthenodesv andxsothatthenodew hasapairofalignedmapsthatalso shareanalignmentwitheachother. Theconnectivityofanode,h,recordsthenumberof times this situation is encountered where initiallyh[v] = 0 for all nodesv∈ V. When 177 a redundant edge is detected for a given pathv→ w→ x, the connectivity of nodew isupdated toreflect the fact that the alignedmapsrepresented by the nodesv andxare alignedwitheach otheraswell. Algorithm 4.1.1 shows how the edge redundancy and edge connectivity are calcu- lated. Wefirstsettheconnectivityh[v] = 0forallnodesandredundancyr[e]=0forall edges. Initially, all nodes are unmarked and all edges as non-reduced. We then iterate overeachnodev andmarkeachofitsadjacentnodesw. Eachedgeofv isthenchecked if it has a marked nodew where we then iterate over all adjacent nodesx ofw. Ifx is markedaswellthenwehavefoundapathv→w→xandv→xasshownonline 12. The appropriate redundancy and connectivity of the edges and nodes detected are then updated. Wethenmarktheedgev→xtobereducedatalaterstep. Examining the running time, we see that the algorithm runs in timeO(|V| 3 ) as we must iterate through all the nodes in the graph and for each node examine each of its adjacent nodes. The paths from each adjacent node are checked as well. However, we makeuseofthefactthattheedgesofeachnodearesortedsothatthealgorithmperforms fasterinpractice. Inline 8,wefirstdeterminetheedgewiththegreatestdistanceoutof v. We knowthenthat anypathv→w→x thatexceeds thisdistance does nothave an edgev→xthatcanbereduced. Weimmediatelycanthenstopprocessingedgesoutof w asthoseedgesare sortedaswell. Chain Collapsal: Transitive edge reduction results in the graph containing some nodes with no branchings; they have in-degree and out-idegree of exactly one. Nodes with in-degree and out-degree of exactly one are called internal nodes. A sequence of edges involving only internal nodes is called an internal path. Any reconstruction of the genome involving an internal path must traverse all nodes along that path since there are no branching possibilities. Internal paths can be collapsed and represented 178 Data: OverlapgraphG=(V,E). Initialization: s[v]←unmarked,h[v]← 0,∀v∈V,t[e]←false,r[e]← 0,∀e∈E 1: forv∈V do 2: Sortedgesofv byd(v←w). 3: endfor 4: forv∈V do 5: forv→w∈E do 6: s[w]←marked 7: endfor 8: L← max w d(v←w) 9: forv→w∈E do 10: ifs[w]=markedthen 11: forw→x∈E andd(v→w)+d(w→x)≤Ldo 12: if s[x] = marked and DISTANCES-CONSISTENT(d(v → x),d(v→ w)+ d(w→x))then 13: Incrementr[v→x],r[v→w],r[w→x],h[w] 14: s[x] =eliminated 15: endif 16: endfor 17: endif 18: endfor 19: forv→w∈E do 20: ifs[w]=eliminatedthen 21: t[v←w]←true 22: endif 23: s[w]←unmarked 24: endfor 25: endfor Algorithm4.1.1: Calculationofedgeredundancyandnodeconnectivity by a composite edge. For a given internal path w 1 → w 2 → ··· → w k , let v be the non-internal node preceding w 1 and x be the non-internal node succeeding w k . The path can be collapsed into a new composite edge v → x that represents the full path v→ w 1 → w 2 →···→ w k → x that wouldbe traversed for thisportion of the graph. The distance for the new edge is set to the accumulated distances of the path so that d(v→x)=d(v→w 1 )+ ! k i=1 d(w i ,w i+1 )+d(w k →x). 179 The compositeedgeimplicitlyrepresentsan assembledportionofthe genomefrom the maps and overlaps corresponding to the nodes and edges along the collapsed path. Before thecollapsal, we ensurethat themapsalong thecollapsedpath are notchimeric since then we would be erroneously joining together unrelated regions of the genome. We check that the edges along the path are of high redundancy and the nodes have high connectivity. We take a conservative approach where we only collapse paths with similar redundancy numbersas they are more likely to represent a well sampled region of the genome. We also set the redundancy of the new composite edge to the average redundancyoftheedgesalongthepath. GraphBulges: Afterhavingperformedthepreviousoperations,wealsoidentifysub- graphscalledgraphbulges. Twointernalpathsp 1 ,p 2 fromanodeuthatendatthesame node v is called a graph bulge. Paths through the overlap graph represent a genomic region,sothattwopathsthatbeginandendatthesamenodeideallyrepresentthesame genomic region. However, due to the distribution of errors or biological variants, it is possible that these two paths contain overlapping maps that do not overlap with each other. The criteria for deciding whether two paths justify simplicationcan be complex, taking into account errors models of the represented region as well as the possibility of haplotypes. For our purposes, we consider a simple model where we merge the two pathsbasedonthepathwithhigherredundancy. Thisprioritizespathswithhighercov- erage andassumesthoseregionsare morereliable. Tofindgraphbulges,we performasimilarmethodastheoneusedforthetransitive edgereduction. Wehavealreadycollapsedallinternalpathsintosinglecompositeedges. Insomecases,ourgraphbulgeswillconsistofmultipleedgesbetweentwonodesasthe chain collapsal from the previous step can produce a composite edge connecting two 180 nodes where a direct edge between the two had existedalready. This is due to the non- uniformdistributionofthemaplengthssothata regionofthegenomecanbe tiledbya varyingnumberofmaps. Tofindgraphbulges,weperformabreadth-firstsearchwhere edgesaretraversedaccordingtoredundancy. Although it would seem simpler to replace all simple paths and look for parallel edges, we want to select paths acccording to redundancy as this provides evidence that theyaremorelikelytobecorrect. Whenamarkednodeisencounteredforasecondtime duringthebreadth-first search,we examinebothpaths. Wechoosethepathwithhigher average redundancy as the path to merge the other path into. We note that the merging of one path into another offers additional evidence of the redundancy and connectivity of the edges and nodes along those paths. We update the path that we merge into to reflect thehigherredundancyandconnectivity. Multiple Rounds of Simplification: The process of detecting chain collapsals and graph bulges can be done multiple times to simplify the graph further. This is because thechaincollapsalsrevealgraphbulgesthatwhenmergedproducefurtherchaincollap- sals. We perform the two operations on the graph until both types of subgraphs are no longer found. At this point, we have simplifed the graph to the extent allowable based onourcriteria. Removing Faulty Nodes and Edges: Once the connectivity of all nodes and redun- dancyofalledgeshavebeenestablished,weseektoretainnodeswithhighconnectivity andedgeswithhighredundancy. Bothvaluesaredirectlyaffectedbythecoverageofthe datasetwhichisextremelyvariableanddependentupontheregionofthegenomebeing sampled. Wesetlowerandupperthresholds[l h ,u h ]and [l r ,u r ]fortheconnectivityand redundancyrespectivelysothatallnodesandedgeshavingconnectivityandredundancy 181 valueswithinthisrangearekept. Edgeswithredundancyvaluesoutsideoftherangeare removedaswellasanyisolatednodesafter removaloftheedges. Nodes with low connectivity are checked to see if they possibly represent the ends of contigs. Since regions of the genome with large fragments have less coverage, there is a bias towards the ends of contigs containing large fragments. For each node v we also keep track of i[v], the number of incoming redundant edges and o[v], the number of outgoing redundant edges. Nodes that represent the ends of contigs should have a noticeable difference between the two values since one side of the corresponding opti- cal map will have an overabundance of overlapping maps. We check the ratio of the incomingandoutgoingedges,E =i[v]/o[v],whereifE≥t in , thenalloutgoingedges from node v are removed, and if E ≤ t out , then all incoming edges into node v are removed. Otherwise, if both i[v]<t in and o[v]>t out , then the node is considered chimericandisreplacedbytwonodeswithonehavingtheoriginalincomingedgesand no outgoing edges and the other having no incoming edges and the original outgoing edges. We do not remove chimeric nodes but split them to disconnect the graph at the pointofchimerism. Duetothehighlyvariablecoverageofanopticalmappingdataset, the removalof chimeric nodes would result in the separation of the graph into multiple smallcomponents. Topreventthegraphfrombecomingextremelyfractured,weinstead disconnectthegraphatthepointofchimerismintotwoseparate nodes. InitialContigConstruction We now have simplified the graph and removed all errors according to our conditions. The graph should contain edges that with high redundancy and nodes with high con- nectivity. Theremovalofnodesandedgesshouldseparateourgraphintodistinctgraph 182 components. Inan idealsituation,each componentwouldrepresent anindividualchro- mosomewithinthe target genome if thechromsomeslargely containdissimilarregions so that no overlaps exist between them. However, in practice we find that we have a number of components in excess of the number of chromosomes. Most components consist of a small number of nodes so that we can immediately eliminate those; they do not contain enough information to reliably assemble a genomic region. For those components consisting of a large number of nodes, we seek to find a path within each that represents an initial contig that we can construct. Recall that we have computed edge redundancy and distance values. We can use them to determine paths through the graph consisting of edges that represent true genomic regions. If we assume that the component is acyclic, using dynamic programming and a similar technique as Valouev et al. (2006c), we can find a path that maximizes both edge redundancy and distance. Since the component is acyclic, there exist source nodes (no incoming edges) and sink nodes (no outgoing edges). We want to maximize the redundancy and distance as our goalistoextractthelongestandmostreliablegenomicregionfromacomponent. Letr e andd e betheedgeredundancyanddistancerespectivelyforagivenedge. Let n e be the number of edges collapsed for a composite edge where n e =1 represents a non-composite edge. Let S be the set of source nodes and T be the set of sink nodes for a given component. The recursion then is given is identical to the longest paths formulationasgiveninSection3.2.3where w(u)= 0 ,u∈T max u→v α·r u→v +β·d u→v +γ·n u→v +w(v). 183 Therecursionassumesthatthegraphisacyclicsothatwecanfindapaththatmaximizes the weightw(u) with respect to the parameters α,β,γ. The chosen parameters reflect the amount of importance we place on each of the edge properties. We note that for a practical application of the above recursion for dynamic programming, due to the number of nodes within a component, we organize our computations by topologically sortingthenodesfirst. Nodesare thenprocessedinreverseordersothatforaparticular node,allofnodesonitsoutgoingedgesareguaranteedtohavebeenalreadycomputed. We had assumed that our graph components are acyclic so that the above recursion would correctly find the best path for each component. In practice, however, cycles within each component can exist. Most cycles are erroneous and are eliminated dur- ing the graph simplification and error correcting steps. However, it is possible that the component can still contain edges that cause cycles. The problem of removing edges fromagraphsothatitbecomesacyclicisknownintheliteratureasthefeedbackarcset problem (Garey and Johnson,1979). In this setting, edges are givenweights where the minimumfeedbackarcsetistoremoveedgesofsmallesttotalweight. Theproblemhas beenshowntobenotonlyNP-hard butAPX-hardaswell(Kann,1992). In sequence assembly, the graph encoding overlaps are built in a greedy fashion and cycles are disregarded as they represent repetitive regions whose true nature are difficult to discern. As our methods are adapted from sequence assembly, we use the same approach were we first focus on non-repetitive regions of the genome. Before applying the above recursion, we create a new graph by iterating through the edges of the component based on decreasing redudancy values and adding them to the graph as long as the addition of the edge does not cause a cycle. In this way, we greedily build our new acyclic graph from the original component that emphasizes highly redundant edges. To quickly detect edges that form a cycle, we use a variant of the online cycle 184 Genome Genome Size (Mb) Enzyme Avg. Fragment Size (Kb) Coverage Maps Sim(2) 50 - 13.6 20x 2.3×10 3 E.coli(1) 4.6 StuI 16.8 562x 6.6×10 3 O.sativa(12) 382 SwaI 20.8 322x 2.6×10 5 Table 4.1: The number in parentheses indicates the number of chromosomes for each genome. The coverage was calculated by dividingthe total mass of the collected maps foreachgenomebythegenomesize. detection algorithm as described in Pearce and Kelly (2003). The algorithm allows us tokickoutedgesastheyarebeingaddedthatcausecyclestoformwiththeassumption that the genomic region represented by the component does not contain repeats. We notethatthisassumptionthatrepeatsare rare isgenerallythecase butitispossiblethat repeatedregionscanexist. Thecyclicedgesaresavedandcanbeusedpost-assemblyto furtheranalyzeifarepeatregiontrulyexistswithinthegenome. 4.1.3 Results Wetestedouralgorithmonvariousdatasetstoexamineitsperformance. Table4.1lists summarystatisticsforeach of thedatasetstested. We useda simulateddata setaswell as two real data sets consisting of the E. coli and O. sativa genome. Both of the real datasetshavepublishedsequencedatareadilyavailablesothatconstructionofinsilico refererence maps is possible (Blattner et al., 1997; Goff et al., 2002). We note that real data sets have extremely deep coverage (> 300x) whereas our simulated dataset has coverage∼ 20x. Wefirstappliedouralgorithmtoasimulateddatasetconsistingofa genomewith2 chromosomesof total size 50 Mb and 20x coverage with the results of overlap compu- tation and random forest classification shown in Table4.2. We constructed our overlap 185 Genome RawOverlaps Computation RF Overlaps(0.85) Maps Sim 4.7×10 5 40hrs 8×10 3 (0.02) 2×10 3 (0.84) E.coli 2.2×10 7 7wks 3.6×10 4 (0.002) 4.2×10 3 (0.64) O.sativa 3.0×10 9 18yrs 2.1×10 7 (0.007) 2.1×10 5 (0.93) Table 4.2: The assembly overlap statistics for the data sets where we give the total numberofcomputedoverlapsforeachgenomeandtheamountofcomputationrequired if we were processing on a single desktop computer. The last two columns show the resulting number of overlaps and maps after applying the random forest classification. Thenumbersintheparenthesesindicatethepercentagefromtheoriginaldataset. Edges Nodes Genome Comps. R C E Ch E Sim 7 5.2×10 3 541 1.2×10 3 332 116 E.coli 1 4.6×10 3 46 2.8×10 4 742 1.3×10 3 O.sativa 389 1.3×10 7 567 6×10 6 1×10 3 8×10 4 Table4.3: Overlapgraphedgestatistics(R=Reduced,C=Composite,E=Eliminated, Ch=Chimeric) graph and ran our graph simplifications producing the results shown in Table 4.3. We ended up with 7 components after having reduced 5.2× 10 3 edges, nearly half of the edges originally within the graph, and eliminating 1.2× 10 3 edges that fell below our redundancythresholds. Thecompositeedgeswerecomposedonaverageof∼ 3.2edges withdistance267.5Kb. In addition, we identified 332 nodes as chimeric and eliminated 116 nodes with low connectivity. Of the 332 nodes that were marked as chimeric, only 114 were true chimeric maps. Examining the locations of the simulated chimeric maps, we note that most of the identified chimeric maps were located in genomic regions with large frag- ments. We expect these chimeric maps should be easier to identify as large fragments decrease the chance of maps randomly overlapping across the length of the chimeric map. 186 Genome ContigYield AlignedContigs %GenomeCovered Sim 7(46Mb) 7(45Mb) 90 E.coli 1(4.6Mb) 1(4.6Mb) 100 O.sativa 289(264Mb) 224(236Mb) 75 Table4.4: Contigstatisticsforthedatasets In Table 4.4, we list statistics for the contigs constructed for the Sim genome. We were able to assemble 7 contigs representing 46 Mb that recovered ∼ 90% of the genome. Figure4.4showsacoverageplotforthe2chromosomesofthegenomewhere the resulting consensus maps are aligned to the simulated genome. We computed the coveragebydeterminingtheoverlappingmapsusedtoconstructtheconsensusmap. In Figure 4.5, we plot histograms of the fragment sizes within uncovered regions of the genome. We see thatmostuncoveredregionsconsistof extremelysmallfragmentsthat are difficulttorecoverintermsofoverlapalignments. Thenextdatasetthatweattemptedtoassemblewasthebacterial genomeE.coliof size 4.6 Mb. For this data set, after computing the intial set of overlaps, we noticed a significant decrease after applying the random forest classification. We started with an overlap data set of size 2.2× 10 7 that was reduced to 3.6× 10 4 overlaps comprising ∼ 0.2% of the original overlaps. We attributed this to the high cutoff score threshold usedduringtherandomforestclassificationtofiltertheoverlaps. Itispossibletolower the threshold at the expense of increasing the amount of noise. From Table4.3, we see that we are able to reduce∼ 12% of the edges while eliminating∼ 77% of the edges. We note that the reduction is substantially less than our simulated genome Sim leading toasignificantincreaseinthenumberofeliminatededges. Thiscouldbeduetothefact that our simulated data does not accurately reflect the amount of noise present in real data. Wefoundthatmostoftheelimnatededgesareduetomapsthatrandomlyalignto 187 Fragment Position Coverage 0 1 2 3 4 5 6 0 500 1000 1500 2000 1 0 1 2 3 4 5 6 2 Figure 4.4: We plot coverage plots along the Sim genome for both chromosomes by fragment position. For reference, we plot the fragment sizes along the bottom of the plots. We note that that coverage increases in regions where there are larger fragments as overlaps are more reliably detected given their presence. Regions with no coverage can be seen from the plot where there are 2 gaps for chromosome 1 and 3 gaps for chromosome 2. We note that the ends of the chromosomes are also underrepresented. In practical applications of optical mapping, the ends of chromosomes in general are difficulttoobtaincoverageduetosamplingdifficulties. nodes with high connectivity. The average distance of the composite edges was∼ 2.6 Mbwhichalreadycovers56%ofthegenome. ExaminingthecontigyieldinTable4.4showsthatweareabletorecoverthewhole genome. FortheE.coligenome,weareaidedbytheextremelydeepcoverage(> 500x) of the dataset that allows us to obtain an accurate overlap set. Comparisons against the in silico map revealed no missing or false cuts with only very small fragments not presentintheconsensusmap. Duetothestringentrandomforestscorecutoffthreshold, 188 Fragment Size (Kb) Percent of Total 0 10 20 30 40 0.0 0.5 1.0 1.5 1 0.0 0.5 1.0 1.5 2 Figure4.5: HistogramoffragmentsizesfromunassembledregionsofSim Edges Nodes Cutoff Maps Ovps Comps. R C E Ch E 0.75 4.8×10 3 5.7×10 4 1 9.7×10 3 38 4.1×10 4 886 134 0.65 5.3×10 3 8.1×10 4 1 1.6×10 4 32 5.9×10 4 1256 178 0.55 5.8×10 3 1.1×10 5 1 2.5×10 4 30 7.9×10 4 1114 192 0.45 6.4×10 3 1.6×10 5 1 3.9×10 4 25 1.2×10 5 987 211 0.35 6.5×10 3 2.5×10 5 1 7.2×10 4 12 1.9×10 5 864 222 Table4.5: E.coligenomeassemblieswithscore cutoffthresholds E.coligenomeassemblieswithscore cutoffthresholds(R =Reduced,C =Composite, E=Eliminated,Ch =Chimeric) weonlyretained 1,251mapsinourfinalassembly. Wetestedtheeffectsofscorecutoff thresholds used during the random forest by assembling using different thresholds as shown in Table 4.5. We notice that the percentage of reduced and eliminated edges in termsoftheoriginaldatasetislargelyunchangedwherewereduce∼ 20%andeliminate ∼ 70%oftheedges. Thenumberofcompositeedges,however,decreasesastheamount of noise within the data set increases. Recall that composite edges consist of internal pathswith internalnodes. The added overlapscause nodes on internalpaths tobecome branch nodes so that they can no longer be collapsed. We also noticed a decrease in 189 thedistancerepresentedbycompositeedgesasfewerofthemarepresent. Interestingly, the number of chimeric nodes decrease as the increase in the number of overlaps result in less of them to be identified. We note that the overall depth coverage of the final assembly increased dramatically in response to the greater number of alignments and maps for the different score cutoffs. The assemblies yielded similar numbers in terms ofcoverageofthegenomeandalignmenttothereference. ThefinaldatasetthatweassembledwastheO.sativaplantgenomeconsistingof12 chromosomes. Thisdatasetissignificantlylargerthanourtwopreviousonesproducing 3×10 9 rawoverlapswithresultsshowninTable4.2. Ifweassumeauniformdistribution of overlaps per map then this corresponds to each map overlapping with∼ 100 other maps. Performingedgereductionresultedin1.3×10 7 feweredgesrepresenting60%of theoriginaledgeset. 567compositeedgeswithaveragedistance∼ 30Mbwere found. As 8 chromosomes of O. sativa are over> 30 Mb and< 45 Mb, the composite edges representsignificantportionsofthegenome. Weeliminated6×10 6 edgesthatrepresent ∼ 30% of the original edge data set. After checking for node connectivity,∼ 40% of the map set was eliminated and 0.4% were detected as chimeras. Our final assembly contained 1.3×10 5 mapswithanaveragecoverageof 46x. However,wenotethatourassemblyyieldednumerouscontigsinexcessofthenum- ber of chromosomes. We assembled 289 contigs of which 224 aligned to the reference genome representing a 75% coverage of the genome. In terms of genome coverage, we noticed that most gaps occured in regions near large fragments. This is due to the fact that alignments are difficult to recover at those points in the genome. In order for theseregionstobeassembledreliably,mapsspanningtheselarge fragmentsmustexist. Due sampling biases and optical mapping errors, maps from those regions are under- represented. With fewer alignments recovered due to fewer maps, these regions have 190 lowcoverageandareeliminatedduringourgraphsimplificationroutines. Weattempted to remedy this by lowering the node connectivity threshold but this in turn allowed for morenoisewheretheassembledcontigsdidnotaligntothereferencegenome. Wenote that a multi-stage approach might work best where we progressively assemble contigs with less strict thresholds. We could use the contigs from a previousstage with stricter thresholdsasaguideforassemblingcontigsforthenextstage. 4.1.4 Conclusion In conclusion, we have adopted methods for sequence assembly into a new algorithm for the assemblyof optical maps. Our methodis based on an overlap-layout-consensus approach for assembly that utilizes the concept of an overlap graph. By simplifying and computing information regarding the connectivity of nodes based on the overlap graph,wecanremoveerroneousedgesandmapsthatdonotcorrespondtotruegenomic regions. We stress that the usage of the overlap graph is dependent upon initiallycom- puting accurate alignments between all the maps. Inevitably, some of the alignments are false positivesand so we use the deep coverageafforded byoptical mappingto fur- ther detect noise within the data set. Optical map assembly forms the basis of many more interesting applications. The ability to accurately and quickly assemble optical mapsthat allowsfor a low-resolutionview of a genomehas the potentialto yieldmany biologicaldiscoveries. 191 Chapter5 FutureWork Inthisthesis,wehaveanalyzedstatisticalmodelsanddevelopedalgorithmsforpractical applicationsofopticalmapping. ThemodelsthatweanalyzedinChapter2wereagood starting point as they were used as a foundation for ideas developed later on in the thesis. One ofthe keyideasthatwe exploredwasthe matchingof restrictionfragments inChapter2underdifferentcontexts. Thisprovideduswithmodelsforthedevelopment of a new scoring function in Section3.1. Althoughthe new scoring function is simpler than previous methods, we note that other scoring functions are possible. Our scoring function is not entirely linear due to the calculation of theD-size statistic described in Section3.1.2. A truly linear score function would allow for the dynamic programming recursion to utilize Gotoh’s algorithm (Gotoh, 1982) so that we achieve linear running time in terms of the number of fragments from each optical map. Future work towards alinearscorefunctionwouldbeahugeimprovementovercurrentmethods. In Section3.1.3, we utilized random forests to reduce the number of false positives for de novo assembly. Although this works well in practice, a better solution would be the development of better alignment score statistics that can filter out true positive from false positives. The statistics would have to normalize the alignment score with respect to the overlap region due to the distribution of map sizes. Another interesting line of research is the development of filtration methods that can quickly determine maps that should be aligned to each other. Recall that k-mer hashing techniques were discussed in Section 4.1.2 for sequence assembly that did not work well for optical 192 mappingdata. Simplermethodsthatcanquicklyexaminematchingfragmentsbetween twomapsmightbeusedforobtainingpairsofmapsthatshouldbealignedusingthefull dynamicprogrammingalgorithm. Thesecouldincorporatesomeoftheideasdeveloped inSection3.2.3intermsofamatchmatrix. In Chapter 4, we addressed methods for optical map assembly. One area of further research that can be explored is the generation of a consensus map. We used refine- ment methods previously developed but algorithms that utilize the information within thegraphcomponentstobuildaconsensuscouldyieldbetterresults. Recallthatwehad notedthatmultiplealignmentofopticalmapsisdifficulttobothformulateandcompute. However,inthespecializedcontextofbuildingaconsensusmap,iterativemethodsthat greedilybuildtheconsensusmightworkwellintermsofamultiplealignment. With the manydifferent applications of optical mapping currently as well as poten- tial new ones, new algorithms and models will need to be developed and analyzed. In thisthesis,wehaveaddressedthecurrentcapabilitiesofthesystemandexaminedmeth- ods for various problems using optical mapping. Overall, optical mapping is a useful tool for genomic analysis and with improvements to the system many more types of analyseswillbe possible. These advancementswillundoubtedlyraise interestingques- tionsforfutureresearcherstoanswer. 193 Bibliography T. Anantharaman, B. Mishra, and D. Schwartz. Genomics via optical mapping III: Contiging genomic DNA and variations. The Seventh International Conference on IntelligentSystemsforMolecularBiology,7:18–27,1999. M. Antoniotti, T. Anantharaman, S. Paxia, and B. Mishra. Genomics via optical map- ping iv: Sequence validationvia optical map matching. Technical Report CIMS-TR- 811,NYUCourantBioinformaticsGroup,2001. R. Arratia, L. Goldstein, and L. Gordon. Poisson approximation and the Chen-Stein method. StatisticalScience,pages403–424,1990. C.Aston,B.Mishra,andD.Schwartz. Opticalmappinganditspotentialforlarge-scale sequencingprojects. TrendsBiotech.,17:297–302,1999. S.Batzoglou,D.Jaffe,K.Stanley,J.Butler,S.Gnerre,E.Mauceli,B.Berger,J.Mesirov, andE.Lander. ARACHNE: AWhole-GenomeShotgunAssembler,2002. J.BigginsandC.Cannings.Markovrenewalprocesses,countersandrepeatedsequences inMarkovchains. AdvancesinAppliedProbability,pages521–545,1987. D. Bishop, J. Williamson, and M. Skolnick. A model for restriction fragment length distributions. AmericanJournalofHumanGenetics,35(5):795,1983. F. Blattner, G. Plunkett III, C. Bloch, N. Perna, V. Burland, M.Riley, J. Collado-Vides, J.Glasner,C.Rode,G.Mayhew,etal. ThecompletegenomesequenceofEscherichia coliK-12. Science,277(5331):1453,1997. S. Breen, M.Waterman, andN.Zhang. Renewal theoryforseveralpatterns. Journalof AppliedProbability,pages228–234,1985. L.Breiman. Randomforests. Machinelearning,45(1):5–32,2001. W. Cai, H. Aburatani, V. Stanton, D. Housman, Y. Wang, and D. Schwartz. Ordered RestrictionEndonucleaseMapsof Yeast ArtificialChromosomesCreated by Optical MappingonSurfaces. ProcNatlAcadSci,92(11):5164–5168,1995. 194 W. Cai, J. Jing, B. Irvin, L. Ohler, E. Rose, H. Shizuya, U. Kim, M. Simon, T. Anan- tharaman, B. Mishra, and D. Schwartz. High-resolutionrestrictionmaps of bacterial artificialchromosomesconstructedbyopticalmapping.ProcNatlAcadSci,95:3390– 3395,1998. G. ChurchillandM. Waterman. The accuracy of DNAsequences: estimatingsequence quality. Genomics,14(1):89–98,1992. G.Churchill,D.Daniels,andM.Waterman. Thedistributionofrestrictionenzymesites inEscherichiacoli. NucleicAcidsResearch,18(3):589,1990. F. Collins, M. Morgan, and A. Patrinos. The Human Genome Project: Lessons from Large-Scale Biology. Science,300(5617):286,2003. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press,2001. E. Dimalanta, A. Lim, R. Runnheim, C. Lamers, C. Churas, D. Forrest, J. de Pablo, M. Graham, S. Coppersmith, S. Goldstein, et al. A microfluidic system for large DNAmoleculearrays. AnalyticalChemistry,76:5293–5301,2004. W. Ewens and G. Grant. Statistical methods in bioinformatics: an introduction. New York: Springer,2001. M. Garey and D. Johnson. Computers and Intractability: A guide to the theory of NP- completeness. WHFreeman andCompany,SanFrancisco,CA, 1979. J. Giacalone, S. Delobette, V. Gibaja, L. Ni, Y. Skiadas, R. Qi, J. Edington, Z. Lai, D. Gebauer, H. Zhao, T. Anantharaman, B. Mishra, L. Brown, R. Saxena, D. Page, andD.Schwartz. OpticalmappingofbacclonesfromthehumanychromosomeDAZ locus. GenomeRes.,10:1421–1429,2000. S. Goff, D. Ricke, T. Lan, G. Presting, R. Wang, M. Dunn, J. Glazebrook, A. Sessions, P. Oeller, H. Varma, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science,296(5565):92–100,2002. L. Goldstein and M. Waterman. Mapping DNA by stochastic relaxation. Advances in AppliedMathematics,8(2):194–207,1987. O.Gotoh.Animprovedalgorithmformatchingbiologicalsequences. JournalofMolec- ularBiology,162(3):705,1982. E.Green.Strategiesforthesystematicsequencingofcomplexgenomes.NatureReviews Genetics,2(8):573–583,2001. 195 P.Green. Againstawhole-genomeshotgun. GenomeRes,7(5):410–417,May1997. H. Harter. Expected values of normal order statistics. Biometrika, 48(1-2):151–165, 1961. X. Huang and M. Waterman. Dynamic programming algorithms for restriction map comparison,1992. X.Huang,J.Wang,S.Aluru,S.Yang,andL.Hillier. PCAP:AWhole-GenomeAssem- blyProgram,2003. R. Idury and M. Waterman. A New Algorithm for DNA Sequence Assembly. Journal ofComputationalBiology,2(2):291–306,1995. J.Jing,J.Reed,J.Huang,X.Hu,V.Clarke,J.Edington,D.Housman,T.Anantharaman, E. Huff, B. Mishra, et al. Automated high resolution optical mappingusing arrayed, fluid-fixedDNAmolecules,1998. A.JohnsonandS.Kotz. Distibutionsinstatistics: ContinuousUnivariateDistributions Vol.1,Vol.2. NewYork: Wiley,1970. N. Johnson, S. Kotz, and A. Kemp. Univariate discrete distributions. Wiley- Interscience,2005. V. Kann. On the Approximabilityof NP-complete Optimization Problems. PhD thesis, RoyalInstituteofTechnology,1992. R. Karp and R. Shamir. Algorithms for optical mapping. Proceedings 2nd ACM Con- ferenceonComputationalMolecularBiology,pages117–124,1998. J. Kececioglu and E. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica,13(1):7–51,1995. Z. Lai, J. Jing,C. Aston,V. Clarke, J. Apodaca, E. Dimalanta,D. Carucci, M. Gardner, B. Mishra, and T. Anantharaman. A shotgun optical map of the entire Plasmodium falciparumgenome. Nat.Genet.,23:309–313,1999. R.LanandP.Reeves. Intraspeciesvariationinbacterialgenomes: theneedforaspecies genomeconcept. TrendsinMicrobiology,8(9):396–401,2000. A. Lim, E. Dimalanta, K. Potamousis, G. Yen, J. Apodaca, C. Tao, J. Lin, R. Qi, J. Skiadas, and A. Ramanathan. Shotgun opticalmaps of the whole Escherichiacoli o157:h7. GenomeRes.,11:1584–1593,2001. 196 J. Lin, R. Qi, C. Aston, J. Jing, T. Anantharaman, B. Mishra, O. White, M. Daly, K. Minton, and J. Venter. Whole-genome shotgun optical mapping of Deinococcus radiodurans. Science,285:1558–1562,1999. S.McCarroll,T.Hadnott,G.Perry,P.Sabeti,M.Zody,J.Barrett,S.Dallaire,S.Gabriel, C. Lee, M. Daly, and D. Altshuler. Common deletion polymorphisms in the human genome. NatureGenetics,38:86–92,2006. M. Metzker. Emerging technologies in DNA sequencing. Genome Research, 15(12): 1767–1776,2005. E.Myers. Thefragmentassemblystringgraph. Bioinformatics,21(2):85,2005. E. Myers and X. Huang. AnO (N 2 logN) restrictionmap comparisonand search algo- rithm. Bulletinofmathematicalbiology,54(4):599–618,1992. E.Myers,G.Sutton,A.Delcher,I.Dew,D.Fasulo,M.Flanigan,S.Kravitz,C.Mobarry, K.Reinert,K.Remington,etal. AWhole-GenomeAssemblyofDrosophila. Science, 287(5461):2196,2000. E. Myers et al. Toward Simplifying and Accurately Formulating Fragment Assembly. JournalofComputationalBiology,2(2):275–290,1995. D. Pearce andP. Kelly. Onlinealgorithmsfortopologicalorder and stronglyconnected components. Technicalreport,Tech.rep.,ImperialCollege,2003. S. Reslewic, S. Zhou, M. Place, Y. Zhang, A. Briska, S. Goldstein, C. Churas, R. Runnheim, D. Forrest, A. Lim,A. Lapidus, C. Han, G. Roberts, and D. Schwartz. Whole-genome shotgun optical mapping of Rhodospirillum rubrum. Appl. Environ. Microbiol.,71(9):5511–5522,2005. D. Sarkar. Of The Analysis Of Optical Mapping Data. PhD thesis, University of Wisconsin-Madison,2006. W.SchmittandM.Waterman.MultiplesolutionsofDNArestrictionmappingproblems. AdvancesinAppliedMathematics,12(4):412–427,1991. D. Schwartz and M. Koval. Conformational dynamics of individual DNA molecules duringgelelectrophoresis. Nature,338(6215):520–522,1989. D. Schwartz, X. Li, L. Hernandez, S. Ramnarain, E. Huff, and Y. Wang. Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science,262(5130):110–114,1993. 197 J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler. Large-scale copy numberpolymorphisminthehumangenome. Science,305:525–528,2004. C.Soderlund,S.Humphray,A.Dunham,andL.French. Contigsbuiltwithfingerprints, markers,andfpcv4.7. GenomeRes,10(11):1772–1787,Nov2000. H.Stefansson,A.Helgason,G.Thorleifsson,V.Steinthorsdottir,G.Mason,J.Barnard, A. Baker, and A. Jonasdottir. A common inversion under selection in europeans. NatureGenetics,37:129–137,2005. A. Sundquist, M. Ronaghi, H. Tang, P. Pevzner, and S. Batzoglou. Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies. PLoS ONE,2(5),2007. S. Tomlins, D. Rhodes, S. Perner, S. Dhanasekaran, R. Mehra, X. Sun, S. Varambally, X. Cao, J. Tchinda, R. Kuefer, C. Lee, J. Montie, R. Shah, K. Pienta, M. Rubin, andA. Chinnaiyan. Recurrent fusionof tmprss2and etstranscriptionfactor genesin prostatecancer. Science,310:644–648,2005. E. Tuzun, A. Sharp, J. Bailey, R. Kaul, V. Morrison, L. Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. Olson, and E. Eichler. Fine-scale structural variation of thehumangenome. NatureGenetics,37:727–732,2005. A.Valouev.ShotgunOpticalMapping: AComprehensiveStatisticalandComputational Analysis. PhDthesis,UniversityofSouthernCalifornia,2006. A. Valouev, L. Li, Y. C. Liu, D. C. Schwartz, Y. Yang, Y. Zhang, and M. S. Waterman. Alignmentofopticalmaps. JComputBiol,13(2):442–462,Mar2006a. A. Valouev, L. Li, Y. C. Liu, D. C. Schwartz, Y. Yang, Y. Zhang, and M. S. Waterman. Alignmentofopticalmaps. JComputBiol,13(2):442–462,Mar2006b. A. Valouev,D. C. Schwartz, S. Zhou, and M.S. Waterman. An algorithmfor assembly ofordered restrictionmapsfromsinglednamolecules. ProcNatlAcadSci, 103(43): 15770–15775,Oct2006c. A. Valouev, Y. Zhang, D. Schwartz, and M. Waterman. Refinement of optical map assemblies. Bioinformatics,22(10):1217–1224,2006d. M. Waterman. Frequencies of restriction sites. Nucleic Acids Research, 11(24):8951, 1983. 198 M.Waterman. Introductiontocomputationalbiology. Chapman&HallNewYork,NY, 1995. M. Waterman, T. Smith, and H. Katcher. Algorithms for restriction map comparisons. NucleicAcidsRes,12(1):237–242,1984. J.L.WeberandE.W.Myers. Humanwhole-genomeshotgunsequencing. GenomeRes, 7(5):401–409,May1997. F. Wei, E. Coe, W. Nelson, A. Bharti, F. Engler, E. Butler, H. Kim, J. Goicoechea, M.Chen,S.Lee,etal. Physicalandgeneticstructureofthemaizegenomereflectsits complexevolutionaryhistory. PLoSGenet,3(7):e123,2007. Y. Yang. Computational Genome Analysis By Alignment. PhD thesis, University of SouthernCalifornia,2005. S. Zhou, W. Deng, T. Anantharaman, A. Lim, E. Dimalanta, J. Wang, T. Wu, C. Tao, R. Creighton, and A. Kile. A whole-genome shotgun optical map of Yersinia pestis strainkim. Appl.Environ.Microbiol.,68:6321–6331,2002. S. Zhou,E. Kvikstad,A.Kile, J.Severin, D.Forrest, R. Runnheim,C. Churas, J.Hick- man,C.Mackenzie,M.Choudhary,T.Donohue,S.Kaplan,andD.Schwartz. Whole- genomeshotgunopticalmappingofRhodobactersphaeroidesstrain2.4.1anditsuse forwhole-genomeshotgunsequenceassembly. GenomeRes.,13:2142–2151,2003. S. Zhou, A. Kile, M. Bechner, M. Place, E. Kvikstad, W. Deng, J. Wei, J. Severin, R.Runnheim,C.Churas,D.Forrest,E.Dimalanta,C.Lamers,V.Burland,F.Blattner, and D. Schwartz. Single-molecule approach to bacterial genomic comparisons via opticalmapping. J.Bacteriology,186(22):7773–7782,2004. 199
Abstract (if available)
Abstract
The ability to conduct whole genome analysis of variation is slowly becoming a reality with both improvements in biotechnology and advancements in data analysis. However, large-scale de novo sequencing still remains a formidable task for complex plant and mammalian genomes. Although not providing resolution at the sequence nucleotide level, physical maps convey useful information that can be leveraged to discover biological events not possible with sequencing technologies. Optical mapping, a novel restriction mapping technology, is able to produce complete genome-wide physical maps both quickly and cheaply. These maps serve not only as invaluable aids for de novo sequencing, but can be used directly to make valuable inferences regarding the underlying genome itself. However, in order for optical mapping to be useful as a tool for genomic analysis, both computational and statistical questions must be addressed. In this thesis, we explore some of the issues involved with analyzing optical mapping data. Specifically, we explore various statistical models and their implications for optical mapping data. We also develop a new scoring function for the alignment of optical maps using dynamic programming. A strategy for comparing optical mapping data against a clone-based sequencing strategy for a genome is examined. Finally, methods for assembling optical mapping data into a complete genome-wide physical map are presented.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Statistical analysis of microarray data and functional genomics of yeast ageing
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Application of machine learning methods in genomic data analysis
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
The use of alignment-free statistics for the evolutionary study of study of 5' cis-regulatory sequences
PDF
Genomic, regulatory and functional dynamics of the duplication process
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Mapping epigenetic and epistatic components of heritability in natural population
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Detecting and understanding differentiation of microarray expression data
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Probabilistic methods and randomized algorithms
PDF
Analysis of genomic polymorphism in Arabidopsis thaliana
Asset Metadata
Creator
Nguyen, John Vu
(author)
Core Title
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Publication Date
01/27/2010
Defense Date
08/21/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
genomic mapping,OAI-PMH Harvest,optical map alignment,optical map assembly,optical mapping
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Waterman, Michael S. (
committee chair
), Kempe, David (
committee member
), Li, Lei M. (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
jvnguyen@gmail.com,jvnguyen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2815
Unique identifier
UC1188731
Identifier
etd-Nguyen-3472 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-289932 (legacy record id),usctheses-m2815 (legacy record id)
Legacy Identifier
etd-Nguyen-3472.pdf
Dmrecord
289932
Document Type
Dissertation
Rights
Nguyen, John Vu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
genomic mapping
optical map alignment
optical map assembly
optical mapping